[AMD] Enhance non-negative proof for buffer ops #5374

danzimm · 2024-12-08T23:53:39Z

Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning.

Context

Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops.

We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without heavy usage of tl.assume.

Working Example: Matmul

We were able to improve the situation by expanding verifyNonNegativeExpr to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial.

Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 .

Tests

As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because This is an RFC. If the idea is accepted I'll write tests.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

danzimm · 2024-12-08T23:57:22Z

#5373 is also interesting here -- I think kernel authors are more used to adding typing, so if that PR lands buffer ops may pass the checks edited in this PR more often

sjw36

Nice improvements @danzimm , Thanks!

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp

giuseros · 2024-12-09T14:43:10Z

I don't see anything wrong here! Thanks a lot @danzimm ! My only suggestions is to add more tests, but I understand you wanted some opinions before adding those. I think the main pain point is the pid_n in the matmul, but I think we can cover it in your other PR: #5373

danzimm · 2024-12-11T19:03:01Z

As I was writing up debug cases I realized there're some bits in this PR that aren't quite right:

Triton will never emit unsigned integers, so the "is unsigned integer" check is defunct
The offset must always be a 32bit tensor, which means
- if the type is ever i1, it must get extended:
  - If sext is used then the i32 value will be 2^32-1, which I think is out of bounds (cc @giuseros / @makslevental to confirm we shouldn't allow this value as an offset)
  - If zext is used then we'll pick that up as we recurse in verifyNonNegativeExpr before we get to the is i1 check
- Thus we shouldn't check that the type is i1
- If the type is a pointer then tt.ptr_to_int must be used at some point, thus we'll pick up that op while recursing before the type check

Altogether the above tells us we should get rid of the "isUnsignedType" check

danzimm · 2024-12-12T18:42:29Z

Other than #5373 I have a few follow up PRs planned:

Update tutorials to annotate unsigned where possible
- Motivation: evangelize marking values as unsigned when they're in fact unsigned so that folks start getting buffer ops for "free"
Introduce a new AMD dialect op "AssumeNonNeg" which accepts an integer like (notably, this includes tensors, which llvm.assume does not accept), introduce a new pass which runs prior to canonicalization to mark e.g. zext'd values as NonNeg (essentially this will push some logic from the ConvertToBufferOps pass to earlier. The plan is to drop this op when lowering to LLVM
- Motivation: the only loads/stores that don't get bufferized in my addmm example after this PR + the afore-linked one can't determine the offsets are non-negative because the zext instruction gets elided by the canonicalization pass (namely the canonicalization pass truncates offsets, thus annihilating the zext, so that ConvertToBufferOps doesn't see that type information)
  EDIT: I had a good chat with @makslevental and we determined a better way forward on the above than what I wrote. Need to wait for his canonicalize pass refactoring to land: [AMD] re-enable canonicalize pointers pass #5329

I was also going to dig into the rest of the tutorial kernels (probably as I work through the first mentioned PR to create) to see why else buffer ops might not get emitted

antiagainst

Overall LGTM other than one logic. But I'd leave @giuseros and @makslevental to take another look. @makslevental is expected to pick up the buffer ops work to push it to complete from our side. Nice that you two are already connected!

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp

…veExpr Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning. ## Context Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops. We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without *heavy* usage of `tl.assume`. ## Working Example: Matmul We were able to improve the situation by expanding `verifyNonNegativeExpr` to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial. Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 . ## Tests As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.

…port

giuseros

It all LGTM! Thanks @danzimm , this is very good work!

sjw36 reviewed Dec 9, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp Outdated Show resolved Hide resolved

ThomasRaoux reviewed Dec 9, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp Outdated Show resolved Hide resolved

danzimm force-pushed the danzimm/amdbufferops-nonneg branch 3 times, most recently from b901d6a to df8062e Compare December 12, 2024 18:29

danzimm marked this pull request as ready for review December 12, 2024 18:36

danzimm requested review from antiagainst, zhanglx13 and ptillet as code owners December 12, 2024 18:36

danzimm changed the title ~~[RFC][AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr~~ [AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr Dec 12, 2024

antiagainst requested changes Dec 14, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp Outdated Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp Outdated Show resolved Hide resolved

antiagainst changed the title ~~[AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr~~ [AMD] Enhance non-negative proof for buffer ops Dec 14, 2024

danzimm added 4 commits December 14, 2024 08:05

Add lit tests, remove unsigned comparisons & typecheck, expand op sup…

07c4d3a

…port

Add test that extui marks a value as nonneg + scf.if is transitive

d51daab

Fix hardcoding scf::IfOp.getOperand(0), add lit test for new behavior

caed070

danzimm force-pushed the danzimm/amdbufferops-nonneg branch from c11daf4 to caed070 Compare December 14, 2024 16:05

fix_typo

9a25c6a

danzimm requested a review from antiagainst December 14, 2024 16:07

giuseros approved these changes Dec 16, 2024

View reviewed changes

antiagainst approved these changes Dec 16, 2024

View reviewed changes

antiagainst merged commit eefafc4 into triton-lang:main Dec 16, 2024
7 checks passed

danzimm deleted the danzimm/amdbufferops-nonneg branch December 16, 2024 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enhance non-negative proof for buffer ops #5374

[AMD] Enhance non-negative proof for buffer ops #5374

danzimm commented Dec 8, 2024 •

edited

Loading

danzimm commented Dec 8, 2024 •

edited

Loading

sjw36 left a comment

giuseros commented Dec 9, 2024

danzimm commented Dec 11, 2024

danzimm commented Dec 12, 2024 •

edited

Loading

antiagainst left a comment

giuseros left a comment

[AMD] Enhance non-negative proof for buffer ops #5374

[AMD] Enhance non-negative proof for buffer ops #5374

Conversation

danzimm commented Dec 8, 2024 • edited Loading

Context

Working Example: Matmul

Tests

New contributor declaration

danzimm commented Dec 8, 2024 • edited Loading

sjw36 left a comment

Choose a reason for hiding this comment

giuseros commented Dec 9, 2024

danzimm commented Dec 11, 2024

danzimm commented Dec 12, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

giuseros left a comment

Choose a reason for hiding this comment

danzimm commented Dec 8, 2024 •

edited

Loading

danzimm commented Dec 8, 2024 •

edited

Loading

danzimm commented Dec 12, 2024 •

edited

Loading