-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Enhance non-negative proof for buffer ops #5374
[AMD] Enhance non-negative proof for buffer ops #5374
Conversation
#5373 is also interesting here -- I think kernel authors are more used to adding typing, so if that PR lands buffer ops may pass the checks edited in this PR more often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements @danzimm , Thanks!
third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp
Outdated
Show resolved
Hide resolved
As I was writing up debug cases I realized there're some bits in this PR that aren't quite right:
Altogether the above tells us we should get rid of the "isUnsignedType" check |
b901d6a
to
df8062e
Compare
Other than #5373 I have a few follow up PRs planned:
I was also going to dig into the rest of the tutorial kernels (probably as I work through the first mentioned PR to create) to see why else buffer ops might not get emitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM other than one logic. But I'd leave @giuseros and @makslevental to take another look. @makslevental is expected to pick up the buffer ops work to push it to complete from our side. Nice that you two are already connected!
third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp
Outdated
Show resolved
Hide resolved
…veExpr Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning. ## Context Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops. We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without *heavy* usage of `tl.assume`. ## Working Example: Matmul We were able to improve the situation by expanding `verifyNonNegativeExpr` to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial. Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 . ## Tests As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.
c11daf4
to
caed070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all LGTM! Thanks @danzimm , this is very good work!
Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning.
Context
Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops.
We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without heavy usage of
tl.assume
.Working Example: Matmul
We were able to improve the situation by expanding
verifyNonNegativeExpr
to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial.Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 .
Tests
As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsThis is an RFC. If the idea is accepted I'll write tests
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)