Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Enhance non-negative proof for buffer ops #5374

Merged
merged 5 commits into from
Dec 16, 2024

Conversation

danzimm
Copy link
Contributor

@danzimm danzimm commented Dec 8, 2024

Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning.

Context

Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops.

We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without heavy usage of tl.assume.

Working Example: Matmul

We were able to improve the situation by expanding verifyNonNegativeExpr to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial.

Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 .

Tests

As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because This is an RFC. If the idea is accepted I'll write tests.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@danzimm
Copy link
Contributor Author

danzimm commented Dec 8, 2024

#5373 is also interesting here -- I think kernel authors are more used to adding typing, so if that PR lands buffer ops may pass the checks edited in this PR more often

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements @danzimm , Thanks!

@giuseros
Copy link
Contributor

giuseros commented Dec 9, 2024

I don't see anything wrong here! Thanks a lot @danzimm ! My only suggestions is to add more tests, but I understand you wanted some opinions before adding those. I think the main pain point is the pid_n in the matmul, but I think we can cover it in your other PR: #5373

@danzimm
Copy link
Contributor Author

danzimm commented Dec 11, 2024

As I was writing up debug cases I realized there're some bits in this PR that aren't quite right:

  • Triton will never emit unsigned integers, so the "is unsigned integer" check is defunct
  • The offset must always be a 32bit tensor, which means
    • if the type is ever i1, it must get extended:
      • If sext is used then the i32 value will be 2^32-1, which I think is out of bounds (cc @giuseros / @makslevental to confirm we shouldn't allow this value as an offset)
      • If zext is used then we'll pick that up as we recurse in verifyNonNegativeExpr before we get to the is i1 check
    • Thus we shouldn't check that the type is i1
    • If the type is a pointer then tt.ptr_to_int must be used at some point, thus we'll pick up that op while recursing before the type check

Altogether the above tells us we should get rid of the "isUnsignedType" check

@danzimm danzimm force-pushed the danzimm/amdbufferops-nonneg branch 3 times, most recently from b901d6a to df8062e Compare December 12, 2024 18:29
@danzimm danzimm marked this pull request as ready for review December 12, 2024 18:36
@danzimm
Copy link
Contributor Author

danzimm commented Dec 12, 2024

Other than #5373 I have a few follow up PRs planned:

  • Update tutorials to annotate unsigned where possible
    • Motivation: evangelize marking values as unsigned when they're in fact unsigned so that folks start getting buffer ops for "free"
  • Introduce a new AMD dialect op "AssumeNonNeg" which accepts an integer like (notably, this includes tensors, which llvm.assume does not accept), introduce a new pass which runs prior to canonicalization to mark e.g. zext'd values as NonNeg (essentially this will push some logic from the ConvertToBufferOps pass to earlier. The plan is to drop this op when lowering to LLVM
    • Motivation: the only loads/stores that don't get bufferized in my addmm example after this PR + the afore-linked one can't determine the offsets are non-negative because the zext instruction gets elided by the canonicalization pass (namely the canonicalization pass truncates offsets, thus annihilating the zext, so that ConvertToBufferOps doesn't see that type information)
      EDIT: I had a good chat with @makslevental and we determined a better way forward on the above than what I wrote. Need to wait for his canonicalize pass refactoring to land: [AMD] re-enable canonicalize pointers pass #5329

I was also going to dig into the rest of the tutorial kernels (probably as I work through the first mentioned PR to create) to see why else buffer ops might not get emitted

@danzimm danzimm changed the title [RFC][AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr [AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr Dec 12, 2024
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM other than one logic. But I'd leave @giuseros and @makslevental to take another look. @makslevental is expected to pick up the buffer ops work to push it to complete from our side. Nice that you two are already connected!

@antiagainst antiagainst changed the title [AMD] Expand support for buffer ops by expanding verifyNonNegativeExpr [AMD] Enhance non-negative proof for buffer ops Dec 14, 2024
…veExpr

Hello AMD triton friends! I want to emphasize that this is an RFC. I understand the work wasn't discussed ahead of time, so it may or may not align with what you folks are doing/planning.

## Context
Buffer ops have an obvious advantage of reducing register pressure, as well as various other improvements over stanard load/store ops.

We want to start using them over at Meta. We found that the current implementation in triton is very strict, and because of this weren't able to use buffer ops without *heavy* usage of `tl.assume`.

## Working Example: Matmul

We were able to improve the situation by expanding `verifyNonNegativeExpr` to be less conservative (eliding details on the exact changes here in favor of conversation over code). With the changes in the PR we're able to (more often) successfully convert to buffer ops. For example, we were able to convert some of the loads/stores in the matmul in the tutorial.

Specifically, working off a copy of the matmul tutorial (https://gist.github.com/danzimm/8ce3cdb52e9630986a71542a239090cd) we're able to successfully convert some loads/stores to buffer ops: https://gist.github.com/danzimm/cdda36a68d3940cf407f403d3af72627 .

## Tests

As this is an RFC I didn't write any tests. If the idea is accepted I'll write lit tests.
@danzimm danzimm force-pushed the danzimm/amdbufferops-nonneg branch from c11daf4 to caed070 Compare December 14, 2024 16:05
@danzimm danzimm requested a review from antiagainst December 14, 2024 16:07
Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all LGTM! Thanks @danzimm , this is very good work!

@antiagainst antiagainst merged commit eefafc4 into triton-lang:main Dec 16, 2024
7 checks passed
@danzimm danzimm deleted the danzimm/amdbufferops-nonneg branch December 16, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants