Replies: 2 comments
-
I think I actually fixed my own issue. I called |
Beta Was this translation helpful? Give feedback.
0 replies
-
Yep, Triton executes kernels in the current stream, so synchronizing it should be enough. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
I am working on very large (5D, [B, C, X, Y, Z] ) tensors for medical image segmentation and i am trying to use triton to fuse some operations. I am running into trouble ensuring that all triton cores finish execution while training in a distributed environment. I initialze the output vector as empty, usually at a size of [2, 3, 300, 300, 30] and launch the triton kernels, however occasionally there are NaN's in the output. I think this is due to that bit of memory not being populated yet by a kernel in the grid. Is there a good way to ensure all triton cores finish excecution? I tried torch.cuda.synchronize, but that does not always work for whatever reason...
I've been struggling with this for a while and would appreciate any help! Thank you so much :)
Beta Was this translation helpful? Give feedback.
All reactions