Best way to ensure all triton kernels finish excecution? #508

buswinka · 2022-05-03T18:09:46Z

buswinka
May 3, 2022

Hello!

I am working on very large (5D, [B, C, X, Y, Z] ) tensors for medical image segmentation and i am trying to use triton to fuse some operations. I am running into trouble ensuring that all triton cores finish execution while training in a distributed environment. I initialze the output vector as empty, usually at a size of [2, 3, 300, 300, 30] and launch the triton kernels, however occasionally there are NaN's in the output. I think this is due to that bit of memory not being populated yet by a kernel in the grid. Is there a good way to ensure all triton cores finish excecution? I tried torch.cuda.synchronize, but that does not always work for whatever reason...

I've been struggling with this for a while and would appreciate any help! Thank you so much :)

buswinka · 2022-05-03T18:50:31Z

buswinka
May 3, 2022
Author

I think I actually fixed my own issue. I called torch.cuda.current_stream().synchronize(). This seems to wait until all events in the stream finish, including the triton kernels! Thank y'all!

0 replies

ptillet · 2022-05-03T20:08:19Z

ptillet
May 3, 2022
Maintainer

Yep, Triton executes kernels in the current stream, so synchronizing it should be enough. NaN may also come from improper bounds-checking, depending on the nature of yoru kernel

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to ensure all triton kernels finish excecution? #508

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Best way to ensure all triton kernels finish excecution? #508

buswinka May 3, 2022

Replies: 2 comments

buswinka May 3, 2022 Author

ptillet May 3, 2022 Maintainer

buswinka
May 3, 2022

buswinka
May 3, 2022
Author

ptillet
May 3, 2022
Maintainer