You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's
There is a slight chance that TL CUDA is not getting picked up as expected.
I had similar issues with OSU microbenchmarks, then I realized I should specifically enable TL_CUDA, otherwise TL_UCP gets preferred. Try this in your script:
There is a slight chance that TL CUDA is not getting picked up as expected.
One example of this issue in our case: If the involved GPUs in the collective are not directly connected (maybe peer-to-peer access is not enabled correctly, or there is no NVLink), then TL_CUDA refuses to handle the collective. Which is, btw, both understandable and at the same time disappointing.
I am seeing bad perf for one-node TL/CUDA/allgather on GPU connected through nvLink.
On H100
Setup DGX 8*H100, one node
osu benchmark's
osu_iallgather
osu benchmark osu_allgather
nccl-test
ucc perftest
On V100
osu iallgather
reproducer
osu-benchmarks
nvFuser Overlap benchmark
nvFuser comm/compute overlap experiment and comparison with nccl. In this experiment, we post a single allgather followed by a single matmul op. After warmup and averaging across multiple iterations, we get that nccl's latency is way better than ucc's
reproducer:
The text was updated successfully, but these errors were encountered: