Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nightly CI issue: CUDA 11.4 jobs were running with CUDA 11.8 when nccl wasn't available #2402

Open
dantegd opened this issue Jul 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dantegd
Copy link
Member

dantegd commented Jul 30, 2024

NCCL 2.22.3.1 in conda-forge was not available for CUDA < 11.8 until yesterday, which was reflected in cuML's CI by failing all CUDA 11.4 jobs until today. But RAFT's CUDA 11.4 CI was passing regardless (which confused me for a while).

Checking the jobs, they were installing cuda-version 11.8 and corresponding packages, from this CUDA 11.4 log for example, the following snippets show the issue when installing the downloaded artifacts

  Upgrade:
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  - cuda-version                              11.4  hfb901f2_3                       conda-forge             Cached
  + cuda-version                              11.8  h70ddcb2_3                       conda-forge               21kB
  - cudatoolkit                             11.4.3  h39f8164_13                      conda-forge             Cached
  + cudatoolkit                             11.8.0  h4ba93d1_13                      conda-forge              716MB

which should not be happening on CUDA 11.4 jobs of course. I think this shouldn't be an issue now with nccl, but any other package could cause a situation like this, This could make things fail silently in the future and catch us by surprise, eliminating the point of having 11.4 jobs in nightly CI.

@dantegd dantegd added the bug Something isn't working label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant