Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Nvidia Image build #489

Merged
merged 2 commits into from
Oct 9, 2024
Merged

Fix Nvidia Image build #489

merged 2 commits into from
Oct 9, 2024

Conversation

Issacwww
Copy link
Contributor

@Issacwww Issacwww commented Oct 8, 2024

Issue #, if available:

Description of changes:
Issue 1:
Build failed in CodeBuild

Step 17/26 : ARG NCCL_VERSION=2.22.3-1+cuda${CUDA_MAJOR_VERSION}.${CUDA_MINOR_VERSION}
 ---> Running in 1a06f4ac470e
Removing intermediate container 1a06f4ac470e
 ---> 07cde702fb9b
Step 18/26 : RUN apt update   && apt install -y     libnccl2=${NCCL_VERSION}      libnccl-dev=${NCCL_VERSION}
...
E: Version '2.22.3-1' for 'libnccl2' was not found
E: Version '2.22.3-1' for 'libnccl-dev' was not found
The command '/bin/sh -c apt update   && apt install -y     libnccl2=${NCCL_VERSION}      libnccl-dev=${NCCL_VERSION}' returned a non-zero code: 100

temp fix by hardcode it

Issue 2:
The unit test test_nvidia_persistence_status is failing on Bottlerocket as it is not enabled, there are an incoming release will fix it. But extending a flag to skip tests for flexibility

Testing with below

---
kind: Job
apiVersion: batch/v1
metadata:
  name: unit-test-job
  labels:
    app: unit-test-job
spec:
  template:
    metadata:
      labels:
        app: unit-test-job
    spec:
      containers:
        - name: unit-test-container
          image: "171391670848.dkr.ecr.us-west-2.amazonaws.com/test-images:nvtest-withSkip"
          command:
            - /bin/bash
            - ./gpu_unit_tests/unit_test
          env:
            - name: SKIP_TESTS_SUBCOMMAND
              value: "-s test_05_dcgm_diagnostics|test_nvidia_persistence_status"
          imagePullPolicy: Always
          resources:
            limits:
              cpu: "4"
              memory: 4Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "1"
              memory: 1Gi
      restartPolicy: Never
  backoffLimit: 4

Output

k logs unit-test-job-77xkh -f
# Running tests in gpu_unit_tests/tests/test_basic.sh
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok -  # skip skip pattern: test_05_dcgm_diagnostics|test_nvidia_persistence_status
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:02 --:--:--     0
curl: (56) Recv failure: Connection reset by peer
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    10  100    10    0     0  13717      0 --:--:-- --:--:-- --:--:-- 10000
ok - test_numa_topo_topo
ok - test_nvidia_gpu_count
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok -  # skip skip pattern: test_05_dcgm_diagnostics|test_nvidia_persistence_status
ok - test_nvidia_smi_topo

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bryantbiggs
Copy link
Member

what is the Docker version in codebuild, or is it using something else like Finch to build the images?

@Issacwww Issacwww merged commit 7d8a797 into main Oct 9, 2024
5 checks passed
@Issacwww Issacwww deleted the fixNvImgBuild branch October 9, 2024 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants