Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_. #272

Closed
9 tasks
chaunceyjiang opened this issue Apr 16, 2024 · 1 comment · Fixed by #328 · May be fixed by #322
Closed
9 tasks

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_. #272

chaunceyjiang opened this issue Apr 16, 2024 · 1 comment · Fixed by #328 · May be fixed by #322

Comments

@chaunceyjiang
Copy link
Contributor

chaunceyjiang commented Apr 16, 2024

1. Issue or feature description

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_.

2. Steps to reproduce the issue

driver

root@controller-node-1:~# kubectl get pods -n gpu-operator -owide
NAME                                                          READY   STATUS                  RESTARTS         AGE     IP              NODE                NOMINATED NODE   READINESS GATES
nvidia-container-toolkit-daemonset-r4x86                      1/1     Running                 0                2m39s   10.233.97.176   worker-a800-1       <none>           <none>
nvidia-container-toolkit-daemonset-splfp                      1/1     Running                 0                6d4h    10.233.90.209   worker-a800-3       <none>           <none>
....
nvidia-driver-daemonset-2x8xv                                 0/2     Init:CrashLoopBackOff   21 (2m39s ago)   95m     10.233.97.186   worker-a800-1       <none>           <none>
nvidia-driver-daemonset-xw2xz                                 2/2     Running                 44 (6d4h ago)    8d      10.233.90.208   worker-a800-3       <none>           <none>



device-plugin

root@controller-node-1:~# kubectl get pods -n nvidia-vgpu -owide
NAME                                                    READY   STATUS               RESTARTS      AGE     IP              NODE                NOMINATED NODE   READINESS GATES
nvidia-vgpu-hami-device-plugin-4mb4f                    2/2     Running              0             3d22h   10.20.100.213   worker-a800-3       <none>           <none>
nvidia-vgpu-hami-device-plugin-jds85                    1/2     CrashLoopBackOff     3 (13s ago)   92s     10.20.100.211   worker-a800-1       <none>           <none>
nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs             2/2     Running              2 (25h ago)   3d22h   10.233.97.156   worker-a800-1       <none>           <none>

node

root@controller-node-1:~# kubectl get nodes  worker-a800-1 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    NVIDIA: Deleted_2024.04.16 07:44:03
    csi.volume.kubernetes.io/nodeid: '{"nfs.csi.k8s.io":"worker-a800-1"}'
    hami.io/mutex.lock: "2024-04-16T07:44:00Z"
    hami.io/node-handshake: Requesting_2024.04.16 05:52:15
    hami.io/node-handshake-mlu: Requesting_2024.04.07 09:47:49
    hami.io/node-nvidia-register: 'GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,10,245760,200,NVIDIA-NVIDIA
      A800 80GB PCIe,0,true:'

vgpu-scheduler-extender



root@controller-node-1:~# kubectl logs -f -n nvidia-vgpu nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs -c vgpu-scheduler-extender |grep remaining
I0416 07:31:41.251339       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:31:57.851477       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:16.052935       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:32.254100       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:50.251122       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:33:06.850208


root@controller-node-1:~# kubectl logs -f -n nvidia-vgpu nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs -c vgpu-scheduler-extender |grep "before rm"
I0416 07:31:41.251310       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:31:57.851455       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:16.052908       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:32.254021       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:50.251091       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []




I0416 08:18:03.975727       1 score.go:94] checkUUID result is true for NVIDIA type
I0416 08:18:03.975736       1 score.go:158] "first fitted" pod="kebe/convert-model2-worker-0" device="GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf"
I0416 08:18:03.975752       1 score.go:169] "device allocate success" pod="kebe/convert-model2-worker-0" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":10000,"Usedcores":0}]}
I0416 08:18:03.975764       1 score.go:235] "calcScore:pod fit node score results" pod="kebe/convert-model2-worker-0" node="worker-a800-1" score=1.25
I0416 08:18:03.975776       1 scheduler.go:380] schedule kebe/convert-model2-worker-0 to worker-a800-1 map[NVIDIA:[[{0 GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf NVIDIA 10000 0}]]]
I0416 08:18:03.975792       1 util.go:137] Encoded container Devices: GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,NVIDIA,10000,0:
I0416 08:18:03.975801       1 util.go:160] Encoded pod single devices GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,NVIDIA,10000,0:;
I0416 08:18:03.975808       1 pods.go:54] Pod added: Name: convert-model2-worker-0, UID: add43aba-26fb-4628-9d2a-3465946ef404, Namespace: kebe, NodeID: worker-a800-1
I0416 08:18:04.039512       1 request.go:629] Waited for 587.713159ms due to client-side throttling, not priority and fairness, request: GET:https://10.233.0.1:443/api/v1/nodes/worker-a800-1
E0416 08:18:04.044525       1 scheduler.go:313] "Failed to lock node" err="node worker-a800-1 has been locked within 5 minutes" node="worker-a800-1"
I0416 08:18:04.239427       1 request.go:629] Waited for 594.764445ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.233.0.1:443/api/v1/namespaces/default/pods/inference-llama2-7b-64bddc9497-tfchb
I0416 08:18:04.250011       1 util.go:228] "Decoded pod annos" poddevices={"Iluvatar":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]],"NVIDIA":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]]}
I0416 08:18:04.252357       1 scheduler.go:337] After Binding Process
I0416 08:18:04.253444       1 util.go:228] "Decoded pod annos" poddevices={"Iluvatar":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]],"NVIDIA":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]]}
I0416 08:18:04.336663       1 pods.go:63] Deleted pod inference-llama2-7b-64bddc9497-tfchb with node ID worker-a800-1
I0416 08:18:04.345710       1 route.go:131] Into webhookfunc
I0416 08:18:04.439145       1 request.go:629] Waited for 463.271887ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.233.0.1:443/api/v1/namespaces/kebe/pods/convert-model2-worker-0

I believe that when the HAMI-device-plugin crashes, hami.io/node-handshake should be set to Deleted_.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The vgpu-device-plugin container logs
  • The vgpu-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
Copy link

Hi @chaunceyjiang,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment