When the HAMI-device-plugin crashes, `hami.io/node-handshake` is still `Requesting_`. #272

chaunceyjiang · 2024-04-16T08:19:17Z

1. Issue or feature description

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_.

2. Steps to reproduce the issue

driver

root@controller-node-1:~# kubectl get pods -n gpu-operator -owide
NAME                                                          READY   STATUS                  RESTARTS         AGE     IP              NODE                NOMINATED NODE   READINESS GATES
nvidia-container-toolkit-daemonset-r4x86                      1/1     Running                 0                2m39s   10.233.97.176   worker-a800-1       <none>           <none>
nvidia-container-toolkit-daemonset-splfp                      1/1     Running                 0                6d4h    10.233.90.209   worker-a800-3       <none>           <none>
....
nvidia-driver-daemonset-2x8xv                                 0/2     Init:CrashLoopBackOff   21 (2m39s ago)   95m     10.233.97.186   worker-a800-1       <none>           <none>
nvidia-driver-daemonset-xw2xz                                 2/2     Running                 44 (6d4h ago)    8d      10.233.90.208   worker-a800-3       <none>           <none>

device-plugin

root@controller-node-1:~# kubectl get pods -n nvidia-vgpu -owide
NAME                                                    READY   STATUS               RESTARTS      AGE     IP              NODE                NOMINATED NODE   READINESS GATES
nvidia-vgpu-hami-device-plugin-4mb4f                    2/2     Running              0             3d22h   10.20.100.213   worker-a800-3       <none>           <none>
nvidia-vgpu-hami-device-plugin-jds85                    1/2     CrashLoopBackOff     3 (13s ago)   92s     10.20.100.211   worker-a800-1       <none>           <none>
nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs             2/2     Running              2 (25h ago)   3d22h   10.233.97.156   worker-a800-1       <none>           <none>

node

root@controller-node-1:~# kubectl get nodes  worker-a800-1 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    NVIDIA: Deleted_2024.04.16 07:44:03
    csi.volume.kubernetes.io/nodeid: '{"nfs.csi.k8s.io":"worker-a800-1"}'
    hami.io/mutex.lock: "2024-04-16T07:44:00Z"
    hami.io/node-handshake: Requesting_2024.04.16 05:52:15
    hami.io/node-handshake-mlu: Requesting_2024.04.07 09:47:49
    hami.io/node-nvidia-register: 'GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,10,245760,200,NVIDIA-NVIDIA
      A800 80GB PCIe,0,true:'

vgpu-scheduler-extender



root@controller-node-1:~# kubectl logs -f -n nvidia-vgpu nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs -c vgpu-scheduler-extender |grep remaining
I0416 07:31:41.251339       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:31:57.851477       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:16.052935       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:32.254100       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:32:50.251122       1 scheduler.go:152] node worker-a800-1 device NVIDIA:&{worker-a800-3 []} leave, <nil> remaining devices:[{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}]
I0416 07:33:06.850208


root@controller-node-1:~# kubectl logs -f -n nvidia-vgpu nvidia-vgpu-hami-scheduler-765f6f84b8-29qfs -c vgpu-scheduler-extender |grep "before rm"
I0416 07:31:41.251310       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:31:57.851455       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:16.052908       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:32.254021       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []
I0416 07:32:50.251091       1 nodes.go:69] before rm: [{GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf 0 10 245760 200 NVIDIA-NVIDIA A800 80GB PCIe 0 true}] needs remove []




I0416 08:18:03.975727       1 score.go:94] checkUUID result is true for NVIDIA type
I0416 08:18:03.975736       1 score.go:158] "first fitted" pod="kebe/convert-model2-worker-0" device="GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf"
I0416 08:18:03.975752       1 score.go:169] "device allocate success" pod="kebe/convert-model2-worker-0" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":10000,"Usedcores":0}]}
I0416 08:18:03.975764       1 score.go:235] "calcScore:pod fit node score results" pod="kebe/convert-model2-worker-0" node="worker-a800-1" score=1.25
I0416 08:18:03.975776       1 scheduler.go:380] schedule kebe/convert-model2-worker-0 to worker-a800-1 map[NVIDIA:[[{0 GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf NVIDIA 10000 0}]]]
I0416 08:18:03.975792       1 util.go:137] Encoded container Devices: GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,NVIDIA,10000,0:
I0416 08:18:03.975801       1 util.go:160] Encoded pod single devices GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf,NVIDIA,10000,0:;
I0416 08:18:03.975808       1 pods.go:54] Pod added: Name: convert-model2-worker-0, UID: add43aba-26fb-4628-9d2a-3465946ef404, Namespace: kebe, NodeID: worker-a800-1
I0416 08:18:04.039512       1 request.go:629] Waited for 587.713159ms due to client-side throttling, not priority and fairness, request: GET:https://10.233.0.1:443/api/v1/nodes/worker-a800-1
E0416 08:18:04.044525       1 scheduler.go:313] "Failed to lock node" err="node worker-a800-1 has been locked within 5 minutes" node="worker-a800-1"
I0416 08:18:04.239427       1 request.go:629] Waited for 594.764445ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.233.0.1:443/api/v1/namespaces/default/pods/inference-llama2-7b-64bddc9497-tfchb
I0416 08:18:04.250011       1 util.go:228] "Decoded pod annos" poddevices={"Iluvatar":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]],"NVIDIA":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]]}
I0416 08:18:04.252357       1 scheduler.go:337] After Binding Process
I0416 08:18:04.253444       1 util.go:228] "Decoded pod annos" poddevices={"Iluvatar":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]],"NVIDIA":[[{"Idx":0,"UUID":"GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf","Type":"NVIDIA","Usedmem":20000,"Usedcores":0}]]}
I0416 08:18:04.336663       1 pods.go:63] Deleted pod inference-llama2-7b-64bddc9497-tfchb with node ID worker-a800-1
I0416 08:18:04.345710       1 route.go:131] Into webhookfunc
I0416 08:18:04.439145       1 request.go:629] Waited for 463.271887ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.233.0.1:443/api/v1/namespaces/kebe/pods/convert-model2-worker-0

I believe that when the HAMI-device-plugin crashes, hami.io/node-handshake should be set to Deleted_.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The vgpu-device-plugin container logs
The vgpu-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-16T08:19:29Z

Hi @chaunceyjiang,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

chaunceyjiang mentioned this issue Apr 16, 2024

fix: When a node cannot bind, it should stop scheduling pods to that node #273

Merged

This was referenced May 27, 2024

"Fix the issue where scheduling can still occur on the node when the device-plugin crashes #322

Draft

[Optimization] Fix the issue where scheduling can still occur on the node when the device-plugin crashes. #328

Merged

hami-robott bot closed this as completed in #328 Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the HAMI-device-plugin crashes, `hami.io/node-handshake` is still `Requesting_`. #272

When the HAMI-device-plugin crashes, `hami.io/node-handshake` is still `Requesting_`. #272

chaunceyjiang commented Apr 16, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_. #272

When the HAMI-device-plugin crashes, hami.io/node-handshake is still Requesting_. #272

Comments

chaunceyjiang commented Apr 16, 2024 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

github-actions bot commented Apr 16, 2024

When the HAMI-device-plugin crashes, `hami.io/node-handshake` is still `Requesting_`. #272

When the HAMI-device-plugin crashes, `hami.io/node-handshake` is still `Requesting_`. #272

chaunceyjiang commented Apr 16, 2024 •

edited

Loading