Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod rejection errors and unexpected MIG slice counts #96

Closed
tardieu opened this issue Sep 13, 2024 · 9 comments
Closed

Pod rejection errors and unexpected MIG slice counts #96

tardieu opened this issue Sep 13, 2024 · 9 comments
Assignees

Comments

@tardieu
Copy link
Contributor

tardieu commented Sep 13, 2024

When following the Kueue+InstaSlice demo #92, I expect 7 pods to run to completion with no more than 3 pods running at a time to due the configuration of the cluster queue. While I observed no violation of the max concurrency, in many instances I observed a couple of unexpected outcomes:

  1. One or several pods report the status OutOfnvidia.com/mig-1g.5gb with the message Pod was rejected: Node didn''t have enough resource: nvidia.com/mig-1g.5gb, requested: 1, used: 0, capacity: 0
  2. The count of nvidia.com/mig-1g.5gb resources in the node capacity no longer matches the count of org.instaslice/* resources in the node capacity. More specifically the former is lower than the latter and lower than the number of slices required to run pending ungated pods.

This is running Kind 1.31 on rancher desktop 1.15.1 on Sonoma. The demo scenario relies on fake GPUs and InstaSlice emulator mode.

@tardieu
Copy link
Contributor Author

tardieu commented Sep 13, 2024

I encounter the same node capacity mismatch issue (nvidia.com/mig-1g.5gb count vs org.instaslice/* count) on kind 1.30.4:

kind create cluster --image kindest/node:v1.30.4@sha256:976ea815844d5fa93be213437e3ff5754cd599b040946b5cca43ca45c2047114

@tardieu
Copy link
Contributor Author

tardieu commented Sep 13, 2024

My suspicion at this point is that in this scenario InstaSlice may destroy the slice intended for a pod right after ungating the pod leading to the 3 scenarios I have observed:

  1. The pod was scheduled and started by the kubelet before the deletion. It appears to run successfully but only because the untimely deletion of the slice does not affect the sleep command running in the pod.
  2. The pod remains pending because the slice has been deleted before the scheduler had a chance to schedule the pod.
  3. The pod was scheduled but the kubelet reports an error because the slice was gone in-between the scheduler scheduling and the kubelet doing its thing.

In these emulated scenarios, the creation/deletion of the slice boils down to the addition and removal of one unit of the MIG resource in the node capacity.

@tardieu
Copy link
Contributor Author

tardieu commented Sep 13, 2024

Example of scenario 2:

tardieu@indigo:instaslice-operator$ kubectl get node kind-control-plane -o json | jq .status.capacity; kubectl get pods
{
  "cpu": "8",
  "ephemeral-storage": "102625208Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "hugepages-32Mi": "0",
  "hugepages-64Ki": "0",
  "memory": "16351912Ki",
  "nvidia.com/accelerator-memory": "80Gi",
  "nvidia.com/mig-1g.5gb": "0",
  "org.instaslice/a5870bf6-1307-4ca9-b36d-d8f5937042c3": "1",
  "pods": "110"
}
NAME   READY   STATUS      RESTARTS   AGE
p1     0/1     Pending     0          21m
p2     0/1     Completed   0          21m
p3     0/1     Completed   0          21m
p4     0/1     Completed   0          21m
p5     0/1     Completed   0          21m
p6     0/1     Completed   0          21m
p7     0/1     Completed   0          21m

It should not be possible for a pod to be pending, i.e., ungated by InstaSlice with a count of mig slices equal to 0.

@tardieu
Copy link
Contributor Author

tardieu commented Sep 13, 2024

Example of scenario 3:

tardieu@indigo:instaslice-operator$ kubectl get node kind-control-plane -o json | jq .status.capacity; kubectl get pods
{
  "cpu": "8",
  "ephemeral-storage": "102625208Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "hugepages-32Mi": "0",
  "hugepages-64Ki": "0",
  "memory": "16351912Ki",
  "nvidia.com/accelerator-memory": "80Gi",
  "nvidia.com/mig-1g.5gb": "2",
  "org.instaslice/171f3a10-1663-4cc9-992e-a9141734ea21": "1",
  "org.instaslice/17b63565-c7a8-489d-8491-d9a165ad4745": "1",
  "org.instaslice/8e2a5159-ae95-4eb5-8898-02081732f1b0": "1",
  "org.instaslice/8f9ddcf2-1caf-4979-b051-477c88a3149a": "1",
  "pods": "110"
}
NAME   READY   STATUS                       RESTARTS   AGE
p1     0/1     SchedulingGated              0          2m33s
p2     0/1     Completed                    0          2m33s
p3     0/1     Completed                    0          2m33s
p4     0/1     SchedulingGated              0          2m33s
p5     0/1     OutOfnvidia.com/mig-1g.5gb   0          2m33s
p6     0/1     Completed                    0          2m33s
p7     1/1     Running                      0          2m33s

@harche
Copy link
Contributor

harche commented Sep 16, 2024

/cc @sairameshv

@asm582
Copy link
Contributor

asm582 commented Sep 17, 2024

Thanks for this issue. scenario 2 can be reproduced and we have a PR for it #99. Scenario 3 could result from dangling org.instaslice resource on the node which is causes emulator to perform excessive deletes

@asm582
Copy link
Contributor

asm582 commented Sep 23, 2024

Solved by #121

@harche
Copy link
Contributor

harche commented Sep 23, 2024

Solved by #121

/assign @asm582

@asm582
Copy link
Contributor

asm582 commented Oct 8, 2024

With new design changes, I am not sure if this issue is valid

@asm582 asm582 closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants