-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod rejection errors and unexpected MIG slice counts #96
Comments
I encounter the same node capacity mismatch issue (nvidia.com/mig-1g.5gb count vs org.instaslice/* count) on kind 1.30.4: kind create cluster --image kindest/node:v1.30.4@sha256:976ea815844d5fa93be213437e3ff5754cd599b040946b5cca43ca45c2047114 |
My suspicion at this point is that in this scenario InstaSlice may destroy the slice intended for a pod right after ungating the pod leading to the 3 scenarios I have observed:
In these emulated scenarios, the creation/deletion of the slice boils down to the addition and removal of one unit of the MIG resource in the node capacity. |
Example of scenario 2:
It should not be possible for a pod to be pending, i.e., ungated by InstaSlice with a count of mig slices equal to 0. |
Example of scenario 3:
|
/cc @sairameshv |
Thanks for this issue. scenario 2 can be reproduced and we have a PR for it #99. Scenario 3 could result from dangling org.instaslice resource on the node which is causes emulator to perform excessive deletes |
Solved by #121 |
With new design changes, I am not sure if this issue is valid |
When following the Kueue+InstaSlice demo #92, I expect 7 pods to run to completion with no more than 3 pods running at a time to due the configuration of the cluster queue. While I observed no violation of the max concurrency, in many instances I observed a couple of unexpected outcomes:
OutOfnvidia.com/mig-1g.5gb
with the messagePod was rejected: Node didn''t have enough resource: nvidia.com/mig-1g.5gb, requested: 1, used: 0, capacity: 0
nvidia.com/mig-1g.5gb
resources in the node capacity no longer matches the count oforg.instaslice/*
resources in the node capacity. More specifically the former is lower than the latter and lower than the number of slices required to run pending ungated pods.This is running Kind 1.31 on rancher desktop 1.15.1 on Sonoma. The demo scenario relies on fake GPUs and InstaSlice emulator mode.
The text was updated successfully, but these errors were encountered: