Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: When a node cannot bind, it should stop scheduling pods to that node #273

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

chaunceyjiang
Copy link
Contributor

@chaunceyjiang chaunceyjiang commented Apr 16, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:

When a node cannot bind, it should stop scheduling pods to that node

Which issue(s) this PR fixes:
Part of #272

Special notes for your reviewer:

When the lock hasn't expired yet.

the log of kube-scheduler

│ kube-scheduler I0417 07:44:16.769658       1 schedule_one.go:794] "Failed to bind pod" pod="default/gpu-brun-high-5959f868c8-4qnlr"                                                                                                                                                                                      │
│ kube-scheduler E0417 07:44:16.769797       1 scheduler.go:367] "Error scheduling pod; retrying" err="binding rejected: node node6 has been locked within 5 minutes" pod="default/gpu-brun-high-5959f868c8-4qnlr"                                                                                                         │
│ kube-scheduler I0417 07:44:16.769966       1 schedule_one.go:847] "Updating pod condition" pod="default/gpu-brun-high-5959f868c8-4qnlr" conditionType=PodScheduled conditionStatus=False conditionReason="SchedulerError"

the status of pod

status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2024-04-17T07:43:57Z'
      reason: SchedulerError
      message: 'binding rejected: node node6 has been locked within 5 minutes'
  qosClass: Guaranteed

When the lock expires.

│ I0417 07:47:34.906040       1 scheduler.go:323] "Bind" pod="gpu-brun-high-5959f868c8-4qnlr" namespace="default" podUID="7d5af6d4-a7fe-493d-ae4b-ef259cac9e99" node="node6"                                                                                                                                               │
│ I0417 07:47:34.919515       1 nodelock.go:96] "Node lock expired" node="node6" lockTime="2024-04-17 07:42:25 +0000 UTC"                                                                                                                                                                                                  │
│ I0417 07:47:34.952036       1 nodelock.go:78] "Node lock released" node="node6"                                                                                                                                                                                                                                          │

│ I0417 07:47:34.983562       1 nodelock.go:46] "Node lock set" node="node6"                                                                                                                                                                                                                                               │

│ I0417 07:47:34.998518       1 util.go:228] "Decoded pod annos" poddevices={"Iluvatar":[[{"Idx":0,"UUID":"GPU-e290caca-2f0c-9582-acab-67a142b61ffa","Type":"NVIDIA","Usedmem":1000,"Usedcores":15}]],"NVIDIA":[[{"Idx":0,"UUID":"GPU-e290caca-2f0c-9582-acab-67a142b61ffa","Type":"NVIDIA","Usedmem":1000,"Usedcores":15} │
│ I0417 07:47:35.003494       1 scheduler.go:364] After Binding Process

Does this PR introduce a user-facing change?:

…node.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Copy link

Hi @chaunceyjiang,
Thanks for your pull request!
If the PR is ready, use the /auto-cc command to assign Reviewer to Review.
We will review it shortly.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

@github-actions github-actions bot added the kind/bug Something isn't working label Apr 16, 2024
@chaunceyjiang
Copy link
Contributor Author

/cc @wawa0210 @haitwang-cloud @archlitchi PTAL.

@wawa0210
Copy link
Member

/lgtm look good to me

cc @archlitchi

@github-actions github-actions bot added the lgtm label Apr 17, 2024
@archlitchi
Copy link
Collaborator

thanks /lgtm

@archlitchi archlitchi merged commit 4e9dedb into Project-HAMi:master Apr 17, 2024
6 checks passed
@chaunceyjiang chaunceyjiang deleted the bind branch April 17, 2024 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lgtm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants