-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996
base: master
Are you sure you want to change the base?
Conversation
YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases. Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #996 +/- ##
=======================================
Coverage 81.34% 81.34%
=======================================
Files 97 97
Lines 15590 15620 +30
=======================================
+ Hits 12681 12706 +25
- Misses 2630 2634 +4
- Partials 279 280 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have at least a single unit test that fails with the old code and passes with this PR?
alloc := result.Request | ||
targetNodeID := result.NodeID | ||
var srcNodeID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to reservedNodeID
is more logical than srcNodeID
The reserved node is only used if the result is Unreserved
or AllocatedReserved
This retrieval and the logging should be moved inside that check (line 911) and not clutter the rest of the code.
I think the missing bit is just a single line:
We need a unit tests, and it should be doable to create one:
Before the fix the allocation will show the reserved node ID or none at all. |
What is this PR for?
YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases.
Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2978
How should this be tested?
Verified successful processing of 1000-pod job on autoscaled cluster where previously this would fail.
Screenshots (if appropriate)
Questions: