Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

craigcondit
Copy link
Contributor

@craigcondit craigcondit commented Nov 15, 2024

What is this PR for?

YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2978

How should this be tested?

Verified successful processing of 1000-pod job on autoscaled cluster where previously this would fail.

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

YUNIKORN-2700 introduced a bug where allocations of previously-reserved
tasks were not handled correctly in the case where we schedule on a
different node than the reservation. Ensure that we unreserve and
allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make
finding issues like this easier in the future.
@craigcondit craigcondit self-assigned this Nov 15, 2024
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 79.54545% with 9 lines in your changes missing coverage. Please review.

Project coverage is 81.34%. Comparing base (ac32595) to head (a731db7).

Files with missing lines Patch % Lines
pkg/scheduler/partition.go 59.09% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #996   +/-   ##
=======================================
  Coverage   81.34%   81.34%           
=======================================
  Files          97       97           
  Lines       15590    15620   +30     
=======================================
+ Hits        12681    12706   +25     
- Misses       2630     2634    +4     
- Partials      279      280    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

alloc := result.Request
targetNodeID := result.NodeID
var srcNodeID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to reservedNodeID is more logical than srcNodeID
The reserved node is only used if the result is Unreserved or AllocatedReserved This retrieval and the logging should be moved inside that check (line 911) and not clutter the rest of the code.

@wilfred-s
Copy link
Contributor

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

I think the missing bit is just a single line:

928       alloc.SetNodeID(targetNodeID)

We need a unit tests, and it should be doable to create one:

  • fill up a node with allocations
  • create a request that does not fit on the used node
  • manually create a reservation for that request on that filled up node
  • run the normal allocation and get it to allocate on the "other" node.
  • the new allocation should show the correct node.

Before the fix the allocation will show the reserved node ID or none at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants