[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

craigcondit · 2024-11-15T20:16:46Z

What is this PR for?

YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases.

Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.

What type of PR is it?

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2978

How should this be tested?

Verified successful processing of 1000-pod job on autoscaled cluster where previously this would fail.

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

YUNIKORN-2700 introduced a bug where allocations of previously-reserved tasks were not handled correctly in the case where we schedule on a different node than the reservation. Ensure that we unreserve and allocate using the proper node in both cases. Also introduce additional logging of allocations on nodes to make finding issues like this easier in the future.

codecov · 2024-11-15T20:19:18Z

Codecov Report

Attention: Patch coverage is 79.54545% with 9 lines in your changes missing coverage. Please review.

Project coverage is 81.34%. Comparing base (ac32595) to head (a731db7).

Files with missing lines	Patch %	Lines
pkg/scheduler/partition.go	59.09%	7 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #996   +/-   ##
=======================================
  Coverage   81.34%   81.34%           
=======================================
  Files          97       97           
  Lines       15590    15620   +30     
=======================================
+ Hits        12681    12706   +25     
- Misses       2630     2634    +4     
- Partials      279      280    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbacsko

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

wilfred-s · 2024-11-18T03:22:30Z

pkg/scheduler/partition.go

 	alloc := result.Request
+	targetNodeID := result.NodeID
+	var srcNodeID string


Rename to reservedNodeID is more logical than srcNodeID
The reserved node is only used if the result is Unreserved or AllocatedReserved This retrieval and the logging should be moved inside that check (line 911) and not clutter the rest of the code.

wilfred-s · 2024-11-18T03:29:42Z

Is it possible to have at least a single unit test that fails with the old code and passes with this PR?

I think the missing bit is just a single line:

928       alloc.SetNodeID(targetNodeID)

We need a unit tests, and it should be doable to create one:

fill up a node with allocations
create a request that does not fit on the used node
manually create a reservation for that request on that filled up node
run the normal allocation and get it to allocate on the "other" node.
the new allocation should show the correct node.

Before the fix the allocation will show the reserved node ID or none at all.

craigcondit self-assigned this Nov 15, 2024

craigcondit requested review from wilfred-s, pbacsko and manirajv06 November 15, 2024 21:23

pbacsko reviewed Nov 17, 2024

View reviewed changes

wilfred-s requested changes Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

craigcondit commented Nov 15, 2024 •

edited

Loading

codecov bot commented Nov 15, 2024

pbacsko left a comment

wilfred-s Nov 18, 2024

wilfred-s commented Nov 18, 2024

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

Are you sure you want to change the base?

[YUNIKORN-2978] Fix handling of reserved allocations where node differs #996

Conversation

craigcondit commented Nov 15, 2024 • edited Loading

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Nov 15, 2024

Codecov Report

pbacsko left a comment

Choose a reason for hiding this comment

wilfred-s Nov 18, 2024

Choose a reason for hiding this comment

wilfred-s commented Nov 18, 2024

craigcondit commented Nov 15, 2024 •

edited

Loading