Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2855] Rethink the call for update metrics with nil resource #960

Closed
wants to merge 6 commits into from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Sep 6, 2024

What is this PR for?

We will not update the nil resource to resource metrics, we need to fix it.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

Copy link

codecov bot commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.96%. Comparing base (9e10746) to head (d520339).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
pkg/metrics/queue.go 76.92% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #960      +/-   ##
==========================================
+ Coverage   80.95%   80.96%   +0.01%     
==========================================
  Files          97       97              
  Lines       12514    12527      +13     
==========================================
+ Hits        10131    10143      +12     
- Misses       2113     2115       +2     
+ Partials      270      269       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhuqi-lucas zhuqi-lucas self-assigned this Sep 6, 2024
@zhuqi-lucas zhuqi-lucas changed the title [YUNIKORN-2855] Remove the call for update metrics with nil resource [YUNIKORN-2855] Rethink the call for update metrics with nil resource Sep 6, 2024
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this approach. This adds significant complexity to the code and requires all callers to do the calculations... make the metrics functions handle the nil case properly themselves.

@zhuqi-lucas
Copy link
Contributor Author

Thanks @craigcondit for review, i addressed the code to a more reasonable way and setting zero resource for nil update case.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something entirely... Why are we setting arbitrary resources to zero here? What is it we're trying to accomplish?

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Sep 10, 2024

@craigcondit , we support setting maxResource/ guaranteed resource to nil, but the metrics update is wrong because we don't update it from previous value update to zero/nil, for example:

sq.maxResource = nil

sq.guaranteedResource = nil

@craigcondit
Copy link
Contributor

craigcondit commented Sep 10, 2024

Then it seems the proper solution is to retrieve the previous value of the metric and update the values there (an "unprune" so to speak). Missing values in the new map should be set to zero. Setting arbitrary resources is just wrong. Doing it this way also has the advantage of ensuring that any resource types that were emitted by the metric previously are still emitted, but with zero instead of missing.

@zhuqi-lucas
Copy link
Contributor Author

Thanks, @craigcondit i like the idea unprune, i will address this way.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just... no. You already have the metric in the collector itself. Read it out to get the list of values that are previously set and build a new metric using a combination of the old and new values. There should be no code changes outside the collector at all.

@zhuqi-lucas
Copy link
Contributor Author

I see... @craigcondit let me change my code.

@zhuqi-lucas
Copy link
Contributor Author

Addressed in latest PR:

  1. Getting the existed metrics.
  2. Setting those field to zero.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is still too complex. See review comments.

pkg/scheduler/objects/queue.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/queue.go Show resolved Hide resolved
}

func (m *QueueMetrics) SetQueueMaxResourceMetrics(resourceName string, value float64) {
m.setQueueResource(QueueMax, resourceName, value)
func (m *QueueMetrics) SetQueueNilResourceMetrics(state string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be made an internal function once logic is moved here from scheduler/objects/queue.go.

pkg/metrics/queue.go Outdated Show resolved Hide resolved
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add locking per comments.

pkg/metrics/queue.go Show resolved Hide resolved
pkg/metrics/queue.go Show resolved Hide resolved
pkg/metrics/queue.go Show resolved Hide resolved
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change lock type.

knownResourceTypes map[string]struct{}
lock locking.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be an RWMutex as we never lock for reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, addressed in latest PR, thanks!

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants