Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Ensure API Server availability recording rules are consistent #975

Open
4 tasks done
lorenzofelletti opened this issue Sep 9, 2024 · 0 comments
Open
4 tasks done

Comments

@lorenzofelletti
Copy link

What's the general idea for the enhancement?

API Server availability recording rules do not ensure consistency between the cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase*{le="+Inf"} and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase*.

This happens because Prometheus does not ensure that rules in the same rule group gets executed on the same data, but they are only guaranteed to be executed in order.
It happened to me several times that the apiserver_request:availability30d went above 100%, and I tracked this down to this fact.

This is an example of plotting
cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"} (green line) vs cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"} (yellow line):
discrepancy

As noticeable in the image, the yellow line (*_count) is always below or equal the green one (*_bucket{le="+Inf"}). In the code, the count rules are before the bucket ones. The behaviour observed is then compatible with the *_count being evaluated before the *_bucket rule, thus on "less" data.

My suggestion would be to change the availability recording rules to first evaluate the *_bucket rules, and then the *_count rules basing the expression on the value of the *_bucket{le="+Inf"} recording rule jut evaluated. This would enforce the consistency between the values of these two time series at expression-level.

Please provide any helpful snippets.

No response

What parts of the codebase does the enhancement target?

Rules

Anything else relevant to the enhancement that would help with the triage process?

Some reference links:

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this proposal applies to the default branch of the repository, as of the latest commit at the time of submission.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant