[Enhancement]: Ensure API Server availability recording rules are consistent #975

lorenzofelletti · 2024-09-09T10:51:19Z

What's the general idea for the enhancement?

API Server availability recording rules do not ensure consistency between the cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase*{le="+Inf"} and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase*.

This happens because Prometheus does not ensure that rules in the same rule group gets executed on the same data, but they are only guaranteed to be executed in order.
It happened to me several times that the apiserver_request:availability30d went above 100%, and I tracked this down to this fact.

This is an example of plotting
cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"} (green line) vs cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"} (yellow line):

As noticeable in the image, the yellow line (*_count) is always below or equal the green one (*_bucket{le="+Inf"}). In the code, the count rules are before the bucket ones. The behaviour observed is then compatible with the *_count being evaluated before the *_bucket rule, thus on "less" data.

My suggestion would be to change the availability recording rules to first evaluate the *_bucket rules, and then the *_count rules basing the expression on the value of the *_bucket{le="+Inf"} recording rule jut evaluated. This would enforce the consistency between the values of these two time series at expression-level.

Please provide any helpful snippets.

No response

What parts of the codebase does the enhancement target?

Rules

Anything else relevant to the enhancement that would help with the triage process?

Some reference links:

I agree to the following terms:

I agree to follow this project's Code of Conduct.
I have filled out all the required information above to the best of my ability.
I have searched the issues of this repository and believe that this is not a duplicate.
I have confirmed this proposal applies to the default branch of the repository, as of the latest commit at the time of submission.

The text was updated successfully, but these errors were encountered:

lorenzofelletti mentioned this issue Sep 9, 2024

feat: API Server availability rules consistency #976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Ensure API Server availability recording rules are consistent #975

[Enhancement]: Ensure API Server availability recording rules are consistent #975

lorenzofelletti commented Sep 9, 2024

[Enhancement]: Ensure API Server availability recording rules are consistent #975

[Enhancement]: Ensure API Server availability recording rules are consistent #975

Comments

lorenzofelletti commented Sep 9, 2024

What's the general idea for the enhancement?

Please provide any helpful snippets.

What parts of the codebase does the enhancement target?

Anything else relevant to the enhancement that would help with the triage process?

I agree to the following terms: