You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
API Server availability recording rules do not ensure consistency between the cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase*{le="+Inf"} and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase*.
This happens because Prometheus does not ensure that rules in the same rule group gets executed on the same data, but they are only guaranteed to be executed in order.
It happened to me several times that the apiserver_request:availability30d went above 100%, and I tracked this down to this fact.
This is an example of plotting cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"} (green line) vs cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"} (yellow line):
As noticeable in the image, the yellow line (*_count) is always below or equal the green one (*_bucket{le="+Inf"}). In the code, the count rules are before the bucket ones. The behaviour observed is then compatible with the *_count being evaluated before the *_bucket rule, thus on "less" data.
My suggestion would be to change the availability recording rules to first evaluate the *_bucket rules, and then the *_count rules basing the expression on the value of the *_bucket{le="+Inf"} recording rule jut evaluated. This would enforce the consistency between the values of these two time series at expression-level.
Please provide any helpful snippets.
No response
What parts of the codebase does the enhancement target?
Rules
Anything else relevant to the enhancement that would help with the triage process?
What's the general idea for the enhancement?
API Server availability recording rules do not ensure consistency between the
cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase*{le="+Inf"}
andcluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase*
.This happens because Prometheus does not ensure that rules in the same rule group gets executed on the same data, but they are only guaranteed to be executed in order.
It happened to me several times that the
apiserver_request:availability30d
went above 100%, and I tracked this down to this fact.This is an example of plotting
cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{cluster="mycluster", le="+Inf", verb="GET", scope="resource"}
(green line) vscluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h{cluster="mycluster", verb="GET", scope="resource"}
(yellow line):As noticeable in the image, the yellow line (
*_count
) is always below or equal the green one (*_bucket{le="+Inf"}
). In the code, the count rules are before the bucket ones. The behaviour observed is then compatible with the*_count
being evaluated before the*_bucket
rule, thus on "less" data.My suggestion would be to change the availability recording rules to first evaluate the
*_bucket
rules, and then the*_count
rules basing the expression on the value of the*_bucket{le="+Inf"}
recording rule jut evaluated. This would enforce the consistency between the values of these two time series at expression-level.Please provide any helpful snippets.
No response
What parts of the codebase does the enhancement target?
Rules
Anything else relevant to the enhancement that would help with the triage process?
Some reference links:
I agree to the following terms:
The text was updated successfully, but these errors were encountered: