You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.
Example SLO configuration
apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
name: projects-inventory-query-availability
labels:
service_name: projects
feature_name: inventory-query
slo_name: availability
team: xyz
spec:
description: 95% of inventory query API HTTP responses are successful
backend: cloud_monitoring
method: good_bad_ratio
service_level_indicator:
filter_good: >
project=${PROJECT_ID}
metric.type="logging.googleapis.com/user/inventory_query_requests_total"
metric.labels.http_status = 200
filter_valid: >
project=${PROJECT_ID}
metric.type="logging.googleapis.com/user/inventory_query_requests_total"
( metric.labels.http_status = 200 OR
metric.labels.http_status = 500 OR
metric.labels.http_status = 501 OR
metric.labels.http_status = 502 OR
metric.labels.http_status = 503 OR
metric.labels.http_status = 504 OR
metric.labels.http_status = 505 OR
metric.labels.http_status = 506 OR
metric.labels.http_status = 507 OR
metric.labels.http_status = 508 OR
metric.labels.http_status = 509 OR
metric.labels.http_status = 510 OR
metric.labels.http_status = 511 )
goal: 0.95
frequency: "* * * * *"
What did you expect?
Correct SLI/error budget rate values when there are only "good" events.
wrong: Good: 510 | Bad: 469 -> interesting: if you sum it up, you get the same total of 979, but why are there 469 "bad" events which don't exist in reality? And after a few minutes it's back to the correct numbers?
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
Hi @svenmueller, thanks for reporting this. Apologies for the late reply. I was on vacation and off the grid.
Just like you, I was immediately surprised by the 510 + 469 == 979 coincidence upon seeing the screenshot for the first time. Any chance you could enable debug mode so we get more details about what's going on under the hood? For example by temporarily setting the DEBUG environment variable to 1?
SLO Generator Version
v2.3.4
Python Version
3.9
What happened?
When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.
Example SLO configuration
What did you expect?
Correct SLI/error budget rate values when there are only "good" events.
Screenshots
Relevant log output
Quite noteworthy:
Code of Conduct
The text was updated successfully, but these errors were encountered: