SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

svenmueller · 2023-08-15T12:44:33Z

SLO Generator Version

v2.3.4

Python Version

3.9

What happened?

When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.

Example SLO configuration

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: projects-inventory-query-availability
  labels:
    service_name: projects
    feature_name: inventory-query
    slo_name: availability
    team: xyz
spec:
  description: 95% of inventory query API HTTP responses are successful
  backend: cloud_monitoring
  method: good_bad_ratio
  service_level_indicator:
    filter_good: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      metric.labels.http_status = 200
    filter_valid: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      ( metric.labels.http_status = 200 OR
        metric.labels.http_status = 500 OR
        metric.labels.http_status = 501 OR
        metric.labels.http_status = 502 OR
        metric.labels.http_status = 503 OR
        metric.labels.http_status = 504 OR
        metric.labels.http_status = 505 OR
        metric.labels.http_status = 506 OR
        metric.labels.http_status = 507 OR
        metric.labels.http_status = 508 OR
        metric.labels.http_status = 509 OR
        metric.labels.http_status = 510 OR
        metric.labels.http_status = 511 )
  goal: 0.95
  frequency: "* * * * *"

What did you expect?

Correct SLI/error budget rate values when there are only "good" events.

Screenshots

Relevant log output

2023-08-14 15:28:29.414 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:29.148 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:28.093 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:27.841 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:25.684 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:23.380 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:12.086 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:08.331 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:28:01.790 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:55.168 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:52.479 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:47.765 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:38.083 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:36.766 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:19.565 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:25:55.593 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:48.714 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.925 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.663 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.216 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:38.536 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.906 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.842 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.840 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.164 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:33.929 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:31.986 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0

Quite noteworthy:

correct: Good: 979 | Bad: 0
wrong: Good: 510 | Bad: 469 -> interesting: if you sum it up, you get the same total of 979, but why are there 469 "bad" events which don't exist in reality? And after a few minutes it's back to the correct numbers?

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

lvaylet · 2023-08-16T13:53:55Z

Hi @svenmueller, thanks for reporting this. Apologies for the late reply. I was on vacation and off the grid.

Just like you, I was immediately surprised by the 510 + 469 == 979 coincidence upon seeing the screenshot for the first time. Any chance you could enable debug mode so we get more details about what's going on under the hood? For example by temporarily setting the DEBUG environment variable to 1?

svenmueller added bug Something isn't working triage labels Aug 15, 2023

svenmueller assigned lvaylet Aug 15, 2023

lvaylet removed the triage label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

svenmueller commented Aug 15, 2023 •

edited

Loading

lvaylet commented Aug 16, 2023

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

Comments

svenmueller commented Aug 15, 2023 • edited Loading

SLO Generator Version

Python Version

What happened?

What did you expect?

Screenshots

Relevant log output

Code of Conduct

lvaylet commented Aug 16, 2023

svenmueller commented Aug 15, 2023 •

edited

Loading