Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) #345

Open
1 task done
svenmueller opened this issue Aug 15, 2023 · 1 comment
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@svenmueller
Copy link

svenmueller commented Aug 15, 2023

SLO Generator Version

v2.3.4

Python Version

3.9

What happened?

When using Google Cloud Monitoring backend , we sometimes (every other hour) notice wrong SLI metrics + error burn rate metrics being calculated for a short time (not correct, e.g. as there are no "bad" events). After the short time (a few minutes), the calculcated metrics are back to expected/correct numbers. We see this happen for calculations of different sliding windows like 1h, 12h, 7d or 28d. E.g. you can see a "sudden" peek in error budget burn rate for one of the sliding windows, e.g. "28 days" but other sliding windows are not affected and showing correct values.

Example SLO configuration

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: projects-inventory-query-availability
  labels:
    service_name: projects
    feature_name: inventory-query
    slo_name: availability
    team: xyz
spec:
  description: 95% of inventory query API HTTP responses are successful
  backend: cloud_monitoring
  method: good_bad_ratio
  service_level_indicator:
    filter_good: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      metric.labels.http_status = 200
    filter_valid: >
      project=${PROJECT_ID}
      metric.type="logging.googleapis.com/user/inventory_query_requests_total"
      ( metric.labels.http_status = 200 OR
        metric.labels.http_status = 500 OR
        metric.labels.http_status = 501 OR
        metric.labels.http_status = 502 OR
        metric.labels.http_status = 503 OR
        metric.labels.http_status = 504 OR
        metric.labels.http_status = 505 OR
        metric.labels.http_status = 506 OR
        metric.labels.http_status = 507 OR
        metric.labels.http_status = 508 OR
        metric.labels.http_status = 509 OR
        metric.labels.http_status = 510 OR
        metric.labels.http_status = 511 )
  goal: 0.95
  frequency: "* * * * *"

What did you expect?

Correct SLI/error budget rate values when there are only "good" events.

Screenshots

Bildschirmfoto 2023-08-15 um 14 22 19

Relevant log output

2023-08-14 15:28:29.414 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:29.148 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:28.093 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:27.841 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:25.684 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:23.380 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:12.086 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:28:08.331 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:28:01.790 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:55.168 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:52.479 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:47.765 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:38.083 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 52.094 % | SLO: 95.0 % | Gap: -42.91% | BR: 9.6 / 1.0 | Alert: 1 | Good: 510 | Bad: 469
2023-08-14 15:27:36.766 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:27:19.565 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:25:55.593 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:48.714 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.925 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.663 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:44.216 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:38.536 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.906 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.842 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.840 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:37.164 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:33.929 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0
2023-08-14 15:24:31.986 CEST
INFO - projects-inventory-query-availability | 28 days | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 1.0 | Alert: 0 | Good: 979 | Bad: 0

Quite noteworthy:

  • correct: Good: 979 | Bad: 0
  • wrong: Good: 510 | Bad: 469 -> interesting: if you sum it up, you get the same total of 979, but why are there 469 "bad" events which don't exist in reality? And after a few minutes it's back to the correct numbers?

Code of Conduct

  • I agree to follow this project's Code of Conduct
@svenmueller svenmueller added bug Something isn't working triage labels Aug 15, 2023
@lvaylet
Copy link
Collaborator

lvaylet commented Aug 16, 2023

Hi @svenmueller, thanks for reporting this. Apologies for the late reply. I was on vacation and off the grid.

Just like you, I was immediately surprised by the 510 + 469 == 979 coincidence upon seeing the screenshot for the first time. Any chance you could enable debug mode so we get more details about what's going on under the hood? For example by temporarily setting the DEBUG environment variable to 1?

@lvaylet lvaylet removed the triage label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants