Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

Open
1 task done
lvaylet opened this issue Mar 27, 2024 · 7 comments
Open
1 task done

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

lvaylet opened this issue Mar 27, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@lvaylet
Copy link
Collaborator

lvaylet commented Mar 27, 2024

SLO Generator Version

v2.5.2

Python Version

3.9

What happened?

After designing synthetic probes and availability/latency SLOs in the CI environment, a lot of 504 errors started to show up in the logs, in a seemingly random pattern. As a result, the error budget is burning quite fast during the outages. The incident usually resolves itself in under 10 minutes, then shows up again later on.

What did you expect?

No 504 errors and no timeouts when calling the SLO Generator.

Screenshots

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@lvaylet lvaylet added bug Something isn't working triage labels Mar 27, 2024
@lvaylet lvaylet self-assigned this Mar 27, 2024
@lvaylet lvaylet removed the triage label Mar 27, 2024
@lvaylet
Copy link
Collaborator Author

lvaylet commented Mar 27, 2024

Example log entry:

{
  "textPayload": "The request has been terminated because it has reached the maximum request timeout. To change this limit, see https://cloud.google.com/run/docs/configuring/request-timeout",
  "insertId": "66042246000dd918ac9c23d5",
  "httpRequest": {
    "requestMethod": "POST",
    "requestUrl": "<REDACTED>",
    "requestSize": "494",
    "status": 504,
    "responseSize": "713",
    "userAgent": "Google-Cloud-Scheduler",
    "remoteIp": "107.178.196.96",
    "serverIp": "216.239.32.53",
    "latency": "29.999816421s",
    "protocol": "HTTP/1.1"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "slo-generator",
      "service_name": "slo-generator",
      "revision_name": "slo-generator-00154-2gv",
      "project_id": "<REDACTED>",
      "location": "<REDACTED>"
    }
  },
  "timestamp": "2024-03-27T13:42:00.906092Z",
  "severity": "ERROR",
  "labels": {
    "instanceId": "<REDACTED>"
  },
  "logName": "projects/<REDACTED>/logs/run.googleapis.com%2Frequests",
  "trace": "projects/<REDACTED>/traces/de38c4e5fb97cd35e44c685f67a1768b",
  "receiveTimestamp": "2024-03-27T13:42:30.966597414Z",
  "spanId": "16424922216080379185"
}

@lvaylet
Copy link
Collaborator Author

lvaylet commented Mar 27, 2024

The reported "latency" value (29.999816421s) is really close to the 30s timeout configured in Cloud Run:

image

@skinlayers
Copy link

Additional info: At one point we tried increasing the timeout to the max of 1hr. This did not seem to help i.e. if the SLO evaluation doesn't finish executing in 5 min, it won't finish in 1hr. To make things worse, this actually ran up our GCP bill as we were charged for the compute time while the instance was waiting to timeout. We also tried increasing the Cloud Run max concurrency and max instances. Max instances helped a little, as it would sometimes allow SLOs to be processed by additional instances, but evetually all instances would be stuck waiting to timeout.

@lvaylet
Copy link
Collaborator Author

lvaylet commented Apr 3, 2024

Thank you @skinlayers. Same conclusions here. The requests to Cloud Run are timing out. However, increasing the timeout value does not help. The issue is somewhere in the middle. I have not found a way to reproduce it consistently though. I will enable debug mode and wait for the next occurrence.

@lvaylet
Copy link
Collaborator Author

lvaylet commented Apr 3, 2024

@skinlayers Does this happen for a specific backend/exporter, or all backends/exporters in general?

@lvaylet
Copy link
Collaborator Author

lvaylet commented Apr 3, 2024

I also found that deploying a new revision of the Cloud Run service removes most (all?) of the 504 errors. But they come back after some time. I ruled out memory leaks as memory utilization never goes above 35% (with 512MiB allocated to each instance of the container). Quite the opposite actually. 504 errors consistently show up when memory utilization is at its lowest point, around 22%.

@lvaylet
Copy link
Collaborator Author

lvaylet commented Apr 3, 2024

In my CI environment, I just:

  • Disabled all the existing uptime checks and all the existing synthetic probes (Cloud Scheduler and Synthetic Monitoring alike) except one: slo_cm_gae_latency724ms, triggered by Cloud Scheduler every minute.
  • Redeployed a new revision of the Service with DEBUG set to 1.

I will keep monitoring the 504 error rate, and introduce new uptime checks or synthetic probes on a regular basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants