🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

lvaylet · 2024-03-27T14:28:00Z

SLO Generator Version

v2.5.2

Python Version

3.9

What happened?

After designing synthetic probes and availability/latency SLOs in the CI environment, a lot of 504 errors started to show up in the logs, in a seemingly random pattern. As a result, the error budget is burning quite fast during the outages. The incident usually resolves itself in under 10 minutes, then shows up again later on.

What did you expect?

No 504 errors and no timeouts when calling the SLO Generator.

Screenshots

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

lvaylet · 2024-03-27T14:30:27Z

Example log entry:

{
  "textPayload": "The request has been terminated because it has reached the maximum request timeout. To change this limit, see https://cloud.google.com/run/docs/configuring/request-timeout",
  "insertId": "66042246000dd918ac9c23d5",
  "httpRequest": {
    "requestMethod": "POST",
    "requestUrl": "<REDACTED>",
    "requestSize": "494",
    "status": 504,
    "responseSize": "713",
    "userAgent": "Google-Cloud-Scheduler",
    "remoteIp": "107.178.196.96",
    "serverIp": "216.239.32.53",
    "latency": "29.999816421s",
    "protocol": "HTTP/1.1"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "slo-generator",
      "service_name": "slo-generator",
      "revision_name": "slo-generator-00154-2gv",
      "project_id": "<REDACTED>",
      "location": "<REDACTED>"
    }
  },
  "timestamp": "2024-03-27T13:42:00.906092Z",
  "severity": "ERROR",
  "labels": {
    "instanceId": "<REDACTED>"
  },
  "logName": "projects/<REDACTED>/logs/run.googleapis.com%2Frequests",
  "trace": "projects/<REDACTED>/traces/de38c4e5fb97cd35e44c685f67a1768b",
  "receiveTimestamp": "2024-03-27T13:42:30.966597414Z",
  "spanId": "16424922216080379185"
}

lvaylet · 2024-03-27T14:38:56Z

The reported "latency" value (29.999816421s) is really close to the 30s timeout configured in Cloud Run:

skinlayers · 2024-04-02T18:10:57Z

Additional info: At one point we tried increasing the timeout to the max of 1hr. This did not seem to help i.e. if the SLO evaluation doesn't finish executing in 5 min, it won't finish in 1hr. To make things worse, this actually ran up our GCP bill as we were charged for the compute time while the instance was waiting to timeout. We also tried increasing the Cloud Run max concurrency and max instances. Max instances helped a little, as it would sometimes allow SLOs to be processed by additional instances, but evetually all instances would be stuck waiting to timeout.

lvaylet · 2024-04-03T10:00:41Z

Thank you @skinlayers. Same conclusions here. The requests to Cloud Run are timing out. However, increasing the timeout value does not help. The issue is somewhere in the middle. I have not found a way to reproduce it consistently though. I will enable debug mode and wait for the next occurrence.

lvaylet · 2024-04-03T10:01:54Z

@skinlayers Does this happen for a specific backend/exporter, or all backends/exporters in general?

lvaylet · 2024-04-03T10:15:19Z

I also found that deploying a new revision of the Cloud Run service removes most (all?) of the 504 errors. But they come back after some time. I ruled out memory leaks as memory utilization never goes above 35% (with 512MiB allocated to each instance of the container). Quite the opposite actually. 504 errors consistently show up when memory utilization is at its lowest point, around 22%.

lvaylet · 2024-04-03T10:30:07Z

In my CI environment, I just:

Disabled all the existing uptime checks and all the existing synthetic probes (Cloud Scheduler and Synthetic Monitoring alike) except one: slo_cm_gae_latency724ms, triggered by Cloud Scheduler every minute.
Redeployed a new revision of the Service with DEBUG set to 1.

I will keep monitoring the 504 error rate, and introduce new uptime checks or synthetic probes on a regular basis.

lvaylet added bug Something isn't working triage labels Mar 27, 2024

lvaylet self-assigned this Mar 27, 2024

lvaylet removed the triage label Mar 27, 2024

lvaylet mentioned this issue Apr 4, 2024

💡 [REQUEST] - Collect Cloud Trace data to troubleshoot latency issues and timeouts #442

Open

1 task

lvaylet mentioned this issue May 21, 2024

💡 [REQUEST] - Better logging and tracing #475

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

lvaylet commented Mar 27, 2024

lvaylet commented Mar 27, 2024

lvaylet commented Mar 27, 2024

skinlayers commented Apr 2, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441

Comments

lvaylet commented Mar 27, 2024

SLO Generator Version

Python Version

What happened?

What did you expect?

Screenshots

Relevant log output

Code of Conduct

lvaylet commented Mar 27, 2024

lvaylet commented Mar 27, 2024

skinlayers commented Apr 2, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024

lvaylet commented Apr 3, 2024