-
Notifications
You must be signed in to change notification settings - Fork 23
Frequent downtime on pulls.web-platform-tests.org #47
Comments
@lukebjerring FYI |
This appears to be caused by long SELECT times in postgres. This usually causes the web server to hang, which makes the application appear down, but results still get aggregated. In the case of #56, this may have caused the results to never be populated. Two possible solutions:
|
It's surprising that there are selects that take anything more than milliseconds given the small amount of data in the system still. What are those queries? |
I'm not sure, I'll have to chase this a bit more through the flask ORM. I'll likely do so when we're closer to implementing a solution, though a more well tuned computer is probably what is actually in order. For what it's worth, I observed multi-second postgres selects in htop for each load of the home page. When I did several, the server became non-responsive. |
This continues to be a serious problem. I am getting 504 Gateway Time-out on https://pulls.web-platform-tests.org/job/23710.13 and other URLs today, and https://bit.ly/ecosystem-infra-status shows very frequent downtime. |
I reccomend hooking it up to New Relic to understand which queries are slow and what the downtime is like. |
A problem today as well, need to look at https://pulls.web-platform-tests.org/job/24794.11 to understand what's wrong with web-platform-tests/wpt#9641 but it's 504 Gateway Time-out. |
https://pulls.web-platform-tests.org/ is now 504 Gateway Time-out.
@mdittmer set up https://bit.ly/ecosystem-infra-status a while ago and from that it's clear that downtime is pretty frequent, some downtime almost every day. This matches what I've experienced, which is that every so often that I take a look, it's slow or down. Recent reports of the same kind: #39 #42 #46
I'm calling this a roadmap issue, because apparently there's something not quite right about the setup causing it to frequently go down. Let's call this resolved when we've seen a week with no downtime.
@mdittmer, can you increase the checking rate to 5 minutes for these checks?
The text was updated successfully, but these errors were encountered: