Frequent downtime on pulls.web-platform-tests.org #47

foolip · 2017-12-18T10:30:25Z

https://pulls.web-platform-tests.org/ is now 504 Gateway Time-out.

@mdittmer set up https://bit.ly/ecosystem-infra-status a while ago and from that it's clear that downtime is pretty frequent, some downtime almost every day. This matches what I've experienced, which is that every so often that I take a look, it's slow or down. Recent reports of the same kind: #39 #42 #46

I'm calling this a roadmap issue, because apparently there's something not quite right about the setup causing it to frequently go down. Let's call this resolved when we've seen a week with no downtime.

@mdittmer, can you increase the checking rate to 5 minutes for these checks?

foolip · 2017-12-19T16:26:27Z

@lukebjerring FYI

boazsender · 2018-01-05T18:58:20Z

This appears to be caused by long SELECT times in postgres.

This usually causes the web server to hang, which makes the application appear down, but results still get aggregated.

In the case of #56, this may have caused the results to never be populated.

Two possible solutions:

Increase CPU resources on the server (postgres selects appear to be CPU bound according to htop)
Separate web server from database server, consider using managed db product, like amazon's RDS in production. If/when this second option is taken, we should also consider how the pullresults services will share resources, data models, and programs with the [w3c/wptdashboard] and http://wpt.fyi constellation of services.

foolip · 2018-01-08T21:13:56Z

It's surprising that there are selects that take anything more than milliseconds given the small amount of data in the system still. What are those queries?

boazsender · 2018-01-09T06:25:31Z

I'm not sure, I'll have to chase this a bit more through the flask ORM. I'll likely do so when we're closer to implementing a solution, though a more well tuned computer is probably what is actually in order.

For what it's worth, I observed multi-second postgres selects in htop for each load of the home page. When I did several, the server became non-responsive.

foolip · 2018-02-01T09:50:42Z

This continues to be a serious problem. I am getting 504 Gateway Time-out on https://pulls.web-platform-tests.org/job/23710.13 and other URLs today, and https://bit.ly/ecosystem-infra-status shows very frequent downtime.

jgraham · 2018-02-01T11:16:12Z

I reccomend hooking it up to New Relic to understand which queries are slow and what the downtime is like.

foolip · 2018-02-25T17:51:46Z

A problem today as well, need to look at https://pulls.web-platform-tests.org/job/24794.11 to understand what's wrong with web-platform-tests/wpt#9641 but it's 504 Gateway Time-out.

foolip added the priority:roadmap label Dec 18, 2017

foolip mentioned this issue Jan 5, 2018

No results for PR #8899 #56

Closed

foolip mentioned this issue Feb 1, 2018

Use Edge 16 in Travis builds web-platform-tests/wpt#9338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent downtime on pulls.web-platform-tests.org #47

Frequent downtime on pulls.web-platform-tests.org #47

foolip commented Dec 18, 2017 •

edited

Loading

foolip commented Dec 19, 2017

boazsender commented Jan 5, 2018 •

edited

Loading

foolip commented Jan 8, 2018 •

edited

Loading

boazsender commented Jan 9, 2018

foolip commented Feb 1, 2018

jgraham commented Feb 1, 2018

foolip commented Feb 25, 2018

Frequent downtime on pulls.web-platform-tests.org #47

Frequent downtime on pulls.web-platform-tests.org #47

Comments

foolip commented Dec 18, 2017 • edited Loading

foolip commented Dec 19, 2017

boazsender commented Jan 5, 2018 • edited Loading

foolip commented Jan 8, 2018 • edited Loading

boazsender commented Jan 9, 2018

foolip commented Feb 1, 2018

jgraham commented Feb 1, 2018

foolip commented Feb 25, 2018

foolip commented Dec 18, 2017 •

edited

Loading

boazsender commented Jan 5, 2018 •

edited

Loading

foolip commented Jan 8, 2018 •

edited

Loading