We scraped and parsed the homepages, politics pages, and top10 lists of prominent news sites for 2012 and 2016--2017. We did all this in 2016--2017, and hence the 2012 data exclusively comes from Internet Archive. For 2016--2017, the data mostly comes from scraping live sites but some of the data---where we realized much too late that we wanted to scrape the site---also comes from Internet Archive.
For summary of the data, see here. The raw data (HTML files) and the CSVs with the parsed data are posted here.
The scripts scraping the Internet Archive still run nicely. The scripts for scraping current homepages, politics pages, and top10 lists have largely survived except format changes mean they may don't work nowadays.
To learn how to we set up live scraping and parsing the data, including setting up monitoring, see here.
Released under the MIT License
Suriyan Laohaprapanon and Gaurav Sood