GitHub - notnews/top10: Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

We scraped and parsed the homepages, politics pages, and top10 lists of prominent news sites for 2012 and 2016--2017. We did all this in 2016--2017, and hence the 2012 data exclusively comes from Internet Archive. For 2016--2017, the data mostly comes from scraping live sites but some of the data---where we realized much too late that we wanted to scrape the site---also comes from Internet Archive.

Data

For summary of the data, see here. The raw data (HTML files) and the CSVs with the parsed data are posted here.

Scripts

The scripts scraping the Internet Archive still run nicely. The scripts for scraping current homepages, politics pages, and top10 lists have largely survived except format changes mean they may don't work nowadays.

To learn how to we set up live scraping and parsing the data, including setting up monitoring, see here.

License

Released under the MIT License

Authors

Suriyan Laohaprapanon and Gaurav Sood

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
data_summary.md		data_summary.md
internet_archive.md		internet_archive.md
live_pages.md		live_pages.md
scripts.md		scripts.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

Data

Scripts

License

Authors

About

Releases

Contributors 2

Languages

notnews/top10

Folders and files

Latest commit

History

Repository files navigation

Top 10 News! Scraping and Parsing Home pages and Top 10 Lists on News Sites

Data

Scripts

License

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 2

Languages