This repository will be used as a place to best describe and include the work I have done for the Internet Archive as a student developer for the Google Summer of Code 2018.
For details on each of my applications and information on how they work please look at the README files in their repositories.
My project was Idea 5 - Integrate the “Scanner” software into the Wayback Machine. The aim of this project was to identify changes in archives for a given URL and highlight those changes by integrating the web-monitoring software into the Wayback Machine and help to further advance it. This further translates into improving the web-monitoring software, recognising which components are useful to the Internet Archive’s implementation, how they can be best used and extracting and using them in a new project.
The web-monitoring project consists of four components:
- web-monitoring-db A Ruby on Rails app that serves database data via a REST API, serves diffs and collects human-entered annotations.
- web-monitoring-ui A React front-end that provides useful views of the diffs. It communicates with the Rails app via JSON.
- web-monitoring-processing A Python backend that ingests new captured HTML, computes diffs, performs prioritization/filtering, and populates databases for the Rails app.
- web-monitoring-versionista-scraper A set of Node.js scripts used to extract data from Versionista and load it into the database.
After carefully studying these components it was apparent that only component number 3 and part of component number 2 were required for our implementation.
Since we were going to use web-monitoring-processing as a whole, this was the best place to begin work for the GSoC. My mentor encouraged me to find issues and ways to improve this project. We were mainly going to use the wm-diffing-server so I made a fork of the GitHub repository and started digging deeper at the related code. In total, I made 3 Pull Requests (see table 1) that consisted of 12 commits (see table 2) on the project.
Table 1
Pull Request | Link |
---|---|
Diffing server exception handling #185 | edgi-govdata-archiving/web-monitoring-processing#185 |
Add tornado debug env value #187 | edgi-govdata-archiving/web-monitoring-processing#187 |
Cors #188 | edgi-govdata-archiving/web-monitoring-processing#188 |
Table 2
The web-monitoring-ui project is a complex React front-end application that communicates with the web-monitoring-db Rails backend, undertakes the task of displaying pages for which snapshots exist, and displays their differences. After studying the component, I decided that only the components that render the diff views
are what our project requires.
Wayback-diff is the React component which I created for the Google Summer of Code 2018 and will be integrated into the Wayback Machine. It queries the wm-diffing-server (part of the web-monitoring-processing app) for the differences between two captures of a webpage. wm-diffing-server in turn fetches the two captures from the Wayback Machine, calculates their differences and returns a JSON response to the component. After that, wayback-diff renders the snapshots with their differences marked in the user’s browser. This component is exported as an ES module and can be used in any React project.
As an example project and in order to demo it’s usage, I created the wayback-diff-test repository. This repository contains an empty project, as it would be initialized using the yarn create react-app
command. It is merely importing wayback-diff and it is using it under a BrowserRouter.
Trying the same thing in a minimal project my mentor created the minimal-react-starter repository. There, he tried to import and use my component. This is where with a Pull request I resolved a configuration issue with express thus, providing anyone with guidance if they want to use this component with express.
Wayback-discover-diff is a Python server I built which runs on Flask and Celery and uses Redis as a database that handles and calculates the Simhash value of snapshots of webpages.
During the development of wayback-discover-diff, my mentor and I saw that it’s execution was not fast enough to be practical to use in any use case. So I started looking into making improvements in the algorithm, wherever possible. When all the improvements were made, I wrote a detailed report describing them and the performance boost each of them offered. The result of my improvements brought an 94.32% improvement of the runtime.
Depending on the look of the Wayback Machine’s User Interface, wayback-diff might require minor changes to its look to match the WMB’s style.
Also, wayback-diff needs more work to be ready for production. We must work to make it scalable. This means to handle URLs with large numbers of captures (100k+). Last but not least, there should be better error handling (special cases like unusual file types or encodings, etc).
wayback-discover-diff also needs more work to be ready for production. This works includes writing more tests, testing the app under a considerable load (100+ concurrent requests) and having better error handling.
A feature that is missing right now from the wayback-discover-diff is to calculate how much two snapshots have changed. Since their Simhash value would have already been calculated and would be in the database, getting the distance between the two snapshots’ simhash values would be a very useful piece of information. Integrating this kind of information in the Wayback Machine would help its users identify how much a webpage has changed from a specific moment in time. In addition, it would help save both resources for the Internet Archive and save users’ time as they would know which snapshots it makes sense to compare. Comparing snapshots that are 100% or 5% identical would not help the users gain any insights on how the page has changed.