GitHub User and Repository data scraper for take-home assignment.
github-scraper is a service written in Python that scrapes User and Repository data from the GitHub API and stores a small subset of the dataset. Additionally, it provides an API to query the scraped data.
This project uses Python 3.6+ with the following libraries
It is highly encouraged to use a virtual environment to create an isolated environment for the project.
python3 -m venv scraper-environment
source ./scraper-environment/bin/activate
git clone https://github.com/jcaraballo17/github-scraper.git
cd github-scraper
pip install -r requirements.txt
The base directory for the Django project is src
, where the manage.py
script is.
cd src
The project configuration is stored in a file called config.json
inside the github_scraper/settings
directory. You can find a template of this file inside the same directory.
Here is an example of config.json
for a development setting using a sqlite3 database:
{
"secret_key": "UaJYaHUsViHcds5Un3vhEsH-5ZAsXg3V6cgxhToR6wU",
"github_oauth_token": "b6706204bf1911b29c7208adb5b0ecbrc1f6bad7",
"debug_mode": true,
"databases": [
{
"connection_name": "default",
"database_name": "scraper_data.sqlite3",
"engine": "django.db.backends.sqlite3"
}
]
}
Other database managers can also be used, and all the appropriate configuration can be found in the django official documentation.
To create the database schema and tables, run the migrate
django command
python manage.py migrate
The service has two main parts:
- The API exposing the scraped data.
- The
scrape_git
django command to scrape data.
Here we'll cover how to use both of them
After creating the configuration file, you should be able to run Django's integrated development server and open it in a Web Browser to use the API.
python manage.py runserver
Configuring a web server is out of the scope of this project, but I recommend following this guide on how to set up a project using nginx
and gunicorn
to serve a django project in Ubuntu.
After starting the server, you can navigate to http://127.0.0.1:8000 on a web browser and see these available API endpoints.
/users/
/users/?since=<id>
/users/<login>/
/users/<login>/repos/
/repos/
/repos/<owner>/<name>
/users/
- shows a list of all users./users/?since=<id>
- shows a list of users with id greater than ./users/<login>/
- shows the details of the user with username ./users/<login>/repos/
- shows a list of repositories by the user with username ./repos/
- shows a list of all repositories./repos/<owner>/<name>/
- shows the details of the repository of user with username and repository name (the repository full name).
The scrape_git command can be used to get individual users or a range of users starting at an id.
Usage:
scrape_git
scrape_git [user] [user] [user]
scrape_git [--since [ID]]
scrape_git [--users [number_of_users]]
scrape_git [--repositories [number_of_repositories]]
scrape_git [--retry]
scrape_git --help
Scrape all users and all their repositories (will fail when rate limit is exceeded)
python manage.py scrape_git
Scrape the users jcaraballo17
and Maurier
and all of their repositories
python manage.py scrape_git jcaraballo17 Maurier
Scrape the user jcaraballo17
with only one repository
python manage.py scrape_git jcaraballo17
Scrape the 100 first users with all of their repositories and wait until the rate limit reset time has passed to continue scraping if the rate limit is exceeded
python manage.py scrape_git --users 100 --retry
Scrape 50 users starting at ID 3000
and 5 repositories for each user
python manage.py scrape_git --since 3000 --users 50 --repositories 5
To test the code with code coverage run
pytest --cov=github_data/ --cov-report html
open htmlcov/index.html
The project was developed in Python for it's simplicity and versatility. Version 3.6+ of Python was chosed most of all for it's support for type hints.
Django was used as the web framework for it's powerful ORM, it's Model-View-Template architecture, and the scalability it provides. On the other hand, Django usually ends up being too complicated for small projects, so it was not the first option when considering which web framework to use. The other option considering the scope of the project was FastAPI for it's high performance and simplicity and it's API oriented design.
The github API has extensive User and Repository data. For this project I chose to scrape what I considered to be the most important data for each to keep it simple and easier to develop.
This was a personal choice and what I'm most used to using for Django projects because all of the configuration is defined in json format in one place.
Old habits from other programming languages and a code convention in previous jobs.
The ghapi library provides simple and easy to use idiomatic access to the entire GitHub API. The interface is built with GitHub's OpenAPI specification for their REST API, and it's much more simple to use and efficient than other GitHub libraries like PyGithub.
One setback that I encountered was that because it's a fairly new library, the only reference is the official documentation, which is hard to navigate. This led me to take some not-so-efficient design decisions like the pagination methods in the Scraper tool which are completely unnecessary and complicated the development of the Scraper.