Skip to content

A tool to collect and store Solidity files, their version history and some metadata from public GitHub repositories.

License

Notifications You must be signed in to change notification settings

carl-egge/github-solidity-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Github Solidity Scraper

A tool to collect and store Solidity files, their version history and some metadata from public GitHub repositories.

Key features

  • This script uses the GitHub REST API to collect data about Solidity repositories, files and commits.
  • When the script is run, it creates a local database with information about Solidity files, their repositories, and their commit history.
  • It uses the GitHub Search API repositories endpoint (https://api.github.com/search/repositories)
  • In order to expand the results it uses a technique called stratified search
  • Request throttling is used to make optimal use of the limited API
  • The search results can be filtered according to various criteria
  • If specified the script will only include data that falls under open source licenses in order to avoid copyright issues
  • You can also decide whether or not to include forks in the search

Script Steps

  1. Stratified Search on GitHub Search API
  2. For each repository collect files
  3. For each file collect commit history
  4. For each commit get content
  5. Store in local sqlite database

How To Use

Getting Started: To clone and run this script, you will need Python (version >= 3) and Pip installed on your computer. From your command line:

# Clone this repository
$ git clone https://github.com/carl-egge/github-solidity-scraper.git

# Go into the repository
$ cd github-solidity-scraper

# Install dependencies
$ python3 -m pip install -r requirements.txt

# Run the app (optionally use arguments)
$ python3 github-solidity-scraper.py [--github-token TOKEN]

Usage: To customize the script manually you can use arguments and control the behavior. It is strongly recommended to state a GitHub access token using the github-token argument.

  • --database : Specify the name of the database file that the results will be stored in (default: results.db)
  • --statistics : Specify a name for a spreadsheet file that is used to store the sampling statistics. This file can be used to continue a previous search if the script get interrupted or throws an exception (default: sampling.csv)
  • --stratum-size : This is the length of the size ranges into which the search population is partitioned (default: 5)
  • --min-size : The minimum code size that is searched for (default: 1)
  • --max-size : The maximum code size that is searched for (default: 393216)
  • --no-throttle : Disable the request throttling
  • --license-filter : When enabled the script filters the search only for repositories that fall under one of githubs licenses
  • --search-forks : When enabled the search includes forks of repositories.
  • --github-token : With this argument you should specify a personal access token for GitHub (by default, the environment variable GITHUB_TOKEN is used)

Note: The GitHub API provides a limit of 60 requests per hour. If a personal access token is provided, this limit can be extended up to 5000 requests per hour. Therefore, you should definitely specify an access token or have it stored in the shell environment so that the script can run efficiently. More information on how to generate a personal access token can be found here.

Showcase Smart Contract Repository

The results.db: The output of the script will be a SQLite database that consits of three tables: repo, file and comit. These tables store the information that the script collects.

  • repo: This table holds data about the repositories that were found (e.g. url, path, owner ...)
  • file: This table contains data about the Solidity files that were found (e.g. path, sha ...)
    • The repo_id is a foreign key and is associated to the repo that the file was found in.
  • comit: The commits correspond to a file and are stored together with some metadata in this table. This table also holds the actual file content from a commit. (e.g. sha, message, content, file_id ...)
    • The file_id is a foreign key and is associated to the file that the commit corresponds to.
    • Commit is a reserved keyword in SQLite therefore the tablename is comit with one m.

Look At The Data: In order to view and analyse the data a SQLite interface is needed. If not yet installed you can use one of many free online graphical user interfaces like ...

or you can download a free database interface such as ...

Feel free to use any tool you want to look at the output data.


The Showcase Database: To show the database scheme and some example data of Solidity files this repository contains a small showcase.db that can be investigated. This way you can look at some output without the need to run the script yourself. The showcase database can again be viewed using your favorite SQLite interface.

License

The MIT License (MIT). Please have a look at the LICENSE for more details.

About

A tool to collect and store Solidity files, their version history and some metadata from public GitHub repositories.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages