Extract any data from the blockchain using gRPC-enabled endpoints (powered by Firehose and Substreams)
substreams_firehose is a data extraction tool leveraging the power of Firehose and Substreams innovative technologies for accessing any blockchain-related data. It acts as an easy-to-use interface to communicate with gRPC endpoints, simplifying the process of extracting data that matters to you.
Using a flexible approach, you can review and select which information to extract in the final output (JSONL is the default but it's entierly up to you how the data looks like at the end). You can then use this data to power other applications or for your own purpose.
Requires Python >= 3.7
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ pip install substreams_firehose
$ git clone git@github.com:pinax-network/substreams_firehose.git
$ cd substreams_firehose
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ pip install -r requirements.txt
A .pylintrc
file is provided if you want to run Pylint:
(.venv) $ pylint substreams_firehose --rcfile=.pylintrc
Auto-generated documentation can be browsed here.
Important: see the next section for how to setup your API keys.
To manage the list of blockchain data providers and the specific settings for individual gRPC connections, substreams_firehose uses two kind of configuration files (written in Hjson, an extended JSON format notably allowing comments to be added).
A TUI (Terminal User Interface) is available to manage (almost) all aspects of the configuration required for the tool. You can run it with the following command :
(.venv) $ python -m substreams_firehose.config
Press F1
on any screen of the TUI to access a help menu. For starter, you can bring the main menu on the starting screen by pressing CTRL+X
.
Note: the TUI requires your terminal emulator to support a 256 colors palette for syntax highlighting (if not, the tool still works you won't just see the pretty colors !). Check here if you're not sure that's the case.
You can also edit any configuration file manually (stored under .venv/lib/{PYTHON_VERSION}/site-packages/substreams_firehose/
with PyPI install) with the editor of your choice but beware it might break the application if edited incorrectly.
This file holds the list of endpoints serving data using either Firehose, Substreams or both. It specifies which authentication endpoint to use and adds a few details describing each endpoint as well as some other settings like the number of retries for failed blocks, etc. (see comments).
It is available in .venv/lib/{PYTHON_VERSION}/site-packages/substreams_firehose/config.hjson
in the PyPI install. From source, you will want to copy the sample.config.hjson
file and rename it.
The most important part is to fill the api_key
setting by registering an API key (with a free account) for each authentication endpoint you plan on using:
- For using Dfuse based endpoints, go to https://dfuse.eosnation.io/.
- For using Pinax based endpoints, go to https://pinax.network/.
- For using StreamingFast based endpoints, go to https://app.streamingfast.io/.
You can run the TUI configuration tool for setting your API keys. The tool will check that they are valid against the specified endpoint (hence why the dummy
key used in the video below will fail to pass):
set_api_keys_demo.mp4
The TUI also allows you to add your own authentication providers and manage the list of data endpoints that you can query.
A stub configuration file describe a particular way to run a gRPC stream for querying blockchain data from an endpoint. They specify the objects that will be manipulated, the request parameters to send as well as the data to filter into the final output. As such, you can have multiple stub configs for an endpoint, each dedicated to extract some particular data.
Here's an example of a stub config dedicated to a substream:
{
"base": "sf.substreams.v1", // Package of the protofile holding the gRPC objects
"service": "Stream", // Service to use on the gRPC endpoint
"method": "Blocks", // Method of the service to use
"request": {
"object": "Request", // Request object of the method
"params": {
"output_modules": [ // Substream output parameter
"map_block",
],
"modules" : "substreams_firehose/proto/common.spkg" // Substream package describing the data format
}
},
"response": {
"object": "Response", // Response object of the method
"params": {} // Output filter (empty will get ALL the fields from the gRPC JSON response)
}
}
Stub configuration files can be easily managed and edited using the TUI configuration tool. A step-by-step approach allows for easily filling all the required information to end up with a stub completely describing the extraction process for a given endpoint.
Here's a demo on how to setup a stub configuration for extracting gas prices data on the Ethereum mainnet and run it with the tool :
stub_config_demo.mp4
substreams_firehose
comes with a number of predefined endpoints, .proto
files definitions and stub configurations to make it easy to use and not loose to much time setting up your own configurations.
From the TUI configuration tool starting screen (go back to the start of Configuration files if you missed how to run it), you can see the list of available endpoints and the blockchain they target.
To extract data for a given block range, simply specify the id
of an endpoint to the command-line tool. For example, to retrieve the latest 100 blocks data from the Ethereum main chain, you can run :
(.venv) $ LAST_ETH_BLOCK=$(curl -s https://api.blockcypher.com/v1/eth/main | jq .height) && echo $LAST_ETH_BLOCK
(.venv) $ python -m substreams_firehose $(($LAST_ETH_BLOCK - 100)) $LAST_ETH_BLOCK --grpc-entry eth_mainnet --out-file jsonl/eth.jsonl
Note: there is work-in-progress to allow specifying a date range instead of block numbers for the query, stay tuned !
All the 100 latest block data will be stored in the jsonl/eth.jsonl
file with one row for each block. If you have jq
installed, you can then preview the output with the following command :
$ cat jsonl/eth.jsonl | jq --color-output | less --RAW-CONTROL-CHARS
To see all available options for the tool, run :
(.venv) $ python -m substreams_firehose -h
For even more control over the data extracted, the extraction process uses a modular approach for manipulating response objects coming from a gRPC stream. A block processing function is used for extracting the data that is later stored in the output file at the end of the block extraction process.
Several block processors are available by default:
default_processor
will output all the data (filtered according to the stub config) from the gRPC response.default_substream_processor
should be used with a substream and will output the data (filtered according to the stub config) from each of the output module in the gRPC response.filtered_block_processor
will output the data (filtered according to the stub config) using the legacy FirehoseV1 filtering system.
All three will output the response data in JSON, with the final data being compiled in a JSONL file (one line for each response parsed).
Customizing the format of the data extracted is the main goal of writing a custom block processor.
In order to write custom block processing functions, some conditions must be respected:
- The function should be placed inside the
processors.py
file (avoid name conflicts with existing functions). - The function should act as a generator (using the
yield
keyword) to return the data. A dictionary is the preferred format, but it could be any format (specify the--no-json-output
flag if you don't want to convert the final output to JSON). - The first parameter of the function should take the raw data extracted from the gRPC stream (Google protobuf
Message
type).
You can use the _filter_data
function to apply the filters defined in the stub config to the output and process it further from here. Or you can directly get all the content from the response using the MessageToJson
function. See other block processors in the processors.py
file for details and instructions.
You can then use a custom block processor through the command-line using the --custom-processor
(or -p
) argument and providing the name of the function. Also, if you do not want the final output to be converted to JSON before being sent to the output file, you can pass the --no-json-output
flag.
For example, let's say you've implemented a custom function my_block_processor
in processors.py
. You would then pass the argument as --custom-processor my_block_processor
. The script will locate it inside the processors.py
module and use the my_block_processor
function to parse block data and extract it to the output file.