ETL Pipeline for Github Firehose API

ETL (Extract, Transform, Load) pipeline using various technologies to extract data from the GitHub Firehose API, transform it, and load it into a PostgreSQL database.

Design:

Source:

GitHub Firehose API: Provides a real-time stream of public events from GitHub.

Technologies Used:

PostgreSQL: Stores the transformed data from the GitHub Firehose API.
Kafka: Serves as the message queue.
Spark: Used for processing and transforming the data in real-time.
Docker: Orchestrates services.

Libraries Used:

Quix: Quix is a library for building and managing streaming data pipelines. It simplifies the interaction with Kafka, providing a high-level API for producing and consuming messages.
requests-sse: Python library for handling Server-Sent Events (SSE) with the Requests library.
PySpark: Python API for Spark, used for building and running Spark applications.

Tweaks: Configuration for Kafka Performance

The producer in this project is configured with external settings to optimize Kafka performance, especially when dealing with a high-traffic API like the GitHub Firehose API. The configurations include:

Statistics Interval: statistics.interval.ms is set to collect performance statistics at regular intervals.
Batch Size: batch.size is set to 1MB to allow efficient batching of messages, balancing between low latency and high throughput.
Linger Time: linger.ms is set to 500ms to wait and collect messages before sending them in a batch, further optimizing throughput.
Compression: compression.type is set to gzip to compress message data, saving storage and bandwidth at the cost of additional CPU usage for compression and decompression.

Setup and Usage

Prerequisites:

Docker
Docker Compose

Steps:

Build and Run the Docker containers:

    docker-compose up --build

Verify Data Ingestion:

    docker-compose exec postgres psql -U postgres -d github_events

    SELECT * FROM github_events;

Scripts Description:

main.py: Extracts data from the GitHub Firehose API, transforms it, and produces messages to Kafka.
consumer.py: Consumes messages from Kafka, transforms them, and loads them into PostgreSQL.
entrypoint.sh: Determines whether to run the producer or consumer based on the RUN_MODE environment variable.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
jars		jars
sql		sql
.gitignore		.gitignore
README.md		README.md
app.yaml		app.yaml
compose.yaml		compose.yaml
consumer-1.py		consumer-1.py
consumer.py		consumer.py
design.png		design.png
dockerfile		dockerfile
entrypoint.sh		entrypoint.sh
main.py		main.py
quix.yaml		quix.yaml
requirements.txt		requirements.txt
test-connection.py		test-connection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline for Github Firehose API

Design:

Source:

Technologies Used:

Libraries Used:

Tweaks: Configuration for Kafka Performance

Setup and Usage

Prerequisites:

Steps:

Scripts Description:

About

Releases

Packages

Languages

quang08/github-sse-kafka

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline for Github Firehose API

Design:

Source:

Technologies Used:

Libraries Used:

Tweaks: Configuration for Kafka Performance

Setup and Usage

Prerequisites:

Steps:

Scripts Description:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages