ETL (Extract, Transform, Load) pipeline using various technologies to extract data from the GitHub Firehose API, transform it, and load it into a PostgreSQL database.
- GitHub Firehose API: Provides a real-time stream of public events from GitHub.
- PostgreSQL: Stores the transformed data from the GitHub Firehose API.
- Kafka: Serves as the message queue.
- Spark: Used for processing and transforming the data in real-time.
- Docker: Orchestrates services.
- Quix: Quix is a library for building and managing streaming data pipelines. It simplifies the interaction with Kafka, providing a high-level API for producing and consuming messages.
- requests-sse: Python library for handling Server-Sent Events (SSE) with the Requests library.
- PySpark: Python API for Spark, used for building and running Spark applications.
The producer in this project is configured with external settings to optimize Kafka performance, especially when dealing with a high-traffic API like the GitHub Firehose API. The configurations include:
- Statistics Interval: statistics.interval.ms is set to collect performance statistics at regular intervals.
- Batch Size: batch.size is set to 1MB to allow efficient batching of messages, balancing between low latency and high throughput.
- Linger Time: linger.ms is set to 500ms to wait and collect messages before sending them in a batch, further optimizing throughput.
- Compression: compression.type is set to gzip to compress message data, saving storage and bandwidth at the cost of additional CPU usage for compression and decompression.
- Docker
- Docker Compose
- Build and Run the Docker containers:
docker-compose up --build
- Verify Data Ingestion:
docker-compose exec postgres psql -U postgres -d github_events
SELECT * FROM github_events;
- main.py: Extracts data from the GitHub Firehose API, transforms it, and produces messages to Kafka.
- consumer.py: Consumes messages from Kafka, transforms them, and loads them into PostgreSQL.
- entrypoint.sh: Determines whether to run the producer or consumer based on the RUN_MODE environment variable.