Publicly available real-time data sets on Kafka, Redpanda, RabbitMQ & Apache Pulsar
This project serves as a starting point for analyzing real-time streaming data. We have prepared a few cool datasets which can be streamed via Kafka, Redpanda, RabbitMQ, and Apache Pulsar. Right now, you can clone/fork the repo and start the service locally, but we will be adding publicly available clusters to which you can just connect.
Currently available datasets:
Place yourself in root folder and run:
python3 start.py --platforms <PLATFORMS> --dataset <DATASET>
The argument <PLATFORMS>
can be:
kafka
,redpanda
,rabbitmq
and/orpulsar
.
The argument <DATASET>
can be:
github
,art-blocks
,movielens
oramazon-books
.
That script will start chosen streaming platforms in docker container, and you will see messages from chosen dataset being consumed.
You can then connect with Memgraph and stream the data into the database by running:
docker-compose up <DATASET>-memgraph
For example, if you choose Kafka as a streaming platform and art-blocks for your dataset, you should run:
python3 start.py --platforms kafka --dataset art-blocks
If you are a Windows user and the upper command doesn't work, try replacing
python3
withpython
.
Next, in the new terminal window run:
docker-compose up art-blocks-memgraph
There's no documentation yet, but it's coming soon! Throw us a star to keep up with upcoming changes.