This project simulates a small real time dataflow with Apache Kafka and the data is saved to a Cassandra table. The readme contains instructions on how to set up Kafka and Cassandra on Windows.
Kafka requires a JDK installation, you can verify if you have it installed by using the following cmd command java -version
. If you get an error, you dont have Java installed on your machine. If this is the case you can download it for example from Oracle.
- Download the Kafka files from https://dlcdn.apache.org/kafka/3.5.0/kafka_2.13-3.5.0.tgz
- Extract the package.
- Open cmd in the Kafka folder and start Zookeeper with the following command :
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
- Open a new cmd and start the Kafka server with the following command:
.\bin\windows\kafka-server-start.bat .\config\server.properties
- Now kafka is up and running so the next step is to create a topic for the project. The topic is created by opening yet another cmd and using the following command:
.\bin\windows\kafka-topics.bat --create --topic *TOPIC NAME* --bootstrap-server localhost:9092
The port used here is the port for the Kafka server, which defaults to 9092. - Next up we will create the consumer. You can use the cmd you used to create the topic and use the following command:
.\bin\windows\kafka-console-consumer.bat --topic *TOPIC NAME* --from-beginning --bootstrap-server localhost:9092
Make sure to use the topic you created earlier.
That pretty much covers setting up Apache Kafka. If you want to make sure that the consumer is working correctly you can start a Kafka producer by opening yet another cmd and using the following command: .\bin\windows\kafka-console-producer.bat --topic kafka-datastream --bootstrap-server localhost:9092
The command creates a producer from the cmd and you will be able to send messages to the consumer from it.
In the end you should have 3 or 4 cmds open: zookeeper, kafka server, consumer and producer. You can close them down with CTRL + C when you are done. If you encounter an error where the commands "are too long" etc. Try to make the path shorter by shortening folder names etc.
NOTE: When running the project, the Zookeeper, Kafka server and Kafka consumer must be up and running.
The requirements for this is that you have docker installed and the docker desktop is running.
- first open up a cmd and pull the latest Cassandra docker image
docker pull cassandra:latest
- Now we can run the container with the following command:
docker run --name cassandra -p 127.0.0.1:9042:9042 -p 127.0.0.1:9160:9160 -d cassandra
- Verify that the container is running by using
docker container ls
ordocker ps
You should see the container info as an output:
Now Cassandra is up and running, the next step is to create a keyspace for the project.
- Keep using the old cmd or open up a new one. First we need to get inside the docker container, use the following command:
docker exec -it cassandra bash
- When you are in the root of the container use the command
cqlsh
to start up a Cassandra shell. - Now we can create the keyspace using the following Cassandra Query Language (CQL):
CREATE KEYSPACE IF NOT EXISTS kafka_datastream WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
NOTE: You cannot use "-" in the keyspace names, the keyspace in Cassandra is like a schema in a relational database which can contain multiple tables. - You can check if the keyspace had been created succesfully by using the following command:
desc keyspaces
.
- Set up Kafka according to the instructions and leave Zookeeper, Kafka server and Kafka consumer running.
- Set up Cassandra according to instructions and leave the Docker container running.
- Run consumer.py
- Run producer.py
- Open up a cmd and and go inside the container with
docker exec -it cassandra bash
- Start a Cassandra shell with
cqlsh
- Use the created keyspace with
use kafka_datastream
. Match the command with the created keyspace name. - Run a cql query to select data from the table, for example
select * from datastream_table;
. NOTE if you have uploaded a ton of data it might be a good idea to specify a limit for rows to be retrieved like:select * from datastream_table limit 5;
- The output will be a table with the inserted data:
- Stop the producer.py with CTRL + C
- Stop the consumer.py with CTRL + C
- Stop the Kafka consumer with CTRL + C
- Stop the Kafka server with CTRL + C
- Stop the Zookeeper with CTRL + C
- Close the Cassandra container from the desktop UI or from the cmd with
docker stop cassandra