Apache Hadoop, Apache Spark, Apache ZooKeeper, Kafka, Scala, Python, PySpark PostgreSQL, Docker, Django, Flexmonster, Real Time Streaming, Data Pipeline, Dashboard
- Introduction
- Frameworks/Technologies used
- Prerequisites
- Environment Setup
- Development Setup
- Final Result
- Team Members
-
In many data centers, different type of servers generate large amounts of data (events, in this case is status of the server in the data center) in real-time.
-
There is always a need to process this data in real-time and generate insights which will be used by the Data Server Manager who have to track the server's status regularly and find the resolution in case of issues occurring, for better server stability.
-
Since the data is huge and generated in real-time, we need to choose the right architecture with scalable storage and computation frameworks/technologies.
-
Hence we want to build the Real Time Data Pipeline Using Apache Kafka, Apache Spark, Hadoop, PostgreSQL, Django and Flexmonster on Docker to generate insights out of this data.
-
The Spark Project/Data Pipeline is built using Apache Spark with Python and PySpark on an Apache Hadoop Cluster running on top of Docker.
-
Data Visualization is built using Django Web Framework and Flexmonster.
- Apache Hadoop
- Apache Spark
- Apache ZooKeeper
- Docker
- Kafka
- Django
- Flexmonster
- Python
- PySpark
- asgiref==3.8.1
- certifi==2024.2.2
- charset-normalizer==3.3.2
- Django==5.0.4
- idna==3.7
- kafka-python==2.0.2
- psycopg2-binary==2.9.9
- py4j==0.10.9
- pyspark==3.0.1
- python-restcountries==2.0.0
- requests==2.31.0
- sqlparse==0.5.0
- typing_extensions==4.11.0
- urllib3==2.2.1
- Python==3.10
Note: Download apache-hive-2.3.7-bin.tar.gz
, hadoop-3.2.1.tar.gz
and spark-3.0.1-bin-hadoop3.2.tgz
and keep them in cloud_hive
, cloud_hadoop
and cloud_spark
folders containing the Dockerfiles respectively before proceeding with the setup.
Install Docker Desktop for Windows\Linux Operating System.
Note: If you are installing on Windows, make sure the system requirements are met and WSL2
is enabled to run the Docker Engine
.
Steps | Type Commands in Terminal |
---|---|
1. Create Docker Network | docker network create --subnet=172.20.0.0/16 cloudnet |
2. Create ZooKeeper Container | docker pull zookeeper:3.4 |
docker run -d --rm --hostname zookeepernode --net cloudnet --ip 172.20.1.3 --name cloud_zookeeper --publish 2181:2181 zookeeper:3.4 | |
3. Create Kafka Container | docker pull ches/kafka |
docker run -d --rm --hostname kafkanode --net cloudnet --ip 172.20.1.4 --name cloud_kafka --publish 9092:9092 --publish 7203:7203 --env KAFKA_AUTO_CREATE_TOPICS_ENABLE=true --env KAFKA_ADVERTISED_HOST_NAME=127.0.0.1 --env ZOOKEEPER_IP=172.20.1.3 ches/kafka |
Steps | Type Commands in Terminal |
---|---|
1. Create docker images for Apache Hadoop, Apache Spark, Apache Hive and PostgreSQL | cd hadoop_spark_cluster_image |
bash 1_create_hadoop_spark_image.sh | |
2. Create docker containers for Apache Hadoop, Apache Spark, Apache Hive and PostgreSQL | bash 2_create_hadoop_spark_cluster.sh create |
Run the python script "data_center_server_status_simulator.py". This script simulates live creation of server status information.
docker exec -it postgresqlnode bash
psql -U postgres
CREATE USER demouser WITH PASSWORD 'demouser';
ALTER USER demouser WITH SUPERUSER;
CREATE DATABASE event_message_db;
GRANT ALL PRIVILEGES ON DATABASE event_message_db TO demouser;
\c event_message_db;
docker exec -it masternode bash
mkdir -p /opt/workarea/code
mkdir -p /opt/workarea/spark_jars
exit
cd real_time_data_pipeline
docker cp real_time_streaming_data_pipeline.py masternode:/opt/workarea/code/
cd ../spark_dependency_jars
docker cp commons-pool2-2.8.1.jar masternode:/opt/workarea/spark_jars
docker cp kafka-clients-2.6.0.jar masternode:/opt/workarea/spark_jars
docker cp postgresql-42.2.16.jar masternode:/opt/workarea/spark_jars
docker cp spark-sql-kafka-0-10_2.12-3.0.1.jar masternode:/opt/workarea/spark_jars
docker cp spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar masternode:/opt/workarea/spark_jars
docker exec -it masternode bash
cd /opt/workarea/code/
spark-submit --master local[*] --jars /opt/workarea/spark_jars/commons-pool2-2.8.1.jar,/opt/workarea/spark_jars/postgresql-42.2.16.jar,/opt/workarea/spark_jars/spark-sql-kafka-0-10_2.12-3.0.1.jar,/opt/workarea/spark_jars/kafka-clients-2.6.0.jar,/opt/workarea/spark_jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar --conf spark.executor.extraClassPath=/opt/workarea/spark_jars/commons-pool2-2.8.1.jar:/opt/workarea/spark_jars/postgresql-42.2.16.jar:/opt/workarea/spark_jars/spark-sql-kafka-0-10_2.12-3.0.1.jar:/opt/workarea/spark_jars/kafka-clients-2.6.0.jar:/opt/workarea/spark_jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar --conf spark.executor.extraLibrary=/opt/workarea/spark_jars/commons-pool2-2.8.1.jar:/opt/workarea/spark_jars/postgresql-42.2.16.jar:/opt/workarea/spark_jars/spark-sql-kafka-0-10_2.12-3.0.1.jar:/opt/workarea/spark_jars/kafka-clients-2.6.0.jar:/opt/workarea/spark_jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar --conf spark.driver.extraClassPath=/opt/workarea/spark_jars/commons-pool2-2.8.1.jar:/opt/workarea/spark_jars/postgresql-42.2.16.jar:/opt/workarea/spark_jars/spark-sql-kafka-0-10_2.12-3.0.1.jar:/opt/workarea/spark_jars/kafka-clients-2.6.0.jar:/opt/workarea/spark_jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar real_time_streaming_data_pipeline.py
cd server_status_monitoring
python manage.py makemigrations dashboard
python manage.py migrate dashboard
python manage.py runserver
The dashboard updates live as the python simultor keeps feeding with new data. Some sample screenshots:
- Aayushi Padia (21bds002)
- Aryan TN (21bds006)
- Nupur Sangwai (21bds046)
- Sharan Thummagunti (21bds061)