This project uses Airflow to scrape and clean data from Wikipedia. The clean data is then pushed to Azure Data lake for processing. Tableau is the used to visualize the data.
View the interactive Tableau dashboard here: https://public.tableau.com/app/profile/yusreen.shah/viz/PopulationDensityProject/Dashboard1
The goal of this project was to understand the role of docker and Airflow in Data Engineering. Libraries like BeautifulSoup as well as Geocoder are used to extract data as well as use the extracted data to locate each country.
- Python 3.9 (minimum)
- Docker
- PostgreSQL
- Apache Airflow 2.6 (minimum)
-
Clone the repository.
git clone https://github.com/Yusreen/CountriesByPopulationDensityProject.git
-
Install Python dependencies.
pip install -r requirements.txt
- Start your services on Docker with
docker compose up -d
- Trigger the DAG on the Airflow UI.
- Fetches data from Wikipedia.
- Cleans the data.
- Transforms the data.
- Pushes the data to Azure Data Lake.
-
I learnt how to read the html using Developer Tools and fetch the correct table using BeautifulSoup.
-
I learnt how to use geocoder to get the correct location of the country/territory.
-
I learnt how to correctly use dags in Airflow as well as familiarize myself with the UI.
-
I learnt how to use Tableau for effective storytelling.
This project was inspired by: https://github.com/airscholar/FootballDataEngineering