Skip to content

This projects looks at extracting data from wikipedia using Apache Airflow, process the data in Azure and visualizing the data using Tableau.

Notifications You must be signed in to change notification settings

Yusreen/Countries-By-Population-Density-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Countries By Population Density Project

This project uses Airflow to scrape and clean data from Wikipedia. The clean data is then pushed to Azure Data lake for processing. Tableau is the used to visualize the data.

View the interactive Tableau dashboard here: https://public.tableau.com/app/profile/yusreen.shah/viz/PopulationDensityProject/Dashboard1

Overview

The goal of this project was to understand the role of docker and Airflow in Data Engineering. Libraries like BeautifulSoup as well as Geocoder are used to extract data as well as use the extracted data to locate each country.

System Architecture

image

Data Visualization

The dashboard is as follows: image

Requirements

  • Python 3.9 (minimum)
  • Docker
  • PostgreSQL
  • Apache Airflow 2.6 (minimum)

Getting Started

  1. Clone the repository.

    git clone https://github.com/Yusreen/CountriesByPopulationDensityProject.git
  2. Install Python dependencies.

    pip install -r requirements.txt

Running the Code With Docker

  1. Start your services on Docker with
    docker compose up -d
  2. Trigger the DAG on the Airflow UI.

How It Works

  1. Fetches data from Wikipedia.
  2. Cleans the data.
  3. Transforms the data.
  4. Pushes the data to Azure Data Lake.

Lessons Learned

  1. I learnt how to read the html using Developer Tools and fetch the correct table using BeautifulSoup. image

  2. I learnt how to use geocoder to get the correct location of the country/territory. image

  3. I learnt how to correctly use dags in Airflow as well as familiarize myself with the UI.

  4. I learnt how to use Tableau for effective storytelling.

References

This project was inspired by: https://github.com/airscholar/FootballDataEngineering

About

This projects looks at extracting data from wikipedia using Apache Airflow, process the data in Azure and visualizing the data using Tableau.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published