This Python-based project crawls data from Wikipedia using Apache Airflow, cleans it and pushes it Azure Data Lake for processing.
- Python 3.9 (minimum)
- Docker
- PostgreSQL
- Apache Airflow 2.6 (minimum)
-
Clone the repository.
git clone https://github.com/airscholar/FootballDataEngineering.git
-
Install Python dependencies.
pip install -r requirements.txt
- Start your services on Docker with
docker compose up -d
- Trigger the DAG on the Airflow UI.
- Fetches data from Wikipedia.
- Cleans the data.
- Transforms the data.
- Pushes the data to Azure Data Lake.