This project bootstraps a template to create toy Spark ETL. This is useful to start a Spark project from scratch.
- Docker
- Python > 3.10
- Poetry as dependency management tool. If you need to change some library version from pyproject.toml, run
poetry lock
.
Then run poetry env info -p
to make sure the environment setup was done properly.
make format_code
: rewrites source code using black and isort to keep it in the standard formatmake lint
: checks the source code for syntax violationsmake test
: Run unit testsmake run_etl
: runs the ETL for the sample_etl. The output data will be located in the output/sample folder
- Further documentation can be placed in the docs folder.