This is the repo associated with user apwheele
in the DrivenData Algae Bloom prediction competition.
To set up the python environment, I use Anaconda. In particular here is my initial set up (I have a difficult time with geopandas, so make them go first):
git clone https://github.com/apwheele/algaebloom.git
cd ./algaebloom
conda create --name bloom python=3.9 pip geopandas
conda activate bloom
pip install -r requirements.txt
ipython kernel install --name "bloom_jpy" --user
Where requirements.txt
has the necessary libraries for data download/manipulation and statistical modeling.
I saved the final built versions via pip list > final_env.txt
, which is also uploaded to the repo.
Once downloading this repository, to simply replicate the final models, you can do:
# everything should be run from root directory
cd ./algaebloom
python main_prepdata.py
This prepares the local sqlite database with the data needed to run the models. Then if you run:
python main_preds.py
To estimate the model, generate predictions, and cache the final model object.
If you want to replicate downloading the original data from the planetary computer, see below.
I have saved the final files I used in the competition as CSV files in the ./data
folder. These include:
elevation_dem.csv
, data obtained from Planetary computers DEM source, seeget_dem.py
spat_lag.csv
, spatial lags in space/time (only from input data), seeget_lag.py
sat.csv
, feature engineering of satellite imagery, seeget_sat.py
split_pred.csv
, weights used for train/test splits, seeget_split.py
(needs to be run afterget_dem.py
)
In addition to the competition data csv files provided, train_labels.csv
, metadata.csv
, and submission_format.csv
are all expected to be saved in the data folder.
If you have already run python main_prepdata.py
, these will not work, as the tables are already populated. Below downloading the original data only works for if you start fresh from an empty database.
Note that for each of these scripts, they involved downloading (and caching) the data in a local sqlite database. On my personal machine they could take a very long time, and if your internet goes out could result in errors. The scripts are written so you can just "rerun" them again, and it will attempt to fill in the missing information and add in more data. E.g., if you are in the root of the project, you can run:
python get_dem.py
See the output, and then if some data is missing, rerun the exact same script:
python get_dem.py
To attempt to download more data. In the end I signed up for the Planetary Computer Hub, running the scripts on their machines went a bit faster than on my local machine.
To run the final model, you can do:
python main_preds.py
This will save a model in the ./models
folder with the current date. You should get a print out showing how it is the same as the final winning solution (which I have saved in the repo), submission ./submissions/sub_2023_02_07.csv
.
In addition to this, I have in the root folder main_hypertune.py
, hypertuning experiments. And these results are saved in hypertune_results.txt
(e.g. by running python main_hypertune.py > hypertune_results.txt
. These helped guided the final models that I experiemented with, but the final ones are due to more idiosyncratic experimentation uploading every day.
To go over the modeling strategy, see the notebook model_strategy.ipynb
.