The ecosystem of geospatial machine learning tools in the Pangeo world.
Presenter: Wei Ji Leong
When: Wednesday, 18 October 2023, 13:50–14:15 (NZDT)
Where: Te Iringa (Wave Room - WG308), Auckland University of Technology (AUT), Auckland, New Zealand
Website: https://2019.foss4g-oceania.org/schedule/2019-11-12?sessionId=SPGUQV
Presentation slides: https://hackmd.io/@weiji14/foss4g2023oceania
Blog post (part 1): https://weiji14.github.io/blog/the-pangeo-machine-learning-ecosystem-in-2023
Blog post (part 2): https://weiji14.github.io/blog/when-cloud-native-geospatial-meets-gpu-native-machine-learning
Several open source tools are enabling the shift to cloud-native geospatial Machine Learning workflows. Stream data from STAC APIs, generate Machine Learning ready chips on-the-fly and train models for different downstream tasks! Find out about advances in the Pangeo ML community towards scalable GPU-native workflows.
An overview of open source Python packages in the Pangeo (big data geoscience) Machine Learning community will be presented. On read/write, kvikIO allows low-latency data transfers from Zarr archives via NVIDIA GPU Direct Storage. With tensors loaded in xarray data structures, xbatcher enables efficient slicing of arrays in an iterative fashion. To connect the pieces, zen3geo acts as the glue between geospatial libraries - from reading STAC items and rasterizing vector geometries to stacking multi-resolution datasets for custom data pipelines. Learn more as the Pangeo community develops tutorials at Project Pythia, and join in to hear about the challenges and ideas on scaling machine learning in the geosciences with the Pangeo ML Working Group.
Follow instructions at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#install-gpudirect-storage to install NVIDIA GPU Direct Storage (GDS).
Note
Starting with CUDA toolkit 12.2.2, GDS kernel driver package nvidia-gds version 12.2.2-1 (provided by nvidia-fs-dkms 2.17.5-1) and above is only supported with the NVIDIA open kernel driver. Follow instructions in NVIDIA Open GPU Kernel Modules to install NVIDIA open kernel driver packages.
Verify that NVIDIA GDS has been installed properly following https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#verify-suc-install. E.g. if you are on Linux and have CUDA 12.2 installed, run:
/usr/local/cuda-12.2/gds/tools/gdscheck.py -p
Alternatively, if you have your conda environment setup below, follow https://xarray.dev/blog/xarray-kvikio#appendix-ii--making-sure-gds-is-working and run:
mamba activate foss4g2023oceania
curl -s https://raw.githubusercontent.com/rapidsai/kvikio/branch-23.08/python/benchmarks/single-node-io.py | python
To help out with development, start by cloning this repo-url
git clone <repo-url>
Then I recommend using mamba to install the dependencies. A virtual environment will also be created with Python and JupyterLab installed.
cd foss4g2023oceania
mamba env create --file environment.yml
Activate the virtual environment first.
mamba activate foss4g2023oceania
Finally, double-check that the libraries have been installed.
mamba list
This is for those who want full reproducibility of the virtual environment. Create a virtual environment with just Python and conda-lock installed first.
mamba create --name foss4g2023oceania python=3.10 conda-lock=2.3.0
mamba activate foss4g2023oceania
Generate a unified conda-lock.yml
file
based on the dependency specification in environment.yml
. Use only when
creating a new conda-lock.yml
file or refreshing an existing one.
conda-lock lock --mamba --file environment.yml --platform linux-64 --with-cuda=11.8
Installing/Updating a virtual environment from a lockile. Use this to sync your
dependencies to the exact versions in the conda-lock.yml
file.
conda-lock install --mamba --name foss4g2023oceania conda-lock.yml
See also https://conda.github.io/conda-lock/output/#unified-lockfile for more usage details.
To create a subset of the WeatherBench2 Zarr dataset, run:
python 0_weatherbench2zarr.py
This will save a one year subset of the WeatherBench2 ERA5 dataset at 6 hourly resolution to your local disk (total size is about 18.2GB). It will include data at pressure level 500hPa, with the variables 'geopotential', 'u_component_of_wind', and 'v_component_of_wind' only.
To run the benchmark experiment loading with the kvikIO engine, run:
python 1_benchmark_kvikIOzarr.py
This will print out a progress bar showing the ERA5 data being loaded in mini-batches (simulating a neural network training loop). One 'epoch' should take under 15 seconds on an Ampere generation (e.g. RTX A2000) NVIDIA GPU. A total of ten epochs will be ran, and the total time taken will be reported, as well as the median/mean/standard deviation time taken per epoch.
To compare the benchmark results between the kvikio
and zarr
engines, do
the following:
- Run
jupyter lab
to launch a JupyterLab session - In your browser, open the
2_compare_results.ipynb
notebook in JupyterLab - Run all the cells in the notebook
The time to load the ERA5 subset data using the kvikio
and zarr
engines
will be printed out. There will also be a summary report of the relative
time difference between the CPU-based zarr
and GPU-based kvikio
engine, and
bar plots of the absolute time taken for each backend engine.
- https://xarray.dev/blog/xarray-kvikio
- https://developer.nvidia.com/blog/gpudirect-storage
- https://developer.nvidia.com/blog/machine-learning-frameworks-interoperability-part-2-data-loading-and-data-transfer-bottlenecks/
- https://developmentseed.org/blog/2023-09-20-see-you-at-foss4g-sotm-oceania-2023
- https://medium.com/rapids-ai/pytorch-rapids-rmm-maximize-the-memory-efficiency-of-your-workflows-f475107ba4d4
- https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/
All code in this repository is licensed under GNU Lesser General Public License 3.0 (LGPL-3.0). All other non-code content is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).