Add "datastores" to represent input data from zarr, npy, etc #66

leifdenby · 2024-07-17T10:18:58Z

Describe your changes

This PR builds on #54 (which introduces zarr-based training data) by splitting the Config-class introduced in #54 to separately represent the configuration for what data to load from the functions to load data (the latter is what I call a "datastore"). In doing this I have also introduced a general interface through an abstract base class BaseDatastore with a set of functions that are called in the rest of neural-lam which provide data for training/validation/test and information about this data (see #58 for my overview of the methods that #54 uses to load data).

The motivation for this work is to allow for a clear separation between how data is loaded into neural-lam and how training/validation/test samples are created from that data. Creating the interface between these two steps makes it clear what is expected to be provided when people want to add new data-sources to neural-lam

In the text below I am trying to use the same nomenclature that @sadamov introduced, namely:

data "category": relates to whether a multidimensional array represents state, forcing or static data.
data "transformation": this refers to the operations of the extracting of specific variables from source datasets (e.g. zarr datasets), flattening spatial coordinates into a grid_index coordinate, levels and variables into a {category}_feature coordinate (i.e. these are operations that

	BaseDatastore-derived classes	WeatherDataset
returns	only python primitive types, `np.ndarray` and `xr.Dataset`/`xr.DataArray` objects	`torch.Tensor` objects
provides	transformed train/test/val datasets that cover the entire time and space range for a given category of data	individual time samples (including windowing and handling both analysis and forecasts) for train/test/val, optionally sample from ensemble members

To support both the multizar config format that @sadamov introduced in #54, the old npyfiles and also data transformed with mllam-data-prep I have currently implemented the following three datastore classes:

neural_lam.datastore.NpyDataStore: reads data from .npy-files in the format introduced in neural-lam v0.1.0 - this uses dask.delayed so no array content is read until it is used
neural_lam.datastore.MultizarrDatastore: can combines multiple zarr files during train/val/test sampling, with the transformations to facilitate this implemented within neural_lam.datastore.MultizarrDatastore. - removed as we decided MDPDatastore was enough
neural_lam.datastore.MDPDatastore: can combine multiple zarr datasets either either as a preprocessing step or during sampling, but offloads the implementation of the transformations the mllam-data-prep package.

Each of the these inherit from BaseCartesianDatastore which itself inherits from BaseDatastore. I have added this last layer of indirection to make it easier for non-gridded data to be used in neural-lam in future.

Testing:

coverage: I have implemented tests for all three types of datastores which tests all that all the methods return the correct values and types. For this I have reused the testing for npyfiles from @SimonKamuk, reused the DANRA download from @sadamov and for mllam the data is streamed directly from S3. The tests also cover checking the return of a single training item, a full batch and training a few epochs.
creation of graphs, boundary-masks, etc: To make it possible to run the training process in tests I needed to make it possible to create all the necessary auxiliary inputs. For this reason I have refactored the create_graph, create_normalization commands etc so that they can be called not just from the command line.

Caveats:

storage of graphs and other auxiliary information: Reading @sadamov's Multiple Zarr to Rule them All #54 I got the feeling that the intention was that the path for a config-file for where data is coming from is in effect the directory for a dataset. It make sense to me to put everything relative the the parent directory of this config file, at least it as an easy thing to simply use. By placing the configuration file externally to the neural-lam repository (by making neural-lam a package Refactor codebase into a python package #32) I think this is necessary and less arbitrary that saying everything has to be in a "data" directory. For this reason I have assumed that any paths in the mllam and multizarr configs that don't start with a protocol or are an absolute path, that these paths are relative to the parent path of the config. For example multizarr's "create_forcing" cli interface defines a path, but so does the config so that was inconsistent and errorprone I think.
I have renamed the coordinate you introduced @sadamov from grid to grid_index. I think it ambiguous what "grid" refers to since that could be the grid itself, as well as the grid-index as it was used.
We shouldn’t use .variable as a variable name for a an xr.DataArray because xr.DataArray.variable is a reserved attribute for data-arrays
I think the comment # target_states: (ar_steps-2, N_grid, d_features) in WeatherDataset.getitem is incorrect @sadamov, or at least my understand of what ar_steps represents is different. I expect the target states to have exactly ar_steps in them, rather than ar_steps-2. Or said another way, would otherwise happen if ar_steps == 0?

Things I am unsure about:

Currently, training appears needs a minimum 2 devices otherwise aggregation of metrics (the all gather https://github.com/leifdenby/neural-lam/blob/feat/datastores/neural_lam/models/ar_model.py#L534) during plotting fails
I don’t think it should the step in multizarr should be called “create forcing”, because there are other forcing variables. Maybe auxiliary forcings instead?
Streaming zarr data from s3 doesn’t work with parallel dataloders, I tried setting DataLoader(…, multiprocessing_context="spawn”) as suggested through RuntimeError: This class is not fork-safe fsspec/filesystem_spec#755 and https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader, but not sure if we should do this or always use local zarr datasets rather than open from s3?
I don't quite understand why the multizarr code opened the source zarr datasets so many times. The way I have refactored the the open is only done once.

On whether something should be in BaseDatastore vs WeatherDataset:

I have moved “apply_windowing” to WeatherDataset because it doesn’t apply to “state” category for example

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

…/deps-in-pyproject-toml

…/refactor-as-package

Co-authored-by: SimonKamuk <43374850+SimonKamuk@users.noreply.github.com>

sadamov · 2024-11-16T20:36:20Z

Okay the remaining bug in neural_lam.datastore.npyfilesmeps.compute_standardization_stats was related to the number of workers defined in the WeatherDataset. Setting num_workers=0 solves the problem of infinite waiting time before the one-step differences are calculated. I assume this is related to a low number of samples in the example datasets we are using in the tests. And the way multiprocessing jobs are spawned when num_workers>0 (see line 617ff in weather_dataset.py). I have now run all test locally on a machine with 2 cuda devices and can happily report that ALL tests pass ✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️✔️

commit 2cc617e Author: Joel Oskarsson <joel.oskarsson@liu.se> Date: Mon Nov 18 08:35:03 2024 +0100 Add weights_only=True to all torch.load calls (mllam#86) ## Describe your changes Currently running neural-lam with the latest version of pytorch gives a warning: ``` FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ``` As we only use `torch.load` to load tensors and lists, we can just set `weights_only=True` and get rid of this warning (and increase security I suppose). ## Issue Link None ## Type of change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] 📖 Documentation (Addition or improvements to documentation) ## Checklist before requesting a review - [x] My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use `pull` with `--rebase` option if possible). - [x] I have performed a self-review of my code - [x] For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values - [x] I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code - [x] I have updated the [README](README.MD) to cover introduced code changes - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have given the PR a name that clearly describes the change, written in imperative form ([context](https://www.gitkraken.com/learn/git/best-practices/git-commit-message#using-imperative-verb-form)). - [x] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee. ## Checklist for reviewers Each PR comes with its own improvements and flaws. The reviewer should check the following: - [x] the code is readable - [ ] the code is well tested - [x] the code is documented (including return types and parameters) - [x] the code is easy to maintain ## Author checklist after completed review - [ ] I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (add section where missing): - *added*: when you have added new functionality - *changed*: when default behaviour of the code has been changed - *fixes*: when your contribution fixes a bug ## Checklist for assignee - [ ] PR is up to date with the base branch - [ ] the tests pass - [ ] author has added an entry to the changelog (and designated the change as *added*, *changed* or *fixed*) - Once the PR is ready to be merged, squash commits and merge the PR.

joeloskarsson · 2024-11-19T12:42:22Z

Do we know exactly why the tests are not passing here? From the comments above I thought @sadamov fixes made all tests pass, but they still look red here on GH. Maybe I have missed something? Looking at the logs I see some "Process completed with exit code 137.", which would be running out of memory. Is that the issue?

sadamov · 2024-11-19T12:54:03Z

Do we know exactly why the tests are not passing here? From the comments above I thought @sadamov fixes made all tests pass, but they still look red here on GH. Maybe I have missed something? Looking at the logs I see some "Process completed with exit code 137.", which would be running out of memory. Is that the issue?

I just remembered that I had to locally fix MDP: https://github.com/mllam/mllam-data-prep/blob/8e7a5bc63a1ae1235b82b1f702c00eb33e891a79/mllam_data_prep/config.py#L306

where I added this line: 307: extra: Optional[Dict[str, Any]] = None

@leifdenby when will you release v0.3.0 of MDP, I think these issues will be fixed there?

I don't know about the memory issue, but I cannot run the test_training.py on my local machine with 16GB of RAM because of memory. Maybe the github runner also runs out of memory?

joeloskarsson · 2024-11-19T13:22:36Z

The tests not passing partially relate to MDP, but there seems to also be an OOM issue. Will investigate.

Specifically: Tests passing requires Remove global config for dataclass_wizard mllam-data-prep#36, which will only be part of v0.5.0 of MDP. That release is however close to ready.
The OOM issues seem to happen at tests/test_datasets.py::test_single_batch[mdp-train].
- Running locally this does indeed seem to eat up 17GB memory, which is unreasonable.
Note that the version of MDP used is from the https://github.com/leifdenby/mllam-data-prep/tree/temp/for-neural-lam-datastores branch. This is needed, otherwise the test above crashes (and one can't investigate any OOM issue).

joeloskarsson · 2024-11-19T16:25:23Z

And we're green again 🟢 🥳

joeloskarsson

Alright, this is pretty much good to go now! Only waiting for the MDP compatability to hit merge. I am happy with everything else.

pyproject.toml

leifdenby · 2024-11-19T17:25:47Z

And we're green again 🟢 🥳

OMG! That makes me happy. With tests running on CPU and GPU. AMAZIN!

Ok, the good news is @observingClouds and I have decided how to add the projection info to the datastore config. We are going to go with the approach I already implemented where we use this extras section of the config that mllam-data-prep ignores (this is because it turns out it is not currently possible to define projection info with a WKT-string and from that create a cartopy.crs.Projection that can be used for plotting, mllam/mllam-data-prep#33 (comment)). I still need to fix the example to set the projection info correctly (which is about also setting the globe radius, mllam/mllam-data-prep#18 (comment)). Once I have complete that and @observingClouds has reviewed it we will release v0.5.0 and this PR can be merged 🥳

joeloskarsson

Since I am off for a few days I'm gonna change to approve here, so you can go ahead and hit merge on this once #66 (comment) is sorted.

leifdenby · 2024-11-21T06:55:58Z

Ok, this is the big moment! All the tests have passed and approvals from both @joeloskarsson and @sadamov! Finally merging after 4 months of work! 🥳 I am merging! 🚀

sadamov · 2024-11-21T07:56:27Z

@leifdenby This is just awesome! So happy with the result. You really introduced a very nice and clean structure to the data-pipeline. Biggest PR of my life, I learned a lot about Python-classes along the way and thouroughly enjoyed working with all of you here on this PR ❤️

Fix bugs in recently introduced datastore functionality #66 (error in calculation in `BaseDatastore.get_xy_extent()` and overlooked in-place modification of config dict in `MDPDatastore.coords_projection`), and also fix issue in `ARModel.plot_examples` by using newly introduced (#66) `WeatherDataset.create_dataarray_from_tensor()` to create `xr.DataArray` from prediction tensor and calling plot methods directly on `xr.DataArray` rather than using bare numpy arrays with `matplotlib`.

leifdenby and others added 30 commits July 6, 2024 13:43

npy mllam nearly done

c52f98e

minor adjustment

80f3639

Merge branch 'main' of https://github.com/mllam/neural-lam into maint…

048f8c6

…/deps-in-pyproject-toml

add pooch and tweak pip cicd testing

5aaa239

combine cicd tests with caching

66c3b03

linting

8566b8f

add pyg dep

29bd9e5

set cirun aws region to frankfurt

bc7f028

adapt image

2070166

set image

e4e86e5

try different image

1fba8fe

add pooch to cicd

02b77cf

add pdm gpu test

b481929

start work on readme

bcec472

Merge branch 'maint/deps-in-pyproject-toml' into datastore

c5beec9

Merge branch 'main' into maint/refactor-as-package

e89facc

Merge branch 'main' of https://github.com/mllam/neural-lam into maint…

0b5687a

…/refactor-as-package

turn meps testdata download into pytest fixture

095fdbc

adapt README for package

49e9bfe

remove pdm cicd test (will be in separate PR)

12cc02b

remove pdm in gitignore

b47f50b

remove pdm and pyproject files (will be sep PR)

90d99ca

add pyproject.toml from main

a91eaaa

clean out tests

5508cea

fix linting

5c623c3

add cli entrypoints import test

08ec168

Merge branch 'maint/refactor-as-package' into datastore

d9cf7ba

tweak cicd pytest execution

3954f04

Merge branch 'maint/refactor-as-package' into datastore

f99fdce

Update tests/test_mllam_dataset.py

db9d96f

Co-authored-by: SimonKamuk <43374850+SimonKamuk@users.noreply.github.com>

sadamov added 2 commits November 16, 2024 20:51

reduce workers to zero

4a5ae6c

revert num_workers to 1 in test

9f0120b

joeloskarsson and others added 6 commits November 18, 2024 10:57

Fix missing datastore kind in plot script

665368d

replace transpose in WeatherDataset.__getitem__ with assert

a90a979

Merge branch 'main' into datastores

0180ca0

default config path should be None for datastore plote example

93c20fc

return stacked coords by default from BaseRegularGridDatastore.get_xy()

f6da2b2

leifdenby assigned SimonKamuk Nov 19, 2024

joeloskarsson added 4 commits November 19, 2024 15:06

Fix typos and clarifications in readme

fc6be8d

Fix dim ordering in time slicing test

9787869

Reduce example size of single batch and training tests to save memory

4cb44de

Add changelog entry

daf1dbc

joeloskarsson requested changes Nov 19, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

joeloskarsson approved these changes Nov 20, 2024

View reviewed changes

leifdenby added 2 commits November 20, 2024 20:19

use mllam-data-prep v0.5.0

f922542

add support for setting globe properties in projection

9e8c08f

sadamov approved these changes Nov 20, 2024

View reviewed changes

update path for meps data chache in ci/cd

4302d58

leifdenby merged commit c3c1722 into mllam:main Nov 21, 2024
8 checks passed

leifdenby mentioned this pull request Nov 26, 2024

Fix evaluation example visualisation plots #91

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "datastores" to represent input data from zarr, npy, etc #66

Add "datastores" to represent input data from zarr, npy, etc #66

leifdenby commented Jul 17, 2024 •

edited

Loading

sadamov commented Nov 16, 2024 •

edited

Loading

joeloskarsson commented Nov 19, 2024

sadamov commented Nov 19, 2024

joeloskarsson commented Nov 19, 2024 •

edited

Loading

joeloskarsson commented Nov 19, 2024

joeloskarsson left a comment

leifdenby commented Nov 19, 2024

joeloskarsson left a comment

leifdenby commented Nov 21, 2024

sadamov commented Nov 21, 2024 •

edited

Loading

Add "datastores" to represent input data from zarr, npy, etc #66

Add "datastores" to represent input data from zarr, npy, etc #66

Conversation

leifdenby commented Jul 17, 2024 • edited Loading

Describe your changes

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

sadamov commented Nov 16, 2024 • edited Loading

joeloskarsson commented Nov 19, 2024

sadamov commented Nov 19, 2024

joeloskarsson commented Nov 19, 2024 • edited Loading

joeloskarsson commented Nov 19, 2024

joeloskarsson left a comment

Choose a reason for hiding this comment

leifdenby commented Nov 19, 2024

joeloskarsson left a comment

Choose a reason for hiding this comment

leifdenby commented Nov 21, 2024

sadamov commented Nov 21, 2024 • edited Loading

leifdenby commented Jul 17, 2024 •

edited

Loading

sadamov commented Nov 16, 2024 •

edited

Loading

joeloskarsson commented Nov 19, 2024 •

edited

Loading

sadamov commented Nov 21, 2024 •

edited

Loading