Improve handling of existing zarr-archives in mdp datastore #82

sadamov · 2024-10-25T11:34:33Z

Idea: Save Dataset Creation Config as YAML Attribute for Improved Reproducibility

Description:

To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.

Currently, the user will only get a warning that no new datastore was created and the old existing one will be used instead.

Key points:

Serialize the dataset creation config (self._config) to YAML format
Save the serialized YAML config as an attribute of the dataset (self._ds)
This provides a record of the exact settings used to generate the dataset
Enables detecting mismatches between dataset and config
Improves reproducibility by allowing datasets to be recreated from the saved config

Implementation:

In the MDPDatastore constructor, serialize the dataset creation config (self._config) to YAML:

import yaml

# Serialize config to YAML string
config_yaml = yaml.dump(self._config)

Save the YAML config string as a dataset attribute, e.g. "creation_config":

self._ds.attrs["creation_config"] = config_yaml

When loading datasets (e.g. in get_dataarray), check for the presence of the "creation_config" attribute
If present, deserialize the YAML back to a config object:

import yaml 

config_yaml = self._ds.attrs["creation_config"]
loaded_config = yaml.safe_load(config_yaml)

Compare deserialized config (loaded_config) with the current self._config
Warn or error if mismatch detected between saved config and current config
Decide whether to rename or overwrite the esiting zarr.

leifdenby · 2024-10-29T15:07:20Z

To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.

I completely agree with this! I was just not sure about the format to do it in, I was wondering if just using yaml would be a "dirty" approach somehow 😆 I would propose that serialisation of the config itself into the dataset should be added to mllam-data-prep and the logic on whether to recreate the dataset or not should maybe reside in nl (that's what the cool kids call neural-lam these days I have heard)

sadamov added this to the v 0.5.0 (proposed) milestone Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of existing zarr-archives in mdp datastore #82

Improve handling of existing zarr-archives in mdp datastore #82

sadamov commented Oct 25, 2024

leifdenby commented Oct 29, 2024

Improve handling of existing zarr-archives in mdp datastore #82

Improve handling of existing zarr-archives in mdp datastore #82

Comments

sadamov commented Oct 25, 2024

leifdenby commented Oct 29, 2024