Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of existing zarr-archives in mdp datastore #82

Open
sadamov opened this issue Oct 25, 2024 · 1 comment
Open

Improve handling of existing zarr-archives in mdp datastore #82

sadamov opened this issue Oct 25, 2024 · 1 comment

Comments

@sadamov
Copy link
Collaborator

sadamov commented Oct 25, 2024

Idea: Save Dataset Creation Config as YAML Attribute for Improved Reproducibility

Description:

To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.

Currently, the user will only get a warning that no new datastore was created and the old existing one will be used instead.

Key points:

  • Serialize the dataset creation config (self._config) to YAML format
  • Save the serialized YAML config as an attribute of the dataset (self._ds)
  • This provides a record of the exact settings used to generate the dataset
  • Enables detecting mismatches between dataset and config
  • Improves reproducibility by allowing datasets to be recreated from the saved config

Implementation:

  1. In the MDPDatastore constructor, serialize the dataset creation config (self._config) to YAML:
import yaml

# Serialize config to YAML string
config_yaml = yaml.dump(self._config)
  1. Save the YAML config string as a dataset attribute, e.g. "creation_config":
self._ds.attrs["creation_config"] = config_yaml
  1. When loading datasets (e.g. in get_dataarray), check for the presence of the "creation_config" attribute
  2. If present, deserialize the YAML back to a config object:
import yaml 

config_yaml = self._ds.attrs["creation_config"]
loaded_config = yaml.safe_load(config_yaml)
  1. Compare deserialized config (loaded_config) with the current self._config
  2. Warn or error if mismatch detected between saved config and current config
  3. Decide whether to rename or overwrite the esiting zarr.
@leifdenby
Copy link
Member

To improve the reproducibility and traceability of datasets created using mllam-data-prep, we should save the dataset creation configuration directly as an attribute of the dataset itself. This will enable proper handling of mismatches between the saved dataset and config used to create it.

I completely agree with this! I was just not sure about the format to do it in, I was wondering if just using yaml would be a "dirty" approach somehow 😆 I would propose that serialisation of the config itself into the dataset should be added to mllam-data-prep and the logic on whether to recreate the dataset or not should maybe reside in nl (that's what the cool kids call neural-lam these days I have heard)

@sadamov sadamov added this to the v 0.5.0 (proposed) milestone Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants