Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data sources #2584

Open
bouweandela opened this issue Nov 22, 2024 · 0 comments
Open

New data sources #2584

bouweandela opened this issue Nov 22, 2024 · 0 comments

Comments

@bouweandela
Copy link
Member

bouweandela commented Nov 22, 2024

Design for improving the tool so new data sources can be added easily

The plan for what the configuration should look like is in #2371). We plan to ship these configuration files with ESMValCore for supported data sources, so users won't have to configure this by themselves. The configuration mentions the name of a class (e.g. esmvalcore.local.DataSource and any arguments that are needed to construct an instance of it, that can be used to find 'data'. This class should have a method to find the data, e.g. named find_files or find_data, that will take the facets from the recipe (plus any automatically added facets) as arguments and return an object/a list of objects that can be used to access the data (e.g. esmvalcore.local.LocalFile or esmvalcore.esgf.ESGFFile, but maybe this could also be Iris Cubes or Xarray Datasets, I'm not sure if intermediate objects are needed in cases where constructing an Iris Cube or Xarray Dataset is really fast). The found 'data object's may then be passed on to the esmvalcore.preprocessor.load function or alternatively, be inserted somewhere in the esmvalcore.dataset.Dataset.load method (skipping the current load/fix/concatenate functions, but CMOR checking will need to be done regardless), and enter the preprocessing chain from there.

Some additional things to consider:

  • For some data sources we need the ability to deduplicate input data across multiple data sources, in particular for the CMIP data (use case: most data available in a centrally managed directory and other data in a user managed directory). This could e.g. be done by adding a name and version attribute (currently this is done based on filename and 'version' facet here) and having a generic function that is applied to all input data objects and filters it so there is only one data object for each name and it is the requested (or latest) version. Individual data sources must also have the ability to deduplicate, e.g. here. Side note: the current implementation does not work if newer versions of files use different filenames because the time slices stored in the files are different from the old version, this issue is probably hidden by the concatenate preprocessor function that takes out duplicated data, but not necessarily the correct bits.
  • Fixes are often specific to the data source, but there can also be overlap. Therefore they should probably be applied as part of the data object load instead of in the generic esmvalcore.dataset.Dataset.load function.
  • In the future we would like to make fixes standalone Python packages based on Xarray (perhaps in combination with ncdata) so they will have larger community uptake and contributions. It seems likely that there will be one fixes package per project (.e.g. CMIP7, CMIP6, CORDEX-CMIP6) and data source (e.g. NetCDF files, Xarray datasets, Zarr).
  • We would like to add support for intake-esm, intake-esgf, xcube, and possibly more (e.g. intake-STAC or PySTAC), so the design should be easy to extend with additional data sources
  • In the long run, we may be able to replace our esmvalcore.local and esmvalcore.esgf modules by intake-esgf, depending on how that develops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant