New data sources #2584

bouweandela · 2024-11-22T09:24:56Z

Design for improving the tool so new data sources can be added easily

The plan for what the configuration should look like is in #2371). We plan to ship these configuration files with ESMValCore for supported data sources, so users won't have to configure this by themselves. The configuration mentions the name of a class (e.g. esmvalcore.local.DataSource and any arguments that are needed to construct an instance of it, that can be used to find 'data'. This class should have a method to find the data, e.g. named find_files or find_data, that will take the facets from the recipe (plus any automatically added facets) as arguments and return an object/a list of objects that can be used to access the data (e.g. esmvalcore.local.LocalFile or esmvalcore.esgf.ESGFFile, but maybe this could also be Iris Cubes or Xarray Datasets, I'm not sure if intermediate objects are needed in cases where constructing an Iris Cube or Xarray Dataset is really fast). The found 'data object's may then be passed on to the esmvalcore.preprocessor.load function or alternatively, be inserted somewhere in the esmvalcore.dataset.Dataset.load method (skipping the current load/fix/concatenate functions, but CMOR checking will need to be done regardless), and enter the preprocessing chain from there.

Some additional things to consider:

For some data sources we need the ability to deduplicate input data across multiple data sources, in particular for the CMIP data (use case: most data available in a centrally managed directory and other data in a user managed directory). This could e.g. be done by adding a name and version attribute (currently this is done based on filename and 'version' facet here) and having a generic function that is applied to all input data objects and filters it so there is only one data object for each name and it is the requested (or latest) version. Individual data sources must also have the ability to deduplicate, e.g. here. Side note: the current implementation does not work if newer versions of files use different filenames because the time slices stored in the files are different from the old version, this issue is probably hidden by the concatenate preprocessor function that takes out duplicated data, but not necessarily the correct bits.
Fixes are often specific to the data source, but there can also be overlap. Therefore they should probably be applied as part of the data object load instead of in the generic esmvalcore.dataset.Dataset.load function.
In the future we would like to make fixes standalone Python packages based on Xarray (perhaps in combination with ncdata) so they will have larger community uptake and contributions. It seems likely that there will be one fixes package per project (.e.g. CMIP7, CMIP6, CORDEX-CMIP6) and data source (e.g. NetCDF files, Xarray datasets, Zarr).
We would like to add support for intake-esm, intake-esgf, xcube, and possibly more (e.g. intake-STAC or PySTAC), so the design should be easy to extend with additional data sources
In the long run, we may be able to replace our esmvalcore.local and esmvalcore.esgf modules by intake-esgf, depending on how that develops.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data sources #2584

New data sources #2584

bouweandela commented Nov 22, 2024 •

edited

Loading

New data sources #2584

New data sources #2584

Comments

bouweandela commented Nov 22, 2024 • edited Loading

Design for improving the tool so new data sources can be added easily

bouweandela commented Nov 22, 2024 •

edited

Loading