Skip to content

Commit

Permalink
Merge pull request #83 from climate-resource/fix-up-drs
Browse files Browse the repository at this point in the history
Fix up DRS handling
  • Loading branch information
znichollscr authored Nov 11, 2024
2 parents bcea3a4 + 1c2db42 commit 1609e5c
Show file tree
Hide file tree
Showing 12 changed files with 317 additions and 43 deletions.
11 changes: 11 additions & 0 deletions changelog/83.breaking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Updated to better be in line with [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk).
The DRS makes no comment on the values in the directory names,
hence we now make no assumptions and put no requirements on characters used in directory names.
The restriction to only a-z, A-Z, 0-9 and the hyphen plus special use of the underscore
is now only applied to file names.
As a result, files with variables that have underscores in their names
are now written with underscores in their directories, but not in their file name.
For example,
`.../mon/mole-fraction-of-methane-in-air/gn/.../mole-fraction-of-methane-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc`
is now written as
`.../mon/mole_fraction_of_methane_in_air/gn/.../mole-fraction-of-methane-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc`.
3 changes: 3 additions & 0 deletions changelog/83.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Fixed a bug in
[`validate_file_written_according_to_drs`][input4mips_validation.cvs.drs.DataReferenceSyntax.validate_file_written_according_to_drs]
which caused it to break if the file contained metadata which was not a string.
1 change: 1 addition & 0 deletions changelog/83.trivial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Improved stubs for iris.
135 changes: 111 additions & 24 deletions src/input4mips_validation/cvs/drs.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
"""
Data reference syntax data model
This module needs a re-write.
The rules are not clear enough and the code misrepresents the logic.
For the re-write:
- change to just having a `parse_filepath` and `parse_filename` function for each DRS
- introduce a prototype to allow different DRS handlers to be injected as needed
- then you just do wrappers around this DRS handling
- get metadata from filename
- get metadata from path
- create filename
- create path
- validate file correctly written according to DRS
- etc.
- remove the ability to inject the DRS via a string
- fundamentally, it doesn't make sense and is too hard to get the logic right
- it also hides the fact
that there is lots of logic that can't be handled in a single string
- as a result of the above, remove the DRS from the CVs entirely
- maybe use a string as a key
- maybe put the DRS in a separate package to allow better re-use
- maybe, because there are some couplings here
that might mean you can't do a clean split from the CVs/other logic that easily
"""

from __future__ import annotations
Expand Down Expand Up @@ -34,6 +58,10 @@
"""Form into which the DRS is serialised for the CVs"""


DEFAULT_REPLACEMENTS: dict[str, str] = {".": "-"}
"""Default replacements for characters in directories and file names"""


@frozen
class DataReferenceSyntax:
"""
Expand Down Expand Up @@ -145,7 +173,7 @@ def get_file_path( # noqa: PLR0913
Time range information is required by the DRS,
but `time_start` and `time_end` are not supplied.
"""
# First step: apply a number of known replacements
# First step: apply a number of known replacements globally
all_available_metadata = {
k: apply_known_replacements(v) for k, v in available_attributes.items()
}
Expand Down Expand Up @@ -190,19 +218,29 @@ def get_file_path( # noqa: PLR0913

apply_subs = functools.partial(
apply_substitutions,
metadata=all_available_metadata,
validate_substituted_metadata=True,
)
# Cast to path to ensure windows compatibility
directory = Path(
apply_subs(
self.directory_path_template,
metadata=all_available_metadata,
substitutions=directory_substitutions,
to_directory_path=True,
)
)
# In the filename, underscore is swapped for hyphen to avoid delimiter issues.
# Burying this here feels too deep,
# but I don't know how to express this in a more obvious way.
all_available_metadata_for_filename = {
k: apply_known_replacements(v, {"_": "-"})
for k, v in all_available_metadata.items()
}
filename = apply_subs(
self.filename_template,
metadata=all_available_metadata_for_filename,
substitutions=filename_substitutions,
to_directory_path=False,
)

generated_path = directory / filename
Expand Down Expand Up @@ -406,15 +444,16 @@ def get_regexp_for_capturing_directory_information(
Notes
-----
According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):
- only [a-zA-Z0-9-] are allowed in file path names
except where underscore is used as a separator.
We use this to significantly simplify our regular expression.
According to
[the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk),
there are no restrictions on the directory characters
(there are only restrictions on the characters in the filename,
see
[`get_regexp_for_capturing_filename_information`][input4mips_validation.cvs.drs.DataReferenceSyntax.get_regexp_for_capturing_filename_information]).
"""
# Hard-code according to the spec
allowed_chars = "[a-zA-Z0-9-]"
# All characters except the path separator
allowed_chars = "[^/]"

drs_template = self.directory_path_template
directory_substitutions = self.parse_drs_template(drs_template=drs_template)
Expand Down Expand Up @@ -538,7 +577,10 @@ def validate_file_written_according_to_drs(

ds = xr.open_dataset(file, use_cftime=True)
comparison_metadata = {
k: apply_known_replacements(v) for k, v in ds.attrs.items()
k: apply_known_replacements(v)
for k, v in ds.attrs.items()
# Ignore everything that isn't a string for comparisons
if isinstance(v, str)
}

# Infer time range information, in case it appears in the DRS.
Expand Down Expand Up @@ -580,10 +622,18 @@ def validate_file_written_according_to_drs(

for k, v in file_metadata.items():
if v is None:
# No info in directory, presumably because key was optional
# No info in filename, presumably because key was optional
continue

if comparison_metadata[k] != v:
# In the filename, underscore
# can be swapped for hyphen to avoid delimiter issues.
# Burying this here feels too deep,
# but I don't know how to express this in a more obvious way.
valid_filename_values = {
comparison_metadata[k],
apply_known_replacements(comparison_metadata[k], {"_": "-"}),
}
if v not in valid_filename_values:
mismatches.append(
[k, "filename", file_metadata[k], comparison_metadata[k]]
)
Expand Down Expand Up @@ -735,6 +785,7 @@ def apply_substitution(
self,
start: str,
metadata: dict[str, str],
to_directory_path: bool,
validate_substituted_metadata: bool = True,
) -> str:
"""
Expand All @@ -748,12 +799,18 @@ def apply_substitution(
metadata
Metadata from which the substitution values can be retrieved
to_directory_path
Are the substitutions being applied to create a directory path?
If `False`, we assume that we are creating a file name.
validate_substituted_metadata
If `True`, the substituted metadata is validated to ensure that its values
only contain allowed characters before being applied.
Returns
-------
:
`start` with the substitution defined by `self` applied
"""
missing_metadata = [k for k in self.required_metadata if k not in metadata]
Expand All @@ -771,7 +828,8 @@ def apply_substitution(
metadata_to_substitute = {k: metadata[k] for k in self.required_metadata}
if validate_substituted_metadata:
assert_all_metadata_substitutions_only_contain_valid_characters(
metadata_to_substitute
metadata_to_substitute,
to_directory_path=to_directory_path,
)

res = start.replace(
Expand Down Expand Up @@ -801,14 +859,16 @@ def apply_known_replacements(
known_replacements
Known replacements to apply.
If `None`, a default set is used.
If `None`, we use
[`DEFAULT_REPLACEMENTS`][input4mips_validation.cvs.drs.DEFAULT_REPLACEMENTS].
Result
------
:
`inp` with known replacements applied.
""" # noqa: E501
if known_replacements is None:
known_replacements = {"_": "-", ".": "-"}
known_replacements = DEFAULT_REPLACEMENTS

res = inp
for old, new in known_replacements.items():
Expand Down Expand Up @@ -847,6 +907,7 @@ def assert_only_valid_chars(inp: str | Path, valid_chars: set[str]) -> None:

def assert_all_metadata_substitutions_only_contain_valid_characters(
metadata: dict[str, str],
to_directory_path: bool,
) -> None:
"""
Assert that all the metadata substitutions only contain valid characters
Expand All @@ -855,21 +916,33 @@ def assert_all_metadata_substitutions_only_contain_valid_characters(
According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):
- only [a-zA-Z0-9-] are allowed in file path names
except where underscore is used as a separator.
This is enforced here.
> All strings appearing in the file name
> are constructed using only the following characters:
> a-z, A-Z, 0-9, and the hyphen ("-"),
> except the hyphen must not appear in variable_id.
> Underscores are prohibited throughout except as shown in the template.
We prohibit the use of underscores in the filenames, following the DRS description.
However, we've clearly ignored the rule
about no hyphens in variable IDs in input4MIPs,
so we allow hyphens to appear in the variable ID part of the file name.
Hyphens will only appear in the variable ID part of the file name
when the original variable had underscores,
and these underscores have been replaced with hyphens to avoid breaking the DRS.
- we're meant to use the CMIP data request variable names, not CF standards,
so that there aren't hyphens in the values for the "variable_id" key.
We've clearly not done that in input4MIPs,
so we are ignoring that rule here
(but it would be important for other CV implementations!).
Nothing is said about the directory names,
so all values are allowed for directory names.
Parameters
----------
metadata
Metadata substitutions to check
to_directory_path
Are the substitutions being applied to create a directory path?
If `False`, we assume that we are creating a file name.
Raises
------
ValueError
Expand All @@ -880,7 +953,13 @@ def assert_all_metadata_substitutions_only_contain_valid_characters(
[`assert_full_filepath_only_contains_valid_characters`][input4mips_validation.cvs.drs.assert_full_filepath_only_contains_valid_characters]
"""
# Hard-code according to the spec
valid_chars = set(string.ascii_letters + string.digits + "-")
if to_directory_path:
# Truth is that this is probably even wider than this, but ok
valid_chars = set(string.ascii_letters + string.digits + "-" + "_")

else:
valid_chars = set(string.ascii_letters + string.digits + "-")

for k, v in metadata.items():
# Special case for variable_id would go here if we applied it
try:
Expand Down Expand Up @@ -924,6 +1003,7 @@ def apply_substitutions(
drs_template: str,
substitutions: Iterable[DRSSubstitution],
metadata: dict[str, str],
to_directory_path: bool,
validate_substituted_metadata: bool = True,
) -> str:
"""
Expand All @@ -940,19 +1020,26 @@ def apply_substitutions(
metadata
Metadata from which the substitution values can be retrieved
to_directory_path
Are the substitutions being applied to create a directory path?
If `False`, we assume that we are creating a file name.
validate_substituted_metadata
Passed to
[`DRSSubstitution.apply_substitution`][input4mips_validation.cvs.drs.DRSSubstitution.apply_substitution].
Returns
-------
:
DRS template, with all substitutions in `substitutions` applied
"""
res = drs_template
for substitution in substitutions:
res = substitution.apply_substitution(
res,
metadata=metadata,
to_directory_path=to_directory_path,
validate_substituted_metadata=validate_substituted_metadata,
)
# # TODO: swap to the below
Expand Down
13 changes: 11 additions & 2 deletions stubs/iris/cube.pyi
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
from typing import Any

from numpy.typing import NDArray

class Cube:
attributes: dict[str, str]
data: NDArray[Any]

def name(self) -> str: ...

class CubeList:
def __getitem__(self, key: int) -> Cube: ...
def __len__(self) -> int: ...

class Cube: ...
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def test_validate_written_single_variable_file(tmp_path):
dataset_category="GHGConcentrations",
datetime_end="2010-12-01T00:00:00Z",
datetime_start="2000-01-01T00:00:00Z",
esgf_dataset_master_id=f"input4MIPs.CMIP6Plus.CMIP.CR.CR-CMIP-0-2-0.atmos.mon.mole-fraction-of-carbon-dioxide-in-air.gn.v{version_exp}",
esgf_dataset_master_id=f"input4MIPs.CMIP6Plus.CMIP.CR.CR-CMIP-0-2-0.atmos.mon.mole_fraction_of_carbon_dioxide_in_air.gn.v{version_exp}",
filepath=str(written_file),
frequency="mon",
further_info_url="http://www.tbd.invalid",
Expand Down
2 changes: 1 addition & 1 deletion tests/integration/cli/test_write_in_drs_path.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def test_validate_write_in_drs(tmp_path):
dataset_category="GHGConcentrations",
datetime_end="2010-12-01T00:00:00Z",
datetime_start="2000-01-01T00:00:00Z",
esgf_dataset_master_id=f"input4MIPs.CMIP6Plus.CMIP.CR.CR-CMIP-0-2-0.atmos.mon.mole-fraction-of-carbon-dioxide-in-air.gn.v{version_exp}",
esgf_dataset_master_id=f"input4MIPs.CMIP6Plus.CMIP.CR.CR-CMIP-0-2-0.atmos.mon.mole_fraction_of_carbon_dioxide_in_air.gn.v{version_exp}",
filepath=str(written_file),
frequency="mon",
further_info_url="http://www.tbd.invalid",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ <h3>Passing files</h3>
<ul>
<details>
<summary>1: mole-fraction-of-methane-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</summary>
<p>Full path: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole-fraction-of-methane-in-air/gn/vVERSION/mole-fraction-of-methane-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</p>
<p>Full path: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole_fraction_of_methane_in_air/gn/vVERSION/mole-fraction-of-methane-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</p>
</details>
</ul>
</details>

<h3>Failing files</h3>
<details>
<summary>1: mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</summary>
<p>Full path: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole-fraction-of-carbon-dioxide-in-air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</p>
<p>Full path: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole_fraction_of_carbon_dioxide_in_air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc</p>
<details>
<summary>Passed: 85.71% (6 / 7)</summary>
<ol>
Expand All @@ -52,7 +52,7 @@ <h3>Failing files</h3>
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File .../subprocess.py", line no., in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command .../bin/cfchecks', '-v', '1.7', 'TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole-fraction-of-carbon-dioxide-in-air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc']' returned non-zero exit status 255.
subprocess.CalledProcessError: Command .../bin/cfchecks', '-v', '1.7', 'TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole_fraction_of_carbon_dioxide_in_air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc']' returned non-zero exit status 255.

The above exception was the direct cause of the following exception:

Expand All @@ -63,7 +63,7 @@ <h3>Failing files</h3>
raise ValueError(error_msg) from exc
ValueError: cf-checker validation failed. cfchecks output:

CHECKING NetCDF FILE: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole-fraction-of-carbon-dioxide-in-air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc
CHECKING NetCDF FILE: TREE_ROOT/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/atmos/mon/mole_fraction_of_carbon_dioxide_in_air/gn/vVERSION/mole-fraction-of-carbon-dioxide-in-air_input4MIPs_GHGConcentrations_CMIP_CR-CMIP-0-2-0_gn_200001-201012.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.7
Expand Down
Loading

0 comments on commit 1609e5c

Please sign in to comment.