add a first version #1

cosunae · 2024-04-02T09:04:39Z

No description provided.

twicki

some comments. Most of it in this PR in individual commits. Cherry-pick what you deem worthwhile

twicki · 2024-04-03T10:15:55Z

pyproject.toml

 priority = "default"

 [[tool.poetry.source]]
 name = "pypi-mch-publish"
-url = "https://hub.meteoswiss.ch/nexus/repository/python-mch/"
+url = "https://service.meteoswiss.ch/nexus/repository/python-mch/"


how is that supposed to work?

service building it at CSCS is somehow required. hub is what the labvm wants?

How does building it on jenkins, where we need nexus.mch. work planned?

This is an attempt to make it work:
https://github.com/MeteoSwiss-APN/cosmo-archive-retrieve/pull/1/files#diff-e6ffa5dc854b843b3ee3c3c28f8eae2f436c2df2b1ca299cca1fa5982e390cf8R52

I hope this is temporary, in some discussions it was suggested that the url might be unified in a near future.

I see, perfect

twicki · 2024-04-03T10:27:08Z

cosmo_archive_retrieve/pre_install.py

+
+def main():
+
+    prefix = os.environ["CONDA_PREFIX"]


this already fails if no CONDA_PREFIX is defined. Should that be caught?

we can, but what would you do ? Raise another custom exception?
The script was meant to install in the share path of the conda prefix, so letting fail seems right in case this can not be done ?

In the next line you raise a custom exception saying the prefix does not exist if it exists and does not point to a directory. Why not be more descriptive than a key error here as well?

twicki · 2024-04-03T11:16:48Z

poetry.lock

I can't say what is wrong, but when I try to build the envionment, I get stuck on:

lab-vm  ../cosmo-archive-retrieve  first_review ◔  cosmo-archive-retrieve  poetry install Warning: Found deprecated priority 'default' for source 'pypi-mch' in pyproject.toml. You can achieve the same effect by changing the priority to 'primary' and putting the source first. Installing dependencies from lock file Package operations: 40 installs, 3 updates, 0 removals - Installing pydantic-core (2.16.3): Pending... - Installing snowballstemmer (2.2.0): Pending... - Installing snuggs (1.4.7): Pending... - Installing soupsieve (2.5): Pending... - Installing sphinxcontrib-applehelp (1.0.8): Pending... - Installing sphinxcontrib-devhelp (1.0.6): Pending... - Installing sphinxcontrib-htmlhelp (2.0.5): Pending... - Installing sphinxcontrib-jsmath (1.0.1): Pending...

twicki · 2024-04-04T06:37:27Z

cosmo_archive_retrieve/create_zarr_archive.py

+        ds = idpi.grib_decoder.load(
+            idpi.data_source.DataSource(datafiles=[full_path]),
+            {
+                "param": [


we somehow don't get PP here. How is that handled?

oh yes, very good point. I think the list came from the original pressure levels extraction, which do not require P. I added it

twicki · 2024-04-04T06:55:44Z

cosmo_archive_retrieve/create_zarr_archive.py

+        ds = xr.open_zarr(data_path)
+        checkpoint = ds.time.values[-1]
+
+        return datetime.utcfromtimestamp(checkpoint.tolist() / 1e9)


this is deprecated

twicki · 2024-04-04T07:14:42Z

cosmo_archive_retrieve/create_zarr_archive.py

+
+
+def load_data(config: dict) -> None:
+    """Load weather data from archive and store it in a Zarr archive."""


missing args

twicki · 2024-04-04T07:19:32Z

cosmo_archive_retrieve/create_zarr_archive.py

+def load_data(config: dict) -> None:
+    """Load weather data from archive and store it in a Zarr archive."""
+
+    n_pool = config["n_pool"]


this seems out of place here

twicki · 2024-04-04T07:19:59Z

cosmo_archive_retrieve/create_zarr_archive.py

+    """Load weather data from archive and store it in a Zarr archive."""
+
+    n_pool = config["n_pool"]
+    tar_file_paths = []


should we in a comment describe what we're expecting here? With the list containing the files and having the same name as the tar?

good point, since was not very clear it seems, I thought we could better move it into its own function. Let me know, if that improves...

twicki · 2024-04-04T07:23:50Z

cosmo_archive_retrieve/create_zarr_archive.py

+        pdset["RELHUM"] = relhum(
+            pdset["QV"], pdset["T"], pdset["P"], clipping=True, phase="water"
+        )
+        pdset["FI"] = pdset["HFL"] * 9.80665


why are we computing FI? seems to not be in the dataset we're using?

yes, originally it was extracted since it was requested by FCN. But indeed FI is just height times g, but we already have HFL, so it does not add anything. I will remove it

twicki · 2024-04-04T07:25:53Z

scripts/slurm-10162269.out

this should not be here

* isdir instead * remove deprecated * update suffix * use path instead of basename * renaming * remove unused arg * move npool to usage * returnvalue

sadamov · 2024-04-10T09:37:54Z

Thanks @cosunae! Some questions and a requests for additional variables. I looked at the zarr archive on Tsa: /scratch/cosuna/neural-lam/zarr/cosmo_ml_data.zarr

Why did we switch from PP to P?
Why did we switch from P0FL to HFL?
Why do these variables have a time dimension: FI, HFL, HSURF?
T_2M, PMSL, U_10M, V_10M, FI, HFL, have z=62 vertical levels, but only one level is not nan. Better to remove all other levels or drop z-dimension completely
U, V, T, RELHUM have missing data at 2016-01-03T05:00:00
Vertical levels 0 and 61 are sponge-layers and should be disregarded.

With regards to the NeuralLAM paper and some discussions with my professor I think we might want to add additional variables. Using this document we are trying to align the variables of ERA5 with our regional datasets. Could we also retrieve the following variables (eccodes param ids)?

Param name					Param.-db id
mean_surface_latent_heat_flux			147
mean_surface_net_long_wave_radiation_flux	176
mean_surface_net_short_wave_radiation_flux	169
mean_surface_sensible_heat_flux			146
total_cloud_cover				164
vertical_velocity				135
divergence					155
vorticity					138

This is my initial suggestion for additional retrievals, I think we should discuss here or in our ML-sync.

sadamov · 2024-04-10T09:54:45Z

I have one additional comment wrt collaboration within the community. It might be worthwile to look at the following GRIB2Zarr libraries and maybe try to consolidate:

Anemoi: https://anemoi-datasets.readthedocs.io/en/latest/building/sources/grib.html
DMI approach: https://fsspec.github.io/kerchunk/index.html

sadamov · 2024-04-10T09:59:19Z

In case you are looking for multi-thread, high-performance zarr creation I have written a simple script using xr.open_mfdataset and SLURMCluster using Dask:

https://github.com/MeteoSwiss/neural-lam/blob/one_zarr/create_zarr_archive.py

cosunae · 2024-04-16T19:48:08Z

Thanks @sadamov for the comments. I have pushed few commits to address them:

Why did we switch from PP to P?
no particular reason, I picked to start an older version of the namelist we used, at the time we were using P. I switched it back again to PP
Why did we switch from P0FL to HFL?
I am not sure I understand this. P0FL is the reference P of the atmosphere, so that P = P0FL + PP. HFL is just height, so they are not related.
Why do these variables have a time dimension: FI, HFL, HSURF?
no reason, I removed them now.
T_2M, PMSL, U_10M, V_10M, FI, HFL, have z=62 vertical levels, but only one level is not nan. Better to remove all other levels or drop z-dimension completely
yes, this is because we were creating a dataset with some arrays that had still a dimension z=1. Now we are squeezing it, so that they have one level only (i.e. no z dim). With the exception of HFL that is truly a 3d field with 60 levels. FI is also removed since we have height (HFL).
U, V, T, RELHUM have missing data at 2016-01-03T05:00:00
while testing, I am often killing generation jobs, if that happens in the middle of creating the zarr archive, I guess this can happen.
Vertical levels 0 and 61 are sponge-layers and should be disregarded.
I am also not sure I understand this comment. All of the fields should contain only 60 levels, not 62. We had fields in zarr with 62 (also in the version we are currently using for neural-lam) because of the xarray dataset creation of arrays that do not have z coordinate aligned (like T_2M and T), which creates artificial additional levels. After I changed that, we have not surface fields or 3d fields with only 60 levels.

Additional changes since the last commit reviewed:

adding a function check_hypercube, to make sure all z dimensions coordinates are aligned before creating the dataset.
add the following fields (you could double check if they are the right ones, the paramId you posted corresponds to IFS, the COSMO definitions are different):
mean_surface_latent_heat_flux ALHFL_S (from FG)
mean_surface_net_long_wave_radiation_flux ATHB_S (from FG)
mean_surface_net_short_wave_radiation_flux ASOBS_S (from FG)
mean_surface_sensible_heat_flux ASHFL_S (from FG)
total_cloud_cover CLCT (from ANA)
vertical_velocity W (from ANA) -> destaggered

The following, we would need to compute them (so not yet added): divergence, vorticity

cosunae · 2024-04-16T19:49:57Z

I completely agree that we should look into anemoi. I think this repo we only need it for generating data as long as we need it. As soon as we have a strategy of how to be aligned with anemoi, we should discuss how to proceed.

sadamov · 2024-04-17T06:34:33Z

Thanks a lot @cosunae, this is great!

Clearly I was ignorant about HFL, P0FL and FI which now makes sense.
Tbh I find the different paramIds/GRIB-definitions very confusing, but the selection you retrieved looks on point.
I was informed about sponge layers from @twicki and that we should not use those; but maybe that is not related. Your explanation makes a lot of sense and the latest zarr-archive shows 60 levels for all vertical variables without nans.
With regards to our verification-pipeline this leaves DURSUN and wind_gust as two final considerations for additional variables. Other than that this PR can be merged from my side.

sadamov · 2024-04-17T09:17:42Z

Summarizing our discussion in the VC:

Variables to be removed: P0FL, PP
Variables to be added: TD_2M, FI
Variables to be derived and then added: snow_depth, wind_gust
Additional variables DURSUN

cosunae · 2024-04-17T15:33:21Z

thanks @sadamov. I think though I still managed to add confusion mixing the P vs PP story and HFL vs FI (I think I was mixing HFL and P during the meeting). I try to clarify:

FI = height * g. Height is in COSMO HFL. So for us it is either HFL or FI. We decided to go for FI, so I already replace HFL by FI.
P = PP + P0FL, so we could choose between P or PP+P0FL in the data extraction.

cosunae · 2024-04-17T15:36:03Z

Concerning P0FL what I found is that the diff between two close time steps the difference is not negligible:
This is the diff (for surface level)

where P0FL is

Both Jean-Marie and Andre think that this can be produced by the error of the simple packing of grib with 16 bits. In our archive we indeed have 16 bits for these variables.
So I assuming this is a packing error, the P0FL would be time invariant.
I removed the time dimension in this PR

cosunae · 2024-04-17T15:38:42Z

I added wind gusts (VMAX_10M, UMAX_10M) and DURSUN from the first guess,
and W_SNOW from the analysis. Does this make sense ?

sadamov · 2024-04-17T15:46:39Z

It makes sense to me, thanks for the explanations and the fast updates 🚀 . If reviewer agrees we can merge. The model/dataloader is also ready for the new single_zarr archive for a first full training MeteoSwiss/neural-lam#12

cosunae · 2024-04-17T15:47:19Z

Concerning the 2m dew point T, TD_2M, it can be computed from QV_2M, T_2M and PS:
https://github.com/COSMO-ORG/fieldextra/blob/develop/src/fxtr_operator_generic.f90#L9416

But we do not have QV_2M in the analysis, so I added TD_2M, ok with that also ?

twicki · 2024-04-22T08:04:20Z

cosmo_archive_retrieve/pre_install.py

+
+def main():
+
+    prefix = os.environ["CONDA_PREFIX"]


In the next line you raise a custom exception saying the prefix does not exist if it exists and does not point to a directory. Why not be more descriptive than a key error here as well?

twicki · 2024-04-22T12:06:32Z

pyproject.toml

 priority = "default"

 [[tool.poetry.source]]
 name = "pypi-mch-publish"
-url = "https://hub.meteoswiss.ch/nexus/repository/python-mch/"
+url = "https://service.meteoswiss.ch/nexus/repository/python-mch/"


I see, perfect

twicki · 2024-04-22T13:03:34Z

cosmo_archive_retrieve/create_zarr_archive.py

+            tar.extractall(path=os.path.join(tarextract_dir, "ANA"))
+
+        fg_file = re.sub(r"KENDA\/ANA(\d{2})", r"KENDA/FG\1", ana_file)
+        shutil.copyfile(fg_file, tmp_tar_file)


I think it is ok like this. I think because it is only temporary it works. Just wanted to double-check

twicki · 2024-04-23T08:59:06Z

cosmo_archive_retrieve/create_zarr_archive.py

+        "PMSL",
+        "FI",
+        "HSURF",
+        "PP",


why do we still write out pp and p0fl?

that was a confusion we had last time we discussed,
After clarification:
we need either pp and p0fl or P.
And either HFL or FI.
So the decision was to go for PP & P0FL and FI

sadamov · 2024-04-24T04:43:58Z

From my side this PR can be merged.

Retrieved variables look complete
Open points from last meeting and here in comments were resolved
Script ran through and created a 6.2TB archive (~2y still missing)
Archive is succesfully being used for ML-training of https://github.com/MeteoSwiss/neural-lam

twicki

Looks to be in a mergeable state now! Thanks a lot

cosunae added 20 commits March 28, 2024 19:03

add a first version

fc67497

remove constants

621ff17

installation

971c709

add conda env

2f9ac59

add a preinstall script

1015b7c

add a preinstall script

c8ae453

no test

4d95933

add black

2cda033

black formatting

632208e

add doc

6d219d7

add doc

01184ce

add doc

2e5d74c

single request API of idpi

dc19f57

fix doc

6b9222f

add submit script

cb4cd53

add regex

9d3501e

add regex

9c92d8f

add regex

d5ed9f0

remove lock

2e62937

https instead of git

1040262

twicki suggested changes Apr 4, 2024

View reviewed changes

Suggestions (#3)

7858fa2

* isdir instead * remove deprecated * update suffix * use path instead of basename * renaming * remove unused arg * move npool to usage * returnvalue

cosunae added 5 commits April 13, 2024 10:23

implement comments for review

da69d80

fix datetime encoding issues for chunking

0a711c3

fix destaggering in z

17012d4

remove time from constant vars

dae9d01

check that all dims fit the hypercube before creating a dataset

64fe12b

cosunae added 4 commits April 16, 2024 17:01

add additional variables as requested in the PR

55922df

adding a test and W

1963b16

add test

6e3d6b0

fix the short wave radiation variable

47a781d

cosunae added 2 commits April 17, 2024 13:33

HFL-> FI and add DURSUN and VMAX_10M, UMAX_10M

211409f

make P0FL invariant in time

847bac4

adding W_SNOW

8178c57

adding TD_2M

b7a1c01

cosunae added 2 commits April 18, 2024 22:48

hack the fact that step is sometimes wrong

312368d

fix time coordinates for writing into zarr

a3bd0b2

twicki suggested changes Apr 23, 2024

View reviewed changes

add more precise error

bf0b612

twicki approved these changes Apr 30, 2024

View reviewed changes

cosunae merged commit dcbbadf into main Apr 30, 2024
1 check failed

cosunae deleted the first_review branch April 30, 2024 14:44



		def load_data(config: dict) -> None:
		"""Load weather data from archive and store it in a Zarr archive."""

add a first version #1

add a first version #1

Conversation

cosunae commented Apr 2, 2024

twicki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadamov commented Apr 10, 2024 • edited Loading

sadamov commented Apr 10, 2024

sadamov commented Apr 10, 2024 • edited Loading

cosunae commented Apr 16, 2024

cosunae commented Apr 16, 2024

sadamov commented Apr 17, 2024 • edited Loading

sadamov commented Apr 17, 2024 • edited Loading

cosunae commented Apr 17, 2024

cosunae commented Apr 17, 2024

cosunae commented Apr 17, 2024

sadamov commented Apr 17, 2024

cosunae commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cosunae Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

sadamov commented Apr 24, 2024 • edited Loading

twicki left a comment

Choose a reason for hiding this comment

sadamov commented Apr 10, 2024 •

edited

Loading

sadamov commented Apr 10, 2024 •

edited

Loading

sadamov commented Apr 17, 2024 •

edited

Loading

sadamov commented Apr 17, 2024 •

edited

Loading

cosunae Apr 28, 2024 •

edited

Loading

sadamov commented Apr 24, 2024 •

edited

Loading