Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a first version #1

Merged
merged 37 commits into from
Apr 30, 2024
Merged

add a first version #1

merged 37 commits into from
Apr 30, 2024

Conversation

cosunae
Copy link
Contributor

@cosunae cosunae commented Apr 2, 2024

No description provided.

Copy link

@twicki twicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments. Most of it in this PR in individual commits. Cherry-pick what you deem worthwhile

priority = "default"

[[tool.poetry.source]]
name = "pypi-mch-publish"
url = "https://hub.meteoswiss.ch/nexus/repository/python-mch/"
url = "https://service.meteoswiss.ch/nexus/repository/python-mch/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that supposed to work?

service building it at CSCS is somehow required. hub is what the labvm wants?

How does building it on jenkins, where we need nexus.mch. work planned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an attempt to make it work:
https://github.com/MeteoSwiss-APN/cosmo-archive-retrieve/pull/1/files#diff-e6ffa5dc854b843b3ee3c3c28f8eae2f436c2df2b1ca299cca1fa5982e390cf8R52

I hope this is temporary, in some discussions it was suggested that the url might be unified in a near future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, perfect


def main():

prefix = os.environ["CONDA_PREFIX"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already fails if no CONDA_PREFIX is defined. Should that be caught?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but what would you do ? Raise another custom exception?
The script was meant to install in the share path of the conda prefix, so letting fail seems right in case this can not be done ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next line you raise a custom exception saying the prefix does not exist if it exists and does not point to a directory. Why not be more descriptive than a key error here as well?

poetry.lock Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say what is wrong, but when I try to build the envionment, I get stuck on:

 lab-vm  ../cosmo-archive-retrieve  first_review ◔  cosmo-archive-retrieve  poetry install
Warning: Found deprecated priority 'default' for source 'pypi-mch' in pyproject.toml. You can achieve the same effect by changing the priority to 'primary' and putting the source first.
Installing dependencies from lock file

Package operations: 40 installs, 3 updates, 0 removals

  - Installing pydantic-core (2.16.3): Pending...
  - Installing snowballstemmer (2.2.0): Pending...
  - Installing snuggs (1.4.7): Pending...
  - Installing soupsieve (2.5): Pending...
  - Installing sphinxcontrib-applehelp (1.0.8): Pending...
  - Installing sphinxcontrib-devhelp (1.0.6): Pending...
  - Installing sphinxcontrib-htmlhelp (2.0.5): Pending...
  - Installing sphinxcontrib-jsmath (1.0.1): Pending...

ds = idpi.grib_decoder.load(
idpi.data_source.DataSource(datafiles=[full_path]),
{
"param": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we somehow don't get PP here. How is that handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, very good point. I think the list came from the original pressure levels extraction, which do not require P. I added it

ds = xr.open_zarr(data_path)
checkpoint = ds.time.values[-1]

return datetime.utcfromtimestamp(checkpoint.tolist() / 1e9)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is deprecated



def load_data(config: dict) -> None:
"""Load weather data from archive and store it in a Zarr archive."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def load_data(config: dict) -> None:
"""Load weather data from archive and store it in a Zarr archive."""

n_pool = config["n_pool"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems out of place here

"""Load weather data from archive and store it in a Zarr archive."""

n_pool = config["n_pool"]
tar_file_paths = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we in a comment describe what we're expecting here? With the list containing the files and having the same name as the tar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, since was not very clear it seems, I thought we could better move it into its own function. Let me know, if that improves...

pdset["RELHUM"] = relhum(
pdset["QV"], pdset["T"], pdset["P"], clipping=True, phase="water"
)
pdset["FI"] = pdset["HFL"] * 9.80665
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we computing FI? seems to not be in the dataset we're using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, originally it was extracted since it was requested by FCN. But indeed FI is just height times g, but we already have HFL, so it does not add anything. I will remove it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be here

* isdir instead

* remove deprecated

* update suffix

* use path instead of basename

* renaming

* remove unused arg

* move npool to usage

* returnvalue
@sadamov
Copy link

sadamov commented Apr 10, 2024

Thanks @cosunae! Some questions and a requests for additional variables. I looked at the zarr archive on Tsa: /scratch/cosuna/neural-lam/zarr/cosmo_ml_data.zarr

  • Why did we switch from PP to P?
  • Why did we switch from P0FL to HFL?
  • Why do these variables have a time dimension: FI, HFL, HSURF?
  • T_2M, PMSL, U_10M, V_10M, FI, HFL, have z=62 vertical levels, but only one level is not nan. Better to remove all other levels or drop z-dimension completely
  • U, V, T, RELHUM have missing data at 2016-01-03T05:00:00
  • Vertical levels 0 and 61 are sponge-layers and should be disregarded.

With regards to the NeuralLAM paper and some discussions with my professor I think we might want to add additional variables. Using this document we are trying to align the variables of ERA5 with our regional datasets. Could we also retrieve the following variables (eccodes param ids)?

Param name					Param.-db id
mean_surface_latent_heat_flux			147
mean_surface_net_long_wave_radiation_flux	176
mean_surface_net_short_wave_radiation_flux	169
mean_surface_sensible_heat_flux			146
total_cloud_cover				164
vertical_velocity				135
divergence					155
vorticity					138

This is my initial suggestion for additional retrievals, I think we should discuss here or in our ML-sync.

@sadamov
Copy link

sadamov commented Apr 10, 2024

I have one additional comment wrt collaboration within the community. It might be worthwile to look at the following GRIB2Zarr libraries and maybe try to consolidate:

@sadamov
Copy link

sadamov commented Apr 10, 2024

In case you are looking for multi-thread, high-performance zarr creation I have written a simple script using xr.open_mfdataset and SLURMCluster using Dask:

@cosunae
Copy link
Contributor Author

cosunae commented Apr 16, 2024

Thanks @sadamov for the comments. I have pushed few commits to address them:

  • Why did we switch from PP to P?
    no particular reason, I picked to start an older version of the namelist we used, at the time we were using P. I switched it back again to PP
  • Why did we switch from P0FL to HFL?
    I am not sure I understand this. P0FL is the reference P of the atmosphere, so that P = P0FL + PP. HFL is just height, so they are not related.
  • Why do these variables have a time dimension: FI, HFL, HSURF?
    no reason, I removed them now.
  • T_2M, PMSL, U_10M, V_10M, FI, HFL, have z=62 vertical levels, but only one level is not nan. Better to remove all other levels or drop z-dimension completely
    yes, this is because we were creating a dataset with some arrays that had still a dimension z=1. Now we are squeezing it, so that they have one level only (i.e. no z dim). With the exception of HFL that is truly a 3d field with 60 levels. FI is also removed since we have height (HFL).
  • U, V, T, RELHUM have missing data at 2016-01-03T05:00:00
    while testing, I am often killing generation jobs, if that happens in the middle of creating the zarr archive, I guess this can happen.
  • Vertical levels 0 and 61 are sponge-layers and should be disregarded.
    I am also not sure I understand this comment. All of the fields should contain only 60 levels, not 62. We had fields in zarr with 62 (also in the version we are currently using for neural-lam) because of the xarray dataset creation of arrays that do not have z coordinate aligned (like T_2M and T), which creates artificial additional levels. After I changed that, we have not surface fields or 3d fields with only 60 levels.

Additional changes since the last commit reviewed:

  • adding a function check_hypercube, to make sure all z dimensions coordinates are aligned before creating the dataset.
  • add the following fields (you could double check if they are the right ones, the paramId you posted corresponds to IFS, the COSMO definitions are different):
    mean_surface_latent_heat_flux ALHFL_S (from FG)
    mean_surface_net_long_wave_radiation_flux ATHB_S (from FG)
    mean_surface_net_short_wave_radiation_flux ASOBS_S (from FG)
    mean_surface_sensible_heat_flux ASHFL_S (from FG)
    total_cloud_cover CLCT (from ANA)
    vertical_velocity W (from ANA) -> destaggered

The following, we would need to compute them (so not yet added): divergence, vorticity

@cosunae
Copy link
Contributor Author

cosunae commented Apr 16, 2024

I completely agree that we should look into anemoi. I think this repo we only need it for generating data as long as we need it. As soon as we have a strategy of how to be aligned with anemoi, we should discuss how to proceed.

@sadamov
Copy link

sadamov commented Apr 17, 2024

Thanks a lot @cosunae, this is great!

  • Clearly I was ignorant about HFL, P0FL and FI which now makes sense.
  • Tbh I find the different paramIds/GRIB-definitions very confusing, but the selection you retrieved looks on point.
  • I was informed about sponge layers from @twicki and that we should not use those; but maybe that is not related. Your explanation makes a lot of sense and the latest zarr-archive shows 60 levels for all vertical variables without nans.
  • With regards to our verification-pipeline this leaves DURSUN and wind_gust as two final considerations for additional variables. Other than that this PR can be merged from my side.

@sadamov
Copy link

sadamov commented Apr 17, 2024

Summarizing our discussion in the VC:

  • Variables to be removed: P0FL, PP
  • Variables to be added: TD_2M, FI
  • Variables to be derived and then added: snow_depth, wind_gust
  • Additional variables DURSUN

@cosunae
Copy link
Contributor Author

cosunae commented Apr 17, 2024

thanks @sadamov. I think though I still managed to add confusion mixing the P vs PP story and HFL vs FI (I think I was mixing HFL and P during the meeting). I try to clarify:

  • FI = height * g. Height is in COSMO HFL. So for us it is either HFL or FI. We decided to go for FI, so I already replace HFL by FI.
  • P = PP + P0FL, so we could choose between P or PP+P0FL in the data extraction.

@cosunae
Copy link
Contributor Author

cosunae commented Apr 17, 2024

Concerning P0FL what I found is that the diff between two close time steps the difference is not negligible:
This is the diff (for surface level)
image
where P0FL is
image

Both Jean-Marie and Andre think that this can be produced by the error of the simple packing of grib with 16 bits. In our archive we indeed have 16 bits for these variables.
So I assuming this is a packing error, the P0FL would be time invariant.
I removed the time dimension in this PR

@cosunae
Copy link
Contributor Author

cosunae commented Apr 17, 2024

I added wind gusts (VMAX_10M, UMAX_10M) and DURSUN from the first guess,
and W_SNOW from the analysis. Does this make sense ?

@sadamov
Copy link

sadamov commented Apr 17, 2024

It makes sense to me, thanks for the explanations and the fast updates 🚀 . If reviewer agrees we can merge. The model/dataloader is also ready for the new single_zarr archive for a first full training MeteoSwiss/neural-lam#12

@cosunae
Copy link
Contributor Author

cosunae commented Apr 17, 2024

Concerning the 2m dew point T, TD_2M, it can be computed from QV_2M, T_2M and PS:
https://github.com/COSMO-ORG/fieldextra/blob/develop/src/fxtr_operator_generic.f90#L9416

But we do not have QV_2M in the analysis, so I added TD_2M, ok with that also ?


def main():

prefix = os.environ["CONDA_PREFIX"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next line you raise a custom exception saying the prefix does not exist if it exists and does not point to a directory. Why not be more descriptive than a key error here as well?

priority = "default"

[[tool.poetry.source]]
name = "pypi-mch-publish"
url = "https://hub.meteoswiss.ch/nexus/repository/python-mch/"
url = "https://service.meteoswiss.ch/nexus/repository/python-mch/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, perfect

tar.extractall(path=os.path.join(tarextract_dir, "ANA"))

fg_file = re.sub(r"KENDA\/ANA(\d{2})", r"KENDA/FG\1", ana_file)
shutil.copyfile(fg_file, tmp_tar_file)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok like this. I think because it is only temporary it works. Just wanted to double-check

"PMSL",
"FI",
"HSURF",
"PP",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still write out pp and p0fl?

Copy link
Contributor Author

@cosunae cosunae Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a confusion we had last time we discussed,
After clarification:
we need either pp and p0fl or P.
And either HFL or FI.
So the decision was to go for PP & P0FL and FI

@sadamov
Copy link

sadamov commented Apr 24, 2024

From my side this PR can be merged.

  • Retrieved variables look complete
  • Open points from last meeting and here in comments were resolved
  • Script ran through and created a 6.2TB archive (~2y still missing)
  • Archive is succesfully being used for ML-training of https://github.com/MeteoSwiss/neural-lam

Copy link

@twicki twicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be in a mergeable state now! Thanks a lot

@cosunae cosunae merged commit dcbbadf into main Apr 30, 2024
1 check failed
@cosunae cosunae deleted the first_review branch April 30, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants