-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example pipeline for GFS Archive #50
Comments
@raybellwaves thanks for opening this issue, and apologies for the delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations. Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle |
I don't see why not - xarray can load them, so long as they are cached on a local filesystem.
On June 24, 2021 7:57:29 PM EDT, Charles Stern ***@***.***> wrote:
***@***.*** thanks for opening this issue, and apologies for the
…delay in responding. Just tagging a few others who may be more familiar
with grib-specific considerations.
Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we
can handle `.grib2` inputs at this time?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#50 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
@raybellwaves, looks like we're good to go. 😄 Are you interested in learning how to develop recipes yourself? If so, I'd be delighted, and will be happy to guide you through the process. Like The first step would be for you to make a PR to this repo, which contains a new draft recipe under |
Is writing a separate zarr store for every init time a good idea? I've been struggling a lot with how to build timeseries from the hrrrzarr data which was written that way. Opening thousands of hours in a loop can take hours and isn't nearly as simple or efficient to parallelize as if time were just a dimension in the dataset from the getgo. That said, that dataset isn't optimized for xarray––e.g. because of the way it's written as a zarr hierarchy, the .zmetadata isn't visible to xarray so I can't use the consolidated option. But I noticed that both #17 and #18 are making init time a dimension rather than creating separate stores (IIUC). How would you decide between the two approaches? |
I don't think so. I think we want init_time and lead_time both as dimension. In order for this to work, we need to resolve pangeo-forge/pangeo-forge-recipes#140. |
@rabernat Is there any concern with xarray's handling of the time dimension for continuously-updating data sets? I assume the GFS (like the HRRR and GEFS) produces new model runs frequently. Some of my colleagues have been avoiding creating a time dimension in these situations because of cases where it's been painful but it's not clear to me if any of those apply to situations like this. Does the .zmetadata get updated efficiently when you just append data? Also do we actually need 140 for this one? Shouldn't you be able to just do it in stages, look at a single init_time and concat over lead_time, then concat the result over init_time? Or do recipes have to be 1 stage? |
Good question. I imagine there are open questions regarding one giant zarr store versus smaller zarr stores which can be concatenated. It may be use case driven. There are probably lessons learnt from what people do with tabular (parquet) which can also be stored as separate files or appended (row wise or as a new row group partition e.g. partition on reftime). I imagine a step beyond which people do with tabular data would be streaming data. I imagine once you get it out of grib into a zarr store of some kind you can iterate through these questions quicker. |
Quick note that the "reference" views I have been working with could provide both, without having to copy or reorganise the data. It can be used to produce a single logical zarr over many zarr datasets. |
@martindurant Where would I get started if I wanted to try that out? |
Source Dataset
s3://noaa-gfs-bdp-pds/gfs.*
although it's ~20210226 onwardspydap
or download the grib filesor
or
Transformation / Alignment / Merging
Concat along
reftime
(init time) andtime
Output Dataset
zarr store.
I imagine one giant zarr store would be crazy so could be stored for one init time and all forecast times. Ideally with init time an expanded dim so it can be concatenated later.
The text was updated successfully, but these errors were encountered: