-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipes for Global Drifter Program #101
Comments
Thanks for posting this Philippe! Pinging @selipot who I believe has NSF Earthcube funding to work on this topic!
Not currently, but this is in the works (see pangeo-forge/pangeo-forge-recipes#140)
On the target Zarr dataset? No, Zarr does not support this. Given the ragged nature of the data you described, Xarray / Zarr may not be a good fit at all for the target format. Have you considered either
|
Thank you for pinging me. Yes, @philippemiron's work on this is funded through my CloudDrift EarthCube project :) |
Thanks for the suggestion, I will take a look at Awkward Array. I have not considered pandas, the data set is not that large ~17 GB, but I believe xarray would be easier to work with. I realized that ragged arrays are not really a good fit for Xarray but I have been testing using groupby() operation from flox by setting nonuniform Dask chunks per trajectory and hope to make operations per trajectory efficient. |
For working with Lagrangian data (which we do a lot in our lab), I have always preferred Pandas, due to the ability to group on any of the different attributes (float id, lat / lon, time, etc). 17 GB is not so bad if you use dask.dataframe + Parquet. I would recommend the following approach:
|
Thanks for the cc!
It's looking more and more like the ideal on-disk format is Parquet. pyarrow support for converting rich data types between Arrow and Parquet is getting better and better, and in pyarrow 6, Awkward 2, we carry over enough metadata to do lossless Awkward ↔ Arrow ↔ Parquet round-trips. Also, we've been thinking/hoping to add a Zarr v3 extension (zarr-developers/zarr-specs#62), though this would require the Awkward library to interpret a Zarr dataset as ragged. An advantage of Parquet is that the raggedness concept is built into the file format—any other library reading it would interpret it correctly. We're in the midst of developing a Dask collection for Awkward Arrays (https://github.com/ContinuumIO/dask-awkward/) and would be eager to help you with your use-case as a way to get feedback for our development. I'm on https://gitter.im/pangeo-data/Lobby and https://discourse.pangeo.io/ . |
Hi @rabernat, We are still testing out file format and actually putting together a Notebook for the EarthCube annual meaning benchmarking typical workflow tasks with xarray, pandas, and awkward array. We are now wondering if it will be possible to host the dataset on the Pangeo Cloud. Currently, I have a set of preprocessing tools that:
With those tools, I can create the ragged array as follow:
My guess/hope is that this could possibly be converted to a pangeo-forge recipe but to be honest I'm not totally sure how to proceed (and if it's possible) with the current model. |
@philippemiron - we would love to help support this use case. If the data can be stored as netCDF, then we can make it work here (with zarr). However, I'm confused about what this ragged array looks like in xarray? My understanding is that xarray does not support ragged arrays. Can you share the xarray repr of |
See here. I'm also working with @jpivarski on a parquet representation that would be handled by awkward array. |
Could you please just do |
|
I am very interested in how this gets solved. I just wanted to raise the question of whether this is the best/good format to save the data in? |
Hi @dhruvbalwada, The variables are indeed each stored into a 1-d ragged array of size |
Oh wow, that sounds amazing! Can't wait to play with this. I am sure the pros will have some better comments, but the way the xarray is formatted it seems like it should be little trouble getting it into Pangeo Forge. |
To perform operations per trajectory, you can chunk it like this:
and then use |
After examining the code and the ouput dataset, I believe it should be possible to produce this dataset using a standard Pangeo Forge XarrayZarrRecipe. You are basically just concatenating all of the netcdf files along the We need to play around with this a bit, but I don't see any fundamental blockers here. |
I'm not 100% certain of what is the issue. I managed to create the XarrayZarrRecipe:
But then when trying to retrieve one chunk, I get the following error:
I will give it a second try over the weekend and report back on Monday. |
I will try to spend some time playing around with it. |
Parquet does that in a smarter way: the construction of a ragged array from >>> import awkward._v2 as ak
>>> dataset = ak.from_parquet("global-drifter/global-drifter-0000.parquet", "obs.l*itude")
>>> dataset.show(limit_rows=6)
{obs: [{longitude: 119, latitude: 21.2}, ..., {longitude: 115, ...}]},
{obs: [{longitude: -67.3, latitude: 41.9}, ..., {longitude: -52.6, ...}]},
{obs: [{longitude: -50, latitude: -32.5}, ..., {longitude: -51.8, ...}]},
...,
{obs: [{longitude: 177, latitude: 56.1}, ..., {longitude: 176, ...}]},
{obs: [{longitude: -109, latitude: -55.3}, ..., {longitude: -96.4, ...}]}]
>>> print(dataset.type)
907 * {obs: var * ?{longitude: float32, latitude: float32}}
>>> ak.num(dataset)
<Array [{obs: 1488}, {obs: 4136}, ..., {obs: 44962}] type='907 * {obs: int64}'> But since this structure is relatively simple, ragged arrays with only one level of depth, distributing them with the existing NetCDF infrastructure and building that structure on demand is pretty reasonable: >>> import h5py
>>> import numpy as np
>>> file = h5py.File("gdp_v2.00.nc") # a different data sample
>>> content = ak.zip({"longitude": np.asarray(file["longitude"]), "latitude": np.asarray(file["latitude"])})
>>> dataset = ak.unflatten(content, np.asarray(file["rowsize"]))
>>> dataset.show(limit_rows=6)
[{longitude: -17.7, latitude: 14.7}, {...}, ..., {longitude: -16.9, ...}],
[{longitude: -17.3, latitude: 14.5}, {...}, ..., {longitude: -17.3, ...}],
[{longitude: 136, latitude: 10}, {...}, ..., {longitude: 125, latitude: 13.7}],
...,
[{longitude: -25.7, latitude: 65}, {...}, ..., {...}, {longitude: -30.2, ...}],
[{longitude: -24.9, latitude: 64.8}, {...}, ..., {longitude: -30.4, ...}]]
>>> print(dataset.type)
17324 * var * {longitude: float32, latitude: float32}
>>> ak.num(dataset)
<Array [417, 2005, 1970, 1307, ..., 4572, 17097, 3127] type='17324 * int64'> (The last numbers are the lengths of the trajectories.) Although Parquet does this stuff in general, this particular dataset isn't complex enough to really need it. Actually, I'd like to get involved in this for Argopy, for exactly the same reasons that it applies to @philippemiron's project. I did some early tests converting Argo's NetCDF files into Parquet, and the Parquet versions were 20× smaller, as well as faster to load. But even there, Parquet isn't strictly needed: the same For this particular project, the main thing that I think Awkward + Dask provides over xarray + Dask is that when you chunk_settings = {'obs': tuple(rowsize.tolist())}
ds = xr.open_dataset(path_gdp, chunks=chunk_settings) the chunking size for parallelization becomes tied to the physically meaningful variable, "trajectory length." Datasets with a large number of small trajectories would have a per-task overhead that you can't get rid of. Consider that the With Awkward + Dask, a single task could be run on Meanwhile, however, I need to learn the "+ Dask" part of this to actually develop some examples. @philippemiron and I have also considered some examples with Numba, and that's a third part to consider. @douglasdavis tried an example of Awkward + Dask + Numba this morning and it worked, though it evaluated the first partition locally to determine the output type and I think we can streamline that. |
I've modified the preprocess function to take a
Cheers, |
I just got an update from @selipot about this
Sounds like we are ready to move ahead with creating a recipe. The dataset is just a single netCDF file at https://www.nodc.noaa.gov/archive/arc0199/0248584/1.1/data/0-data/gdp_v2.00.nc |
@selipot - I browsed around the website, but I couldn't find any information about the GDP data license. Without a clear license, we may be in violation of copyright law if we host the data. Can you say any more about the license under which GDP is distributed? |
Working on finding this out with with the GDP. I suspect it would be what the license is for NOAA data which is unknown to me. Again reaching out to NOAA. |
From NOAA and NCEI: All environmental data managed/archived at NCEI is publicly available without restriction. NCEI doesn't currently employ a data usage license. |
Source Dataset
I'm trying to port a code that converts the hourly GDP dataset (~20'000 individual netCDF files) into a ragged array to a pangeo-forge recipe.
Transformation / Alignment / Merging
The files contain a few variables and metadata that should in fact be stored as variables. I have a draft recipe that cleans up the file, parses the date, converts metadata to variables. The files have two dimensions ['traj', 'obs'], where each file (one ['traj']) contains a different number of observations ['obs'] (this ranges from ~1000-20000+). More precisely, scalar variables['traj'] are: type of buoy, dimensions, launch date, drogue loss date, etc., and vector variables['obs'] are: lon, lat, ve, vn, time, err, etc.
To create the ragged array, scalar variables should be concatenated together, and the same goes for the various vector variables. My two issues are:
Cheers,
Output Dataset
Single netCDF archive, or a Zarr folder.
The text was updated successfully, but these errors were encountered: