Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for more than one ConcatDim #140

Open
nbarlowATI opened this issue May 20, 2021 · 10 comments
Open

Support for more than one ConcatDim #140

nbarlowATI opened this issue May 20, 2021 · 10 comments

Comments

@nbarlowATI
Copy link

Hi all, I'm trying to convert a dataset from NetCDF to Zarr, and would really like to concatenate over more than one dimension (in my case "time" and "ensemble_id").
I see that XarrayZarrRecipe/FilePattern doesn't currently support this.
I'd be happy to work on implementing this if others think it's worthwhile, and if there isn't already someone working on it?

@martindurant
Copy link
Contributor

I'd like to take the chance to once more gently remind that RefernceFileSystem really wants this logic: of figuring out for every chunk in the input, what chunk index it should have in a theoretical output zarr dataset. This process doesn't need to live inside a recipe class, but I suppose that's a good place to start.

@rabernat
Copy link
Contributor

Hi @nbarlowATI and welcome!

You are correct that this is not yet supported. You are welcome to try to implement it, and we would love to have your contribution. However, I am concerned that the current code related to this is not very clear or accessible to outside contributors. Consequently, you may become frustrated if you try to work on this. As Martin suggested, abstracting this logic into a standalone module would probably be wise; however that is also a difficult task.

I will try to give you some hints about how to proceed:

The key method to look at is region_and_conflicts_for_chunk:

def region_and_conflicts_for_chunk(self, chunk_key: ChunkKey):
# return a dict suitable to pass to xr.to_zarr(region=...)
# specifies where in the overall array to put this chunk's data
# also return the conflicts with other chunks

This method translates a chunk_key to a specific slice within a multidimensional dataset (e.g. {'time': slice(10, 20)}). chunk_key is a tuple of ints. It is a key internal data structure which encodes the chunk's place within the broader dataset. It is set here:

chunk_key = tuple([v[0] if hasattr(v, "__len__") else v for v in k])

(This is the part of the code that is hardest to understand.) Currently chunk_key is either a one or two-item tuple (e.g. (2,) [only a ConcatDim] or (2, 3) [ConcatDim and MergeDim]). I To implement multiple ConcatDims, we would probably need to generalize chunk_key to an arbitrary length.

Another key method is input_position:

def input_position(self, input_key):
# returns the index position of an input key wrt the concat_dim
concat_dim_axis = list(self.file_pattern.dims).index(self._concat_dim)
return input_key[concat_dim_axis]

This would also need to be generalized to handle multiple ConcatDims.

I hope these tips are helpful. I don't want to suggest it will be easy, but we will be happy to support you, answer questions, review PRs etc.

@rabernat
Copy link
Contributor

Having just written that up, I realize that I may be able to make some progress on this fairly quickly and also refactor the code along the lines Martin suggested. So if you can wait a few weeks, I may be able to implement the feature myself.

@nbarlowATI - I'm curious what your application is for this...

@martindurant
Copy link
Contributor

Hurray @rabernat ! I feel like it might be useful for us two (or more) to have a brainstorm on how to bring this about.

@cisaacstern
Copy link
Member

Would love to listen in on this brainstorm.

@rabernat
Copy link
Contributor

Since we all seem to be online, could we drop in https://whereby.com/pangeo right now?

I can work for another 2 hours today, then I am checking out for a long weekend.

@martindurant
Copy link
Contributor

Sorry, Dask is taking all of my time right now, I would prefer after my ESIP talk on Monday.

@rabernat
Copy link
Contributor

Also noting that this is basically the same as #98.

@aaron-kaplan
Copy link

We're working on providing the SubX data in Zarr format, and this issue kept us from using pangeo-forge to do the conversion.

@martindurant
Copy link
Contributor

It should be possible to use the reference-maker idea to merge subsets of the total data on one dimension, and then merge these intermediate virtual datasets on a second dimension, and so on. The original data would only need scanning once. Having done all this, then you can rechunk as required. Of course, it would be worth keeping the reference-maker output too, which would view the original data with whatever chunks it had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants