-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouper object design doc #8510
Conversation
[0, 0, 0, 1, 1]] | ||
``` | ||
|
||
### The `preferred_chunks` method (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite speculative but seems cool!
2. Merge existing two class implementation to a single Grouper class | ||
1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies. | ||
2. See [PR](https://github.com/pydata/xarray/pull/8509) | ||
3. Clean up what's returned by `factorize` methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs input on what we expect custom factorize
methods to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably best to try implementing the most important use cases from the lists you've been assembling: by doing so I believe we can figure out the most important ones.
3. Clean up what's returned by `factorize` methods. | ||
1. A solution here might be to have `group_indices: Mapping[int, Sequence[int]]` be a mapping from group index in `full_index` to a sequence of integers. | ||
2. Return a `namedtuple` or `dataclass` from existing Grouper factorize methods to facilitate API changes in the future. | ||
4. Figure out what to pass to `factorize` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use some input here too, requires some internal cleanup.
4. Figure out what to pass to `factorize` | ||
1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose. | ||
2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. | ||
5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on the namespace? And what other "groupers" would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me. We'd need to figure out which groupers to declare out-of-scope for inclusion in xarray
itself, though.
2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. | ||
5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this at a dev meeting a while ago, but would be good to rethink and commit to a path now.
return weights | ||
``` | ||
|
||
### The `factorize` method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name? "encode"
? "to_codes"
? "label"
? "to_integer_labels"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for something involving "labels", as IIUC this method is about grouping coordinate "labels" into common groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.find_groups()
? IIUC The goal of a Grouper
is to find and extract the groups from a variable? Assuming that this method returns all the information needed to represent the groups, wouldn't it be better to encapsulate the returned items in an instance of a simple dataclass Groups
?
And maybe rename the method argument from group
to var
? I find "group" a bit confusing here... It is more a variable that possibly contains groups of a certain kind.
class MyGrouper(Grouper):
def find_groups(self, var: Variable) -> Groups:
...
return Groups(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like find_groups
. (As an aside this maybe does not depend so much, as this method will almost only be used internally?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. In flox I use by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late, but throughout the document you're using the term "factorize" when you talk about constructing an array of integer indices for the "split" step, so maybe that means that using factorize
as the method name would be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't like factorize
because no one knows what it means :) AFAICT pandas chose that name, and I haven't seen it used anywhere else. I think label_groups
would be best, and most easy to understand for a new user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that makes sense. However, since this is advanced API I think we can require people to learn the new meaning, which is not actually new since we're borrowing from pandas
(we could take split_indices
, to_split_indices
or just .split
if we wanted to stay closer to "split-apply-combine", though).
What does other "split-apply-combine" software call this? From what I can tell:
polars
:to_physical
(I don't get what the background there is)dataframes.jl
:compute_indices
There is some redundancy here since `unique_coord` is always equal to or a subset of `full_index`. | ||
We can clean this up (see Implementation below). | ||
|
||
### The `weights` method (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to_weights"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"compute_weights"? "find_weights"?
For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of `((2,),(3,),(4,),(6,))`. | ||
For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency. | ||
|
||
## Related work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interested in what else is out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn's clustering?
https://scikit-learn.org/stable/modules/clustering.html#clustering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very excited about this! Especially the potential for weighted groupby!
return weights | ||
``` | ||
|
||
### The `factorize` method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like find_groups
. (As an aside this maybe does not depend so much, as this method will almost only be used internally?)
This is excellent and exciting work @dcherian! These new features will definitely be useful in xCDAT. A few of the top of my head include:
For the custom season grouping feature, will it include the ability to group across calendar years as mentioned in this xCDAT issue? Also, I'd be happy to experiment with this feature whenever I get some time. |
I just saw "Support seasons that span a year-end." in the design doc! |
1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008)) | ||
2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216)) | ||
3. A `QuantileBinGrouper` to abstract away `pd.cut` ([issue](https://github.com/pydata/xarray/discussions/7110)) | ||
4. A `SeasonGrouper` and `SeasonResampler` would abstract away common annoyances with such calculations today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototype Implemented here: https://github.com/dcherian/xarray/blob/grouper-public/xarray/core/groupers.py
Looks cool! I would like to second a desire for "Grouping by multiple variables". In our use case we often want to do something like "group cells by patient_status and cell_type". Maybe this is a Another feature (which is admittedly less common, and may be out of scope) is support for MultiLabel groupers – e.g the kinds of labels from scikit-multilearn. This means that a single observation can be present in more than one group. This is a little similar to rolling groupers, just not based on position. |
This looks awesome, but I will have to go through it in more detail.
|
Thanks for the input, looks like we have great support for the idea. I have been thinking that there's lots of redundancy in what's returned by
@ivirshup @tomvothecoder This is already supported in flox!. It'll be a while before Xarray gets there.
Yup, this is just a special "factorize" / "weights" generation problem which will be encapsulated by these Grouper objects. The exact API that would work is still up for debate but we can enable that usecase. |
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
### The `weights` method (?) | ||
|
||
The proposed `weights` method is optional and unimplemented today. | ||
Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights
seems potentially useful, but I'm not sure we need it for composing weighted()
and groupby()
. For example, you could implement x.weighted(w).groupby(k).mean()
with something like (x * w).groupby(k).sum() / w.groupby(k).sum()
. This would likely be more efficient, too, at least if the grouped over dimension is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your implementation point but the weights
can depend on the parameters to the Grouper object. For example a time bounds-aware resampling would have non-uniform weights, (and the Grouper object is more of a Regridder object). Adding the weights method would allow us to encapsulate that functionality.
|
||
The proposed `weights` method is optional and unimplemented today. | ||
Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)). | ||
The `weights` method should return an appropriate array of weights such that the following property is satisfied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to allow these operations to be efficient, we should definitely allow and encourage returning weights as sparse arrays.
This allows reuse of Grouper objects, example | ||
```python | ||
grouper = BinGrouper(...) | ||
ds.groupby(x=grouper, y=grouper) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where reusing Groupers would actually be useful? I think this is mostly just about a syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one where I wanted to use the same bin specification for two variables. But I agree that its minor.
Nominally every property on the grouper is independent of the data variable it is applied to. I think that motivated my suggestion to not encapsulate the data variable in the grouper. But every method on the grouper relies on the data variable as input. So 🤷🏾♂️ .
Do you have any advice here on how to think about the coupling/decoupling?
grouper = BinGrouper(...) | ||
ds.groupby(x=grouper, y=grouper) | ||
``` | ||
but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is probably fine. The variables can also be coordinates (e.g., on a DataArray).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with that, how would we support variables that don't align with the dataset? I was under the impression that that currently works (not sure, though). If we require a new dimension name for the grouper variable that would be fine with me, but it's also something we should explicitly state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we support variables that don't align with the dataset?
We don't. We use an "exact" join today:
Lines 616 to 620 in 41d33f5
if isinstance(group, DataArray): | |
try: | |
align(obj, group, join="exact", copy=False) | |
except ValueError: | |
raise ValueError(error_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case sounds good to me, .assign_coords({name: arr})
is not that much more to write.
### The `factorize` method | ||
Today, the `factorize` method takes as input the group variable and returns 4 variables (I propose to clean this up below): | ||
1. `codes`: An array of same shape as the `group` with int dtype. NaNs in `group` are coded by `-1` and ignored later. | ||
2. `group_indices` is a list of index location of `group` elements that belong to a single group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be part of the interface? I think it can be calculated from codes
if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it can be calculated.
We'd lose an optimization where the current code uses slices for resampling, and unindexed coordinates. I think it's valuable to preserve the former at least. What do you think of this function returning a named tuple with an optional indices
property. That way, Groupers can opt in to optimizing the indices, potentially reusing any intermediate step they've computed for generating the codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a dataclass with some optional fields would indeed be a nice way to handle this. It would also be forward compatible with any future interface changes.
What's the issue here for xarray?
I would really like to see some compatibility with scikit-learn here, where these are represented as a binary matrix where each column is a mask for that class (docs). |
2. scikit-downscale provides [`PaddedDOYGrouper`]( | ||
https://github.com/pangeo-data/scikit-downscale/blob/e16944a32b44f774980fa953ea18e29a628c71b8/skdownscale/pointwise_models/groupers.py#L19) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would be quite happy to have this directly in xarray! We have are own implementation at xclim (https://github.com/Ouranosinc/xclim/blob/b905e5529f2757a49f4485b7713b0ef01a45df2c/xclim/sdba/base.py#L103) but it's kind of a mess. Grouping per time period with a window is very useful and common in model output statistics.
return weights | ||
``` | ||
|
||
### The `factorize` method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late, but throughout the document you're using the term "factorize" when you talk about constructing an array of integer indices for the "split" step, so maybe that means that using factorize
as the method name would be fine?
grouper = BinGrouper(...) | ||
ds.groupby(x=grouper, y=grouper) | ||
``` | ||
but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with that, how would we support variables that don't align with the dataset? I was under the impression that that currently works (not sure, though). If we require a new dimension name for the grouper variable that would be fine with me, but it's also something we should explicitly state.
2. Merge existing two class implementation to a single Grouper class | ||
1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies. | ||
2. See [PR](https://github.com/pydata/xarray/pull/8509) | ||
3. Clean up what's returned by `factorize` methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably best to try implementing the most important use cases from the lists you've been assembling: by doing so I believe we can figure out the most important ones.
4. Figure out what to pass to `factorize` | ||
1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose. | ||
2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. | ||
5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me. We'd need to figure out which groupers to declare out-of-scope for inclusion in xarray
itself, though.
1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008)) | ||
2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fundamentally different are space and time resampling? If the difference is small enough (units / step size), would it be possible to merge both into a single Resampler
Grouper object? Then it might also be possible to apply this to coordinates with physical dimensions other than space / time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this could be done, they're the same, its just the user interface that could be different. For time it's nice to write freq="M"
and for space spacing=2.5
(?). The other arguments carry over but for time
they can be timedelta
or datetime
objects. So it feels nice to separate them for easy validation, and makes the user code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point was that we might want to resample using (numeric?) coordinates that are neither time nor space, but "spacing" might be appropriate in that case, as well? In which case we'd only need a general "Resampler" and a specialized "TimeResampler" (4 in total: Resampler
/ RollingResampler
and TimeResampler
/ RollingTimeResampler
). Unless I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resample using (numeric?) coordinates that are neither time nor space, but "spacing" might be appropriate in that case, as well
Ah yes, agree! The Resampler
is really just a convenient wrapper around BinGrouper
Toward pydata#8510 1. Rename to Resampler from ResampleGrouper 2. Move code from common.resample to TimeResampler
* main: (31 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
* main: (42 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
xref #8509, #6610
Rendered version here
@pydata/xarray I've been poking at this on and off for a year now and finally figured out how to do it cleanly (#8509).
I wrote up a design doc for
Grouper
objects that allow custom conversion of DataArrays to integer group codes, following the NEP template (which is absolutely great!). Such Grouper objects allow us to generalize the GroupBy interface to a much larger class of problems, and eventually provide a nice path to grouping by multiple variables (#6610)#8509 implements two custom Groupers for you to try out :)
All comments are welcome,
Grouper
objects we should provide in Xarray.cc @ilan-gold @ivirshup @aulemahal @tomvothecoder @jbusecke @katiedagon - it would be good to hear what "Groupers" would be useful for your work / projects. I bet you already have examples that fit this proposal