Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Region (...) does not align with Zarr chunks (). #644

Open
ghislainp opened this issue Nov 1, 2023 · 9 comments
Open

ValueError: Region (...) does not align with Zarr chunks (). #644

ghislainp opened this issue Nov 1, 2023 · 9 comments

Comments

@ghislainp
Copy link

I'm trying to merge netcdf file into a single Zarr.

The recipe is:

recipe = (
        beam.Create(pattern.items())
        | OpenURLWithFSSpec()
        | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
        | StoreToZarr(
            combine_dims=pattern.combine_dim_keys,
            target_root='.',
            store_name='out.zarr',
            #target_chunks=chunks,
        )
    )

and the pattern = pattern_from_file_sequence(ncfiles, 'time')

the structure of the ncfiles is:

<xarray.Dataset>
Dimensions:                             (y: 402, x: 462, time: 365, nv: 4)
Coordinates:
  * time                                (time) datetime64[ns] 2005-04-01 ... ...
  * x                                   (x) float64 -2.975e+06 ... 2.788e+06
  * y                                   (y) float64 2.625e+06 ... -2.388e+06
Dimensions without coordinates: nv
Data variables:
    lat                                 (y, x) float32 ...
    lon                                 (y, x) float32 ...
    bounds_lat                          (y, x, nv) float32 ...
    bounds_lon                          (y, x, nv) float32 ...
    spatial_ref                         int64 ...
    snow_status_wet_dry_19H_ASC_raw     (time, y, x) float32 ...
    snow_status_wet_dry_19H_ASC_filter  (time, y, x) float32 ...
    snow_status_wet_dry_19H_DSC_raw     (time, y, x) float32 ...
    snow_status_wet_dry_19H_DSC_filter  (time, y, x) float32 ...

I get the error: ValueError: Region (slice(0, 365, None), slice(None, None, None), slice(None, None, None)) does not align with Zarr chunks (402, 462).

It seems that StoreToZarr tries to use the 'time' dimension to merge variables that do not depend on time.
When I remove the variable lat, lon, bounds_lat, bounds_lon, it works fine.

How can I solve this problem ? I had not such a problem with XarrayZarrRecipe

@norlandrhagen
Copy link
Contributor

Thanks for raising an issue @ghislainp. Any chance you can share the input list of netcdf files used to create the file pattern?

@ghislainp
Copy link
Author

Sure, you can download the data from here: https://filesender.renater.fr/?s=download&token=17666c2e-d738-4447-b338-406315b08aae The link is valid for 2 weeks.

@rabernat
Copy link
Contributor

rabernat commented Nov 1, 2023

This is almost certainly due to the presence of coordinates in the data variables. I know there are other similar issues but I can't find them. Anything in the data variables with a time in the dims will trigger this error.

@norlandrhagen
Copy link
Contributor

Thanks for the files @ghislainp. I moved them to a temp s3 bucket and added a transform to drop the offending dims/vars to get a working example. Hope this helps.

import apache_beam as beam
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.transforms import OpenURLWithFSSpec, OpenWithXarray, StoreToZarr

year_list = [2004,2005,2006]

def make_url(time):
    return f"s3://carbonplan-scratch/pgf/melt-AMSRU-Antarctic-{time}-12km.nc"


concat_dim = ConcatDim("time", year_list)
pattern = FilePattern(make_url, concat_dim)

from pangeo_forge_recipes.transforms import Indexed, T

class DropDims(beam.PTransform):

    @staticmethod
    def _drop_dims(item: Indexed[T]) -> Indexed[T]:
        index, ds = item
        ds = ds.drop_dims('nv')
        ds = ds[['snow_status_wet_dry_19H_ASC_raw',
        'snow_status_wet_dry_19H_ASC_filter',
        'snow_status_wet_dry_19H_DSC_raw',
        'snow_status_wet_dry_19H_DSC_filter']]
        return index, ds

    def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
        return pcoll | beam.Map(self._drop_dims)

recipe = (
        beam.Create(pattern.items())
        | OpenURLWithFSSpec()
        | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
        | DropDims()
        | StoreToZarr(
            combine_dims=pattern.combine_dim_keys,
            target_root='.',
            store_name='out.zarr',
        )
    )

with beam.Pipeline() as p:
    p | recipe
image

@ghislainp
Copy link
Author

Thank you. I also obtained the same effect by removing the variables manually with NCO...
However, is there an elegant way to re-add the missing variables after the StoreToZarr ? I imagine a parallel flow ? I'm completely novice to Beam...

However, is there a way to improve StoreToZarr to recover the previous behavior of XarrayZarrRecipe that was dealing correctly with these variables not depending on the combine dim ?

@rabernat
Copy link
Contributor

rabernat commented Nov 1, 2023

I don't think you have to drop all these variables. Just move them to coords instead of data variables.

is there a way to improve StoreToZarr to recover the previous behavior of XarrayZarrRecipe that was dealing correctly with these variables not depending on the combine dim ?

Are you sure about that? In the previous version, would lon and lat gain a time dimension? It's pretty ambiguous how to handle the presence of these coordinate variables in each dataset fragment.

@jbusecke
Copy link
Contributor

I think I also ran into the same issue over at LEAP.
I have not confirmed that it works in dataflow, but I think a minimal solution here could be to do something like:

...
| OpenWithXarray(xarray_open_kwargs={'preprocess':lambda ds: ds.set_coords(['list', 'of', 'offending', 'coords'])})
...

In these relatively simple cases I wonder if we can provide a much more helpful error message by catching the ValueError: Region ... does not align with Zarr chunks ... and performing a quick test:

  • Are there data_vars that do not include concat_dim?
    • If yes, give a more useful warning and a suggestion how to fix it.
    • If not, just raise the original exception.

@mattjbr123
Copy link

mattjbr123 commented Aug 28, 2024

After a bit of snooping around in various feedstocks I was able to cobble together a solution that seems to do the job when this became an issue for me (forgive all the comments which I was doing to help myself figure it out!):

# They are implemented as subclasses of the beam.PTransform class
class DataVarToCoordVar(beam.PTransform):

    # not sure why it needs to be a staticmethod
    @staticmethod
    # the preprocess function should take in and return an
    # object of type Indexed[T]. These are pangeo-forge-recipes
    # derived types, internal to the functioning of the
    # pangeo-forge-recipes transforms.
    # I think they consist of a list of 2-item tuples,
    # each containing some type of 'index' and a 'chunk' of
    # the dataset or a reference to it, as can be seen in
    # the first line of the function below
    def _datavar_to_coordvar(item: Indexed[T]) -> Indexed[T]:
        index, ds = item
        # do something to each ds chunk here 
        # and leave index untouched.
        # Here we convert some of the variables in the file
        # to coordinate variables so that pangeo-forge-recipes
        # can process them
        print(f'Preprocessing before {ds =}')
        ds = ds.set_coords(['x_bnds', 'y_bnds', 'time_bnds', 'crs'])
        print(f'Preprocessing after {ds =}')
        return index, ds

    # this expand function is a necessary part of
    # developing your own Beam PTransforms, I think
    # it wraps the above preprocess function and applies
    # it to the PCollection, i.e. all the 'ds' chunks in Indexed
    def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
        return pcoll | beam.Map(self._datavar_to_coordvar)

recipe = (
    beam.Create(pattern.items())
    | OpenWithXarray(file_type=pattern.file_type)
    | DataVarToCoordVar() # the preprocess
    | StoreToZarr(
        target_root=td,
        store_name=tn,
        combine_dims=pattern.combine_dim_keys,
        target_chunks=target_chunks,
    )
    | ConsolidateDimensionCoordinates()
    | ConsolidateMetadata()
)

@jbusecke
Copy link
Contributor

Great to see this worked for you @mattjbr123.
This seems to be a very common thing (I have ran into this many times now) and I think we should at the very least make a "common errors and how to (maybe) fix them" section in the docs (or even think about changing the error message as suggested above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants