Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in Group.__getitem__ for zarr_format=2 #2275

Open
TomAugspurger opened this issue Sep 30, 2024 · 5 comments
Open

KeyError in Group.__getitem__ for zarr_format=2 #2275

TomAugspurger opened this issue Sep 30, 2024 · 5 comments
Labels
bug Potential issues with the zarr-python library

Comments

@TomAugspurger
Copy link
Contributor

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

In zarr-v2, you could create an array at a path and access it via a group:

In [4]: import zarr  # 2.18.3

In [5]: store = {}

In [6]: zarr.open_array(store=store, shape=(4,), path="foo")
Out[6]: <zarr.core.Array '/foo' (4,) float64>

In [7]: zarr.open_group(store)["foo"]
Out[7]: <zarr.core.Array '/foo' (4,) float64>

Steps to reproduce

In zarr v3, this raises a KeyError:

import zarr

store = zarr.store.MemoryStore(store_dict={}, mode="w")
zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
zarr.open_group(store)["foo"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 5
      3 store = zarr.store.MemoryStore(store_dict={}, mode="w")
      4 zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
----> 5 zarr.open_group(store)["foo"]

File ~/gh/zarr-developers/zarr-python/src/zarr/core/group.py:814, in Group.__getitem__(self, path)
    813 def __getitem__(self, path: str) -> Array | Group:
--> 814     obj = self._sync(self._async_group.getitem(path))
    815     if isinstance(obj, AsyncArray):
    816         return Array(obj)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:135, in SyncMixin._sync(self, coroutine)
    132 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    133     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    134     # this should allow us to better type the sync wrapper
--> 135     return sync(
    136         coroutine,
    137         timeout=config.get("async.timeout"),
    138     )

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:91, in sync(coro, loop, timeout)
     88 return_result = next(iter(finished)).result()
     90 if isinstance(return_result, BaseException):
---> 91     raise return_result
     92 else:
     93     return return_result

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:50, in _runner(coro)
     45 """
     46 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     47 exception, the exception will be returned.
     48 """
     49 try:
---> 50     return await coro
     51 except Exception as ex:
     52     return ex

File ~/gh/zarr-developers/zarr-python/src/zarr/core/group.py:228, in AsyncGroup.getitem(self, key)
    226 zarr_json_bytes = await (store_path / ZARR_JSON).get()
    227 if zarr_json_bytes is None:
--> 228     raise KeyError(key)
    229 else:
    230     zarr_json = json.loads(zarr_json_bytes.to_bytes())

KeyError: 'foo'

Additional output

No response

@TomAugspurger TomAugspurger added the bug Potential issues with the zarr-python library label Sep 30, 2024
@TomAugspurger
Copy link
Contributor Author

Ah I see the potential issue: the Group is opened with zarr_format=None which defaults to 3. So we have a Group with v3 metadata, which assumes that all its children will be v3 arrays, which is probably sensible. The fix here is to set zarr_format in both places:

zarr.open_group(store, zarr_format=2)["foo"]

So probably not a bug.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

For top level stuff like zarr.open, what do you think about automatically discovering the zarr format? We would need to handle the case where both a v2 and v3 node are under the same prefix, but I think this would be a pretty user-friendly and intuitive feature.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 30, 2024

I think the current implementation does a good job with auto-discovery when reading. The problem with my snippet is that the zarr.open_group(store) automatically created a Group, which falls back to a v3 default.

If the store had already had a (v2) Group then this works:

In [1]: import zarr

In [2]: store = zarr.store.MemoryStore(store_dict={}, mode="w")

In [3]: zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
Out[3]: <Array memory://4537150336/foo shape=(4,) dtype=float64>

In [4]: zarr.open_group(store, zarr_format=2)  # create the v2 Group
Out[4]: Group(_async_group=<AsyncGroup memory://4537150336>)

In [5]: zarr.open_group(store)["foo"]  # read the v2 Group (relying on zarr_format inference)
Out[5]: <Array memory://4537150336/foo shape=(4,) dtype=float64>

So maybe this question here is to what extent we want to support hierarchies with a mixture of Zarr V2 and Zarr V3 nodes. That makes my head hurt a little bit, but I guess it's doable (don't assume that Group.zarr_format will match the child zarr format, so don't pass that through in the __getitem__ call and rely on zarr's auto discovery?).

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

So maybe this question here is to what extent we want to support hierarchies with a mixture of Zarr V2 and Zarr V3 nodes.

I think we want no support for this. IMO, a v3 group with a v2 array as a member should not be expressible in zarr-python. But I don't think it should be an error to create a v3 and v2 group in the same place. Maybe part of the problem is the open_group function itself, which by name is ambiguous about whether it creates or reads. I would suggest adding a read_group function that uses mode=r under the hood, thereby avoiding surprise group creation. Similarly for read_array, and an array-or-group read function. I can open a PR for this when i get some time this week.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 30, 2024

Explicit read_* and create_* would be fantastic :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants