Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r/python] R and Python create groups with incompatible SOMA metadata #2698

Open
bkmartinjr opened this issue Jun 8, 2024 · 3 comments · May be fixed by #3469
Open

[r/python] R and Python create groups with incompatible SOMA metadata #2698

bkmartinjr opened this issue Jun 8, 2024 · 3 comments · May be fixed by #3469

Comments

@bkmartinjr
Copy link
Member

Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for dataset_type, soma_encoding_version and soma_object_type.

  • Python creates with Unicode strings (e.g., "dataset_type": "soma")
  • R package creates objects with byte arrays (e.g., "dataset_type": b"soma")

Using TileDB-Py to inspect two arrays.

When array created by Python (array info):

Out[25]: {'dataset_type': 'soma', 'soma_encoding_version': '1', 'soma_object_type': 'SOMAExperiment'}

When array created by R (array info):

Out[23]: {'dataset_type': b'soma', 'soma_encoding_version': b'1', 'soma_object_type': b'SOMAExperiment'}

I also checked directly reading from S3, i.e., not using the tiledb:// URI, and the result is the same.

Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.

Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.

In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.9.final.0
OS version                          Linux 6.8.0-76060800daily20240311-generic
@johnkerl
Copy link
Member

@nguyenv @eddelbuettel is this complete now that #2819 is merged?

@bkmartinjr
Copy link
Member Author

@johnkerl would also like to confirm we have a unit test that would catch this...

@johnkerl
Copy link
Member

@johnkerl johnkerl assigned johnkerl and unassigned eddelbuettel Aug 12, 2024
@johnkerl johnkerl changed the title R and Python create groups with incompatible SOMA metadata [r/python] R and Python create groups with incompatible SOMA metadata Aug 15, 2024
mojaveazure added a commit that referenced this issue Dec 18, 2024
String group-level metadata was previously encoded using
`TILEDB_CHAR` or `TILEDB_STRING_ASCII`; however, this resulting in the
metadata being read in as `bytes` in the Python API instead of as `str`.
The Python API already [encodes all strings (`str`) as
`TILEDB_STRING_UTF8`](https://github.com/single-cell-data/TileDB-SOMA/blob/884342a1ceb994d677c52c74ba2d789fc4e208d4/apis/python/src/tiledbsoma/common.cc#L211-L223)
so this PR brings the R API in-line with the Python API

[SC-61001](https://app.shortcut.com/tiledb-inc/story/61001)

resolves #2698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants