[r/python] R and Python create groups with incompatible SOMA metadata #2698

bkmartinjr · 2024-06-08T16:40:28Z

Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for dataset_type, soma_encoding_version and soma_object_type.

Python creates with Unicode strings (e.g., "dataset_type": "soma")
R package creates objects with byte arrays (e.g., "dataset_type": b"soma")

Using TileDB-Py to inspect two arrays.

When array created by Python (array info):

Out[25]: {'dataset_type': 'soma', 'soma_encoding_version': '1', 'soma_object_type': 'SOMAExperiment'}

When array created by R (array info):

Out[23]: {'dataset_type': b'soma', 'soma_encoding_version': b'1', 'soma_object_type': b'SOMAExperiment'}

I also checked directly reading from S3, i.e., not using the tiledb:// URI, and the result is the same.

Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.

Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.

In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.9.final.0
OS version                          Linux 6.8.0-76060800daily20240311-generic

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-08-12T01:33:20Z

@nguyenv @eddelbuettel is this complete now that #2819 is merged?

bkmartinjr · 2024-08-12T14:55:38Z

@johnkerl would also like to confirm we have a unit test that would catch this...

johnkerl · 2024-08-12T15:25:30Z

@bkmartinjr I'll add a cross-language test here https://github.com/single-cell-data/TileDB-SOMA/tree/1.13.0/apis/system/tests

String group-level metadata was previously encoded using `TILEDB_CHAR` or `TILEDB_STRING_ASCII`; however, this resulting in the metadata being read in as `bytes` in the Python API instead of as `str`. The Python API already [encodes all strings (`str`) as `TILEDB_STRING_UTF8`](https://github.com/single-cell-data/TileDB-SOMA/blob/884342a1ceb994d677c52c74ba2d789fc4e208d4/apis/python/src/tiledbsoma/common.cc#L211-L223) so this PR brings the R API in-line with the Python API [SC-61001](https://app.shortcut.com/tiledb-inc/story/61001) resolves #2698

johnkerl assigned eddelbuettel and nguyenv Jun 8, 2024

johnkerl added r-api cpp-api r-python-parity labels Jun 17, 2024

johnkerl unassigned nguyenv Jul 15, 2024

johnkerl mentioned this issue Aug 2, 2024

[r] Metadata read/write support via libtiledbsoma #2819

Merged

johnkerl assigned johnkerl and unassigned eddelbuettel Aug 12, 2024

johnkerl changed the title ~~R and Python create groups with incompatible SOMA metadata~~ [r/python] R and Python create groups with incompatible SOMA metadata Aug 15, 2024

johnkerl mentioned this issue Aug 15, 2024

[c++] Final update for Python/R metadata typing #2900

Merged

mojaveazure linked a pull request Dec 18, 2024 that will close this issue

[r] Write group-level string metadata as TILEDB_STRING_UTF8 #3469

Open

johnkerl assigned mojaveazure Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r/python] R and Python create groups with incompatible SOMA metadata #2698

[r/python] R and Python create groups with incompatible SOMA metadata #2698

bkmartinjr commented Jun 8, 2024

johnkerl commented Aug 12, 2024

bkmartinjr commented Aug 12, 2024

johnkerl commented Aug 12, 2024

[r/python] R and Python create groups with incompatible SOMA metadata #2698

[r/python] R and Python create groups with incompatible SOMA metadata #2698

Comments

bkmartinjr commented Jun 8, 2024

johnkerl commented Aug 12, 2024

bkmartinjr commented Aug 12, 2024

johnkerl commented Aug 12, 2024