Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing string data (TLeafC) to TTrees #516

Closed
meliache opened this issue Nov 30, 2021 · 3 comments
Closed

Support writing string data (TLeafC) to TTrees #516

meliache opened this issue Nov 30, 2021 · 3 comments
Labels
feature New feature or request

Comments

@meliache
Copy link

meliache commented Nov 30, 2021

Writing a dataframe containing a Categorical column to a ROOT TTree raises an AttributeError (see MWE below). I guess this is not supported. I don't know if ROOT TTrees even have an equivalent datatype. Of course I can still as a user convert the columns into integer or something and then write the resulting dataframe.

I saw that with a string object columns, we get:

NotImplementedError: array of strings

With this issue I ask to maybe raise a more helpful NotImplementedError for Categorical axes. It would also be ideal to support writing those datatypes, though I assume that's not trivial and probably would also require implementing string axes etc.

While playing around with examples for this issue, I also tried the experimental StringArray datatype (dtype="string") and is also raises the same AttributeError instead of a NotImplementedError, so maybe I can include that in this issue as well.

Here is my MWE to reproduce the error when writing TTrees from Categorical columns:

import uproot
import pandas as pd

df = pd.DataFrame({0: pd.Categorical(["a", "a", "b", "c", "c", "c"])})
f = uproot.recreate("/tmp/cat.root")
f["tree"] = df

which results in the traceback:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-10-bc1c7ac44da4>:1 in <module>                                                    │
│ /home/michael/.local/lib/python3.9/site-packages/uproot/writing/writable.py:967 in __setitem__   │
│                                                                                                  │
│    964 │   def __setitem__(self, where, what):                                                   │
│    965 │   │   if self._file.sink.closed:                                                        │
│    966 │   │   │   raise ValueError("cannot write data to a closed file")                        │
│ ❱  967 │   │   self.update({where: what})                                                        │
│    968 │                                                                                         │
│    969 │   def __delitem__(self, where):                                                         │
│    970 │   │   if self._file.sink.closed:                                                        │
│                                                                                                  │
│ /home/michael/.local/lib/python3.9/site-packages/uproot/writing/writable.py:1468 in update       │
│                                                                                                  │
│   1465 │   │   │   for item in path:                                                             │
│   1466 │   │   │   │   directory = directory[item]                                               │
│   1467 │   │   │                                                                                 │
│ ❱ 1468 │   │   │   uproot.writing.identify.add_to_directory(v, name, directory, streamers)       │
│   1469 │   │                                                                                     │
│   1470 │   │   self._file._cascading.streamers.update_streamers(self._file.sink, streamers)      │
│   1471                                                                                           │
│                                                                                                  │
│ /home/michael/.local/lib/python3.9/site-packages/uproot/writing/identify.py:79 in                │
│ add_to_directory                                                                                 │
│                                                                                                  │
│     76 │   │   │   module_name = type(branch_array).__module__                                   │
│     77 │   │   │                                                                                 │
│     78 │   │   │   if module_name == "pandas" or module_name.startswith("pandas."):              │
│ ❱   79 │   │   │   │   branch_array = uproot.writing._cascadetree.dataframe_to_dict(             │
│     80 │   │   │   │   │   branch_array                                                          │
│     81 │   │   │   │   )                                                                         │
│     82                                                                                           │
│                                                                                                  │
│ /home/michael/.local/lib/python3.9/site-packages/uproot/writing/_cascadetree.py:1477 in          │
│ dataframe_to_dict                                                                                │
│                                                                                                  │
│   1474 │   """                                                                                   │
│   1475 │   Converts a Pandas DataFrame into a dict of NumPy arrays for writing.                  │
│   1476 │   """                                                                                   │
│ ❱ 1477 │   out = {"index": df.index.values}                                                      │
│   1478 │   for column_name in df.columns:                                                        │
│   1479 │   │   out[str(column_name)] = df[column_name].values                                    │
│   1480 │   return out                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Categorical' object has no attribute 'index'

@meliache meliache added the feature New feature or request label Nov 30, 2021
@meliache meliache changed the title Support writing dataframes with Categorical columns to TTrees Support writing dataframes with Categorical columns to TTrees or raise NotImplementedError Nov 30, 2021
@jpivarski
Copy link
Member

With #517, we'll at least get a NotImplementedError message. You saw the exception because

https://github.com/scikit-hep/uproot4/blob/624b0338995d2cd35bfd7a6c06da3c5cc62330de/src/uproot/writing/identify.py#L78-L81

just checks to see if the object is from the pandas library and then assumes it's a DataFrame. pandas.core.arrays.categorical.Categorical is from the pandas library, but it's a kind of array; Series objects are like that, too. What we really wanted was

https://github.com/scikit-hep/uproot4/blob/8bdd6839cb86ec571e96fcd3b03d0783be3c53b1/src/uproot/writing/identify.py#L79-L85

(which is what's done elsewhere in that file).

Unfortunately, you still can't write it because I haven't implemented strings (TLeafC), just the numeric types. The "categoricalness" (fact that each unique string is stored only once and referred to by integers; a.k.a "dictionary encoding") would not be preserved in any case because I don't think ROOT I/O has a type like that—definitely not one of the basic TLeaf types. But the data are compressed (if you don't opt out), and a compression algorithm effectively does dictionary encoding.

At least, though, PR #517 has Uproot return an error message like this:

NotImplementedError: array of strings

which would have been clearer than what you got.

I'll leave this issue open, since it's a request for writing string data. In the meantime, you might want to use np.unique with return_inverse=True, which you can use to convert strings into numerical categories. It's also possible to write TObjString objects directly into directories, which can be JSON metadata of unique strings and their corresponding integers.

@jpivarski jpivarski changed the title Support writing dataframes with Categorical columns to TTrees or raise NotImplementedError Support writing string data (TLeafC) to TTrees Nov 30, 2021
@meliache
Copy link
Author

Thanks for the quick reaction and for the tipps how to workaround this.

and a compression algorithm effectively does dictionary encoding.
Yes, I guess that pandas has this type because the dataframe is not compressed in-memory (I assume), but as ROOT workflows are mostly out-of-memory, I guess that's not really needed.

Another reason I personally had used Categorical columns is because the categories are ordered and I currently use this metadata in my plotting/histogramming functions. I have a MC-category in a categorical column and then my histogramming/plotting functions use the ordered categories to define the MC-category-axis of a 2D histogram and in the end this also decides the order in which the components of my stacked histogram are plotted. But I'm sure I could achieve this otherwise with string columns or as you say I could just use an integer encoding.

@ioanaif
Copy link
Collaborator

ioanaif commented Oct 3, 2023

Hi!

#940 added support for writing string data (TLeafC).

Your use case is now working as expected:

import uproot
import pandas as pd

df = pd.DataFrame({0: pd.Categorical(["a", "a", "b", "c", "c", "c"])})
f = uproot.recreate("cat.root")
f["tree"] = df


>>> nf = uproot.open("cat.root")
>>> f["tree"].keys()
['index', '0']
>>> f["tree"]["index"].arrays().show()
[{index: 0},
 {index: 1},
 {index: 2},
 {index: 3},
 {index: 4},
 {index: 5}]
>>> f["tree"]["0"].arrays().show()
[{'0': 'a'},
 {'0': 'a'},
 {'0': 'b'},
 {'0': 'c'},
 {'0': 'c'},
 {'0': 'c'}]

@ioanaif ioanaif closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants