ENH: Add dtype argument to StringMethods get_dummies() #59577

aaronchucarroll · 2024-08-21T18:30:29Z

Adding these args makes StringMethods get_dummies() similarly powerful to pd.get_dummies(), but usable in the common scenario that the strings are separated by some separator.

closes ENH: Allow different dtype in pandas.Series.str.get_dummies #47872
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.

…_dummies()

aaronchucarroll · 2024-08-30T17:59:50Z

@rhshadrach Is this PR under review at all? I believe its a helpful change to a function with currently limited functionality.

rhshadrach

Thanks for the PR! A few questions on some of the added arguments. It seems to me this would make the API significantly larger without provided any features users can't already readily attain.

rhshadrach · 2024-09-03T18:22:08Z

pandas/core/strings/accessor.py

@@ -2409,6 +2419,17 @@ def get_dummies(self, sep: str = "|"):
        ----------
        sep : str, default "|"
            String to split on.
+        prefix : str, list of str, or dict of str, default None


Can't users just prefix the columns by calling .rename after the call to get_dummies?

rhshadrach · 2024-09-03T18:23:06Z

pandas/core/strings/accessor.py

+        prefix_sep : str, default '_'
+            If appending prefix, separator/delimiter to use.


Can't users just add the separator to their prefix, e.g. prefix="prefix_"?

rhshadrach · 2024-09-03T18:25:50Z

pandas/core/strings/accessor.py

+        dummy_na : bool, default False
+            Add a column to indicate NaNs, if False NaNs are ignored.


It seems to me users can already do this in a straightforward manner:

pd.concat([ser.str.get_dummies(), ser.isna().rename("NaN")], axis=1)

Is this not sufficient?

aaronchucarroll · 2024-09-03T19:11:07Z

I don't disagree with that. I was only adding trivial arguments to mimic the behavior of pd.get_dummies. In the interest of simplicity, I can remove all those args. However, I think dtype is a pretty critical behavior.

…hucarroll/pandas into stringmethods-get-dummies

rhshadrach · 2024-09-04T20:08:28Z

I don't disagree with that. I was only adding trivial arguments to mimic the behavior of pd.get_dummies.

Ah, I see! Still, I think it might be better to deprecate prefix, prefix_sep, and columns from pd.get_dummies and rely on the user using rename and concat to accomplish their goal. Thanks for removing!

rhshadrach

This is looking quite good, but will need it to work with PyArrow and nullable dtypes.

rhshadrach · 2024-09-04T20:12:55Z

pandas/tests/strings/test_get_dummies.py

    tm.assert_frame_equal(result, expected)
+    assert (result.dtypes == np.int8).all()


No need to assert this, it is covered by assert_frame_equal.

rhshadrach · 2024-09-04T20:13:25Z

pandas/tests/strings/test_get_dummies.py

-    s = Series(["a", "b,name", "b"], dtype=any_string_dtype)
-    result = s.str.get_dummies(",")
-    expected = DataFrame([[1, 0, 0], [0, 1, 1], [0, 1, 0]], columns=["a", "b", "name"])
+def test_get_dummies_int8_dtype():


Can you parametrize these tests with dtype.

In addition, can you add dtype=str, PyArrow, and nullable dtypes (e.g. "Int64"). Specifying PyArrow and nullable dtypes currently fails:

ser = pd.Series(["a|b", "a", "a|c"], dtype="string[pyarrow]") ser.str.get_dummies(dtype=pd.ArrowDtype(pa.int64()))

but is successful with pd.get_dummies

pd.get_dummies(ser, dtype=pd.ArrowDtype(pa.int64()))

I think this will need to be fixed. You may find it necessary to have multiple tests - perhaps one for NumPy (which are already present), one for str, one for PyArrow etc. But just try to consolidate with pytest.parametrize as much as is reasonable.

…hucarroll/pandas into stringmethods-get-dummies

aaronchucarroll · 2024-09-05T18:57:57Z

I previously had the pyarrow tests parametrized, but the td.skip_if_no decorator does not seem to work concurrently with the pytest parametrize decorator. So I split the pyarrow tests by types.

rhshadrach

Can you also add a tests with strings:

s = Series(["a|b", "a|c", np.nan])
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
    [["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
    columns=list("abc"),
    dtype=str,
)
tm.assert_frame_equal(result, expected)

and similarly with PyArrow strings (I think Pyarrow strings still fail)

rhshadrach · 2024-09-07T10:48:47Z

pandas/core/strings/accessor.py

+        if is_extension_array_dtype(dtype):
+            return self._wrap_result(
+                DataFrame(result, columns=name, dtype=dtype),
+                name=name,
+                returns_string=False,
+            )
+        if isinstance(dtype, ArrowDtype):
+            return self._wrap_result(
+                DataFrame(result, columns=name, dtype=dtype),
+                name=name,
+                returns_string=False,
+            )


Can you consolidate these two using if ... or ...:

rhshadrach · 2024-09-07T10:50:07Z

pandas/core/arrays/arrow/array.py

+        dummy_dtype: NpDtype
+        if isinstance(_dtype, np.dtype):
+            dummy_dtype = _dtype
+        else:
+            dummy_dtype = np.bool_
+        dummies = np.zeros(n_rows * n_cols, dtype=dummy_dtype)


This is on the verge (and arguably, is) a nitpick, but I think it'd be better to call it dummies_dtype being the dtype of the dummies array throughout.

pandas/tests/strings/test_get_dummies.py

rhshadrach · 2024-09-07T11:10:02Z

I previously had the pyarrow tests parametrized, but the td.skip_if_no decorator does not seem to work concurrently with the pytest parametrize decorator. So I split the pyarrow tests by types.

The issue is just with parametrize - namely that @pytest.mark.parametrize(..., [ArrowDtype(pa.uint8()]) requires pa be imported which fails when PyArrow is not installed. You can instead use the string alias, e.g. "uint8[pyarrow]".

rhshadrach

Looks good! Can you add a line in doc/source/whatsnew/v3.0.0.rst in the Other Enhancments section.

aaronchucarroll · 2024-09-09T20:44:12Z

Just added the info to whatsnew.

rhshadrach

lgtm

pandas/tests/strings/test_get_dummies.py

pandas/core/strings/accessor.py

mroeschke · 2024-09-09T21:51:34Z

pandas/core/strings/accessor.py

+        result, name = self._data.array._str_get_dummies(sep, dtype)
+        if is_extension_array_dtype(dtype) or isinstance(dtype, ArrowDtype):
+            return self._wrap_result(
+                DataFrame(result, columns=name, dtype=dtype),


I think you can use _wrap_result(result, name=name, dtype=dtype, expand=True) here instead to avoid the DataFrame import

Making this change causes failures because the numpy.ndarray does not take non-numpy dtypes. It doesn't seem like _wrap_result handles this case.

pandas/core/strings/object_array.py

mroeschke · 2024-09-09T21:56:11Z

pandas/core/arrays/arrow/array.py

+        if dtype == str:
+            dummies[:] = False


Can you just put this logic before the dummies creation i.e.

if dtype == str: dummies_dtype = np.bool_ dummies = ...

The string types do not need to use a dummy dtype, they can handle boolean values. The issue is with the str type interaction with np.zeroes(), where it considers the zero value to be the empty string.

doc/source/whatsnew/v3.0.0.rst

jorisvandenbossche · 2024-09-10T11:47:58Z

Can you also add a tests with strings:

s = Series(["a|b", "a|c", np.nan])
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
    [["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
    columns=list("abc"),
    dtype=str,
)
tm.assert_frame_equal(result, expected)

Is there any specific reason we want the expected result to use "T" and "F" for strings? (compared to "True" and "False" as actual string repr of bools, or "1" and "0" if we could actually cast the standard integer result to strings)

jorisvandenbossche · 2024-09-10T11:55:35Z

Looking in the code, I see the technical reason for it. dtype=str is translated to numpy's U type, but we first create an empty array to then fill in, and at that point numpy makes that more concrete as a U1 type (length-1 strings), and then assining boolean True/False values into that array gives the resulting "T" and "F".

Now, that doesn't look very intentional, and I also notice that with the pyarrow string backend, it has a different result.
And with pd.get_dummies() (not the .str. variant), it seems we get yet another result:

In [3]: s = Series(["a", "b", "a", "c", np.nan])

In [4]: pd.get_dummies(s)
Out[4]: 
       a      b      c
0   True  False  False
1  False   True  False
2   True  False  False
3  False  False   True
4  False  False  False

In [5]: pd.get_dummies(s, dtype=str)
Out[5]: 
   a  b  c
0  1      
1     1   
2  1      
3        1
4  0  0  0

But that also looks wrong ... (at least inconsistent in empty string vs "0" for False cases depending on whether it was a missing value or not).

The dtype keyword in general is definitely useful, but I think mostly to ask for a different numeric data type (like a smaller integer bitwidth), so we could maybe also just disallow asking for string dtype (that would also avoid having to decide which of those string results is the best ..)

rhshadrach · 2024-09-12T02:00:30Z

And with pd.get_dummies() (not the .str. variant), it seems we get yet another result:

I was under the impression that pd.get_dummies(..., dtype=str) would return T/F prior to this PR, but that doesn't seem to be the case. I'd be okay with raising on string dtype in str.get_dummies and deprecate pd.get_dummies.

@aaronchucarroll - would you be interested in doing a follow up addressing the above comments by @mroeschke and @jorisvandenbossche?

aaronchucarroll · 2024-09-12T16:16:22Z

Yes, I could do a follow-up but it may take me a bit of time to get to it. We want to allow dtype for numeric types only and raise an error on string dtype arguments? I will open a PR soon for this.

aaronchucarroll added 3 commits August 21, 2024 14:26

Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get…

e6f9527

…_dummies()

Fix import issue

dafb61d

Fix typing of dtype

bb79ef2

aaronchucarroll changed the title ~~Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies()~~ ENH: Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies() Aug 21, 2024

aaronchucarroll and others added 6 commits August 21, 2024 17:08

Fix NaN type issue

24be84f

Support categorical string backend

09b2fad

Fix dtype type hints

50ed90c

Add dtype to get_dummies docstring

9e95485

Fix get_dummies dtype docstring

9a47768

Merge branch 'main' into stringmethods-get-dummies

0c94bff

rhshadrach added Enhancement Strings String extension data type and string data labels Aug 25, 2024

rhshadrach reviewed Sep 3, 2024

View reviewed changes

aaronchucarroll added 2 commits September 3, 2024 15:52

remove changes for unnecessary args

9702bf7

Merge branch 'stringmethods-get-dummies' of https://github.com/aaronc…

8793516

…hucarroll/pandas into stringmethods-get-dummies

aaronchucarroll changed the title ~~ENH: Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies()~~ ENH: Add dtype argument to StringMethods get_dummies() Sep 3, 2024

Merge branch 'main' into stringmethods-get-dummies

bad1038

aaronchucarroll requested a review from rhshadrach September 3, 2024 21:00

rhshadrach requested changes Sep 4, 2024

View reviewed changes

aaronchucarroll and others added 8 commits September 5, 2024 00:07

parametrize dtype tests

163fe09

Merge branch 'stringmethods-get-dummies' of https://github.com/aaronc…

3d75fdc

…hucarroll/pandas into stringmethods-get-dummies

support pyarrow and nullable dtypes

d68bece

Merge branch 'main' into stringmethods-get-dummies

c2aa7d5

fix pyarrow import error

0fd2401

skip pyarrow tests when not present

920c865

split pyarrow tests

800f787

Merge branch 'main' into stringmethods-get-dummies

d8149e6

aaronchucarroll requested a review from rhshadrach September 5, 2024 18:58

rhshadrach requested changes Sep 7, 2024

View reviewed changes

aaronchucarroll added 6 commits September 7, 2024 15:02

parametrize pyarrow tests

6cbc3e8

change var name to dummies_dtype

532e139

fix string issue

cd5c2ab

consolidate conditionals

822b3f4

add tests for str and pyarrow strings

ba05a8d

skip pyarrow string tests if not present

37dddb8

aaronchucarroll requested a review from rhshadrach September 7, 2024 23:36

rhshadrach requested changes Sep 9, 2024

View reviewed changes

add info to whatsnew doc

6fbe183

change func to meth in doc info

87a1ee8

rhshadrach added this to the 3.0 milestone Sep 9, 2024

rhshadrach approved these changes Sep 9, 2024

View reviewed changes

rhshadrach merged commit 715585d into pandas-dev:main Sep 9, 2024
47 checks passed

mroeschke reviewed Sep 9, 2024

View reviewed changes

pandas/tests/strings/test_get_dummies.py Show resolved Hide resolved

mroeschke reviewed Sep 9, 2024

View reviewed changes

pandas/core/strings/accessor.py Show resolved Hide resolved

mroeschke reviewed Sep 9, 2024

View reviewed changes

pandas/core/strings/object_array.py Show resolved Hide resolved

mroeschke reviewed Sep 9, 2024

View reviewed changes

doc/source/whatsnew/v3.0.0.rst Show resolved Hide resolved

aaronchucarroll mentioned this pull request Sep 26, 2024

ENH: Series.str.get_dummies() raise on string type (follow up to PR #59577) #59786

Open

jorisvandenbossche mentioned this pull request Nov 18, 2024

TST (string dtype): clean-up assorted xfails #60354

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add dtype argument to StringMethods get_dummies() #59577

ENH: Add dtype argument to StringMethods get_dummies() #59577

aaronchucarroll commented Aug 21, 2024 •

edited

Loading

aaronchucarroll commented Aug 30, 2024

rhshadrach left a comment •

edited

Loading

rhshadrach Sep 3, 2024

rhshadrach Sep 3, 2024

rhshadrach Sep 3, 2024

aaronchucarroll commented Sep 3, 2024

rhshadrach commented Sep 4, 2024

rhshadrach left a comment

rhshadrach Sep 4, 2024

rhshadrach Sep 4, 2024

rhshadrach Sep 4, 2024

aaronchucarroll commented Sep 5, 2024

rhshadrach left a comment

rhshadrach Sep 7, 2024

rhshadrach Sep 7, 2024

rhshadrach commented Sep 7, 2024 •

edited

Loading

rhshadrach left a comment

aaronchucarroll commented Sep 9, 2024

rhshadrach left a comment

mroeschke Sep 9, 2024

aaronchucarroll Sep 9, 2024

mroeschke Sep 9, 2024

aaronchucarroll Sep 9, 2024

jorisvandenbossche commented Sep 10, 2024

jorisvandenbossche commented Sep 10, 2024

rhshadrach commented Sep 12, 2024 •

edited

Loading

aaronchucarroll commented Sep 12, 2024

		prefix_sep : str, default '_'
		If appending prefix, separator/delimiter to use.

		dummy_na : bool, default False
		Add a column to indicate NaNs, if False NaNs are ignored.

		tm.assert_frame_equal(result, expected)
		assert (result.dtypes == np.int8).all()

ENH: Add dtype argument to StringMethods get_dummies() #59577

ENH: Add dtype argument to StringMethods get_dummies() #59577

Conversation

aaronchucarroll commented Aug 21, 2024 • edited Loading

aaronchucarroll commented Aug 30, 2024

rhshadrach left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronchucarroll commented Sep 3, 2024

rhshadrach commented Sep 4, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronchucarroll commented Sep 5, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Sep 7, 2024 • edited Loading

rhshadrach left a comment

Choose a reason for hiding this comment

aaronchucarroll commented Sep 9, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 10, 2024

jorisvandenbossche commented Sep 10, 2024

rhshadrach commented Sep 12, 2024 • edited Loading

aaronchucarroll commented Sep 12, 2024

aaronchucarroll commented Aug 21, 2024 •

edited

Loading

rhshadrach left a comment •

edited

Loading

rhshadrach commented Sep 7, 2024 •

edited

Loading

rhshadrach commented Sep 12, 2024 •

edited

Loading