Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dtype argument to StringMethods get_dummies() #59577

Merged
merged 28 commits into from
Sep 9, 2024

Conversation

aaronchucarroll
Copy link
Contributor

@aaronchucarroll aaronchucarroll commented Aug 21, 2024

Adding these args makes StringMethods get_dummies() similarly powerful to pd.get_dummies(), but usable in the common scenario that the strings are separated by some separator.

@aaronchucarroll aaronchucarroll changed the title Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies() ENH: Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies() Aug 21, 2024
@rhshadrach rhshadrach added Enhancement Strings String extension data type and string data labels Aug 25, 2024
@aaronchucarroll
Copy link
Contributor Author

@rhshadrach Is this PR under review at all? I believe its a helpful change to a function with currently limited functionality.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! A few questions on some of the added arguments. It seems to me this would make the API significantly larger without provided any features users can't already readily attain.

@@ -2409,6 +2419,17 @@ def get_dummies(self, sep: str = "|"):
----------
sep : str, default "|"
String to split on.
prefix : str, list of str, or dict of str, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't users just prefix the columns by calling .rename after the call to get_dummies?

Comment on lines 2427 to 2428
prefix_sep : str, default '_'
If appending prefix, separator/delimiter to use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't users just add the separator to their prefix, e.g. prefix="prefix_"?

Comment on lines 2429 to 2430
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me users can already do this in a straightforward manner:

pd.concat([ser.str.get_dummies(), ser.isna().rename("NaN")], axis=1)

Is this not sufficient?

@aaronchucarroll
Copy link
Contributor Author

I don't disagree with that. I was only adding trivial arguments to mimic the behavior of pd.get_dummies. In the interest of simplicity, I can remove all those args. However, I think dtype is a pretty critical behavior.

@aaronchucarroll aaronchucarroll changed the title ENH: Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get_dummies() ENH: Add dtype argument to StringMethods get_dummies() Sep 3, 2024
@rhshadrach
Copy link
Member

I don't disagree with that. I was only adding trivial arguments to mimic the behavior of pd.get_dummies.

Ah, I see! Still, I think it might be better to deprecate prefix, prefix_sep, and columns from pd.get_dummies and rely on the user using rename and concat to accomplish their goal. Thanks for removing!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking quite good, but will need it to work with PyArrow and nullable dtypes.

tm.assert_frame_equal(result, expected)
assert (result.dtypes == np.int8).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to assert this, it is covered by assert_frame_equal.

s = Series(["a", "b,name", "b"], dtype=any_string_dtype)
result = s.str.get_dummies(",")
expected = DataFrame([[1, 0, 0], [0, 1, 1], [0, 1, 0]], columns=["a", "b", "name"])
def test_get_dummies_int8_dtype():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize these tests with dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, can you add dtype=str, PyArrow, and nullable dtypes (e.g. "Int64"). Specifying PyArrow and nullable dtypes currently fails:

ser = pd.Series(["a|b", "a", "a|c"], dtype="string[pyarrow]")
ser.str.get_dummies(dtype=pd.ArrowDtype(pa.int64()))

but is successful with pd.get_dummies

pd.get_dummies(ser, dtype=pd.ArrowDtype(pa.int64()))

I think this will need to be fixed. You may find it necessary to have multiple tests - perhaps one for NumPy (which are already present), one for str, one for PyArrow etc. But just try to consolidate with pytest.parametrize as much as is reasonable.

@aaronchucarroll
Copy link
Contributor Author

I previously had the pyarrow tests parametrized, but the td.skip_if_no decorator does not seem to work concurrently with the pytest parametrize decorator. So I split the pyarrow tests by types.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a tests with strings:

s = Series(["a|b", "a|c", np.nan])
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
    [["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
    columns=list("abc"),
    dtype=str,
)
tm.assert_frame_equal(result, expected)

and similarly with PyArrow strings (I think Pyarrow strings still fail)

Comment on lines 2490 to 2501
if is_extension_array_dtype(dtype):
return self._wrap_result(
DataFrame(result, columns=name, dtype=dtype),
name=name,
returns_string=False,
)
if isinstance(dtype, ArrowDtype):
return self._wrap_result(
DataFrame(result, columns=name, dtype=dtype),
name=name,
returns_string=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consolidate these two using if ... or ...:

Comment on lines 2530 to 2535
dummy_dtype: NpDtype
if isinstance(_dtype, np.dtype):
dummy_dtype = _dtype
else:
dummy_dtype = np.bool_
dummies = np.zeros(n_rows * n_cols, dtype=dummy_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the verge (and arguably, is) a nitpick, but I think it'd be better to call it dummies_dtype being the dtype of the dummies array throughout.

pandas/tests/strings/test_get_dummies.py Show resolved Hide resolved
@rhshadrach
Copy link
Member

rhshadrach commented Sep 7, 2024

I previously had the pyarrow tests parametrized, but the td.skip_if_no decorator does not seem to work concurrently with the pytest parametrize decorator. So I split the pyarrow tests by types.

The issue is just with parametrize - namely that @pytest.mark.parametrize(..., [ArrowDtype(pa.uint8()]) requires pa be imported which fails when PyArrow is not installed. You can instead use the string alias, e.g. "uint8[pyarrow]".

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Can you add a line in doc/source/whatsnew/v3.0.0.rst in the Other Enhancments section.

@aaronchucarroll
Copy link
Contributor Author

Just added the info to whatsnew.

@rhshadrach rhshadrach added this to the 3.0 milestone Sep 9, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 715585d into pandas-dev:main Sep 9, 2024
47 checks passed
result, name = self._data.array._str_get_dummies(sep, dtype)
if is_extension_array_dtype(dtype) or isinstance(dtype, ArrowDtype):
return self._wrap_result(
DataFrame(result, columns=name, dtype=dtype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use _wrap_result(result, name=name, dtype=dtype, expand=True) here instead to avoid the DataFrame import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this change causes failures because the numpy.ndarray does not take non-numpy dtypes. It doesn't seem like _wrap_result handles this case.

Comment on lines +2536 to +2537
if dtype == str:
dummies[:] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just put this logic before the dummies creation i.e.

if dtype == str:
    dummies_dtype = np.bool_
dummies = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string types do not need to use a dummy dtype, they can handle boolean values. The issue is with the str type interaction with np.zeroes(), where it considers the zero value to be the empty string.

@jorisvandenbossche
Copy link
Member

Can you also add a tests with strings:

s = Series(["a|b", "a|c", np.nan])
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
    [["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
    columns=list("abc"),
    dtype=str,
)
tm.assert_frame_equal(result, expected)

Is there any specific reason we want the expected result to use "T" and "F" for strings? (compared to "True" and "False" as actual string repr of bools, or "1" and "0" if we could actually cast the standard integer result to strings)

@jorisvandenbossche
Copy link
Member

Looking in the code, I see the technical reason for it. dtype=str is translated to numpy's U type, but we first create an empty array to then fill in, and at that point numpy makes that more concrete as a U1 type (length-1 strings), and then assining boolean True/False values into that array gives the resulting "T" and "F".

Now, that doesn't look very intentional, and I also notice that with the pyarrow string backend, it has a different result.
And with pd.get_dummies() (not the .str. variant), it seems we get yet another result:

In [3]: s = Series(["a", "b", "a", "c", np.nan])

In [4]: pd.get_dummies(s)
Out[4]: 
       a      b      c
0   True  False  False
1  False   True  False
2   True  False  False
3  False  False   True
4  False  False  False

In [5]: pd.get_dummies(s, dtype=str)
Out[5]: 
   a  b  c
0  1      
1     1   
2  1      
3        1
4  0  0  0

But that also looks wrong ... (at least inconsistent in empty string vs "0" for False cases depending on whether it was a missing value or not).

The dtype keyword in general is definitely useful, but I think mostly to ask for a different numeric data type (like a smaller integer bitwidth), so we could maybe also just disallow asking for string dtype (that would also avoid having to decide which of those string results is the best ..)

@rhshadrach
Copy link
Member

rhshadrach commented Sep 12, 2024

And with pd.get_dummies() (not the .str. variant), it seems we get yet another result:

I was under the impression that pd.get_dummies(..., dtype=str) would return T/F prior to this PR, but that doesn't seem to be the case. I'd be okay with raising on string dtype in str.get_dummies and deprecate pd.get_dummies.

@aaronchucarroll - would you be interested in doing a follow up addressing the above comments by @mroeschke and @jorisvandenbossche?

@aaronchucarroll
Copy link
Contributor Author

Yes, I could do a follow-up but it may take me a bit of time to get to it. We want to allow dtype for numeric types only and raise an error on string dtype arguments? I will open a PR soon for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Allow different dtype in pandas.Series.str.get_dummies
4 participants