-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add dtype argument to StringMethods get_dummies() #59577
ENH: Add dtype argument to StringMethods get_dummies() #59577
Conversation
@rhshadrach Is this PR under review at all? I believe its a helpful change to a function with currently limited functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! A few questions on some of the added arguments. It seems to me this would make the API significantly larger without provided any features users can't already readily attain.
pandas/core/strings/accessor.py
Outdated
@@ -2409,6 +2419,17 @@ def get_dummies(self, sep: str = "|"): | |||
---------- | |||
sep : str, default "|" | |||
String to split on. | |||
prefix : str, list of str, or dict of str, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't users just prefix the columns by calling .rename
after the call to get_dummies?
pandas/core/strings/accessor.py
Outdated
prefix_sep : str, default '_' | ||
If appending prefix, separator/delimiter to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't users just add the separator to their prefix, e.g. prefix="prefix_"
?
pandas/core/strings/accessor.py
Outdated
dummy_na : bool, default False | ||
Add a column to indicate NaNs, if False NaNs are ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me users can already do this in a straightforward manner:
pd.concat([ser.str.get_dummies(), ser.isna().rename("NaN")], axis=1)
Is this not sufficient?
I don't disagree with that. I was only adding trivial arguments to mimic the behavior of pd.get_dummies. In the interest of simplicity, I can remove all those args. However, I think dtype is a pretty critical behavior. |
…hucarroll/pandas into stringmethods-get-dummies
Ah, I see! Still, I think it might be better to deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking quite good, but will need it to work with PyArrow and nullable dtypes.
tm.assert_frame_equal(result, expected) | ||
assert (result.dtypes == np.int8).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to assert this, it is covered by assert_frame_equal
.
s = Series(["a", "b,name", "b"], dtype=any_string_dtype) | ||
result = s.str.get_dummies(",") | ||
expected = DataFrame([[1, 0, 0], [0, 1, 1], [0, 1, 0]], columns=["a", "b", "name"]) | ||
def test_get_dummies_int8_dtype(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize these tests with dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, can you add dtype=str
, PyArrow, and nullable dtypes (e.g. "Int64"). Specifying PyArrow and nullable dtypes currently fails:
ser = pd.Series(["a|b", "a", "a|c"], dtype="string[pyarrow]")
ser.str.get_dummies(dtype=pd.ArrowDtype(pa.int64()))
but is successful with pd.get_dummies
pd.get_dummies(ser, dtype=pd.ArrowDtype(pa.int64()))
I think this will need to be fixed. You may find it necessary to have multiple tests - perhaps one for NumPy (which are already present), one for str
, one for PyArrow etc. But just try to consolidate with pytest.parametrize
as much as is reasonable.
…hucarroll/pandas into stringmethods-get-dummies
I previously had the pyarrow tests parametrized, but the td.skip_if_no decorator does not seem to work concurrently with the pytest parametrize decorator. So I split the pyarrow tests by types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a tests with strings:
s = Series(["a|b", "a|c", np.nan])
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
[["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
columns=list("abc"),
dtype=str,
)
tm.assert_frame_equal(result, expected)
and similarly with PyArrow strings (I think Pyarrow strings still fail)
pandas/core/strings/accessor.py
Outdated
if is_extension_array_dtype(dtype): | ||
return self._wrap_result( | ||
DataFrame(result, columns=name, dtype=dtype), | ||
name=name, | ||
returns_string=False, | ||
) | ||
if isinstance(dtype, ArrowDtype): | ||
return self._wrap_result( | ||
DataFrame(result, columns=name, dtype=dtype), | ||
name=name, | ||
returns_string=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you consolidate these two using if ... or ...:
pandas/core/arrays/arrow/array.py
Outdated
dummy_dtype: NpDtype | ||
if isinstance(_dtype, np.dtype): | ||
dummy_dtype = _dtype | ||
else: | ||
dummy_dtype = np.bool_ | ||
dummies = np.zeros(n_rows * n_cols, dtype=dummy_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on the verge (and arguably, is) a nitpick, but I think it'd be better to call it dummies_dtype
being the dtype of the dummies
array throughout.
The issue is just with parametrize - namely that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Can you add a line in doc/source/whatsnew/v3.0.0.rst
in the Other Enhancments
section.
Just added the info to whatsnew. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
result, name = self._data.array._str_get_dummies(sep, dtype) | ||
if is_extension_array_dtype(dtype) or isinstance(dtype, ArrowDtype): | ||
return self._wrap_result( | ||
DataFrame(result, columns=name, dtype=dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use _wrap_result(result, name=name, dtype=dtype, expand=True)
here instead to avoid the DataFrame import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this change causes failures because the numpy.ndarray does not take non-numpy dtypes. It doesn't seem like _wrap_result handles this case.
if dtype == str: | ||
dummies[:] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just put this logic before the dummies
creation i.e.
if dtype == str:
dummies_dtype = np.bool_
dummies = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string types do not need to use a dummy dtype, they can handle boolean values. The issue is with the str type interaction with np.zeroes(), where it considers the zero value to be the empty string.
Is there any specific reason we want the expected result to use |
Looking in the code, I see the technical reason for it. Now, that doesn't look very intentional, and I also notice that with the pyarrow string backend, it has a different result.
But that also looks wrong ... (at least inconsistent in empty string vs "0" for False cases depending on whether it was a missing value or not). The |
I was under the impression that @aaronchucarroll - would you be interested in doing a follow up addressing the above comments by @mroeschke and @jorisvandenbossche? |
Yes, I could do a follow-up but it may take me a bit of time to get to it. We want to allow dtype for numeric types only and raise an error on string dtype arguments? I will open a PR soon for this. |
Adding these args makes StringMethods get_dummies() similarly powerful to pd.get_dummies(), but usable in the common scenario that the strings are separated by some separator.
dtype
inpandas.Series.str.get_dummies
#47872