-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF (string): de-duplicate str_endswith, startswith #59568
REF (string): de-duplicate str_endswith, startswith #59568
Conversation
The question also is to what extent we necessarily want to keep this difference. I don't have a strong opinion on what |
mask=isna(self._pa_array), | ||
) | ||
else: | ||
# For empty tuple, pd.StringDtype() returns null for missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# For empty tuple, pd.StringDtype() returns null for missing values | |
# For empty tuple, ArrowDtype(string) returns null for missing values |
I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is the existing comment in the ArrowEA method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be an existing comment, but isn't it wrong? This is in the else
path and so the code path for ArrowDtype(string), as far as I understand? (For StringDtype, _object_compat
will be set to True?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i interpreted it as "StringDtype does this other thing [...] in contrast to what we do here"
But looking at the implementations (and trying it with N=1 example) i think they might actually behave the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think that is indeed calculating the same thing. Small equivalent example:
In [7]: arr = pa.array(["a", None, "b"])
In [8]: pa.array(np.zeros(len(arr), dtype=np.bool_), mask=pd.array(arr).isna())
Out[8]:
<pyarrow.lib.BooleanArray object at 0x7fbd98e303a0>
[
false,
null,
false
]
In [9]: pc.if_else(pc.is_null(arr), None, False)
Out[9]:
<pyarrow.lib.BooleanArray object at 0x7fbd98ec1360>
[
false,
null,
false
]
I don't see a way that this can be different. So the question is which one is the most efficient. Timing both, the second seems to be the best option:
In [10]: arr = pa.array(["a", None, "b"]*100_000)
In [18]: %timeit pa.array(np.zeros(len(arr), dtype=np.bool_), mask=arr.is_null().to_numpy(zero_copy_only=False))
711 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [20]: %timeit pc.if_else(pc.is_null(arr), None, False)
25.5 µs ± 94.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! updated to remove the need for the comment altogether. let's see if the code check job likes it this time
This is my general hope for |
I don't have an opinion on the behavior, but would prefer to keep behavior-changing PRs separate from REF/de-dup PRs. |
f91bd67
to
0df2005
Compare
i dont understand these complaints |
7cf5d4f
to
08691d4
Compare
lint complaints fixed! |
|
||
def __init__(self, *args, **kwargs) -> None: | ||
raise NotImplementedError | ||
|
||
def _result_converter(self, values, na=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _result_converter(self, values, na=None): | |
def _convert_bool_result(self, values, na=None): |
Or something that is more specific about it being for bool dtype (in #59616 I was renaming it to _predicate_result_converter
)
And then can use a consistent scheme in #59562 for integer conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have a branch doing this renaming for all the affected methods. for now am keeping the existing naming scheme
pandas/core/arrays/string_arrow.py
Outdated
@@ -278,8 +278,11 @@ def astype(self, dtype, copy: bool = True): | |||
|
|||
# ------------------------------------------------------------------------ | |||
# String methods interface | |||
_object_compat = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed now since it was removed from the mixin? (at least for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will update
0a9cec5
to
cdaa99b
Compare
I think comments have been addressed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This adds a _object_compat attribute to govern whether the subclass uses object-dtype semantics in cases where they differ.update_object_compat turned out to be unnecessary.