Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF (string): de-duplicate str_endswith, startswith #59568

Merged
merged 5 commits into from
Aug 29, 2024

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Aug 20, 2024

This adds a _object_compat attribute to govern whether the subclass uses object-dtype semantics in cases where they differ.update_object_compat turned out to be unnecessary.

@mroeschke mroeschke added Refactor Internal refactoring of code Strings String extension data type and string data labels Aug 21, 2024
@jorisvandenbossche
Copy link
Member

This adds a _object_compat attribute to govern whether the subclass uses object-dtype semantics in cases where they differ.

The question also is to what extent we necessarily want to keep this difference. I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now), but aligning the behaviour would simplify the implementation .. (cc @mroeschke)

mask=isna(self._pa_array),
)
else:
# For empty tuple, pd.StringDtype() returns null for missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For empty tuple, pd.StringDtype() returns null for missing values
# For empty tuple, ArrowDtype(string) returns null for missing values

I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is the existing comment in the ArrowEA method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be an existing comment, but isn't it wrong? This is in the else path and so the code path for ArrowDtype(string), as far as I understand? (For StringDtype, _object_compat will be set to True?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i interpreted it as "StringDtype does this other thing [...] in contrast to what we do here"

But looking at the implementations (and trying it with N=1 example) i think they might actually behave the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, I think that is indeed calculating the same thing. Small equivalent example:

In [7]: arr = pa.array(["a", None, "b"])

In [8]: pa.array(np.zeros(len(arr), dtype=np.bool_), mask=pd.array(arr).isna())
Out[8]: 
<pyarrow.lib.BooleanArray object at 0x7fbd98e303a0>
[
  false,
  null,
  false
]

In [9]: pc.if_else(pc.is_null(arr), None, False)
Out[9]: 
<pyarrow.lib.BooleanArray object at 0x7fbd98ec1360>
[
  false,
  null,
  false
]

I don't see a way that this can be different. So the question is which one is the most efficient. Timing both, the second seems to be the best option:

In [10]: arr = pa.array(["a", None, "b"]*100_000)

In [18]: %timeit pa.array(np.zeros(len(arr), dtype=np.bool_), mask=arr.is_null().to_numpy(zero_copy_only=False))
711 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [20]: %timeit pc.if_else(pc.is_null(arr), None, False)
25.5 µs ± 94.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! updated to remove the need for the comment altogether. let's see if the code check job likes it this time

@mroeschke
Copy link
Member

I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now)

This is my general hope for ArrowExtensionArray and it's behaviors generally to match pyarrow as closely as possible.

@jbrockmendel
Copy link
Member Author

The question also is to what extent we necessarily want to keep this difference. I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now), but aligning the behaviour would simplify the implementation

I don't have an opinion on the behavior, but would prefer to keep behavior-changing PRs separate from REF/de-dup PRs.

@jbrockmendel
Copy link
Member Author

  /home/runner/work/pandas/pandas/pandas/core/arrays/_arrow_string_mixins.py:129:16 - error: Invalid conditional operand of type "bool | NDArray[bool_] | NDFrame"
    Method __bool__ for type "NDFrame" returns type "NoReturn" rather than "bool" (reportGeneralTypeIssues)
  /home/runner/work/pandas/pandas/pandas/core/arrays/_arrow_string_mixins.py:154:16 - error: Invalid conditional operand of type "bool | NDArray[bool_] | NDFrame"
    Method __bool__ for type "NDFrame" returns type "NoReturn" rather than "bool" (reportGeneralTypeIssues)

i dont understand these complaints

@jbrockmendel jbrockmendel force-pushed the ref-startswith branch 2 times, most recently from 7cf5d4f to 08691d4 Compare August 27, 2024 14:38
@jbrockmendel
Copy link
Member Author

lint complaints fixed!


def __init__(self, *args, **kwargs) -> None:
raise NotImplementedError

def _result_converter(self, values, na=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _result_converter(self, values, na=None):
def _convert_bool_result(self, values, na=None):

Or something that is more specific about it being for bool dtype (in #59616 I was renaming it to _predicate_result_converter)

And then can use a consistent scheme in #59562 for integer conversion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a branch doing this renaming for all the affected methods. for now am keeping the existing naming scheme

@@ -278,8 +278,11 @@ def astype(self, dtype, copy: bool = True):

# ------------------------------------------------------------------------
# String methods interface
_object_compat = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed now since it was removed from the mixin? (at least for now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will update

@mroeschke mroeschke added this to the 2.3 milestone Aug 28, 2024
@jbrockmendel
Copy link
Member Author

I think comments have been addressed here

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit 27c7d51 into pandas-dev:main Aug 29, 2024
47 checks passed
@jbrockmendel jbrockmendel deleted the ref-startswith branch August 29, 2024 14:10
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Refactor Internal refactoring of code Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants