REF (string): de-duplicate str_endswith, startswith #59568

jbrockmendel · 2024-08-20T20:55:31Z

~~This adds a _object_compat attribute to govern whether the subclass uses object-dtype semantics in cases where they differ.~~update_object_compat turned out to be unnecessary.

pandas/core/arrays/string_arrow.py

jorisvandenbossche · 2024-08-21T20:20:54Z

This adds a _object_compat attribute to govern whether the subclass uses object-dtype semantics in cases where they differ.

The question also is to what extent we necessarily want to keep this difference. I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now), but aligning the behaviour would simplify the implementation .. (cc @mroeschke)

jorisvandenbossche · 2024-08-21T20:22:20Z

pandas/core/arrays/_arrow_string_mixins.py

+                        mask=isna(self._pa_array),
+                    )
+                else:
+                    # For empty tuple, pd.StringDtype() returns null for missing values


Suggested change

# For empty tuple, pd.StringDtype() returns null for missing values

# For empty tuple, ArrowDtype(string) returns null for missing values

I think?

no, this is the existing comment in the ArrowEA method

It might be an existing comment, but isn't it wrong? This is in the else path and so the code path for ArrowDtype(string), as far as I understand? (For StringDtype, _object_compat will be set to True?)

i interpreted it as "StringDtype does this other thing [...] in contrast to what we do here"

But looking at the implementations (and trying it with N=1 example) i think they might actually behave the same?

Looking at the code, I think that is indeed calculating the same thing. Small equivalent example:

In [7]: arr = pa.array(["a", None, "b"]) In [8]: pa.array(np.zeros(len(arr), dtype=np.bool_), mask=pd.array(arr).isna()) Out[8]: <pyarrow.lib.BooleanArray object at 0x7fbd98e303a0> [ false, null, false ] In [9]: pc.if_else(pc.is_null(arr), None, False) Out[9]: <pyarrow.lib.BooleanArray object at 0x7fbd98ec1360> [ false, null, false ]

I don't see a way that this can be different. So the question is which one is the most efficient. Timing both, the second seems to be the best option:

In [10]: arr = pa.array(["a", None, "b"]*100_000) In [18]: %timeit pa.array(np.zeros(len(arr), dtype=np.bool_), mask=arr.is_null().to_numpy(zero_copy_only=False)) 711 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [20]: %timeit pc.if_else(pc.is_null(arr), None, False) 25.5 µs ± 94.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

great! updated to remove the need for the comment altogether. let's see if the code check job likes it this time

mroeschke · 2024-08-21T20:47:41Z

I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now)

This is my general hope for ArrowExtensionArray and it's behaviors generally to match pyarrow as closely as possible.

jbrockmendel · 2024-08-22T16:24:35Z

The question also is to what extent we necessarily want to keep this difference. I don't have a strong opinion on what ArrowDtype(pa.string()) should do (I am also fine with it being a more "strictly using pyarrow" version, for now), but aligning the behaviour would simplify the implementation

I don't have an opinion on the behavior, but would prefer to keep behavior-changing PRs separate from REF/de-dup PRs.

jbrockmendel · 2024-08-22T22:50:22Z

  /home/runner/work/pandas/pandas/pandas/core/arrays/_arrow_string_mixins.py:129:16 - error: Invalid conditional operand of type "bool | NDArray[bool_] | NDFrame"
    Method __bool__ for type "NDFrame" returns type "NoReturn" rather than "bool" (reportGeneralTypeIssues)
  /home/runner/work/pandas/pandas/pandas/core/arrays/_arrow_string_mixins.py:154:16 - error: Invalid conditional operand of type "bool | NDArray[bool_] | NDFrame"
    Method __bool__ for type "NDFrame" returns type "NoReturn" rather than "bool" (reportGeneralTypeIssues)

i dont understand these complaints

jbrockmendel · 2024-08-27T16:46:20Z

lint complaints fixed!

jorisvandenbossche · 2024-08-27T16:59:42Z

pandas/core/arrays/_arrow_string_mixins.py


    def __init__(self, *args, **kwargs) -> None:
        raise NotImplementedError

+    def _result_converter(self, values, na=None):


Suggested change

def _result_converter(self, values, na=None):

def _convert_bool_result(self, values, na=None):

Or something that is more specific about it being for bool dtype (in #59616 I was renaming it to _predicate_result_converter)

And then can use a consistent scheme in #59562 for integer conversion

i have a branch doing this renaming for all the affected methods. for now am keeping the existing naming scheme

jorisvandenbossche · 2024-08-27T17:05:19Z

pandas/core/arrays/string_arrow.py

@@ -278,8 +278,11 @@ def astype(self, dtype, copy: bool = True):

    # ------------------------------------------------------------------------
    # String methods interface
+    _object_compat = True


This can be removed now since it was removed from the mixin? (at least for now)

yep, will update

jbrockmendel · 2024-08-28T20:01:49Z

I think comments have been addressed here

jorisvandenbossche

Thanks!

mroeschke added Refactor Internal refactoring of code Strings String extension data type and string data labels Aug 21, 2024

jorisvandenbossche reviewed Aug 21, 2024

View reviewed changes

pandas/core/arrays/string_arrow.py Show resolved Hide resolved

jorisvandenbossche reviewed Aug 21, 2024

View reviewed changes

jbrockmendel force-pushed the ref-startswith branch from f91bd67 to 0df2005 Compare August 22, 2024 19:11

jbrockmendel force-pushed the ref-startswith branch 2 times, most recently from 7cf5d4f to 08691d4 Compare August 27, 2024 14:38

jbrockmendel mentioned this pull request Aug 27, 2024

REF (string): de-duplicate ArrowStringArray methods #59555

Merged

5 tasks

jorisvandenbossche reviewed Aug 27, 2024

View reviewed changes

mroeschke added this to the 2.3 milestone Aug 28, 2024

jbrockmendel added 5 commits August 28, 2024 11:04

REF (string): de-duplicate str_endswith, startswith

8dfb6cb

specify override

78bc224

No need for _object_compat

dcb77b6

pyright ignore

16ac7fd

CLN: remove no-longer-needed object_compat

cdaa99b

jbrockmendel force-pushed the ref-startswith branch from 0a9cec5 to cdaa99b Compare August 28, 2024 18:04

jorisvandenbossche approved these changes Aug 29, 2024

View reviewed changes

jorisvandenbossche merged commit 27c7d51 into pandas-dev:main Aug 29, 2024
47 checks passed

jbrockmendel deleted the ref-startswith branch August 29, 2024 14:10

jorisvandenbossche added the backported label Oct 10, 2024

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024

REF (string): de-duplicate str_endswith, startswith (pandas-dev#59568)

3121121

jorisvandenbossche pushed a commit that referenced this pull request Oct 10, 2024

REF (string): de-duplicate str_endswith, startswith (#59568)

807d8d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF (string): de-duplicate str_endswith, startswith #59568

REF (string): de-duplicate str_endswith, startswith #59568

jbrockmendel commented Aug 20, 2024 •

edited

Loading

jorisvandenbossche commented Aug 21, 2024

jorisvandenbossche Aug 21, 2024

jbrockmendel Aug 22, 2024

jorisvandenbossche Aug 22, 2024

jbrockmendel Aug 22, 2024

jorisvandenbossche Aug 27, 2024

jbrockmendel Aug 27, 2024

mroeschke commented Aug 21, 2024

jbrockmendel commented Aug 22, 2024

jbrockmendel commented Aug 22, 2024

jbrockmendel commented Aug 27, 2024

jorisvandenbossche Aug 27, 2024

jbrockmendel Aug 27, 2024

jorisvandenbossche Aug 27, 2024

jbrockmendel Aug 27, 2024

jbrockmendel commented Aug 28, 2024

jorisvandenbossche left a comment

	# For empty tuple, pd.StringDtype() returns null for missing values
	# For empty tuple, ArrowDtype(string) returns null for missing values

	def _result_converter(self, values, na=None):
	def _convert_bool_result(self, values, na=None):

REF (string): de-duplicate str_endswith, startswith #59568

REF (string): de-duplicate str_endswith, startswith #59568

Conversation

jbrockmendel commented Aug 20, 2024 • edited Loading

jorisvandenbossche commented Aug 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 21, 2024

jbrockmendel commented Aug 22, 2024

jbrockmendel commented Aug 22, 2024

jbrockmendel commented Aug 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 28, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 20, 2024 •

edited

Loading