Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Backport of string changes for 2.3 release #59513

Merged
merged 52 commits into from
Oct 9, 2024

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 14, 2024

Specific backports or the 2.3.0 release (#59664), for the string dtype )#54792)

@rhshadrach rhshadrach added the Strings String extension data type and string data label Aug 14, 2024
@rhshadrach rhshadrach added this to the 2.2.3 milestone Aug 14, 2024
@WillAyd WillAyd force-pushed the 2.3.0 branch 3 times, most recently from e126806 to a2c3db0 Compare August 15, 2024 17:54
@WillAyd WillAyd requested a review from MarcoGorelli as a code owner August 15, 2024 17:54
@WillAyd
Copy link
Member Author

WillAyd commented Aug 15, 2024

I'm unsure about the failing tests. Seems like we might have behavior in our string types that works off of what is done in #57361 but guessing we don't want to backport that PR to 2.3?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 21, 2024

A lot of failures seem related to the handling of to_numeric with errors="ignore", which was removed for the 3.0 release.

Prior to #57361, this is what a to_numeric call would yield:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
'-47393996303418497800'

with all of the backport, it looks like the new behavior casts it to a float:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
-4.73939963e+19

@WillAyd WillAyd removed this from the 2.2.3 milestone Aug 22, 2024
@WillAyd WillAyd force-pushed the 2.3.0 branch 2 times, most recently from 5563064 to 5583700 Compare August 22, 2024 19:51
@WillAyd
Copy link
Member Author

WillAyd commented Aug 22, 2024

@mroeschke could I get some of your help on the test failures? A lot of them look related to NumPy and have to do with how it handles overflows, casting, and some syntax things with numexpr.

Any chance you saw those on main and know what the fix is? https://github.com/pandas-dev/pandas/actions/runs/10514684927/job/29133089833?pr=59513#step:8:2013

@jbrockmendel
Copy link
Member

@WillAyd looking at the failures in the first build, i think #58484 may need to be backported

@mroeschke
Copy link
Member

I don't remember seeing failures related to eval but I'll dig a little bit more.

In addition to what @jbrockmendel mentioned, you'll need #59545 and the nomkl install part of #59553

@jbrockmendel
Copy link
Member

Getting close here

___________________ TestBasic.test_write_index[fastparquet] ____________________
[gw2] linux -- Python 3.12.5 /home/runner/micromamba/envs/test/bin/python3.12
[XPASS(strict)] fastparquet write into index

@bnavigator
Copy link
Contributor

I don't remember seeing failures related to eval but I'll dig a little bit more.

Already reported in May (und unfortunately not picked up by anyone) here: #58548. It tracks the failures in main and sees a fix somewhere between a3e751c and 236d89b.

Fixed by backport here: #59535

@WillAyd
Copy link
Member Author

WillAyd commented Aug 27, 2024

It tracks the failures in main and sees a fix somewhere between a3e751c and 236d89b.

I don't believe that issue was directly solved, but it was at least removed from the test suite as part of changes in https://github.com/pandas-dev/pandas/pull/56709/files#diff-f145e441b5820d235a78589ee9ee6c9c49fea0de6ca659a972b61b6aa29f9df8 that replaced the fixture of:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])

with a fixture that instead uses:

[float, "float32", "float64"]

@lithomas1
Copy link
Member

lithomas1 commented Sep 7, 2024

@WillAyd

Mind if I push some fixes for the non string dtype builds to your branch?
I finally got to taking a look at this PR and I have a couple of fixes locally.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 7, 2024

Sure by all means. FWIW I've been trying to keep the order of cherry picked commits maintained and put any custom patches at the end

jbrockmendel and others added 26 commits October 7, 2024 13:45
…andas-dev#59514)

* REF (string): Move StringArrayNumpySemantics methods to base class

* mypy fixup
* REF (string): remove _str_na_value

* mypy fixup
…ss (pandas-dev#59501)

* REF: move ArrowStringArrayNumpySemantics methods to parent class

* REF: move methods to ArrowStringArray

* mypy fixup

* Fix incorrect double-unpacking

* move methods to subclass
…pandas-dev#59526)

* API (string): return str dtype for .dt methods, DatetimeIndex methods

* mypy fixup
…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants