ENH: support Arrow PyCapsule Interface on Series for export #59587

MarcoGorelli · 2024-08-23T17:12:04Z

closes ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

MarcoGorelli · 2024-08-23T17:13:26Z

pandas/core/series.py

+            # todo: how should this be supported?
+            msg = (
+                "Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet "
+                "supported"
+            )
+            raise NotImplementedError(msg)


@kylebarron @jorisvandenbossche @WillAyd @PyCapsuleGang how should this be handled? I was looking at the Polars implementation and there's no tests there where requested_schema is not None

I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema capsule anyway

In a general sense, you can ignore it.

The callee should attempt to provide the data in the requested schema. However, if the callee cannot provide the data in the requested schema, they may return with the same schema as if None were passed to requested_schema.

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#schema-requests

However, in this case you should just delegate to pyarrow's implementation

ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)]) return ca.__arrow_c_stream__(requested_schema)

pandas/tests/series/test_arrow_interface.py

WillAyd

Looks pretty good, although I am still worried that we are using __arrow_c_stream__ when most of our types fit better with __arrow_c_array__

Seems pedantic but this affects binary data exchange, so want to be careful. Right now I'm trying to think what the future looks like for pandas where we might actually have a ChunkedSeries or different type where streaming actually makes sense, and how that may get tripped up with the standard Series supporting this dunder

WillAyd · 2024-08-23T17:58:38Z

pandas/core/series.py

+            # todo: how should this be supported?
+            msg = (
+                "Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet "
+                "supported"
+            )
+            raise NotImplementedError(msg)


I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema capsule anyway

WillAyd · 2024-08-23T18:03:15Z

pandas/core/series.py

+                "supported"
+            )
+            raise NotImplementedError(msg)
+        ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])


The pyarrow types already use a chunkedarray for storage right? I think we can short-circuit on that (or in a larger PR, reasses why we use chunkedarray for storage)

As Will said, I think we should short-circuit (or special case it) for when the pyarrow array already is a chunked array.

Right now, if you have a column such using e.g. StringDtype with pyarrow storage, which uses chunked arrays under the hood, the above will apparently concatenate the result, and this conversion will not be zero copy in a case where you actually expect it to be zero-copy (and this is actually the case which makes us use __arrow_c_stream__ instead of __arrow_c_array__ in the first place)

mroeschke · 2024-08-26T18:36:19Z

Thanks @MarcoGorelli

jorisvandenbossche · 2024-08-26T19:07:05Z

pandas/core/series.py

+        PyCapsule
+        """
+        pa = import_optional_dependency("pyarrow", min_version="16.0.0")
+        ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])


Something else, the passing of requested_schema is not going to work like this, I think. See how I first converted it to a pyarrow object before passing it on:

pandas/pandas/core/frame.py

Lines 985 to 986 in bb4ab4f

if requested_schema is not None:

requested_schema = pa.Schema._import_from_c_capsule(requested_schema)

You can use the same but using pa.DataType instead of pa.Schema

jorisvandenbossche · 2024-08-26T19:10:00Z

pandas/tests/series/test_arrow_interface.py

+
+    ca = pa.chunked_array(s)
+    expected = pa.chunked_array([[1, 4, 2]])
+    assert ca.equals(expected)


Best to add a test case here specifying the type (to cover the requested_schema part). Something like:

arr = pa.array(s, type=pa.int32()) expected = pa.array([1, 4, 2], pa.int32()) assert arr.equals(expected)

(but then using chunked_array() instead of array(), because array(..) actually doesn't work if we only define __arrow_c_stream__ (and not __arrow_c_array__) ..

WillAyd · 2024-08-26T19:17:09Z

Absolutely my mistake for not putting up a "changes requested," but I don't think this was ready to be merged. @jorisvandenbossche has some great points that need to be addressed, and I still want to discuss a bit more about how we feel the pd.Series should be exported through the Arrow C Data interface.

As mentioned before I am a little wary about having developers write extensions that treat the Series as a stream (when in the vast majority of current use cases it is not) and how that may impact us as both a consumer and producer of Arrow C Data

WillAyd · 2024-08-26T23:24:00Z

To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.

But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.

The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now

mroeschke · 2024-08-27T00:06:04Z

Sorry if I merged too soon. Yeah I think generally our ArrowEA should just be backed by a contiguous pyarrow.array instead of a pyarrow.chunked_array, especially if it makes it easier to decide to implement array over stream.

IIRC there are some ops where we need to .combine_chunks beforehand so it would be nice to avoid this copy. I'm not sure if there are any ops in pandas where operating on chunks is more optimal than the whole array

MarcoGorelli · 2024-08-27T13:44:56Z

ah sorry about this, I should've marked as draft whilst there was still some discussion underway

i'll open up a follow-up shortly and we can iterate on the required changes there (I think that's simpler than reverting and starting again)

jorisvandenbossche · 2024-08-27T13:54:24Z

The fixes for the current implementation can indeed be done in a follow-up PR.
And let's continue the discussion about stream vs array interface for Series in the original issue (I'll reopen it).

Yeah I think generally our ArrowEA should just be backed by a contiguous pyarrow.array instead of a pyarrow.chunked_array, especially if it makes it easier to decide to implement array over stream.

If we want to consider changing that, let's open a new dedicated issue about it (we discussed the pros/cons quite extensively in the past, and it has much larger impact beyond this stream interface)

ENH: support Arrow PyCapsule Interface on Series for export

84543af

MarcoGorelli force-pushed the arrow-c-stream branch from a46a29a to 84543af Compare August 23, 2024 17:12

MarcoGorelli commented Aug 23, 2024

View reviewed changes

mroeschke reviewed Aug 23, 2024

View reviewed changes

pandas/tests/series/test_arrow_interface.py Outdated Show resolved Hide resolved

mroeschke added the Arrow pyarrow functionality label Aug 23, 2024

WillAyd reviewed Aug 23, 2024

View reviewed changes

MarcoGorelli added 2 commits August 23, 2024 19:14

simplify

5b3eb17

simplify

6db2b8e

MarcoGorelli marked this pull request as ready for review August 23, 2024 18:51

mroeschke added this to the 3.0 milestone Aug 26, 2024

mroeschke approved these changes Aug 26, 2024

View reviewed changes

mroeschke merged commit bb4ab4f into pandas-dev:main Aug 26, 2024
47 checks passed

kylebarron mentioned this pull request Aug 26, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

jorisvandenbossche reviewed Aug 26, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Aug 27, 2024

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support Arrow PyCapsule Interface on Series for export #59587

ENH: support Arrow PyCapsule Interface on Series for export #59587

MarcoGorelli commented Aug 23, 2024

MarcoGorelli Aug 23, 2024

WillAyd Aug 23, 2024

kylebarron Aug 23, 2024 •

edited

Loading

WillAyd left a comment

WillAyd Aug 23, 2024

WillAyd Aug 23, 2024

jorisvandenbossche Aug 26, 2024

mroeschke commented Aug 26, 2024

jorisvandenbossche Aug 26, 2024

jorisvandenbossche Aug 26, 2024

jorisvandenbossche Aug 26, 2024

WillAyd commented Aug 26, 2024

WillAyd commented Aug 26, 2024 •

edited

Loading

mroeschke commented Aug 27, 2024

MarcoGorelli commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

	if requested_schema is not None:
	requested_schema = pa.Schema._import_from_c_capsule(requested_schema)

ENH: support Arrow PyCapsule Interface on Series for export #59587

ENH: support Arrow PyCapsule Interface on Series for export #59587

Conversation

MarcoGorelli commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Aug 26, 2024

WillAyd commented Aug 26, 2024 • edited Loading

mroeschke commented Aug 27, 2024

MarcoGorelli commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

kylebarron Aug 23, 2024 •

edited

Loading

WillAyd commented Aug 26, 2024 •

edited

Loading