Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support Arrow PyCapsule Interface on Series for export #59587

Merged
merged 3 commits into from
Aug 26, 2024

Conversation

MarcoGorelli
Copy link
Member

Comment on lines 584 to 589
# todo: how should this be supported?
msg = (
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet "
"supported"
)
raise NotImplementedError(msg)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylebarron @jorisvandenbossche @WillAyd @PyCapsuleGang how should this be handled? I was looking at the Polars implementation and there's no tests there where requested_schema is not None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema capsule anyway

Copy link
Contributor

@kylebarron kylebarron Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a general sense, you can ignore it.

The callee should attempt to provide the data in the requested schema. However, if the callee cannot provide the data in the requested schema, they may return with the same schema as if None were passed to requested_schema.

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#schema-requests

However, in this case you should just delegate to pyarrow's implementation

        ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
        return ca.__arrow_c_stream__(requested_schema)

@mroeschke mroeschke added the Arrow pyarrow functionality label Aug 23, 2024
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, although I am still worried that we are using __arrow_c_stream__ when most of our types fit better with __arrow_c_array__

Seems pedantic but this affects binary data exchange, so want to be careful. Right now I'm trying to think what the future looks like for pandas where we might actually have a ChunkedSeries or different type where streaming actually makes sense, and how that may get tripped up with the standard Series supporting this dunder

Comment on lines 584 to 589
# todo: how should this be supported?
msg = (
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet "
"supported"
)
raise NotImplementedError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema capsule anyway

"supported"
)
raise NotImplementedError(msg)
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow types already use a chunkedarray for storage right? I think we can short-circuit on that (or in a larger PR, reasses why we use chunkedarray for storage)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Will said, I think we should short-circuit (or special case it) for when the pyarrow array already is a chunked array.

Right now, if you have a column such using e.g. StringDtype with pyarrow storage, which uses chunked arrays under the hood, the above will apparently concatenate the result, and this conversion will not be zero copy in a case where you actually expect it to be zero-copy (and this is actually the case which makes us use __arrow_c_stream__ instead of __arrow_c_array__ in the first place)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 23, 2024 18:51
@mroeschke mroeschke added this to the 3.0 milestone Aug 26, 2024
@mroeschke mroeschke merged commit bb4ab4f into pandas-dev:main Aug 26, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @MarcoGorelli

PyCapsule
"""
pa = import_optional_dependency("pyarrow", min_version="16.0.0")
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else, the passing of requested_schema is not going to work like this, I think. See how I first converted it to a pyarrow object before passing it on:

pandas/pandas/core/frame.py

Lines 985 to 986 in bb4ab4f

if requested_schema is not None:
requested_schema = pa.Schema._import_from_c_capsule(requested_schema)

You can use the same but using pa.DataType instead of pa.Schema


ca = pa.chunked_array(s)
expected = pa.chunked_array([[1, 4, 2]])
assert ca.equals(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to add a test case here specifying the type (to cover the requested_schema part). Something like:

arr = pa.array(s, type=pa.int32())
expected = pa.array([1, 4, 2], pa.int32())
assert arr.equals(expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but then using chunked_array() instead of array(), because array(..) actually doesn't work if we only define __arrow_c_stream__ (and not __arrow_c_array__) ..

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2024

Absolutely my mistake for not putting up a "changes requested," but I don't think this was ready to be merged. @jorisvandenbossche has some great points that need to be addressed, and I still want to discuss a bit more about how we feel the pd.Series should be exported through the Arrow C Data interface.

As mentioned before I am a little wary about having developers write extensions that treat the Series as a stream (when in the vast majority of current use cases it is not) and how that may impact us as both a consumer and producer of Arrow C Data

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2024

To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.

But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.

The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now

@mroeschke
Copy link
Member

Sorry if I merged too soon. Yeah I think generally our ArrowEA should just be backed by a contiguous pyarrow.array instead of a pyarrow.chunked_array, especially if it makes it easier to decide to implement array over stream.

IIRC there are some ops where we need to .combine_chunks beforehand so it would be nice to avoid this copy. I'm not sure if there are any ops in pandas where operating on chunks is more optimal than the whole array

@MarcoGorelli
Copy link
Member Author

ah sorry about this, I should've marked as draft whilst there was still some discussion underway

i'll open up a follow-up shortly and we can iterate on the required changes there (I think that's simpler than reverting and starting again)

@jorisvandenbossche
Copy link
Member

The fixes for the current implementation can indeed be done in a follow-up PR.
And let's continue the discussion about stream vs array interface for Series in the original issue (I'll reopen it).

Yeah I think generally our ArrowEA should just be backed by a contiguous pyarrow.array instead of a pyarrow.chunked_array, especially if it makes it easier to decide to implement array over stream.

If we want to consider changing that, let's open a new dedicated issue about it (we discussed the pros/cons quite extensively in the past, and it has much larger impact beyond this stream interface)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support the Arrow PyCapsule Interface on pandas.Series (export)
5 participants