-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support Arrow PyCapsule Interface on Series for export #59587
Conversation
a46a29a
to
84543af
Compare
pandas/core/series.py
Outdated
# todo: how should this be supported? | ||
msg = ( | ||
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet " | ||
"supported" | ||
) | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebarron @jorisvandenbossche @WillAyd @PyCapsuleGang how should this be handled? I was looking at the Polars implementation and there's no tests there where requested_schema
is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema
capsule anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a general sense, you can ignore it.
The callee should attempt to provide the data in the requested schema. However, if the callee cannot provide the data in the requested schema, they may return with the same schema as if None were passed to requested_schema.
https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#schema-requests
However, in this case you should just delegate to pyarrow's implementation
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
return ca.__arrow_c_stream__(requested_schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, although I am still worried that we are using __arrow_c_stream__
when most of our types fit better with __arrow_c_array__
Seems pedantic but this affects binary data exchange, so want to be careful. Right now I'm trying to think what the future looks like for pandas where we might actually have a ChunkedSeries
or different type where streaming actually makes sense, and how that may get tripped up with the standard Series
supporting this dunder
pandas/core/series.py
Outdated
# todo: how should this be supported? | ||
msg = ( | ||
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet " | ||
"supported" | ||
) | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema
capsule anyway
"supported" | ||
) | ||
raise NotImplementedError(msg) | ||
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyarrow types already use a chunkedarray for storage right? I think we can short-circuit on that (or in a larger PR, reasses why we use chunkedarray for storage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Will said, I think we should short-circuit (or special case it) for when the pyarrow array already is a chunked array.
Right now, if you have a column such using e.g. StringDtype with pyarrow storage, which uses chunked arrays under the hood, the above will apparently concatenate the result, and this conversion will not be zero copy in a case where you actually expect it to be zero-copy (and this is actually the case which makes us use __arrow_c_stream__
instead of __arrow_c_array__
in the first place)
Thanks @MarcoGorelli |
PyCapsule | ||
""" | ||
pa = import_optional_dependency("pyarrow", min_version="16.0.0") | ||
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else, the passing of requested_schema
is not going to work like this, I think. See how I first converted it to a pyarrow object before passing it on:
Lines 985 to 986 in bb4ab4f
if requested_schema is not None: | |
requested_schema = pa.Schema._import_from_c_capsule(requested_schema) |
You can use the same but using pa.DataType
instead of pa.Schema
|
||
ca = pa.chunked_array(s) | ||
expected = pa.chunked_array([[1, 4, 2]]) | ||
assert ca.equals(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to add a test case here specifying the type (to cover the requested_schema
part). Something like:
arr = pa.array(s, type=pa.int32())
expected = pa.array([1, 4, 2], pa.int32())
assert arr.equals(expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but then using chunked_array()
instead of array()
, because array(..)
actually doesn't work if we only define __arrow_c_stream__
(and not __arrow_c_array__
) ..
Absolutely my mistake for not putting up a "changes requested," but I don't think this was ready to be merged. @jorisvandenbossche has some great points that need to be addressed, and I still want to discuss a bit more about how we feel the pd.Series should be exported through the Arrow C Data interface. As mentioned before I am a little wary about having developers write extensions that treat the Series as a stream (when in the vast majority of current use cases it is not) and how that may impact us as both a consumer and producer of Arrow C Data |
To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk. But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration. The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now |
Sorry if I merged too soon. Yeah I think generally our ArrowEA should just be backed by a contiguous IIRC there are some ops where we need to |
ah sorry about this, I should've marked as draft whilst there was still some discussion underway i'll open up a follow-up shortly and we can iterate on the required changes there (I think that's simpler than reverting and starting again) |
The fixes for the current implementation can indeed be done in a follow-up PR.
If we want to consider changing that, let's open a new dedicated issue about it (we discussed the pros/cons quite extensively in the past, and it has much larger impact beyond this stream interface) |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.