-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518
Comments
Yes, this is on my to do list for the coming weeks! :)
For our numpy-based dtypes, this is indeed the case. But the problem is that for the pyarrow-based columns, we actually store chunked arrays. So when implementing one protocol, it should be |
Yes that makes sense and I'd agree. I'd suggest always exporting a stream. |
Why wouldn't we support both methods? The fact that pyarrow arrays use chunked storage behind the scenes is an implementation detail, but for everything else (and from a front-end perspective) array makes more sense than array_stream |
N.B. I'm also not sure the decision that went into us using chunked array storage for pyarrow-backed types |
I think the question is: what should data consumers be able to infer based on the presence of one or more of these methods. See apache/arrow#40648
|
There is still some discussion about using array vs stream interface, so reopening this issue. @WillAyd 's comment from #59587 (comment):
For the specific example of a Series with multiple chunks, and then put in a DataFrame: if you access that through the DataFrame's stream interface, you will also get multiple chunks. In pyarrow, a Table is also not required to have equally chunked columns, but when converting to a stream of batches, it will kind of merge the chunking of the different columns (I think by default in such a way that everything can be zero-copy slices from the original data, i.e. so the smallest batches needed to do that for all columns). Code example to illustrate:
So in practice, while we are discussing the usage of the Stream interface for |
@jorisvandenbossche that's super cool - I did not realize Arrow created RecordBatches in that fashion. And that's zero-copy for the integral array in your example? In that case then maybe we should export both the stream and array interfaces. I guess there's some ambiguity as a consumer if that is a zero-copy exchange or not, but maybe that doesn't matter (?) |
Another consideration point is how we want to act as consumers of Arrow data, not just as a producer. If we push developers towards preferring stream data in this interface, the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays. I'm not sure that is a bad thing, but its different than where we are today |
My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented. Otherwise, implementing the C Array interface as well could allow consumers to unknowingly cause pandas to concatenate a series into an array. IMO, the C Array interface should only be implemented if a pandas Series always stores a contiguous array under the hood. It's easy for consumers of a C stream to see if the stream only holds a single array (assuming the consumer materializes the full stream) |
AFAIK yes
That's already the case for dtypes backed by pyarrow, I think. If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking. |
Thanks, this is good input. I'm bought in!
Cool! I don't think I voiced it well, but my major concern is around our usage of NumPy by default. For the I/O operation you described, I think the dtype_mapper will in many cases still end up copying that chunked data into a single array to fit into the NumPy view of the world. Not trying to solve that here, but I think with I/O methods there is some control over how you want that conversion to happen. With this interface, I don't think we get that control? So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed? |
We don't yet have the consumption part in pandas (I am planning to look into that in the near future, but let that not stop anyone to already do that), but if we have a method to generally convert Arrow data to a pandas DataFrame, that in theory could have a similar keyword as other IO methods to determine which dtypes to use. |
Makes sense. So for where we are on main now I think we just need to make it so that the implementation doesn't always copy (?) |
Yes, and fix |
Problem Description
Similar to existing DataFrame export in #56587, I would like Pandas to export Series via the Arrow PyCapsule Interface.
Feature Description
Implementation would be quite similar to #56587. I'm not sure what the ideal API is to convert a pandas Series to a pyarrow Array/ChunkedArray.
Alternative Solutions
Require users to pass a series into a DataFrame before passing to an external consumer?
Additional Context
Pyarrow has implemented the PyCapsule Interface on ChunkedArrays for the last couple versions. Ref apache/arrow#40818
If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement
__arrow_c_array__
instead?cc @jorisvandenbossche as he implemented #56587
The text was updated successfully, but these errors were encountered: