ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

kylebarron · 2024-08-14T17:51:00Z

Problem Description

Similar to existing DataFrame export in #56587, I would like Pandas to export Series via the Arrow PyCapsule Interface.

Feature Description

Implementation would be quite similar to #56587. I'm not sure what the ideal API is to convert a pandas Series to a pyarrow Array/ChunkedArray.

Alternative Solutions

Require users to pass a series into a DataFrame before passing to an external consumer?

Additional Context

Pyarrow has implemented the PyCapsule Interface on ChunkedArrays for the last couple versions. Ref apache/arrow#40818

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

cc @jorisvandenbossche as he implemented #56587

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-08-14T21:20:46Z

Yes, this is on my to do list for the coming weeks! :)

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

For our numpy-based dtypes, this is indeed the case. But the problem is that for the pyarrow-based columns, we actually store chunked arrays.

So when implementing one protocol, it should be __arrow_c_stream__ I think. I am not entirely sure how helpful it would be to also implement __arrow_c_array__ and let that either only work if there is only one chunk (and error otherwise) or let that concatenate implicitly when there are multiple chunks.

kylebarron · 2024-08-15T03:15:08Z

So when implementing one protocol, it should be __arrow_c_stream__ I think.

Yes that makes sense and I'd agree. I'd suggest always exporting a stream.

WillAyd · 2024-08-15T18:02:15Z

Why wouldn't we support both methods? The fact that pyarrow arrays use chunked storage behind the scenes is an implementation detail, but for everything else (and from a front-end perspective) array makes more sense than array_stream

WillAyd · 2024-08-15T18:03:52Z

N.B. I'm also not sure the decision that went into us using chunked array storage for pyarrow-backed types

kylebarron · 2024-08-15T18:14:42Z

I think the question is: what should data consumers be able to infer based on the presence of one or more of these methods. See apache/arrow#40648

If an object implements __arrow_c_array__, then the object's internal storage should already be contiguous. You shouldn't implement __arrow_c_array__ if calling that would (even sometimes) require a concatenation to happen.
If an object's internal storage may be one or more arrays, then it should implement __arrow_c_stream__, where the stream will sometimes include only a single array, but where it can be zero-copy as much as possible (in the pandas case: always except from non-arrow storage)

jorisvandenbossche · 2024-08-27T14:05:40Z

There is still some discussion about using array vs stream interface, so reopening this issue.

@WillAyd 's comment from #59587 (comment):

To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.

But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.

The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now

For the specific example of a Series with multiple chunks, and then put in a DataFrame: if you access that through the DataFrame's stream interface, you will also get multiple chunks.

In pyarrow, a Table is also not required to have equally chunked columns, but when converting to a stream of batches, it will kind of merge the chunking of the different columns (I think by default in such a way that everything can be zero-copy slices from the original data, i.e. so the smallest batches needed to do that for all columns).

Code example to illustrate:

In [21]: pd.options.future.infer_string = True

In [22]: ser = pd.concat([pd.Series(["a"]), pd.Series(["b"])], ignore_index=True)

In [23]: ser.array._pa_array
Out[23]: 
<pyarrow.lib.ChunkedArray object at 0x7f096d29b650>
[
  [
    "a"
  ],
  [
    "b"
  ]
]

In [24]: df = pd.DataFrame({"A": [1, 2], "B": ser})

In [25]: df
Out[25]: 
   A  B
0  1  a
1  2  b

In [26]: table = pa.Table.from_pandas(df)

In [27]: table
Out[27]: 
pyarrow.Table
A: int64
B: large_string
----
A: [[1,2]]
B: [["a"],["b"]]

In [28]: table["B"]
Out[28]: 
<pyarrow.lib.ChunkedArray object at 0x7f096cdd7650>
[
  [
    "a"
  ],
  [
    "b"
  ]
]

In [29]: list(table.to_reader())
Out[29]: 
[pyarrow.RecordBatch
 A: int64
 B: large_string
 ----
 A: [1]
 B: ["a"],
 pyarrow.RecordBatch
 A: int64
 B: large_string
 ----
 A: [2]
 B: ["b"]]

So in practice, while we are discussing the usage of the Stream interface for Series, I think effectively it would be for both DataFrame and Series (or would be there a reason to handle those differently? maybe for tabular data people generally expect more to work with streams, as in contrast to single columns/arrays?)

WillAyd · 2024-08-27T14:37:33Z

@jorisvandenbossche that's super cool - I did not realize Arrow created RecordBatches in that fashion. And that's zero-copy for the integral array in your example?

In that case then maybe we should export both the stream and array interfaces. I guess there's some ambiguity as a consumer if that is a zero-copy exchange or not, but maybe that doesn't matter (?)

WillAyd · 2024-08-27T14:48:11Z

Another consideration point is how we want to act as consumers of Arrow data, not just as a producer. If we push developers towards preferring stream data in this interface, the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

I'm not sure that is a bad thing, but its different than where we are today

kylebarron · 2024-08-27T14:55:54Z

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented. Otherwise, implementing the C Array interface as well could allow consumers to unknowingly cause pandas to concatenate a series into an array.

IMO, the C Array interface should only be implemented if a pandas Series always stores a contiguous array under the hood. It's easy for consumers of a C stream to see if the stream only holds a single array (assuming the consumer materializes the full stream)

jorisvandenbossche · 2024-08-27T15:03:20Z

And that's zero-copy for the integral array in your example?

AFAIK yes

the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

That's already the case for dtypes backed by pyarrow, I think. If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

WillAyd · 2024-08-27T15:27:34Z

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented.

Thanks, this is good input. I'm bought in!

If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

Cool! I don't think I voiced it well, but my major concern is around our usage of NumPy by default. For the I/O operation you described, I think the dtype_mapper will in many cases still end up copying that chunked data into a single array to fit into the NumPy view of the world.

Not trying to solve that here, but I think with I/O methods there is some control over how you want that conversion to happen. With this interface, I don't think we get that control? So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

jorisvandenbossche · 2024-08-27T15:46:25Z

So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

We don't yet have the consumption part in pandas (I am planning to look into that in the near future, but let that not stop anyone to already do that), but if we have a method to generally convert Arrow data to a pandas DataFrame, that in theory could have a similar keyword as other IO methods to determine which dtypes to use.

WillAyd · 2024-08-27T16:03:44Z

Makes sense. So for where we are on main now I think we just need to make it so that the implementation doesn't always copy (?)

jorisvandenbossche · 2024-08-27T16:17:09Z

Yes, and fix requested_schema handling

kylebarron added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2024

mroeschke added Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2024

kylebarron mentioned this issue Aug 15, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

MarcoGorelli mentioned this issue Aug 23, 2024

ENH: support Arrow PyCapsule Interface on Series for export #59587

Merged

5 tasks

mroeschke closed this as completed in #59587 Aug 26, 2024

jorisvandenbossche reopened this Aug 27, 2024

jorisvandenbossche mentioned this issue Aug 27, 2024

ENH: support the Arrow PyCapsule Interface for importing data #59631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

kylebarron commented Aug 14, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Aug 14, 2024

kylebarron commented Aug 15, 2024

WillAyd commented Aug 15, 2024

WillAyd commented Aug 15, 2024

kylebarron commented Aug 15, 2024 •

edited

Loading

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

WillAyd commented Aug 27, 2024

kylebarron commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

Comments

kylebarron commented Aug 14, 2024 • edited by jorisvandenbossche Loading

Problem Description

Feature Description

Alternative Solutions

Additional Context

jorisvandenbossche commented Aug 14, 2024

kylebarron commented Aug 15, 2024

WillAyd commented Aug 15, 2024

WillAyd commented Aug 15, 2024

kylebarron commented Aug 15, 2024 • edited Loading

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

WillAyd commented Aug 27, 2024

kylebarron commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

kylebarron commented Aug 14, 2024 •

edited by jorisvandenbossche

Loading

kylebarron commented Aug 15, 2024 •

edited

Loading