Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support file splitting in ReadParquetPyarrowFS #1139

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 21, 2024

Proposed Changes (Revised Sep 23, 2024)

Adds optimization-time ("tune up") support for large-file splitting.

I like how dask-expr currently "squashes" small files together at optimization time. This PR simply expands the functionality of _tune_up to split (rather than fuse) oversized parquet files. The splitting behavior is controlled by a new "dataframe.parquet.maximum-partition-size" config (to compliment the existing "dataframe.parquet.minimum-partition-size" config that is already used to control fusion).

Background

The legacy read_parquet infrastructure is obviously a mess, and I'd like to avoid the need to keep maintaining it (in both dask and rapids). The logic in ReadParquetPyarrowFS is slightly arrow-specific, but is already very close to what I was already planning to do in rapids. The only missing feature that is preventing it from supporting real-world use cases is the lack of support for splitting oversized files (a need we definitely run into a lot in the wild).

@rjzamora rjzamora added the enhancement New feature or request label Sep 21, 2024
@rjzamora rjzamora self-assigned this Sep 21, 2024
@phofl
Copy link
Collaborator

phofl commented Sep 23, 2024

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively. One of the main complexity drivers of the old read_parquet implementation is that there is a plethora of options that disable each other, so I don't want to repeat this here

@rjzamora
Copy link
Member Author

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively.

Yeah, that makes sense. What do you think is the best way to deal with oversized files? I don't think the aggregate_files argument is necessary, but the lack of support for "splitting" large files is a pretty serious blocker right now. Would a "dataframe.parquet.maximum-partition-size" config be more palatable?

@rjzamora rjzamora changed the title [WIP] Support blocksize and aggregate_files options in ReadParquetPyarrowFS [WIP] Support blocksize in ReadParquetPyarrowFS Sep 23, 2024
@rjzamora rjzamora changed the title [WIP] Support blocksize in ReadParquetPyarrowFS [WIP] Support file splitting in ReadParquetPyarrowFS Sep 23, 2024
@rjzamora
Copy link
Member Author

Update: I moved away from using blocksize and aggregate_files in favor of "dataframe.parquet.minimum-partition-size" (already used) and "dataframe.parquet.maximum-partition-size" (new) configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants