-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support file splitting in ReadParquetPyarrowFS
#1139
base: main
Are you sure you want to change the base?
Conversation
I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively. One of the main complexity drivers of the old read_parquet implementation is that there is a plethora of options that disable each other, so I don't want to repeat this here |
Yeah, that makes sense. What do you think is the best way to deal with oversized files? I don't think the |
blocksize
and aggregate_files
options in ReadParquetPyarrowFS
blocksize
in ReadParquetPyarrowFS
blocksize
in ReadParquetPyarrowFS
ReadParquetPyarrowFS
Update: I moved away from using |
Proposed Changes (Revised Sep 23, 2024)
Adds optimization-time ("tune up") support for large-file splitting.
I like how dask-expr currently "squashes" small files together at optimization time. This PR simply expands the functionality of
_tune_up
to split (rather than fuse) oversized parquet files. The splitting behavior is controlled by a new"dataframe.parquet.maximum-partition-size"
config (to compliment the existing"dataframe.parquet.minimum-partition-size"
config that is already used to control fusion).Background
The legacy
read_parquet
infrastructure is obviously a mess, and I'd like to avoid the need to keep maintaining it (in both dask and rapids). The logic inReadParquetPyarrowFS
is slightly arrow-specific, but is already very close to what I was already planning to do in rapids. The only missing feature that is preventing it from supporting real-world use cases is the lack of support for splitting oversized files (a need we definitely run into a lot in the wild).