[WIP] Support file splitting in `ReadParquetPyarrowFS` #1139

rjzamora · 2024-09-21T14:47:38Z

Proposed Changes (Revised Sep 23, 2024)

Adds optimization-time ("tune up") support for large-file splitting.

I like how dask-expr currently "squashes" small files together at optimization time. This PR simply expands the functionality of _tune_up to split (rather than fuse) oversized parquet files. The splitting behavior is controlled by a new "dataframe.parquet.maximum-partition-size" config (to compliment the existing "dataframe.parquet.minimum-partition-size" config that is already used to control fusion).

Background

The legacy read_parquet infrastructure is obviously a mess, and I'd like to avoid the need to keep maintaining it (in both dask and rapids). The logic in ReadParquetPyarrowFS is slightly arrow-specific, but is already very close to what I was already planning to do in rapids. The only missing feature that is preventing it from supporting real-world use cases is the lack of support for splitting oversized files (a need we definitely run into a lot in the wild).

phofl · 2024-09-23T16:24:45Z

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively. One of the main complexity drivers of the old read_parquet implementation is that there is a plethora of options that disable each other, so I don't want to repeat this here

rjzamora · 2024-09-23T16:39:24Z

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively.

Yeah, that makes sense. What do you think is the best way to deal with oversized files? I don't think the aggregate_files argument is necessary, but the lack of support for "splitting" large files is a pretty serious blocker right now. Would a "dataframe.parquet.maximum-partition-size" config be more palatable?

rjzamora · 2024-09-23T18:04:10Z

Update: I moved away from using blocksize and aggregate_files in favor of "dataframe.parquet.minimum-partition-size" (already used) and "dataframe.parquet.maximum-partition-size" (new) configs.

rjzamora added 7 commits September 20, 2024 12:32

initial changes to support blocksize with arrow parquet reader

dd3718f

add aggregate_files support

dc52fa9

use blockwise=None as default for now

603851c

add test coverage

7d12678

add docstring notes

d420a09

tweak _fusion_compression_factor

73a10d8

fix _blocksize

86ab5d5

rjzamora added the enhancement New feature or request label Sep 21, 2024

rjzamora self-assigned this Sep 21, 2024

remove aggregate_files

0144cd9

rjzamora changed the title ~~[WIP] Support blocksize and aggregate_files options in ReadParquetPyarrowFS~~ [WIP] Support blocksize in ReadParquetPyarrowFS Sep 23, 2024

rjzamora changed the title ~~[WIP] Support blocksize in ReadParquetPyarrowFS~~ [WIP] Support file splitting in ReadParquetPyarrowFS Sep 23, 2024

remove blocksize in favor of config

48f2623

Merge branch 'main' into arrow-blocksize-support

928085b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support file splitting in `ReadParquetPyarrowFS` #1139

[WIP] Support file splitting in `ReadParquetPyarrowFS` #1139

rjzamora commented Sep 21, 2024 •

edited

Loading

phofl commented Sep 23, 2024

rjzamora commented Sep 23, 2024

rjzamora commented Sep 23, 2024

[WIP] Support file splitting in ReadParquetPyarrowFS #1139

Are you sure you want to change the base?

[WIP] Support file splitting in ReadParquetPyarrowFS #1139

Conversation

rjzamora commented Sep 21, 2024 • edited Loading

phofl commented Sep 23, 2024

rjzamora commented Sep 23, 2024

rjzamora commented Sep 23, 2024

[WIP] Support file splitting in `ReadParquetPyarrowFS` #1139

[WIP] Support file splitting in `ReadParquetPyarrowFS` #1139

rjzamora commented Sep 21, 2024 •

edited

Loading