Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(mito): parquet memtable reader (#4967)
* wip: row group reader base * wip: memtable row group reader * Refactor MemtableRowGroupReader to streamline data fetching - Added early return when fetch_ranges is empty to optimize performance. - Replaced inline chunk data assignment with a call to `assign_dense_chunk` for cleaner code. * wip: row group reader * wip: reuse RowGroupReader * wip: bulk part reader * Enhance BulkPart Iteration with Filtering - Introduced `RangeBase` to `BulkIterContext` for improved filter handling. - Implemented filter application in `BulkPartIter` to prune batches based on predicates. - Updated `SimpleFilterContext::new_opt` to be public for broader access. * chore: add prune test * fix: clippy * fix: introduce prune reader for memtable and add more prune test * Enhance BulkPart read method to return Option<BoxedBatchIterator> - Modified `BulkPart::read` to return `Option<BoxedBatchIterator>` to handle cases where no row groups are selected. - Added logic to return `None` when all row groups are filtered out. - Updated tests to handle the new return type and added a test case to verify behavior when no row groups match the pr * refactor/separate-paraquet-reader: Add helper function to parse parquet metadata and integrate it into BulkPartEncoder * refactor/separate-paraquet-reader: Change BulkPartEncoder row_group_size from Option to usize and update tests * refactor/separate-paraquet-reader: Add context module for bulk memtable iteration and refactor part reading • Introduce context module to encapsulate context for bulk memtable iteration. • Refactor BulkPart to use BulkIterContextRef for reading operations. • Remove redundant code in BulkPart by centralizing context creation and row group pruning logic in the new context module. • Create new file context.rs with structures and logic for handling iteration context. • Adjust part_reader.rs and row_group_reader.rs to reference the new BulkIterContextRef. * refactor/separate-paraquet-reader: Refactor RowGroupReader traits and implementations in memtable and parquet reader modules • Rename RowGroupReaderVirtual to RowGroupReaderContext for clarity. • Replace BulkPartVirt with direct usage of BulkIterContextRef in MemtableRowGroupReader. • Simplify MemtableRowGroupReaderBuilder by directly passing context instead of creating a BulkPartVirt instance. • Update RowGroupReaderBase to use context field instead of virt, reflecting the trait renaming and usage. • Modify FileRangeVirt to FileRangeContextRef and adjust implementations accordingly. * refactor/separate-paraquet-reader: Refactor column page reader creation and remove unused code • Centralize creation of SerializedPageReader in RowGroupBase::column_reader method. • Remove unused RowGroupCachedReader and related code from MemtableRowGroupPageFetcher. • Eliminate redundant error handling for invalid column index in multiple places. * chore: rebase main and resolve conflicts * fix: some comments * chore: resolve conflicts * chore: resolve conflicts
- Loading branch information