Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse metadata from deltalake when reading parquet #22

Open
j-bennet opened this issue Jun 1, 2023 · 2 comments
Open

Reuse metadata from deltalake when reading parquet #22

j-bennet opened this issue Jun 1, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@j-bennet
Copy link
Collaborator

j-bennet commented Jun 1, 2023

In dask-deltatable, when calling dd.read_parquet, perhaps we can reuse the metadata already preserved in delta json, instead of collecting it from parquet files all over again.

Here:

df = dd.read_parquet(dt.file_uris(), **kwargs)

It looks like dd.read_parquet will have to go through the parquet files to read the metadata, but the DeltaTable should have all that info already.

@jrbourbeau jrbourbeau changed the title Reese metadata from deltalake when reading parquet Reuse metadata from deltalake when reading parquet Jun 1, 2023
@jrbourbeau
Copy link
Member

Good catch @j-bennet. I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call

df = dd.read_parquet(dt.file_uris(), **kwargs)

should do the trick. Though it'd be nice if someone could confirm this is the case.

@jacobtomlinson jacobtomlinson added the enhancement New feature or request label Jun 5, 2023
@j-bennet
Copy link
Collaborator Author

j-bennet commented Jun 9, 2023

I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call
should do the trick. Though it'd be nice if someone could confirm this is the case.

I think Delta Log also contains columns stats, so maybe we can avoid gathering those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants