Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

Open
KrzysztofDoboszInpost opened this issue Nov 15, 2023 · 6 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@KrzysztofDoboszInpost
Copy link

Description

Pandas DataFrame saved and loaded using pandas.DeltaTableDataset differs from the original DataFrame - it has additional column __index_level_0__. This is an Index saved as a column, but it's not interpreted as such.

I tried to read the stored file manually:

  • pandas.read_parquet loads index correctly,
  • deltalake.DeltaTable returns index as a column.

Since the dataset is based on deltalake, maybe it will suffice to add this line to _load()?

delta_table = delta_table.set_index("__index_level_0__")

Context

How has this bug affected you? What were you trying to accomplish?
I'm trying to use pandas.DeltaTableDataset locally and databricks.ManagedTableDataset in the pipeline deployed to Databricks.

Steps to Reproduce

  1. Create two nodes sharing a dataset of type pandas.DeltaTableDataset. Pass pd.DataFrame.
  2. In the second node see the additional column.

Expected Result

The DataFrame after save&load should be identical as the one before the operations.

Actual Result

The DataFrame after save&load has additional column __index_level_0__ and the index is default RangeIndex.

Your Environment

kedro==0.18.14
kedro-datasets==1.8.0
deltalake==0.13.0
pyarrow==14.0.1

  • Python version used (python -V): 3.10.10
  • Operating system and version: Windows 10
@datajoely
Copy link
Contributor

So we use the Deltalake Python library which doesn't seem to have an explicitly retaining this.

We typically don't like to modify the underlying API and retain 1:1 relationship with Kedro and load_args / save_args this feels quite sensible. Even if this is easy to remedy in the node or by subclassing the dataset. I'd be interested to see what the other maintainers think.

@KrzysztofDoboszInpost
Copy link
Author

KrzysztofDoboszInpost commented Nov 16, 2023

Actually, there might be more than one index level and at some point deltalake might start handling pandas_metadata, so:

index_cols = delta_table.columns[delta_table.columns.str.match(r"__index_level_+\d__")].values.tolist()
if index_cols:
    delta_table = delta_table.set_index(index_cols)

@datajoely
Copy link
Contributor

The fundamental point is that Indices are a Pandas concept, Spark and more recently Polars decided not to mirror this concept. There is an argument that the current behaviour is justified when using the Delta format. Again, very keen to see what others think here.

@KrzysztofDoboszInpost
Copy link
Author

I changed my mind. I needed a dataset that I'd use as a local counterpart of databricks.ManagedTableDataset (or actually something derived from this one), and that dataset ignores pandas index completely.

To get consistent behaviour between the datasets, let's drop index on saving the data:

data = pa.Table.from_pandas(data, preserve_index=False)

This line at the start of _save would do the magic, and also fix #610.

@noklam noklam added the Community Issue/PR opened by the open-source community label Mar 29, 2024
@astrojuanlu
Copy link
Member

If I understand correctly @KrzysztofDoboszInpost , you no longer desire what you stated in the first comment #431 (comment) ?

It's not very clear to me where your suggestion #431 (comment) should be implemented and what would that achieve.

@KrzysztofDoboszInpost
Copy link
Author

Reg. first comment: correct.
Reg. last comment: I wanted to leverage kedro's superpower to use different definitions of datasets per environment. Specifically I wanted to use databricks.ManagedTableDataset on prod, and develop locally by downloading data in delta format from the data late and reading it with pandas.DeltaTableDataset. In time, it turned out that databricks.ManagedTableDataset doesn't store pandas index, so I proposed that pandas.DeltaTableDataset would do the same by dropping the index before saving. Anyway - this is just my use case, I'm not sure if it justifies the change.

BTW: At some point of time it turned out that Deltalake cannot read delta format saved by Databricks (protocol version mismatch), so I moved back to parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

4 participants