Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

KrzysztofDoboszInpost · 2023-11-15T11:59:48Z

Description

Pandas DataFrame saved and loaded using pandas.DeltaTableDataset differs from the original DataFrame - it has additional column __index_level_0__. This is an Index saved as a column, but it's not interpreted as such.

I tried to read the stored file manually:

pandas.read_parquet loads index correctly,
deltalake.DeltaTable returns index as a column.

Since the dataset is based on deltalake, maybe it will suffice to add this line to _load()?

delta_table = delta_table.set_index("__index_level_0__")

Context

How has this bug affected you? What were you trying to accomplish?
I'm trying to use pandas.DeltaTableDataset locally and databricks.ManagedTableDataset in the pipeline deployed to Databricks.

Steps to Reproduce

Create two nodes sharing a dataset of type pandas.DeltaTableDataset. Pass pd.DataFrame.
In the second node see the additional column.

Expected Result

The DataFrame after save&load should be identical as the one before the operations.

Actual Result

The DataFrame after save&load has additional column __index_level_0__ and the index is default RangeIndex.

Your Environment

kedro==0.18.14
kedro-datasets==1.8.0
deltalake==0.13.0
pyarrow==14.0.1

Python version used (python -V): 3.10.10
Operating system and version: Windows 10

The text was updated successfully, but these errors were encountered:

datajoely · 2023-11-15T12:11:11Z

So we use the Deltalake Python library which doesn't seem to have an explicitly retaining this.

We typically don't like to modify the underlying API and retain 1:1 relationship with Kedro and load_args / save_args this feels quite sensible. Even if this is easy to remedy in the node or by subclassing the dataset. I'd be interested to see what the other maintainers think.

KrzysztofDoboszInpost · 2023-11-16T08:42:12Z

Actually, there might be more than one index level and at some point deltalake might start handling pandas_metadata, so:

index_cols = delta_table.columns[delta_table.columns.str.match(r"__index_level_+\d__")].values.tolist()
if index_cols:
    delta_table = delta_table.set_index(index_cols)

datajoely · 2023-11-16T09:30:10Z

The fundamental point is that Indices are a Pandas concept, Spark and more recently Polars decided not to mirror this concept. There is an argument that the current behaviour is justified when using the Delta format. Again, very keen to see what others think here.

KrzysztofDoboszInpost · 2024-03-29T08:55:30Z

I changed my mind. I needed a dataset that I'd use as a local counterpart of databricks.ManagedTableDataset (or actually something derived from this one), and that dataset ignores pandas index completely.

To get consistent behaviour between the datasets, let's drop index on saving the data:

data = pa.Table.from_pandas(data, preserve_index=False)

This line at the start of _save would do the magic, and also fix #610.

astrojuanlu · 2024-11-08T18:55:51Z

If I understand correctly @KrzysztofDoboszInpost , you no longer desire what you stated in the first comment #431 (comment) ?

It's not very clear to me where your suggestion #431 (comment) should be implemented and what would that achieve.

KrzysztofDoboszInpost · 2024-11-12T09:32:56Z

Reg. first comment: correct.
Reg. last comment: I wanted to leverage kedro's superpower to use different definitions of datasets per environment. Specifically I wanted to use databricks.ManagedTableDataset on prod, and develop locally by downloading data in delta format from the data late and reading it with pandas.DeltaTableDataset. In time, it turned out that databricks.ManagedTableDataset doesn't store pandas index, so I proposed that pandas.DeltaTableDataset would do the same by dropping the index before saving. Anyway - this is just my use case, I'm not sure if it justifies the change.

BTW: At some point of time it turned out that Deltalake cannot read delta format saved by Databricks (protocol version mismatch), so I moved back to parquet.

noklam added this to Kedro Framework Mar 29, 2024

noklam added the Community Issue/PR opened by the open-source community label Mar 29, 2024

merelcht removed this from Kedro Framework Oct 31, 2024

astrojuanlu added the support: needs more info label Nov 8, 2024

github-actions bot removed the support: needs more info label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

KrzysztofDoboszInpost commented Nov 15, 2023

datajoely commented Nov 15, 2023

KrzysztofDoboszInpost commented Nov 16, 2023 •

edited

Loading

datajoely commented Nov 16, 2023

KrzysztofDoboszInpost commented Mar 29, 2024

astrojuanlu commented Nov 8, 2024

KrzysztofDoboszInpost commented Nov 12, 2024

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

Comments

KrzysztofDoboszInpost commented Nov 15, 2023

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

datajoely commented Nov 15, 2023

KrzysztofDoboszInpost commented Nov 16, 2023 • edited Loading

datajoely commented Nov 16, 2023

KrzysztofDoboszInpost commented Mar 29, 2024

astrojuanlu commented Nov 8, 2024

KrzysztofDoboszInpost commented Nov 12, 2024

KrzysztofDoboszInpost commented Nov 16, 2023 •

edited

Loading