-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431
Comments
So we use the Deltalake Python library which doesn't seem to have an explicitly retaining this. We typically don't like to modify the underlying API and retain 1:1 relationship with Kedro and |
Actually, there might be more than one index level and at some point
|
The fundamental point is that Indices are a Pandas concept, Spark and more recently Polars decided not to mirror this concept. There is an argument that the current behaviour is justified when using the Delta format. Again, very keen to see what others think here. |
I changed my mind. I needed a dataset that I'd use as a local counterpart of databricks.ManagedTableDataset (or actually something derived from this one), and that dataset ignores pandas index completely. To get consistent behaviour between the datasets, let's drop index on saving the data:
This line at the start of |
If I understand correctly @KrzysztofDoboszInpost , you no longer desire what you stated in the first comment #431 (comment) ? It's not very clear to me where your suggestion #431 (comment) should be implemented and what would that achieve. |
Reg. first comment: correct. BTW: At some point of time it turned out that Deltalake cannot read delta format saved by Databricks (protocol version mismatch), so I moved back to parquet. |
Description
Pandas DataFrame saved and loaded using pandas.DeltaTableDataset differs from the original DataFrame - it has additional column
__index_level_0__
. This is an Index saved as a column, but it's not interpreted as such.I tried to read the stored file manually:
pandas.read_parquet
loads index correctly,deltalake.DeltaTable
returns index as a column.Since the dataset is based on
deltalake
, maybe it will suffice to add this line to_load()
?Context
How has this bug affected you? What were you trying to accomplish?
I'm trying to use pandas.DeltaTableDataset locally and databricks.ManagedTableDataset in the pipeline deployed to Databricks.
Steps to Reproduce
Expected Result
The DataFrame after save&load should be identical as the one before the operations.
Actual Result
The DataFrame after save&load has additional column
__index_level_0__
and the index is default RangeIndex.Your Environment
kedro==0.18.14
kedro-datasets==1.8.0
deltalake==0.13.0
pyarrow==14.0.1
python -V
): 3.10.10The text was updated successfully, but these errors were encountered: