-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Versioning]: Explore Kedro + Iceberg for versioning #4241
Comments
This is slightly unscientific, but I trust the vibes in the industry enough to say Iceberg will clearly be the winner in long term. Plus people saying things like this: In my opinion this is a situation where we should really go all in on the technology rather than be super agnostic / on-size-fits-all. I'd love a future for Kedro where without much configuration persisted data defaults to this model. |
@datajoely I actually took a stab on this a while ago. My experience with it is Delta has a more mature support than Iceberg at the moment in the Python ecosystem. for example the integration of ibis with iceberg is suboptimal. So from there I think Delta is gonna have a better performance with anything database related, AFAIK with iceberg it always load things in memory first. One thing to note that these "versioning" are not as effective as we want. For example, an incremental change of adding 1 row will result in a complete rewrite in current Kedro dataset with Delta as well. For high-level versioning, they works very well with dataframe/table format. The main challenge here I see is how to unify the "versioning" in Kedro, Kedro use a customisable timestamp, while Delta use a incremental version number (0, 1, xxx) or timestamp. Iceberg probably user something similar but I haven't checked. |
Delta is 100% more mature, Iceberg is the horse to back. This is the thread I was trying to find earlier: https://x.com/sean_lynch/status/1845500735842390276 I also don't think we should be wedded to that timestamp decision. It was made a long time ago and also has a non-trivial risk of collision. If we were doing that again we'd be better off using a ULID... |
^ To be more specific, I was referring mainly to the python binding, i.e. PyIceberg and rust-delta(python). Iceberg itself is fairly mature, especially with the catalog etc, but the python binding seems to be lacking behind a little bit. |
Any chance I can take this ticket or work together on this? I have explored this a little bit a while ago and would be a great opportunities to continue on it. |
I agree with @datajoely is the horse to back, at least from an API perspective. PyIceberg is maturing (it has moved significantly in the past couple years). Realistically, I don't think Kedro should dictate whether you use Iceberg or Delta (or Hudi); that is a user choice, just like whether to use Spark or Polars. This is where unified APIs will ideally make implementation easier. |
So I'm actually being bullish and saying we should pick one of these when it comes to our idea of versioned data. We simply don't have capacity to integrate everywhere properly. |
Super cool application of these concepts |
I'm with @deepyaman on this one. There should be a layer in Kedro that is format-agnostic. We can be more opinionated in a higher layer. |
What's clear though is that the Apache Iceberg’s REST catalog API has won for sure kedro-org/kedro-devrel#141 (comment) |
I just want to warn against the noble pursuit of generalisation when there are times to pick a winner, I'd much rather pick a horse and do it well. |
@ElenaKhaustova I have left some questions at the end since it's not a PR yet. https://noklam.github.io/blog/posts/pyiceberg/2024-11-18-PyIcebergDataset.html # Questions
- What does it means when we said " if we can use Iceberg to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow." Versioning code & parameters sounds more like versioning artifacts.
- How to efficiently version data? `overwrite` is a completely re-write. For SQL engine this is implemented by the engine that utilise API like `append`, `replace`. With pandas/polars it is unclear if it's possible. (Maybe be possible if it's using something like `ibis`)
- Incremental pipeline (and incremental data)
- Version non-table type, i.e. parameters, code(?), Iceberg support only these three formats out of the box: Apache Parquet, Apache ORC, and Apache Avro. Parquet is the first-class citizen and the only format that people use in practice. |
From the versioning research (https://miro.com/app/board/uXjVK9U8mVo=/?share_link_id=24726044039) pain points and summary, we concluded that users mention two major problems—versioning and experiment tracking. At first, we decided to focus on versioning. With it, the main user pain point was not to version a specific artefact as current kedro versioning allows so (not in an optimal way though) but to be able to retrieve a whole experiment/run. Meaning being able to travel back in time with your code and data and checkout to a specific version for the whole kedro project not just for the individual artifact. Please see Kedro + DVC example for better understanding: #4239 (comment) It's clear we can easily version artifacts (tabular data), but what about versioning catalogs/projects—more high-level entities and non-tabular data? So the main questions are:
|
My view:
I'm willing to bet >95% of use cases fall into this.
Now I've seen how elegant the dvc integration can be, maybe that's the right paradigm? |
Description
At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.
The goal is to check if we can use Iceberg to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.
As a result, we expect a working example of kedro project used with Iceberg for versioning and some assumptions on:
Context
#4199
Market research
The text was updated successfully, but these errors were encountered: