-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Created table_args to pass to create_table
, create_view
, and table
methods
#909
base: main
Are you sure you want to change the base?
feat(datasets): Created table_args to pass to create_table
, create_view
, and table
methods
#909
Conversation
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
…o avoid breaking changes Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
47331ff
to
ef3712e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving initial comments; happy to review later once it's ready.
|
||
def save(self, data: ir.Table) -> None: | ||
if self._table_name is None: | ||
raise DatasetError("Must provide `table_name` for materialization.") | ||
|
||
writer = getattr(self.connection, f"create_{self._materialized}") | ||
writer(self._table_name, data, **self._save_args) | ||
writer(self._table_name, data, **self._table_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? I think the table args should only apply to the table
call, but haven't looked into it deeply before commenting now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman Sorry this is a little confusing so just adding a bit more context.
This PR
The table
method takes the database
argument, butcreate_table
& create_view
methods both take the database
and overwrite
arguments. The overwrite
argument is already in save_args
, but I'm assuming save_args
will be removed from TableDataset
in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database
) to table_args
and left the old parameters alone. is already in the save_args
they both also have overwrite
which is already in _save_args
.
To avoid breaking changes but still allow create_table
and create_view
arguments to flow through, I combined _save_args
and _table_args
here.
Version 6
I am assuming that save_args
& load_args
will be dropped from TableDataset
in version 6. In that change, I'd assume the arguments still used from load_args
and save_args
would be added to table_args
. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args
. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args
than save_args
now that it's split from TableDataset (i.e. the filepath
, file_type
, sep
, etc. would be same for load and save)? I may be totally overlooking some things though 🤷♂️
bronze_tracks:
type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read)
connection:
backend: pyspark
file_args:
filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
file_format: csv
materialized: view
overwrite: True
table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active
sep: ","
silver_tracks:
type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read)
connection:
backend: pyspark
table_args:
name: tracks
database: spotify.silver
overwrite: True
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
…ark-druffel/kedro-plugins into fix/datasets/ibis-TableDataset
@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of |
@jakepenzak For visibility |
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Sorry, I saw this yesterday and started drafting an apology. 🙈
I will review it later today. 🤞
…On Wed, Nov 13, 2024, 6:16 AM Merel Theisen ***@***.***> wrote:
@merelcht <https://github.com/merelcht> requested your review on: #909
<#909> feat(datasets):
Created table_args to pass to create_table, create_view, and table
methods.
—
Reply to this email directly, view it on GitHub
<#909 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
No worries @deepyaman, really appreciate your help! Let me know what I can do to support, just trying to make sure the yaml changes I'm introducing make sense and figure out how to get through the PR process :) Regarding my issues with For testing, unfortunately I don't think the tests will work on my personal machine because I'm on an old processor that doesn't support |
@mark-druffel Actually, putting aside the issues with local development, if you look at the CI failure on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on the whole, but one comment re how database
is handled.
Let me know if I can help with any of the technical aspects of resolving merge conflicts, adding tests, etc.!
if table_args is not None: | ||
save_args["database"] = table_args.get("database", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit magical to me. It's not really consistent with the docstring, either, which says that arguments will be passed to create_{materialized}
; in reality, the user needs to know that just database
will be passed.
I personally would recommend one of two approaches. One is to not do anything special here; the user can pass database
in save_args
and database
in table_args
, and, while it may feel duplicative, at least it's explicit. The other approach to make an explicit database
keyword for the dataset, and likely raise an error if database
is specified in save_args
and/or table_args
if also passed explicitly.
@mark-druffel does this make sense, and do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly.
When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬
Based on the screenshot, it's only reformatting one file. Maybe you can do
a `git diff` to see what's changed? You can also just add that change, and
I cam take a look.
Also happy to help debug the workflow on a quick call, if that would help!
…On Fri, Nov 15, 2024, 3:48 PM Mark Druffel ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In kedro-datasets/kedro_datasets/ibis/table_dataset.py
<#909 (comment)>
:
> + if table_args is not None:
+ save_args["database"] = table_args.get("database", None)
@deepyaman <https://github.com/deepyaman> As discussed yesterday, I've
moved database to the top-level as discussed. I'm trying to push the
changes, but I'm getting blocked by pre-commit now that I have it setup
properly.
When it ran, it changed a bunch of files I never touched. I staged those
as well (not sure if I should've), but my commit still failed because of
Black. I've run black manually on the file I changed too to try to lint the
file. Any suggestions how I can get this working properly? 😬
image.png (view on web)
<https://github.com/user-attachments/assets/94b397cc-7263-4eaf-871f-0405a5cc59ee>
—
Reply to this email directly, view it on GitHub
<#909 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADK3W3X364WF4MLCOIUU5K32AZ23BAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMZZHA4DMOJQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Description
Development notes
Checklist
RELEASE.md
file