Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Created table_args to pass to create_table, create_view, and table methods #909

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mark-druffel
Copy link

@mark-druffel mark-druffel commented Oct 25, 2024

Description

Development notes

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@mark-druffel mark-druffel changed the title Created table_args to pass to create_table, create_view, and table methods Fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
@mark-druffel mark-druffel changed the title Fix(datasets): Created table_args to pass to create_table, create_view, and table methods fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
…o avoid breaking changes

Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving initial comments; happy to review later once it's ready.

kedro-datasets/RELEASE.md Outdated Show resolved Hide resolved

def save(self, data: ir.Table) -> None:
if self._table_name is None:
raise DatasetError("Must provide `table_name` for materialization.")

writer = getattr(self.connection, f"create_{self._materialized}")
writer(self._table_name, data, **self._save_args)
writer(self._table_name, data, **self._table_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I think the table args should only apply to the table call, but haven't looked into it deeply before commenting now.

Copy link
Author

@mark-druffel mark-druffel Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Sorry this is a little confusing so just adding a bit more context.

This PR

The table method takes the database argument, butcreate_table & create_view methods both take the database and overwrite arguments. The overwrite argument is already in save_args, but I'm assuming save_args will be removed from TableDataset in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database) to table_args and left the old parameters alone. is already in the save_args they both also have overwrite which is already in _save_args.

To avoid breaking changes but still allow create_table and create_view arguments to flow through, I combined _save_args and _table_args here.

Version 6

I am assuming that save_args & load_args will be dropped from TableDataset in version 6. In that change, I'd assume the arguments still used from load_args and save_args would be added to table_args. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args than save_args now that it's split from TableDataset (i.e. the filepath, file_type, sep, etc. would be same for load and save)? I may be totally overlooking some things though 🤷‍♂️

bronze_tracks:
  type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read)
  connection:
    backend: pyspark
  file_args:
    filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
    file_format: csv
    materialized: view
    overwrite: True
    table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active
    sep: "," 

silver_tracks:
  type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read)
  connection:
    backend: pyspark
  table_args:
    name: tracks
    database: spotify.silver
    overwrite: True

@mark-druffel mark-druffel changed the title fix(datasets): Created table_args to pass to create_table, create_view, and table methods feat(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 28, 2024
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
@mark-druffel mark-druffel deleted the fix/datasets/ibis-TableDataset branch October 28, 2024 19:39
@deepyaman deepyaman reopened this Oct 28, 2024
deepyaman and others added 5 commits October 28, 2024 16:58
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
Signed-off-by: Mark Druffel <mark.druffel@gmail.com>
@mark-druffel mark-druffel marked this pull request as ready for review November 1, 2024 22:37
@mark-druffel
Copy link
Author

@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the make tests they all fail saying No rule. Any chance you can take a look and give me a bit of guidance? Sorry just not sure where to go from here 😬

image

Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of TableDataset and FileDataset above.

@mark-druffel
Copy link
Author

@jakepenzak For visibility

@deepyaman
Copy link
Member

deepyaman commented Nov 13, 2024 via email

@mark-druffel
Copy link
Author

mark-druffel commented Nov 13, 2024

Sorry, I saw this yesterday and started drafting an apology. 🙈 I will review it later today. 🤞

On Wed, Nov 13, 2024, 6:16 AM Merel Theisen @.> wrote: @merelcht https://github.com/merelcht requested your review on: #909 <#909> feat(datasets): Created table_args to pass to create_table, create_view, and table methods. — Reply to this email directly, view it on GitHub <#909 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ . You are receiving this because your review was requested.Message ID: @.>

No worries @deepyaman, really appreciate your help! Let me know what I can do to support, just trying to make sure the yaml changes I'm introducing make sense and figure out how to get through the PR process :)

Regarding my issues with make, I was able to figure out my initial question, but still ran into some errors running when linting and testing. Not sure what went wrong, but for linting I get errors related to some other datasets. Perhaps from rebasing 🤷
image

For testing, unfortunately I don't think the tests will work on my personal machine because I'm on an old processor that doesn't support AVX2. I know that causes issues when running polars, not sure if that's the root cause or not, the setps seem to fail because of a python illegal instruction error in execnet...
image

@deepyaman
Copy link
Member

@mark-druffel Actually, putting aside the issues with local development, if you look at the CI failure on kedro-datasets / unit-tests, you'll see that the main thing is not having unit tests covering lines 150 and 155 in kedro_datasets/ibis/table_dataset.py:

kedro_datasets/ibis/table_dataset.py                            66      2    97%   150, 155

Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the whole, but one comment re how database is handled.

Let me know if I can help with any of the technical aspects of resolving merge conflicts, adding tests, etc.!

Comment on lines +149 to +150
if table_args is not None:
save_args["database"] = table_args.get("database", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit magical to me. It's not really consistent with the docstring, either, which says that arguments will be passed to create_{materialized}; in reality, the user needs to know that just database will be passed.

I personally would recommend one of two approaches. One is to not do anything special here; the user can pass database in save_args and database in table_args, and, while it may feel duplicative, at least it's explicit. The other approach to make an explicit database keyword for the dataset, and likely raise an error if database is specified in save_args and/or table_args if also passed explicitly.

@mark-druffel does this make sense, and do you have a preference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly.

When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬

image

@deepyaman
Copy link
Member

deepyaman commented Nov 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants