Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better way of interacting with databases #72

Merged
merged 15 commits into from
Aug 22, 2024
Merged

feat: better way of interacting with databases #72

merged 15 commits into from
Aug 22, 2024

Conversation

sg-s
Copy link
Collaborator

@sg-s sg-s commented Aug 15, 2024

problem description

a typical user flow is

  • get some data from a database. data is then exstant as a pandas dataframe
  • do some computation on it
  • maybe write results to a new column in the dataframe
  • write those results back to the DO database (sync back)

right now, to do so, we have to worry about API calls where we have to laboriously pass database ID, column IDs, etc.

proposed solution -- ux

1. user creates a df using pandas-like syntax

Screenshot 2024-08-19 at 9 52 42 PM

2. user modifies some data

the dataframe warns user that there are unsynced changes

Screenshot 2024-08-19 at 9 53 05 PM

3. user uses pandas-like to_deeporigin df method to write changes back to DO

  • only updated data is sent back
  • the df intelligently keeps track of modified data
  • all operations resulting in modifying DO databases trigger a print statement
Screenshot 2024-08-19 at 9 53 16 PM

4. user views df again

  • now we see that there is no warning about unsynced data/local changes
Screenshot 2024-08-19 at 9 53 34 PM

5. user turns on auto_sync to live dangerously

  • a message is shown indicating that changes will be automatically written when viewing df
Screenshot 2024-08-19 at 9 53 47 PM

6. user makes some changes

  • changes automatically written
  • this triggers a print statement
  • no warning about local unsaved data
Screenshot 2024-08-19 at 9 54 03 PM

technical implementation

we create a new class that subclasses a pandas dataframe, and write special methods to talk to the data hub

changes

  • api.get_dataframe() stores database info in dataframe attrs
  • dataframe contains information about columns
  • dataframe.sync works
  • support for intelligently creating new columns won't do this PR
  • updating a column writes only those values to that DB
  • updating a single cell writes only that value to the DB
  • documentation describing how to use this
  • pretty printing of databases includes a link to the database on the web UI
  • auto_sync should be turned off when running in the lambda to preserve existing behavior

changes post review

  • when auto_sync is False, df keeps track of modified columns so that we can selectively update those
  • auto_sync is False by default
  • printing df in jupyter shows last modified row and by whom
  • printing df in jupyter tells me if there are modifications that need to be written back to DO
  • printing df in jupyter tells me if auto_sync is enabled
  • docs updated with new behavior
  • docs updated with screenshots
  • every operation of to_deeporigin will print a message to console

@sg-s sg-s self-assigned this Aug 15, 2024
@sg-s sg-s marked this pull request as ready for review August 16, 2024 13:54
@sg-s sg-s requested a review from a team as a code owner August 16, 2024 13:54
@sg-s sg-s requested a review from a team as a code owner August 16, 2024 16:41
pyproject.toml Show resolved Hide resolved
docs/ref/data-hub/types.md Show resolved Hide resolved
mkdocs.yaml Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Outdated Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/utils.py Show resolved Hide resolved
Copy link
Contributor

@kennyjwilli kennyjwilli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO FREAKING COOL!

src/utils.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Show resolved Hide resolved
src/data_hub/dataframe.py Show resolved Hide resolved
Copy link

@akash-guru akash-guru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Member

@jonrkarr jonrkarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I live the simpler UX!

Auto-saving

Instead of auto-saving changes back to Deep Origin, I think we should give users a method to easily save changes back to Deep Origin. To fit with the typical DataFrame UX, we could add a to_deeporigin method.

  • I think auto-save would make it too easy for users to unintentionally overwrite their data.
  • I think auto-save would be surprising to users.
  • By making auto-saving the default, it makes it a little harder for users to get a data frame that they can transiently manipulate. For this use case, users would have to go an extra step, rather than making users go an extra step to sync their changes back to deeporigin.

Generalizing to enable additional use cases

By making to_deeporigin more explicit, we could also build on this idea further to enable additional use cases. See below.

  • Use to_deeporigin to create a new database.
    1. User creates a DataFrame
    2. User calls to_deeporigin to sync the DataFrame to Deep Origin. The user must supply arguments for the database ID, name, and row ID prefix. to_deeporigin then attaches this metadata to the DataFrame.
  • Use to_deeporigin to create a new rows

Documentation

I think we should expand the documentation for this a little to outline how various scenarios are handled (or not). I don't think we need to handle all of the use cases right now. Rather, we could simply make it clear what is presently supported and what isn't presently supported so users aren't surprised.

  • Creating new databases
  • Creating new columns
  • Creating new rows
  • Editing the names of databases
  • Editing the columns
  • Deleting columns
  • Deleting rows

mkdocs.yaml Outdated Show resolved Hide resolved
src/utils.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
@sg-s sg-s marked this pull request as draft August 19, 2024 16:42
@sg-s sg-s marked this pull request as ready for review August 20, 2024 12:50
@sg-s sg-s requested a review from jonrkarr August 20, 2024 12:50
Copy link
Member

@jonrkarr jonrkarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several minor suggests for clarifying the documentation

docs/data-hub/dataframes.md Show resolved Hide resolved
docs/data-hub/dataframes.md Outdated Show resolved Hide resolved
docs/data-hub/dataframes.md Outdated Show resolved Hide resolved
docs/data-hub/dataframes.md Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
src/data_hub/dataframe.py Outdated Show resolved Hide resolved
@sg-s sg-s enabled auto-merge (squash) August 22, 2024 13:00
@sg-s sg-s merged commit 224dde9 into main Aug 22, 2024
11 checks passed
@sg-s sg-s deleted the database-class branch August 22, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants