feat: better way of interacting with databases #72

sg-s · 2024-08-15T19:45:47Z

problem description

a typical user flow is

get some data from a database. data is then exstant as a pandas dataframe
do some computation on it
maybe write results to a new column in the dataframe
write those results back to the DO database (sync back)

right now, to do so, we have to worry about API calls where we have to laboriously pass database ID, column IDs, etc.

proposed solution -- ux

1. user creates a df using pandas-like syntax

2. user modifies some data

the dataframe warns user that there are unsynced changes

3. user uses pandas-like `to_deeporigin` df method to write changes back to DO

only updated data is sent back
the df intelligently keeps track of modified data
all operations resulting in modifying DO databases trigger a print statement

4. user views df again

now we see that there is no warning about unsynced data/local changes

5. user turns on auto_sync to live dangerously

a message is shown indicating that changes will be automatically written when viewing df

6. user makes some changes

changes automatically written
this triggers a print statement
no warning about local unsaved data

technical implementation

we create a new class that subclasses a pandas dataframe, and write special methods to talk to the data hub

changes

api.get_dataframe() stores database info in dataframe attrs
dataframe contains information about columns
dataframe.sync works
~~support for intelligently creating new columns~~ won't do this PR
updating a column writes only those values to that DB
updating a single cell writes only that value to the DB
documentation describing how to use this
pretty printing of databases includes a link to the database on the web UI
auto_sync should be turned off when running in the lambda to preserve existing behavior

changes post review

when auto_sync is False, df keeps track of modified columns so that we can selectively update those
auto_sync is False by default
printing df in jupyter shows last modified row and by whom
printing df in jupyter tells me if there are modifications that need to be written back to DO
printing df in jupyter tells me if auto_sync is enabled
docs updated with new behavior
docs updated with screenshots
every operation of to_deeporigin will print a message to console

pyproject.toml

docs/ref/data-hub/types.md

mkdocs.yaml

src/data_hub/api.py

src/utils.py

kennyjwilli

SO FREAKING COOL!

src/utils.py

src/data_hub/dataframe.py

akash-guru

Looks great!

jonrkarr

I live the simpler UX!

Auto-saving

Instead of auto-saving changes back to Deep Origin, I think we should give users a method to easily save changes back to Deep Origin. To fit with the typical DataFrame UX, we could add a to_deeporigin method.

I think auto-save would make it too easy for users to unintentionally overwrite their data.
I think auto-save would be surprising to users.
By making auto-saving the default, it makes it a little harder for users to get a data frame that they can transiently manipulate. For this use case, users would have to go an extra step, rather than making users go an extra step to sync their changes back to deeporigin.

Generalizing to enable additional use cases

By making to_deeporigin more explicit, we could also build on this idea further to enable additional use cases. See below.

Use to_deeporigin to create a new database.
1. User creates a DataFrame
2. User calls to_deeporigin to sync the DataFrame to Deep Origin. The user must supply arguments for the database ID, name, and row ID prefix. to_deeporigin then attaches this metadata to the DataFrame.
Use to_deeporigin to create a new rows

Documentation

I think we should expand the documentation for this a little to outline how various scenarios are handled (or not). I don't think we need to handle all of the use cases right now. Rather, we could simply make it clear what is presently supported and what isn't presently supported so users aren't surprised.

Creating new databases
Creating new columns
Creating new rows
Editing the names of databases
Editing the columns
Deleting columns
Deleting rows

docs/ref/data-hub/types.md

mkdocs.yaml

docs/data-hub/dataframes.md

src/data_hub/api.py

src/utils.py

src/data_hub/dataframe.py

jonrkarr

Left several minor suggests for clarifying the documentation

docs/data-hub/dataframes.md

src/data_hub/dataframe.py

feat: better way of interacting with databases

673dc7d

sg-s self-assigned this Aug 15, 2024

feat: automatic syncing of dataframes to databases

09e398e

sg-s marked this pull request as ready for review August 16, 2024 13:54

sg-s requested a review from a team as a code owner August 16, 2024 13:54

sg-s added 4 commits August 16, 2024 09:54

Merge branch 'main' into database-class

da11c50

fix: added required docstrings

61200b2

feat: introduced a short-circuit for a no-op

7d2b81c

feat: dataframes now show link and created time

5408fc5

sg-s requested a review from a team as a code owner August 16, 2024 16:41

sg-s added 2 commits August 16, 2024 13:17

fix: turned off auto-sync on lambda

cd5a547

feat: docs for new dataframe feature

56957f0

sg-s requested review from a team as code owners August 16, 2024 19:27

sg-s had a problem deploying to docs August 16, 2024 19:27 — with GitHub Actions Failure

fix: fixed a broken link in docs

cfdadf2

sg-s temporarily deployed to docs August 16, 2024 19:30 — with GitHub Actions Inactive

sg-s commented Aug 16, 2024

View reviewed changes

kennyjwilli approved these changes Aug 16, 2024

View reviewed changes

src/utils.py Outdated Show resolved Hide resolved

src/data_hub/dataframe.py Show resolved Hide resolved

src/data_hub/dataframe.py Show resolved Hide resolved

akash-guru reviewed Aug 16, 2024

View reviewed changes

jonrkarr requested changes Aug 17, 2024

View reviewed changes