Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better way of interacting with databases #72

Merged
merged 15 commits into from
Aug 22, 2024
Merged
10 changes: 10 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Changelog

### `deeporigin v2.1.0`


- dropped support for column keys
- added the [deeporigin.DataFrame][src.data_hub.dataframe.DataFrame] class, which:
- is a drop-in replacement for pandas.DataFrame
- supports all pandas methods
- supports automatic syncing with a Deep Origin database
79 changes: 79 additions & 0 deletions docs/data-hub/dataframes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Using Deep Origin DataFrames

This page describes how to use Deep Origin DataFrames, which are the primary object you will use to interact with databases on Deep Origin. This page will cover:
sg-s marked this conversation as resolved.
Show resolved Hide resolved

- fetching data from a Deep Origin database
- modifying data locally
- writing data back to Deep Origin


!!! question "What is a Deep Origin DataFrame?"
sg-s marked this conversation as resolved.
Show resolved Hide resolved
A Deep Origin DataFrame is a subclass of a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that is backed by a Deep Origin database. Because it is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames.
sg-s marked this conversation as resolved.
Show resolved Hide resolved

## Create a DataFrame

Create a DataFrame using:

```python
from deeporigin.data_hub import api
df = api.get_data_frame("database-id")
sg-s marked this conversation as resolved.
Show resolved Hide resolved
```

In an interactive web-based environment such as Jupyter Lab, you should be able to view the DataFrame using:
sg-s marked this conversation as resolved.
Show resolved Hide resolved

```py
df
```

which should show you something like this:

![DataFrame](../images/dataframe.png)


!!! tip "Information in the DataFrame"
In addition to information you would find in the rows and columns of a pandas DataFrame, a Deep Origin DataFrame also contains metadata about the underling database. In the view above we also see:
sg-s marked this conversation as resolved.
Show resolved Hide resolved

- The name of the database on Deep Origin.
- A link to the database on Deep Origin
- When the database was created.

## Modify data in the DataFrame

Because a Deep Origin DataFrame is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames. In this example, we modify values in one of the columns, or modify a single cell.



### Modify entire columns

To modify entire columns, use native pandas syntax:

```python
df["y^2"] = df["y"] ** 2
```

### Modify data in a single cell

To modify data in a single cell, use native pandas syntax (the at operator):

```python
df.at["sgs-1", "y"] = 10
```


## Write data back to Deep Origin

!!! success "Nothing to do here!"
A Deep Origin DataFrame automatically syncs back to the Deep origin database. Making the changes above also makes changes in the source database.
sg-s marked this conversation as resolved.
Show resolved Hide resolved


### Manually syncing

sg-s marked this conversation as resolved.
Show resolved Hide resolved
Automatic syncing occurs when the `auto_sync` attribute of a Deep Origin DataFrame is set to `True`. You may want to turn `auto_sync` off if you want to manually sync back to the source database, for example if you want to keep the changes in the source database.

To manually sync changes, use the `sync` method:

```python
df.sync()
sg-s marked this conversation as resolved.
Show resolved Hide resolved
```

Read more about the sync method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync).
Binary file added docs/images/dataframe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 11 additions & 2 deletions docs/ref/data-hub/types.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
# Types and constants
sg-s marked this conversation as resolved.
Show resolved Hide resolved
# Classes and Constants
sg-s marked this conversation as resolved.
Show resolved Hide resolved

This page lists some classes, types and constants used in this library.

::: src.data_hub.dataframe.DataFrame
options:
members:
- sync
- auto_sync
filters:
- "!^_"

This page lists some types and constants used in this library.

::: src.utils
options:
Expand Down
5 changes: 4 additions & 1 deletion mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ nav:
- Authenticate: how-to/auth.md
- Data hub:
- data-hub/index.md
- Deep Origin DataFrames:
- Tutorial: data-hub/dataframes.md
- How to:
sg-s marked this conversation as resolved.
Show resolved Hide resolved
- Create objects: how-to/data-hub/create.md
- Delete objects: how-to/data-hub/delete.md
Expand All @@ -54,12 +56,13 @@ nav:
- API reference:
- High-level API: ref/data-hub/high-level-api.md
- Low-level API: ref/data-hub/low-level-api.md
- Types and constants: ref/data-hub/types.md
- Classes & Constants: ref/data-hub/types.md
sg-s marked this conversation as resolved.
Show resolved Hide resolved
- Compute hub:
- compute-hub/index.md
- How to:
- Install variables and secrets: how-to/variables.md
- Get info about your workstation: how-to/workstation-info.md
- Changelog: changelog.md
- Support: https://www.support.deeporigin.com/servicedesk/customer/portals


Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ dependencies = [
"filetype",
"httpx",
"deeporigin-data-sdk==0.1.0a7",
"humanize",
sg-s marked this conversation as resolved.
Show resolved Hide resolved
]
dynamic = ["version"]

Expand Down
26 changes: 14 additions & 12 deletions src/data_hub/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -861,14 +861,22 @@ def get_dataframe(

# this import is here because we don't want to
# import pandas unless we actually use this function
import pandas as pd
from deeporigin.data_hub.dataframe import DataFrame
sg-s marked this conversation as resolved.
Show resolved Hide resolved

df = pd.DataFrame(data)
df = DataFrame(data)
df.attrs["file_ids"] = list(set(file_ids))
df.attrs["reference_ids"] = list(set(reference_ids))
df.attrs["id"] = database_id
sg-s marked this conversation as resolved.
Show resolved Hide resolved
df.attrs["metadata"] = dict(db_row)

return _type_and_cleanup_dataframe(df, columns)
df = _type_and_cleanup_dataframe(df, columns)

# if this code is running the lambda, we do not
# turn on auto_sync by default because we don't want
# side effects from code that the LLM writes
if os.environ.get("AWS_LAMBDA_FUNCTION_NAME") is None:
df.auto_sync = True
return df
sg-s marked this conversation as resolved.
Show resolved Hide resolved

else:
# rename keys
Expand Down Expand Up @@ -994,7 +1002,7 @@ def _row_to_dict(row, *, use_file_names: bool = True):

@beartype
def _type_and_cleanup_dataframe(
df, # pd.Dataframe, not typed to avoid pandas import
df, # Dataframe, not typed to avoid pandas import
sg-s marked this conversation as resolved.
Show resolved Hide resolved
columns: list[dict],
):
"""Internal function to type and clean a pandas dataframe
Expand Down Expand Up @@ -1164,10 +1172,7 @@ def get_row_data(
column_name_mapper = dict()
column_cardinality_mapper = dict()
for col in parent_response.cols:
if use_column_keys:
column_name_mapper[col["id"]] = col["key"]
else:
column_name_mapper[col["id"]] = col["name"]
column_name_mapper[col["id"]] = col["name"]
sg-s marked this conversation as resolved.
Show resolved Hide resolved
column_cardinality_mapper[col["id"]] = col["cardinality"]

# now use this to construct the required dictionary
Expand All @@ -1176,10 +1181,7 @@ def get_row_data(
for col in parent_response.cols:
if "systemType" in col.keys() and col["systemType"] == "bodyDocument":
continue
if use_column_keys:
row_data[col["key"]] = None
else:
row_data[col["name"]] = None
row_data[col["name"]] = None
if not hasattr(response, "fields"):
sg-s marked this conversation as resolved.
Show resolved Hide resolved
return row_data
for field in response.fields:
Expand Down
159 changes: 159 additions & 0 deletions src/data_hub/dataframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
"""
This module defines a class called DataFrame that is a drop-in
replacement for a pandas DataFrame, but also allows automatic
sg-s marked this conversation as resolved.
Show resolved Hide resolved
updating of Deep Origin databases.
"""

from datetime import datetime
from typing import Optional

import humanize
import pandas as pd
from deeporigin.data_hub import api
from deeporigin.utils import construct_resource_url


class DataFrame(pd.DataFrame):
sg-s marked this conversation as resolved.
Show resolved Hide resolved
"""A subclass of pandas DataFrame that allows for automatic updates to Deep Origin databases. This can be used as a drop-in replacement for a pandas DataFrame, and should support all methods a pandas DataFrame supports.
sg-s marked this conversation as resolved.
Show resolved Hide resolved

The primary method of creating an object of this type is to use the [api.get_dataframe][src.data_hub.api.get_dataframe] function.
sg-s marked this conversation as resolved.
Show resolved Hide resolved
"""

auto_sync: bool = False
"""When True, changes made to the dataframe will be automatically synced to the Deep Origin database this dataframe represents."""

class AtIndexer:
"""this class override is used to intercept calls to at indexer of a pandas dataframe"""

def __init__(self, obj):
self.obj = obj

def __getitem__(self, key):
"""intercept for the set operation"""

return self.obj._get_value(*key)

def __setitem__(self, key, value):
"""intercept for the set operation""" ""

old_value = self.obj._get_value(*key)
if value == old_value:
# noop
return

rows = [key[0]]
columns = [key[1]]

# Perform the actual setting operation
self.obj._set_value(*key, value)

# now update the DB. note that self is an AtIndexer
# object, so we need to index into the pandas object
if self.obj.auto_sync:
self.obj.sync(columns=columns, rows=rows)

@property
def at(self):
"""Override the `at` property to return an AtIndexer"""
return self.AtIndexer(self)

def __setitem__(self, key, value):
sg-s marked this conversation as resolved.
Show resolved Hide resolved
"""Override the __setitem__ method to update the Deep Origin database when changes are made to the local
dataframe"""

# first, call the pandas method
super().__setitem__(key, value)

# now, update the Deep Origin database with the changes
# we just made
if self.auto_sync:
self.sync(columns=[key])

def _repr_html_(self):
"""method override to customize printing in a Jupyter notebook"""

name = self.attrs["metadata"]["name"]
url = construct_resource_url(
name=name,
row_type="database",
)

# Convert the string to a datetime object
date_str = self.attrs["metadata"]["dateCreated"]
date_obj = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S.%f")

now = datetime.now()

# Calculate the difference
time_diff = now - date_obj

# Convert the time difference into "x time ago" format
time_ago = humanize.naturaltime(time_diff)

header = f'<h4>{name} <a href = "{url}">🔗</a></h4>'
txt = f'<p style="font-size: 12px; color: #808080;">Created {time_ago}.</p>'
df_html = super()._repr_html_()
return header + txt + df_html

def __repr__(self):
"""method override to customize printing in an interactive session"""

header = f'{self.attrs["metadata"]["hid"]}\n'
df_representation = super().__repr__()
return header + df_representation

def sync(
sg-s marked this conversation as resolved.
Show resolved Hide resolved
self,
*,
columns: Optional[list] = None,
rows: Optional[list] = None,
):
"""Manually synchronize data in the dataframe to the underlying Deep Origin database.

!!! tip "Deep Origin DataFrames automatically synchronize"
Typically, you do not need to manually synchronize. If the `auto_sync` attribute of the dataframe is set to `True`, the dataframe will automatically synchronize when changes are made to the dataframe.
sg-s marked this conversation as resolved.
Show resolved Hide resolved


Args:
columns (list, optional): The columns of the dataframe to update. Defaults to None.
sg-s marked this conversation as resolved.
Show resolved Hide resolved
rows (list, optional): The rows to update. Defaults to None. When None, all rows in the relevant columns are updated.

"""

if columns is None:
columns = self.columns

for column in columns:
if column in ["Validation Status", "ID"]:
continue

column_metadata = [
col for col in self.attrs["metadata"]["cols"] if col["name"] == column
]

if len(column_metadata) == 0:
raise NotImplementedError(
"Column metadata not found. This is likely because it's a new column"
)
sg-s marked this conversation as resolved.
Show resolved Hide resolved

column_metadata = column_metadata[0]

if column_metadata["type"] == "file":
continue

if rows is None:
# we're updating a whole column
rows = list(self.index)

api.set_data_in_cells(
values=self[column],
row_ids=rows,
column_id=column,
database_id=self.attrs["id"],
)
sg-s marked this conversation as resolved.
Show resolved Hide resolved

@property
def _constructor(self):
"""this method overrides the _constructor property to return a DataFrame and is required for compatibility with a pandas DataFrame"""

return DataFrame
Loading