deeporiginbio · sg-s · Aug 22, 2024 · Aug 15, 2024 · Aug 16, 2024 · Aug 16, 2024
@@ -0,0 +1,10 @@
+# Changelog
+
+### `deeporigin v2.1.0` 
+
+
+- dropped support for column keys
+- added the [deeporigin.DataFrame][src.data_hub.dataframe.DataFrame] class, which:
+    - is a drop-in replacement for pandas.DataFrame
+    - supports all pandas methods
+    - supports automatic syncing with a Deep Origin database
@@ -0,0 +1,79 @@
+# Using Deep Origin DataFrames
+
+This page describes how to use Deep Origin DataFrames, which are the primary object you will use to interact with databases on Deep Origin. This page will cover:
+
+- fetching data from a Deep Origin database
+- modifying data locally
+- writing data back to Deep Origin
+
+
+!!! question "What is a Deep Origin DataFrame?"
+    A Deep Origin DataFrame is a subclass of a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that is backed by a Deep Origin database. Because it is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames. 
+
+## Create a DataFrame
+
+Create a DataFrame using:
+
+```python
+from deeporigin.data_hub import api
+df = api.get_data_frame("database-id")
+```
+
+In an interactive web-based environment such as Jupyter Lab, you should be able to view the DataFrame using:
+
+```py
+df
+```
+
+which should show you something like this:
+
+![DataFrame](../images/dataframe.png)
+
+
+!!! tip "Information in the DataFrame"
+    In addition to information you would find in the rows and columns of a pandas DataFrame, a Deep Origin DataFrame also contains metadata about the underling database. In the view above we also see:
+
+    - The name of the database on Deep Origin.
+    - A link to the database on Deep Origin
+    - When the database was created.
+
+## Modify data in the DataFrame
+
+Because a Deep Origin DataFrame is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames. In this example, we modify values in one of the columns, or modify a single cell.
+
+
+
+### Modify entire columns
+
+To modify entire columns, use native pandas syntax:
+
+```python
+df["y^2"] = df["y"] ** 2
+```
+
+### Modify data in a single cell
+
+To modify data in a single cell, use native pandas syntax (the at operator):
+
+```python
+df.at["sgs-1", "y"] = 10
+```
+
+
+## Write data back to Deep Origin
+
+!!! success "Nothing to do here!"
+    A Deep Origin DataFrame automatically syncs back to the Deep origin database. Making the changes above also makes changes in the source database. 
+
+
+### Manually syncing
+
+Automatic syncing occurs when the `auto_sync` attribute of a Deep Origin DataFrame is set to `True`. You may want to turn `auto_sync` off if you want to manually sync back to the source database, for example if you want to keep the changes in the source database.
+
+To manually sync changes, use the `sync` method:
+
+```python
+df.sync()
+```
+
+Read more about the sync method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync). 
diff --git a/docs/images/dataframe.png b/docs/images/dataframe.png
@@ -1,6 +1,15 @@
-# Types and constants
+# Classes and Constants
+
+This page lists some classes, types and constants used in this library.
+
+::: src.data_hub.dataframe.DataFrame
+    options:
+      members: 
+        - sync
+        - auto_sync
+      filters:
+        - "!^_"
 
-This page lists some types and constants used in this library.
 
 ::: src.utils
     options:

@@ -44,6 +44,8 @@ nav:
   - Authenticate: how-to/auth.md
 - Data hub: 
   - data-hub/index.md
+  - Deep Origin DataFrames:
+    - Tutorial: data-hub/dataframes.md
   - How to:
     - Create objects: how-to/data-hub/create.md
     - Delete objects: how-to/data-hub/delete.md
@@ -54,12 +56,13 @@ nav:
   - API reference:
     - High-level API: ref/data-hub/high-level-api.md
     - Low-level API: ref/data-hub/low-level-api.md
-    - Types and constants: ref/data-hub/types.md
+    - Classes & Constants: ref/data-hub/types.md
 - Compute hub: 
   - compute-hub/index.md
   - How to:
     - Install variables and secrets: how-to/variables.md
     - Get info about your workstation: how-to/workstation-info.md
+- Changelog: changelog.md
 - Support: https://www.support.deeporigin.com/servicedesk/customer/portals
 
 

@@ -28,6 +28,7 @@ dependencies = [
     "filetype",
     "httpx",
     "deeporigin-data-sdk==0.1.0a7",
+    "humanize",
 ]
 dynamic = ["version"]
 

@@ -861,14 +861,22 @@ def get_dataframe(
 
         # this import is here because we don't want to
         # import pandas unless we actually use this function
-        import pandas as pd
+        from deeporigin.data_hub.dataframe import DataFrame
 
-        df = pd.DataFrame(data)
+        df = DataFrame(data)
         df.attrs["file_ids"] = list(set(file_ids))
         df.attrs["reference_ids"] = list(set(reference_ids))
         df.attrs["id"] = database_id
+        df.attrs["metadata"] = dict(db_row)
 
-        return _type_and_cleanup_dataframe(df, columns)
+        df = _type_and_cleanup_dataframe(df, columns)
+
+        # if this code is running the lambda, we do not
+        # turn on auto_sync by default because we don't want
+        # side effects from code that the LLM writes
+        if os.environ.get("AWS_LAMBDA_FUNCTION_NAME") is None:
+            df.auto_sync = True
+        return df
 
     else:
         # rename keys
@@ -994,7 +1002,7 @@ def _row_to_dict(row, *, use_file_names: bool = True):
 
 @beartype
 def _type_and_cleanup_dataframe(
-    df,  # pd.Dataframe, not typed to avoid pandas import
+    df,  # Dataframe, not typed to avoid pandas import
     columns: list[dict],
 ):
     """Internal function to type and clean a pandas dataframe
@@ -1164,10 +1172,7 @@ def get_row_data(
     column_name_mapper = dict()
     column_cardinality_mapper = dict()
     for col in parent_response.cols:
-        if use_column_keys:
-            column_name_mapper[col["id"]] = col["key"]
-        else:
-            column_name_mapper[col["id"]] = col["name"]
+        column_name_mapper[col["id"]] = col["name"]
         column_cardinality_mapper[col["id"]] = col["cardinality"]
 
     # now use this to construct the required dictionary
@@ -1176,10 +1181,7 @@ def get_row_data(
     for col in parent_response.cols:
         if "systemType" in col.keys() and col["systemType"] == "bodyDocument":
             continue
-        if use_column_keys:
-            row_data[col["key"]] = None
-        else:
-            row_data[col["name"]] = None
+        row_data[col["name"]] = None
     if not hasattr(response, "fields"):
         return row_data
     for field in response.fields:

@@ -0,0 +1,159 @@
+"""
+This module defines a class called DataFrame that is a drop-in
+replacement for a pandas DataFrame, but also allows automatic
+updating of Deep Origin databases.
+"""
+
+from datetime import datetime
+from typing import Optional
+
+import humanize
+import pandas as pd
+from deeporigin.data_hub import api
+from deeporigin.utils import construct_resource_url
+
+
+class DataFrame(pd.DataFrame):
+    """A subclass of pandas DataFrame that allows for automatic updates to Deep Origin databases. This can be used as a drop-in replacement for a pandas DataFrame, and should support all methods a pandas DataFrame supports.
+
+    The primary method of creating an object of this type is to use the [api.get_dataframe][src.data_hub.api.get_dataframe] function.
+    """
+
+    auto_sync: bool = False
+    """When True, changes made to the dataframe will be automatically synced to the Deep Origin database this dataframe represents."""
+
+    class AtIndexer:
+        """this class override is used to intercept calls to at indexer of a pandas dataframe"""
+
+        def __init__(self, obj):
+            self.obj = obj
+
+        def __getitem__(self, key):
+            """intercept for the set operation"""
+
+            return self.obj._get_value(*key)
+
+        def __setitem__(self, key, value):
+            """intercept for the set operation""" ""
+
+            old_value = self.obj._get_value(*key)
+            if value == old_value:
+                # noop
+                return
+
+            rows = [key[0]]
+            columns = [key[1]]
+
+            # Perform the actual setting operation
+            self.obj._set_value(*key, value)
+
+            # now update the DB. note that self is an AtIndexer
+            # object, so we need to index into the pandas object
+            if self.obj.auto_sync:
+                self.obj.sync(columns=columns, rows=rows)
+
+    @property
+    def at(self):
+        """Override the `at` property to return an AtIndexer"""
+        return self.AtIndexer(self)
+
+    def __setitem__(self, key, value):
+        """Override the __setitem__ method to update the Deep Origin database when changes are made to the local
+        dataframe"""
+
+        # first, call the pandas method
+        super().__setitem__(key, value)
+
+        # now, update the Deep Origin database with the changes
+        # we just made
+        if self.auto_sync:
+            self.sync(columns=[key])
+
+    def _repr_html_(self):
+        """method override to customize printing in a Jupyter notebook"""
+
+        name = self.attrs["metadata"]["name"]
+        url = construct_resource_url(
+            name=name,
+            row_type="database",
+        )
+
+        # Convert the string to a datetime object
+        date_str = self.attrs["metadata"]["dateCreated"]
+        date_obj = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S.%f")
+
+        now = datetime.now()
+
+        # Calculate the difference
+        time_diff = now - date_obj
+
+        # Convert the time difference into "x time ago" format
+        time_ago = humanize.naturaltime(time_diff)
+
+        header = f'<h4>{name} <a href = "{url}">🔗</a></h4>'
+        txt = f'<p style="font-size: 12px; color: #808080;">Created {time_ago}.</p>'
+        df_html = super()._repr_html_()
+        return header + txt + df_html
+
+    def __repr__(self):
+        """method override to customize printing in an interactive session"""
+
+        header = f'{self.attrs["metadata"]["hid"]}\n'
+        df_representation = super().__repr__()
+        return header + df_representation
+
+    def sync(
+        self,
+        *,
+        columns: Optional[list] = None,
+        rows: Optional[list] = None,
+    ):
+        """Manually synchronize data in the dataframe to the underlying Deep Origin database.
+
+        !!! tip "Deep Origin DataFrames automatically synchronize"
+            Typically, you do not need to manually synchronize. If the `auto_sync` attribute of the dataframe is set to `True`, the dataframe will automatically synchronize when changes are made to the dataframe.
+
+
+        Args:
+            columns (list, optional): The columns of the dataframe to update. Defaults to None.
+            rows (list, optional): The rows to update. Defaults to None. When None, all rows in the relevant columns are updated.
+
+        """
+
+        if columns is None:
+            columns = self.columns
+
+        for column in columns:
+            if column in ["Validation Status", "ID"]:
+                continue
+
+            column_metadata = [
+                col for col in self.attrs["metadata"]["cols"] if col["name"] == column
+            ]
+
+            if len(column_metadata) == 0:
+                raise NotImplementedError(
+                    "Column metadata not found. This is likely because it's a new column"
+                )
+
+            column_metadata = column_metadata[0]
+
+            if column_metadata["type"] == "file":
+                continue
+
+            if rows is None:
+                # we're updating a whole column
+                rows = list(self.index)
+
+            api.set_data_in_cells(
+                values=self[column],
+                row_ids=rows,
+                column_id=column,
+                database_id=self.attrs["id"],
+            )
+
+    @property
+    def _constructor(self):
+        """this method overrides the _constructor property to return a DataFrame and is required for compatibility with a pandas DataFrame"""
+
+        return DataFrame