diff --git a/docs/changelog.md b/docs/changelog.md new file mode 100644 index 0000000..c857fb6 --- /dev/null +++ b/docs/changelog.md @@ -0,0 +1,10 @@ +# Changelog + +### `deeporigin v2.1.0` + + +- dropped support for column keys +- added the [deeporigin.DataFrame][src.data_hub.dataframe.DataFrame] class, which: + - is a drop-in replacement for pandas.DataFrame + - supports all pandas methods + - supports automatic syncing with a Deep Origin database \ No newline at end of file diff --git a/docs/data-hub/dataframes.md b/docs/data-hub/dataframes.md new file mode 100644 index 0000000..4e06342 --- /dev/null +++ b/docs/data-hub/dataframes.md @@ -0,0 +1,138 @@ +# Using Deep Origin DataFrames + +This page describes how to a Deep Origin DataFrame, which are the primary object you will use to interact with a database on Deep Origin. This page will cover: + +- fetching data from a Deep Origin database +- modifying data locally +- writing data back to Deep Origin + + +!!! question "What is a Deep Origin DataFrame?" + A Deep Origin DataFrame is a subclass of a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that is created from a Deep Origin database, and can easily write data back to it. Because it is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames. + +## Create a DataFrame + +Create a DataFrame using: + +```python +from deeporigin import DataFrame +df = DataFrame.from_deeporigin("database-id") +``` + +In a Jupyter Lab, you should be able to view the DataFrame using: + +```py +df +``` + +which should show you something like this: + +![DataFrame](../images/dataframe-0.png) + + +!!! tip "Information in the DataFrame" + In addition to information you would find in the rows and columns of a pandas DataFrame, a Deep Origin DataFrame also contains metadata about the underlying database. In the view above we also see: + + - The name of the database on Deep Origin. + - A link to the database on Deep Origin. + - When the database was created. + - Information about the last edit made to the database. + + +## Modify data in the DataFrame + +Because a Deep Origin DataFrame is a subclass of a pandas DataFrame, all pandas DataFrame methods work on Deep Origin DataFrames. In this example, we modify values in one of the columns, or modify a single cell. + + +=== "Modify entire columns" + + To modify entire columns, use native pandas syntax: + + ```python + df["y^2"] = df["y"] ** 2 + ``` + +=== "Modify data in a single cell" + + To modify data in a single cell, use native pandas syntax (the at operator): + + ```python + df.at["sgs-1", "y"] = 10 + ``` + +In either case, when we view the DataFrame once more using `df`, we see the updated data, together with a warning telling us that we have local changes that haven't been written back to Deep Origin. + + +![DataFrame with warning telling us about local changes that have not been written back to Deep Origin](../images/dataframe-1.png) + +## Write data back to Deep Origin + +!!! warning "Work in progress" + Writing data back to Deep Origin from a Deep Origin DataFrame is still a work in progress. The following functionality is not yet supported. To perform these functions, use the API directly instead. + + - Updating values of cells that contain files + - Updating values of cells that contain references + - Uploading files + - Modifying or deleting existing columns. To delete a column, use the GUI or the API and then use the `from_deeporigin` method. + - Creating new columns. To insert data into a new column, create a new column using the GUI or the API and then use the `from_deeporigin` method. + - Deleting rows + - Creating new databases + +### Using the `to_deeporigin` method + +Local changes in the dataframe can be written back to Deep Origin using the `to_deeporigin` method: + +```python +df.to_deeporigin() + +# ✔︎ Wrote 9 rows in y^2 to Deep Origin database. +``` + +The `to_deeporigin` method writes data that have been modified in the local dataframe back to the corresponding Deep Origin Database. + +!!! tip "Intelligent writing" + - Deep Origin DataFrames keep track of local changes, and only write columns back that have been modified locally. + - Every call of `to_deeporigin` will generate a print statement describing the changes that have been written back to Deep Origin. + - Because a Deep Origin DataFrame corresponds to a database on Deep Origin, there is no need to specify the database name in the `to_deeporigin` method. + +If we now view the dataframe once more using `df`, we see the following: + +![DataFrame](../images/dataframe-2.png) + +Note that the warning about local changes that have not been written back to Deep Origin has disappeared, because the changes **have** been written back to Deep Origin. + +### Automatic writing to Deep Origin + +All Deep Origin DataFrames have an attribute called `auto_sync` that determines if local changes are written automatically to Deep Origin. By default, `auto_sync` is set to `False`, requiring you to call the `to_deeporigin` method to write changes back to Deep Origin. + +To enable automatic syncing, set the `auto_sync` attribute to `True`: + +```python +df.auto_sync = True +df +``` + +![DataFrame](../images/dataframe-3.png) + +Note that the dataframe now displays a message indicating that local changes will be written back to Deep Origin. + +Making any change to the dataframe now triggers a write back to the Deep Origin database. + +```python +df["y^2"] = df["y"] * 0.99 +df + +# ✔︎ Wrote 9 rows in y^2 to Deep Origin database. +``` + +!!! danger "Use `auto_sync` with caution" + Turning on `auto_sync` on dataframes can be dangerous. + - Changes made to the local database are written to a Deep Origin database automatically, and no confirmation is asked for. + - This can cause data loss. + - Every change made to the database is written immediately, so modifying the local dataframe multiple times leads to multiple writes to a Deep Origin database. + + + +## Reference + +Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync). \ No newline at end of file diff --git a/docs/how-to/auth.md b/docs/how-to/auth.md index d7f4222..b219292 100644 --- a/docs/how-to/auth.md +++ b/docs/how-to/auth.md @@ -1,5 +1,8 @@ # Sign into Deep Origin +!!! tip "Configure if running locally" + If you're running this code on your local computer (outside of a Deep Origin Workstation), make sure to [configure](../configure.md#on-your-local-computer) it first. + To use most of the functionality of the CLI or Python client, you must first run one of the following commands to sign into Deep Origin. === "CLI" diff --git a/docs/images/dataframe-0.png b/docs/images/dataframe-0.png new file mode 100644 index 0000000..cd33244 Binary files /dev/null and b/docs/images/dataframe-0.png differ diff --git a/docs/images/dataframe-1.png b/docs/images/dataframe-1.png new file mode 100644 index 0000000..692ed4e Binary files /dev/null and b/docs/images/dataframe-1.png differ diff --git a/docs/images/dataframe-2.png b/docs/images/dataframe-2.png new file mode 100644 index 0000000..d5a9a73 Binary files /dev/null and b/docs/images/dataframe-2.png differ diff --git a/docs/images/dataframe-3.png b/docs/images/dataframe-3.png new file mode 100644 index 0000000..5ad2b14 Binary files /dev/null and b/docs/images/dataframe-3.png differ diff --git a/docs/ref/data-hub/types.md b/docs/ref/data-hub/types.md index 16cf814..e8cdb9b 100644 --- a/docs/ref/data-hub/types.md +++ b/docs/ref/data-hub/types.md @@ -1,6 +1,16 @@ -# Types and constants +# Classes and constants + +This page lists some classes, types and constants used in this library. + +::: src.data_hub.dataframe.DataFrame + options: + members: + - auto_sync + - from_deeporigin + - to_deeporigin + filters: + - "!^_" -This page lists some types and constants used in this library. ::: src.utils options: diff --git a/mkdocs.yaml b/mkdocs.yaml index 76eb1b0..543055d 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -44,6 +44,8 @@ nav: - Authenticate: how-to/auth.md - Data hub: - data-hub/index.md + - Deep Origin DataFrames: + - Tutorial: data-hub/dataframes.md - How to: - Create objects: how-to/data-hub/create.md - Delete objects: how-to/data-hub/delete.md @@ -54,12 +56,13 @@ nav: - API reference: - High-level API: ref/data-hub/high-level-api.md - Low-level API: ref/data-hub/low-level-api.md - - Types and constants: ref/data-hub/types.md + - Classes & constants: ref/data-hub/types.md - Compute hub: - compute-hub/index.md - How to: - Install variables and secrets: how-to/variables.md - Get info about your workstation: how-to/workstation-info.md +- Changelog: changelog.md - Support: https://www.support.deeporigin.com/servicedesk/customer/portals diff --git a/pyproject.toml b/pyproject.toml index c6ea666..ea8beeb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -28,6 +28,7 @@ dependencies = [ "filetype", "httpx", "deeporigin-data-sdk==0.1.0a7", + "humanize", ] dynamic = ["version"] diff --git a/src/__init__.py b/src/__init__.py index 0e3d6b8..e8cd300 100644 --- a/src/__init__.py +++ b/src/__init__.py @@ -3,9 +3,9 @@ import subprocess from pathlib import Path -__all__ = [ - "__version__", -] +from deeporigin.data_hub.dataframe import DataFrame + +__all__ = ["__version__", "DataFrame"] SRC_DIR = pathlib.Path(__file__).parent diff --git a/src/data_hub/api.py b/src/data_hub/api.py index 6fc4a61..7b45fde 100644 --- a/src/data_hub/api.py +++ b/src/data_hub/api.py @@ -24,9 +24,10 @@ DataType, DatabaseReturnType, IDFormat, - RowType, + ObjectType, _parse_params_from_url, download_sync, + find_last_updated_row, ) @@ -143,7 +144,7 @@ def create_database( def list_rows( *, parent_id: Optional[str] = None, - row_type: RowType = None, + row_type: ObjectType = None, parent_is_root: Optional[bool] = None, client=None, ) -> list: @@ -861,14 +862,22 @@ def get_dataframe( # this import is here because we don't want to # import pandas unless we actually use this function - import pandas as pd + from deeporigin.data_hub.dataframe import DataFrame - df = pd.DataFrame(data) + df = DataFrame(data) df.attrs["file_ids"] = list(set(file_ids)) df.attrs["reference_ids"] = list(set(reference_ids)) df.attrs["id"] = database_id + df.attrs["metadata"] = dict(db_row) - return _type_and_cleanup_dataframe(df, columns) + df = _type_and_cleanup_dataframe(df, columns) + + # find last updated row for pretty printing + df.attrs["last_updated_row"] = find_last_updated_row(rows) + + df._deep_origin_out_of_sync = False + df._modified_columns = set() + return df else: # rename keys @@ -994,7 +1003,7 @@ def _row_to_dict(row, *, use_file_names: bool = True): @beartype def _type_and_cleanup_dataframe( - df, # pd.Dataframe, not typed to avoid pandas import + df, # Dataframe, not typed to avoid pandas import columns: list[dict], ): """Internal function to type and clean a pandas dataframe @@ -1164,10 +1173,7 @@ def get_row_data( column_name_mapper = dict() column_cardinality_mapper = dict() for col in parent_response.cols: - if use_column_keys: - column_name_mapper[col["id"]] = col["key"] - else: - column_name_mapper[col["id"]] = col["name"] + column_name_mapper[col["id"]] = col["name"] column_cardinality_mapper[col["id"]] = col["cardinality"] # now use this to construct the required dictionary @@ -1176,10 +1182,7 @@ def get_row_data( for col in parent_response.cols: if "systemType" in col.keys() and col["systemType"] == "bodyDocument": continue - if use_column_keys: - row_data[col["key"]] = None - else: - row_data[col["name"]] = None + row_data[col["name"]] = None if not hasattr(response, "fields"): return row_data for field in response.fields: diff --git a/src/data_hub/dataframe.py b/src/data_hub/dataframe.py new file mode 100644 index 0000000..77e28d5 --- /dev/null +++ b/src/data_hub/dataframe.py @@ -0,0 +1,219 @@ +""" +This module defines a class called DataFrame that is a drop-in +replacement for a pandas DataFrame, but also allows for easy +updating of Deep Origin databases. +""" + +from datetime import datetime, timezone +from typing import Optional + +import humanize +import pandas as pd +from deeporigin.data_hub import api +from deeporigin.utils import ( + DatabaseReturnType, + IDFormat, + construct_resource_url, +) + + +class DataFrame(pd.DataFrame): + """A subclass of pandas DataFrame that allows for easy updating of a Deep Origin database. This can be used as a drop-in replacement for a pandas DataFrame, and should support all methods a pandas DataFrame supports. + + The primary method of creating an object of this type is to use the [from_deeporigin][src.data_hub.dataframe.DataFrame.from_deeporigin] class method. + """ + + auto_sync: bool = False + """When `True`, changes made to the dataframe will be automatically synced to the Deep Origin database this dataframe represents.""" + + _modified_columns: set = set() + """if data is modified in a dataframe, and auto_sync is False, this list will contain the columns that have been modified so that the Deep Origin database can be updated. If an empty list, the Deep Origin database will not be updated, and the dataframe matches the Deep Origin database at the time of creation.""" + + class AtIndexer: + """this class override is used to intercept calls to at indexer of a pandas dataframe""" + + def __init__(self, obj): + self.obj = obj + + def __getitem__(self, key): + """intercept for the set operation""" + + return self.obj._get_value(*key) + + def __setitem__(self, key, value): + """intercept for the set operation""" "" + + old_value = self.obj._get_value(*key) + if value == old_value: + # noop + return + + rows = [key[0]] + columns = [key[1]] + + # Perform the actual setting operation + self.obj._set_value(*key, value) + + # now update the DB. note that self is an AtIndexer + # object, so we need to index into the pandas object + if self.obj.auto_sync: + self.obj.to_deeporigin(columns=columns, rows=rows) + else: + self.obj._modified_columns.add(key[1]) + + @property + def at(self): + """Override the `at` property to return an AtIndexer""" + return self.AtIndexer(self) + + def __setitem__(self, key, value): + """Override the __setitem__ method to update the Deep Origin database when changes are made to the local + dataframe""" + + # first, call the pandas method + super().__setitem__(key, value) + + # now, update the Deep Origin database with the changes + # we just made + if self.auto_sync: + self.to_deeporigin(columns=[key]) + else: + self._modified_columns.add(key) + + def _repr_html_(self): + """method override to customize printing in a Jupyter notebook""" + + name = self.attrs["metadata"]["name"] + url = construct_resource_url( + name=name, + row_type="database", + ) + + # Convert the string to a datetime object + date_str = self.attrs["metadata"]["dateCreated"] + date_obj = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S.%f").replace( + tzinfo=timezone.utc + ) + + now = datetime.now(timezone.utc) + + # Convert the time difference into "x time ago" format + created_time_ago = humanize.naturaltime(now - date_obj) + + date_str = self.attrs["last_updated_row"].date_updated + date_obj = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S.%f").replace( + tzinfo=timezone.utc + ) + edited_time_ago = humanize.naturaltime(now - date_obj) + + header = f'

{name} 🔗

' + txt = f'

Created {created_time_ago}. Row {self.attrs["last_updated_row"].hid} was last edited {edited_time_ago}' + try: + txt += ( + " by " + + self.attrs["last_updated_row"].edited_by_user_drn.split("|")[1] + + ".

" + ) + except Exception: + txt += ".

" + + if self._modified_columns: + txt += '

⚠️ This dataframe contains changes that have not been written back to the Deep Origin database.

' + elif self.auto_sync: + txt += '

🧬 This dataframe will automatically write changes made to it back to Deep Origin.

' + df_html = super()._repr_html_() + return header + txt + df_html + + def __repr__(self): + """method override to customize printing in an interactive session""" + + header = f'{self.attrs["metadata"]["hid"]}\n' + df_representation = super().__repr__() + return header + df_representation + + @classmethod + def from_deeporigin( + cls, + database_id: str, + *, + use_file_names: bool = True, + reference_format: IDFormat = "human-id", + return_type: DatabaseReturnType = "dataframe", + client=None, + ): + """Create a local Deep Origin DataFrame from a Deep Origin database. + + Args: + database_id (str): The ID of the Deep Origin database. + use_file_names (bool, optional): Whether to use the file names in the Deep Origin database. Defaults to True. + reference_format (IDFormat, optional): The format of the IDs in the Deep Origin database. Defaults to "human-id". + return_type (DatabaseReturnType, optional): The type of return value. Defaults to "dataframe". + + """ + + return api.get_dataframe( + database_id=database_id, + use_file_names=use_file_names, + reference_format=reference_format, + return_type=return_type, + client=client, + ) + + def to_deeporigin( + self, + *, + columns: Optional[list] = None, + rows: Optional[list] = None, + ): + """Write data in dataframe to Deep Origin + + !!! tip "Deep Origin DataFrames can automatically synchronize" + To automatically save changes to local DataFrames to Deep Origin databases, set the `auto_sync` attribute of the dataframe `True`. + + + Args: + columns (list, optional): The columns of the dataframe to update. When None, all modified columns are updated. + rows (list, optional): The rows to update. Defaults to None. When None, all rows in the relevant columns are updated. + + """ + + if columns is None: + columns = self._modified_columns.copy() + + for column in columns: + if column in ["Validation Status", "ID"]: + continue + + column_metadata = [ + col for col in self.attrs["metadata"]["cols"] if col["name"] == column + ] + + if len(column_metadata) == 0: + raise NotImplementedError( + "Column metadata not found. This is likely because it's a new column" + ) + + column_metadata = column_metadata[0] + + if column_metadata["type"] == "file": + continue + + if rows is None: + # we're updating a whole column + rows = list(self.index) + + api.set_data_in_cells( + values=self[column], + row_ids=rows, + column_id=column, + database_id=self.attrs["id"], + ) + + print(f"✔︎ Wrote {len(rows)} rows in {column} to Deep Origin database.") + self._modified_columns.discard(column) + + @property + def _constructor(self): + """this method overrides the _constructor property to return a DataFrame and is required for compatibility with a pandas DataFrame""" + + return DataFrame diff --git a/src/utils.py b/src/utils.py index 737a80f..c7825ea 100644 --- a/src/utils.py +++ b/src/utils.py @@ -1,8 +1,11 @@ +"""This module contains utility functions that are used internally by the Python Client and the CLI""" + import json import os import shutil from dataclasses import dataclass -from typing import Literal, Union +from datetime import datetime +from typing import List, Literal, TypeVar, Union from urllib.parse import parse_qs, urljoin, urlparse import requests @@ -16,11 +19,13 @@ ] -RowType = Literal["row", "database", "workspace"] -"""Type of a row""" +T = TypeVar("T") + +ObjectType = Literal["row", "database", "workspace"] +"""Type of a row. In Deep Origin, a row can be a database row, a database or a workspace""" FileStatus = Literal["ready", "archived"] -"""Status of a file""" +"""Status of a file. Ready files are ready to be used, downloaded, and operated on.""" DataType = Literal[ "integer", @@ -34,7 +39,7 @@ "float", "boolean", ] -"""Type of a column""" +"""Type of a column in a Deep Origin database. See [this page in the documentation](https://docs.deeporigin.io/docs/os/data-hub/databases/columns) for more information.""" DATAFRAME_ATTRIBUTE_KEYS = { "file_ids", @@ -44,12 +49,13 @@ Cardinality = Literal["one", "many"] +"""The cardinality defines whether a cell in a database can contain one or multiple objects""" IDFormat = Literal["human-id", "system-id"] """Format of an ID""" DatabaseReturnType = Literal["dataframe", "dict"] -"""Return type of a database""" +"""Return type for [api.get_dataframe][src.data_hub.api.get_dataframe]""" @dataclass @@ -63,6 +69,50 @@ class PREFIXES: FOLDER = "_workspace" +@beartype +def construct_resource_url( + *, + name: str, + row_type: ObjectType, +) -> str: + """Constructs the URL for a resource + + Args: + name (str): name of the resource + row_type (ObjectType): type of the resource + + Returns: + str: URL for the resource + """ + + env = get_value()["env"] + org = get_value()["organization_id"] + if env == "prod": + url = f"https://os.deeporigin.io/org/{org}/data/{row_type}/{name}" + else: + url = f"https://os.{env}.deeporigin.io/org/{org}/data/{row_type}/{name}" + + return url + + +@beartype +def find_last_updated_row(rows: List[T]) -> T: + """utility function to find the most recently updated row and return that object""" + + most_recent_date = None + most_recent_row = rows[0] + + # Iterate over the list of objects + for row in rows: + current_date = datetime.strptime(row.date_updated, "%Y-%m-%d %H:%M:%S.%f") + + if most_recent_date is None or current_date > most_recent_date: + most_recent_date = current_date + most_recent_row = row + + return most_recent_row + + @beartype def _print_tree(tree: dict, offset: int = 0) -> None: """Helper function to pretty print a tree"""