Skip to content

Commit

Permalink
refactor(datasets): deprecate "DataSet" type names (#328)
Browse files Browse the repository at this point in the history
* refactor(datasets): deprecate "DataSet" type names (api)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (biosequence)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (dask)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (databricks)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (email)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (geopandas)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (holoviews)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (json)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (matplotlib)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (networkx)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.csv_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.deltatable_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.excel_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.feather_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.gbq_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.generic_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.hdf_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.json_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.parquet_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.sql_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pandas.xml_dataset)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pickle)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (pillow)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (plotly)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (polars)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (redis)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (snowflake)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (spark)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (svmlight)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (tensorflow)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (text)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (tracking)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (video)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* refactor(datasets): deprecate "DataSet" type names (yaml)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): ignore TensorFlow coverage issues

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
  • Loading branch information
deepyaman committed Sep 20, 2023
1 parent 4fe0e3e commit 7b3ac6c
Show file tree
Hide file tree
Showing 114 changed files with 4,593 additions and 3,232 deletions.
43 changes: 43 additions & 0 deletions kedro-datasets/docs/source/kedro_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,47 +12,90 @@ kedro_datasets
:template: autosummary/class.rst

kedro_datasets.api.APIDataSet
kedro_datasets.api.APIDataset
kedro_datasets.biosequence.BioSequenceDataSet
kedro_datasets.biosequence.BioSequenceDataset
kedro_datasets.dask.ParquetDataSet
kedro_datasets.dask.ParquetDataset
kedro_datasets.databricks.ManagedTableDataSet
kedro_datasets.databricks.ManagedTableDataset
kedro_datasets.email.EmailMessageDataSet
kedro_datasets.email.EmailMessageDataset
kedro_datasets.geopandas.GeoJSONDataSet
kedro_datasets.geopandas.GeoJSONDataset
kedro_datasets.holoviews.HoloviewsWriter
kedro_datasets.json.JSONDataSet
kedro_datasets.json.JSONDataset
kedro_datasets.matplotlib.MatplotlibWriter
kedro_datasets.networkx.GMLDataSet
kedro_datasets.networkx.GMLDataset
kedro_datasets.networkx.GraphMLDataSet
kedro_datasets.networkx.GraphMLDataset
kedro_datasets.networkx.JSONDataSet
kedro_datasets.networkx.JSONDataset
kedro_datasets.pandas.CSVDataSet
kedro_datasets.pandas.CSVDataset
kedro_datasets.pandas.DeltaTableDataSet
kedro_datasets.pandas.DeltaTableDataset
kedro_datasets.pandas.ExcelDataSet
kedro_datasets.pandas.ExcelDataset
kedro_datasets.pandas.FeatherDataSet
kedro_datasets.pandas.FeatherDataset
kedro_datasets.pandas.GBQQueryDataSet
kedro_datasets.pandas.GBQQueryDataset
kedro_datasets.pandas.GBQTableDataSet
kedro_datasets.pandas.GBQTableDataset
kedro_datasets.pandas.GenericDataSet
kedro_datasets.pandas.GenericDataset
kedro_datasets.pandas.HDFDataSet
kedro_datasets.pandas.HDFDataset
kedro_datasets.pandas.JSONDataSet
kedro_datasets.pandas.JSONDataset
kedro_datasets.pandas.ParquetDataSet
kedro_datasets.pandas.ParquetDataset
kedro_datasets.pandas.SQLQueryDataSet
kedro_datasets.pandas.SQLQueryDataset
kedro_datasets.pandas.SQLTableDataSet
kedro_datasets.pandas.SQLTableDataset
kedro_datasets.pandas.XMLDataSet
kedro_datasets.pandas.XMLDataset
kedro_datasets.pickle.PickleDataSet
kedro_datasets.pickle.PickleDataset
kedro_datasets.pillow.ImageDataSet
kedro_datasets.pillow.ImageDataset
kedro_datasets.plotly.JSONDataSet
kedro_datasets.plotly.JSONDataset
kedro_datasets.plotly.PlotlyDataSet
kedro_datasets.plotly.PlotlyDataset
kedro_datasets.polars.CSVDataSet
kedro_datasets.polars.CSVDataset
kedro_datasets.polars.GenericDataSet
kedro_datasets.polars.GenericDataset
kedro_datasets.redis.PickleDataSet
kedro_datasets.redis.PickleDataset
kedro_datasets.snowflake.SnowparkTableDataSet
kedro_datasets.snowflake.SnowparkTableDataset
kedro_datasets.spark.DeltaTableDataSet
kedro_datasets.spark.DeltaTableDataset
kedro_datasets.spark.SparkDataSet
kedro_datasets.spark.SparkDataset
kedro_datasets.spark.SparkHiveDataSet
kedro_datasets.spark.SparkHiveDataset
kedro_datasets.spark.SparkJDBCDataSet
kedro_datasets.spark.SparkJDBCDataset
kedro_datasets.spark.SparkStreamingDataSet
kedro_datasets.spark.SparkStreamingDataset
kedro_datasets.svmlight.SVMLightDataSet
kedro_datasets.svmlight.SVMLightDataset
kedro_datasets.tensorflow.TensorFlowModelDataSet
kedro_datasets.tensorflow.TensorFlowModelDataset
kedro_datasets.text.TextDataSet
kedro_datasets.text.TextDataset
kedro_datasets.tracking.JSONDataSet
kedro_datasets.tracking.JSONDataset
kedro_datasets.tracking.MetricsDataSet
kedro_datasets.tracking.MetricsDataset
kedro_datasets.video.VideoDataSet
kedro_datasets.video.VideoDataset
kedro_datasets.yaml.YAMLDataSet
kedro_datasets.yaml.YAMLDataset
9 changes: 6 additions & 3 deletions kedro-datasets/kedro_datasets/api/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
"""``APIDataSet`` loads the data from HTTP(S) APIs
"""``APIDataset`` loads the data from HTTP(S) APIs
and returns them into either as string or json Dict.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
APIDataSet: Any
APIDataSet: type[APIDataset]
APIDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"api_dataset": ["APIDataSet"]}
__name__, submod_attrs={"api_dataset": ["APIDataSet", "APIDataset"]}
)
105 changes: 63 additions & 42 deletions kedro-datasets/kedro_datasets/api/api_dataset.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
"""``APIDataSet`` loads the data from HTTP(S) APIs.
"""``APIDataset`` loads the data from HTTP(S) APIs.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
import json as json_ # make pylint happy
import warnings
from copy import deepcopy
from typing import Any, Dict, List, Tuple, Union

import requests
from requests import Session, sessions
from requests.auth import AuthBase

from .._io import AbstractDataset as AbstractDataSet
from .._io import DatasetError as DataSetError
from kedro_datasets._io import AbstractDataset, DatasetError


class APIDataSet(AbstractDataSet[None, requests.Response]):
"""``APIDataSet`` loads/saves data from/to HTTP(S) APIs.
class APIDataset(AbstractDataset[None, requests.Response]):
"""``APIDataset`` loads/saves data from/to HTTP(S) APIs.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
Example usage for the `YAML API <https://kedro.readthedocs.io/en/stable/data/\
Expand All @@ -23,7 +23,7 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
.. code-block:: yaml
usda:
type: api.APIDataSet
type: api.APIDataset
url: https://quickstats.nass.usda.gov
params:
key: SOME_TOKEN,
Expand All @@ -33,39 +33,42 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
agg_level_desc: STATE,
year: 2000
Example usage for the `Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_: ::
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
>>> from kedro_datasets.api import APIDataSet
>>> from kedro_datasets.api import APIDataset
>>>
>>>
>>> data_set = APIDataSet(
>>> url="https://quickstats.nass.usda.gov",
>>> load_args={
>>> "params": {
>>> "key": "SOME_TOKEN",
>>> "format": "JSON",
>>> "commodity_desc": "CORN",
>>> "statisticcat_des": "YIELD",
>>> "agg_level_desc": "STATE",
>>> "year": 2000
>>> }
>>> },
>>> credentials=("username", "password")
>>> )
>>> data = data_set.load()
``APIDataSet`` can also be used to save output on a remote server using HTTP(S)
methods. ::
>>> dataset = APIDataset(
... url="https://quickstats.nass.usda.gov",
... load_args={
... "params": {
... "key": "SOME_TOKEN",
... "format": "JSON",
... "commodity_desc": "CORN",
... "statisticcat_des": "YIELD",
... "agg_level_desc": "STATE",
... "year": 2000
... }
... },
... credentials=("username", "password")
... )
>>> data = dataset.load()
``APIDataset`` can also be used to save output on a remote server using HTTP(S)
methods.
::
>>> example_table = '{"col1":["val1", "val2"], "col2":["val3", "val4"]}'
>>> data_set = APIDataSet(
method = "POST",
url = "url_of_remote_server",
save_args = {"chunk_size":1}
)
>>> data_set.save(example_table)
>>>
>>> dataset = APIDataset(
... method = "POST",
... url = "url_of_remote_server",
... save_args = {"chunk_size":1}
... )
>>> dataset.save(example_table)
On initialisation, we can specify all the necessary parameters in the save args
dictionary. The default HTTP(S) method is POST but PUT is also supported. Two
Expand All @@ -74,7 +77,7 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
used if the input of save method is a list. It will divide the request into chunks
of size `chunk_size`. For example, here we will send two requests each containing
one row of our example DataFrame.
If the data passed to the save method is not a list, ``APIDataSet`` will check if it
If the data passed to the save method is not a list, ``APIDataset`` will check if it
can be loaded as JSON. If true, it will send the data unchanged in a single request.
Otherwise, the ``_save`` method will try to dump the data in JSON format and execute
the request.
Expand All @@ -99,7 +102,7 @@ def __init__(
credentials: Union[Tuple[str, str], List[str], AuthBase] = None,
metadata: Dict[str, Any] = None,
) -> None:
"""Creates a new instance of ``APIDataSet`` to fetch data from an API endpoint.
"""Creates a new instance of ``APIDataset`` to fetch data from an API endpoint.
Args:
url: The API URL endpoint.
Expand Down Expand Up @@ -179,9 +182,9 @@ def _execute_request(self, session: Session) -> requests.Response:
response = session.request(**self._request_args)
response.raise_for_status()
except requests.exceptions.HTTPError as exc:
raise DataSetError("Failed to fetch data", exc) from exc
raise DatasetError("Failed to fetch data", exc) from exc
except OSError as exc:
raise DataSetError("Failed to connect to the remote server") from exc
raise DatasetError("Failed to connect to the remote server") from exc

return response

Expand All @@ -190,7 +193,7 @@ def _load(self) -> requests.Response:
with sessions.Session() as session:
return self._execute_request(session)

raise DataSetError("Only GET method is supported for load")
raise DatasetError("Only GET method is supported for load")

def _execute_save_with_chunks(
self,
Expand All @@ -214,10 +217,10 @@ def _execute_save_request(self, json_data: Any) -> requests.Response:
response = requests.request(**self._request_args)
response.raise_for_status()
except requests.exceptions.HTTPError as exc:
raise DataSetError("Failed to send data", exc) from exc
raise DatasetError("Failed to send data", exc) from exc

except OSError as exc:
raise DataSetError("Failed to connect to the remote server") from exc
raise DatasetError("Failed to connect to the remote server") from exc
return response

def _save(self, data: Any) -> requests.Response:
Expand All @@ -227,9 +230,27 @@ def _save(self, data: Any) -> requests.Response:

return self._execute_save_request(json_data=data)

raise DataSetError("Use PUT or POST methods for save")
raise DatasetError("Use PUT or POST methods for save")

def _exists(self) -> bool:
with sessions.Session() as session:
response = self._execute_request(session)
return response.ok


_DEPRECATED_CLASSES = {
"APIDataSet": APIDataset,
}


def __getattr__(name):
if name in _DEPRECATED_CLASSES:
alias = _DEPRECATED_CLASSES[name]
warnings.warn(
f"{repr(name)} has been renamed to {repr(alias.__name__)}, "
f"and the alias will be removed in Kedro-Datasets 2.0.0",
DeprecationWarning,
stacklevel=2,
)
return alias
raise AttributeError(f"module {repr(__name__)} has no attribute {repr(name)}")
10 changes: 7 additions & 3 deletions kedro-datasets/kedro_datasets/biosequence/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
"""``AbstractDataSet`` implementation to read/write from/to a sequence file."""
"""``AbstractDataset`` implementation to read/write from/to a sequence file."""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
BioSequenceDataSet: Any
BioSequenceDataSet: type[BioSequenceDataset]
BioSequenceDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"biosequence_dataset": ["BioSequenceDataSet"]}
__name__,
submod_attrs={"biosequence_dataset": ["BioSequenceDataSet", "BioSequenceDataset"]},
)
43 changes: 31 additions & 12 deletions kedro-datasets/kedro_datasets/biosequence/biosequence_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""BioSequenceDataSet loads and saves data to/from bio-sequence objects to
"""BioSequenceDataset loads and saves data to/from bio-sequence objects to
file.
"""
import warnings
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any, Dict, List
Expand All @@ -9,29 +10,29 @@
from Bio import SeqIO
from kedro.io.core import get_filepath_str, get_protocol_and_path

from .._io import AbstractDataset as AbstractDataSet
from kedro_datasets._io import AbstractDataset


class BioSequenceDataSet(AbstractDataSet[List, List]):
r"""``BioSequenceDataSet`` loads and saves data to a sequence file.
class BioSequenceDataset(AbstractDataset[List, List]):
r"""``BioSequenceDataset`` loads and saves data to a sequence file.
Example:
::
>>> from kedro_datasets.biosequence import BioSequenceDataSet
>>> from kedro_datasets.biosequence import BioSequenceDataset
>>> from io import StringIO
>>> from Bio import SeqIO
>>>
>>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
>>> raw_data = []
>>> for record in SeqIO.parse(StringIO(data), "fasta"):
>>> raw_data.append(record)
... raw_data.append(record)
>>>
>>> data_set = BioSequenceDataSet(filepath="ls_orchid.fasta",
>>> load_args={"format": "fasta"},
>>> save_args={"format": "fasta"})
>>> data_set.save(raw_data)
>>> sequence_list = data_set.load()
>>> dataset = BioSequenceDataset(filepath="ls_orchid.fasta",
... load_args={"format": "fasta"},
... save_args={"format": "fasta"})
>>> dataset.save(raw_data)
>>> sequence_list = dataset.load()
>>>
>>> assert raw_data[0].id == sequence_list[0].id
>>> assert raw_data[0].seq == sequence_list[0].seq
Expand All @@ -52,7 +53,7 @@ def __init__(
metadata: Dict[str, Any] = None,
) -> None:
"""
Creates a new instance of ``BioSequenceDataSet`` pointing
Creates a new instance of ``BioSequenceDataset`` pointing
to a concrete filepath.
Args:
Expand Down Expand Up @@ -137,3 +138,21 @@ def invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs.invalidate_cache(filepath)


_DEPRECATED_CLASSES = {
"BioSequenceDataSet": BioSequenceDataset,
}


def __getattr__(name):
if name in _DEPRECATED_CLASSES:
alias = _DEPRECATED_CLASSES[name]
warnings.warn(
f"{repr(name)} has been renamed to {repr(alias.__name__)}, "
f"and the alias will be removed in Kedro-Datasets 2.0.0",
DeprecationWarning,
stacklevel=2,
)
return alias
raise AttributeError(f"module {repr(__name__)} has no attribute {repr(name)}")
Loading

0 comments on commit 7b3ac6c

Please sign in to comment.