Merge pull request #66 from Sage-Bionetworks/bwmac/IBCDPE-409/impleme…

…nt_cli_2 [IBCDPE-409] Implement `agora-data-tools` CLI
Sage-Bionetworks · Mar 28, 2023 · 3ac7c75 · 3ac7c75
2 parents f2d3464 + b014e6d
commit 3ac7c75
Show file tree

Hide file tree

Showing 20 changed files with 1,090 additions and 94 deletions.
diff --git a/.github/workflows/dev.yml b/.github/workflows/dev.yml
@@ -16,7 +16,11 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: [3.7, 3.8, 3.9]
+        python-version:
+          - "3.7"
+          - "3.8"
+          - "3.9"
+          - "3.10"
     steps:
       - uses: actions/checkout@v3
       - name: Set up Python ${{ matrix.python-version }}
@@ -26,9 +30,8 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install pytest pytest-cov
           pip install .
-          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+          pip install pytest pytest-cov
       - name: Test with pytest
         run: |
           pytest tests/ --cov=agoradatatools --cov-report=html
@@ -51,6 +54,5 @@ jobs:
         with:
           python-version: "3.9"
       - run: pip install -U setuptools
-      - run: pip install -r ./requirements.txt
       - run: pip install .
-      - run: python ./agoradatatools/process.py test_config.yaml --authtoken ${{secrets.SYNAPSE_PAT}}
+      - run: adt test_config.yaml -t ${{secrets.SYNAPSE_PAT}}
diff --git a/.gitignore b/.gitignore
@@ -91,8 +91,7 @@ ipython_config.py
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
-Pipfile.lock
-Pipfile
+
 
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 __pypackages__/

diff --git a/Dockerfile b/Dockerfile
@@ -2,6 +2,8 @@ FROM python:3.9-slim-buster
 
 RUN  apt-get update && \
   apt-get install -y procps && \
+  apt-get install -y gcc && \
+  apt-get install -y g++ && \  
   rm -rf /var/lib/apt/lists/*
 
 WORKDIR /agora-data-tools

diff --git a/Pipfile b/Pipfile
@@ -0,0 +1,16 @@
+[[source]]
+url = "https://pypi.org/simple"
+verify_ssl = true
+name = "pypi"
+
+[packages]
+agoradatatools = {editable = true, path = "."}
+
+[dev-packages]
+agoradatatools = {editable = true, path = "."}
+
+[requires]
+python_version = "3.9.13"
+
+[pipenv]
+allow_prereleases = true
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -13,28 +13,28 @@
 A place for Agora's ETL, data testing, and data analysis
 
 This configuration-driven data pipeline uses a config file - which is easy for
-engineers, analysts, and project managers to understand - to drive the entire ETL process.  The code in `/agoradatatools` uses
+engineers, analysts, and project managers to understand - to drive the entire ETL process.  The code in `src/agoradatatools` uses
 parameters defined in a config file to determine which kinds of extraction and transformations a particular
 dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository.  
 
 In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file, 
 and run the scripts. 
 
-This `/agoradatatools` implementation was influenced by the "Modern Config Driven ELT Framework for Building a 
+This `src/agoradatatools` implementation was influenced by the "Modern Config Driven ELT Framework for Building a 
 Data Lake" talk given at the Data + AI Summit of 2021.
 
 Python notebooks that describe the custom logic for various datasets are located in `/data_analysis/notebooks`.
 
 ## Running the pipeline
-The json files generated by `/agoradatatools` are written to folders in the [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/) by default, 
+The json files generated by `src/agoradatatools` are written to folders in the [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/) by default, 
 although you can modify the destination Synapse folder in the [config file](#config).
 
 Note that running the pipeline does _not_ automatically update the Agora database in any environment.  Ingestion of generated json files
 into the Agora databases is handled by [agora-data-manager](https://github.com/Sage-Bionetworks/agora-data-manager/).
 
 You can run the pipeline in any of the following ways:
 1. [Nextflow Tower](#nextflow-tower) is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client.
-2. [Locally](#locally) requires installing Python, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
+2. [Locally](#locally) requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
 3. [Docker](#docker) requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT.
 
 When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo:  
@@ -50,59 +50,53 @@ This pipeline can be executed without any local installation, permissions, or cr
 
 The instructions to trigger the workflow can be found at [Sage-Bionetworks-Workflows/nf-agora](https://github.com/Sage-Bionetworks-Workflows/nf-agora)
 
+### Configuring Synapse Credentials
+
+1. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378).  If you see a green unlocked lock icon, then you should be good to go.
+2. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
+3. Create a Synapse personal access token (PAT)
+4. [Set up](https://help.synapse.org/docs/Client-Configuration.1985446156.html) your Synapse Python client locally
+
+Your configured Synapse credentials can be used to run this package both locally and using Docker, as outlined below.
+
 ### Locally
 Perform the following one-time steps to set up your local environment and to obtain the required Synapse permissions:
 
-1. Due to the nature of Python, you will want to set up your python environment with [conda](https://www.anaconda.com/products/distribution) or [pyenv](https://github.com/pyenv/pyenv).  You will want to create a virtual environment to do your work.
-    * conda - please follow instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) to manage environments
-    * pyenv - you will want to use [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your python environment
+1. If you have not already, [install](https://cloud.google.com/python/docs/setup) a supported version of Python. Versions supported by this package are all versions >3.7 and <3.11. Make sure that Python and `pip` are installed correctly and have been added to your PATH by running `python3 --version` and `pip3 --version`. If your installation was successful, your terminal will return the versions of Python and `pip` that you installed.
 
-2. Install the package locally using conda or pyenv, depending on which you chose:
+2. Install `pipenv` by running `pip install pipenv`.
 
-    * conda
-      ```bash
-      conda create -n agora python=3.9
-      conda activate agora
-      pip install .
-      pip install -r requirements.txt
-      ```
-    * pyenv + virtualenv
+3. Install `git` if you have not done so already using [these instructions](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
+
+4. Clone this Github Repository to your local machine by opening your terminal, navigating to the directory that you want this repository to be cloned and running `git clone https://github.com/Sage-Bionetworks/agora-data-tools.git`. After cloning is complete, navigate into the newly created `agora-data-tools` directory.
+
+5. Install `agoradatatools` locally using pipenv:
+
+    * pipenv
       ```bash
-      pyenv install -v 3.9.13
-      pyenv global 3.9.13
-      python -m venv env
-      source env/bin/activate
-      python3 -m pip install .
-      python3 -m pip -r requirements.txt
+      pipenv install
+      pipenv shell
       ```
 
-3. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378).  If you see a green unlocked lock icon, then you should be good to go.
-4. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
-5. Create a Synapse personal access token (PAT)
-6. [Set up](https://help.synapse.org/docs/Client-Configuration.1985446156.html) your Synapse Python client locally
-
-Once you have completed the setup steps outlined above, execute the pipeline by running `process.py` and providing the desired [config file](#config) as an argument. The following example command will execute the pipeline using ```test_config.yaml```:
+6. You can check if the package was isntalled correctly by running `adt --help` in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired [config file](#config) as an argument. The following example command will execute the pipeline using ```test_config.yaml```:
 
     ```bash
-    python ./agoradatatools/process.py test_config.yaml
+    adt test_config.yaml
     ```
 
 ### Docker
 
-There is a publicly available [DockerHub repository](https://hub.docker.com/r/sagebionetworks/agora-data-tools) automatically build via DockerHub. That said, you may want to develop using Docker locally on a feature branch.
+There is a publicly available [DockerHub repository](https://hub.docker.com/r/sagebionetworks/agora-data-tools) automatically built via DockerHub. That said, you may want to develop using Docker locally on a feature branch.
 
-If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time steps to set up your docker environment and to obtain the required Synapse permissions:
+If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time step to set up your docker environment and to obtain the required Synapse permissions:
 1. Install [Docker](https://docs.docker.com/get-docker/).
-2. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378).  If you see a green unlocked lock icon, then you should be good to go.
-3. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
-4. Create a Synapse personal access token (PAT)
 
-Once you have completed the one-time setup steps outlined above, execute the pipeline by running the following command and providing your PAT and the desired [config file](#config) as an argument. The following example command will execute the pipeline in Docker using ```test_config.yaml```:
+Once you have completed the one-time setup step outlined above, execute the pipeline by running the following command and providing your PAT and the desired [config file](#config) as an argument. The following example command will execute the pipeline in Docker using ```test_config.yaml```:
 
 ```
 # This creates a local docker image
 docker build -t agora-data-tools .
-docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools python ./agoradatatools/process.py test_config.yaml
+docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools adt test_config.yaml
 ```
 
 ## Testing Github Workflow

diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,3 @@
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta:__legacy__"
diff --git a/requirements.txt b/requirements.txt
diff --git a/setup.cfg b/setup.cfg
@@ -21,20 +21,30 @@ classifiers =
     Programming Language :: Python :: 3.7
     Programming Language :: Python :: 3.8
     Programming Language :: Python :: 3.9
+    Programming Language :: Python :: 3.10
     Topic :: Scientific/Engineering
 project_urls =
     Bug Tracker = https://github.com/Sage-Bionetworks/agora-data-tools/issues
     Source Code = https://github.com/Sage-Bionetworks/agora-data-tools
 
 [options]
+package_dir =
+    = src
 packages = find:
 install_requires =
-    pandas~=1.2.4
-    numpy~=1.21.0
-    pytest~=6.2.4
-    synapseclient~=2.6.0
-    PyYAML~=5.4.1
-    pyarrow~=3.0.0
-python_requires = >=3.7, <3.10
+    pandas==1.2.4
+    numpy~=1.21
+    pytest~=7.2
+    setuptools~=67.0.0
+    synapseclient~=2.7.0
+    PyYAML~=6.0
+    pyarrow~=11.0
+    typer~=0.7.0
+python_requires = >=3.7, <3.11
 include_package_data = True
 zip_safe = False
+[options.packages.find]
+where = src
+[options.entry_points]
+console_scripts =
+    adt = agoradatatools.process:app
diff --git a/agoradatatools/__init__.py → src/agoradatatools/__init__.py b/agoradatatools/__init__.py → src/agoradatatools/__init__.py
diff --git a/src/agoradatatools/__main__.py b/src/agoradatatools/__main__.py
@@ -0,0 +1,4 @@
+if __name__ == "__main__":
+    from process import app
+
+    app(prog_name="adt")
diff --git a/agoradatatools/errors.py → src/agoradatatools/errors.py b/agoradatatools/errors.py → src/agoradatatools/errors.py
diff --git a/agoradatatools/etl/__init__.py → src/agoradatatools/etl/__init__.py b/agoradatatools/etl/__init__.py → src/agoradatatools/etl/__init__.py
diff --git a/agoradatatools/etl/extract.py → src/agoradatatools/etl/extract.py b/agoradatatools/etl/extract.py → src/agoradatatools/etl/extract.py
diff --git a/agoradatatools/etl/load.py → src/agoradatatools/etl/load.py b/agoradatatools/etl/load.py → src/agoradatatools/etl/load.py
diff --git a/agoradatatools/etl/test.py → src/agoradatatools/etl/test.py b/agoradatatools/etl/test.py → src/agoradatatools/etl/test.py
diff --git a/agoradatatools/etl/transform.py → src/agoradatatools/etl/transform.py b/agoradatatools/etl/transform.py → src/agoradatatools/etl/transform.py
diff --git a/agoradatatools/etl/utils.py → src/agoradatatools/etl/utils.py b/agoradatatools/etl/utils.py → src/agoradatatools/etl/utils.py
diff --git a/agoradatatools/process.py → src/agoradatatools/process.py b/agoradatatools/process.py → src/agoradatatools/process.py
@@ -1,13 +1,13 @@
-import argparse
+import os
 
-from pandas import DataFrame
 import synapseclient
+from pandas import DataFrame
+from typer import Argument, Option, Typer
 
 import agoradatatools.etl.extract as extract
-import agoradatatools.etl.transform as transform
 import agoradatatools.etl.load as load
+import agoradatatools.etl.transform as transform
 import agoradatatools.etl.utils as utils
-
 from agoradatatools.errors import ADTDataProcessingError
 
 
@@ -165,31 +165,27 @@ def process_all_files(config_path: str = None, syn=None):
         )
 
 
-def build_parser():
-    """Builds the argument parser and returns the result.
+app = Typer()
 
-    Returns:
-        argparse.ArgumentParser: argument parser for agora data processing
-    """
-    parser = argparse.ArgumentParser(description="Agora data processing")
-    parser.add_argument(
-        "configpath",
-        help="Agora processing yaml configuration",
-    )
-    parser.add_argument(
-        "-a",
-        "--authtoken",
-        help="Synapse PAT",
-    )
-    return parser
+
+input_path_arg = Argument(..., help="Path to configuration file for processing run")
+synapse_auth_opt = Option(
+    None,
+    "--token",
+    "-t",
+    help="Synapse authentication token. Defaults to environment variable $SYNAPSE_AUTH_TOKEN via syn.login() functionality",
+    show_default=False,
+)
 
 
-def main():
-    parser = build_parser()
-    args = parser.parse_args()
-    syn = utils._login_to_synapse(token=args.authtoken)
-    process_all_files(config_path=args.configpath, syn=syn)
+@app.command()
+def process(
+    config_path: str = input_path_arg,
+    auth_token: str = synapse_auth_opt,
+):
+    syn = utils._login_to_synapse(token=auth_token)
+    process_all_files(config_path=config_path, syn=syn)
 
 
 if __name__ == "__main__":
-    main()
+    app()
diff --git a/tests/test_process.py b/tests/test_process.py
@@ -234,14 +234,3 @@ def test_process_all_files_full(self, syn):
             staging_path="./staging",
             filename="data_manifest.csv",
         )
-
-
-def test_build_parser():
-    with patch.object(
-        argparse,
-        "ArgumentParser",
-        return_value=argparse.ArgumentParser(),
-    ) as patch_build_parser:
-        parser = process.build_parser()
-        patch_build_parser.assert_called_once_with(description="Agora data processing")
-        assert parser == argparse.ArgumentParser()