Skip to content

Commit

Permalink
Merge pull request #66 from Sage-Bionetworks/bwmac/IBCDPE-409/impleme…
Browse files Browse the repository at this point in the history
…nt_cli_2

[IBCDPE-409] Implement `agora-data-tools` CLI
  • Loading branch information
BWMac authored Mar 28, 2023
2 parents f2d3464 + b014e6d commit 3ac7c75
Show file tree
Hide file tree
Showing 20 changed files with 1,090 additions and 94 deletions.
12 changes: 7 additions & 5 deletions .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,11 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
python-version:
- "3.7"
- "3.8"
- "3.9"
- "3.10"
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -26,9 +30,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest pytest-cov
pip install .
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install pytest pytest-cov
- name: Test with pytest
run: |
pytest tests/ --cov=agoradatatools --cov-report=html
Expand All @@ -51,6 +54,5 @@ jobs:
with:
python-version: "3.9"
- run: pip install -U setuptools
- run: pip install -r ./requirements.txt
- run: pip install .
- run: python ./agoradatatools/process.py test_config.yaml --authtoken ${{secrets.SYNAPSE_PAT}}
- run: adt test_config.yaml -t ${{secrets.SYNAPSE_PAT}}
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,7 @@ ipython_config.py
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
Pipfile.lock
Pipfile


# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
Expand Down
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ FROM python:3.9-slim-buster

RUN apt-get update && \
apt-get install -y procps && \
apt-get install -y gcc && \
apt-get install -y g++ && \
rm -rf /var/lib/apt/lists/*

WORKDIR /agora-data-tools
Expand Down
16 changes: 16 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
agoradatatools = {editable = true, path = "."}

[dev-packages]
agoradatatools = {editable = true, path = "."}

[requires]
python_version = "3.9.13"

[pipenv]
allow_prereleases = true
988 changes: 988 additions & 0 deletions Pipfile.lock

Large diffs are not rendered by default.

66 changes: 30 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,28 @@
A place for Agora's ETL, data testing, and data analysis

This configuration-driven data pipeline uses a config file - which is easy for
engineers, analysts, and project managers to understand - to drive the entire ETL process. The code in `/agoradatatools` uses
engineers, analysts, and project managers to understand - to drive the entire ETL process. The code in `src/agoradatatools` uses
parameters defined in a config file to determine which kinds of extraction and transformations a particular
dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository.

In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file,
and run the scripts.

This `/agoradatatools` implementation was influenced by the "Modern Config Driven ELT Framework for Building a
This `src/agoradatatools` implementation was influenced by the "Modern Config Driven ELT Framework for Building a
Data Lake" talk given at the Data + AI Summit of 2021.

Python notebooks that describe the custom logic for various datasets are located in `/data_analysis/notebooks`.

## Running the pipeline
The json files generated by `/agoradatatools` are written to folders in the [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/) by default,
The json files generated by `src/agoradatatools` are written to folders in the [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/) by default,
although you can modify the destination Synapse folder in the [config file](#config).

Note that running the pipeline does _not_ automatically update the Agora database in any environment. Ingestion of generated json files
into the Agora databases is handled by [agora-data-manager](https://github.com/Sage-Bionetworks/agora-data-manager/).

You can run the pipeline in any of the following ways:
1. [Nextflow Tower](#nextflow-tower) is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client.
2. [Locally](#locally) requires installing Python, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
2. [Locally](#locally) requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
3. [Docker](#docker) requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT.

When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo:
Expand All @@ -50,59 +50,53 @@ This pipeline can be executed without any local installation, permissions, or cr

The instructions to trigger the workflow can be found at [Sage-Bionetworks-Workflows/nf-agora](https://github.com/Sage-Bionetworks-Workflows/nf-agora)

### Configuring Synapse Credentials

1. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378). If you see a green unlocked lock icon, then you should be good to go.
2. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
3. Create a Synapse personal access token (PAT)
4. [Set up](https://help.synapse.org/docs/Client-Configuration.1985446156.html) your Synapse Python client locally

Your configured Synapse credentials can be used to run this package both locally and using Docker, as outlined below.

### Locally
Perform the following one-time steps to set up your local environment and to obtain the required Synapse permissions:

1. Due to the nature of Python, you will want to set up your python environment with [conda](https://www.anaconda.com/products/distribution) or [pyenv](https://github.com/pyenv/pyenv). You will want to create a virtual environment to do your work.
* conda - please follow instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) to manage environments
* pyenv - you will want to use [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your python environment
1. If you have not already, [install](https://cloud.google.com/python/docs/setup) a supported version of Python. Versions supported by this package are all versions >3.7 and <3.11. Make sure that Python and `pip` are installed correctly and have been added to your PATH by running `python3 --version` and `pip3 --version`. If your installation was successful, your terminal will return the versions of Python and `pip` that you installed.

2. Install the package locally using conda or pyenv, depending on which you chose:
2. Install `pipenv` by running `pip install pipenv`.

* conda
```bash
conda create -n agora python=3.9
conda activate agora
pip install .
pip install -r requirements.txt
```
* pyenv + virtualenv
3. Install `git` if you have not done so already using [these instructions](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

4. Clone this Github Repository to your local machine by opening your terminal, navigating to the directory that you want this repository to be cloned and running `git clone https://github.com/Sage-Bionetworks/agora-data-tools.git`. After cloning is complete, navigate into the newly created `agora-data-tools` directory.

5. Install `agoradatatools` locally using pipenv:

* pipenv
```bash
pyenv install -v 3.9.13
pyenv global 3.9.13
python -m venv env
source env/bin/activate
python3 -m pip install .
python3 -m pip -r requirements.txt
pipenv install
pipenv shell
```

3. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378). If you see a green unlocked lock icon, then you should be good to go.
4. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
5. Create a Synapse personal access token (PAT)
6. [Set up](https://help.synapse.org/docs/Client-Configuration.1985446156.html) your Synapse Python client locally

Once you have completed the setup steps outlined above, execute the pipeline by running `process.py` and providing the desired [config file](#config) as an argument. The following example command will execute the pipeline using ```test_config.yaml```:
6. You can check if the package was isntalled correctly by running `adt --help` in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired [config file](#config) as an argument. The following example command will execute the pipeline using ```test_config.yaml```:

```bash
python ./agoradatatools/process.py test_config.yaml
adt test_config.yaml
```

### Docker

There is a publicly available [DockerHub repository](https://hub.docker.com/r/sagebionetworks/agora-data-tools) automatically build via DockerHub. That said, you may want to develop using Docker locally on a feature branch.
There is a publicly available [DockerHub repository](https://hub.docker.com/r/sagebionetworks/agora-data-tools) automatically built via DockerHub. That said, you may want to develop using Docker locally on a feature branch.

If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time steps to set up your docker environment and to obtain the required Synapse permissions:
If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time step to set up your docker environment and to obtain the required Synapse permissions:
1. Install [Docker](https://docs.docker.com/get-docker/).
2. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend [here](https://www.synapse.org/#!Synapse:syn5550378). If you see a green unlocked lock icon, then you should be good to go.
3. Obtain write access to the destination Synapse project, e.g. [Agora Synapse project](https://www.synapse.org/#!Synapse:syn11850457/files/)
4. Create a Synapse personal access token (PAT)
Once you have completed the one-time setup steps outlined above, execute the pipeline by running the following command and providing your PAT and the desired [config file](#config) as an argument. The following example command will execute the pipeline in Docker using ```test_config.yaml```:
Once you have completed the one-time setup step outlined above, execute the pipeline by running the following command and providing your PAT and the desired [config file](#config) as an argument. The following example command will execute the pipeline in Docker using ```test_config.yaml```:
```
# This creates a local docker image
docker build -t agora-data-tools .
docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools python ./agoradatatools/process.py test_config.yaml
docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools adt test_config.yaml
```
## Testing Github Workflow
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta:__legacy__"
7 changes: 0 additions & 7 deletions requirements.txt

This file was deleted.

24 changes: 17 additions & 7 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,30 @@ classifiers =
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Topic :: Scientific/Engineering
project_urls =
Bug Tracker = https://github.com/Sage-Bionetworks/agora-data-tools/issues
Source Code = https://github.com/Sage-Bionetworks/agora-data-tools

[options]
package_dir =
= src
packages = find:
install_requires =
pandas~=1.2.4
numpy~=1.21.0
pytest~=6.2.4
synapseclient~=2.6.0
PyYAML~=5.4.1
pyarrow~=3.0.0
python_requires = >=3.7, <3.10
pandas==1.2.4
numpy~=1.21
pytest~=7.2
setuptools~=67.0.0
synapseclient~=2.7.0
PyYAML~=6.0
pyarrow~=11.0
typer~=0.7.0
python_requires = >=3.7, <3.11
include_package_data = True
zip_safe = False
[options.packages.find]
where = src
[options.entry_points]
console_scripts =
adt = agoradatatools.process:app
File renamed without changes.
4 changes: 4 additions & 0 deletions src/agoradatatools/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
if __name__ == "__main__":
from process import app

app(prog_name="adt")
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
48 changes: 22 additions & 26 deletions agoradatatools/process.py → src/agoradatatools/process.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import argparse
import os

from pandas import DataFrame
import synapseclient
from pandas import DataFrame
from typer import Argument, Option, Typer

import agoradatatools.etl.extract as extract
import agoradatatools.etl.transform as transform
import agoradatatools.etl.load as load
import agoradatatools.etl.transform as transform
import agoradatatools.etl.utils as utils

from agoradatatools.errors import ADTDataProcessingError


Expand Down Expand Up @@ -165,31 +165,27 @@ def process_all_files(config_path: str = None, syn=None):
)


def build_parser():
"""Builds the argument parser and returns the result.
app = Typer()

Returns:
argparse.ArgumentParser: argument parser for agora data processing
"""
parser = argparse.ArgumentParser(description="Agora data processing")
parser.add_argument(
"configpath",
help="Agora processing yaml configuration",
)
parser.add_argument(
"-a",
"--authtoken",
help="Synapse PAT",
)
return parser

input_path_arg = Argument(..., help="Path to configuration file for processing run")
synapse_auth_opt = Option(
None,
"--token",
"-t",
help="Synapse authentication token. Defaults to environment variable $SYNAPSE_AUTH_TOKEN via syn.login() functionality",
show_default=False,
)


def main():
parser = build_parser()
args = parser.parse_args()
syn = utils._login_to_synapse(token=args.authtoken)
process_all_files(config_path=args.configpath, syn=syn)
@app.command()
def process(
config_path: str = input_path_arg,
auth_token: str = synapse_auth_opt,
):
syn = utils._login_to_synapse(token=auth_token)
process_all_files(config_path=config_path, syn=syn)


if __name__ == "__main__":
main()
app()
11 changes: 0 additions & 11 deletions tests/test_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,14 +234,3 @@ def test_process_all_files_full(self, syn):
staging_path="./staging",
filename="data_manifest.csv",
)


def test_build_parser():
with patch.object(
argparse,
"ArgumentParser",
return_value=argparse.ArgumentParser(),
) as patch_build_parser:
parser = process.build_parser()
patch_build_parser.assert_called_once_with(description="Agora data processing")
assert parser == argparse.ArgumentParser()

0 comments on commit 3ac7c75

Please sign in to comment.