Skip to content

Commit

Permalink
Merge pull request #37 from bento-platform/more-logging
Browse files Browse the repository at this point in the history
Default deduplicate to True and add more logging + testing
  • Loading branch information
davidlougheed authored Feb 9, 2023
2 parents 8330f6c + ccc0a4c commit 4332554
Show file tree
Hide file tree
Showing 16 changed files with 122 additions and 48 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
uses: actions/checkout@v3

- name: Run Bento build action
uses: bento-platform/bento_build_action@v0.9.3
uses: bento-platform/bento_build_action@v0.10.1
with:
registry: ghcr.io
registry-username: ${{ github.actor }}
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM ghcr.io/bento-platform/bento_base_image:python-debian-2022.12.06
FROM ghcr.io/bento-platform/bento_base_image:python-debian-2023.02.09

# TODO: change USER
USER root
Expand Down
27 changes: 24 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@

A proof of concept based on [GA4GH's DRS specifications](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.0.0/docs/).
This flask application offers an interface to query files in such
a fashion: "drs://some-domain/some-ID".
a fashion: `drs://some-domain/some-ID`.

For storing the files, two methods are currently supported : in the current filesystem
or inside a MinIO instance (which is a s3-like software).


## TODO / Future considerations

- Ingesting is either through the command line or by the endpoint of the same name
Expand All @@ -19,6 +20,7 @@ or inside a MinIO instance (which is a s3-like software).
- Consider how to be aware of http vs https depending on the deployment setup
(in singularity, docker, as is).


## Configuration

At the root of this project there is a sample dotenv file (.env-sample). These can be
Expand All @@ -29,6 +31,7 @@ provide the missing values.
cp .env-sample .env
```


## Running in Development

Development dependencies are described in `requirements.txt` and can be
Expand Down Expand Up @@ -57,6 +60,7 @@ The Flask development server can be run with the following command:
FLASK_DEBUG=True flask run
```


## Running Tests

To run all tests and calculate coverage, run the following command:
Expand All @@ -69,6 +73,7 @@ Tox is configured to run both pytest and flake8, you may want to uncomment
the second line of tox.ini (envlist = ...) so as to run these commands
for multiple versions of Python.


## Deploying

In production, the service should be deployed using a WSGI service like
Expand All @@ -78,6 +83,7 @@ With uWSGI you should point to chord_drs.app:application, the wsgi.py file
at the root of the project is there to simplify executing the commands (such
as "ingest")


## API

##### GET a single object
Expand All @@ -86,6 +92,8 @@ as "ingest")

`/ga4gh/drs/v1/objects/<string:object_id>`

Returns a standard GA4GH record for the object.

##### GET search

`/search`
Expand All @@ -105,13 +113,26 @@ partial match `/search?fuzzy_name=1001`

e.g. POST body

```bash
```json
{
"path": "examples/P-1001.hc.g.vcf.gz"
"path": "examples/P-1001.hc.g.vcf.gz"
}
```

This will automatically deduplicate with existing DRS objects if the file matches.

To ingest and force-create a duplicate record, provide the `deduplicate` parameter, set to `false`:

```json
{
"path": "examples/P-1001.hc.g.vcf.gz",
"deduplicate": false
}
```


##### GET service info

`/service-info`

Returns a GA4GH+Bento-formatted service info response.
2 changes: 1 addition & 1 deletion chord_drs/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,5 @@ def get_backend() -> Optional[Backend]:
return g.backend


def close_backend(_e=None):
def close_backend(_e=None) -> None:
g.pop("backend", None)
4 changes: 2 additions & 2 deletions chord_drs/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def create_drs_bundle(location: str, parent: Optional[DrsBundle] = None) -> DrsB
return bundle


def create_drs_object(location: str, parent: Optional[DrsBundle] = None):
def create_drs_object(location: str, parent: Optional[DrsBundle] = None) -> None:
drs_object = DrsObject(location=location)

if parent:
Expand All @@ -47,7 +47,7 @@ def create_drs_object(location: str, parent: Optional[DrsBundle] = None):
@click.command("ingest")
@click.argument("source")
@with_appcontext
def ingest(source: str):
def ingest(source: str) -> None:
"""
When provided with a file or a directory, this command will add these
to our list of objects, to be served by the application.
Expand Down
3 changes: 1 addition & 2 deletions chord_drs/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@

from chord_drs.backend import get_backend
from chord_drs.backends.minio import MinioBackend
from chord_drs.constants import SERVICE_NAME
from chord_drs.db import db
from chord_drs.utils import drs_file_checksum

Expand Down Expand Up @@ -77,7 +76,7 @@ def __init__(self, *args, **kwargs):
try:
current_location = backend.save(location, new_filename)
except Exception as e:
current_app.logger.error(f"[{SERVICE_NAME}] Encountered exception during DRS object creation: {e}")
current_app.logger.error(f"Encountered exception during DRS object creation: {e}")
# TODO: implement more specific exception handling
raise Exception("Well if the file is not saved... we can't do squat")

Expand Down
2 changes: 1 addition & 1 deletion chord_drs/package.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[package]
name = chord_drs
version = 0.8.0
version = 0.9.0
authors = Simon Chénard, David Lougheed
author_emails = simon.chenard2@mcgill.ca, david.lougheed@mail.mcgill.ca
70 changes: 44 additions & 26 deletions chord_drs/routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def build_object_json(drs_object: DrsObject, inside_container: bool = False) ->
default_access_method = {
"access_url": {
"url": url_for("drs_service.object_download", object_id=drs_object.id, _external=True)
# No headers means that auth will have to be obtained via some
# No headers --> auth will have to be obtained via some
# out-of-band method, or the object's contents are public. This
# will depend on how the service is deployed.
},
Expand Down Expand Up @@ -176,6 +176,8 @@ def service_info():
res_branch_str = res_branch.decode().strip()
info["git_branch"] = res_branch_str
info["bento"]["gitBranch"] = res_branch_str
if res_commit := subprocess.check_output(["git", "rev-parse", "HEAD"]):
info["bento"]["gitCommit"] = res_commit.decode().strip()

except Exception as e:
except_name = type(e).__name__
Expand All @@ -187,30 +189,30 @@ def service_info():
@drs_service.route("/objects/<string:object_id>", methods=["GET"])
@drs_service.route("/ga4gh/drs/v1/objects/<string:object_id>", methods=["GET"])
def object_info(object_id: str):
drs_bundle = DrsBundle.query.filter_by(id=object_id).first()
drs_object = None
drs_bundle: Optional[DrsBundle] = DrsBundle.query.filter_by(id=object_id).first()
drs_object: Optional[DrsObject] = None

if not drs_bundle: # Only try hitting the database for an object if no bundle was found
drs_object = DrsObject.query.filter_by(id=object_id).first()

if not drs_object:
return flask_errors.flask_not_found_error("No object found for this ID")

# Are we inside the bento singularity container? if so, provide local accessmethod
inside_container = request.headers.get("X-CHORD-Internal", "0") == "1"

# Log X-CHORD-Internal header
current_app.logger.info(
f"[{SERVICE_NAME}] object_info X-CHORD-Internal: {request.headers.get('X-CHORD-Internal', 'not set')}"
)
current_app.logger.info(f"object_info X-CHORD-Internal: {request.headers.get('X-CHORD-Internal', 'not set')}")

# Are we inside the bento singularity container? if so, provide local access method
inside_container = request.headers.get("X-CHORD-Internal", "0") == "1"

# The requester can specify object internal path to be added to the response
use_internal_path = strtobool(request.args.get("internal_path", ""))

include_internal_path = inside_container or use_internal_path

if drs_bundle:
response = build_bundle_json(drs_bundle, inside_container=(inside_container or use_internal_path))
response = build_bundle_json(drs_bundle, inside_container=include_internal_path)
else:
response = build_object_json(drs_object, inside_container=(inside_container or use_internal_path))
response = build_object_json(drs_object, inside_container=include_internal_path)

return jsonify(response)

Expand All @@ -237,6 +239,8 @@ def object_search():

@drs_service.route("/objects/<string:object_id>/download", methods=["GET"])
def object_download(object_id):
logger = current_app.logger

try:
drs_object = DrsObject.query.filter_by(id=object_id).one()
except NoResultFound:
Expand All @@ -256,23 +260,27 @@ def object_download(object_id):
download_name=drs_object.name,
)

current_app.logger.debug(f"Found Range header: {range_header}")
logger.debug(f"Found Range header: {range_header}")
range_err = f"Malformatted range header: expected bytes=X-Y or bytes=X-, got {range_header}"

rh_split = range_header.split("=")
if len(rh_split) != 2 or rh_split[0] != "bytes":
err = f"Malformatted range header: expected bytes=X-Y or bytes=X-, got {range_header}"
current_app.logger.error(err)
return flask_errors.flask_bad_request_error(err)
logger.error(range_err)
return flask_errors.flask_bad_request_error(range_err)

byte_range = rh_split[1].strip().split("-")
current_app.logger.debug(f"Retrieving byte range {byte_range}")
logger.debug(f"Retrieving byte range {byte_range}")

start = int(byte_range[0])
end = int(byte_range[1]) if byte_range[1] else None
try:
start = int(byte_range[0])
end = int(byte_range[1]) if byte_range[1] else None
except (IndexError, ValueError):
logger.error(range_err)
return flask_errors.flask_bad_request_error(range_err)

if end is not None and end < start:
err = f"Invalid range header: end cannot be less than start (start={start}, end={end})"
current_app.logger.error(err)
logger.error(err)
return flask_errors.flask_bad_request_error(err)

def generate_bytes():
Expand All @@ -282,7 +290,7 @@ def generate_bytes():

# Then, read in either CHUNK_SIZE byte segments or however many bytes are left to send, whichever is
# left. This avoids filling memory with the contents of large files.
byte_offset = start
byte_offset: int = start
while True:
# Add a 1 to the amount to read if it's below chunk size, because the last coordinate is inclusive.
data = fh2.read(min(CHUNK_SIZE, (end + 1 - byte_offset) if end is not None else CHUNK_SIZE))
Expand Down Expand Up @@ -319,30 +327,40 @@ def generate_bytes():

@drs_service.route("/private/ingest", methods=["POST"])
def object_ingest():
logger = current_app.logger
data = request.json or {}

obj_path: str = data.get("path")

if not obj_path or not isinstance(obj_path, str):
logger.error(f"Missing or invalid path parameter in JSON request: {obj_path}")
return flask_errors.flask_bad_request_error("Missing or invalid path parameter in JSON request")

# TODO: Should this always be the case?
deduplicate: bool = data.get("deduplicate", False)

drs_object: Optional[DrsObject] = None
deduplicate: bool = data.get("deduplicate", True) # Change for v0.9: default to True

if deduplicate:
# Get checksum of original file, and query database for objects that match
checksum = drs_file_checksum(obj_path)

try:
checksum = drs_file_checksum(obj_path)
except FileNotFoundError:
err = f"File not found at path {obj_path}"
logger.error(err)
return flask_errors.flask_bad_request_error(err)

drs_object = DrsObject.query.filter_by(checksum=checksum).first()
if drs_object:
logger.info(f"Found duplicate DRS object via checksum (will deduplicate): {drs_object}")

if not drs_object:
try:
drs_object = DrsObject(location=obj_path)

db.session.add(drs_object)
db.session.commit()
logger.info(f"Added DRS object: {drs_object}")
except Exception as e: # TODO: More specific handling
current_app.logger.error(f"[{SERVICE_NAME}] Encountered exception during ingest: {e}")
logger.error(f"Encountered exception during ingest: {e}")
return flask_errors.flask_bad_request_error("Error while creating the object")

response = build_object_json(drs_object)
Expand Down
4 changes: 3 additions & 1 deletion chord_drs/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@

__all__ = ["drs_file_checksum"]

CHUNK_SIZE = 16 * 1024


def drs_file_checksum(path: str) -> str:
hash_obj = sha256()

with open(path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
while chunk := f.read(CHUNK_SIZE):
hash_obj.update(chunk)

return hash_obj.hexdigest()
2 changes: 1 addition & 1 deletion dev.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM ghcr.io/bento-platform/bento_base_image:python-debian-2022.12.06
FROM ghcr.io/bento-platform/bento_base_image:python-debian-2023.02.09

# TODO: change USER
USER root
Expand Down
3 changes: 3 additions & 0 deletions entrypoint.dev.bash
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

# Set .gitconfig for development
/set_gitconfig.bash

export FLASK_ENV=development
export FLASK_APP=chord_drs.app:application

Expand Down
9 changes: 5 additions & 4 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -66,16 +66,17 @@ pytz==2022.7.1
PyYAML==5.4.1
redis==3.5.3
requests==2.28.2
responses==0.13.4
responses==0.22.0
s3transfer==0.5.0
six==1.16.0
SQLAlchemy==1.4.46
toml==0.10.2
tomli==2.0.1
tox==4.3.5
tox==4.4.4
types-toml==0.10.8.3
urllib3==1.26.14
virtualenv==20.17.1
virtualenv==20.18.0
Werkzeug==2.2.2
xmltodict==0.12.0
yarl==1.8.2
zipp==3.11.0
zipp==3.12.1
1 change: 1 addition & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
NON_EXISTENT_DUMMY_FILE = os.path.join(BASEDIR, "potato")
DUMMY_FILE = os.path.join(BASEDIR, "tests", "dummy_file.txt")
DUMMY_DIRECTORY = os.path.join(APP_DIR, "migrations")
EMPTY_FILE = os.path.join(BASEDIR, "tests", "empty_file.txt")


@pytest.fixture
Expand Down
Empty file added tests/empty_file.txt
Empty file.
10 changes: 10 additions & 0 deletions tests/test_checksum.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from .conftest import DUMMY_FILE, EMPTY_FILE
from chord_drs.utils import drs_file_checksum


def test_sha256_checksum():
# see https://crypto.stackexchange.com/questions/26133/sha-256-hash-of-null-input
assert drs_file_checksum(EMPTY_FILE) == "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

# file with content
assert drs_file_checksum(DUMMY_FILE) == "ca5170c51e4d4e68d4c39832489ea9ad8e275c9f46e0c195c86aaf61ee2ce3d8"
Loading

0 comments on commit 4332554

Please sign in to comment.