Skip to content

Commit

Permalink
Merge pull request #325 from genomic-medicine-sweden/324-improve-the-…
Browse files Browse the repository at this point in the history
…pipeline-description

324 improve the pipeline description
  • Loading branch information
mhkc authored Jun 20, 2024
2 parents 31d902a + d85ee67 commit 25c92eb
Show file tree
Hide file tree
Showing 26 changed files with 649 additions and 210 deletions.
32 changes: 32 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
# You can also specify other tool versions:
# nodejs: "19"
# rust: "1.64"
# golang: "1.19"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/source/conf.py

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added bam and bai to bonsai input for `staphylococcus_aureus`, `escherichia_coli` & `klebsiella_pneumoniae`
- Added `bamDir` and `vcfDir` to config params
- Added run `bwa_mem` from only when profile is not `mycobacterium_tuberculosis`
- Automatically publish the pipeline documentation to read the docs.

### Fixed

- ShigaPass URL fixed
- Fixed qc channel regarding `mycobacterium_tuberculosis`
- Fixed bwa output file bug and stub
- Fixed getting some software versions

### Changed

- Fixed output format for tbprofiler
- Removed `samtools_sort_ref` from configs
- Changed `--symlink_dir` arg for prp
- Updated bonsai-prp to v0.9.2
- Updated the pipeline documentation.
- Removed sudo from make (deprecated)
- Updated NGP config for new hardware
- Refined readme instructions

## [0.7.0]

Expand Down
19 changes: 19 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Contributing to Jasen

This is a guide on how to contribute to Jasen in various ways. You can contribute to this project by reporting bugs, help implement new types of analysis, or by improving the pipeline documentation.

## Documentation changes

Propose changes to the documentation by opening a pull request.

Edit the documentation by editing the source files in the [docs](docs/) directory. The [README](docs/source/README.md) describe how to build and preview the documentation on your local machine.

## Bug reports

Submitting a bug report is one of the simplest and one of the most useful way to help out the project.

Report a bug by creating a new issue (use the *New issue* button) on the [issues page](https://github.com/genomic-medicine-sweden/jasen/issues). A good bug report should include a description of the error (with the error message) and steps on how to reproduce the error.

## New feature, analysis, or tool

We welcome suggestions on new types of analysis, bacterial species or other features to add to the pipeline. Before contributing these, please create a feature proposal on the [issues page](https://github.com/genomic-medicine-sweden/jasen/issues) so it can be discussed.
223 changes: 16 additions & 207 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,223 +9,32 @@ _Just Another System for Epityping NGS data_
>[!WARNING]
>**JASEN is in beta stage and the results are unverified. There is no guarantee that the pipeline can execute, output format consistency, or that it produces accurate results until there is an official 1.0 release.**
Jasen produces results for epidemiological and surveillance purposes.
Jasen has been developed for a small set of microbiota (primarily MRSA), but will likely work with any bacteria with a stable cgMLST scheme.
Jasen produces results for antibiotic resistance and virulence prediction and epidemiological typing for surveillance purposes. The pipeline is developed in collaboration with several Swedish hospitals. The development was funded by [Genomic Medicine Sweden](https://genomicmedicine.se/).

## Requirements

* [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html#install-on-windows-or-mac)
* [JRE 8 - 21](https://www.java.com/en/download/manual.jsp)
* Nextflow (`curl -s https://get.nextflow.io | bash`)

### Recommended

* Conda
* Singularity Remote Login

## Usage

### Simple self-test

```
nextflow run main.nf -profile staphylococcus_aureus -config configs/nextflow.base.config --csv assets/test_data/samplelist.csv
```

#### Usage arguments

| Argument type | Options | Required |
| ------------- | -------------------------------------- | -------- |
| -profile | **staphylococcus_aureus**, escherichia_coli, klebsiella_pneumoniae, mycobacterium_tuberculosis| True |
| -config | **configs/nextflow.base.config**, configs/nextflow.dev.config, configs/nextflow.hopper.config, configs/nextflow.ngp.config| True |
| -entry | bacterial_default | True |
| --output | User specified directory | False |
| -resume | Not applicable | False |


### Input file format

```csv
id,platform,read1,read2
p1,illumina,assets/test_data/sequencing_data/saureus_10k/saureus_large_R1_001.fastq.gz,assets/test_data/sequencing_data/saureus_10k/saureus_large_R2_001.fastq.gz
```

### Update databases

#### Update MLST database

```
bash /path/to/jasen/assets/mlst_db/update_mlst_db.sh
```
The pipeline currently support a small set of microbiota and the support are in different stages of development. See the documentation of information on the supported analysis for each species and what the development status means.

| Species | Development status |
|------------------------------|--------------------|
| *Staphylococcus arueus* | Draft |
| *Escherichia coli* | Draft |
| *Mycobacterium tuberculosis* | Draft |

## Installation

### Copy code locally

```
git clone --recurse-submodules --single-branch --branch master https://github.com/genomic-medicine-sweden/jasen.git && cd jasen
```

### Create singularity images.

The containers will be attempted to be built and downloaded as part of
the main Makefile (that is, when running `make install` in the main repo
folder).

```
cd container
make
```


### Download references and databases using singularity.

First, make sure you stand in the `container` folder. Then run the `make` commands:

```
cd ..
make install
make check
```

Any errors produced during this step will hinder pipeline execution in
unexpected ways.

## Configuration

### Nextflow configuration
Source: `configs/nextflow.base.config`

* Edit the `root` parameter in `configs/nextflow.base.config`
* Edit the `krakenDb`, `workDir` and `outdir` parameters in `configs/nextflow.base.config`
* Edit the `runOptions` in `configs/nextflow.base.config` in order to mount directories to your run

When analysing Nanopore data:
* Edit the `ext.args` for Flye: specify genome size for the organism of interest with flag `--genome-size`
* Edit the `ext.seqmethod`for Flye depending on the input data
* Edit the `ext.args` for Medaka: specify the model with flag `-m`. Currently it is set to `r941_min_sup_g507`, but one should always set it based on how the data was produced. More about choosing the right model can be found [here](https://github.com/nanoporetech/medaka#models).

### Test data configuration
Source: `assets/test_data/samplelist.csv`

* Edit the read1 and read2 columns in `assets/test_data/samplelist.csv`

### Temporary directories configuration
Source: `~/.bashrc`

* Add the export line to `~/.bashrc`
* Change `SINGULARITY_TMPDIR` to `APPTAINER_TMPDIR` if you are using apptainer

```
export SINGULARITY_TMPDIR="/tmp" #or equivalent filepath to tmp dir
```

### Database configuration

#### Kraken database configuration
Choose between Kraken DB (64GB [Highly recommended]) or MiniKraken DB (8GB).
Or customize [your own](https://benlangmead.github.io/aws-indexes/k2).

##### Download standard Kraken database

```
wget -O /path/to/kraken_db/krakenstd.tar.gz https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20230314.tar.gz
tar -xf /path/to/kraken_db/krakenstd.tar.gz
```

##### (Alternatively) Download miniKraken database

```
wget -O /path/to/kraken_db/krakenmini.tar.gz https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230314.tar.gz
tar -xf /path/to/kraken_db/krakenmini.tar.gz
```

#### Create TBProfiler database

##### Install jasentool

```
git clone git@github.com:ryanjameskennedy/jasentool.git && cd jasentool
pip install .
```

##### Create input csv that is used as tbdb input (composed of FoHM, WHO & tbdb variants)

```
jasentool converge --output_dir /path/to/jasen/assets/tbdb
```

##### Create tbdb (ensure tb-profiler is installed)

```
cd /path/to/jasen/assets/tbdb
tb-profiler create_db --prefix converged_who_fohm_tbdb
tb-profiler load_library converged_who_fohm_tbdb
```

##### Bgzip and index gms TBProfiler db

```
bgzip -c converged_who_fohm_tbdb.bed > /path/to/jasen/assets/tbprofiler_dbs/bed/converged_who_fohm_tbdb.bed.gz
tabix -p bed /path/to/jasen/assets/tbprofiler_dbs/bed/converged_who_fohm_tbdb.bed.gz
```


## Component Breakdown

### QC

* [Kraken2](https://ccb.jhu.edu/software/kraken2/): Species detection.
* [Bracken](https://ccb.jhu.edu/software/bracken/): Combined with Kraken2 for species detection.
* [bwa mem](https://github.com/lh3/bwa): Maps reads to cgMLST loci (demarcated by bed file) in order to estimate genome coverage. Low levels of Intra-species contamination or erroneous mapping is removed using bwa and filtering away the heterozygous mapped bases.
* [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range): Calculates evenness of coverage.

### Assembly

* [SPAdes](http://cab.spbu.ru/software/spades/): De novo assembly for Ion Torrent.
* [SKESA](https://www.ridom.de/seqsphere/ug/v60/SKESA_Assembler.html): De novo assembly for Illumina.
* [QUAST](http://cab.spbu.ru/software/quast/): Extracts QC data (De novo assembly parameters) from the assembly.
* [Flye](https://github.com/fenderglass/Flye/tree/flye): De novo assembly for Oxford Nanopore Technologies (ONT).
* [Medaka](https://github.com/nanoporetech/medaka): Creates consensus sequences from ONT data.

### Epidemiological typing

* [chewBBACA](https://github.com/B-UMMI/chewBBACA/wiki): Calculates cgMLST of extracted alleles decided by schema. Number of missing loci is calculated and used as a QC parameter.
* [cgmlst.net](https://www.cgmlst.org/ncs/schema/141106/): The cgMLST reference schema.
* [mlst](https://github.com/tseemann/mlst): Caculates traditional 7-locus MLST.

#### Supported profiles:

* `staphylococcus_aureus`
* `escherichia_coli`

#### Future profiles that will be supported:

* `klebsiella_pneumoniae`
* `mycobacterium_tuberculosis`

### Virulence and resistance markers
See the documentation for installation instructions.

* [resfinder](https://bitbucket.org/genomicepidemiology/resfinder/src/master/): Detects antimicrobial resistance genes as well as environmental and chemical resistance genes.
* [pointfinder](https://bitbucket.org/genomicepidemiology/pointfinder/src/master/): Combines with resfinder to detect variants.
* [virulencefinder](https://bitbucket.org/genomicepidemiology/virulencefinder/src/master/): Detects virulence genes.
* [amrfinderplus](https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus): Detects antimicrobial resistance genes as well as environmental, chemical resistance and virulence genes.
* [resfinder_db](https://bitbucket.org/genomicepidemiology/resfinder_db/src/master/): Resfinder database.
* [pointfinder_db](https://bitbucket.org/genomicepidemiology/pointfinder_db/src/master/): Pointfinder database.
* [virulencefinder_db](https://bitbucket.org/genomicepidemiology/virulencefinder_db/src/master/): Virulencefinder database.
### Tips

### Relatedness
* You can use [Bonsai](https://github.com/Clinical-Genomics-Lund/cgviz) to visualise jasen outputs.

* [sourmash](https://github.com/sourmash-bio/sourmash): Determine relatedness between samples.
## Documentation

## Report and visualisation
The documentation is abailable for the latest stable release.

* [Bonsai](https://github.com/Clinical-Genomics-Lund/cgviz): Visualises jasen outputs.
* [graptetree](https://github.com/achtman-lab/GrapeTree): Visualise phylogenetic relationship using cgmlst data.
## Contributing

## Frequent issues / Tips
Contributions to the pipeline is more than welcome. Please use the [CONTRIBUTING](CONTRIBUTING.md) file for details.

* Always run the latest versions of the bioinformatical software.
* Verify you have execution permission for jasens `*.sif` images.
* Old Singularity versions may sporadically produce the error `FATAL: could not open image jasen/container/*.sif: image format not recognized!`
## License

Jasen is released under the GPLv3 license.
54 changes: 54 additions & 0 deletions bin/concat_sw_versions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env python
"""Concatinate software versions."""

import click
import yaml
from yaml import Loader
import pandas as pd
from pathlib import Path


def get_versions(version_obj: dict[str: dict]) -> dict[str, str]:
workflow_name = list(version_obj.keys())[0].split(":")[-1]
raw_softwares = list(version_obj.values())[0]
# add workflow name to the list of all softwares
softwares = {}
for sw, version_info in raw_softwares.items():
version_info["workflow"] = workflow_name
softwares[sw] = version_info
# get container
if "http" not in version_info["container"]:
version_info["container"] = None
return softwares


@click.command()
@click.option("-o", "--output", type=click.File("w"), help="Path to write output file to.")
@click.argument("version_files", nargs=-1)
def cli(output, version_files):
"""Concatinate the versions of softwares."""

all_versions = {}
for file in version_files:
with open(file) as vfile:
sw_version = get_versions(yaml.load(vfile, Loader=Loader))
# combine new sw versions with existing sw versions
all_versions = {**all_versions, **sw_version}

# convert version dict to csv tables
df = (pd.DataFrame
.from_dict(all_versions, orient="index")
.drop("workflow", axis=1)
.fillna("-")
)
df.index.name = "software"
df.reset_index(inplace=True)
df.sort_values("software", inplace=True)
df.columns = [col.capitalize() for col in df.columns]
# export to csv
df.to_csv(output, index=False)
click.secho(f"Wrote output file: {output.name}", fg="green")


if __name__ == "__main__":
cli()
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Loading

0 comments on commit 25c92eb

Please sign in to comment.