Merge pull request #325 from genomic-medicine-sweden/324-improve-the-…

…pipeline-description 324 improve the pipeline description
genomic-medicine-sweden · Jun 20, 2024 · 25c92eb · 25c92eb
2 parents 31d902a + d85ee67
commit 25c92eb
Show file tree

Hide file tree

Showing 26 changed files with 649 additions and 210 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,32 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.12"
+    # You can also specify other tool versions:
+    # nodejs: "19"
+    # rust: "1.64"
+    # golang: "1.19"
+
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  configuration: docs/source/conf.py
+
+# Optionally build your docs in additional formats such as PDF and ePub
+# formats:
+#    - pdf
+#    - epub
+
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+   install:
+   - requirements: docs/requirements.txt
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,22 +15,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added bam and bai to bonsai input for `staphylococcus_aureus`, `escherichia_coli` & `klebsiella_pneumoniae`
 - Added `bamDir` and `vcfDir` to config params
 - Added run `bwa_mem` from only when profile is not `mycobacterium_tuberculosis`
+- Automatically publish the pipeline documentation to read the docs.
 
 ### Fixed
 
 - ShigaPass URL fixed
 - Fixed qc channel regarding `mycobacterium_tuberculosis`
 - Fixed bwa output file bug and stub
+- Fixed getting some software versions
 
 ### Changed
 
 - Fixed output format for tbprofiler
 - Removed `samtools_sort_ref` from configs
 - Changed `--symlink_dir` arg for prp
 - Updated bonsai-prp to v0.9.2
+- Updated the pipeline documentation.
 - Removed sudo from make (deprecated)
 - Updated NGP config for new hardware
-- Refined readme instructions
 
 ## [0.7.0]
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,19 @@
+# Contributing to Jasen
+
+This is a guide on how to contribute to Jasen in various ways. You can contribute to this project by reporting bugs, help implement new types of analysis, or by improving the pipeline documentation.
+
+## Documentation changes
+
+Propose changes to the documentation by opening a pull request.
+
+Edit the documentation by editing the source files in the [docs](docs/) directory. The [README](docs/source/README.md) describe how to build and preview the documentation on your local machine.
+
+## Bug reports
+
+Submitting a bug report is one of the simplest and one of the most useful way to help out the project.
+
+Report a bug by creating a new issue (use the *New issue* button) on the [issues page](https://github.com/genomic-medicine-sweden/jasen/issues). A good bug report should include a description of the error (with the error message) and steps on how to reproduce the error.
+
+## New feature, analysis, or tool
+
+We welcome suggestions on new types of analysis, bacterial species or other features to add to the pipeline. Before contributing these, please create a feature proposal on the [issues page](https://github.com/genomic-medicine-sweden/jasen/issues) so it can be discussed.
diff --git a/README.md b/README.md
@@ -9,223 +9,32 @@ _Just Another System for Epityping NGS data_
 >[!WARNING]
 >**JASEN is in beta stage and the results are unverified. There is no guarantee that the pipeline can execute, output format consistency, or that it produces accurate results until there is an official 1.0 release.**
 
-Jasen produces results for epidemiological and surveillance purposes.
-Jasen has been developed for a small set of microbiota (primarily MRSA), but will likely work with any bacteria with a stable cgMLST scheme.
+Jasen produces results for antibiotic resistance and virulence prediction and epidemiological typing for surveillance purposes. The pipeline is developed in collaboration with several Swedish hospitals. The development was funded by [Genomic Medicine Sweden](https://genomicmedicine.se/).
 
-## Requirements
-
-* [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html#install-on-windows-or-mac)
-* [JRE 8 - 21](https://www.java.com/en/download/manual.jsp)
-* Nextflow (`curl -s https://get.nextflow.io | bash`)
-
-### Recommended
-
-* Conda
-* Singularity Remote Login
-
-## Usage
-
-### Simple self-test
-
-```
-nextflow run main.nf -profile staphylococcus_aureus -config configs/nextflow.base.config --csv assets/test_data/samplelist.csv
-```
-
-#### Usage arguments
-
-| Argument type | Options                                | Required |
-| ------------- | -------------------------------------- | -------- |
-| -profile      | **staphylococcus_aureus**, escherichia_coli, klebsiella_pneumoniae, mycobacterium_tuberculosis| True     |
-| -config       | **configs/nextflow.base.config**, configs/nextflow.dev.config, configs/nextflow.hopper.config, configs/nextflow.ngp.config| True     |
-| -entry        | bacterial_default                      | True     |
-| --output      | User specified directory                         | False    |
-| -resume       | Not applicable                                     | False    |
-
-
-### Input file format 
-
-```csv
-id,platform,read1,read2
-p1,illumina,assets/test_data/sequencing_data/saureus_10k/saureus_large_R1_001.fastq.gz,assets/test_data/sequencing_data/saureus_10k/saureus_large_R2_001.fastq.gz
-```
-
-### Update databases
-
-#### Update MLST database
-
-```
-bash /path/to/jasen/assets/mlst_db/update_mlst_db.sh
-```
+The pipeline currently support a small set of microbiota and the support are in different stages of development. See the documentation of information on the supported analysis for each species and what the development status means.
 
+| Species                      | Development status |
+|------------------------------|--------------------|
+| *Staphylococcus arueus*      | Draft              |
+| *Escherichia coli*           | Draft              |
+| *Mycobacterium tuberculosis* | Draft              |
 
 ## Installation
 
-### Copy code locally
-
-```
-git clone --recurse-submodules --single-branch --branch master  https://github.com/genomic-medicine-sweden/jasen.git && cd jasen
-```
-
-### Create singularity images. 
-
-The containers will be attempted to be built and downloaded as part of
-the main Makefile (that is, when running `make install` in the main repo
-folder).
-
-```
-cd container
-make
-```
-
-
-### Download references and databases using singularity. 
-
-First, make sure you stand in the `container` folder. Then run the `make` commands:
-
-```
-cd ..
-make install
-make check
-```
-
-Any errors produced during this step will hinder pipeline execution in
-unexpected ways.
-
-## Configuration
-
-### Nextflow configuration
-Source: `configs/nextflow.base.config`
-
-* Edit the `root` parameter in `configs/nextflow.base.config`
-* Edit the `krakenDb`, `workDir` and `outdir` parameters in `configs/nextflow.base.config`
-* Edit the `runOptions` in `configs/nextflow.base.config` in order to mount directories to your run
-
-When analysing Nanopore data:
-* Edit the `ext.args` for Flye: specify genome size for the organism of interest with flag `--genome-size`
-* Edit the `ext.seqmethod`for Flye depending on the input data
-* Edit the `ext.args` for Medaka: specify the model with flag `-m`. Currently it is set to `r941_min_sup_g507`, but one should always set it based on how the data was produced. More about choosing the right model can be found [here](https://github.com/nanoporetech/medaka#models).
-
-### Test data configuration
-Source: `assets/test_data/samplelist.csv`
-
-* Edit the read1 and read2 columns in `assets/test_data/samplelist.csv`
-
-### Temporary directories configuration
-Source: `~/.bashrc`
-
-* Add the export line to `~/.bashrc`
-* Change `SINGULARITY_TMPDIR` to `APPTAINER_TMPDIR` if you are using apptainer
-
-```
-export SINGULARITY_TMPDIR="/tmp" #or equivalent filepath to tmp dir
-```
-
-### Database configuration
-
-#### Kraken database configuration
-Choose between Kraken DB (64GB [Highly recommended]) or MiniKraken DB (8GB).
-Or customize [your own](https://benlangmead.github.io/aws-indexes/k2).
-
-##### Download standard Kraken database
-
-```
-wget -O /path/to/kraken_db/krakenstd.tar.gz https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20230314.tar.gz
-tar -xf /path/to/kraken_db/krakenstd.tar.gz
-```
-
-##### (Alternatively) Download miniKraken database
-
-```
-wget -O /path/to/kraken_db/krakenmini.tar.gz https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230314.tar.gz
-tar -xf /path/to/kraken_db/krakenmini.tar.gz
-```
-
-#### Create TBProfiler database
-
-##### Install jasentool
-
-```
-git clone git@github.com:ryanjameskennedy/jasentool.git && cd jasentool
-pip install .
-```
-
-##### Create input csv that is used as tbdb input (composed of FoHM, WHO & tbdb variants)
-
-```
-jasentool converge --output_dir /path/to/jasen/assets/tbdb
-```
-
-##### Create tbdb (ensure tb-profiler is installed)
-
-```
-cd /path/to/jasen/assets/tbdb
-tb-profiler create_db --prefix converged_who_fohm_tbdb
-tb-profiler load_library converged_who_fohm_tbdb
-```
-
-##### Bgzip and index gms TBProfiler db
-
-```
-bgzip -c converged_who_fohm_tbdb.bed > /path/to/jasen/assets/tbprofiler_dbs/bed/converged_who_fohm_tbdb.bed.gz
-tabix -p bed /path/to/jasen/assets/tbprofiler_dbs/bed/converged_who_fohm_tbdb.bed.gz
-```
-
-
-## Component Breakdown
-
-### QC
-
-* [Kraken2](https://ccb.jhu.edu/software/kraken2/): Species detection.
-* [Bracken](https://ccb.jhu.edu/software/bracken/): Combined with Kraken2 for species detection.
-* [bwa mem](https://github.com/lh3/bwa): Maps reads to cgMLST loci (demarcated by bed file) in order to estimate genome coverage. Low levels of Intra-species contamination or erroneous mapping is removed using bwa and filtering away the heterozygous mapped bases.
-* [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range): Calculates evenness of coverage.
-
-### Assembly
-
-* [SPAdes](http://cab.spbu.ru/software/spades/): De novo assembly for Ion Torrent.
-* [SKESA](https://www.ridom.de/seqsphere/ug/v60/SKESA_Assembler.html): De novo assembly for Illumina.
-* [QUAST](http://cab.spbu.ru/software/quast/): Extracts QC data (De novo assembly parameters) from the assembly.
-* [Flye](https://github.com/fenderglass/Flye/tree/flye): De novo assembly for Oxford Nanopore Technologies (ONT).
-* [Medaka](https://github.com/nanoporetech/medaka): Creates consensus sequences from ONT data.
-
-### Epidemiological typing
-
-* [chewBBACA](https://github.com/B-UMMI/chewBBACA/wiki): Calculates cgMLST of extracted alleles decided by schema. Number of missing loci is calculated and used as a QC parameter.
-* [cgmlst.net](https://www.cgmlst.org/ncs/schema/141106/): The cgMLST reference schema.
-* [mlst](https://github.com/tseemann/mlst): Caculates traditional 7-locus MLST.
-
-#### Supported profiles:
-
-* `staphylococcus_aureus`
-* `escherichia_coli`
-
-#### Future profiles that will be supported:
-
-* `klebsiella_pneumoniae`
-* `mycobacterium_tuberculosis`
-
-### Virulence and resistance markers
+See the documentation for installation instructions.
 
-* [resfinder](https://bitbucket.org/genomicepidemiology/resfinder/src/master/): Detects antimicrobial resistance genes as well as environmental and chemical resistance genes.
-* [pointfinder](https://bitbucket.org/genomicepidemiology/pointfinder/src/master/): Combines with resfinder to detect variants.
-* [virulencefinder](https://bitbucket.org/genomicepidemiology/virulencefinder/src/master/): Detects virulence genes.
-* [amrfinderplus](https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus): Detects antimicrobial resistance genes as well as environmental, chemical resistance and virulence genes.
-* [resfinder_db](https://bitbucket.org/genomicepidemiology/resfinder_db/src/master/): Resfinder database.
-* [pointfinder_db](https://bitbucket.org/genomicepidemiology/pointfinder_db/src/master/): Pointfinder database.
-* [virulencefinder_db](https://bitbucket.org/genomicepidemiology/virulencefinder_db/src/master/): Virulencefinder database.
+### Tips
 
-### Relatedness
+* You can use [Bonsai](https://github.com/Clinical-Genomics-Lund/cgviz) to visualise jasen outputs.
 
-* [sourmash](https://github.com/sourmash-bio/sourmash): Determine relatedness between samples.
+## Documentation
 
-## Report and visualisation
+The documentation is abailable for the latest stable release.
 
-* [Bonsai](https://github.com/Clinical-Genomics-Lund/cgviz): Visualises jasen outputs.
-* [graptetree](https://github.com/achtman-lab/GrapeTree): Visualise phylogenetic relationship using cgmlst data.
+## Contributing
 
-## Frequent issues / Tips
+Contributions to the pipeline is more than welcome. Please use the [CONTRIBUTING](CONTRIBUTING.md) file for details.
 
-* Always run the latest versions of the bioinformatical software.
-* Verify you have execution permission for jasens `*.sif` images.
-* Old Singularity versions may sporadically produce the error `FATAL: could not open image jasen/container/*.sif: image format not recognized!`
+## License
 
+Jasen is released under the GPLv3 license.
diff --git a/bin/concat_sw_versions.py b/bin/concat_sw_versions.py
@@ -0,0 +1,54 @@
+#!/usr/bin/env python
+"""Concatinate software versions."""
+
+import click
+import yaml
+from yaml import Loader
+import pandas as pd
+from pathlib import Path
+
+
+def get_versions(version_obj: dict[str: dict]) -> dict[str, str]:
+    workflow_name = list(version_obj.keys())[0].split(":")[-1]
+    raw_softwares = list(version_obj.values())[0]
+    # add workflow name to the list of all softwares
+    softwares = {}
+    for sw, version_info in raw_softwares.items():
+        version_info["workflow"] = workflow_name
+        softwares[sw] = version_info
+        # get container
+        if "http" not in version_info["container"]:
+            version_info["container"] = None
+    return softwares
+
+
+@click.command()
+@click.option("-o", "--output", type=click.File("w"), help="Path to write output file to.")
+@click.argument("version_files", nargs=-1)
+def cli(output, version_files):
+    """Concatinate the versions of softwares."""
+
+    all_versions = {}
+    for file in version_files:
+        with open(file) as vfile:
+            sw_version = get_versions(yaml.load(vfile, Loader=Loader))
+            # combine new sw versions with existing sw versions
+            all_versions = {**all_versions, **sw_version}
+
+    # convert version dict to csv tables
+    df = (pd.DataFrame
+        .from_dict(all_versions, orient="index")
+        .drop("workflow", axis=1)
+        .fillna("-")
+    )
+    df.index.name = "software"
+    df.reset_index(inplace=True)
+    df.sort_values("software", inplace=True)
+    df.columns = [col.capitalize() for col in df.columns]
+    # export to csv
+    df.to_csv(output, index=False)
+    click.secho(f"Wrote output file: {output.name}", fg="green")
+
+
+if __name__ == "__main__":
+    cli()
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)