Initial commit

scbirlab · Sep 19, 2024 · 857d886 · 857d886
commit 857d886
Show file tree

Hide file tree

Showing 10 changed files with 1,175 additions and 0 deletions.
diff --git a/.github/workflows/nf-test.yml b/.github/workflows/nf-test.yml
@@ -0,0 +1,33 @@
+
+name: Nextflow test
+
+on: [push]
+
+jobs:
+  test:
+
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        shell: bash -el {0}
+
+    steps:
+    - uses: easimon/maximize-build-space@master
+      with:
+        root-reserve-mb: 512
+        swap-size-mb: 1024
+        remove-dotnet: 'true'
+        remove-android: 'true'
+        remove-haskell: 'true'
+        remove-codeql: 'true'
+        remove-docker-images: 'true'
+    - uses: actions/checkout@v3
+    - uses: conda-incubator/setup-miniconda@v2
+      with:
+        python-version: 3.11
+        channels: conda-forge,bioconda
+        channel-priority: true
+    - uses: nf-core/setup-nextflow@v1
+    - name: Test Nextflow pipeline
+      run: |
+          nextflow run ${GITHUB_WORKSPACE} -c test/nextflow.config -profile gh -stub
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,8 @@
+.nextflow*
+/outputs/
+/work/
+/test/outputs
+/test/work
+/test/*.html
+~$*.xlsx
+*.html
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) [year] [fullname]
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,194 @@
+# Nanopore variant calling pipeline
+
+![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/nf-ggi/nf-test.yml)
+[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.10.0-23aa62.svg)](https://www.nextflow.io/)
+[![run with conda](https://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+
+**scbirlab/nf-ont-call-variants** is a Nextflow pipeline to call variants from Nanopore FASTQ files from bacterial clones relative to a wildtype control.
+
+The pipeline broadly recapitualtes, where possible, the GATK best practices for germline short variant calling. 
+
+**Table of contents**
+
+- [Processing steps](#processing-steps)
+- [Requirements](#requirements)
+- [Quick start](#quick-start)
+- [Inputs](#inputs)
+- [Outputs](#outputs)
+- [Issues, problems, suggestions](#issues-problems-suggestions)
+- [Further help](#further-help)
+
+## Processing steps
+
+For each sample:
+
+1. Quality Trim reads using `cutadapt`. 
+2. Map to genome FASTA using `minimap2`.
+3. Mark duplicates with `picard MarkDuplicates`.
+4. Call variants with `Clair3`.
+
+Then merge resulting GVCFs using GATK `CombineGVCFs`. With the combined variant calls:
+
+1. Joint genoptype with GATK `GenotypeGVCFs`.
+2. Filter variants using GATK `VariantFiltration`.
+3. Annotate variant effects using `snpEff`.
+4. Filter out variants where all samples are identical to the wildtype control, which [is assumed to be the `sample_id` which is alphabetically last](#sample-sheet).
+5. Write to output TSV.
+
+### Other steps
+
+1. Get FASTQ quality metrics with `fastqc`.
+2. Generate alignment statistics and plots with `samtools stats`.
+2. Map to genome FASTA using `bowtie2` because `minimap2` logs are not compatible with `multiqc`. This way, some kind of alignment metrics are possible.
+3. Compile the logs of processing steps into an HTML report with `multiqc`.
+
+## Requirements
+
+### Software
+
+You need to have Nextflow and `conda` installed on your system.
+
+#### First time using Nextflow?
+
+If you're at the Crick or your shared cluster has it already installed, try:
+
+```bash
+module load Nextflow
+```
+
+Otherwise, if it's your first time using Nextflow on your system, you can install it using `conda`:
+
+```bash
+conda install -c bioconda nextflow 
+```
+
+You may need to set the `NXF_HOME` environment variable. For example,
+
+```bash
+mkdir -p ~/.nextflow
+export NXF_HOME=~/.nextflow
+```
+
+To make this a permanent change, you can do something like the following:
+
+```bash
+mkdir -p ~/.nextflow
+echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
+source ~/.bash_profile
+```
+
+### Reference genome
+
+You also need the genome FASTA for the bacteria you are sequencing. These can be obtained from [NCBI Nucleotide](https://www.ncbi.nlm.nih.gov/nuccore/):
+
+1. Search for your strain of interest, and open its main page
+2. On the right-hand side, click `Customize view`, then `Customize` and check `Show sequence`. Finally, click `Update view`. You may have to wait a few minute while the sequence downloads.
+3. Click `Send to: > Complete record > File > FASTA > Create file`
+4. Save the files to a path which you provide as `--genome_fasta` [below](#inputs).
+
+## Quick start
+
+Make a [sample sheet (see below)](#sample-sheet) and, optionally, a [`nextflow.config` file](#inputs) in the directory where you want the pipeline to run. Then run Nextflow.
+
+```bash 
+nextflow run scbirlab/nf-ont-call-variants
+```
+
+Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the `-latest` flag.
+
+```bash 
+nextflow run scbirlab/nf-ont-call-variants -latest
+```
+If you want to run a particular tagged version of the pipeline, such as `v0.0.1`, you can do so using
+
+```bash 
+nextflow run scbirlab/nf-ont-call-variants -r v0.0.2
+```
+
+For help, use `nextflow run scbirlab/nf-ont-call-variants --help`.
+
+The first time you run the pipeline on your system, the software dependencies in `environment.yml` will be installed. This may take several minutes.
+
+## Inputs
+
+The following parameters are required:
+
+- `sample_sheet`: path to a CSV with information about the samples and FASTQ files to be processed
+- `genome_fasta`: path to reference genome FASTA
+- `snpeff_database`: name of snpEff database to use for annotation. This should be derived from the same assembly as `genome_fasta`. Database names often end in the assembly name, such as `gca_000015005`, which you can check matches that of your `genome_fasta`
+
+The following parameters have default values which can be overridden if necessary.
+
+- `inputs = "inputs"` : The folder containing your inputs.
+- `trim_qual = 10` : For `cutadapt`, the minimum Phred score for trimming 3' calls
+- `min_length = 10` : For `cutadapt`, the minimum trimmed length of a read. Shorter reads will be discarded
+- `gatk_image = "docker://broadinstitute/gatk:latest"` : Which GATK4 image to use
+- `snpeff_url = "https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip"` : Where to download snpEff from
+- `clair3_image = "docker://hkubal/clair3:latest"` : Which Clair3 image to use
+
+The parameters can be provided either in the `nextflow.config` file or on the `nextflow run` command.
+
+Here is an example of the `nextflow.config` file:
+
+```nextflow
+params {
+    sample_sheet = "/path/to/sample-sheet.csv"
+    inputs = "/path/to/inputs"
+    genome_fasta = "/path/to/1_ASM28329v1_genomic.fna"
+    snpeff_database = "Mycobacterium_smegmatis_str_mc2_155_gca_000283295"
+
+}
+```
+
+Alternatively, you can provide the parameters on the command line:
+
+```bash
+nextflow run scbirlab/nf-ont-call-variants \
+    --sample_sheet /path/to/sample-sheet.csv \
+    --inputs /path/to/inputs
+    --genome_fasta /path/to/1_ASM28329v1_genomic.fna \
+    --snpeff_database Mycobacterium_smegmatis_str_mc2_155_gca_000283295
+``` 
+
+### Sample sheet
+
+The sample sheet is a CSV file providing information about which FASTQ files belong to which sample.
+
+The file must have a header with the column names below, and one line per sample to be processed.
+
+- `sample_id`: the unique name of the sample. The wildtype must be **named so that it is alphabetically last**
+- `reads`: path (relative to `inputs` option above) to compressed FASTQ files derived from Nanopore sequencing
+
+Here is an example of the sample sheet:
+
+| sample_id | reads                |
+| --------- | -------------------- | 
+| wt        | raw_reads.fastq.gz   | 
+| mut1      | raw_reads.fastq.gz   | 
+
+## Outputs
+
+Outputs are saved in the same directory as `sample_sheet`. They are organised under three directories:
+
+- `processed`: FASTQ files and logs resulting from alignments
+- `tables`: tables, plots, and VCF files corresponding to variant calls
+- `multiqc`: HTML report on processing steps
+
+## Issues, problems, suggestions
+
+Add to the [issue tracker](https://www.github.com/scbirlab/nf-ont-call-variants/issues).
+
+## Further help
+
+Here are the help pages of the software used by this pipeline.
+
+- [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [multiqc](https://multiqc.info/)
+- [nextflow](https://www.nextflow.io/docs/latest/index.html)
+- [cutadapt](https://cutadapt.readthedocs.io/en/stable/index.html)
+- [minimap2](https://lh3.github.io/minimap2/minimap2.html)
+- [bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
+- [samtools](http://www.htslib.org/doc/samtools.html)
+- [GATK](https://gatk.broadinstitute.org/hc/en-us)
+- [Picard](https://broadinstitute.github.io/picard/)
+- [snpEff](https://pcingola.github.io/SnpEff/)
diff --git a/bin/metabolism/connect-metabolites.py b/bin/metabolism/connect-metabolites.py
@@ -0,0 +1,121 @@
+"""Use RDKit to clean reactants and products, and identify shared and similar metabolites between pairs."""
+
+from typing import Iterable, List, Optional, Union
+from functools import partial
+from itertools import product
+import sys
+
+from carabiner import print_err
+from carabiner.cast import cast
+import pandas as pd
+import numpy as np
+from rdkit import RDLogger    
+from rdkit.Chem import AddHs, rdFingerprintGenerator, MolFromSmiles, MolToInchi, MolToSmiles
+from rdkit.Chem.SaltRemover import SaltRemover
+from rdkit.DataStructs import TanimotoSimilarity
+from tqdm.auto import tqdm
+
+RDLogger.DisableLog('rdApp.*')     
+
+removers = (
+    SaltRemover(defnData="[H,Na,K,Mg,Ca,Mn,Fe,F,Cl,Br,O,S]").StripMol,
+    partial(SaltRemover().StripMol, dontRemoveEverything=True),
+    AddHs,
+)
+
+def _clean_smiles_list(smiles=str) -> str:
+    mol = MolFromSmiles(smiles)
+    for remover in removers:
+        mol = remover(mol)
+    return MolToSmiles(mol)
+
+
+def clean_metabolites(table: pd.DataFrame, 
+                      cols=Iterable[str]) -> pd.DataFrame:
+    return table.assign(**{col: table[col].apply(_clean_smiles_list) for col in cols})
+
+
+def _shared(a_b):
+    a, b = (map(MolFromSmiles, a_or_b) for a_or_b in a_b)
+    b = set(map(MolToInchi, b))
+    return any(MolToInchi(item) in b for item in a)
+
+
+def _have_shared_chemical(df: pd.DataFrame, a: str, b: str):
+    return df[[a, b]].progress_apply(_shared, axis=1)
+
+mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
+
+def _fingerprinter(a):
+    return mfpgen.GetFingerprint(MolFromSmiles(a))
+
+def _tanimoto(a_b):
+    a, b = a_b
+    return max(TanimotoSimilarity(*map(_fingerprinter, items)) for items in product(*a_b))
+
+def _max_similarity(df: pd.DataFrame, a: str, b: str):
+    return df[[a, b]].progress_apply(_tanimoto, axis=1)
+
+
+def _smiles_splitter(df: pd.DataFrame, col: str, char: str = ".") -> List[str]:
+    return df[col].str.split(char)
+
+
+def connect_metabolites(table: pd.DataFrame,
+                        cols: Iterable[str],
+                        id_col: Union[str, Iterable[str]],
+                        table2: Optional[pd.DataFrame] = None) -> pd.DataFrame:
+
+    if table2 is None:
+        table2 = table.copy()
+    if isinstance(id_col, str):
+        id_col = cast(id_col, to=list)
+    cols = cast(cols, to=list)
+
+    all_cols = id_col + cols
+    all_cols2 = [f"{col}_2" for col in all_cols]
+    cols2 = [f"{col}_2" for col in cols]
+
+    table = table[all_cols].assign(**{f"{col}_split": partial(_smiles_splitter, col=col) for col in cols})
+    table2 = (table2[all_cols]
+              .rename(columns=dict(zip(all_cols, all_cols2)))
+              .assign(**{f"{col}_split": partial(_smiles_splitter, col=col) for col in cols2}))
+    table = table.merge(table2, how='cross').query(f"{id_col[0]} < {id_col[0]}_2")
+
+    index_labels = tuple("_".join(items) for items in product(["r1", "p1"], ["r2", "p2"]))
+    print_err(f"By shared chemicals...")
+    table = table.assign(**{f"connection_{index_labels[i]}": partial(_have_shared_chemical, a=f"{a}_split", b=f"{b}_split") 
+                            for i, (a, b) in enumerate(product(cols, cols2))})
+    print_err(f"By maximum similarity...")
+    table = table.assign(**{f"max_similarity_{index_labels[i]}": partial(_max_similarity, a=f"{a}_split", b=f"{b}_split") 
+                            for i, (a, b) in enumerate(product(cols, cols2))})
+
+    return table[[col for col in table if not col.endswith("_split")]]
+
+
+def main() -> None:
+
+    tqdm.pandas()
+
+    print_err(f"Reading table from STDIN:")
+    table = pd.read_csv(sys.stdin, sep='\t').assign(_id=lambda x: x["rhea_reaction_id"].astype(str).str.cat(x["uniprot_id"], sep=":"))
+    print_err(table)
+    id_table = table[["_id", "rhea_reaction_id", "uniprot_id"]].copy()
+
+    print_err(f"Cleaning metabolites...")
+    table = clean_metabolites(table, cols=("reactants", "products"))
+    print_err(f"Connecting metabolites...")
+    table = (connect_metabolites(table, cols=("reactants", "products"), id_col=["_id"])
+             .merge(id_table, how='left')
+             .merge(id_table.rename(columns={col: f"{col}_2" for col in id_table}), how='left'))
+    cols_to_rename = ("reactants", "products", "uniprot_id", "rhea_reaction_iduniprot_id")
+    table = (table
+             .drop(columns=["_id", "_id_2"])
+             .rename(columns=dict(zip(cols_to_rename, (f"{col}_1" for col in cols_to_rename)))))
+    table[reversed(table.columns)].to_csv(sys.stdout, sep='\t', index=False)
+
+    return None
+
+
+if __name__ == '__main__':
+    main()