-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 857d886
Showing
10 changed files
with
1,175 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
name: Nextflow test | ||
|
||
on: [push] | ||
|
||
jobs: | ||
test: | ||
|
||
runs-on: ubuntu-latest | ||
defaults: | ||
run: | ||
shell: bash -el {0} | ||
|
||
steps: | ||
- uses: easimon/maximize-build-space@master | ||
with: | ||
root-reserve-mb: 512 | ||
swap-size-mb: 1024 | ||
remove-dotnet: 'true' | ||
remove-android: 'true' | ||
remove-haskell: 'true' | ||
remove-codeql: 'true' | ||
remove-docker-images: 'true' | ||
- uses: actions/checkout@v3 | ||
- uses: conda-incubator/setup-miniconda@v2 | ||
with: | ||
python-version: 3.11 | ||
channels: conda-forge,bioconda | ||
channel-priority: true | ||
- uses: nf-core/setup-nextflow@v1 | ||
- name: Test Nextflow pipeline | ||
run: | | ||
nextflow run ${GITHUB_WORKSPACE} -c test/nextflow.config -profile gh -stub |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
.nextflow* | ||
/outputs/ | ||
/work/ | ||
/test/outputs | ||
/test/work | ||
/test/*.html | ||
~$*.xlsx | ||
*.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) [year] [fullname] | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,194 @@ | ||
# Nanopore variant calling pipeline | ||
|
||
![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/nf-ggi/nf-test.yml) | ||
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.10.0-23aa62.svg)](https://www.nextflow.io/) | ||
[![run with conda](https://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/) | ||
|
||
**scbirlab/nf-ont-call-variants** is a Nextflow pipeline to call variants from Nanopore FASTQ files from bacterial clones relative to a wildtype control. | ||
|
||
The pipeline broadly recapitualtes, where possible, the GATK best practices for germline short variant calling. | ||
|
||
**Table of contents** | ||
|
||
- [Processing steps](#processing-steps) | ||
- [Requirements](#requirements) | ||
- [Quick start](#quick-start) | ||
- [Inputs](#inputs) | ||
- [Outputs](#outputs) | ||
- [Issues, problems, suggestions](#issues-problems-suggestions) | ||
- [Further help](#further-help) | ||
|
||
## Processing steps | ||
|
||
For each sample: | ||
|
||
1. Quality Trim reads using `cutadapt`. | ||
2. Map to genome FASTA using `minimap2`. | ||
3. Mark duplicates with `picard MarkDuplicates`. | ||
4. Call variants with `Clair3`. | ||
|
||
Then merge resulting GVCFs using GATK `CombineGVCFs`. With the combined variant calls: | ||
|
||
1. Joint genoptype with GATK `GenotypeGVCFs`. | ||
2. Filter variants using GATK `VariantFiltration`. | ||
3. Annotate variant effects using `snpEff`. | ||
4. Filter out variants where all samples are identical to the wildtype control, which [is assumed to be the `sample_id` which is alphabetically last](#sample-sheet). | ||
5. Write to output TSV. | ||
|
||
### Other steps | ||
|
||
1. Get FASTQ quality metrics with `fastqc`. | ||
2. Generate alignment statistics and plots with `samtools stats`. | ||
2. Map to genome FASTA using `bowtie2` because `minimap2` logs are not compatible with `multiqc`. This way, some kind of alignment metrics are possible. | ||
3. Compile the logs of processing steps into an HTML report with `multiqc`. | ||
|
||
## Requirements | ||
|
||
### Software | ||
|
||
You need to have Nextflow and `conda` installed on your system. | ||
|
||
#### First time using Nextflow? | ||
|
||
If you're at the Crick or your shared cluster has it already installed, try: | ||
|
||
```bash | ||
module load Nextflow | ||
``` | ||
|
||
Otherwise, if it's your first time using Nextflow on your system, you can install it using `conda`: | ||
|
||
```bash | ||
conda install -c bioconda nextflow | ||
``` | ||
|
||
You may need to set the `NXF_HOME` environment variable. For example, | ||
|
||
```bash | ||
mkdir -p ~/.nextflow | ||
export NXF_HOME=~/.nextflow | ||
``` | ||
|
||
To make this a permanent change, you can do something like the following: | ||
|
||
```bash | ||
mkdir -p ~/.nextflow | ||
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile | ||
source ~/.bash_profile | ||
``` | ||
|
||
### Reference genome | ||
|
||
You also need the genome FASTA for the bacteria you are sequencing. These can be obtained from [NCBI Nucleotide](https://www.ncbi.nlm.nih.gov/nuccore/): | ||
|
||
1. Search for your strain of interest, and open its main page | ||
2. On the right-hand side, click `Customize view`, then `Customize` and check `Show sequence`. Finally, click `Update view`. You may have to wait a few minute while the sequence downloads. | ||
3. Click `Send to: > Complete record > File > FASTA > Create file` | ||
4. Save the files to a path which you provide as `--genome_fasta` [below](#inputs). | ||
|
||
## Quick start | ||
|
||
Make a [sample sheet (see below)](#sample-sheet) and, optionally, a [`nextflow.config` file](#inputs) in the directory where you want the pipeline to run. Then run Nextflow. | ||
|
||
```bash | ||
nextflow run scbirlab/nf-ont-call-variants | ||
``` | ||
|
||
Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the `-latest` flag. | ||
|
||
```bash | ||
nextflow run scbirlab/nf-ont-call-variants -latest | ||
``` | ||
If you want to run a particular tagged version of the pipeline, such as `v0.0.1`, you can do so using | ||
|
||
```bash | ||
nextflow run scbirlab/nf-ont-call-variants -r v0.0.2 | ||
``` | ||
|
||
For help, use `nextflow run scbirlab/nf-ont-call-variants --help`. | ||
|
||
The first time you run the pipeline on your system, the software dependencies in `environment.yml` will be installed. This may take several minutes. | ||
|
||
## Inputs | ||
|
||
The following parameters are required: | ||
|
||
- `sample_sheet`: path to a CSV with information about the samples and FASTQ files to be processed | ||
- `genome_fasta`: path to reference genome FASTA | ||
- `snpeff_database`: name of snpEff database to use for annotation. This should be derived from the same assembly as `genome_fasta`. Database names often end in the assembly name, such as `gca_000015005`, which you can check matches that of your `genome_fasta` | ||
|
||
The following parameters have default values which can be overridden if necessary. | ||
|
||
- `inputs = "inputs"` : The folder containing your inputs. | ||
- `trim_qual = 10` : For `cutadapt`, the minimum Phred score for trimming 3' calls | ||
- `min_length = 10` : For `cutadapt`, the minimum trimmed length of a read. Shorter reads will be discarded | ||
- `gatk_image = "docker://broadinstitute/gatk:latest"` : Which GATK4 image to use | ||
- `snpeff_url = "https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip"` : Where to download snpEff from | ||
- `clair3_image = "docker://hkubal/clair3:latest"` : Which Clair3 image to use | ||
|
||
The parameters can be provided either in the `nextflow.config` file or on the `nextflow run` command. | ||
|
||
Here is an example of the `nextflow.config` file: | ||
|
||
```nextflow | ||
params { | ||
sample_sheet = "/path/to/sample-sheet.csv" | ||
inputs = "/path/to/inputs" | ||
genome_fasta = "/path/to/1_ASM28329v1_genomic.fna" | ||
snpeff_database = "Mycobacterium_smegmatis_str_mc2_155_gca_000283295" | ||
} | ||
``` | ||
|
||
Alternatively, you can provide the parameters on the command line: | ||
|
||
```bash | ||
nextflow run scbirlab/nf-ont-call-variants \ | ||
--sample_sheet /path/to/sample-sheet.csv \ | ||
--inputs /path/to/inputs | ||
--genome_fasta /path/to/1_ASM28329v1_genomic.fna \ | ||
--snpeff_database Mycobacterium_smegmatis_str_mc2_155_gca_000283295 | ||
``` | ||
|
||
### Sample sheet | ||
|
||
The sample sheet is a CSV file providing information about which FASTQ files belong to which sample. | ||
|
||
The file must have a header with the column names below, and one line per sample to be processed. | ||
|
||
- `sample_id`: the unique name of the sample. The wildtype must be **named so that it is alphabetically last** | ||
- `reads`: path (relative to `inputs` option above) to compressed FASTQ files derived from Nanopore sequencing | ||
|
||
Here is an example of the sample sheet: | ||
|
||
| sample_id | reads | | ||
| --------- | -------------------- | | ||
| wt | raw_reads.fastq.gz | | ||
| mut1 | raw_reads.fastq.gz | | ||
|
||
## Outputs | ||
|
||
Outputs are saved in the same directory as `sample_sheet`. They are organised under three directories: | ||
|
||
- `processed`: FASTQ files and logs resulting from alignments | ||
- `tables`: tables, plots, and VCF files corresponding to variant calls | ||
- `multiqc`: HTML report on processing steps | ||
|
||
## Issues, problems, suggestions | ||
|
||
Add to the [issue tracker](https://www.github.com/scbirlab/nf-ont-call-variants/issues). | ||
|
||
## Further help | ||
|
||
Here are the help pages of the software used by this pipeline. | ||
|
||
- [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) | ||
- [multiqc](https://multiqc.info/) | ||
- [nextflow](https://www.nextflow.io/docs/latest/index.html) | ||
- [cutadapt](https://cutadapt.readthedocs.io/en/stable/index.html) | ||
- [minimap2](https://lh3.github.io/minimap2/minimap2.html) | ||
- [bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) | ||
- [samtools](http://www.htslib.org/doc/samtools.html) | ||
- [GATK](https://gatk.broadinstitute.org/hc/en-us) | ||
- [Picard](https://broadinstitute.github.io/picard/) | ||
- [snpEff](https://pcingola.github.io/SnpEff/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
"""Use RDKit to clean reactants and products, and identify shared and similar metabolites between pairs.""" | ||
|
||
from typing import Iterable, List, Optional, Union | ||
from functools import partial | ||
from itertools import product | ||
import sys | ||
|
||
from carabiner import print_err | ||
from carabiner.cast import cast | ||
import pandas as pd | ||
import numpy as np | ||
from rdkit import RDLogger | ||
from rdkit.Chem import AddHs, rdFingerprintGenerator, MolFromSmiles, MolToInchi, MolToSmiles | ||
from rdkit.Chem.SaltRemover import SaltRemover | ||
from rdkit.DataStructs import TanimotoSimilarity | ||
from tqdm.auto import tqdm | ||
|
||
RDLogger.DisableLog('rdApp.*') | ||
|
||
removers = ( | ||
SaltRemover(defnData="[H,Na,K,Mg,Ca,Mn,Fe,F,Cl,Br,O,S]").StripMol, | ||
partial(SaltRemover().StripMol, dontRemoveEverything=True), | ||
AddHs, | ||
) | ||
|
||
def _clean_smiles_list(smiles=str) -> str: | ||
mol = MolFromSmiles(smiles) | ||
for remover in removers: | ||
mol = remover(mol) | ||
return MolToSmiles(mol) | ||
|
||
|
||
def clean_metabolites(table: pd.DataFrame, | ||
cols=Iterable[str]) -> pd.DataFrame: | ||
return table.assign(**{col: table[col].apply(_clean_smiles_list) for col in cols}) | ||
|
||
|
||
def _shared(a_b): | ||
a, b = (map(MolFromSmiles, a_or_b) for a_or_b in a_b) | ||
b = set(map(MolToInchi, b)) | ||
return any(MolToInchi(item) in b for item in a) | ||
|
||
|
||
def _have_shared_chemical(df: pd.DataFrame, a: str, b: str): | ||
return df[[a, b]].progress_apply(_shared, axis=1) | ||
|
||
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048) | ||
|
||
def _fingerprinter(a): | ||
return mfpgen.GetFingerprint(MolFromSmiles(a)) | ||
|
||
def _tanimoto(a_b): | ||
a, b = a_b | ||
return max(TanimotoSimilarity(*map(_fingerprinter, items)) for items in product(*a_b)) | ||
|
||
def _max_similarity(df: pd.DataFrame, a: str, b: str): | ||
return df[[a, b]].progress_apply(_tanimoto, axis=1) | ||
|
||
|
||
def _smiles_splitter(df: pd.DataFrame, col: str, char: str = ".") -> List[str]: | ||
return df[col].str.split(char) | ||
|
||
|
||
def connect_metabolites(table: pd.DataFrame, | ||
cols: Iterable[str], | ||
id_col: Union[str, Iterable[str]], | ||
table2: Optional[pd.DataFrame] = None) -> pd.DataFrame: | ||
|
||
if table2 is None: | ||
table2 = table.copy() | ||
if isinstance(id_col, str): | ||
id_col = cast(id_col, to=list) | ||
cols = cast(cols, to=list) | ||
|
||
all_cols = id_col + cols | ||
all_cols2 = [f"{col}_2" for col in all_cols] | ||
cols2 = [f"{col}_2" for col in cols] | ||
|
||
table = table[all_cols].assign(**{f"{col}_split": partial(_smiles_splitter, col=col) for col in cols}) | ||
table2 = (table2[all_cols] | ||
.rename(columns=dict(zip(all_cols, all_cols2))) | ||
.assign(**{f"{col}_split": partial(_smiles_splitter, col=col) for col in cols2})) | ||
table = table.merge(table2, how='cross').query(f"{id_col[0]} < {id_col[0]}_2") | ||
|
||
index_labels = tuple("_".join(items) for items in product(["r1", "p1"], ["r2", "p2"])) | ||
print_err(f"By shared chemicals...") | ||
table = table.assign(**{f"connection_{index_labels[i]}": partial(_have_shared_chemical, a=f"{a}_split", b=f"{b}_split") | ||
for i, (a, b) in enumerate(product(cols, cols2))}) | ||
print_err(f"By maximum similarity...") | ||
table = table.assign(**{f"max_similarity_{index_labels[i]}": partial(_max_similarity, a=f"{a}_split", b=f"{b}_split") | ||
for i, (a, b) in enumerate(product(cols, cols2))}) | ||
|
||
return table[[col for col in table if not col.endswith("_split")]] | ||
|
||
|
||
def main() -> None: | ||
|
||
tqdm.pandas() | ||
|
||
print_err(f"Reading table from STDIN:") | ||
table = pd.read_csv(sys.stdin, sep='\t').assign(_id=lambda x: x["rhea_reaction_id"].astype(str).str.cat(x["uniprot_id"], sep=":")) | ||
print_err(table) | ||
id_table = table[["_id", "rhea_reaction_id", "uniprot_id"]].copy() | ||
|
||
print_err(f"Cleaning metabolites...") | ||
table = clean_metabolites(table, cols=("reactants", "products")) | ||
print_err(f"Connecting metabolites...") | ||
table = (connect_metabolites(table, cols=("reactants", "products"), id_col=["_id"]) | ||
.merge(id_table, how='left') | ||
.merge(id_table.rename(columns={col: f"{col}_2" for col in id_table}), how='left')) | ||
cols_to_rename = ("reactants", "products", "uniprot_id", "rhea_reaction_iduniprot_id") | ||
table = (table | ||
.drop(columns=["_id", "_id_2"]) | ||
.rename(columns=dict(zip(cols_to_rename, (f"{col}_1" for col in cols_to_rename))))) | ||
table[reversed(table.columns)].to_csv(sys.stdout, sep='\t', index=False) | ||
|
||
return None | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
Oops, something went wrong.