MarkerScan Pipeline

Pipeline to determine the species composition of your sample, and separate and assemble every component.

Installation
Config file
Visual overview of MarkerScan pipeline
Workflow details

Installation

Singularity installation

The prefered way of installation is via the provided singularity container as this will ensure there are no software incompatibilities. After installation of singularity, you can pull a docker image of the latest version of the code and convert it into a singularity container:

singularity pull docker://emvcaest/markerscan:latest

Next you need to bind all required directories of your local machine (see Config file) to the image

export SINGULARITY_BIND="$DIR"

The image can now be run as follows:

singularity run markerscan_latest.sif snakemake --cores $threads --use-conda --conda-prefix /opt/conda/ -s /MarkerScanPipeline/Snakefile --configfile $configfile

From source

Please clone this directory to your location of choice.

git clone https://github.com/CobiontID/MarkerScan.git

Download snakemake, e.g. via conda

conda install -c bioconda snakemake

# activate the Conda environment
conda activate snakemake
snakemake --configfile $configfile --cores $threads --use-conda --conda-prefix $condaprefix -s $pipelinedir/Snakefile

Config file

To run the pipeline a yaml file containing all external parameters needs to be created, an example is shown below.

reads: zipped fasta read file
genome: unzipped fasta file
shortname: e.g. ilBlaLact1, this will be used in output file names
sci_name: e.g. Blastobasis lacticolella, this full scientific name name needs to be present in NCBI taxonomy with exact spelling 
workingdirectory: folder to store all output files
datadir: central folder to store output which can be re-used across multiple pipeline runs
full: 0|1 (run only the SSU detection steps, or complete the full pipeline)

Visual overview of MarkerScan pipeline

flowchart

A[Genome] --> B(Scan for SSU)
B --> C(Extract SSU)
C --> D(Classify SSU)
D --> E(Get families)
E --> F(Download public genomes families)
F --> H(Classify reads - kraken)
G[Reads] --> H
H --> I(Extract reads)
E --> I
I -->J(Map reads)
I -->K(Independent assembly)
A-->J
J-->L(Busco)
F-->M
J-->M(Whole genome alignment)
K-->O(Busco)
F-->P
K-->P(Whole genome alignment)
M-->N(Report)
L-->N(Report)
O-->N(Report)
P-->N(Report)

Workflow details

Download steps

Download SILVA DB (https://ftp.arb-silva.de/current/ARB_files/) into {datadir}/silva (re-run, when new version)
Download all refseq organellar sequences from https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/ and https://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/ and store in {datadir}/organelles (re-run, when older than 180 days)
Download all genbank organellar sequences for apicomplexans (common contaminant, but sequence information is rare) via e-utils and store in {datadir}/apicomplexa (re-run, when older than 180 days)
Download NCBI taxonomy (both names.dmp / nodes.dmp and nucl_wgs.accession2taxid/nucl_gb.accession2taxid) (re-run, when older than 180 days)

Workflow steps

Run nhmmer with SSU_Prok_Euk_Microsporidia.hmm across the assembly and coordinates of matches can be found in {shortname}.SSU.readsinfo
The SSU loci are extracted and collapsed with 99% nucleotide identity and stored in {shortname}.SSU.reduced.fa
Classify SSU regions using SILVA. Taxonomy per sequence is found in {shortname}.SSU.reduced.SILVA.tax
Determine the species composition of sample and for which families the procedure continues, output in {workingdirectory}/genera
Download genomes for the closest relatives of the target species available. Next, this fasta file is split and masked using duskmasker. Outputfile: relatives/kraken.relatives.masked.ffn.
Download all available genomes (refseq if bacterial, all if eukaryotic) for the detected families and store in {datadir}/genera (re-run, when older than 180 days).
All fasta files of the detected cobiont families are combined in kraken.tax.masked.ffn.
A custom kraken database consisting out of kraken.tax.masked.ffn and relatives/kraken.relatives.masked.ffn is created: krakendb/
Kraken2 is run. Outputfiles are kraken.output and kraken.report
All reads are mapped to the draft assembly: AllReadsGenome.paf

The following part of the pipeline will be done for every detected family based on the composition of the sample.

Reads are extracted per bin. {family}/kraken.fa
Kraken reads are mapped to draft assembly. Fully aligned contigs {family}/{family}.ctgs
Run Busco on these contigs: {family}/busco/
Based on the downloaded genomes of this family, homology search using nucmer is performed on these contigs:{family}/{family}_vs_contigs.overview.txt
Combine these results and define certain set of reads which are deemed to belong to this {family}. {family}/{family}.final_reads.fa --> concatenated across families in final_reads_removal.fa and corresponding assembled sequence in final_assembly.fa.

Moreover, also a re-assembly is done.

Reads of draft contigs which are not fully aligned are added to the kraken reads: {family}/{family}.reads2assemble.fa
Assembly is done using hifiasm: {family}/hifiasm/
Busco is run twice, both on the reads as on the novel assembly: {family}/buscoReads and {family}/buscoAssembly
Nucmer against re-assembled contigs: {family}/{family}_vs_hifiasm.overview.txt
Map reads to re-assembled contigs: {family}/{family}.re-assembly_reads.fa --> concatenated across families in re-assembly_reads.fa

Combine all results and generate report file {shortname}.report.pdf and {shortname}.json.

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
.github/workflows		.github/workflows
envs		envs
scripts		scripts
src/docker		src/docker
AcariSSU_SILVA_pr2_curated.nhr		AcariSSU_SILVA_pr2_curated.nhr
AcariSSU_SILVA_pr2_curated.nin		AcariSSU_SILVA_pr2_curated.nin
AcariSSU_SILVA_pr2_curated.nsq		AcariSSU_SILVA_pr2_curated.nsq
MicrosporidiaSSU_NCBI.nhr		MicrosporidiaSSU_NCBI.nhr
MicrosporidiaSSU_NCBI.nin		MicrosporidiaSSU_NCBI.nin
MicrosporidiaSSU_NCBI.nsq		MicrosporidiaSSU_NCBI.nsq
README.md		README.md
SSU_Prok_Euk_Microsporidia.hmm		SSU_Prok_Euk_Microsporidia.hmm
SSU_Prok_Euk_Microsporidia.hmm.h3f		SSU_Prok_Euk_Microsporidia.hmm.h3f
SSU_Prok_Euk_Microsporidia.hmm.h3i		SSU_Prok_Euk_Microsporidia.hmm.h3i
SSU_Prok_Euk_Microsporidia.hmm.h3m		SSU_Prok_Euk_Microsporidia.hmm.h3m
SSU_Prok_Euk_Microsporidia.hmm.h3p		SSU_Prok_Euk_Microsporidia.hmm.h3p
Snakefile		Snakefile
create-envs.smk		create-envs.smk
run_snakemake.sh		run_snakemake.sh
submit_snakemake.sh		submit_snakemake.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MarkerScan Pipeline

Installation

Singularity installation

From source

Config file

Visual overview of MarkerScan pipeline

Workflow details

Download steps

Workflow steps

About

Releases

Packages

Contributors 2

Languages

CobiontID/MarkerScan

Folders and files

Latest commit

History

Repository files navigation

MarkerScan Pipeline

Installation

Singularity installation

From source

Config file

Visual overview of MarkerScan pipeline

Workflow details

Download steps

Workflow steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages