Skip to content

Latest commit

 

History

History
141 lines (101 loc) · 6.63 KB

README.md

File metadata and controls

141 lines (101 loc) · 6.63 KB
BactFlow Logo

BactFlow logo was desinged by DALLE :)

BactFlow

Introduction

BactFlow is a workflow for bacterial genome assembly of single isolate and metagenomics sequencing reads extracted from Oxford Nanopore Technology (ONT) and Illumina platforms. It is designed using Nextflow DSL 2 technology and reads the generic outputs of Guppy and Dorado basecallers.

Requirements

Installation

  1. Clone the Repository
   git clone https://github.com/yourusername/bactflow.git

   cd bactflow
   mamba env create -f config.yml

   cd ..
   cp bactflow ~/.nextflow/assets/farhadm1990/

   nextflow run bactflow --help
  1. Clone the Repository via nextflow pull command
nextflow pull farhadm1990/bactflow

nextflow  run bactflow
  1. Setting up the conda environment by bactflow internal funciton envSetUp(): On the first lunch you can run the following code to create conda environment called bactflow and install the required packages within it.
nextflow run bactflow -r main --setup_only true

NOTE: the option --setup_only is by default false which means, the envSetUp() function will be invoced automatically before running the downstream processes.

Usage

Run the Nextflow workflow directly from GitHub with the following command:

nextflow run bactflow --help -r main

Usage: nextflow run bactflow [options]

Options:
    --setup_only            #Only runs envSetUp(), default false.
    --fastq_dir             #Absolute path to the fastq_pass directory (required). 
    --concat_reads          #Default true, it concatenates all your ONT basecaller 4000-chunk reads into one fastq file. Set it to false if it is already concatenated.
    --extension             #String; extention of basecalled fastq files; default '.fastq.gz'
    --cpus                  #Number of available cpus; default 1.
    --coverage_filter       #If you want to normalize all your genomes to a certain coverage (default false).
    --coverage              #Only if '--coverage_filter true'; default is 50.
    --genome_size           #Genome size for coverage normalizaiton. Only if '--coverage_filter true'; default is 6.
    --out_dir               #Output directory of your final results. Default "genebrosh_output"
    --tensor_batch          #Medaka tensorflow batch size. Lower it in low coverage genomes. Default 200.
    --nanofilter            #Filtering reads for length and quality; default true.
    --min_length            #If '--nanofilter' true, filter reads below a certain read length (default 1000). 
    --min_quality           #If '--nanofilter' true, filter reads below a certain read quality (default 16 for R10.4.1 flowcells). 
    --medaka_polish         #If true, it will polish assembled genomes by medaka (dfault false).
    --basecaller_model      #Basecaller model for medaka polishing step. 'r1041_e82_400bps_hac_v4.2.0'
    --genome_extension      #Required if '--checkm_lineag_check true'; default fasta.
    --run_flye              #If true, it runs Flye assembler; default true.
    --circle_genome         #If true, it will fix the start of the genome to an arbitrary gene, e.g. dnaA. Default false.
    --run_unicycler         #If true, it runs Unicycler hybrid assemlber, default false.
    --run_megahit           #If true, it runs Megahit assembler, default false.
    --run_spades            #If true, it runs Spades assembler, default false.
    --tax_class             #If true, it runs GTBtk taxonomic classification, default false.
    --prok_annot            #If true, it runs gene annotaiton by Prokka, default false. 
    --run_checkm            #If true, it runs checmk lineage and phylogenetic tree workflow.
    --checkm_db             #An absolute path to the Checkm database.  
    --gtdbtk_data_path      #Absolute path to the GDBtk database. 
    --run_quast             #Post-assembly stats by Quaset, default true.
    --genome_dir            #Path to already assembled genomes, only to run post-assembly tasks, e.g. taxonomy classification, gene annotations and quast or checkm

Dependencies

You can run taxonomic classification workflow by setting --tax_class true. Alternatively, you can perform taxonomy classfiication on already-assembled genomes by adding --run_flye false. In this case you must provide the workflow with --genome_dir; a directory to the assembled genome and a genome --extension fasta or fa. GTDBtk depends on an external datasbase which can be downloaded and extracted as follows:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz

tar xvzf gtdbtk_data.tar.gz

The full path of the extracted must be proveded to the pipline nextflow.config file or when running the pipeline via --gtdbtk_data_path /home/databases/gtdbtk_db/release220 for example.

Generic ouputs

Report maker

Pangenome

Fig 1. Pangenome of assembled genomes extracted from bactflow.


Phylogenetic tree

Fig 2. Phylogenetic tree of assembled genomes based on their ANI.

Potential issues

In the first lunch of the program, process envSetUP will be invoked and it creates a conda environment called bactflow. In some conda environment, the following error could be raised:

miniconda3/envs/bactflow/etc/conda/deactivate.d/libxml2_deactivate.sh: line 3: xml_catalog_files_libxml2: unbound variable

Solution

You can open the file and add edit it as follows and rerun bactflow:

#!/bin/sh

if [ -n "${xml_catalog_files_libxml2:-}" ]; then
    export XML_CATALOG_FILES="${xml_catalog_files_libxml2}"
else
    unset XML_CATALOG_FILES
fi
unset xml_catalog_files_libxml2