Pipeliner upgraded for Nextflow DSL2 modules
$ git clone https://github.com/montilab/pipeliner-2 --branch dev
Workflows are built using Nextflow. Nextflow can be used on any POSIX compatible system (Linux, OS X, etc) and requires BASH and Java 8 (or higher) to be installed. Download the latest version of Nextflow compatible with DSL2:
-
Make sure 8 or later is installed on your computer by using the command.
java -version
-
Download the nextflow executable, this repository uses nextflow version 20.10.0.5430.
curl -s https://get.nextflow.io | bash
-
Add nextflow to your $PATH in your .bash_profile (or system equivalent file).
export PATH=$PATH:/path/to/nextflow
To make use of this repository you'll need a basic understanding of how processes and channels work in Nextflow as well as familiarity with the new syntax for defining modular workflows Nextflow DSL2. Also note that many of the modules will contain Groovy code snippets.
A Nextflow workflow is a series of dependent processes that execute pre-defined scripts. Before DSL2, we wrote pipelines that shared script templates to reuse software tools common across pipelines. Recently, it is now possible to write processes as modules (similar to functions). This allows us to make modular pipelines and great reduces the friction of modifying or creating new pipelines from existing components.
A couple important notes about the beginning of your workflow script...
- You must have
nextflow.enable.dsl=2
in the beginning to enable DSL2. - Your modules will be parameterized with the
params
variable. So check the modules that you're using and see what parameters they are expecting to be defined. - The
params
variable must come before the module imports, because they're parameterized as they're imported. - Here I am importing them all, but you only need to import the modules you're making use of.
#!/usr/bin/env nextflow
VERSION="1.0"
nextflow.enable.dsl=2
params.paired = false
params.wd = "path/to/pipeliner-2"
params.outdir = "${params.wd}/results"
params.fasta = "${params.wd}/data/genomes/genome_reference.fa"
params.gtf = "${params.wd}/data/genomes/genome_annotation.gtf"
params.index = "${params.wd}/results/hisat/index/part"
include { FASTQC } from './modules/FASTQC' params(params)
include { TRIM_GALORE } from './modules/TRIM_GALORE' params(params)
include { HISAT_INDEX } from './modules/HISAT' params(params)
include { HISAT_MAPPING } from './modules/HISAT' params(params)
include { FEATURE_COUNTS } from './modules/QUANT' params(params)
include { FEATURE_COUNTS_MATRIX } from './modules/QUANT' params(params)
include { ESET } from './modules/QUANT' params(params)
include { MULTIQC } from './modules/MULTIQC' params(params)
Here are some examples of how you might use these modules.
You'll probably already have an index built but in case you need to build one, you can now do it independent of your workflow.
workflow {
HISAT_INDEX( params.fasta, params.gtf )
}
We defined a helper function for reading in single or paired end reads. All you have to do is specify the files you want to read in and make sure params.paired = true
and your reads will be properly formatted for the downstream modules.
First we pass the reads to TRIM_GALORE
which trims off the adapters and performs some quality control checks. The first output are the trimmed reads so we can pass TRIM_GALORE.out[0]
to HISAT_MAPPING
. Check your modules to see what the inputs/outputs are so you can modify this sequence if necessary.
Note that we pass FEATURE_COUNTS.out[0].collect()
to FEATURE_COUNTS_MATRIX
. The .collect()
function will flatten the counts per sample so that they are all passed at once and processed together rather than processed in independently in parallel.
def load_reads(path, paired) {
if (paired) {
Channel
.fromFilePairs( path )
.set {reads}
} else {
Channel
.fromPath( path )
.map { [it.getName().split("\\_1|\\_2", 2)[0], [it]] }
.set {reads}
}
}
workflow {
load_reads("${params.wd}/data/rnaseq/reads/*_{1,2}.fq.gz", params.paired)
TRIM_GALORE( reads )
HISAT_MAPPING( TRIM_GALORE.out[0] )
FEATURE_COUNTS( HISAT_MAPPING.out[0], params.gtf)
FEATURE_COUNTS_MATRIX( FEATURE_COUNTS.out[0].collect() )
ESET( FEATURE_COUNTS_MATRIX.out[0] )
MULTIQC( ESET.out[0] )
}
If the reads were single end, we only have to change the wildcard and ensure params.paired = false
.
load_reads("${params.wd}/data/rnaseq/reads/*_1.fq.gz", params.paired)
What if we already mapped the reads and have the bams? Rather than writing this conditional logic into a pipeline explicity, with a modular architecture, we can just modify the modules we're using.
def load_bams(path) {
Channel
.fromPath( path )
.map { [it.getName().split("\\.bam", 0)[0], [it]] }
.set {bams}
}
workflow {
load_bams("${params.wd}/data/rnaseq/bams/*.bam")
FEATURE_COUNTS( bams, params.gtf)
FEATURE_COUNTS_MATRIX( FEATURE_COUNTS.out[0].collect() )
ESET( FEATURE_COUNTS_MATRIX.out[0] )
}
cd /path/to/folder/
git clone https://github.com/montilab/pipeliner-2 --branch dev
# pipeliner-2 will be at /path/to/folder/pipeliner-2
conda env create -n pipeliner-2 -f pipeliner-2/pipeliner.yml
conda activate pipeliner-2
or (if you want to specify a destination)
# MLAB=/restricted/projectnb/montilab-p
conda env create -f pipeliner-2/pipeliner-2.yml -p $MLAB/tools/pipeliner-2/envs
conda activate $MLAB/tools/pipeliner-2/envs
module load nextflow