Nextflow pipeline to process Nanopore POD5 files from multiple bacterial samples into a gene
Duplex basecalling is carried out, which requires v10.4 chemistry to have been used.
- Merge all POD5 files (in case they've been already demultiplexed) into a single file.
- Basecall the merged POD5 file using
guppy
duplex basecalling, producing a.fastq.gz
. - Demultiplex the
.fastq.gz
based on thesample_sheet
barcodes usingguppy
according to the parameterbarcode_kit
.
Then for each demultiplexed sample:
- Trim reads to adapters using
cutadapt
. Reads from sbRNA-seq will be flanked by sequences containing cell barcodes and UMIs, so this step trims any extra sequences either side of these flanking sequences. - Extract cell barcodes and UMIs using
umitools
. - Map to genome FASTA using
minimap2
. - Deduplicate mapped reads using
umitools
. - Count deduplicated reads per gene using
featureCounts
. - Count deduplicated reads per gene per cell using
umitools
.
- Get FASTQ quality metrics with
fastqc
. - Map to genome FASTA using
bowtie2
becauseminimap2
logs are not compatible withmultiqc
. This way, some kind of alignment metrics are possible. - Compile the logs of processing steps into an HTML report with
multiqc
.
You need to have Nextflow and either conda
or mamba
installed on your system. If possible, use mamba
because it will be faster.
You will also need the guppy
basecaller from Oxford Nanopore. It can be downloaded from their community site. When you've installed it, guppy_path
is a required parameter of the pipeline.
You also need the genome FASTA and GFF annotations for the bacteria you are sequencing. These can be obtained from NCBI Nucleotide:
- Search for your strain of interest, and open its main page
- On the right-hand side, click
Customize view
, thenCustomize
and checkShow sequence
. Finally, clickUpdate view
. You may have to wait a few minute while the sequence downloads. - Click
Send to: > Complete record > File > [FASTA or GFF3] > Create file
- Save the files to directories which you provide as parameters below.
If it's your first time using Nextflow on your system, you can install it using conda
:
conda install -c bioconda nextflow
You may need to set the NXF_HOME
environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow
To make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile
Make a sample sheet (see below) and, optionally, a nextflow.config
file in the directory where you want the pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-ont-sbrnaseq
Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the -latest
flag.
nextflow run scbirlab/nf-ont-sbrnaseq -latest
If you want to run a particular tagged version of the pipeline, such as v0.0.1
, you can do so using
nextflow run scbirlab/nf-ont-sbrnaseq -r v0.0.1
For help, use nextflow run scbirlab/nf-ont-sbrnaseq --help
.
The first time you run the pipeline on your system, the software dependencies in environment.yml
will be installed. This may take several minutes.
The following parameters are required:
sample_sheet
: path to a CSV with information about the samples and FASTQ files to be processeddata_dir
: Path to root directory to find POD5 files (using pattern<data_dir>/pod5_*/*/*.pod5
)genome_fasta_dir
: path to directory containing genome FASTA files (for mapping)genome_gff_dir
: path to directory containing genome GFF files (for feature counting)guppy_path
: path to Guppy (ont-guppy
directory) provided by Oxford Nanoporebarcode_kit
: SKU of the barcoding kit used, e.g.SQK-NBD114-24
for the 24-barcode ligation kit
The following parameters have default values can be overridden if necessary.
model = "dna_r10.4.1_e8.2_400bps"
: Forguppy
, the basecalling model to use.trim_qual = 10
: Forcutadapt
, the minimum Phred score for trimming 3' callsmin_length = 11
: Forcutadapt
, the minimum trimmed length of a read. Shorter reads will be discardedumitools_error = 6
: Forumitools
, the number of errors allowed to correct cell barcodesstrand = 1
: ForfeatureCounts
, the strandedness of RNA-seq.1
for forward,2
for reverse.ann_type = 'gene'
: ForfeatureCounts
, features from GFF column 3 to use for countinglabel = 'Name'
: ForfeatureCounts
, one or more (comma-separated) fields from column 9 of GFF for labeling counts
The parameters can be provided either in the nextflow.config
file or on the nextflow run
command.
Here is an example of the nextflow.config
file:
params {
sample_sheet = "/path/to/sample_sheet.csv"
data_dir = "/path/to/pod5/root"
guppy_path = "/path/to/ont-guppy"
genome_fasta_dir = "/path/to/fastas"
genome_gff_dir = "/path/to/gffs"
barcode_kit = "SQK-NBD114-24"
}
Alternatively, you can provide these on the command line:
nextflow run scbirlab/nf-ont-sbrnaseq \
--sample_sheet /path/to/sample_sheet.csv \
--data_dir /path/to/pod5/root \
--guppy_path /path/to/ont-guppy \
--genome_fasta_dir /path/to/fastas \
--genome_gff_dir /path/to/gffs \
--barcode_kit SQK-NBD114-24
The sample sheet is a CSV file providing information about which demultiplexed FASTQ files belong to which sample, which genome each sample should be mapped to, and the UMI and cell barcode scheme for each sample.
The file must have a header with the column names below, and one line per sample to be processed.
barcode_id
: the ONT barcode name from the barcoding kit you usedsample_id
: the unique name of the samplegenome_id
: the name of the genome to map to. Each entry must match the name of one file (apart from the extension) ingenome_fasta_dir
andgenome_gff_dir
n_cells
: maximum number of uniquely barcoded cells in the sample (used byumi_tools
). This can be more than the number of reads, in which case the number of reads is taken instead.adapter
: the adapter sequence to trim to incutadapt
format. Sequences either side will be removed, but the adapters themselves will be retained.umi
: the cell barcode and UMI pattern inumitools
regex format
Here is an example of the sample sheet:
barcode_id | sample_id | genome_id | n_cells | adapter | umi |
---|---|---|---|---|---|
barcode01 | Eco1 | EcoMG1655-NC_000913.3 | 885000 | AGACAGN{6}G{3}...N{7}AGATCG | ^(?P<discard_1>.{6})(?P<cell_1>.{6})(?P<discard_2>.{3}).+(?P<cell_3>.{7})(?P<discard_4>.{6})$ |
barcode02 | Eco2 | EcoMG1655-NC_000913.3 | 885000 | AGACAGN{6}G{3}...N{7}AGATCG | ^(?P<discard_1>.{6})(?P<cell_1>.{6})(?P<discard_2>.{3}).+(?P<cell_3>.{7})(?P<discard_4>.{6})$ |
Outputs are saved in the same directory as sample_sheet
. They are organised under four directories:
-
fastq
: Demultiplexed FASTQ files. -
processed
: FASTQ files and logs resulting from trimming and UMI extraction -
counts
: tables and BAM files corresponding to cell$\times$ gene counts -
multiqc
: HTML report on processing steps
Add to the issue tracker.
Here are the help pages of the software used by this pipeline.