scbirlab/nf-crispriseq is a Nextflow pipeline to count and annotate guide RNAs in demultiplexed FASTQ files, optionally with UMIs, and optionally modelling fitness changes.
Table of contents
Per genome:
- If no guide RNAs are provided, generate all possible guide RNAs from the provided genome
- For each input set of guide RNAs, map to the reference genome and GFF to get genomic feature annotations
Per FASTQ file:
- Filter and trim reads to adapters using
cutadapt
. This ensures reads used downstream have the expected features and are trimmed so that the features are in predictable places. - (Optionally) Extract UMIs using
umitools extract
. - Find guide RNA matches using
cutadapt
. - Count UMIs (if using) and reads per guide RNA using
umitools count_tab
. - Plot histograms and correlations of UMI and read count distributions.
Optionally [work in progress]:
- If the data are from a time-course, calculate fitness per guide RNA, per condition.
- Get FASTQ quality metrics with
fastqc
. - Compile the logs of processing steps into an HTML report with
multiqc
.
You need to have Nextflow and conda
installed on your system.
If you're at the Crick or your shared cluster has it already installed, try:
module load Nextflow
Otherwise, if it's your first time using Nextflow on your system, you can install it using conda
:
conda install -c bioconda nextflow
You may need to set the NXF_HOME
environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow
To make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile
Make a sample sheet (see below) and, optionally, a nextflow.config
file
in the directory where you want the pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-crispriseq
Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which
will not be automatically updated. If you want to ensure that you're running a version of
the pipeline, use the -r <version>
flag. For example,
nextflow run scbirlab/nf-crispriseq -r v0.0.1
A list of versions can be found by running nextflow info scbirlab/nf-crispriseq
.
For help, use nextflow run scbirlab/nf-crispriseq --help
.
The first time you run the pipeline on your system, the software dependencies in environment.yml
will be installed. This can take around 10 minutes.
If your run is unexpectedly interrupted, you can restart from the last completed step using the -resume
flag.
nextflow run scbirlab/nf-crispriseq -resume
The following parameters are required:
sample_sheet
: path to a CSV containing sample IDs matched with FASTQ filenames, references, and adapter sequencesfastq_dir
: path to directory containing the FASTQ files (optionally GZIPped)inputs
: path to directory containing files referenced in thesample_sheet
, such as lists of guide RNAs.
The following parameters have default values that can be overridden if necessary.
output = "outputs"
: path to directory to put output filessample_names = "sample_id"
: column ofsample_sheet
to take as a sample identifier. Use"Run"
for SRA table inputs.use_umis = false
: Whether to the reads include UMIsfrom_sra = false
: Whether the FASTQ files should be pulled from the SRA instead of provided as local filesguides = true
: Whether the name of a CSV of guide sequences is provided in thesample_sheet
name_column = "Name"
: If using a CSV of guide RNA sequences (guides = true
), the column containing the name of each guidesequence_column = "guide_sequence"
: If using a CSV of guide RNA sequences (guides = true
), the column containing the sequence of each guiderc = false
: Whether to reverse complement the guide sequences before mapping.trim_qual = 10
: Forcutadapt
, the minimum Phred score for trimming 3' callsmin_length = 105
: Forcutadapt
, the minimum trimmed length of a read. Shorter reads will be discarded
The parameters can be provided either in the nextflow.config
file or on the nextflow run
command.
Here is an example of the nextflow.config
file:
params {
sample_sheet = "/path/to/sample-sheet.csv"
fastq_path = "/path/to/fastqs"
guides = "/path/to/reference"
// Optional
rc = true
guides = true
trim_qual = 15
min_length = 90
}
Alternatively, you can provide these on the command line:
nextflow run scbirlab/nf-crispriseq -r v0.0.1 \
--sample_sheet /path/to/sample_sheet.csv \
--fastq /path/to/fastqs \
--reference /path/to/reference \
--rc --guides \
--trim_qual 15 --min_length 90
The sample sheet is a CSV file providing information about each sample: which FASTQ files belong
to it, the reference genome accession number, adapters to be trimmed off, (optionally) the UMI
scheme, (optionally) the name of a table of known guide RNAs, and (optionally) experimental conditions if calculating fitness.
The file must have a header with the column names below, and one line per sample to be processed.
sample_id
: the unique name of the samplegenome
: The NCBI assembly accession number for the organism that the guide RNAs are targeting. This number starts with "GCF_" or "GCA_".pam
: The name (e.g. "Spy" or "Sth1") or sequence (e.g "NGG" or "NGRVAN") of the dCas9 PAM used in the experimentscaffold
: The name of the sgRNA scaffold ("PerturbSeq" or "Sth1") used in the experiment. The pipeline will look for files matching<fastq_dir>/*<dir>*
, and should match only the forward read if you had paired-end sequencing.adapter_5prime
: the 5' adapter on the forward read to trim to incutadapt
format. Sequence to the left will be removed, but the adapters themselves will be retained.adapter_3prime
: the 3' adapter on the forward read to trim to incutadapt
format. Sequence matching the adapter and everything to the right will be removed.
If you have set from_sra = false
(the default):
reads
: the search glob to find FASTQ files for each sample infastq_dir
(see config). Otherwise withfrom_sra = true
:Run
: the SRA Run ID
If you have set use_umis = true
(the default):
umi_pattern
: the cell barcode and UMI pattern inumitools
regex format for the forward read
If you have set guides = true
(the default):
guides_filename
: the name of a file in the inputs directory containing guide sequences.
Here is an example of the sample sheet:
sample_id | genome | reads | guides_filename | pam | scaffold | adapter_5prime | adapter_3prime | umi_pattern |
---|---|---|---|---|---|---|---|---|
lib001 | GCA_003076915.1 | FAU6865A42_*_R1 | guides.csv | Spy | PerturbSeq | ^N{8}TCGACTGAGCTGAAAGAAT | GTTTAAGAGCTATGCTGG | ^(?P<umi_1>.{8})(?P<discard_1>.{86}).+$ |
lib002 | GCA_003076915.1 | FAU6865A43_*_R1 | guides.csv | Spy | PerturbSeq | ^N{8}TCGACTGAGCTGAAAGAAT | GTTTAAGAGCTATGCTGG | ^(?P<umi_1>.{8})(?P<discard_1>.{86}).+$ |
And here is an example of the guides_filename
(guides.csv
in this example):
Name | guide_sequence |
---|---|
guide001 | TCGACTGAGCTGAAAGAAT |
guide002 | GTTTAAGAGCTATGCTGGT |
It is also possible to provide the guides as a fasta file:
>guide001
TCGACTGAGCTGAAAGAAT
>guide002
GTTTAAGAGCTATGCTGGT
You don't need to provide gene anntotation infomation, because the pipeline will map these guides back to the genome and annotate the features for you.
Outputs are saved in the output
defined in the config file. They are organised under
three directories:
processed
: FASTQ files and logs resulting from trimming and UMI extractionmapped
: FASTQ files and logs resulting mapping featurescounts
: tables and plots relating to UMI and read countsmultiqc
: HTML report on processing steps
Add to the issue tracker.
Here are the help pages of the software used by this pipeline.