NeoFlow: a proteogenomics pipeline for neoantigen discovery
NeoFlow includes four modules:
-
Variant annotation and customized database construction:
neoflow_db.nf
; -
Variant peptide identification:
neoflow_msms.nf
; -
HLA typing:
neoflow_hlatyping.nf
; -
Neoantigen prediction:
neoflow_neoantigen.nf
.
NeoFlow supports both label free and iTRAQ/TMT data.
- Download neoflow:
git clone https://github.com/bzhanglab/neoflow
-
Install Docker (>=19.03).
-
Install Nextflow. More information can be found in the Nextflow get started page.
-
Install ANNOVAR by following the instruction at http://annovar.openbioinformatics.org/en/latest/.
-
Install netMHCpan 4.0 by following the instruction at http://www.cbs.dtu.dk/services/doc/netMHCpan-4.0.readme. Please set
TMPDIR
in filenetMHCpan-4.0/netMHCpan
as/tmp
as shown below:
# determine where to store temporary files (must be writable to all users)
if ( ${?TMPDIR} == 0 ) then
setenv TMPDIR /tmp
endif
- Install nvidia-docker (>=2.2.2) for AutoRT by following the instruction at https://github.com/NVIDIA/nvidia-docker. This is optional and it is only required when users want to use the RT based validation for novel peptide identifications using AutoRT.
All other tools used by NeoFlow have been dockerized and will be automatically installed when NeoFlow is run in the first time on a computer.
$ nextflow run neoflow_db.nf --help
N E X T F L O W ~ version 19.10.0
Launching `neoflow_db.nf` [irreverent_faggin] - revision: 741bf1a931
=========================================
neoflow => variant annotation and customized database construction
=========================================
Usage:
nextflow run neoflow_db.nf
Arguments:
--vcf_file A txt file contains VCF file(s)
--annovar_dir ANNOVAR folder
--protocol The parameter of "protocol" for ANNOVAR, default is "refGene"
--ref_dir ANNOVAR annotation data folder
--ref_ver The genome version, hg19 or hg38, default is "hg19"
--out_dir Output folder, default is "./output"
--cpu The number of CPUs
--help Print help message
The input file for parameter --vcf_file
is a tab-delimited text file which contains the path of variant file(s). The variant file can be VCF format or simple text-based format (ANNOVAR input format). The input txt file (a tab-delimited text file) for --vcf_file
format is shown below:
experiment | sample | file | file_type |
---|---|---|---|
TMT01 | T1 | T1_somatic.vcf;T1_rna.vcf | somatic;rna |
TMT01 | T2 | T2_somatic.vcf;T2_rna.vcf | somatic;rna |
TMT02 | T3 | T3_somatic.vcf;T3_rna.vcf | somatic;rna |
TMT02 | T4 | T4_somatic.vcf;T4_rna.vcf | somatic;rna |
The column of experiment
is label free, TMT or iTRAQ experiment name and the column of sample
is sample name. If it's iTRAQ or TMT data, the samples from the same iTRAQ or TMT experiment should have the same experiment
name. If it's label free data, different samples should have different experiment
name. All variant files (for example, somatic variant vcf file and variant calling result vcf file based on RNA-Seq data) for the same sample should be in the same row (column file
) and different files should be separated by ";". The column of file_type
indicates the corresponding variant types for the vcf files in column file
. Please note that all variant files should be under the folder where you run neoflow. We recommend users to provide absolute path for each variant file in the input txt file for --vcf_file
.
The ANNOVAR annotation data (--annovar_dir
) can be downloaded following the instruction at http://annovar.openbioinformatics.org/en/latest/user-guide/download/.
The output files of neoflow_db.nf
include customized protein databases in FASTA format for each experiment, variant annotation result files for each sample.
nextflow run neoflow_db.nf --ref_dir /data/tools/annovar/humandb_hg19/ \
--vcf_file example_data/test_vcf_files.tsv \
--annovar_dir /data/tools/annovar/ \
--ref_ver hg19 \
--out_dir output
Please update inputs for parameters --ref_dir
and --annovar_dir
before run the above example. The input file for --vcf_file
can be downloaded from the example data prepared for testing. After the example data is downloaded to users' computer, unzip the data and all the testing data are available in the example_data folder.
The running time of above example is less than 5 minutes on a Linux server with 40 cores.
Please note that the customized database generated in the first step will be used in this step.
$ ./nextflow run neoflow_msms.nf --help
N E X T F L O W ~ version 19.10.0
Launching `neoflow_msms.nf` [drunk_nobel] - revision: 6d58fb19bd
=========================================
neoflow => Variant peptide identification
=========================================
Usage:
nextflow run neoflow-msms.nf
MS/MS searching arguments:
--db The customized protein database (target + decoy sequences) in FASTA format which is generated by neoflow_db.nf
--ms MS/MS data in MGF format
--msms_para_file Parameter file for MS/MS searching
--out_dir Output folder, default is "./"
--prefix The prefix of output files
--search_engine The search engine used for MS/MS searching, comet=Comet, msgf=MS-GF+ or xtandem=X!Tandem
PepQuery arguments:
--pv_enzyme Enzyme used for protein digestion. 0:Non enzyme, 1:Trypsin (default), 2:Trypsin (no P rule), 3:Arg-C, 4:Arg-C (no P rule), 5:Arg-N, 6:Glu-C, 7:Lys-C
--pv_c The max missed cleavages, default is 2
--pv_tol Precursor ion m/z tolerance, default is 10
--pv_tolu The unit of --tol, ppm or Da. Default is ppm
--pv_itol The error window for fragment ion, default is 0.5
--pv_fixmod Fixed modification. The format is like : 1,2,3. Different modification is represented by different number
--pv_varmod Variable modification. The format is the same with --fixMod;
--pv_refdb Reference protein database
AutoRT parameters:
--rt_validation Perform RT based validation
--help Print help message
The output files of neoflow_msms.nf
include MS/MS searching raw identification files, FDR estimation result files at both PSM and peptide levels, PepQuery validation result files.
nextflow run neoflow_msms.nf --ms example_data/mgf/ \
--msms_para_file example_data/comet_parameter.txt \
--search_engine comet \
--db output/customized_database/neoflow_crc_target_decoy.fasta \
--out_dir output \
--pv_refdb output/customized_database/ref.fasta \
--pv_tol 20 \
--pv_itol 0.05
The input files for --ms
and --msms_para_file
can be downloaded from the example data prepared for testing.
The variant peptide identification result is in this file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv
.
The running time of above example is less than 15 minutes on a Linux server with 40 cores.
$ ./nextflow run neoflow_hlatyping.nf --help
N E X T F L O W ~ version 19.10.0
Launching `neoflow_hlatyping.nf` [spontaneous_hawking] - revision: 5fd970e701
=========================================
neoflow => HLA typing
=========================================
Usage:
nextflow run neoflow_hlatyping.nf
Arguments:
--reads Reads data in fastq.gz or fastq format. For example, "*_{1,2}.fq.gz"
--hla_ref_dir HLA reference folder
--seqtype Read type, dna or rna. Default is dna.
--singleEnd Single end or not, default is false (pair end reads)
--cpu The number of CPUs, default is 6.
--out_dir Output folder, default is "./"
--help Print help message
The output of neoflow_hlatyping.nf
is a txt format file containing HLA alleles for a sample. This file is generated by OptiType.
nextflow run neoflow_hlatyping.nf --hla_ref_dir example_data/hla_reference \
--reads "example_data/dna/*_{1,2}.fastq.gz" \
--out_dir output/ \
--cpu 40
The input files for --hla_ref_dir
and --reads
can be downloaded from the example data prepared for testing.
The HLA typing result is in this file output/hla_type/sample1/sample1_result.tsv
.
The running time of above example is less than 10 minutes on a Linux server with 40 cores.
Please note that the results generated in step 1-3 will be used in this step.
$ ./nextflow run neoflow_neoantigen.nf --help
N E X T F L O W ~ version 19.10.0
Launching `neoflow_neoantigen.nf` [mighty_roentgen] - revision: e4261baca3
=========================================
neoflow => Neoantigen prediction
=========================================
Usage:
nextflow run neoflow_neoantigen.nf
Arguments:
--var_db Variant (somatic) database in fasta format generated by neoflow_db.nf
--var_info_file Variant (somatic) information in txt format generated by neoflow_db.nf
--ref_db Reference (known) protein database
--hla_type HLA typing result in txt format generated by Optitype
--netmhcpan_dir NetMHCpan 4.0 folder
--var_pep_file Variant peptide identification result generated by neoflow_msms.nf, optional.
--var_pep_info Variant information in txt format for customized database used for variant peptide identification
--prefix The prefix of output files
--out_dir Output directory
--cpu The number of CPUs
--help Print help message
The output of neoflow_neoantigen.nf
is a tsv format file containing neoantigen prediction result as shown below:
Variant_ID | Chr | Start | End | Ref | Alt | Variant_Type | Variant_Function | Gene | mRNA | Neoepitope | Variant_Start | Variant_End | AA_before | AA_after | HLA_type | netMHCpan_binding_affinity_nM | netMHCpan_precentail_rank | protein_var_evidence_pep |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VAR|NM_002536|10054 | chrX | 48418659 | 48418659 | G | A | nonsynonymous SNV | protein-altering | TBC1D25 | NM_002536 | TGFGGHRG | 1 | 1 | A | T | HLA-A*01:01 | 44216.6 | 88.5537 | - |
VAR|NM_002536|10054 | chrX | 48418659 | 48418659 | G | A | nonsynonymous SNV | protein-altering | TBC1D25 | NM_002536 | TGFGGHRG | 1 | 1 | A | T | HLA-C*07:01 | 43330 | 73.7774 | - |
VAR|NM_002536|10054 | chrX | 48418659 | 48418659 | G | A | nonsynonymous SNV | protein-altering | TBC1D25 | NM_002536 | TGFGGHRG | 1 | 1 | A | T | HLA-B*08:01 | 35925.8 | 70.8561 | - |
VAR|NM_001348265|10055 | chrX | 48418659 | 48418659 | G | A | nonsynonymous SNV | protein-altering | TBC1D25 | NM_001348265 | TGFGGHRG | 1 | 1 | A | T | HLA-A*01:01 | 44216.6 | 88.5537 | - |
VAR|NM_001348265|10055 | chrX | 48418659 | 48418659 | G | A | nonsynonymous SNV | protein-altering | TBC1D25 | NM_001348265 | TGFGGHRG | 1 | 1 | A | T | HLA-C*07:01 | 43330 | 73.7774 | - |
Column description for the above table:
Variant_ID: variant ID defined by neoflow
Chr: variant chromosome
Start: start position on genome
End: end position on genome
Ref: reference base
Alt: alterative base
Variant_Type: variant type annotated by ANNOVAR
Variant_Function: variant function annotated by ANNOVAR
Gene: gene ID
mRNA: mRNA ID
Neoepitope: neoepitope peptide
Variant_Start: variant start position on neoepitope peptide
Variant_End: variant end position on neoepitope peptide
AA_before: reference amino acid
AA_after: alterative amino acid
HLA_type: HLA type
netMHCpan_binding_affinity_nM: MHC-peptide binding affinity from NetMHCpan 4.0. The lower the value, the higher the binding affinity between MHC and neoepitope peptide.
netMHCpan_precentail_rank: MHC-peptide binding affinity rank from NetMHCpan 4.0
protein_var_evidence_pep: variant peptide. "-" means no variant peptide identified covers the mutation site.
nextflow run neoflow_neoantigen.nf --prefix sample1 \
--hla_type output/hla_type/sample1/sample1_result.tsv \
--var_db output/customized_database/sample1-somatic-var.fasta \
--var_info_file output/customized_database/sample1-somatic-varInfo.txt \
--out_dir output/ \
--netmhcpan_dir /data/tools/netMHCpan-4.0/ \
--cpu 40 \
--ref_db output/customized_database/ref.fasta \
--var_pep_file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv \
--var_pep_info output/customized_database/neoflow_crc_anno-varInfo.txt
Please update input for parameter --netmhcpan_dir
before run the above example.
The neoantigen prediction result is in this file output/neoantigen_prediction/sample1_neoepitope_filtered_by_reference_add_variant_protein_evidence.tsv
.
The running time of above example is less than 30 minutes on a Linux server with 40 cores.
The test data used for above examples can be downloaded by clicking test data .
Wen, B., Li, K., Zhang, Y. et al. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nature Communications 11, 1759 (2020). https://doi.org/10.1038/s41467-020-15456-w