Skip to content

Commit

Permalink
Much better docs; minor bug fix
Browse files Browse the repository at this point in the history
  • Loading branch information
ckandoth committed Sep 27, 2014
1 parent 5b21653 commit dd2657c
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 17 deletions.
45 changes: 31 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ vcf2maf
=======

To convert a [VCF](http://samtools.github.io/hts-specs/) into a [MAF](https://wiki.nci.nih.gov/x/eJaPAQ), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. This selection of a single effect per variant, is often subjective. So this project is an attempt to make the selection criteria smarter, reproducible, and more configurable. And the default criteria must lean towards best practices. Per the current defaults, a single affected transcript is selected per variant, as follows:
1. Sort effects first by transcript [biotype priority](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L75), then by [effect severity](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L17), and finally by transcript length
2. Pick the gene on the top of the list (worst-affected), and choose it's [canonical](http://www.ensembl.org/Help/Glossary?id=346) transcript (if you used VEP, that relies on [CCDS](http://www.ncbi.nlm.nih.gov/CCDS/))
1. Sort effects first by transcript [biotype priority](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L81), then by [effect severity](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L24), and finally by decreasing transcript length
2. Pick the gene on the top of the list (worst-affected), and choose it's [canonical](http://www.ensembl.org/Help/Glossary?id=346) transcript (VEP picks the longest [CCDS](http://www.ncbi.nlm.nih.gov/CCDS/) isoform)
3. If the gene has no canonical transcript tagged (if you used snpEff), choose its longest transcript instead

Quick start
Expand All @@ -20,50 +20,67 @@ If you don't have [VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html

perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf

If you'd rather use snpEff, there's an option for that, but make sure you ran snpEff with the `-sequenceOntology` option. In older versions of snpEff, this option was incorrectly spelled as `-sequenceOntolgy`:
If you'd rather use snpEff, there's an option for that:

perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.snpeff.maf --use-snpeff

If you already have a VCF annotated with either VEP or snpEff, you can use those directly:
If you already have a VCF annotated with either VEP or snpEff, you can use those directly. You should have ran VEP with at least these options: `--everything --check_existing --total_length --allele_number --xref_refseq`. And for snpEff use these options: `-hgvs -sequenceOntology`. In older versions of snpEff, `-sequenceOntology` was incorrectly spelled `-sequenceOntolgy`. Feed your VEP/snpEff annotated VCFs into vcf2maf as follows:

perl vcf2maf.pl --input-vep data/test.vep.vcf --output-maf data/test.maf
perl vcf2maf.pl --input-snpeff data/test.snpeff.vcf --output-maf data/test.maf

To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from the corresponding tumor/normal genotype columns in the VCF:
To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options `--tumor-id` and `--normal-id`. Skip option `--normal-id` if you didn't have a matched normal:

perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --tumor-id WD1309 --normal-id NB1308

VCFs from variant callers like [VarScan](http://varscan.sourceforge.net/somatic-calling.html#somatic-output) use hardcoded sample IDs TUMOR/NORMAL in the genotype columns of the VCF. To have this script correctly parse the correct genotype columns, while still printing the proper IDs in the output MAF:

perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL
perl vcf2maf.pl --input-vcf data/test_varscan.vcf --output-maf data/test_varscan.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL

If you have VEP in a different folder like `/opt/vep`, and cached in `/srv/vep`, there are options available to point the script there. Similar options available for snpEff too:

perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --vep-path /opt/vep --vep-data /srv/vep
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --snpeff-path /opt/snpEff --snpeff-data /srv/snpEff/data --use-snpeff
perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --snpeff-path /opt/snpEff --snpeff-data /opt/snpEff/data --use-snpeff

Install VEP
-----------

Ensembl's VEP ([Variant Effect Predictor](http://useast.ensembl.org/info/docs/tools/vep/index.html)) is popular for how it selects a single "canonical transcript" per gene as [detailed here](http://useast.ensembl.org/Help/Glossary?id=346), its CLIA-compliant [HGVS variant format](http://www.hgvs.org/mutnomen/recs.html), and [Sequence Ontology nomenclature](http://useast.ensembl.org/info/genome/variation/predicted_data.html#consequences) for variant effects. It's download-able as a Perl script, so make sure you have [Perl installed](http://www.perl.org/get.html).

On Ubuntu/Debian, sudoers can install VEP's pre-requisite packages like this:

sudo apt-get install -y curl rsync tabix libarchive-extract-perl libarchive-zip-perl libwww-perl libcgi-pm-perl libdbi-perl libdbd-mysql-perl

On CentOS/Redhat/Fedora, here is how to install the equivalent packages:

sudo yum -y install curl rsync tabix perl-Archive-Extract perl-Archive-Zip perl-libwww-perl perl-CGI perl-DBI perl-DBD-mysql perl-Time-HiRes

Download the v76 release of VEP into your home directory:

cd ~/
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/76.tar.gz
tar -zxf 76.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|vep/|g'
cd vep

To significantly speed up the step where VEP looks up known variants in COSMIC/dbSNP/etc., it's strongly recommended to have [tabix](https://github.com/samtools/tabix) installed, or available in your `$PATH`. Install it with `apt-get install tabix` on Ubuntu/Debian or `yum install tabix` on CentOS/Redhat/Fedora.
Download and unpack VEP's offline cache for GRCh37 and GRCh38 into `~/.vep` (default path that VEP looks in, but can be user-specified):

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-76/variation/VEP/homo_sapiens_vep_76_GRCh{37,38}.tar.gz ~/.vep/
cat ~/.vep/*.tar.gz | tar -izxf - -C ~/.vep

Install the Ensembl v76 API and download the reference FASTAs for GRCh37 and GRCh38:

cd ~/vep
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38

Import the Ensembl v76 (Gencode v20) cache for GRCh37 and GRCh38 (writes to `~/.vep` by default). it is packaged for If you were not able to set up tabix, then skip argument `--CONVERT` below:
Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

perl INSTALL.pl --AUTO acf --SPECIES homo_sapiens --ASSEMBLY GRCh37 --CONVERT
perl INSTALL.pl --AUTO acf --SPECIES homo_sapiens --ASSEMBLY GRCh38 --CONVERT
perl convert_cache.pl --species homo_sapiens --version 76_GRCh37
perl convert_cache.pl --species homo_sapiens --version 76_GRCh38

Test running VEP in offline mode, on the provided sample GRCh38 VCF:
Test running VEP in offline mode, on the provided sample GRCh37 and GRCh38 VCFs:

perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep --assembly GRCh38 --input_file example_GRCh38.vcf --output_file example_GRCh38.vep.txt
perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep/homo_sapiens/76_GRCh37 --assembly GRCh37 --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep/homo_sapiens/76_GRCh38 --assembly GRCh38 --input_file example_GRCh38.vcf --output_file example_GRCh38.vep.txt

Install snpEff
--------------
Expand Down
6 changes: 3 additions & 3 deletions vcf2maf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ sub GetBiotypePriority {
}

# Contruct VEP command using our chosen defaults and run it
my $vep_cmd = "perl -I /ifs/e63data/schultzlab/opt/perl5/lib/perl5 $vep_path/variant_effect_predictor.pl --fork 8 --offline --no_stats --everything --check_existing --total_length --allele_number --no_escape --gencode_basic --xref_refseq --assembly $ncbi_build --dir $vep_data --fasta $vep_data/homo_sapiens/76_$ncbi_build --vcf --input_file $input_vcf --output_file $vep_anno";
my $vep_cmd = "perl $vep_path/variant_effect_predictor.pl --fork 4 --offline --no_stats --everything --check_existing --total_length --allele_number --no_escape --gencode_basic --xref_refseq --assembly $ncbi_build --dir $vep_data --fasta $vep_data/homo_sapiens/76_$ncbi_build --vcf --input_file $input_vcf --output_file $vep_anno";

system( $vep_cmd ) == 0 or die "ERROR: Failed to run the VEP annotator!\nCommand: $vep_cmd\n";
( -s $vep_anno ) or warn "WARNING: VEP-annotated VCF file is missing or empty!\nPath: $vep_anno\n";
Expand Down Expand Up @@ -718,12 +718,12 @@ sub GetVariantClassification {
=head1 NAME
vcf2maf.pl - Map effects of variants in a given VCF, and report them in a MAF
vcf2maf.pl - Map effects of variants in a given single-sample VCF, and report them in a MAF
=head1 SYNOPSIS
perl vcf2maf.pl --help
perl vcf2maf.pl --input-vcf test.vcf --output-maf test.maf --tumor-id WD4086 --normal-id NB4086
perl vcf2maf.pl --input-vcf WD4086.vcf --output-maf WD4086.maf --tumor-id WD4086 --normal-id NB4086
=head1 OPTIONS
Expand Down

0 comments on commit dd2657c

Please sign in to comment.