Much better docs; minor bug fix

mskcc · Sep 27, 2014 · dd2657c · dd2657c
1 parent 5b21653
commit dd2657c
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,8 @@ vcf2maf
 =======
 
 To convert a [VCF](http://samtools.github.io/hts-specs/) into a [MAF](https://wiki.nci.nih.gov/x/eJaPAQ), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. This selection of a single effect per variant, is often subjective. So this project is an attempt to make the selection criteria smarter, reproducible, and more configurable. And the default criteria must lean towards best practices. Per the current defaults, a single affected transcript is selected per variant, as follows:
- 1. Sort effects first by transcript [biotype priority](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L75), then by [effect severity](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L17), and finally by transcript length
- 2. Pick the gene on the top of the list (worst-affected), and choose it's [canonical](http://www.ensembl.org/Help/Glossary?id=346) transcript (if you used VEP, that relies on [CCDS](http://www.ncbi.nlm.nih.gov/CCDS/))
+ 1. Sort effects first by transcript [biotype priority](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L81), then by [effect severity](https://github.com/ckandoth/vcf2maf/blob/master/vcf2maf.pl#L24), and finally by decreasing transcript length
+ 2. Pick the gene on the top of the list (worst-affected), and choose it's [canonical](http://www.ensembl.org/Help/Glossary?id=346) transcript (VEP picks the longest [CCDS](http://www.ncbi.nlm.nih.gov/CCDS/) isoform)
  3. If the gene has no canonical transcript tagged (if you used snpEff), choose its longest transcript instead
 
 Quick start
@@ -20,50 +20,67 @@ If you don't have [VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html
 
     perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf
 
-If you'd rather use snpEff, there's an option for that, but make sure you ran snpEff with the `-sequenceOntology` option. In older versions of snpEff, this option was incorrectly spelled as `-sequenceOntolgy`:
+If you'd rather use snpEff, there's an option for that:
 
     perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.snpeff.maf --use-snpeff
 
-If you already have a VCF annotated with either VEP or snpEff, you can use those directly:
+If you already have a VCF annotated with either VEP or snpEff, you can use those directly. You should have ran VEP with at least these options: `--everything --check_existing --total_length --allele_number --xref_refseq`. And for snpEff use these options: `-hgvs -sequenceOntology`. In older versions of snpEff, `-sequenceOntology` was incorrectly spelled `-sequenceOntolgy`. Feed your VEP/snpEff annotated VCFs into vcf2maf as follows:
 
     perl vcf2maf.pl --input-vep data/test.vep.vcf --output-maf data/test.maf
     perl vcf2maf.pl --input-snpeff data/test.snpeff.vcf --output-maf data/test.maf
 
-To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from the corresponding tumor/normal genotype columns in the VCF:
+To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options `--tumor-id` and `--normal-id`. Skip option `--normal-id` if you didn't have a matched normal:
 
     perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --tumor-id WD1309 --normal-id NB1308
 
 VCFs from variant callers like [VarScan](http://varscan.sourceforge.net/somatic-calling.html#somatic-output) use hardcoded sample IDs TUMOR/NORMAL in the genotype columns of the VCF. To have this script correctly parse the correct genotype columns, while still printing the proper IDs in the output MAF:
 
-    perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL
+    perl vcf2maf.pl --input-vcf data/test_varscan.vcf --output-maf data/test_varscan.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL
 
 If you have VEP in a different folder like `/opt/vep`, and cached in `/srv/vep`, there are options available to point the script there. Similar options available for snpEff too:
 
     perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --vep-path /opt/vep --vep-data /srv/vep
-    perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --snpeff-path /opt/snpEff --snpeff-data /srv/snpEff/data --use-snpeff
+    perl vcf2maf.pl --input-vcf data/test.vcf --output-maf data/test.maf --snpeff-path /opt/snpEff --snpeff-data /opt/snpEff/data --use-snpeff
 
 Install VEP
 -----------
 
 Ensembl's VEP ([Variant Effect Predictor](http://useast.ensembl.org/info/docs/tools/vep/index.html)) is popular for how it selects a single "canonical transcript" per gene as [detailed here](http://useast.ensembl.org/Help/Glossary?id=346), its CLIA-compliant [HGVS variant format](http://www.hgvs.org/mutnomen/recs.html), and [Sequence Ontology nomenclature](http://useast.ensembl.org/info/genome/variation/predicted_data.html#consequences) for variant effects. It's download-able as a Perl script, so make sure you have [Perl installed](http://www.perl.org/get.html).
 
+On Ubuntu/Debian, sudoers can install VEP's pre-requisite packages like this:
+
+    sudo apt-get install -y curl rsync tabix libarchive-extract-perl libarchive-zip-perl libwww-perl libcgi-pm-perl libdbi-perl libdbd-mysql-perl
+
+On CentOS/Redhat/Fedora, here is how to install the equivalent packages:
+
+    sudo yum -y install curl rsync tabix perl-Archive-Extract perl-Archive-Zip perl-libwww-perl perl-CGI perl-DBI perl-DBD-mysql perl-Time-HiRes
+
 Download the v76 release of VEP into your home directory:
 
     cd ~/
     curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/76.tar.gz
     tar -zxf 76.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|vep/|g'
-    cd vep
 
-To significantly speed up the step where VEP looks up known variants in COSMIC/dbSNP/etc., it's strongly recommended to have [tabix](https://github.com/samtools/tabix) installed, or available in your `$PATH`. Install it with `apt-get install tabix` on Ubuntu/Debian or `yum install tabix` on CentOS/Redhat/Fedora.
+Download and unpack VEP's offline cache for GRCh37 and GRCh38 into `~/.vep` (default path that VEP looks in, but can be user-specified):
+
+    rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-76/variation/VEP/homo_sapiens_vep_76_GRCh{37,38}.tar.gz ~/.vep/
+    cat ~/.vep/*.tar.gz | tar -izxf - -C ~/.vep
+
+Install the Ensembl v76 API and download the reference FASTAs for GRCh37 and GRCh38:
+
+    cd ~/vep
+    perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37
+    perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38
 
-Import the Ensembl v76 (Gencode v20) cache for GRCh37 and GRCh38 (writes to `~/.vep` by default). it is packaged for  If you were not able to set up tabix, then skip argument `--CONVERT` below:
+Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:
 
-    perl INSTALL.pl --AUTO acf --SPECIES homo_sapiens --ASSEMBLY GRCh37 --CONVERT
-    perl INSTALL.pl --AUTO acf --SPECIES homo_sapiens --ASSEMBLY GRCh38 --CONVERT
+    perl convert_cache.pl --species homo_sapiens --version 76_GRCh37
+    perl convert_cache.pl --species homo_sapiens --version 76_GRCh38
 
-Test running VEP in offline mode, on the provided sample GRCh38 VCF:
+Test running VEP in offline mode, on the provided sample GRCh37 and GRCh38 VCFs:
 
-    perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep --assembly GRCh38 --input_file example_GRCh38.vcf --output_file example_GRCh38.vep.txt
+    perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep/homo_sapiens/76_GRCh37 --assembly GRCh37 --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
+    perl variant_effect_predictor.pl --offline --gencode_basic --everything --total_length --allele_number --no_escape --check_existing --xref_refseq --fasta ~/.vep/homo_sapiens/76_GRCh38 --assembly GRCh38 --input_file example_GRCh38.vcf --output_file example_GRCh38.vep.txt
 
 Install snpEff
 --------------

diff --git a/vcf2maf.pl b/vcf2maf.pl
@@ -237,7 +237,7 @@ sub GetBiotypePriority {
             }
 
             # Contruct VEP command using our chosen defaults and run it
-            my $vep_cmd = "perl -I /ifs/e63data/schultzlab/opt/perl5/lib/perl5 $vep_path/variant_effect_predictor.pl --fork 8 --offline --no_stats --everything --check_existing --total_length --allele_number --no_escape --gencode_basic --xref_refseq --assembly $ncbi_build --dir $vep_data --fasta $vep_data/homo_sapiens/76_$ncbi_build --vcf --input_file $input_vcf --output_file $vep_anno";
+            my $vep_cmd = "perl $vep_path/variant_effect_predictor.pl --fork 4 --offline --no_stats --everything --check_existing --total_length --allele_number --no_escape --gencode_basic --xref_refseq --assembly $ncbi_build --dir $vep_data --fasta $vep_data/homo_sapiens/76_$ncbi_build --vcf --input_file $input_vcf --output_file $vep_anno";
 
             system( $vep_cmd ) == 0 or die "ERROR: Failed to run the VEP annotator!\nCommand: $vep_cmd\n";
             ( -s $vep_anno ) or warn "WARNING: VEP-annotated VCF file is missing or empty!\nPath: $vep_anno\n";
@@ -718,12 +718,12 @@ sub GetVariantClassification {
 
 =head1 NAME
 
- vcf2maf.pl - Map effects of variants in a given VCF, and report them in a MAF
+ vcf2maf.pl - Map effects of variants in a given single-sample VCF, and report them in a MAF
 
 =head1 SYNOPSIS
 
  perl vcf2maf.pl --help
- perl vcf2maf.pl --input-vcf test.vcf --output-maf test.maf --tumor-id WD4086 --normal-id NB4086
+ perl vcf2maf.pl --input-vcf WD4086.vcf --output-maf WD4086.maf --tumor-id WD4086 --normal-id NB4086
 
 =head1 OPTIONS