bcbio validation workflows in Common Workflow Language (CWL) generated using bcbio's CWL support. These aim to be compatible with multiple CWL runners.
giab-joint
-- Run a germline GATK4 joint variant calling
workflow using three diverse Genome
in a Bottle samples from different sequencing technologies:
- NA12878: (Caucasian daughter): 65x NovaSeq TruSeq Nano
- NA24385 (Ashkenazi Jewish son): 50x HiSeq x10 dataset from 10x genomics
- NA24631 (Chinese son) -- 55x Illumina HiSeq 2500 2x250bp
Ths original whole genome inputs were subset to exome regions and chr20, maintaining parallelization and whole genome challenges while trying to avoid long runtimes.
This calls each independently using GATK HaplotypeCaller, generating gVCF as input to a combined joint callset.
giab-chm
-- This validation expands on the Genome in a Bottle datasets to include
a synthetic diploid input
derived from two haploid cell lines from Complete Hydatidiform Moles (CHM).
This truth set uses PacBio sequencing and is orthogonal to the Genome in a
Bottle Truth sets, helping identify short read and Illumina specific bias in those
inputs.
We subset the initial read input from ERA596361
to exome regions and chromosome 20. Truth set variants (full.38.vcf.gz
)
and confidence regions (full.38.bed.gz
) are from the
version 0.5 20180222 tarball.
Genome in a Bottle exome samples for variant calling:
- NA12878 exome -- From Illumina Basespace Public datasets, 200x NovaSeq S2 using the Illumina TruSeq DNA Library Prep for Enrichment with IDT xGen Exome Research Panel v1.0, trimmed to 2x101 for analysis.
- NA24385 exome -- from Oslo University Hospital's contribution to the Genome in a Bottle Project: 150x Illumina HiSeq 2500 with 150bp paired-end reads, Agilent SureSelect Human All Exon V5 kit
- NA24631 exome -- also from Oslo University Hospital's GiaB contribution, 100x Illumina HiSeq 2500 sequencing.
somatic-giab-mix
-- Somatic tumor/normal variant calling on a sequenced
mixture dataset of two Genome in a Bottle
samples
simulating a lower frequency set of calls. It has a 90x tumor genome consisting
of 30% NA12878 (tumor) and 70% NA24385 (germline) and a 30x normal genome of
NA24385. Unique NA12878 variants are somatic variations at 15% and 30%.
Since this has well resolved truth sets and is non-synthetic, it provides a
minimum baseline for calling lower frequency somatic variant. The HiSeq x10
sequenced data also has some 3' quality issues which help identify false
positive and long runtime issues in different calling methods.
The dataset is subset to chr20 and exome regions, similar to the giab-joint
and giab-chm
examples above.
somatic-lowfreq
-- Low frequency variant detection assessment with more difficult features: tumor-only,
FFPE and UMI tagged. Inputs datasets are from
the smcounter2 low frequency UMI-based variant caller
and Pisces tumor only variant caller papers:
- RAS-panel (
pisces-ras
) -- Tumor-only FFPE sample inputs with validated mutations in KRAS/NRAS, from the PISCES supplementation material. This contains 319 samples with single mutations. Not available as of 2021/11 - NA12878-dilution (
pisces-titr
) -- Tumor-only low frequency (8%, 12%, 16%) dilutions of NA12878 and NA12877 from the PISCES supplementation material - N13532 (
smcounter2-umi
) -- ~4000x UMI tagged 228-gene panel with 0.5% low frequency SNPs and indels from a NA12878/NA24385 mixture of Genome in a Bottle Samples. Truth set and input data URLs available in smcounter2's suppplemental material. - N0261 (
smcounter2-umi
) -- ~3500x UMI tagged 0.5% low frequency variants with an emphasis on 269 heterozygous indels, using the same NA12878/NA24385 mixture as N13532. - M0253 (
smcounter2-umi
) -- ~5000x QIAseq Actionable Solid Tumor Panel with UMIs for detection of ~0.5% low frequency SNPs from a mixture of Horizon Dx’s Tru-Q 7 reference standard with cancer specific mutations. Truth set and input data URLs available in smcounter2's suppplemental material.wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR752/000/SRR7526730/SRR7526730_1.fastq.gz wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR752/000/SRR7526730/SRR7526730_2.fastq.gz
NA24385-sv
-- Structural variant calling on Genome in a Bottle NA24385 (HG002) Ashkenazi
sample, compared against the v0.5.0 combined validation set. Uses the 50x HiSeq x10 dataset from 10x genomics from the giab
validations, subset to chr20 and exome regions.
HCC2218-sv
-- Somatic structural variant calling on exome sequencing from a breast carcinoma cell line.
The truth set and tumor and normal BAM files are from validations done with
Illumina's Canvas CNV caller.
NA12878-chr20
-- A germline variant calling workflow running a single chromosome subset of the
Genome in a Bottle NA12878 sample from
Illumina's Platinum Genomes.
It runs alignment, variant calling with multiple methods (
GATK4 HaplotypeCaller
FreeBayes,
Platypus,
samtools) and quality control,
validating outputs against reference standards. It runs in ~4 hours on a 8
core machine, using ~50Gb of disk space (35Gb for inputs, 15Gb for analyses).
The NA12878 chr20 example is part of the GA4GH-DREAM tool execution challenge and meant to be easily run on multiple CWL runners. The Synapse bcbio workflow directory mirrors the CWL in this repository and also contains the biological data to run the workflow.
The data for this run is self-contained within synapse:
synapse get -r syn9725771
pgp
-- Variant, HLA and structural variant calling on Personal Genome
Project genomes using the Arvados public
cloud.
SGDP-recall-CGC
-- Germline recalling on the
Cancer Genomics Cloud using the public
Simons Genome Diversity Project Dataset.
Each validation contains a ready to run Common Workflow Language description of the workflow, along with a description of samples pointing to local files from the downloaded biodata directory. If you're an experienced user of an analysis platform this is all you need to run an analysis.
If you'd like to explore more or are not sure where to get started, bcbio-vm contains wrappers and automated tools to help with regenerating CWL from input YAML files and running CWL on muliple plaforms. The bcbio CWL documentation contains details about installing and running on each platform, but in brief, Install bcbio-vm with:
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
~/install/bcbio-vm/anaconda/bin/conda install --yes -c conda-forge -c bioconda bcbio-nextgen-vm
export PATH=~/install/bcbio-vm/anaconda/bin:$PATH
If you're able to use Docker, then the bcbio-vm wrappers are all you need to install as the CWL workflows will download Docker container with the bcbio code and third party tools. Genome and other input data is retrieved separately in the next step so is also not required. If you're on an HPC or other system without Docker you can also run CWL workflows with an externally installed bcbio. Use the bcbio automated installation which will install the tools via bioconda:
wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py ~/install/bcbio --tooldir=~/install/bcbio --nodata --isolate
export PATH=~/install/bcbio/bin:$PATH
The GA4GH-DREAM Workflow Execution Challenge hosts the data for these challenges on Synapse in the bcbio biodata folder.
Use the Synapse python client to download the data. You can install this yourself via pip or conda:
conda install -c conda-forge -c bioconda synapseclient
or it's included with a bcbio-vm installation.
Download requires a free account on Synapse to obtain access. First login to Synapse with your credentials:
synapse login --remember-me
Then use the download_data.sh
shell script link in each validation project to
get only the data for that run. If you prefer, you can also retrieve the full
biodata folder with sample data for multiple validations with:
mkdir biodata
cd biodata
synapse get -r syn10468187
The individual tool run directories contains starter shell scripts for different CWL-enabled tools like Toil or and rabix bunny. Run with these can either use a local bcbio installation of tools or bcbio Docker containers.
You can adjust bcbio_system.yaml
in each directory to fit a pipeline run
to your current system. This includes options for changing cores, memory usage
and locations of input files. After changing, use run_generate_cwl.sh
to
create a new CWL matching your input specifications.
All CWL workflows, scripts and documentation are freely available for all uses under the MIT license.