Skip to content

Tumor-normal variant calling workflow using HiFi reads

License

Notifications You must be signed in to change notification settings

PacificBiosciences/HiFi-somatic-WDL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiFi somatic WDL

A tumor-normal variant-calling workflow for HiFi data

See our tech note here on somatic variant detection with HiFi sequencing.


Table of contents

Installation and Dependencies

The workflow is written in WDL version 1.0 (Workflow Description Language). It depends on miniwdl and singularity (version 3 and later). miniwdl can be installed using Bioconda. The workflow can also be run using Cromwell. (See here for instructions, tested on version 86).

Usage

A step-by-step tutorial demonstrating usage and an FAQ can be found here.

Important outputs from workflow

The workflow will generate the following (non-exhaustive list) results in the $OUTDIR/_LAST/out folder. Please refer to output for a more detailed description of the outputs. An example of final HTML report from COLO829 dataset can be found here (Right click to save the file and double click to open it in a web browser).

Folder Types of results
AnnotatedSeverusSV Severus structural variants annotated with AnnotSV (TSV, see AnnotSV README)
Annotated*SV_intogen Structural variants annotated with AnnotSV (TSV) overlapping with the Compendium of Cancer Genes (IntOGen May 23)
small_variant_vcf_annotated DeepSomatic or ClairS SNV/INDEL annotated with VEP (VCF, single entry per variant with --pick, see VEP documentation)
small_variant_tsv_annotated VEP annotation for SNV/INDEL in TSV format
small_variant_tsv_CCG SNV/INDEL that are in the Compendium of Cancer Genes (IntOGen May 23)
mutsig_SNV_profile Mutational profile plot (MutationalPattern)
mutsig_SNV Mutational signature in TSV format (MutationalPattern)
normal_germline_small_variant_vcf_annotated ClairS/Clair3 germline SNV/INDEL (In normal sample) annotated with VEP (Optional, see input JSON parameters)
DMR_annotated Differentially methylated region annotated with genes/introns/promoters etc (TSV)
DMR_results Raw differentially methylated region from DSS (Unannotated, TSV)
DMR_annotated_CCG Annotated DMR (>50 CpG sites) overlapping with the Compendium of Cancer Genes (IntOGen May 23)
mosdepth_normal_summary Depth of coverage of normal (TXT)
mosdepth_tumor_summary Depth of coverage of tumor (TXT)
normal_bams_phased Phased normal BAM file (Hiphase)
tumor_bams_hiphase Phased tumor BAM file (Hiphase)
tumor_bams_longphase Phased tumor BAM file (Optionally, use Longphase for phasing. See input JSON parameters)
overall_(tumor|normal)_alignment_stats Alignment overall statistics (Mapped %)
per_alignment_(tumor|normal)_stats Statistics (accuracy/n_mismatches/length) for each alignment
aligned_RL_summary_(tumor|normal) Aligned read length N50, mean and median
normal_germline_small_variant_vcf Germline variants in normal (VCF)
tumor_germline_small_variant_vcf Germline variants in tumor (VCF)
pileup_(normal/tumor)_bed Summarized 5mC probability  in normal and tumor (BED, see pb-CpG-tools for format description)
cnvkit_cns_with_major_minor_CN Copy number segments adjusted with purity and ploidy estimate, see cnvkit_output for raw CNVKit result (BED)
Severus_filtered_vcf Severus structural variants (filtered with control VCF and has simple annotation based on svpack)
small_variant_vcf DeepSomatic or ClairS SNV/INDEL (Unannotated VCF)
Purple_outputs Purity and ploidy estimate + allele-specific copy number calls from HMFtools suite
chord_hrd_prediction Homologous recombination deficiency (HRD) prediction using CHORD
report HTML report summarizing the results. This can be open in any modern web browser. The report is only generated if all steps in the pipeline is carried out (e.g. small variants calling, SV annotation)

Demo datasets

There are two cancer cell lines sequenced on Revio systems, provided by PacBio:

  1. COLO829 (60X tumor, 60X normal): https://downloads.pacbcloud.com/public/revio/2023Q2/COLO829
  2. HCC1395 (60X tumor, 40X normal): https://downloads.pacbcloud.com/public/revio/2023Q2/HCC1395/

More datasets and benchmarking can be found on GitHub page of Severus and DeepSomatic's preprint.

References

References of tools used

Following are the references for the tools used in the workflow, which should be cited if you use the workflow. The list may not be exhaustive; we welcome suggestions for additional references.

  1. Zheng, Z. et al. ClairS: a deep-learning method for long-read somatic small variant calling. 2023.08.17.553778 Preprint at https://doi.org/10.1101/2023.08.17.553778 (2023).
  2. English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biology 23, 271 (2022).
  3. Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
  4. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
  5. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
  6. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single molecule sequencing. Nat Methods 15, 461–468 (2018).
  7. Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).
  8. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biology 17, 122 (2016).
  9. Park, Y. & Wu, H. Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics 32, 1446–1453 (2016).
  10. Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput Biol 12, e1004873 (2016).
  11. Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat Rev Cancer 20, 555–572 (2020). https://www.intogen.org
  12. Manders, F. et al. MutationalPatterns: the one stop shop for the analysis of mutational processes. BMC Genomics 23, 134 (2022).
  13. Lin, J.-H., Chen, L.-C., Yu, S.-C. & Huang, Y.-T. LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants. Bioinformatics 38, 1816–1822 (2022).
  14. HMFtools suite (Amber, Cobalt and Purple): https://github.com/hartwigmedical/hmftools/tree/master.
  15. Park, J. et al. DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies. 2024.08.16.608331 Preprint at https://doi.org/10.1101/2024.08.16.608331 (2024).
  16. Elrick, H. et al. SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing. 2024.07.25.604944 Preprint at https://doi.org/10.1101/2024.07.25.604944 (2024).
  17. Nguyen, L., W. M. Martens, J., Van Hoeck, A. & Cuppen, E. Pan-cancer landscape of homologous recombination deficiency. Nat Commun 11, 5584 (2020).
  18. Keskus, A. et al. Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads. 2024.03.22.24304756 Preprint at https://doi.org/10.1101/2024.03.22.24304756 (2024).

Tools versions

Tools used in the workflow
Tool Version Purpose Container
pbmm2 1.14.99 Alignment of HiFi reads quay.io
pbtk 3.1.0 Merging HiFi reads quay.io
samtools 1.17 Various tasks manipulating BAM files quay.io
VEP 110.1 Annotation of small variants docker
AnnotSV 3.4.12 Annotation of structural variants quay.io
DSS 2.48.0 Differential methylation self-hosted on quay.io
annotatr 1.26.0 Annotation of differentially methylated region (DMR) self-hosted on quay.io
ClairS 0.3.0 Somatic SNV and INDEL caller docker
bcftools 1.17 Manipulation of VCF quay.io
CNVKit 0.9.10 Copy number segmentation quay.io
Truvari 4.0.0 Filtering of control structural variants (Deprecated, using svpack instead) quay.io
bedtools 2.31.0 Splitting genome intervals for parallelization quay.io
mosdepth 0.3.4 Calculating depth of coverage quay.io
pb-CpG-tools 2.3.1 Summarizing 5mC probability quay.io
HiPhase 1.4.5 Diploid phasing using germline variants quay.io
slivar 0.3.0 Selecting/filtering variants from VCF quay.io
Severus 1.2 Structural variants quay.io
seqkit 2.5.1 Aligned BAM statistics quay.io
csvtk 0.27.2 Aligned BAM statistics summary and other CSV/TSV operation quay.io
IntOGen May 31 2023 Compendium of Cancer Genes for annotation self-hosted on quay.io
MutationalPattern 3.10.0 Mutational signatures based on SNV quay.io
Longphase v1.5.2 Optional phasing tool quay.io
Amber v4.0 BAF segmentation (HMFtools suite) self-hosted on quay.io
Cobalt v1.16.0 Log ratio segmentation (HMFtools suite) self-hosted on quay.io
Purple v4.0 Purity and ploidy estimate, somatic CNV (HMFtools suite) self-hosted on quay.io
DeepSomatic v1.7.0 Somatic SNV/INDELs caller docker
CHORD v2.0.0 HRD prediction docker
SAVANA v1.2.3 Structural variants and copy number variants caller quay.io

Change logs

Click to expand changelogs:
  • v0.8.1:

    • Move BND square bracket annotation for VCF to INFO field to avoid AnnotSV from harmonizing the BND format (Does not work well with long-reads SVs).
    • Fixed a bug preventing skip_align from working properly.
    • Added option to produce SAVANA output. This is experimental and can be enabled with hifisomatic.run_savana in the input JSON.
      • Note that currrently the output for SAVANA is not annotated or used for any further downstream processing.
    • Updated Severus to 1.2.0.
    • bcftools norm on small variants before annotation.
    • Updated HiPhase to 1.4.5.
    • Updated reference list.
  • v0.8:

    • Updated DeepSomatic to v1.7.0. This resulted in a significant improve in INDEL recall. See benchmark from DeepSomatic preprint for more comparisons.
    • Updated AnnotSV to 3.4.2. Please update AnnotSV cache by following the instructions in the step-by-step tutorial here.
    • Updated pbmm2 to 1.14.99 (With -A2 option for better alignment of some complex SV with short supplementary segments, e.g. truthset_41 in COLO829).
    • Updated Severus to version 1.1.
      • Note that in COLO829 truthset_19 becomes a FN with Severus 1.1. See issue here.
    • Updated report format to become easier to read.
    • Resource bundle now uses germline SVs called with Severus instead of the previous Sniffles2 SV set. Please update resource bundle from Zenodo.
    • Simplified kinetics stripping directly in pbmm2.
    • Modified Amber to use the same pcf gamma as Cobalt. It was previously using a value of 100, while Cobalt was using 1000. This change will make the segmentation more consistent between Amber and Cobalt and should improve purity/ploidy estimates.
    • Better logic with merging BAMs in the workflow (No more redundant merging when n_bam=1).
    • Suppressed a warning causing failures in Cobalt with CIGAR error in the BAM file. This is known.
    • Incorporated pull request from here for Cromwell on Azure (not tested).
  • v0.7:

    • Updated DeepSomatic to v1.6.1.
    • Pipeline now calls DeepSomatic in chunks.
  • v0.6.2:

    • Added experimental CHORD HRD (Homologous Recombination Deficiency) prediction. See here for details.
    • Renamed some variables (legacy Sniffles parameters).
    • Swap CNVkit visualization to Purple in report.
    • Made changes in WDL for compatibility with Cromwell.
    • Allow specifying min/max purity/ploidy for Purple (In task WDL)
  • v0.6.1:

    • Updated documentation and benchmark.
    • Small bugfix in tabix indexing of VCFs.
    • Added HTML report for summary metrics.
    • Fixed a bug where germline VCFs are not output when using DeepSomatic.
    • Fixed a bug introduced in svpack filtering where entries with SVLEN=0. are filtered out (They should not be).
  • v0.6:

    • Added DeepSomatic 1.6.0 (Experimental and disabled by default. Enable with hifisomatic.use_deepsomatic in input JSON). Note that DeepSomatic is computationally expensive compared to ClairS so we recommend disabling it if computational resources are limited. See benchmark here for comparison between ClairS and DeepSomatic.
    • Use svpack to filter for control SVs (previously using truvari) and provide simple annotation in filtered VCF.
    • Switch to using samtools to strip kinetics.
  • v0.5:

    • Updated Cobalt to 4.0. It now counts read depth correctly. See here for details.
    • Containers are now on pacbio quay.io.
    • SV calling now only uses Severus.
    • Individual tasks now output the version number in stdout.
  • v0.4:

    • Added purity, ploidy and somatic CNV with Amber, Cobalt and Purple
      • Note that Cobalt doesn't count the read depth from long-reads correctly so it'll affect the segmentation accuracy. However, purity and ploidy estimation appears to be robust.
    • CNVKit segmentation results recalled with purity and ploidy estimate from Purple.
    • Severus release now updated with Bioconda container.
    • Fixed an issue when call_smallvariants is set to false (Issue #1).
  • v0.3:

    • Added IntOGen filtering of SV/SNV/INDEL/DMR.
    • Added mutational signature analysis.
    • Added germline small variants annotation with VEP (optional).
    • Added Longphase as an optional phasing tool.
    • Better documentation of output in output.
  • v0.2:

    • Downgraded to WDL 1.0 for better compatibility.
    • Added run time attribute to tasks for future support on cloud (not tested yet).
  • v0.1: Initial release.

DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.

About

Tumor-normal variant calling workflow using HiFi reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •