Pipelines and tools for WES and WGS data processing
- Create submission script for rsync of .fastq files from RDS to HPC
- Creates submission script for bwa_gatk.sh
- Creates submission script for rsync of bwa_katk.sh results from HPC to RDS. Script will remove -t folder after rsync results.
- Limit of parallel job to run concurrently with -p option. Default is 1. I.e. job will run sequentially.
- Option to define path for the result folder with -o option.
- Option to define path for storing the fastq processed with -kfq option. This is useful for converting bam to fastq.gz for reference.
- Script support for fastq to vcf and bam to fastq to vcf flow.
- Index the reference FASTA sequence for use with BWA
- Align FASTQ reads with BWA-MEM to hg38 with ATL-contigs
- Add read group information, preprocess to make a clean BAM and call variants
- Create unmapped uBAM- This step uses Picard tools RevertSam function to revert sam file produced during alignment to previous state. We clear read attributes that have prevented alignments (XA: Alternative hits; format: (chr,pos,CIGAR,NM;) & XS:Suboptimal alignment score)
- Add read group information to uBAM
- Merge uBAM with aligned BAMi) Performed with Picard MergeBamAlignment. Needs to have FASTA dictionary file (.dic) in same directory as reference sequence.ii) Set UNMAP_CONTAMINANT_READS (Boolean) to “true” which enables detection of reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample), and unmap + label those reads accordingly. iii) When UNMAP_CONTAMINANT_READS is set, then we must setMIN_UNCLIPPED_BASES (Integer), default = 32
7. Base Recalibrator Report & 8. Perform Base Quality Score Recalibration (BQSR) - Currently uncommented as previous studies by Tian et al 2016 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5048557/) found BQSR had virtually negligible effect on INDEL calling and generally reduced sensitivity for SNP calling and reduced the SNP calling sensitivity but improved the precision when the coverage is insufficient. However, in regions of high divergence (e.g., the HLA region), BQSR reduced the sensitivity of callers. There is currently not a good argument to support its use if we are using alt-aware alignment. # - This step attempts performs recalibration of base score qualities using machine learning. Read Group information is used to identify the same sample/ run therefore this must be performed prior to changing of read group information.
- Flag duplicate reads- The step flags duplicate reads. OPTICAL_DUPLICATE_PIXEL_DISTANCE; The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.
- Coordinate sort, fix NM and UQ tags and index for clean BAMi) Picard.SortSam; Sort BAM file using coordinatesii) Picard. SetNmAndUqTags; - NM:i: Edit distance to the reference, including ambiguous bases but excluding clipping. i.e. mismatches to reference- UQ:i: Phred likelihood of the segment, conditional on the mapping being correct. i.e. quality score of the mapping- NOTE; Here we are using picard 2.7.1. With picard 2.10.6 and higher command is SetNmMDAndUqTags however seems to have piping issues at the moment https://gatkforums.broadinstitute.org/gatk/discussion/10104/picard-2-10-7-fails-pipelining-sortsam-and-setnmanduqtags
- Call SNP and indel variants in reference confidence (ERC) mode per sample using HaplotypeCaller
- ANNOVAR .vcf file annotioans with currently implemented annnotations that include; refGene,cytoBand,exac03,exac03nontcga,exac03nonpsych,avsnp147,dbnsfp33a,dbscsnv11,cosmic70,esp6500siv2_ea,esp6500siv2_aa,esp6500siv2_all,gnomad_exome,gnomad_genome,AFR.sites.2015_08,ALL.sites.2015_08,AMR.sites.2015_08,mcap,revel,clinvar_20170130 NOTE: The hg38_avsnp147.txt annotations contained irregular lines. Lines were removed with following script;
RUN SCRIPT
~/scripts/annovar.sh CDS0024.20K_ /project/RDS-SMS-FFbigdata-RW/Genetics/Genomes/whole_genome_sequencing_2012/FASTQ/bwagatk_test/test_271117