Fix detect optical duplicates in TGA workflow #1291

mathiasbio · 2023-10-20T10:20:13Z

Need

Related to this issue: #1282

Optical duplicates have not been detected and traced in the TGA workflow due to the presence of UMI tags in the read header which messes with Picard MarkDuplicates extraction of the flowcell tile information which is essential to detect optical duplicates:

Example: @A00689:80:HMNMHDSXX:4:1101:28989:1094:UMI_TCGTC_AGGTT

This is not an issue for WGS as we don't have UMIs there, and in the UMI workflow we're actually assigning the extracted UMIs as tags to the bamfile before read consensus-calling.

Suggested approach

Try to add the UMIs as tags to the bamfile similar to the UMI workflow, which should leave the readnames unmodified and Dedup should be able to work with the default setting.

This can later allow us to use also the UMI information when doing dedup after updating Sentieon (Update Sentieon #1250)
This feature was added here: https://support.sentieon.com/manual/appendix/releasenotes/#release-202112-06

Considered alternatives

Modify --READ_NAME_REGEX if it exists for Sentieon Dedup as it does for Picard MarkDuplicates where we can change which part of the read header the tool should look for flowcell tile info. https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-
Completely remove the UMI tags in these reads instead of adding to header. However if we don't deliver raw fastqs to the customer this means that this information is lost for researchers where they could previously extract this from the readname in the final bamfile, and then run their own custom UMI pipeline.

Requests/suggestions/bugs solved by the feature

Record optical duplicates in QC report #1282

Can be closed when

Optical duplicates are detected as evidenced by the presence of the values in that column in the dedup.metrics file. Example TGA table below, see value 0 for optical duplicates:

LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
Unknown Library	7776	40365947	661574	10128	7047	14367911	0	0,355994	42248922
Unknown Library	6150	25823111	540829	8460	5402	6595770	0	0,255495	41515496

Infrastructure changes

Create/Link issues in cg or servers needed for this feature

Blockers

Anything preventing this from happening?

The text was updated successfully, but these errors were encountered:

mathiasbio · 2023-10-23T13:02:16Z

In the current version of PR: #1292

Using solution where fastp to extract UMI sequences is replaced by Sentieon UMI extract we find optical duplicates from Sentieon dedup from the same T/N TGA case previously mentioned. On first glance it looks like this solution works.

LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
Unknown Library	7948	40364554	661654	24424	7170	14367482	302442	0,355997	42727135
Unknown Library	6107	25821601	540918	21701	5338	6595169	200229	0,255486	42373405

mathiasbio · 2023-10-23T16:48:05Z

Update on the status in the PR for adding UMI tags to bamfile.

Current workflow for TGA in Balsamic (DEVELOP)

fastp trims UMIs and adds to header of reads
fastp quality and adapter trims reads (using --detect_adapter_for_pe )
alignment per lane
mark duplicates, and dedup fails to detect optical duplicates

Workflow for TGA in Balsamic (this branch: #1292) to fix optical duplicate detection:

sentieon umi extract and adds UMIs as tags to header of reads
fastp quality and adapter trims reads (without --detect_adapter_for_pe )
alignment per lane
mark duplicates, and dedup detects optical duplicates

The setting --detect_adapter_for_pe did not work with the interleaved fastqs produced by sentieon umi extract, so I read up on the setting in the fastp documentation and learned that the default for paired end is to not use this, because the tool instead relies on detecting the overlap of the read pairs. (https://github.com/OpenGene/fastp#adapters)

Still to make sure that we still get similar results from the adapter trimming I did a comparison between the two approaches, grouped according to the different strategies. (sorry for the cut off names)

As the stats are very similar I don't see an issue with switching to this adapter-method to implement this fix for the optical duplicate detection.

mathiasbio added the Feature New feature label Oct 20, 2023

mathiasbio mentioned this issue Oct 20, 2023

[User Story] Record optical duplicates in QC report #1282

Open

5 tasks

mathiasbio self-assigned this Oct 20, 2023

mathiasbio mentioned this issue Oct 20, 2023

fix: TGA optical dup detection #1292

Closed

3 tasks

mathiasbio linked a pull request Oct 20, 2023 that will close this issue

fix: TGA optical dup detection #1292

Closed

3 tasks

pbiology modified the milestones: TBD, Release 14 Oct 24, 2023

pbiology modified the milestones: Release 14, Release 15 Nov 7, 2023

mathiasbio modified the milestones: Release 15, Release 16 Feb 8, 2024

mathiasbio linked a pull request Aug 15, 2024 that will close this issue

feat: deduplicate with UMIs #1358

Open

56 tasks

mathiasbio removed a link to a pull request Aug 15, 2024

fix: TGA optical dup detection #1292

Closed

3 tasks

mathiasbio modified the milestones: Release 17, Release 16 Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detect optical duplicates in TGA workflow #1291

Fix detect optical duplicates in TGA workflow #1291

mathiasbio commented Oct 20, 2023

mathiasbio commented Oct 23, 2023

mathiasbio commented Oct 23, 2023 •

edited

Loading

Fix detect optical duplicates in TGA workflow #1291

Fix detect optical duplicates in TGA workflow #1291

Comments

mathiasbio commented Oct 20, 2023

Need

Suggested approach

Considered alternatives

Requests/suggestions/bugs solved by the feature

Can be closed when

Infrastructure changes

Blockers

mathiasbio commented Oct 23, 2023

mathiasbio commented Oct 23, 2023 • edited Loading

mathiasbio commented Oct 23, 2023 •

edited

Loading