Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detect optical duplicates in TGA workflow #1291

Open
mathiasbio opened this issue Oct 20, 2023 · 2 comments · May be fixed by #1358
Open

Fix detect optical duplicates in TGA workflow #1291

mathiasbio opened this issue Oct 20, 2023 · 2 comments · May be fixed by #1358
Assignees
Labels
Feature New feature
Milestone

Comments

@mathiasbio
Copy link
Contributor

Need

Related to this issue: #1282

Optical duplicates have not been detected and traced in the TGA workflow due to the presence of UMI tags in the read header which messes with Picard MarkDuplicates extraction of the flowcell tile information which is essential to detect optical duplicates:

Example: @A00689:80:HMNMHDSXX:4:1101:28989:1094:UMI_TCGTC_AGGTT

This is not an issue for WGS as we don't have UMIs there, and in the UMI workflow we're actually assigning the extracted UMIs as tags to the bamfile before read consensus-calling.

Suggested approach

Try to add the UMIs as tags to the bamfile similar to the UMI workflow, which should leave the readnames unmodified and Dedup should be able to work with the default setting.

Considered alternatives

  • Modify --READ_NAME_REGEX if it exists for Sentieon Dedup as it does for Picard MarkDuplicates where we can change which part of the read header the tool should look for flowcell tile info. https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-
  • Completely remove the UMI tags in these reads instead of adding to header. However if we don't deliver raw fastqs to the customer this means that this information is lost for researchers where they could previously extract this from the readname in the final bamfile, and then run their own custom UMI pipeline.

Requests/suggestions/bugs solved by the feature

Record optical duplicates in QC report #1282

Can be closed when

Optical duplicates are detected as evidenced by the presence of the values in that column in the dedup.metrics file. Example TGA table below, see value 0 for optical duplicates:

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 7776 40365947 661574 10128 7047 14367911 0 0,355994 42248922
Unknown Library 6150 25823111 540829 8460 5402 6595770 0 0,255495 41515496

Infrastructure changes

Create/Link issues in cg or servers needed for this feature

Blockers

Anything preventing this from happening?

@mathiasbio mathiasbio added the Feature New feature label Oct 20, 2023
@mathiasbio mathiasbio self-assigned this Oct 20, 2023
@mathiasbio mathiasbio linked a pull request Oct 20, 2023 that will close this issue
3 tasks
@mathiasbio
Copy link
Contributor Author

In the current version of PR: #1292

Using solution where fastp to extract UMI sequences is replaced by Sentieon UMI extract we find optical duplicates from Sentieon dedup from the same T/N TGA case previously mentioned. On first glance it looks like this solution works.

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 7948 40364554 661654 24424 7170 14367482 302442 0,355997 42727135
Unknown Library 6107 25821601 540918 21701 5338 6595169 200229 0,255486 42373405

@mathiasbio
Copy link
Contributor Author

mathiasbio commented Oct 23, 2023

Update on the status in the PR for adding UMI tags to bamfile.

Current workflow for TGA in Balsamic (DEVELOP)

  1. fastp trims UMIs and adds to header of reads
  2. fastp quality and adapter trims reads (using --detect_adapter_for_pe )
  3. alignment per lane
  4. mark duplicates, and dedup fails to detect optical duplicates

Workflow for TGA in Balsamic (this branch: #1292) to fix optical duplicate detection:

  1. sentieon umi extract and adds UMIs as tags to header of reads
  2. fastp quality and adapter trims reads (without --detect_adapter_for_pe )
  3. alignment per lane
  4. mark duplicates, and dedup detects optical duplicates

The setting --detect_adapter_for_pe did not work with the interleaved fastqs produced by sentieon umi extract, so I read up on the setting in the fastp documentation and learned that the default for paired end is to not use this, because the tool instead relies on detecting the overlap of the read pairs. (https://github.com/OpenGene/fastp#adapters)

Still to make sure that we still get similar results from the adapter trimming I did a comparison between the two approaches, grouped according to the different strategies. (sorry for the cut off names)

bases_by_strategy

reads_by_strategy

As the stats are very similar I don't see an issue with switching to this adapter-method to implement this fix for the optical duplicate detection.

@pbiology pbiology modified the milestones: TBD, Release 14 Oct 24, 2023
@pbiology pbiology modified the milestones: Release 14, Release 15 Nov 7, 2023
@mathiasbio mathiasbio modified the milestones: Release 15, Release 16 Feb 8, 2024
@mathiasbio mathiasbio linked a pull request Aug 15, 2024 that will close this issue
56 tasks
@mathiasbio mathiasbio removed a link to a pull request Aug 15, 2024
3 tasks
@mathiasbio mathiasbio modified the milestones: Release 17, Release 16 Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature
Projects
Status: In Testing
Development

Successfully merging a pull request may close this issue.

2 participants