-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix detect optical duplicates in TGA workflow #1291
Comments
In the current version of PR: #1292 Using solution where fastp to extract UMI sequences is replaced by Sentieon UMI extract we find optical duplicates from Sentieon dedup from the same T/N TGA case previously mentioned. On first glance it looks like this solution works.
|
Update on the status in the PR for adding UMI tags to bamfile. Current workflow for TGA in Balsamic (DEVELOP)
Workflow for TGA in Balsamic (this branch: #1292) to fix optical duplicate detection:
The setting Still to make sure that we still get similar results from the adapter trimming I did a comparison between the two approaches, grouped according to the different strategies. (sorry for the cut off names) As the stats are very similar I don't see an issue with switching to this adapter-method to implement this fix for the optical duplicate detection. |
Need
Related to this issue: #1282
Optical duplicates have not been detected and traced in the TGA workflow due to the presence of UMI tags in the read header which messes with Picard MarkDuplicates extraction of the flowcell tile information which is essential to detect optical duplicates:
Example:
@A00689:80:HMNMHDSXX:4:1101:28989:1094:UMI_TCGTC_AGGTT
This is not an issue for WGS as we don't have UMIs there, and in the UMI workflow we're actually assigning the extracted UMIs as tags to the bamfile before read consensus-calling.
Suggested approach
Try to add the UMIs as tags to the bamfile similar to the UMI workflow, which should leave the readnames unmodified and Dedup should be able to work with the default setting.
Considered alternatives
--READ_NAME_REGEX
if it exists for Sentieon Dedup as it does for Picard MarkDuplicates where we can change which part of the read header the tool should look for flowcell tile info. https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-Requests/suggestions/bugs solved by the feature
Record optical duplicates in QC report #1282
Can be closed when
Optical duplicates are detected as evidenced by the presence of the values in that column in the dedup.metrics file. Example TGA table below, see value 0 for optical duplicates:
Infrastructure changes
Create/Link issues in cg or servers needed for this feature
Blockers
Anything preventing this from happening?
The text was updated successfully, but these errors were encountered: