-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: deduplicate with UMIs #1358
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1358 +/- ##
========================================
Coverage 99.48% 99.48%
========================================
Files 40 40
Lines 1932 1957 +25
========================================
+ Hits 1922 1947 +25
Misses 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 8 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codewise looks great, as well as the results! 💯 Amazing work and fantastic documentation of the process! 🤩
Thanks for the reviews! As discussed I'll not merge this however until I have fully tested the replace VarDict PR, and before this PR is entirely complete I'll still need to add some documentation 🙏 |
Quality Gate passedIssues Measures |
This PR:
Makes some significant changes to the TGA workflow by aiming to use the UMIs during the deduplication stage. This is a feature that only exists on the later versions of Sentieon.
The PR is tested on Sentieon version: sentieon-genomics-202308.03 which has this feature.
As this new feature was implemented it required some further changes to the workflows which ended up becoming a quite large refactoring. But in terms of effects on the quality of the analysis the only relevant changes are probably:
Some more info regarding using the UMIs in the dedup step. It could rescue a significant number of reads wrongfully discarded as duplicates with a purely position-based approach, and could probably replace the UMI workflow for many use-cases as error-correction with 3,1,1 approach may not be necessary to avoid false positives in the range of VAFs commonly looked for in the samples previously run in the UMI workflow (0.5 - 2% VAF)
For more info see user story: #1361
Further necessary features before merging this:
Extra information:
Extra note:
Blocked by:
This update requires updating Sentieon as the current version in production:
sentieon-genomics-202010.02
Does not contain the consensus option for LocusCollector which is the step prior to Dedup:
Current prod:
Latest Sentieon version:
Issues to consider:
After implementing this, we are no longer able to retrieve % Duplicates and Optical Duplicates info from the ".metrics" file from dedup. The values are "0". How can this be fixed? I have notified Sentieon about this issue and they say that they will add more useful statistics to the report in the next version.
At the moment I have added standard non UMI Sentieon Dedup just to get the PERCENT DUPLICATES stats. But is it worth it?
UPDATE: In the latest version of Sentieon 202308.03 they claim to have added support for % Duplicates from the --consensus mode from Dedup, so maybe this extra rule can be removed!
Comparing the percent duplicates between using Dedup with or without the UMI consensus options shows that they are a bit different, which makes sense as we're now using UMIs to identify duplicates as well as position. With roughly 4% less duplicates seen in this example case while using UMIs. Meaning that we're should be both getting a better calculation of duplicates which will be good for future evaluations of lab-methods, and at the same time saving more reads.
With this new version of sentieon we can safely remove the extra Sentieon dedup!
Are overlapping mates in read-pairs maintained as Pairs or Singletons after UMI collapse (as in the other Sentieon UMI collapse tool)? I checked and they are singletons when the mates are overlapping.
I discussed with Sentieon and colleagues and it should be fine: #1361 But it would be good to add more SVs from TGA to the validation set and make sure we can find them.
Graph overview
TGA T+N
TGA T+N (3,1,1 UMI workflow)
TGA T+N (QC)
WGS T+N
TGA CNVkit PON
Added
Changed
Removed
Documentation
Tests
Feature Tests
Aggregate QC metric comparison to v15.0.0
Sheet with QC metric comparison:
https://docs.google.com/spreadsheets/d/1phztyRT1VS8ud6bf0JWwFo5PB2yWB0YgFXz_ZnXSF7U/edit?gid=0#gid=0
Notes on QC metrics:
Pipeline Integrity Tests
.hk
file)Clinical Genomics Stockholm
Documentation
User Changes
Infrastructure Changes
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.