Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] Record optical duplicates in QC report #1282

Open
2 of 5 tasks
pbiology opened this issue Oct 11, 2023 · 9 comments · May be fixed by #1358
Open
2 of 5 tasks

[User Story] Record optical duplicates in QC report #1282

pbiology opened this issue Oct 11, 2023 · 9 comments · May be fixed by #1358
Assignees
Labels
Effort Medium Gain Medium Urgency Medium User-Story A User-Story describing new functionality
Milestone

Comments

@pbiology
Copy link
Contributor

pbiology commented Oct 11, 2023

Need

As the lab head of unit I want to be able to trend the optical duplicates in samples processed by BALSAMIC so that we can see if the levels change when we change methods an instruments in the lab. Most recently due to the new NovaSeq X.

Suggested approach

Start collecting the optical duplicates per case and record in multiQC. This should be done both for panels, WES and WGS.

Considered alternatives

None

Deviation

None

System requirements assessed

  • Yes, I have reviewed the system requirements
  • No

Requirements affected by this story

N/A

Risk assessment needed

  • Needed
  • Not needed

Risk assessment

Gathering new metrics doesn't change the analysis in any way so no risk.

SOUPs

N/A

Can be closed when

  • BALSAMIC MultiQC reports includes optical duplicates

Blockers

Fix detect optical duplicates in TGA workflow #1291

Anything else?

  • Do we need to include something new in which documents to save?
@pbiology pbiology added this to the Release 14 milestone Oct 11, 2023
@mathiasbio
Copy link
Contributor

mathiasbio commented Oct 18, 2023

I think collecting this statistic already existed for TGA-cases (panel and exome) which was using picard markdups, but was lacking for WGS because the sentieon dedup report wasn't picked up by multiqc, however in the development version after creating more unified alignment workflows where the sentieon dedup tool is used for dedup in all workflows (and where the report was modified to be acceptable by multiqc) the same duplication stats which includes the optical duplicates is reported by all workflows. However! For some reason no optical duplicates are detected in TGA...

Below are the top 2 examples from TGA tumor and normal sample, and bottom an example WGS test sample run on the develop branch.

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 7776 40365947 661574 10128 7047 14367911 0 0,355994 42248922
Unknown Library 6150 25823111 540829 8460 5402 6595770 0 0,255495 41515496
Unknown Library 231373 347284974 3329457 817093 84083 31283787 2697729 0,090172 1960368099

These values are also included in the multiqc_data.json including some small calculations that multiqc has done on the stats. Here's the example from the WGS case from multiqc_data.json

LIBRARY: Unknown Library,
UNPAIRED_READS_EXAMINED: 231373
READ_PAIRS_EXAMINED: 347284974
SECONDARY_OR_SUPPLEMENTARY_RDS: 3329457
UNMAPPED_READS: 817093
UNPAIRED_READ_DUPLICATES: 84083
READ_PAIR_DUPLICATES: 31283787
READ_PAIR_OPTICAL_DUPLICATES: 2697729
PERCENT_DUPLICATION: 0,090172
ESTIMATED_LIBRARY_SIZE: 1960368099
READS_IN_DUPLICATE_PAIRS: 62567574
READS_IN_UNIQUE_PAIRS: 632002374
READS_IN_UNIQUE_UNPAIRED: 147290
READS_IN_DUPLICATE_PAIRS_OPTICAL: 5395458
READS_IN_DUPLICATE_PAIRS_NONOPTICAL: 57172116
READS_IN_DUPLICATE_UNPAIRED: 84083
READS_UNMAPPED: 817093

@mathiasbio
Copy link
Contributor

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

@mathiasbio
Copy link
Contributor

I will investigate a little why there are no optical dups in TGA first however...

@mathiasbio
Copy link
Contributor

mathiasbio commented Oct 18, 2023

Ok...I think I see the issue...for TGA we modify the headers of the reads in the fastq by adding the extracted UMI-sequence, which creates problems for how Picard MarkDuplicates extracts the flowcell tile information.

https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-

--READ_NAME_REGEX <String>    MarkDuplicates can use the tile and cluster positions to estimate the rate of optical
                              duplication in addition to the dominant source of duplication, PCR, to provide a more
                              accurate estimation of library size. By default (with no READ_NAME_REGEX specified),
                              MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). 
                              Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without
                              optical duplicate counts, library size estimation will be less accurate. If the read name
                              does not follow a standard Illumina colon-separation convention, but does contain tile and
                              x,y coordinates, a regular expression can be specified to extract three variables:
                              tile/region, x coordinate and y coordinate from a read name. The regular expression must
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name.   e.g. if field names were separated by semi-colon (';') this example regex
                              could be specified      (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no
                              READ_NAME_REGEX is specified, the read name is split on ':'.   For 5 element names, the
                              3rd, 4th and 5th elements are assumed to be tile, x and y values.   For 7 element names
                              (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. 
                              Default value: <optimized capture of last three ':' separated fields as numeric values>. 

So it seems that the problem is slightly worse than not just tracking the optical duplicates, we have probably not detected and marked them ever since the UMIs were added to the TGA.

@pbiology
Copy link
Contributor Author

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

I would say so? Perhaps @karlnyr has some other input?

@mathiasbio
Copy link
Contributor

I think it would be good to try to fix the MarkDuplicates issue for Optical Duplicate detection in TGA as well. Hopefully it's as easy as adding a new --READ_NAME_REGEX to Sentieon Dedup for the TGA cases to adjust for the UMI-sequences in the readheader

@mathiasbio
Copy link
Contributor

Blocked by this: #1291

@pbiology pbiology added Effort Medium Gain Medium Urgency Medium User-Story A User-Story describing new functionality and removed Feature New feature Needs Refinement labels Jan 3, 2024
@pbiology
Copy link
Contributor Author

pbiology commented Jan 3, 2024

Updated the issues to the User Story format

@ivadym ivadym changed the title Record optical duplicates in QC report [User Story] Record optical duplicates in QC report Jan 3, 2024
@mathiasbio mathiasbio modified the milestones: Release 15, Release 16 Feb 8, 2024
@mathiasbio
Copy link
Contributor

This issue will be solved by PR #1358

@mathiasbio mathiasbio modified the milestones: Release 17, Release 16 Aug 15, 2024
@mathiasbio mathiasbio linked a pull request Aug 15, 2024 that will close this issue
56 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Effort Medium Gain Medium Urgency Medium User-Story A User-Story describing new functionality
Projects
Status: In Testing
Development

Successfully merging a pull request may close this issue.

2 participants