[User Story] Record optical duplicates in QC report #1282

pbiology · 2023-10-11T09:33:29Z

Need

As the lab head of unit I want to be able to trend the optical duplicates in samples processed by BALSAMIC so that we can see if the levels change when we change methods an instruments in the lab. Most recently due to the new NovaSeq X.

Suggested approach

Start collecting the optical duplicates per case and record in multiQC. This should be done both for panels, WES and WGS.

Considered alternatives

None

Deviation

None

System requirements assessed

Yes, I have reviewed the system requirements
No

Requirements affected by this story

N/A

Risk assessment needed

Needed
Not needed

Risk assessment

Gathering new metrics doesn't change the analysis in any way so no risk.

SOUPs

N/A

Can be closed when

BALSAMIC MultiQC reports includes optical duplicates

Blockers

Fix detect optical duplicates in TGA workflow #1291

Anything else?

Do we need to include something new in which documents to save?

mathiasbio · 2023-10-18T10:47:22Z

I think collecting this statistic already existed for TGA-cases (panel and exome) which was using picard markdups, but was lacking for WGS because the sentieon dedup report wasn't picked up by multiqc, however in the development version after creating more unified alignment workflows where the sentieon dedup tool is used for dedup in all workflows (and where the report was modified to be acceptable by multiqc) the same duplication stats which includes the optical duplicates is reported by all workflows. However! For some reason no optical duplicates are detected in TGA...

Below are the top 2 examples from TGA tumor and normal sample, and bottom an example WGS test sample run on the develop branch.

LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
Unknown Library	7776	40365947	661574	10128	7047	14367911	0	0,355994	42248922
Unknown Library	6150	25823111	540829	8460	5402	6595770	0	0,255495	41515496
Unknown Library	231373	347284974	3329457	817093	84083	31283787	2697729	0,090172	1960368099

These values are also included in the multiqc_data.json including some small calculations that multiqc has done on the stats. Here's the example from the WGS case from multiqc_data.json

LIBRARY:	Unknown Library,
UNPAIRED_READS_EXAMINED:	231373
READ_PAIRS_EXAMINED:	347284974
SECONDARY_OR_SUPPLEMENTARY_RDS:	3329457
UNMAPPED_READS:	817093
UNPAIRED_READ_DUPLICATES:	84083
READ_PAIR_DUPLICATES:	31283787
READ_PAIR_OPTICAL_DUPLICATES:	2697729
PERCENT_DUPLICATION:	0,090172
ESTIMATED_LIBRARY_SIZE:	1960368099
READS_IN_DUPLICATE_PAIRS:	62567574
READS_IN_UNIQUE_PAIRS:	632002374
READS_IN_UNIQUE_UNPAIRED:	147290
READS_IN_DUPLICATE_PAIRS_OPTICAL:	5395458
READS_IN_DUPLICATE_PAIRS_NONOPTICAL:	57172116
READS_IN_DUPLICATE_UNPAIRED:	84083
READS_UNMAPPED:	817093

mathiasbio · 2023-10-18T10:57:36Z

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

mathiasbio · 2023-10-18T10:58:23Z

I will investigate a little why there are no optical dups in TGA first however...

mathiasbio · 2023-10-18T11:10:13Z

Ok...I think I see the issue...for TGA we modify the headers of the reads in the fastq by adding the extracted UMI-sequence, which creates problems for how Picard MarkDuplicates extracts the flowcell tile information.

https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-

--READ_NAME_REGEX <String>    MarkDuplicates can use the tile and cluster positions to estimate the rate of optical
                              duplication in addition to the dominant source of duplication, PCR, to provide a more
                              accurate estimation of library size. By default (with no READ_NAME_REGEX specified),
                              MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). 
                              Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without
                              optical duplicate counts, library size estimation will be less accurate. If the read name
                              does not follow a standard Illumina colon-separation convention, but does contain tile and
                              x,y coordinates, a regular expression can be specified to extract three variables:
                              tile/region, x coordinate and y coordinate from a read name. The regular expression must
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name.   e.g. if field names were separated by semi-colon (';') this example regex
                              could be specified      (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no
                              READ_NAME_REGEX is specified, the read name is split on ':'.   For 5 element names, the
                              3rd, 4th and 5th elements are assumed to be tile, x and y values.   For 7 element names
                              (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. 
                              Default value: <optimized capture of last three ':' separated fields as numeric values>.

So it seems that the problem is slightly worse than not just tracking the optical duplicates, we have probably not detected and marked them ever since the UMIs were added to the TGA.

pbiology · 2023-10-18T13:01:25Z

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

I would say so? Perhaps @karlnyr has some other input?

mathiasbio · 2023-10-18T13:12:36Z

I think it would be good to try to fix the MarkDuplicates issue for Optical Duplicate detection in TGA as well. Hopefully it's as easy as adding a new --READ_NAME_REGEX to Sentieon Dedup for the TGA cases to adjust for the UMI-sequences in the readheader

mathiasbio · 2023-12-12T12:58:33Z

Blocked by this: #1291

pbiology · 2024-01-03T08:39:16Z

Updated the issues to the User Story format

mathiasbio · 2024-08-15T09:00:53Z

This issue will be solved by PR #1358

pbiology added Feature New feature Needs Refinement labels Oct 11, 2023

pbiology added this to the Release 14 milestone Oct 11, 2023

mathiasbio self-assigned this Oct 20, 2023

This was referenced Oct 20, 2023

Fix detect optical duplicates in TGA workflow #1291

Open

fix: TGA optical dup detection #1292

Closed

pbiology modified the milestones: Release 14, Release 15 Nov 7, 2023

pbiology added Effort Medium Gain Medium Urgency Medium User-Story A User-Story describing new functionality and removed Feature New feature Needs Refinement labels Jan 3, 2024

ivadym changed the title ~~Record optical duplicates in QC report~~ [User Story] Record optical duplicates in QC report Jan 3, 2024

mathiasbio modified the milestones: Release 15, Release 16 Feb 8, 2024

mathiasbio modified the milestones: Release 17, Release 16 Aug 15, 2024

mathiasbio linked a pull request Aug 15, 2024 that will close this issue

feat: deduplicate with UMIs #1358

Open

56 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] Record optical duplicates in QC report #1282

[User Story] Record optical duplicates in QC report #1282

pbiology commented Oct 11, 2023 •

edited

Loading

mathiasbio commented Oct 18, 2023 •

edited

Loading

mathiasbio commented Oct 18, 2023

mathiasbio commented Oct 18, 2023

mathiasbio commented Oct 18, 2023 •

edited

Loading

pbiology commented Oct 18, 2023

mathiasbio commented Oct 18, 2023

mathiasbio commented Dec 12, 2023

pbiology commented Jan 3, 2024

mathiasbio commented Aug 15, 2024

[User Story] Record optical duplicates in QC report #1282

[User Story] Record optical duplicates in QC report #1282

Comments

pbiology commented Oct 11, 2023 • edited Loading

Need

Suggested approach

Considered alternatives

Deviation

System requirements assessed

Requirements affected by this story

Risk assessment needed

Risk assessment

SOUPs

Can be closed when

Blockers

Anything else?

mathiasbio commented Oct 18, 2023 • edited Loading

mathiasbio commented Oct 18, 2023

mathiasbio commented Oct 18, 2023

mathiasbio commented Oct 18, 2023 • edited Loading

pbiology commented Oct 18, 2023

mathiasbio commented Oct 18, 2023

mathiasbio commented Dec 12, 2023

pbiology commented Jan 3, 2024

mathiasbio commented Aug 15, 2024

pbiology commented Oct 11, 2023 •

edited

Loading

mathiasbio commented Oct 18, 2023 •

edited

Loading

mathiasbio commented Oct 18, 2023 •

edited

Loading