UMI sequencing optimisation #1337
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
-
Regarding the likelihood of fragment conflicts for UMIs of 6 bases in length I did some simulations for counting the number of conflicting DNA fragments, that is DNA fragments which randomly get cut from the exact same position in the target area of the panel (sharing the same start and end positions), and get the same unique UMI barcode. Here's the code:
And here's how I run it:
I run the simulation 10 times, inputting:
When I run this with 50k molecules, which is way more than we would aim to use. I get an average of 38 conflicting fragments, meaning these would have been merged in the same UMI group. If I use a more reasonable number like 2k molecules, the average number of conflicting fragments is consistently 0. These results suggest that we are not at risk of getting a substantial number of conflicting DNA fragments using these UMIs. |
Beta Was this translation helpful? Give feedback.
-
I don't know if this is relevant at all! I have not talked to the lab yet so I don't know if this is something that has already been thought-out and optimised. I have only looked at some results from customers that ordered UMI analysis and seen that they have not received very usable coverage results, and poor sensitivity in detecting low frequency variants as a result.
Ticket example (https://clinical-scilifelab.supportsystem.com/scp/tickets.php?id=62959)
Looking at the % duplicates for their example samples the results made sense to me, and it seemed that they could be substantially improved if they were sequenced with more duplicates. So out of curiosity I did some simple simulations in python for input nano-grams of DNA, PCR-cycles, and aimed for sequencing coverage.
I have probably missed something in this simulation, and probably you can get to similar results with some simple math! But just in case it is useful I'll write the results here.
The code:
Now I have the total amount of molecules in this region given the original ng of DNA.
Then given the number of molecules in the region, the number of PCR cycles, and the total sequencing coverage aimed for. I extract:
etc...
And I do this 10 times each, to get a sense of the randomness.
for r in range(10):
Which I get from the function below:
Example from running this:
I loop through:
Here's total aimed for coverage on the Y-axis, and total coverage after UMI-collapse on the X-axis, with colors depending on the percent duplicates. All of these results make sense to me.
Regardless of sequencing coverage, you want to aim for a percent duplicates of around 70%. This all comes from the requirement of needing 3 reads per UMI group. Which means that optimally you have 33% of unique reads and 66% are duplicates, 2 duplicates each.
If you have more duplicates than that you're wasting more reads than necessary for the UMI group, and getting less UMI-collapsed reads as a result.
Here's another view of the same thing, using the percent dups on the y-axis.
The percent duplicates, depends of course on the amount of input nanogram you have, and the total coverage you're sequencing at.
So if you only sequence at 2500 X and have an input nanogram of 16, you can't reliably expect to sequence the same molecule very often...on the other side, if you sequence very deeply at 20 000 X and have a low input ng of 6 I can expect to have a lot of duplicates.
If the optimal percent duplicates is at around 70% we can also see and understand that the optimal amount of ng for the depth of sequencing we're aiming for is different. If we want to sequence at 5000X we should use a low amount of DNA, at around 6ng, otherwise we won't get enough duplicates.
If however we want to reach a higher final UMI-collapsed coverage we should also add more input DNA to avoid oversequencing the same molecules too much!
Let's get to some more practical examples:
It seems that a lot of these samples that the customers are ordering with UMI analysis are aiming for a coverage of around 5000X, and that that they want to detect variants down to around 0.005 VAF. Meaning 0.5% of reads will support the variant.
Is it possible theoretically to get to a coverage after UMI-collapse to reliably detect these variants if we sequence to 5000X? Well if we got to 1000X after collapse. We would have 1000 * 0.005 = 5 reads, which should be sufficient, but perhaps not reliable as there is some randomness involved here...
Can we get to 1000X after UMI collapse though?
Here I have only looked at the results if we sequence to 5000X, and it seems possible if we use in this case a low amount of input DNA.
A small note on the number of PCR-cycles...: Maybe I should exclude this parameter from the simulation. It only matters in these cases based on the way the simulation is done. That it is beneficial for instance to have a low number of PCR amplification cycles in this case, but that's only because we're removing the randomness effect by "filling the pool of molecules" with a similar amount that will be randomly withdrawn by "sequencing".
Here's the result on 5000X cov if I limit the simulation to only 5 PCR cycles:
Here's the comparison against the actual data from a the ticket mentioned earlier: https://clinical-scilifelab.supportsystem.com/scp/tickets.php?id=62959
The stats don't perfectly agree, but they're not that far off at least...I don't know what the input ng was however so I can't compare that at the moment. I also realise now that these samples were run on the GMSmyeloid panel, not the lymphoid. So I would need to adjust for that panel-size to make the simulation comparable.
Here's for myeloid size (approximately half the size of the lymphoid panel), and I guess the size doesn't matter in the end since I'm only looking at one region...:
For lymphoid just to show how the optimal input NG changes depending on the depth of sequencing:
The same for 7500X and 10000X:
Beta Was this translation helpful? Give feedback.
All reactions