UMI sequencing optimisation #1337

mathiasbio · 2023-12-04T14:37:02Z

mathiasbio
Dec 4, 2023
Maintainer

I don't know if this is relevant at all! I have not talked to the lab yet so I don't know if this is something that has already been thought-out and optimised. I have only looked at some results from customers that ordered UMI analysis and seen that they have not received very usable coverage results, and poor sensitivity in detecting low frequency variants as a result.

Ticket example (https://clinical-scilifelab.supportsystem.com/scp/tickets.php?id=62959)

Looking at the % duplicates for their example samples the results made sense to me, and it seemed that they could be substantially improved if they were sequenced with more duplicates. So out of curiosity I did some simple simulations in python for input nano-grams of DNA, PCR-cycles, and aimed for sequencing coverage.

I have probably missed something in this simulation, and probably you can get to similar results with some simple math! But just in case it is useful I'll write the results here.

The code:

def final_cov_per_ng(ng, cov, pcr_max):
    molecules = (ng*6.022*10e23) / (300*(1*10e9)*650)
    lymphoid = 1586717
    humangenome = 3 * 1000 * 1000000
    fraction = lymphoid / humangenome
    fished_out = molecules * fraction
    total_molecules = int(round(fished_out, 0))
    molecules_one_region = int(total_molecules * (300/lymphoid))

    pcr_cycles = range(2, pcr_max)
    cov_per_cycles = {}
    for pcr_cycle in pcr_cycles:
        for r in range(10):
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"] = {}
            final_coverage, duplicates, unique_reads, total_coverage = get_final_coverage(molecules_one_region, pcr_cycle, cov)
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"]["cov"] = final_coverage
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"]["dups"] = duplicates
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"]["total_coverage"] = total_coverage
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"]["cycles"] = pcr_cycle
            cov_per_cycles[f"{pcr_cycle}_{r}_{ng}"]["ng"] = ng
    return pd.DataFrame.from_dict(cov_per_cycles).T

First I input the number of nano-grams of DNA, and calculate the total number of molecules of DNA with 300 bp of size.
Then to get the number of molecules for the lymphoid panel I multiply the number of molecules of DNA with the fraction of the size of the lymphoid panel / size of human genome to get the number of molecules for the lymphoid panel.
Then to make things simpler, I focus on a single region of the panel and extract only the molecules that come from this region by multiplying the molecules for the entire panel with the fraction of the size of the single region / total panel size.

Now I have the total amount of molecules in this region given the original ng of DNA.

Then given the number of molecules in the region, the number of PCR cycles, and the total sequencing coverage aimed for. I extract:

The final UMI coverage after collapse
% Duplicates
etc...
And I do this 10 times each, to get a sense of the randomness. for r in range(10):

Which I get from the function below:

def get_final_coverage(molecules_one_region, pcr_cycles, aimed_for_cov):
    
    molecule_dict = {}
    for id in range(0, molecules_one_region):
        molecule_dict[id] = 0

    molecules_after_pcr = int(molecules_one_region * math.pow(2, pcr_cycles))
    times_amplified = int(molecules_after_pcr / molecules_one_region)

    molecule_pool = []
    for molecule in molecule_dict:
        for _ in range(times_amplified):
            molecule_pool.append(molecule)

    if len(molecule_pool) >= aimed_for_cov:
        reads = random.sample(molecule_pool, aimed_for_cov)
    else:
        reads = random.sample(molecule_pool, len(molecule_pool))

    for read in reads:
        molecule_dict[read] += 1

    final_coverage = 0
    unique_reads = 0
    total_coverage = 0
    for molecule in molecule_dict:
        if molecule_dict[molecule] > 0:
            unique_reads += 1
            total_coverage += molecule_dict[molecule]
        if molecule_dict[molecule] >= 3:
            final_coverage += 1
    duplicates = round((total_coverage - unique_reads) / total_coverage, 4)
    return final_coverage, duplicates, unique_reads, total_coverage

Each molecule gets assigned to a dictionary.
Then I calculate how many molecule at this region I will have after X amount of pcr cycles.
I then calculate how many times on average each molecule has been amplified.
Then I create a list which I fill with all amplified molecules (using the unique molecule ID)
Then I randomly extract molecules from this list based on the aimed for sequencing coverage.
Then I loop through all the reads and count the number of times each molecule occurred.
Finally I count how many molecules got a minimum of 3 reads, which will count towards the final UMI collapsed coverage

Example from running this:

df_list = []
for totalcov in [2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000]:
    for ng in range(5, 18):
        df_list.append(final_cov_per_ng(ng, totalcov, 9))

I loop through:

A range of sequencing depths I aim for.
A range of input nanograms, and then I'm looping through a range of 2-8 PCR cycles (inside the function).

Here's total aimed for coverage on the Y-axis, and total coverage after UMI-collapse on the X-axis, with colors depending on the percent duplicates. All of these results make sense to me.

Regardless of sequencing coverage, you want to aim for a percent duplicates of around 70%. This all comes from the requirement of needing 3 reads per UMI group. Which means that optimally you have 33% of unique reads and 66% are duplicates, 2 duplicates each.

If you have more duplicates than that you're wasting more reads than necessary for the UMI group, and getting less UMI-collapsed reads as a result.

Here's another view of the same thing, using the percent dups on the y-axis.

The percent duplicates, depends of course on the amount of input nanogram you have, and the total coverage you're sequencing at.

So if you only sequence at 2500 X and have an input nanogram of 16, you can't reliably expect to sequence the same molecule very often...on the other side, if you sequence very deeply at 20 000 X and have a low input ng of 6 I can expect to have a lot of duplicates.

If the optimal percent duplicates is at around 70% we can also see and understand that the optimal amount of ng for the depth of sequencing we're aiming for is different. If we want to sequence at 5000X we should use a low amount of DNA, at around 6ng, otherwise we won't get enough duplicates.

If however we want to reach a higher final UMI-collapsed coverage we should also add more input DNA to avoid oversequencing the same molecules too much!

Let's get to some more practical examples:

It seems that a lot of these samples that the customers are ordering with UMI analysis are aiming for a coverage of around 5000X, and that that they want to detect variants down to around 0.005 VAF. Meaning 0.5% of reads will support the variant.

Is it possible theoretically to get to a coverage after UMI-collapse to reliably detect these variants if we sequence to 5000X? Well if we got to 1000X after collapse. We would have 1000 * 0.005 = 5 reads, which should be sufficient, but perhaps not reliable as there is some randomness involved here...

Can we get to 1000X after UMI collapse though?

Here I have only looked at the results if we sequence to 5000X, and it seems possible if we use in this case a low amount of input DNA.

A small note on the number of PCR-cycles...: Maybe I should exclude this parameter from the simulation. It only matters in these cases based on the way the simulation is done. That it is beneficial for instance to have a low number of PCR amplification cycles in this case, but that's only because we're removing the randomness effect by "filling the pool of molecules" with a similar amount that will be randomly withdrawn by "sequencing".

Here's the result on 5000X cov if I limit the simulation to only 5 PCR cycles:

Here's the comparison against the actual data from a the ticket mentioned earlier: https://clinical-scilifelab.supportsystem.com/scp/tickets.php?id=62959

	Before UMI collapse		After UMI collapse
cg case-ids	MEDIAN_TARGET_COVERAGE	PCT_TARGET_BASES_100X	MEDIAN_TARGET_COVERAGE	PCT_TARGET_BASES_100X	Percent duplicates
tightlark	4861	0,999658	116	0,626473	0,27137
briefcorgi	5896	0,998579	135	0,772384	0,280835
movingwalrus	5113	0,999529	214	0,911337	0,318597
civilelf	5100	0,997778	358	0,901665	0,429211

The stats don't perfectly agree, but they're not that far off at least...I don't know what the input ng was however so I can't compare that at the moment. I also realise now that these samples were run on the GMSmyeloid panel, not the lymphoid. So I would need to adjust for that panel-size to make the simulation comparable.

Here's for myeloid size (approximately half the size of the lymphoid panel), and I guess the size doesn't matter in the end since I'm only looking at one region...:

For lymphoid just to show how the optimal input NG changes depending on the depth of sequencing:

The same for 7500X and 10000X:

mathiasbio · 2023-12-12T12:38:47Z

mathiasbio
Dec 12, 2023
Maintainer Author

I just did a small adaptation of the simulation to include the requirements for collapsing a UMI group for at least one "+" and one "-" strand:

def get_final_coverage(molecules_one_region, pcr_cycles, aimed_for_cov):
    
    molecule_dict = {}
    for id in range(0, molecules_one_region):
        molecule_dict[id] = {}
        molecule_dict[id]["+"] = 0
        molecule_dict[id]["-"] = 0

    molecules_after_pcr = int(molecules_one_region * math.pow(2, pcr_cycles))
    times_amplified = int(molecules_after_pcr / molecules_one_region)

    molecule_pool = []
    for molecule in molecule_dict:
        for _ in range(times_amplified):
            molecule_pool.append(molecule)

    if len(molecule_pool) >= aimed_for_cov:
        reads = random.sample(molecule_pool, aimed_for_cov)
    else:
        reads = random.sample(molecule_pool, len(molecule_pool))

    strands = ["+", "-"]
    for read in reads:
        strand = random.choices(strands)[0]
        molecule_dict[read][strand] += 1

    final_coverage = 0
    unique_reads = 0
    total_coverage = 0
    for molecule in molecule_dict:
        mol_plus = molecule_dict[molecule]["+"]
        mol_minus = molecule_dict[molecule]["-"]
        read_sum = mol_plus + mol_minus
        if mol_plus > 0 or mol_minus > 0:
            unique_reads += 1
            total_coverage += (mol_plus + mol_minus)
        if read_sum >= 3 and mol_plus >= 1 and mol_minus >= 1:
            final_coverage += 1
    duplicates = round((total_coverage - unique_reads) / total_coverage, 4)
    return final_coverage, duplicates, unique_reads, total_coverage

This has the expected effect of requiring a somewhat larger % duplicates to be more likely to capture these reads for each unique fragment. The correlation between input ng DNA and % duplicates here should not be correct as I have not taken into account the library conversation rate etc...but looking only at the % duplicates we can see that the optimal % duplicates when including this requirement for strand-representation is pushed closer to 78% where without it it was closer to 70%. It also seems that we're less likely to get to beyond 1000X coverage after UMI collapse. In fact in none of these simulations did we get past 900X.

If we sequence to 7500X however with between 60% and 85% duplicates, with an optimal % duplicates around 75% we seem to consistently reach beyond 1000X after UMI collapse.

0 replies

mathiasbio · 2023-12-13T09:03:52Z

mathiasbio
Dec 13, 2023
Maintainer Author

Regarding the likelihood of fragment conflicts for UMIs of 6 bases in length

I did some simulations for counting the number of conflicting DNA fragments, that is DNA fragments which randomly get cut from the exact same position in the target area of the panel (sharing the same start and end positions), and get the same unique UMI barcode.

Here's the code:

def get_conflict_fragments(umi_combinations, average_fragment_size, minimum_insert_size, num_molecules):
    insert_size = average_fragment_size
    umi_combos = umi_combinations
    start_positions = list(range(1, 150))
    end_positions = list(range(150, 300))
    fragments = {}
    count = 0
    while count < num_molecules:
        start = random.choices(start_positions)[0]
        end = random.choices(end_positions)[0]
        if (end - start) < minimum_insert_size:
            continue
        count += 1
        fragments[count] = {}
        fragments[count]["s"] = start
        fragments[count]["e"] = end

    umi_combos = list(range(1, umi_combos))
    conflict_fragments = {}
    for fragment in fragments:
        umi_combo = random.choices(umi_combos)[0]
        start = fragments[fragment]["s"]
        end = fragments[fragment]["e"]
        fragment_uniqueness = f"{umi_combo}_{start}_{end}"
        if fragment_uniqueness not in conflict_fragments:
            conflict_fragments[fragment_uniqueness] = 1
        else:
            conflict_fragments[fragment_uniqueness] += 1
    return conflict_fragments

And here's how I run it:

conflict_list = []
for repeat in range(0, 10):
    conflict_fragments = get_conflict_fragments(umi_combinations = 4096, average_fragment_size = 300, minimum_insert_size = 100, num_molecules= 50000)
    count_conflicts = 0
    for combo in conflict_fragments:
        if conflict_fragments[combo] > 1:
            count_conflicts += conflict_fragments[combo]
    conflict_list.append(count_conflicts)
mean(conflict_list)

I run the simulation 10 times, inputting:

total UMI combinations = 4096 (4 raised to the power of 6)
average fragment length = 300 from which to randomise start and end positions
minimum insert size after randomising start and end, only choosing fragments with a minimum of 100 bases separating start from end.
number of molecules to test!

When I run this with 50k molecules, which is way more than we would aim to use. I get an average of 38 conflicting fragments, meaning these would have been merged in the same UMI group.

If I use a more reasonable number like 2k molecules, the average number of conflicting fragments is consistently 0.

These results suggest that we are not at risk of getting a substantial number of conflicting DNA fragments using these UMIs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI sequencing optimisation #1337

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

UMI sequencing optimisation #1337

mathiasbio Dec 4, 2023 Maintainer

Replies: 2 comments

mathiasbio Dec 12, 2023 Maintainer Author

mathiasbio Dec 13, 2023 Maintainer Author

mathiasbio
Dec 4, 2023
Maintainer

mathiasbio
Dec 12, 2023
Maintainer Author

mathiasbio
Dec 13, 2023
Maintainer Author