Inconsistencies and optimization regarding obs2partition! #19

nossleinad · 2024-04-19T13:21:47Z

Incosistencies

dest.scaling = zeros(length(seq)) not occuring for CodonPartition but for other DiscretePartitions.
For CodonPartition, we error if sequence length does not match partition, however for other DiscretePartitions we don't (a warning is commented out).

Optimization

If we decide that we want to enforce that sequence length matches with partition, some optimizations can be made, since we can now do in-place modifications. Here is my suggestion for such an optimization for CodonPartition (the idea is applicable to other DiscretePartitions) :

function obs2partition!(dest::CodonPartition, seq::String; code = universal_code)
    problem_codons = String[]
    if mod(length(seq), 3) != 0
        error("Codon sequences must be divisible by 3")
    end
    if length(seq) / 3 != dest.sites
        error("Sequence length does not match partition")
    end

    @views for j in axes(dest.state, 2)
        c = seq[3*(j-1)+1:3*(j-1)+3]
        cod_ind = get(code.string2sense, c, -1)
        if cod_ind == -1
            fill!(dest.state[:, j], 1.0)
            push!(problem_codons, c)
        else
            fill!(dest.state[:, j], 0.0)
            dest.state[cod_ind, j] = 1.0
        end
    end
    return countmap(problem_codons)
end

And here is a benchmark of this suggestion

#Simulate a sequence
function sim_seq(n)
    nucleotides = collect(MolecularEvolution.nucstring)
    return join([rand(nucleotides) for _ = 1:(3*n)])
end

n = 100_000
seq = sim_seq(n)
cpart = CodonPartition(n)
@benchmark obs2partition!(cpart, seq)

Main

BenchmarkTools.Trial: 387 samples with 1 evaluation.
 Range (min … max):  11.382 ms …  15.878 ms  ┊ GC (min … max): 0.00% … 10.13%
 Time  (median):     12.950 ms               ┊ GC (median):    9.26%
 Time  (mean ± σ):   12.944 ms ± 542.715 μs  ┊ GC (mean ± σ):  8.53% ±  2.75%

                              ▄▄▇█ ▃▁▄▂▃                        
  ▃▁▃▃▄▃▂▃▃▄▂▄▁▁▁▁▃▁▁▃▁▃▃▇▅▄▄▆████▇█████▇▇▆▄▃▄▄▃▂▃▂▃▄▁▂▃▁▁▁▁▁▂ ▃
  11.4 ms         Histogram: frequency by time         14.3 ms <

 Memory estimate: 54.29 MiB, allocs estimate: 200012.

Suggestion

BenchmarkTools.Trial: 1461 samples with 1 evaluation.
Range (min … max):  3.287 ms …  4.815 ms  ┊ GC (min … max): 0.00% … 28.54%
Time  (median):     3.412 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   3.420 ms ± 79.042 μs  ┊ GC (mean ± σ):  0.13% ±  1.57%

                       ▄▅█▇▅▃▃▃▁                             
 ▂▁▁▁▂▁▁▁▁▁▁▂▁▁▁▂▂▂▄▃▄▅█████████▇▅▅▄▃▃▂▂▂▂▂▂▁▁▂▂▂▁▁▁▁▂▁▁▁▂▂ ▃
 3.29 ms        Histogram: frequency by time        3.56 ms <

Memory estimate: 230.77 KiB, allocs estimate: 4624.

The text was updated successfully, but these errors were encountered:

murrellb · 2024-04-25T08:06:02Z

Good. This seems like an improvement on performance, and preventing possible bugs. If there is a use case where the sequence length doesn't match the partition, then users should populate their own partition anyway.

Fix issue #19

nossleinad mentioned this issue Apr 26, 2024

Fix issue #19 #20

Merged

murrellb added a commit that referenced this issue May 16, 2024

Merge pull request #20 from nossleinad/fix_obs2partition!

6d25159

Fix issue #19

nossleinad closed this as completed May 22, 2024

nossleinad reopened this May 22, 2024

nossleinad closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies and optimization regarding obs2partition! #19

Inconsistencies and optimization regarding obs2partition! #19

nossleinad commented Apr 19, 2024

murrellb commented Apr 25, 2024

Inconsistencies and optimization regarding obs2partition! #19

Inconsistencies and optimization regarding obs2partition! #19

Comments

nossleinad commented Apr 19, 2024

Incosistencies

Optimization

And here is a benchmark of this suggestion

murrellb commented Apr 25, 2024