Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies and optimization regarding obs2partition! #19

Closed
nossleinad opened this issue Apr 19, 2024 · 1 comment
Closed

Inconsistencies and optimization regarding obs2partition! #19

nossleinad opened this issue Apr 19, 2024 · 1 comment

Comments

@nossleinad
Copy link
Collaborator

Incosistencies

Optimization

If we decide that we want to enforce that sequence length matches with partition, some optimizations can be made, since we can now do in-place modifications. Here is my suggestion for such an optimization for CodonPartition (the idea is applicable to other DiscretePartitions) :

function obs2partition!(dest::CodonPartition, seq::String; code = universal_code)
    problem_codons = String[]
    if mod(length(seq), 3) != 0
        error("Codon sequences must be divisible by 3")
    end
    if length(seq) / 3 != dest.sites
        error("Sequence length does not match partition")
    end

    @views for j in axes(dest.state, 2)
        c = seq[3*(j-1)+1:3*(j-1)+3]
        cod_ind = get(code.string2sense, c, -1)
        if cod_ind == -1
            fill!(dest.state[:, j], 1.0)
            push!(problem_codons, c)
        else
            fill!(dest.state[:, j], 0.0)
            dest.state[cod_ind, j] = 1.0
        end
    end
    return countmap(problem_codons)
end

And here is a benchmark of this suggestion

#Simulate a sequence
function sim_seq(n)
    nucleotides = collect(MolecularEvolution.nucstring)
    return join([rand(nucleotides) for _ = 1:(3*n)])
end

n = 100_000
seq = sim_seq(n)
cpart = CodonPartition(n)
@benchmark obs2partition!(cpart, seq)

Main

BenchmarkTools.Trial: 387 samples with 1 evaluation.
 Range (min … max):  11.382 ms …  15.878 ms  ┊ GC (min … max): 0.00% … 10.13%
 Time  (median):     12.950 ms               ┊ GC (median):    9.26%
 Time  (mean ± σ):   12.944 ms ± 542.715 μs  ┊ GC (mean ± σ):  8.53% ±  2.75%

                              ▄▄▇█ ▃▁▄▂▃                        
  ▃▁▃▃▄▃▂▃▃▄▂▄▁▁▁▁▃▁▁▃▁▃▃▇▅▄▄▆████▇█████▇▇▆▄▃▄▄▃▂▃▂▃▄▁▂▃▁▁▁▁▁▂ ▃
  11.4 ms         Histogram: frequency by time         14.3 ms <

 Memory estimate: 54.29 MiB, allocs estimate: 200012.

Suggestion

BenchmarkTools.Trial: 1461 samples with 1 evaluation.
Range (min … max):  3.287 ms …  4.815 ms  ┊ GC (min … max): 0.00% … 28.54%
Time  (median):     3.412 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   3.420 ms ± 79.042 μs  ┊ GC (mean ± σ):  0.13% ±  1.57%

                       ▄▅█▇▅▃▃▃▁                             
 ▂▁▁▁▂▁▁▁▁▁▁▂▁▁▁▂▂▂▄▃▄▅█████████▇▅▅▄▃▃▂▂▂▂▂▂▁▁▂▂▂▁▁▁▁▂▁▁▁▂▂ ▃
 3.29 ms        Histogram: frequency by time        3.56 ms <

Memory estimate: 230.77 KiB, allocs estimate: 4624.
@murrellb
Copy link
Member

Good. This seems like an improvement on performance, and preventing possible bugs. If there is a use case where the sequence length doesn't match the partition, then users should populate their own partition anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants