Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: calculate cds coverage #1514

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Jul 19, 2024

Resolves: #1513

  • add a new column/field "cdsCoverage" to the output files
  • add CDS coverage table in the tooltip of the "Cov." column in Nextclade Web

Copy link

vercel bot commented Jul 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview Jul 19, 2024 4:45pm

Comment on lines +240 to +261
let cds_coverage = translation
.cdses()
.map(|cds| {
let ref_peptide_len = ref_translation.get_cds(&cds.name)?.seq.len();
let num_aligned_aa = cds.alignment_ranges.iter().map(Range::len).sum::<usize>();
let num_unknown_aa = unknown_aa_ranges
.iter()
.filter(|r| r.cds_name == cds.name)
.map(|r| r.length)
.sum();
let total_covered_aa = num_aligned_aa.saturating_sub(num_unknown_aa);

let coverage_aa = if ref_peptide_len == 0 {
0.0
} else {
total_covered_aa as f64 / ref_peptide_len as f64
};

Ok((cds.name.clone(), coverage_aa))
})
.collect::<Result<BTreeMap<String, f64>, Report>>()?;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant algo code

@ivan-aksamentov
Copy link
Member Author

Right now missing genes are not shown in the CSV/TSV output and in the Web UI. This was the easiest and aligns with how mutations are shown etc.

Numbers are not truncated in CSV/TSV. It's not human-readable much anyways.

@rneher
Copy link
Member

rneher commented Jul 21, 2024

the algorithm makes sense to me and I think works as intended.

HIV is a good test case (frame shifts, multi-segment CDS, gaps, insertions, etc):

https://nextclade-git-feat-aa-coverage-nextstrain.vercel.app/?dataset-name=community/neherlab/hiv-1/hxb2&input-fasta=example

Currently it the calculation is (aligned amino acids + gaps)/reference_length where numerator and denominator are summed over all segments. Since insertions are not counted, this is always between 0 and 1. Gaps are part of the alignment and I would argue that a protein with a (functional) deletion is as complete at the reference one. So I think the treatment of gaps and insertions is sensible.

notably, frameshifted regions or XXXXX don't count towards the coverage, which makes sense for the intended use case (finding sequences that are worth analyzing).

If we weren't using "Coverage" for the corresponding number for nucleotides, I would probably argue for "Completeness".

regarding precision. I don't feel strongly. Digits past 1+log10 of the reference CDS length are not meaningful. But since most genes that are fully covered are just output as 1, this doesn't lead to a huge expansion of the file.

@j23414 j23414 self-assigned this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add coverage per CDS to output
3 participants