feat: calculate cds coverage #1514

ivan-aksamentov · 2024-07-19T16:37:28Z

Resolves: #1513

add a new column/field "cdsCoverage" to the output files
add CDS coverage table in the tooltip of the "Cov." column in Nextclade Web

vercel · 2024-07-19T16:37:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
nextclade	✅ Ready (Inspect)	Visit Preview	Jul 19, 2024 4:45pm

ivan-aksamentov · 2024-07-19T16:39:07Z

packages/nextclade/src/run/nextclade_run_one.rs

+    let cds_coverage = translation
+      .cdses()
+      .map(|cds| {
+        let ref_peptide_len = ref_translation.get_cds(&cds.name)?.seq.len();
+        let num_aligned_aa = cds.alignment_ranges.iter().map(Range::len).sum::<usize>();
+        let num_unknown_aa = unknown_aa_ranges
+          .iter()
+          .filter(|r| r.cds_name == cds.name)
+          .map(|r| r.length)
+          .sum();
+        let total_covered_aa = num_aligned_aa.saturating_sub(num_unknown_aa);
+
+        let coverage_aa = if ref_peptide_len == 0 {
+          0.0
+        } else {
+          total_covered_aa as f64 / ref_peptide_len as f64
+        };
+
+        Ok((cds.name.clone(), coverage_aa))
+      })
+      .collect::<Result<BTreeMap<String, f64>, Report>>()?;
+


Relevant algo code

ivan-aksamentov · 2024-07-19T16:47:01Z

Right now missing genes are not shown in the CSV/TSV output and in the Web UI. This was the easiest and aligns with how mutations are shown etc.

Numbers are not truncated in CSV/TSV. It's not human-readable much anyways.

rneher · 2024-07-21T15:39:25Z

the algorithm makes sense to me and I think works as intended.

HIV is a good test case (frame shifts, multi-segment CDS, gaps, insertions, etc):

https://nextclade-git-feat-aa-coverage-nextstrain.vercel.app/?dataset-name=community/neherlab/hiv-1/hxb2&input-fasta=example

Currently it the calculation is (aligned amino acids + gaps)/reference_length where numerator and denominator are summed over all segments. Since insertions are not counted, this is always between 0 and 1. Gaps are part of the alignment and I would argue that a protein with a (functional) deletion is as complete at the reference one. So I think the treatment of gaps and insertions is sensible.

notably, frameshifted regions or XXXXX don't count towards the coverage, which makes sense for the intended use case (finding sequences that are worth analyzing).

If we weren't using "Coverage" for the corresponding number for nucleotides, I would probably argue for "Completeness".

regarding precision. I don't feel strongly. Digits past 1+log10 of the reference CDS length are not meaningful. But since most genes that are fully covered are just output as 1, this doesn't lead to a huge expansion of the file.

ivan-aksamentov added 2 commits July 19, 2024 18:36

feat: calculate cds coverage

e5d2508

feat(web): display coverage in results table

846730c

ivan-aksamentov commented Jul 19, 2024

View reviewed changes

ivan-aksamentov mentioned this pull request Jul 19, 2024

Add coverage per CDS to output #1513

Open

vercel bot deployed to Preview July 19, 2024 16:45 View deployment

j23414 self-assigned this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: calculate cds coverage #1514

feat: calculate cds coverage #1514

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

vercel bot commented Jul 19, 2024 •

edited

Loading

ivan-aksamentov Jul 19, 2024

ivan-aksamentov commented Jul 19, 2024

rneher commented Jul 21, 2024

feat: calculate cds coverage #1514

Are you sure you want to change the base?

feat: calculate cds coverage #1514

Conversation

ivan-aksamentov commented Jul 19, 2024 • edited Loading

vercel bot commented Jul 19, 2024 • edited Loading

ivan-aksamentov Jul 19, 2024

Choose a reason for hiding this comment

ivan-aksamentov commented Jul 19, 2024

rneher commented Jul 21, 2024

ivan-aksamentov commented Jul 19, 2024 •

edited

Loading

vercel bot commented Jul 19, 2024 •

edited

Loading