ingest: Standardize steps for adding gene coverage to metadata #50

joverlee521 · 2024-07-10T20:17:27Z

Related to https://github.com/nextstrain/private/issues/102

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

This was previously done by @j23414 in dengue with nextstrain/dengue#36.

We can add these as standardized steps to the ingest template but one hiccup is it requires running sequences through Nextclade. This is easy if a Nextclade dataset already exists, but not as straightforward if users need to create a Nextclade dataset from scratch.

The minimal Nextclade dataset files for annotating gene coverage

reference FASTA
genome annotation GFF file
pathogen.json

The main stumbling block is figuring out which reference to use (currently ingest does not require a reference) and creating the GFF file. It seems like we should have a comprehensive guide on how to get past these blockers in the template as well.

genehack · 2024-07-11T18:19:12Z

A simple form of a flowchart for "figure out the reference" would be something like, "Is there a RefSeq entry? If so, use that. If not, do a literature search or consult an expert in the field." (I realize that's not great but I do think this is one of those areas where you kinda actually need to know something about what you're trying to do?)

As for constructing a GFF, there are tools that we could point to? Presumably the most common starting point is going to be a GenBank file; if somebody is trying to start with a completely unannotated FASTA as the reference sequence, again, they're probably going to need more specialized support than we want to provide?

joverlee521 · 2024-07-11T23:21:07Z

As for constructing a GFF, there are tools that we could point to?

For sure! Richard has a script fro generating the GFF from GenBank accession but I haven't personally tried it.
https://github.com/nextstrain/nextclade_dataset_template/blob/sanitize_gff/generate_from_genbank.py

ivan-aksamentov · 2024-07-19T13:15:13Z

Just a quick clarification/precision: Nextclade technically does not require a GFF annotation - it can run with just reference fasta and a very minimal (almost empty) pathogen.json. Though, of course, without annotation it would not know anything about CDSes and amino acid things.

One idea for allowing faster bootstrapping of projects relying on Nextclade is to also not require annotations by default, where possible. This will end up with less useful analysis, but might encourage new learners and simplify their first steps. Will likely increase complexity of workflows though.

joverlee521 · 2024-07-19T16:33:55Z

Thanks for the clarification @ivan-aksamentov! I guess I didn't mean a minimum Nextclade dataset, but the minimum files needed to get the gene/CDS coverage, which does require a GFF annotation. I've updated the language in above to be explicit.

joverlee521 added the enhancement New feature or request label Jul 10, 2024

j23414 mentioned this issue Jul 15, 2024

Adds rest of the genes for gene coverage columns nextstrain/dengue#79

Merged

1 task

joverlee521 mentioned this issue Jul 19, 2024

Add coverage per CDS to output nextstrain/nextclade#1513

Open

joverlee521 mentioned this issue Jul 26, 2024

ingest: How to handle segmented viruses #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: Standardize steps for adding gene coverage to metadata #50

ingest: Standardize steps for adding gene coverage to metadata #50

joverlee521 commented Jul 10, 2024 •

edited

Loading

genehack commented Jul 11, 2024

joverlee521 commented Jul 11, 2024

ivan-aksamentov commented Jul 19, 2024

joverlee521 commented Jul 19, 2024

ingest: Standardize steps for adding gene coverage to metadata #50

ingest: Standardize steps for adding gene coverage to metadata #50

Comments

joverlee521 commented Jul 10, 2024 • edited Loading

genehack commented Jul 11, 2024

joverlee521 commented Jul 11, 2024

ivan-aksamentov commented Jul 19, 2024

joverlee521 commented Jul 19, 2024

joverlee521 commented Jul 10, 2024 •

edited

Loading