Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Standardize steps for adding gene coverage to metadata #50

Open
joverlee521 opened this issue Jul 10, 2024 · 4 comments
Open

ingest: Standardize steps for adding gene coverage to metadata #50

joverlee521 opened this issue Jul 10, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 10, 2024

Related to https://github.com/nextstrain/private/issues/102

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

This was previously done by @j23414 in dengue with nextstrain/dengue#36.

We can add these as standardized steps to the ingest template but one hiccup is it requires running sequences through Nextclade. This is easy if a Nextclade dataset already exists, but not as straightforward if users need to create a Nextclade dataset from scratch.

The minimal Nextclade dataset files for annotating gene coverage

  1. reference FASTA
  2. genome annotation GFF file
  3. pathogen.json

The main stumbling block is figuring out which reference to use (currently ingest does not require a reference) and creating the GFF file. It seems like we should have a comprehensive guide on how to get past these blockers in the template as well.

@joverlee521 joverlee521 added the enhancement New feature or request label Jul 10, 2024
@genehack
Copy link
Contributor

A simple form of a flowchart for "figure out the reference" would be something like, "Is there a RefSeq entry? If so, use that. If not, do a literature search or consult an expert in the field." (I realize that's not great but I do think this is one of those areas where you kinda actually need to know something about what you're trying to do?)

As for constructing a GFF, there are tools that we could point to? Presumably the most common starting point is going to be a GenBank file; if somebody is trying to start with a completely unannotated FASTA as the reference sequence, again, they're probably going to need more specialized support than we want to provide?

@joverlee521
Copy link
Contributor Author

As for constructing a GFF, there are tools that we could point to?

For sure! Richard has a script fro generating the GFF from GenBank accession but I haven't personally tried it.
https://github.com/nextstrain/nextclade_dataset_template/blob/sanitize_gff/generate_from_genbank.py

@ivan-aksamentov
Copy link
Member

Just a quick clarification/precision: Nextclade technically does not require a GFF annotation - it can run with just reference fasta and a very minimal (almost empty) pathogen.json. Though, of course, without annotation it would not know anything about CDSes and amino acid things.

One idea for allowing faster bootstrapping of projects relying on Nextclade is to also not require annotations by default, where possible. This will end up with less useful analysis, but might encourage new learners and simplify their first steps. Will likely increase complexity of workflows though.

@joverlee521
Copy link
Contributor Author

Thanks for the clarification @ivan-aksamentov! I guess I didn't mean a minimum Nextclade dataset, but the minimum files needed to get the gene/CDS coverage, which does require a GFF annotation. I've updated the language in above to be explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants