-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include vaccine strains #23
Conversation
This commit adds strain and date annotations for 5 vaccine strains that all descend from Edmonston isolate collected in 1954. The Parks et al. paper describes these well. I purposely chose not to include location for these as I wanted the gray dot in the Auspice tree to make these look a bit different than wild-type isolates This also includes strain, date and location for Edmonston WT strain.
There's not enough genome data to warrant inclusion of month in the subsampling grouping. Also, by including month the subsampling was dropping a number of older samples that were only annotated by year. I noticed this in wanting to include the 1954 Edmonston related vaccine strains and they were getting filtered out with the previous "country year month" group-by.
Strain name is often not included in GenBank or is not very helpful. But still good to surface as metadata for modal. I particularly wanted this for the 1954 Edmonston-related vaccine strains. People know these by their strain names, certainly not their GenBank accessions.
This swap to using --metadata-columns in augur export to surface strain, division and location.
export: | ||
metadata_columns: "strain division location" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to surface the strain names now. Eventually we should be able to pull more strain names from GenBank, after NCBI Datasets starts pulling the "strain" field, which is where most measles strain names are reported on GenBank (currently we are getting strain names from Genbank's "isolate" field, which NCBI Datasets does pull). NCBI says this is planned for sometime this year. This would also enable us to recover dates for some samples that have empty dates, since dates are part of the strain name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thanks for the context.
Explicitly add vaccine strains to genome tree and N450 tree, following up on #23 These strains currently end up in the trees due to our subsampling parameters and lack of other sequences from 1954, but this commit explicitly adds them.
Full genomes for Edmonston-related vaccine strains were present in the ingest dataset, but weren't making it to the final genome or N450 results due to getting filtered out from lack of
date
metadata. This PR surfaces these vaccine strains by:country year
group-by so that samples with just year metadata make it into the final build.strain
coloring to provide proper descriptions of these 6 samples.I've just the entire pipeline locally and everything works as expected.
Results from running this branch are viewable at: