-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create method for getting CDS overlap #216
Comments
@korikuzma Do you want or need to be able to support reference from specific sources? Does it matter if the reference sequence comes from RefSeq or Ensembl? |
@egchristensen We mainly use and test RefSeq accessions. There are some issues when using Ensebml since UTA does not store versions at the moment (see this related issue) |
@korikuzma So in a perfect world, all annotation providers would use Sequence Ontology (SO) terms to describe sequence features, but NCBI hasn't officially adopted SO. Instead, they use the INSDC Feature Table definitions which do differ from SO in some ways (though they should largely mirror each other). More detail here. Some questions:
|
I want to say subset but I defer to @austinant or @wesleygoar on this one.
At the moment, we only support GRCh37 and GRCh38.
I believe the latter. @wesleygoar @austinant feel free to override me on any of these |
@egchristensen I think it would be awesome to support SO annotations in the future. I think for the sake of this manuscript we are trying to get out the door I want to keep it fairly high level with base pair overlap and CDS overlap since that's a very pervasive and well known parameter. I think it would be great to work together to figure out which SO terms would make sense to start supporting for future updates. |
I see. So, our initial list of sequence features from the sequence annotation group can be found in our working document. Do you think you would need anything beyond what we have listed here @korikuzma @wesleygoar @austinant? There are still some definitions that we need to flesh out, but we already have SO terms for pretty much all of these so we're not starting from scratch. |
Hmm interesting. In his original analysis Austin including base pair overlap, gene overlap, and CDS overlap. I think we could certainly work on supporting more of these SO terms you linked to from the SA group @egchristensen. I can speak with some of our clinical directors and ask them what features would be interesting for overlap analysis. |
@egchristensen I think it would be cool if we could involve Michael Baudis as well for determining which sequence features would be useful for CNV matching strategies. |
@wesleygoar, Once we decide what structure we want to use to handle the annotations imported from RefSeq then we should be able to handle any type of sequence feature you want to support. They're all tied to genomic coordinates so extending support to additional sequence features should just be a matter of identifying the right term as defined in the INSDC feature table (since you're just using RefSeq). Sorry for not clarifying this sooner, but the reason that I'm asking is that we have our wish list of terms that need to be in our v1.0 SA model so if you don't need anything more than what's in the working doc then we should be good to go from a terminological standpoint. Also... just double checking, we just need to be able to bring in reference annotations / gene models, right? We don't need to bring in any other information from RefSeq? We have a few user stories for the SA model so I'm curious if any of them apply here:
|
@korikuzma Do you anticipate pulling annotations in from RefSeq as GFF3 files or through some other method? If you're looking at GFF3 (tab-delimited) then we'll probably have to consider how to parse it. If we're only looking to identify whether or not a variant may fall within a CDS or an exon, you maybe be able to identify it using GFF3 column 3 "type" and avoid column 9 a.k.a. "the junk drawer" entirely (GFF3 spec here). But if you're looking at pulling in a gene name (located within the "id" attribute in GFF column 9) or any specific transcript ID, then we'll have to parse through column 9. If you prefer to work with more structure from the get go, some of what's going on in BioPython may be relevant. Their solution is to import the GFF file into an SQLite database that you can then run queries on. I'm not sure if it'd be better than just running grep on the GFF3 file though... what do you think? |
@egchristensen I'm just adding @austinant 's work to cool-seq-tool. His implementation had used GFF files. We plan on using https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v1.3.refseq_genomic.gff.gz. He already has work for parsing this file. Later, we're going to switch to using UTA rather than GFF files. For the purpose of getting something out quick for the manuscript, we're just re-using @austinant 's work with minor changes. |
I see... so it seems like what's left here then is just writing the code to pull regions that overlap with a variant, right? Do you think some of the stuff going on with intersects at BedTools is relevant here @korikuzma or do you think that it's too different? I wonder if you think it'd be worthwhile to explore applying their approach to BED files to GFF3? The multiIntersectBed class in Bedtools itself isn't written in Python but there is a Python wrapper. We can rope in someone from the Quinlan lab if need be. I'm not sure the best way I can be helpful here. Please let me know if I'm stating things you've already considered. If there's anything specific you're looking from the SA group, please let me know. |
Hey all. Just catching up here (h/t @korikuzma). @egchristensen what we could use your help with is a JSON Schema document from SA for representing an exon, CDS, etc. Does something like that exist? |
We don't have a proper JSON schema yet, but I'll see what I can put together this week (unless you need it sooner than the end of the week). @emiliorighi and I got started on one a while back, but it doesn't really resemble the current state of modeling so we're kind of starting from scratch. I'll keep you posted. |
There is a JSON version of the sequence ontology if that's helpful to @korikuzma in the meantime, but it won't say anything about RefSeq-specific data structures or metadata @ahwagner. https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json |
I'm changing output to {
"BRAF": [
{
"exon_start": 0,
"exon_stop": 100,
"overlap_start": 90,
"overlap_stop": 100
},
{
"exon_start": 220,
"exon_stop": 580,
"overlap_start": 220,
"overlap_stop": 580
}
]
},
... # other genes @wesleygoar how should we handle sorting? Keep default (as is in GFF) or asc/desc exon_start? |
Initial work can be found here I need to add tests (@austinant provided cases, ty):
@wesleygoar let me know if you think there should be any other cases I should cover. |
@korikuzma I think default as in the GFF is fine unless @ahwagner thinks differently. |
I think all of these sound good to me. I will think about it a little more in the morning. |
We added method in #217 for what we needed in the variation-normalizer manuscript, so closing this issue. |
Work from @austinant in the variation-normalizer manuscript notebook will be migrated to cool-seq-tool.
Given chromosome/assembly (or alt_ac), start position, and end position for a variant, we want to find the features (gene/exons) that overlap. We need to return the HGNC gene symbols, exon start, exon stop, overlap start, and overlap stop. Something like below:
This will need to be done for both:
We need to do the 0.1.x for manuscript purposes.
The text was updated successfully, but these errors were encountered: