Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create method for getting CDS overlap #216

Closed
2 tasks
korikuzma opened this issue Oct 24, 2023 · 20 comments
Closed
2 tasks

Create method for getting CDS overlap #216

korikuzma opened this issue Oct 24, 2023 · 20 comments
Assignees
Labels
enhancement New feature or request priority:high High priority

Comments

@korikuzma
Copy link
Member

korikuzma commented Oct 24, 2023

Work from @austinant in the variation-normalizer manuscript notebook will be migrated to cool-seq-tool.

Given chromosome/assembly (or alt_ac), start position, and end position for a variant, we want to find the features (gene/exons) that overlap. We need to return the HGNC gene symbols, exon start, exon stop, overlap start, and overlap stop. Something like below:

{
  "genes": [
    {
      "BRAF": [
        {
          "exon_number": 0,
          "exon_start": 0,
          "exon_end": 100,
          "overlap_start": 90,
          "overlap_end": 100
        },
        {
          "exon_number": 1,
          "exon_start": 220,
          "exon_end": 580,
          "overlap_start": 220,
          "overlap_end": 580
        }
      ]
    }
  ]
}

This will need to be done for both:

  • 0.1.x version
  • 0.3.x version
    We need to do the 0.1.x for manuscript purposes.
@korikuzma korikuzma added enhancement New feature or request priority:high High priority labels Oct 24, 2023
@korikuzma korikuzma self-assigned this Oct 24, 2023
@egchristensen
Copy link

egchristensen commented Oct 24, 2023

@korikuzma Do you want or need to be able to support reference from specific sources? Does it matter if the reference sequence comes from RefSeq or Ensembl?

@korikuzma
Copy link
Member Author

@korikuzma Do you want or need to be able to support reference from specific sources? Does it matter if the reference sequence comes from RefSeq or Ensembl?

@egchristensen We mainly use and test RefSeq accessions. There are some issues when using Ensebml since UTA does not store versions at the moment (see this related issue)

@egchristensen
Copy link

@korikuzma So in a perfect world, all annotation providers would use Sequence Ontology (SO) terms to describe sequence features, but NCBI hasn't officially adopted SO. Instead, they use the INSDC Feature Table definitions which do differ from SO in some ways (though they should largely mirror each other). More detail here.

Some questions:

  • Is there a subset of sequence features we can limit ourselves to here or do we need to be able to handle any/all sequence features in use at RefSeq?
  • Do we need to be able to support both current and previous builds of the human genome?
  • Do we need to access any knowledge/annotations/sequence beyond a curated reference sequence? Or will it be enough to just pull sequence features/regions contained in the reference annotation?

@korikuzma
Copy link
Member Author

Is there a subset of sequence features we can limit ourselves to here or do we need to be able to handle any/all sequence features in use at RefSeq?

I want to say subset but I defer to @austinant or @wesleygoar on this one.

Do we need to be able to support both current and previous builds of the human genome?

At the moment, we only support GRCh37 and GRCh38.

Do we need to access any knowledge/annotations/sequence beyond a curated reference sequence? Or will it be enough to just pull sequence features/regions contained in the reference annotation?

I believe the latter.

@wesleygoar @austinant feel free to override me on any of these

@wesleygoar
Copy link

@egchristensen I think it would be awesome to support SO annotations in the future. I think for the sake of this manuscript we are trying to get out the door I want to keep it fairly high level with base pair overlap and CDS overlap since that's a very pervasive and well known parameter. I think it would be great to work together to figure out which SO terms would make sense to start supporting for future updates.

@egchristensen
Copy link

egchristensen commented Oct 25, 2023

I see. So, our initial list of sequence features from the sequence annotation group can be found in our working document. Do you think you would need anything beyond what we have listed here @korikuzma @wesleygoar @austinant? There are still some definitions that we need to flesh out, but we already have SO terms for pretty much all of these so we're not starting from scratch.

@wesleygoar
Copy link

Hmm interesting. In his original analysis Austin including base pair overlap, gene overlap, and CDS overlap. I think we could certainly work on supporting more of these SO terms you linked to from the SA group @egchristensen. I can speak with some of our clinical directors and ask them what features would be interesting for overlap analysis.

@wesleygoar
Copy link

@egchristensen I think it would be cool if we could involve Michael Baudis as well for determining which sequence features would be useful for CNV matching strategies.

@egchristensen
Copy link

@wesleygoar, Once we decide what structure we want to use to handle the annotations imported from RefSeq then we should be able to handle any type of sequence feature you want to support. They're all tied to genomic coordinates so extending support to additional sequence features should just be a matter of identifying the right term as defined in the INSDC feature table (since you're just using RefSeq). Sorry for not clarifying this sooner, but the reason that I'm asking is that we have our wish list of terms that need to be in our v1.0 SA model so if you don't need anything more than what's in the working doc then we should be good to go from a terminological standpoint.

Also... just double checking, we just need to be able to bring in reference annotations / gene models, right? We don't need to bring in any other information from RefSeq? We have a few user stories for the SA model so I'm curious if any of them apply here:

  • More detailed annotation of sequence features
  • Metadata for streamlined retrieval of source data
  • Consistent representation of errors and imperfections
  • Annotation origin (experimental vs computational)
  • Clinical significance
  • Unambiguous transcript identification

@egchristensen
Copy link

@korikuzma Do you anticipate pulling annotations in from RefSeq as GFF3 files or through some other method?

If you're looking at GFF3 (tab-delimited) then we'll probably have to consider how to parse it. If we're only looking to identify whether or not a variant may fall within a CDS or an exon, you maybe be able to identify it using GFF3 column 3 "type" and avoid column 9 a.k.a. "the junk drawer" entirely (GFF3 spec here). But if you're looking at pulling in a gene name (located within the "id" attribute in GFF column 9) or any specific transcript ID, then we'll have to parse through column 9.

If you prefer to work with more structure from the get go, some of what's going on in BioPython may be relevant. Their solution is to import the GFF file into an SQLite database that you can then run queries on. I'm not sure if it'd be better than just running grep on the GFF3 file though... what do you think?

@korikuzma
Copy link
Member Author

@korikuzma Do you anticipate pulling annotations in from RefSeq as GFF3 files or through some other method?

If you're looking at GFF3 (tab-delimited) then we'll probably have to consider how to parse it. If we're only looking to identify whether or not a variant may fall within a CDS or an exon, you maybe be able to identify it using GFF3 column 3 "type" and avoid column 9 a.k.a. "the junk drawer" entirely (GFF3 spec here). But if you're looking at pulling in a gene name (located within the "id" attribute in GFF column 9) or any specific transcript ID, then we'll have to parse through column 9.

If you prefer to work with more structure from the get go, some of what's going on in BioPython may be relevant. Their solution is to import the GFF file into an SQLite database that you can then run queries on. I'm not sure if it'd be better than just running grep on the GFF3 file though... what do you think?

@egchristensen I'm just adding @austinant 's work to cool-seq-tool. His implementation had used GFF files. We plan on using https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v1.3.refseq_genomic.gff.gz. He already has work for parsing this file.

Later, we're going to switch to using UTA rather than GFF files. For the purpose of getting something out quick for the manuscript, we're just re-using @austinant 's work with minor changes.

@egchristensen
Copy link

I see... so it seems like what's left here then is just writing the code to pull regions that overlap with a variant, right? Do you think some of the stuff going on with intersects at BedTools is relevant here @korikuzma or do you think that it's too different? I wonder if you think it'd be worthwhile to explore applying their approach to BED files to GFF3? The multiIntersectBed class in Bedtools itself isn't written in Python but there is a Python wrapper. We can rope in someone from the Quinlan lab if need be.

I'm not sure the best way I can be helpful here. Please let me know if I'm stating things you've already considered. If there's anything specific you're looking from the SA group, please let me know.

@ahwagner
Copy link
Member

Hey all. Just catching up here (h/t @korikuzma).

@egchristensen what we could use your help with is a JSON Schema document from SA for representing an exon, CDS, etc. Does something like that exist?

@egchristensen
Copy link

We don't have a proper JSON schema yet, but I'll see what I can put together this week (unless you need it sooner than the end of the week). @emiliorighi and I got started on one a while back, but it doesn't really resemble the current state of modeling so we're kind of starting from scratch. I'll keep you posted.

@egchristensen
Copy link

There is a JSON version of the sequence ontology if that's helpful to @korikuzma in the meantime, but it won't say anything about RefSeq-specific data structures or metadata @ahwagner. https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json

@korikuzma
Copy link
Member Author

I'm changing output to

{
  "BRAF": [
    {
      "exon_start": 0,
      "exon_stop": 100,
      "overlap_start": 90,
      "overlap_stop": 100
    },
    {
      "exon_start": 220,
      "exon_stop": 580,
      "overlap_start": 220,
      "overlap_stop": 580
    }
  ]
},
... # other genes

@wesleygoar how should we handle sorting? Keep default (as is in GFF) or asc/desc exon_start?

@korikuzma
Copy link
Member Author

Initial work can be found here

I need to add tests (@austinant provided cases, ty):

  1. variant fully contains exon
  2. variant is fully contained within exon
  3. variant partially overlaps with exon, from the exon's left side
  4. variant partially overlaps with exon, from the exon's right side
  5. cases where variant overlaps with multiple exons, combinations of these things happening
  6. overlap with exons in multiple genes?
  7. alt chromosome accessions/whatever the non 1..22,X,Y chromosome values are

@wesleygoar let me know if you think there should be any other cases I should cover.

@wesleygoar
Copy link

I'm changing output to

{
  "BRAF": [
    {
      "exon_start": 0,
      "exon_stop": 100,
      "overlap_start": 90,
      "overlap_stop": 100
    },
    {
      "exon_start": 220,
      "exon_stop": 580,
      "overlap_start": 220,
      "overlap_stop": 580
    }
  ]
},
... # other genes

@wesleygoar how should we handle sorting? Keep default (as is in GFF) or asc/desc exon_start?

@korikuzma I think default as in the GFF is fine unless @ahwagner thinks differently.

@wesleygoar
Copy link

Initial work can be found here

I need to add tests (@austinant provided cases, ty):

  1. variant fully contains exon
  2. variant is fully contained within exon
  3. variant partially overlaps with exon, from the exon's left side
  4. variant partially overlaps with exon, from the exon's right side
  5. cases where variant overlaps with multiple exons, combinations of these things happening
  6. overlap with exons in multiple genes?
  7. alt chromosome accessions/whatever the non 1..22,X,Y chromosome values are

@wesleygoar let me know if you think there should be any other cases I should cover.

I think all of these sound good to me. I will think about it a little more in the morning.

@korikuzma
Copy link
Member Author

We added method in #217 for what we needed in the variation-normalizer manuscript, so closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:high High priority
Projects
None yet
Development

No branches or pull requests

4 participants