Create method for getting CDS overlap #216

korikuzma · 2023-10-24T12:00:45Z

Work from @austinant in the variation-normalizer manuscript notebook will be migrated to cool-seq-tool.

Given chromosome/assembly (or alt_ac), start position, and end position for a variant, we want to find the features (gene/exons) that overlap. We need to return the HGNC gene symbols, exon start, exon stop, overlap start, and overlap stop. Something like below:

{
  "genes": [
    {
      "BRAF": [
        {
          "exon_number": 0,
          "exon_start": 0,
          "exon_end": 100,
          "overlap_start": 90,
          "overlap_end": 100
        },
        {
          "exon_number": 1,
          "exon_start": 220,
          "exon_end": 580,
          "overlap_start": 220,
          "overlap_end": 580
        }
      ]
    }
  ]
}

This will need to be done for both:

0.1.x version
0.3.x version
We need to do the 0.1.x for manuscript purposes.

egchristensen · 2023-10-24T22:08:46Z

@korikuzma Do you want or need to be able to support reference from specific sources? Does it matter if the reference sequence comes from RefSeq or Ensembl?

korikuzma · 2023-10-24T22:11:56Z

@korikuzma Do you want or need to be able to support reference from specific sources? Does it matter if the reference sequence comes from RefSeq or Ensembl?

@egchristensen We mainly use and test RefSeq accessions. There are some issues when using Ensebml since UTA does not store versions at the moment (see this related issue)

egchristensen · 2023-10-24T23:55:38Z

@korikuzma So in a perfect world, all annotation providers would use Sequence Ontology (SO) terms to describe sequence features, but NCBI hasn't officially adopted SO. Instead, they use the INSDC Feature Table definitions which do differ from SO in some ways (though they should largely mirror each other). More detail here.

Some questions:

Is there a subset of sequence features we can limit ourselves to here or do we need to be able to handle any/all sequence features in use at RefSeq?
Do we need to be able to support both current and previous builds of the human genome?
Do we need to access any knowledge/annotations/sequence beyond a curated reference sequence? Or will it be enough to just pull sequence features/regions contained in the reference annotation?

korikuzma · 2023-10-25T12:44:44Z

Is there a subset of sequence features we can limit ourselves to here or do we need to be able to handle any/all sequence features in use at RefSeq?

I want to say subset but I defer to @austinant or @wesleygoar on this one.

Do we need to be able to support both current and previous builds of the human genome?

At the moment, we only support GRCh37 and GRCh38.

Do we need to access any knowledge/annotations/sequence beyond a curated reference sequence? Or will it be enough to just pull sequence features/regions contained in the reference annotation?

I believe the latter.

@wesleygoar @austinant feel free to override me on any of these

wesleygoar · 2023-10-25T13:26:45Z

@egchristensen I think it would be awesome to support SO annotations in the future. I think for the sake of this manuscript we are trying to get out the door I want to keep it fairly high level with base pair overlap and CDS overlap since that's a very pervasive and well known parameter. I think it would be great to work together to figure out which SO terms would make sense to start supporting for future updates.

egchristensen · 2023-10-25T14:45:25Z

I see. So, our initial list of sequence features from the sequence annotation group can be found in our working document. Do you think you would need anything beyond what we have listed here @korikuzma @wesleygoar @austinant? There are still some definitions that we need to flesh out, but we already have SO terms for pretty much all of these so we're not starting from scratch.

wesleygoar · 2023-10-26T12:26:36Z

Hmm interesting. In his original analysis Austin including base pair overlap, gene overlap, and CDS overlap. I think we could certainly work on supporting more of these SO terms you linked to from the SA group @egchristensen. I can speak with some of our clinical directors and ask them what features would be interesting for overlap analysis.

wesleygoar · 2023-10-26T12:29:03Z

@egchristensen I think it would be cool if we could involve Michael Baudis as well for determining which sequence features would be useful for CNV matching strategies.

egchristensen · 2023-10-27T15:07:29Z

@wesleygoar, Once we decide what structure we want to use to handle the annotations imported from RefSeq then we should be able to handle any type of sequence feature you want to support. They're all tied to genomic coordinates so extending support to additional sequence features should just be a matter of identifying the right term as defined in the INSDC feature table (since you're just using RefSeq). Sorry for not clarifying this sooner, but the reason that I'm asking is that we have our wish list of terms that need to be in our v1.0 SA model so if you don't need anything more than what's in the working doc then we should be good to go from a terminological standpoint.

Also... just double checking, we just need to be able to bring in reference annotations / gene models, right? We don't need to bring in any other information from RefSeq? We have a few user stories for the SA model so I'm curious if any of them apply here:

More detailed annotation of sequence features
Metadata for streamlined retrieval of source data
Consistent representation of errors and imperfections
Annotation origin (experimental vs computational)
Clinical significance
Unambiguous transcript identification

egchristensen · 2023-10-27T15:40:48Z

@korikuzma Do you anticipate pulling annotations in from RefSeq as GFF3 files or through some other method?

If you're looking at GFF3 (tab-delimited) then we'll probably have to consider how to parse it. If we're only looking to identify whether or not a variant may fall within a CDS or an exon, you maybe be able to identify it using GFF3 column 3 "type" and avoid column 9 a.k.a. "the junk drawer" entirely (GFF3 spec here). But if you're looking at pulling in a gene name (located within the "id" attribute in GFF column 9) or any specific transcript ID, then we'll have to parse through column 9.

If you prefer to work with more structure from the get go, some of what's going on in BioPython may be relevant. Their solution is to import the GFF file into an SQLite database that you can then run queries on. I'm not sure if it'd be better than just running grep on the GFF3 file though... what do you think?

korikuzma · 2023-10-27T15:44:56Z

@korikuzma Do you anticipate pulling annotations in from RefSeq as GFF3 files or through some other method?

If you're looking at GFF3 (tab-delimited) then we'll probably have to consider how to parse it. If we're only looking to identify whether or not a variant may fall within a CDS or an exon, you maybe be able to identify it using GFF3 column 3 "type" and avoid column 9 a.k.a. "the junk drawer" entirely (GFF3 spec here). But if you're looking at pulling in a gene name (located within the "id" attribute in GFF column 9) or any specific transcript ID, then we'll have to parse through column 9.

If you prefer to work with more structure from the get go, some of what's going on in BioPython may be relevant. Their solution is to import the GFF file into an SQLite database that you can then run queries on. I'm not sure if it'd be better than just running grep on the GFF3 file though... what do you think?

@egchristensen I'm just adding @austinant 's work to cool-seq-tool. His implementation had used GFF files. We plan on using https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v1.3.refseq_genomic.gff.gz. He already has work for parsing this file.

Later, we're going to switch to using UTA rather than GFF files. For the purpose of getting something out quick for the manuscript, we're just re-using @austinant 's work with minor changes.

egchristensen · 2023-10-27T16:55:39Z

I see... so it seems like what's left here then is just writing the code to pull regions that overlap with a variant, right? Do you think some of the stuff going on with intersects at BedTools is relevant here @korikuzma or do you think that it's too different? I wonder if you think it'd be worthwhile to explore applying their approach to BED files to GFF3? The multiIntersectBed class in Bedtools itself isn't written in Python but there is a Python wrapper. We can rope in someone from the Quinlan lab if need be.

I'm not sure the best way I can be helpful here. Please let me know if I'm stating things you've already considered. If there's anything specific you're looking from the SA group, please let me know.

ahwagner · 2023-10-30T14:16:06Z

Hey all. Just catching up here (h/t @korikuzma).

@egchristensen what we could use your help with is a JSON Schema document from SA for representing an exon, CDS, etc. Does something like that exist?

egchristensen · 2023-10-30T16:31:23Z

We don't have a proper JSON schema yet, but I'll see what I can put together this week (unless you need it sooner than the end of the week). @emiliorighi and I got started on one a while back, but it doesn't really resemble the current state of modeling so we're kind of starting from scratch. I'll keep you posted.

egchristensen · 2023-10-30T16:35:12Z

There is a JSON version of the sequence ontology if that's helpful to @korikuzma in the meantime, but it won't say anything about RefSeq-specific data structures or metadata @ahwagner. https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json

korikuzma · 2023-11-01T21:12:44Z

I'm changing output to

{
  "BRAF": [
    {
      "exon_start": 0,
      "exon_stop": 100,
      "overlap_start": 90,
      "overlap_stop": 100
    },
    {
      "exon_start": 220,
      "exon_stop": 580,
      "overlap_start": 220,
      "overlap_stop": 580
    }
  ]
},
... # other genes

@wesleygoar how should we handle sorting? Keep default (as is in GFF) or asc/desc exon_start?

korikuzma · 2023-11-01T21:16:03Z

Initial work can be found here

I need to add tests (@austinant provided cases, ty):

variant fully contains exon
variant is fully contained within exon
variant partially overlaps with exon, from the exon's left side
variant partially overlaps with exon, from the exon's right side
cases where variant overlaps with multiple exons, combinations of these things happening
overlap with exons in multiple genes?
alt chromosome accessions/whatever the non 1..22,X,Y chromosome values are

@wesleygoar let me know if you think there should be any other cases I should cover.

wesleygoar · 2023-11-02T00:18:20Z

I'm changing output to

{
  "BRAF": [
    {
      "exon_start": 0,
      "exon_stop": 100,
      "overlap_start": 90,
      "overlap_stop": 100
    },
    {
      "exon_start": 220,
      "exon_stop": 580,
      "overlap_start": 220,
      "overlap_stop": 580
    }
  ]
},
... # other genes

@wesleygoar how should we handle sorting? Keep default (as is in GFF) or asc/desc exon_start?

@korikuzma I think default as in the GFF is fine unless @ahwagner thinks differently.

wesleygoar · 2023-11-02T00:20:47Z

Initial work can be found here

I need to add tests (@austinant provided cases, ty):

variant fully contains exon

variant is fully contained within exon

variant partially overlaps with exon, from the exon's left side

variant partially overlaps with exon, from the exon's right side

cases where variant overlaps with multiple exons, combinations of these things happening

overlap with exons in multiple genes?

alt chromosome accessions/whatever the non 1..22,X,Y chromosome values are

@wesleygoar let me know if you think there should be any other cases I should cover.

I think all of these sound good to me. I will think about it a little more in the morning.

korikuzma · 2023-12-27T18:35:26Z

We added method in #217 for what we needed in the variation-normalizer manuscript, so closing this issue.

korikuzma added enhancement New feature or request priority:high High priority labels Oct 24, 2023

korikuzma self-assigned this Oct 24, 2023

korikuzma mentioned this issue Nov 3, 2023

feat: get cds overlap given chromosome, start, + stop #217

Merged

korikuzma closed this as completed Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create method for getting CDS overlap #216

Create method for getting CDS overlap #216

korikuzma commented Oct 24, 2023 •

edited

Loading

egchristensen commented Oct 24, 2023 •

edited

Loading

korikuzma commented Oct 24, 2023

egchristensen commented Oct 24, 2023

korikuzma commented Oct 25, 2023

wesleygoar commented Oct 25, 2023

egchristensen commented Oct 25, 2023 •

edited

Loading

wesleygoar commented Oct 26, 2023

wesleygoar commented Oct 26, 2023

egchristensen commented Oct 27, 2023

egchristensen commented Oct 27, 2023

korikuzma commented Oct 27, 2023

egchristensen commented Oct 27, 2023

ahwagner commented Oct 30, 2023

egchristensen commented Oct 30, 2023

egchristensen commented Oct 30, 2023

korikuzma commented Nov 1, 2023

korikuzma commented Nov 1, 2023

wesleygoar commented Nov 2, 2023

wesleygoar commented Nov 2, 2023

korikuzma commented Dec 27, 2023

Create method for getting CDS overlap #216

Create method for getting CDS overlap #216

Comments

korikuzma commented Oct 24, 2023 • edited Loading

egchristensen commented Oct 24, 2023 • edited Loading

korikuzma commented Oct 24, 2023

egchristensen commented Oct 24, 2023

korikuzma commented Oct 25, 2023

wesleygoar commented Oct 25, 2023

egchristensen commented Oct 25, 2023 • edited Loading

wesleygoar commented Oct 26, 2023

wesleygoar commented Oct 26, 2023

egchristensen commented Oct 27, 2023

egchristensen commented Oct 27, 2023

korikuzma commented Oct 27, 2023

egchristensen commented Oct 27, 2023

ahwagner commented Oct 30, 2023

egchristensen commented Oct 30, 2023

egchristensen commented Oct 30, 2023

korikuzma commented Nov 1, 2023

korikuzma commented Nov 1, 2023

wesleygoar commented Nov 2, 2023

wesleygoar commented Nov 2, 2023

korikuzma commented Dec 27, 2023

korikuzma commented Oct 24, 2023 •

edited

Loading

egchristensen commented Oct 24, 2023 •

edited

Loading

egchristensen commented Oct 25, 2023 •

edited

Loading