Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGVS using ENSEMBL ID is not working #621

Closed
alex97751 opened this issue Aug 15, 2021 · 2 comments
Closed

HGVS using ENSEMBL ID is not working #621

alex97751 opened this issue Aug 15, 2021 · 2 comments

Comments

@alex97751
Copy link

How come parsing a variant in HGVS notation with an ENSEMBL Identifier (EBI) does not work when the HGVS notation explicitly states that they can be used as a reference sequence in HGVS?

"reference sequences must come from data sources that provide stable and permanent identifiers, e.g. RefSeq (NCBI) and Ensembl (EBI)"

The HGVS documentation even recommends the use of Ensembl transcripts.

Recommended Reference Sequences types are:
-RefSeq sequences with the prefixes NC_, NT_, NW_,NG_, NM_, NR_ or NP_
-LRG sequences with the prefixes LRG_#, LRG_#t#, LRG_#p#,
-Ensembl transcript (ENST) and protein (ENSP) which are not identified by Ensembl as being incomplete

HGVS Reference Sequences

It first has to be converted from ENSEMBL (ENST..) to a RefSeq ID (NM.. ) in order to work. Even tools such as Mutalyzer or VariantValidator are unable to validate or even parse them.

import hgvs.parser
import hgvs.validator
import hgvs.dataproviders.uta

def VariantHGVS(variant_HGVS):
    v = parseVariant(variant_HGVS)
    print("Is valid: " + str(validate(v)))
    return

def parseVariant(variant):
    hp = hgvs.parser.Parser()
    var_g = hp.parse_hgvs_variant(variant)

    return var_g

def validate(var_g):
    hdp = hgvs.dataproviders.uta.connect()
    vr = hgvs.validator.Validator(hdp)
    try:
        return vr.validate(var_g)
    except hgvs.exceptions.HGVSError as e:
        return False


VariantHGVS("NM_007313.2:c.1001C>T")  # TRUE
VariantHGVS("ENST00000372348.2:c.1001C>T")  # FALSE

Just a short code snippet that explains what i mean. First we run the code on a variant with a RefSeq ID (NM_007313.2:c.1001C>T) and then we run the same code with the corresponding ENS ID (ENST00000372348.2:c.1001C>T).

The mapping should be correct as it has been done by using a local copy of the ENSEMBLEDB (ensembldb.ensembl.org). You can also use their RESP API to check for yourself: rest.ensembl.org

What am i missing here? Maybe someone can help me out :)

@reece
Copy link
Member

reece commented Aug 19, 2021

There are several reasons this doesn't work.

The first is that the ensembl data in UTA are very stale. That transcript isn't there. The reasons for this are in biocommons/uta#233.

You capture an exception but just return False. When you do that, you're burying the message that likely would have explained what's happening. At the very least, print or log the exception before ignoring it.

You may want to see this old talk about the origins of UTA and some of the dark corners of the false equivalence of some transcripts.

@reece reece closed this as completed Aug 19, 2021
@davmlaw
Copy link
Contributor

davmlaw commented Mar 21, 2023

Hi, as well as the transcript, the sequence is not in SeqRepo:

HGVSDataNotAvailableError: Failed to fetch ENST00000372348.2 from bioutils.seqfetcher (network fetching) ('Ensembl API only provides ENST00000372348 version (9), requested: 2')

To resolve it, you can use the cdot library/data to provide the transcript mapping coordinates and sequence

Install via python3 -m pip install cdot

import hgvs.parser
import hgvs.validator

from cdot.hgvs.dataproviders import RESTDataProvider, FastaSeqFetcher

# see how to get fasta: https://github.com/SACGF/cdot/wiki/FastaSeqFetcher
seqfetcher = FastaSeqFetcher("/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz")
hdp = RESTDataProvider(seqfetcher=seqfetcher)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant("ENST00000372348.2:c.1001C>T")
vr = hgvs.validator.Validator(hdp)
valid = vr.validate(var_c)
print(f"Valid: {valid}")


from hgvs.assemblymapper import AssemblyMapper
am = AssemblyMapper(hdp,
                    assembly_name='GRCh37',
                    alt_aln_method='splign', replace_reference=True)
am.c_to_g(var_c)
print(f"Resolves to {am.c_to_g(var_c)}")

Output:

Valid: True
Resolves to NC_000009.11:g.133748283C>T

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants