Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reorder id_prefixes for small molecule, fixes #1375 #1376

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

sierra-moxon
Copy link
Member

No description provided.

@cbizon
Copy link
Collaborator

cbizon commented Aug 11, 2023

It's ok with me. I seem to recall @vdancik having some reasons for preferring pubchem though. Maybe just that it covers more items?

@sierra-moxon sierra-moxon merged commit e055c2c into master Sep 5, 2023
4 checks passed
gaurav added a commit to TranslatorSRI/Babel that referenced this pull request Mar 28, 2024
We are running into issues where a conflated clique can't find the identifiers associated with it because they are not primary IDs. For example, PUBCHEM.COMPOUND:962 was the primary ID for water, but after the Biolink model changes in 3.6.0 (specifically biolink/biolink-model#1376 and biolink/biolink-model#1398) the order of prefixes within chemicals has changed, and many chemicals previously identified by PUBCHEM.COMPOUND are now identified with CHEBI identifiers instead (e.g. water is now CHEBI:15377 since it is primarily a biolink:SmallMolecule, which now [has CHEBI as the most preferred prefix](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes).

This is a problem because our current drug conflation algorithm primarily connects PUBCHEM.COMPOUND, RXCUI, CHEBI and UMLS identifiers, and PUBCHEM.COMPOUND is now unlikely to be a clique leader at all.

This PR modifies the drug conflation algorithm so that it:
1. Goes through all the chemical compendia cliques and loads ID -> preferred ID and preferred ID -> Biolink type maps into memory.
2. Runs the algorithm as it is currently written, which results in a set of PUBCHEM.COMPOUND, RXCUI, CHEBI and UMLS identifier mappings (as well as some other identifiers: UNII, DRUGBANK and one CHEMBL.COMPOUND).
3. Adds normalization to the glomming step by adding (subject, normalized_subject) and (object, normalized_object) for any non-normalized identifiers in the pairs to be glommed. (This is necessary because in some cases different conflations will be merged by sharing the same normalized CURIE).
4. Normalize identifiers before writing them out, skipping any identifiers that can't be normalized. We can therefore guarantee that every CURIE in the conflation file will be a clique ID in both NodeNorm and NameRes.

This PR also renames `get_curie_suffix(curie)` to `get_numerical_curie_suffix(curie)` to make it clearer exactly what it does.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants