reorder id_prefixes for small molecule, fixes #1375 #1376

sierra-moxon · 2023-08-04T18:38:40Z

No description provided.

cbizon · 2023-08-11T19:55:59Z

It's ok with me. I seem to recall @vdancik having some reasons for preferring pubchem though. Maybe just that it covers more items?

We are running into issues where a conflated clique can't find the identifiers associated with it because they are not primary IDs. For example, PUBCHEM.COMPOUND:962 was the primary ID for water, but after the Biolink model changes in 3.6.0 (specifically biolink/biolink-model#1376 and biolink/biolink-model#1398) the order of prefixes within chemicals has changed, and many chemicals previously identified by PUBCHEM.COMPOUND are now identified with CHEBI identifiers instead (e.g. water is now CHEBI:15377 since it is primarily a biolink:SmallMolecule, which now [has CHEBI as the most preferred prefix](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes). This is a problem because our current drug conflation algorithm primarily connects PUBCHEM.COMPOUND, RXCUI, CHEBI and UMLS identifiers, and PUBCHEM.COMPOUND is now unlikely to be a clique leader at all. This PR modifies the drug conflation algorithm so that it: 1. Goes through all the chemical compendia cliques and loads ID -> preferred ID and preferred ID -> Biolink type maps into memory. 2. Runs the algorithm as it is currently written, which results in a set of PUBCHEM.COMPOUND, RXCUI, CHEBI and UMLS identifier mappings (as well as some other identifiers: UNII, DRUGBANK and one CHEMBL.COMPOUND). 3. Adds normalization to the glomming step by adding (subject, normalized_subject) and (object, normalized_object) for any non-normalized identifiers in the pairs to be glommed. (This is necessary because in some cases different conflations will be merged by sharing the same normalized CURIE). 4. Normalize identifiers before writing them out, skipping any identifiers that can't be normalized. We can therefore guarantee that every CURIE in the conflation file will be a clique ID in both NodeNorm and NameRes. This PR also renames `get_curie_suffix(curie)` to `get_numerical_curie_suffix(curie)` to make it clearer exactly what it does.

reorder id_prefixes for small molecule

3f93514

sierra-moxon requested review from vdancik and cbizon August 4, 2023 18:38

cbizon approved these changes Aug 11, 2023

View reviewed changes

vdancik approved these changes Sep 4, 2023

View reviewed changes

sierra-moxon merged commit e055c2c into master Sep 5, 2023
4 checks passed

gaurav mentioned this pull request Mar 22, 2024

Normalize DrugChemical conflation IDs TranslatorSRI/Babel#250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reorder id_prefixes for small molecule, fixes #1375 #1376

reorder id_prefixes for small molecule, fixes #1375 #1376

sierra-moxon commented Aug 4, 2023

cbizon commented Aug 11, 2023

reorder id_prefixes for small molecule, fixes #1375 #1376

reorder id_prefixes for small molecule, fixes #1375 #1376

Conversation

sierra-moxon commented Aug 4, 2023

cbizon commented Aug 11, 2023