Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give better chemical labels to returned responses #462

Open
jh111 opened this issue Aug 8, 2023 · 22 comments
Open

Give better chemical labels to returned responses #462

jh111 opened this issue Aug 8, 2023 · 22 comments
Assignees
Labels
Hammerhead (Sprint 6) - due Oct 4 in CI This ticket will be fixed in CI by the end of Hammerhead (Sprint 6) (Oct 4) response labels task around getting better labels to show UI - display confusion on or overlooking information

Comments

@jh111
Copy link

jh111 commented Aug 8, 2023

Search What drug may treat Multiple Sclerosis.
https://ui.test.transltr.io/results?l=Multiple%20Sclerosis&i=MONDO:0005301&t=0&q=bf9d0342-0966-4cec-8122-8d87187b1ef3

One of the answer that comes up is Monoclonal antibody an100226.

This is the early name/number for natalizumab. It will be much more helpful for users to have this normalized to the current name, natalizumab.

Options:

  1. A search in PubChem for Monoclonal antibody an100226. brings up natalizumab, and a list of depositor-supplied synonyms: https://pubchem.ncbi.nlm.nih.gov/substance/481101759
  2. The evidence for treat includes many papers from PubMed that clearly have natalizumab in the title. Perhaps there's a way to get SemMedDB results to provide natalizumab as an answer, or to map SemMedDB DB answers.
@sierra-moxon
Copy link
Member

@gaurav - is this something for Name Resolver?

@sandrine-m sandrine-m added the UI - display confusion on or overlooking information label Aug 14, 2023
@sandrine-m
Copy link

Tagging the ace team David and Gaurav.

@sandrine-m sandrine-m added this to the D: Fall - 2023 milestone Aug 14, 2023
@gprice1129
Copy link
Collaborator

It isn't clear what the UI team can do about this issue. Is the idea of a "canonical name" available in the attribute server? @newgene

@sandrine-m
Copy link

sandrine-m commented Sep 27, 2023

I think Jenn means that the returned results were not normalized properly. I used Jenn's PK to load back the results using ARAX CI UI (note this is an "old" query and the ARAS are falling the validation):
image
I found that BTE was responsible for this result:
image

I RETESTED on test today and the unusual name is still popping up:
image
and appears twice (one with the meshID and one with the UMLS ID . Apparently both BTE and RTX-KG2 are returning that result.

I looked at RENCI name resolver for monoclonal antibody AN100226 and found that the 2 identifiers instances gets properly pooled together.

Natalizumab is part of the synonyms but is not the label. I do not know what is the rule for deciding the drug label, but my guess is that the drug label is decided at the Node Norm stage, so that is a NodeNorm issue?

EDIT:
So there are 2 issues here I think:

  • @gprice1129 : there are 2 results with the same compound, is it UI who normalize/group same compound together?
  • @gaurav : Got feedback from Chris B.: NodeNorm is where the labels get decided on, the team is discussing possible update in a coming future

@gprice1129
Copy link
Collaborator

@sandrine-m the UI does not do any normalization, we use the normalization the ARS provides. The ARS relies on the node normalizer so most likely it is an issue with that service @gaurav @cbizon

@sandrine-muller-research
Copy link
Collaborator

From conversation through slack:
@cbizon : The label is probably coming from nodenorm, which is where we are choosing the 'best' label. We currently have an approach that has not always been well received
@gaurav aurav Vaidya (SRI)
I've added "Investigate strategies for improved preferred labels for cliques." to our priorities. I know we have some tickets with individual examples we can start working on, but if people have ideas about improving this at scale -- if a particular chemical provider has really good labels, say -- please let us know!

@sandrine-muller-research sandrine-muller-research changed the title Translator should map early drug numbers to final names. Give better labels to returned responses Oct 4, 2023
@sandrine-muller-research sandrine-muller-research added the response labels task around getting better labels to show label Oct 4, 2023
@sandrine-muller-research sandrine-muller-research changed the title Give better labels to returned responses Give better chemical labels to returned responses Oct 4, 2023
@gprice1129
Copy link
Collaborator

I think we should move away from tickets with open ended definitions of success. "Give better labels" is way too broad and basically can never be finished. It would be better to create tickets with a finite set of items that should be corrected.

@jh111
Copy link
Author

jh111 commented Feb 2, 2024

@gprice1129 If I understand correctly, I think what you're pointing out is that we can't implement this until we define what output is expected, and whether it's possible to do it.

  • What is the relative importance of improving this user experience? Can we decide now, or do need time to get user input?
  • How much do users care whether the name is familar to them? Does it need to be on top, or would be ok to be able to click to get a list of synonyms?
  • Is there one canonical name for biomedical chemicals that are already used as drug ingredients, or do different users have different expectations?
  • What is the technical feasibility improving this user experience? How will we test it?

@sandrine-muller
Copy link

sandrine-muller commented Feb 2, 2024

Re: deciding on the label for nodeNorm. my understanding was that sometimes, nodeNorm choosen label is not the user preferred one. Although this issue cannot be fixed right away (longterm issue, perhaps needing some user surveys as Jenn is pointing out) , I started a test asset sheet for testing chemical names based on a few searches I made using the system. Please note that this sheet was done back in November 2023 I think so perhaps the system changed since then. MolePro team was interested particularly into looking at chemical labels choosen differently between MolePro and NodeNorm to see how we can improve our system.

@jh111 jh111 added needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams and removed needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams labels Feb 2, 2024
@gprice1129
Copy link
Collaborator

gprice1129 commented Feb 2, 2024

@jh111 Having a definition for "better chemical labels" would definitely be a good idea, however, even if we had a perfect definition for "chemical label" its still unclear when the ticket can be closed: Are we talking about all the chemical labels in the system right now or all of them for all time? In my opinion it would be better if we constrained tickets of this nature to some finite set of chemical labels so whoever is working on it can have a clear goal.

@jh111
Copy link
Author

jh111 commented Feb 2, 2024

I have put on a better title, to reflect the problem/opportunity with experience for specific users, and the fact different users might want different names. There are several different technical options for how this could be addressed.

For the INN, for I think RxNorm ingredient would be a fine level of detail. For example, inFLIXimab, as opposed to inFLIXimab-abda. I don't think we need to use the uppercase (which is designed for prescription safety).

@jh111 jh111 added the needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams label Feb 2, 2024
@Genomewide
Copy link

I think this is a node norm issue. We display whatever the canonical name is. So, @gaurav can you tell us what the rules are for this? Then maybe @jh111 can see if there are examples where that are not optimized and if optimizing those would break other terms? So, the rubric could change. However, I don't think this is a UI issue.

@cbizon
Copy link
Collaborator

cbizon commented Feb 15, 2024

Another example of suboptimal labeling is using the name "Activated Charcoal" for carbon:

https://nodenorm.test.transltr.io/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A5462310&conflate=true&drug_chemical_conflate=false&description=false

The rule that's being applied is to get the name from each source and then rank them by the same source priority as used in biolink to pick which curie is the best one.

@sandrine-muller
Copy link

sandrine-muller commented Feb 16, 2024

When you say source, do you mean original sources or each team within Translator? Would it be useful then to collect the name that each source provides and learn a rule (=set of weights) that best predict the user liking (=the desired result in the test asset sheet?) The idea being that some sources have more user-friendly naming strategies than others (=higher weights).

@sierra-moxon sierra-moxon removed needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams group1 labels Mar 1, 2024
@gaurav gaurav added Guppy (Sprint 5) - due Aug 23 in CI This ticket will be fixed in CI by the end of Guppy (Sprint 5) (Aug 23) Hammerhead (Sprint 6) - due Oct 4 in CI This ticket will be fixed in CI by the end of Hammerhead (Sprint 6) (Oct 4) and removed Guppy (Sprint 5) - due Aug 23 in CI This ticket will be fixed in CI by the end of Guppy (Sprint 5) (Aug 23) labels Jul 12, 2024
@gaurav
Copy link

gaurav commented Jul 12, 2024

To deal with the simpler issue first, CHEBI:27594 "CHARCOAL, ACTIVATED" still has the wrong label (should be "carbon"). This is because we prefer CHEMBL.COMPOUND labels over others. I think I've seen other examples of CHEMBL labels being suboptimal; I wonder if we should promote CHEBI above it and see if that improves this situation (it should definitely fix this bug). I'm going to look for other reports of this before deciding whether to try this.

Now for the more complex issue: UMLS:C0665297 is present twice in NodeNorm Test -- once in a UMLS-only Protein clique, and once in a UMLS+MESH ChemicalEntity clique. These should really be merged into a single clique, but proteins and chemicals are currently produced by independent modules, so there isn't any way to merge those cliques given how NodeNorm is currently architected.

  • I don't think fixing this is reasonable to do within this round of Translator funding, as we'll need to rethink how Babel works.
  • However, I would like to see how often this happens, which shouldn't take too much effort. I'm going to schedule that part of this work for Hammerhead, but I'll see if I can do it any sooner than that.

@Genomewide
Copy link

Is there a way we can gather all of the examples together to look at the flavors we are talking about?
Charcoal, activated is wrong for different reasons than A synthetic peptide of 20 amino acids, comprising D-Phe, Pro, Arg, Pro, Gly, Gly, Gly, Gly, Asn, Gly, Asp, Phe, Glu, Glu, Ile, Pro, Glu, Glu, Tyr, and Leu in sequence. A congener of hirudin (a naturally occurring drug found in the saliva of the medicinal leech), it a specific and reversible inhibitor of thrombin, and is used as an anticoagulant.

@gaurav Do you have a dart board or a stress ball where you keep all of our complaints (or other place). I would be interested in seeing how to break these down and then look at the some examples from each group.

@sandrine-muller-research
Copy link
Collaborator

@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more

@Genomewide
Copy link

How do I find what to put for Molpro? I added asset # 25

@sandrine-muller-research
Copy link
Collaborator

sandrine-muller-research commented Jul 25, 2024

Thank you for adding a row to the sheet.
Here is how you can query MolePro where you put as an input ["CID:75007581"] (MolePro has internally a different set of CURIES.
However, MolePro does not know about this ID (we are tracking why at the moment) but does know about collagenase. To query by a name, use the "by_name" endpoint.
I do see it on the PubChem page that it got modified beginning of July (2024-07-20) so that is perhaps a change of ID. We are investigating. I'll keep you posted.

@gaurav
Copy link

gaurav commented Jul 26, 2024

@Genomewide I started this sheet on my side (to become perhaps a set of tests in future for @gaurav ) it does not contain all examples and surely Gaurav has a lot more

Thanks, @sandrine-muller-research! My list is actually much shorter :) I'll start moving your entries over in Hammerhead.

@sstemann sstemann removed this from the D: Fall - 2023 milestone Aug 6, 2024
@colleenXu
Copy link

Just putting these here in case people are unaware of other convos:

@sandrine-muller-research
Copy link
Collaborator

Thank you Colleen!
Putting this query here as it has a good amount of extremely long names. I will need to see whether we have better chemical name, and update the test asset sheet. Will come back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hammerhead (Sprint 6) - due Oct 4 in CI This ticket will be fixed in CI by the end of Hammerhead (Sprint 6) (Oct 4) response labels task around getting better labels to show UI - display confusion on or overlooking information
Projects
None yet
Development

No branches or pull requests