Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce DrugChemical for load into SAPBERT #330

Open
wants to merge 19 commits into
base: load-drugchemical-into-duckdb
Choose a base branch
from

Conversation

gaurav
Copy link
Collaborator

@gaurav gaurav commented Aug 7, 2024

This PR produces a DrugChemicalSmaller file so that we can load it into SAPBERT. It includes three changes:

  1. I've tweaked the order of the prefix boosts so that we pick slightly better names.
  2. A new demote_labels_longer_than config setting (currently set to 15) filters out any name longer than that size as long as at least one label equal to or less than that size is available.
  3. Generate a SAPBERT training file called DrugChemicalConflatedSmaller.txt.gz which only includes cliques from DrugChemicalConflated if the preferred label is shorter than DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH.

DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH can be used to control how big DrugChemicalConflatedSmaller.txt is:

DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH Training rows Unique CURIEs
50 25,187,771 19,835,134
40 15,450,212 10,571,449
30 9,620,711 6,056,510
15 4,803,665 3,808,855

DRUG_CHEMICAL_SMALLER_MAX_LABEL_LENGTH=30 seems like a reasonable setting right now.

Closes #313.

src/babel_utils.py Outdated Show resolved Hide resolved
@gaurav gaurav requested a review from cbizon September 30, 2024 20:01
@gaurav gaurav marked this pull request as ready for review September 30, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants