Curation

A carefully selected subset of the projects in the survey is presented here in the style of an awesome list.

Biomedical knowledge graphs

Biomedical Data Translator – Publication (2022), Website, Code, API, Demo
- Content:
  - A collection of harmonized APIs
- Scope:
  - "integrated data from over 250 knowledge sources, each exposed via open application programming interfaces (APIs)"
  - "a diverse community of nearly 200 basic and clinical scientists, informaticians, ontologists, software developers, and practicing clinicians distributed over 11 teams and 28 institutions to form the Biomedical Data Translator Consortium"
- Goals:
  - "integrate as many datasets as possible, using a ‘knowledge graph’–based architecture, and allow them to be cross-queried and reasoned over by translational researchers"
  - "integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science"
  - "promote serendipitous discovery and augment human reasoning in a variety of disease spaces"
  - "federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions"
- Sub-projects that construct knowledge graphs:
  - ROBOKOP – Publication (2019), Code, Data
  - RTX-KG2 – Publication (2022), Code, Data
Bioteque – Publication (2022), Website, Code, Data
- Content:
  - 450,000 nodes of 12 types
  - 30 million edges of 67 types
  - Extracted from 150 data sources
  - Provided as triples in multiple TSV files
- Scope:
  - "a resource of unprecedented size and scope that contains pre-calculated embeddings derived from a gigantic heterogeneous network"
  - "Bioteque embeddings retain the information contained in the large biological network"
- Goals:
  - "make biomedical knowledge embeddings available to the broad scientific community"
  - "evaluate, characterize and predict a wide set of experimental observations"
  - "assessment of high-throughput protein-protein interactome data"
  - "prediction of drug response and new repurposing opportunities"
CKG – Publication (2023), Website, Code, Data
- Full name: Clinical Knowledge Graph
- Content:
  - 20 million nodes
  - 220 million edges
  - Extracted from 26 databases, 10 ontologies, 7 million publications
  - Provided as Neo4j graph database
- Scope:
  - "prior knowledge, experimental data and de-identified clinical patient information"
  - "harmonization of proteomics with other omics data while integrating the relevant biomedical databases and text extracted from scientific publications"
- Goals:
  - "inform clinical decision-making"
  - "reveal candidate markers of prognosis and/or treatment"
  - "generate new hypotheses that ultimately translate into clinically actionable results"
  - "clinically meaningful queries and advanced statistical analyses"
  - "liver disease biomarker discovery"
  - "multi-proteomics data integration for cancer biomarker discovery and validation"
  - "prioritize treatment options for chemorefractory cases"
HALD – Publication (2023), Website, Code, Data
- Full name: Human Aging and Longevity Dataset
- Content:
  - 12,227 nodes of 10 types
  - 115,522 edges of various types
  - Extracted from 339,918 biomedical articles in PubMed
  - Provided as triples with additional information in multiple JSON and CSV files
- Scope:
  - "a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed"
- Goals:
  - "precision gerontology and geroscience analyses"
  - "provide predictions regarding the individuals’ lifespan under various treatment scenarios"
  - "devise novel, biologically-driven therapeutic and preventive strategies that address fundamental aging mechanisms"
Monarch KG – Publication (2024), Website, Code, Data
- Naming explanation: "The name ’Monarch Initiative’ was chosen because it is a community effort to create paths for diverse data to be put to use for disease discovery, not unlike the navigation routes that a monarch butterfly would take."
- Content:
  - 862,115 nodes of 88 types
  - 11,412,471 edges of 23 types
  - Extracted from 33 biomedical resources and biomedical ontologies and "updated with the latest data from each source once a month"
  - Provided in various formats such as SQLite, Neo4J, RDF, KGX
- Scope:
  - "Monarch App includes an ETL platform for ingesting, harmonizing, and serving diverse life science data relating genes, phenotypes, and diseases into a semantic KG for use in various downstream applications"
  - "Monarch KG integrates gene, disease, and phenotype data"
  - "Monarch Assistant, which will combine the ability of LLMs to answer questions in plain language with Monarch’s extensive KG and analysis algorithms"
- Goals:
  - "learn different things about the relationship between genotype and phenotype from different organisms"
  - "collect, integrate, and make a broad compendium of species and sources computable"
OREGANO – Publication (2023), Code, Data
- Content:
  - 88,937 nodes of 11 types
  - 824,231 edges of 19 types
  - Extracted from various drug, protein and phenotype databases
  - Provided as triples in a TSV file
- Scope:
  - "a holistically constructed knowledge graph using the broadest possible features and drug characteristics"
  - "integration of natural compounds (i.e. herbal and plant remedies)"
  - "incorporating together disease and drug information and natural compounds"
- Goals:
  - "computational drug repositioning"
  - "generate hypotheses (molecule/drug - target links) through link prediction"
  - "from the available data, determine whether a drug is potentially capable of binding to a new target"
  - "identify possible repositionable molecules using machine learning (or more specifically deep learning) algorithms"
PharMeBINet – Publication (2022), Website, Code, Data
- Full name: Pharmacological Medical Biochemical Network
- Content:
  - 2,869,407 nodes of 66 types
  - 15,883,653 edges of 208 types
  - Extracted from 48 data sources
  - Provided as Neo4j graph database and GraphML file
- Scope:
  - "heterogeneous information on drugs, ADRs, genes, proteins, gene variants, and diseases"
- Goals:
  - "analysis of ADRs [Adverse Drug Reactions]"
  - "analysis of possible existing connections between gene variants and drugs"
PrimeKG – Publication (2023), Website, Code, Data
- Full name: Precision Medicine Knowledge Graph
- Content:
  - 129,375 nodes of 10 types
  - 4,050,249 edges of 30 types
  - Extracted from 20 data sources
  - Provided as triples in a CSV file
- Scope:
  - "ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action"
  - "improves on coverage of diseases, both rare and common, by one-to-two orders of magnitude compared to existing knowledge graphs"
- Goals:
  - "support research in precision medicine"
  - "linking biomedical knowledge to patient-level health information"
  - "personalized diagnostic strategies and targeted treatments"
  - "providing a holistic and multimodal view of diseases"
SPOKE – Publication (2023), Website, Code, API
- Full name: Scalable Precision Medicine Open Knowledge Engine
- Content:
  - 27,056,367 nodes of 21 types
  - 53,264,489 edges of 55 types
  - Extracted from 41 databases
  - Provided as a REST API that accepts graph queries, but "not available as a bulk download"
- Scope:
  - "ranging from molecular and cellular biology to pharmacology and clinical practice"
  - "focuses on experimentally determined information"
  - "computational predictions and text mining from the literature are not currently prioritized"
- Goals:
  - "applications relevant to precision medicine"
  - "provide insights into the understanding of diseases, discovering of drugs and proactively improving personal health"
  - "drug repurposing"
  - "disease prediction and interpretation of transcriptomic data"
  - "predict diagnosis"
  - "predict biomedical outcomes in a biologically meaningful manner"
SynLethKG – Publication (2021), Website, Code, Data
- Full name: Synthetic Lethality Knowledge Graph
- Content:
  - 54,012 nodes of 11 types
  - 2,231,921 edges of 24 types
  - Extracted from SynLethDB and various gene, drug and compound databases
  - Provided as triples in a CSV file
- Scope:
  - "genes, compounds, diseases, biological processes and 24 kinds of relationships that could be pertinent to SL"
- Goals:
  - "identify SL gene pairs"
  - "discovery of anti-cancer drug targets"

Tools

BioCypher – Publication (2023), Website, GitHub, PyPI
- Scope:
  - "a Python library that provides a low-code access point to data processing and ontology manipulation"
  - "a modular architecture that maximizes reuse of data and code in three ways: input, ontology and output"
  - "adhere to FAIR (Findable, Accessible, Interoperable and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles"
- Goals:
  - "make the process of creating a biomedical knowledge graph easier than ever, but still flexible and transparent"
  - "abstracting the KG build process as a combination of modular input adapters"
  - "provides easy access to state-of-the-art KGs to the average biomedical researcher"
  - "creating a more interoperable biomedical research community"
KGX – Website, GitHub, PyPI
- Scope:
  - "a Python library and set of command line utilities"
  - "The core datamodel is a Property Graph (PG), represented internally in Python using a networkx MultiDiGraph model."
- Goals:
  - "exchanging Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model"
  - "provide validation, to ensure the KGs are conformant to the Biolink Model"

Databases

Collections

Ontologies and controlled vocabularies

Collections
- BioPortal
- Ontology Lookup Service
Biolink Model – Publication (2022) Website Code
- Scope:
  - "a unified data model that bridges across multiple ontologies, schemas, and data models"
  - "a map for bringing together data from different sources under one unified model, and as a bridge between ontological domains"
- Goals:
  - "supported easier integration and interoperability of biomedical KGs"
  - "supports translation, integration, and harmonization across knowledge sources"

File formats

KGX (.json, .jsonl, .tsv, .ttl) – Website
Neo4j (.dump) – Website, Wikipedia
Resource Description Framework (RDF) – Website, Wikipedia
- Turtle (.ttl) – Website, Wikipedia
- N-Triples (.nt) – Website, Wikipedia
- Notation3 (.n3) – Website, Wikipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

curation.md

curation.md

Curation

Table of Contents

Biomedical knowledge graphs

Tools

Databases

Ontologies and controlled vocabularies

File formats

Files

curation.md

Latest commit

History

curation.md

File metadata and controls

Curation

Table of Contents

Biomedical knowledge graphs

Tools

Databases

Ontologies and controlled vocabularies

File formats