Skip to content

Latest commit

 

History

History
310 lines (262 loc) · 14.7 KB

File metadata and controls

310 lines (262 loc) · 14.7 KB

Curation

A carefully selected subset of the projects in the survey is presented here in the style of an awesome list.

Table of Contents

Biomedical knowledge graphs

  • Biomedical Data TranslatorPublication (2022), Website, Code, API, Demo

    • Content:
      • A collection of harmonized APIs
    • Scope:
      • "integrated data from over 250 knowledge sources, each exposed via open application programming interfaces (APIs)"
      • "a diverse community of nearly 200 basic and clinical scientists, informaticians, ontologists, software developers, and practicing clinicians distributed over 11 teams and 28 institutions to form the Biomedical Data Translator Consortium"
    • Goals:
      • "integrate as many datasets as possible, using a ‘knowledge graph’–based architecture, and allow them to be cross-queried and reasoned over by translational researchers"
      • "integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science"
      • "promote serendipitous discovery and augment human reasoning in a variety of disease spaces"
      • "federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions"
    • Sub-projects that construct knowledge graphs:
  • BiotequePublication (2022), Website, Code, Data

    • Content:
      • 450,000 nodes of 12 types
      • 30 million edges of 67 types
      • Extracted from 150 data sources
      • Provided as triples in multiple TSV files
    • Scope:
      • "a resource of unprecedented size and scope that contains pre-calculated embeddings derived from a gigantic heterogeneous network"
      • "Bioteque embeddings retain the information contained in the large biological network"
    • Goals:
      • "make biomedical knowledge embeddings available to the broad scientific community"
      • "evaluate, characterize and predict a wide set of experimental observations"
      • "assessment of high-throughput protein-protein interactome data"
      • "prediction of drug response and new repurposing opportunities"
  • CKGPublication (2023), Website, Code, Data

    • Full name: Clinical Knowledge Graph
    • Content:
      • 20 million nodes
      • 220 million edges
      • Extracted from 26 databases, 10 ontologies, 7 million publications
      • Provided as Neo4j graph database
    • Scope:
      • "prior knowledge, experimental data and de-identified clinical patient information"
      • "harmonization of proteomics with other omics data while integrating the relevant biomedical databases and text extracted from scientific publications"
    • Goals:
      • "inform clinical decision-making"
      • "reveal candidate markers of prognosis and/or treatment"
      • "generate new hypotheses that ultimately translate into clinically actionable results"
      • "clinically meaningful queries and advanced statistical analyses"
      • "liver disease biomarker discovery"
      • "multi-proteomics data integration for cancer biomarker discovery and validation"
      • "prioritize treatment options for chemorefractory cases"
  • HALDPublication (2023), Website, Code, Data

    • Full name: Human Aging and Longevity Dataset
    • Content:
      • 12,227 nodes of 10 types
      • 115,522 edges of various types
      • Extracted from 339,918 biomedical articles in PubMed
      • Provided as triples with additional information in multiple JSON and CSV files
    • Scope:
      • "a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed"
    • Goals:
      • "precision gerontology and geroscience analyses"
      • "provide predictions regarding the individuals’ lifespan under various treatment scenarios"
      • "devise novel, biologically-driven therapeutic and preventive strategies that address fundamental aging mechanisms"
  • Monarch KGPublication (2024), Website, Code, Data

    • Naming explanation: "The name ’Monarch Initiative’ was chosen because it is a community effort to create paths for diverse data to be put to use for disease discovery, not unlike the navigation routes that a monarch butterfly would take."
    • Content:
      • 862,115 nodes of 88 types
      • 11,412,471 edges of 23 types
      • Extracted from 33 biomedical resources and biomedical ontologies and "updated with the latest data from each source once a month"
      • Provided in various formats such as SQLite, Neo4J, RDF, KGX
    • Scope:
      • "Monarch App includes an ETL platform for ingesting, harmonizing, and serving diverse life science data relating genes, phenotypes, and diseases into a semantic KG for use in various downstream applications"
      • "Monarch KG integrates gene, disease, and phenotype data"
      • "Monarch Assistant, which will combine the ability of LLMs to answer questions in plain language with Monarch’s extensive KG and analysis algorithms"
    • Goals:
      • "learn different things about the relationship between genotype and phenotype from different organisms"
      • "collect, integrate, and make a broad compendium of species and sources computable"
  • OREGANOPublication (2023), Code, Data

    • Content:
      • 88,937 nodes of 11 types
      • 824,231 edges of 19 types
      • Extracted from various drug, protein and phenotype databases
      • Provided as triples in a TSV file
    • Scope:
      • "a holistically constructed knowledge graph using the broadest possible features and drug characteristics"
      • "integration of natural compounds (i.e. herbal and plant remedies)"
      • "incorporating together disease and drug information and natural compounds"
    • Goals:
      • "computational drug repositioning"
      • "generate hypotheses (molecule/drug - target links) through link prediction"
      • "from the available data, determine whether a drug is potentially capable of binding to a new target"
      • "identify possible repositionable molecules using machine learning (or more specifically deep learning) algorithms"
  • PharMeBINetPublication (2022), Website, Code, Data

    • Full name: Pharmacological Medical Biochemical Network
    • Content:
      • 2,869,407 nodes of 66 types
      • 15,883,653 edges of 208 types
      • Extracted from 48 data sources
      • Provided as Neo4j graph database and GraphML file
    • Scope:
      • "heterogeneous information on drugs, ADRs, genes, proteins, gene variants, and diseases"
    • Goals:
      • "analysis of ADRs [Adverse Drug Reactions]"
      • "analysis of possible existing connections between gene variants and drugs"
  • PrimeKGPublication (2023), Website, Code, Data

    • Full name: Precision Medicine Knowledge Graph
    • Content:
      • 129,375 nodes of 10 types
      • 4,050,249 edges of 30 types
      • Extracted from 20 data sources
      • Provided as triples in a CSV file
    • Scope:
      • "ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action"
      • "improves on coverage of diseases, both rare and common, by one-to-two orders of magnitude compared to existing knowledge graphs"
    • Goals:
      • "support research in precision medicine"
      • "linking biomedical knowledge to patient-level health information"
      • "personalized diagnostic strategies and targeted treatments"
      • "providing a holistic and multimodal view of diseases"
  • SPOKEPublication (2023), Website, Code, API

    • Full name: Scalable Precision Medicine Open Knowledge Engine
    • Content:
      • 27,056,367 nodes of 21 types
      • 53,264,489 edges of 55 types
      • Extracted from 41 databases
      • Provided as a REST API that accepts graph queries, but "not available as a bulk download"
    • Scope:
      • "ranging from molecular and cellular biology to pharmacology and clinical practice"
      • "focuses on experimentally determined information"
      • "computational predictions and text mining from the literature are not currently prioritized"
    • Goals:
      • "applications relevant to precision medicine"
      • "provide insights into the understanding of diseases, discovering of drugs and proactively improving personal health"
      • "drug repurposing"
      • "disease prediction and interpretation of transcriptomic data"
      • "predict diagnosis"
      • "predict biomedical outcomes in a biologically meaningful manner"
  • SynLethKGPublication (2021), Website, Code, Data

    • Full name: Synthetic Lethality Knowledge Graph
    • Content:
      • 54,012 nodes of 11 types
      • 2,231,921 edges of 24 types
      • Extracted from SynLethDB and various gene, drug and compound databases
      • Provided as triples in a CSV file
    • Scope:
      • "genes, compounds, diseases, biological processes and 24 kinds of relationships that could be pertinent to SL"
    • Goals:
      • "identify SL gene pairs"
      • "discovery of anti-cancer drug targets"

Tools

  • BioCypherPublication (2023), Website, GitHub, PyPI

    • Scope:
      • "a Python library that provides a low-code access point to data processing and ontology manipulation"
      • "a modular architecture that maximizes reuse of data and code in three ways: input, ontology and output"
      • "adhere to FAIR (Findable, Accessible, Interoperable and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles"
    • Goals:
      • "make the process of creating a biomedical knowledge graph easier than ever, but still flexible and transparent"
      • "abstracting the KG build process as a combination of modular input adapters"
      • "provides easy access to state-of-the-art KGs to the average biomedical researcher"
      • "creating a more interoperable biomedical research community"
  • KGXWebsite, GitHub, PyPI

    • Scope:
      • "a Python library and set of command line utilities"
      • "The core datamodel is a Property Graph (PG), represented internally in Python using a networkx MultiDiGraph model."
    • Goals:
      • "exchanging Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model"
      • "provide validation, to ensure the KGs are conformant to the Biolink Model"

Databases

Ontologies and controlled vocabularies

  • Collections

  • Biolink ModelPublication (2022) Website Code

    • Scope:
      • "a unified data model that bridges across multiple ontologies, schemas, and data models"
      • "a map for bringing together data from different sources under one unified model, and as a bridge between ontological domains"
    • Goals:
      • "supported easier integration and interoperability of biomedical KGs"
      • "supports translation, integration, and harmonization across knowledge sources"

File formats