Skip to content

Text and data mining

th0mp50n edited this page Sep 18, 2015 · 55 revisions

Text and data mining

  • Textmining phenotypes (Rob H, Hongyan Wu, Yas; interested: S Kawashima, Tudor, R Vos, Mark W, S Kumagai, MichelD, interested: Joe Miyamoto)
    • Drug interactions and phenotypes
    • complex phenotypes
    • environment-phenotype interactions
    • host-pathogen interactions and phenotypes for infectious diseases
    • microbial phenotypes, environments, genetic information (add to MicrobeDB?)
    • Create a X Phenotype Ontology, X \in { frog, chicken, mosquito }
  • Natural language processing, literature annotation and QA system
    • Interoperability of text mining/annotation resources (Tudor, interested: Takatomo, E. Bolton, Takeru Nakazato, Raoul, MarkT)
      • Goal
        • to collect annotations to Biomedical Literature
        • to align them
        • to make them publicly accessible
          • REST API
          • SPARQL endpoint
        • to find and develop appliations
      • Resources
        • PubTator REST API to become compatible with PubAnnotation Annotation Server API (Wei)
        • Phenotype annotation to be registered to PubAnnotation (Tudor)
        • TextAE to be used in Patient Archive and Orphanet Knowledge Management (Tudor)
        • DisGeNet annotation to be registered to PubAnnotation (Núria, Tudor)
        • EuropeanPMC - PubAnnotation Interoperability (Jee-Hyub)
        • Nanotea (Alex)
        • PubAnnotation - an open repository of literature annotation
    • Knowledge Graph Annotator for human curation (MarkT, Jee-Hyub, interested: E. Bolton)
      • Finished: bug fixes and working nanopub store interface
      • New feature ideas (many thanks to Nuria and Erick):
        • more authentication options, e.g. LinkedIn, Twitter, Scopus, ResearchID
        • options to store annotation (nanopub) in different locations
        • register type of evidence with ECO ontology, or
        • register source of evidence with PubAnnotation URL
      • In progress: RDFa bookmarklet to connect to the annotator from any html page
    • (Graph-based) data analytics on top of integrated text mining data sources (disease - gene - phenotype - chemical entities - species) (Tudor, interested: E. Bolton, Atsuko, Joe Miyamoto)
    • QA over LOD
      • Bio2RDF+UniProt setup for LODQA (Michel, Hongyan, Jin-Dong, Jerven, interested: E. Bolton)

Day 1&2

Text mining annotation interoperability (Text mining annotation alignment and analysis)

  • Motivation
    • Various annotation projects sharing the same target, PubMed and PMC.
    • They are maintained in silos.
  • Goal
    • To collect annotations to literature and align them
    • To estabilish interoperability between text annotation resources
    • To make them publicly accessible through dereferenceable URIs, REST API, RDF and SPARQL endpoints
    • To find and develop applications
  • Participants
    • Wei, Tudor, Núria, Mark T., Jee-Hyub, Kevin, Alex G, Jin-Dong ...
  • Started integration of several text mined data sets:
    • DisGeNET (diseases + genes)
    • PubTator (diseases + genes + variants + species)
    • HPO Pubmed annotations
    • DisGeRel? (diseases + genes)
  • PubAnnotation to provide alignment and storage of annotation resources
  • API-level interoperability
    • PubTator is interoperable with PubAnnotation at API level.
    • Phenotype CR
  • Issues
    • How to represent document level annotation.
    • How to align concept labels (ontology alignment problem).
    • How to do the quality assessment.
  • Goals:
  • Naive comparison of text mined concepts - for quality purposes
  • Detection of gaps between mined data and curated domain knowledge with focus on disease - gene - phenotype associations (e.g., inferring new phenotypes for rare disorders)
  • Clustering of genes based on phenotypes and / or phenotypes based on genes - and perhaps going forward towards BPs | MFs, etc ... (integration with GO)

Day 4

API-level interoperability (con.)

Final Wrap-up

Literature annotation interoperability

Data interoperability

  • Annotation data set deposited
    • PubmedHPO
    • DisGeNet
  • Large data set interchange API to be developed.

API level interoperability

Question-Answering

  • New version of LODQA released: http://lodqa.org
    • unstable at the moment but will be improved.
  • Setting up for Bio2RDF and UniProt's begun.
    • not yet finished.

Phenotype data analytics

  • Processed NCBI disease - gene corpus and the HPO annotation corpus - ~6.5Mio abstracts
  • Results:
  • Gene & GO BP annotations for 4,722 diseases
  • HPO TF-IDF annotations for 4,691 diseases
  • 7,666 HPO terms
  • 10,438 GO BP terms
  • Integrated: a large HP - GO BP association matrix based on additive TF-IDF scores

Textmining phenotypes

Knowledge graph annotation

  • Knowledge Graph Annotator (contact MarkT; input from Erick, Nuria, Jee-Hyub, Wei, Jin-Dong, Gang, MarkW, Michel, ... -> Thank you!)
    • purpose: service for human curation directly on knowledge graph
      • UI
      • machine readable annotations
    • new name! Open Reusable Knowledge graph Annotator: “ORKA
    • result 1: UI integration: partial implementation of Bookmarklet
      • extract RDFa triple statements (e.g. Virtuoso Facet Viewer)
      • display for user selection -> send to ORKA API
      • todo: better human readable labels
      • todo: look at alternative content representation
    • result 2: better back-end accessibility: nanopub store interface
    • result 3: bug fixes, but..
    • future work
      • evidence to point to PubAnnotation span (identifier)
      • query PubAnnotation for evidence suggestions
      • previous text-mining results in PubAnnotation <=> training, validation