Text and data mining

Textmining phenotypes (Rob H, Hongyan Wu, Yas; interested: S Kawashima, Tudor, R Vos, Mark W, S Kumagai, MichelD, interested: Joe Miyamoto)
- Drug interactions and phenotypes
- complex phenotypes
- environment-phenotype interactions
- host-pathogen interactions and phenotypes for infectious diseases
- microbial phenotypes, environments, genetic information (add to MicrobeDB?)
- Create a X Phenotype Ontology, X \in { frog, chicken, mosquito }
Natural language processing, literature annotation and QA system
- Interoperability of text mining/annotation resources (Tudor, interested: Takatomo, E. Bolton, Takeru Nakazato, Raoul, MarkT)
  - Goal
    - to collect annotations to Biomedical Literature
    - to align them
    - to make them publicly accessible
      - REST API
      - SPARQL endpoint
    - to find and develop appliations
  - Resources
    - PubTator REST API to become compatible with PubAnnotation Annotation Server API (Wei)
    - Phenotype annotation to be registered to PubAnnotation (Tudor)
    - TextAE to be used in Patient Archive and Orphanet Knowledge Management (Tudor)
    - DisGeNet annotation to be registered to PubAnnotation (Núria, Tudor)
    - EuropeanPMC - PubAnnotation Interoperability (Jee-Hyub)
    - Nanotea (Alex)
    - PubAnnotation - an open repository of literature annotation
- Knowledge Graph Annotator for human curation (MarkT, Jee-Hyub, interested: E. Bolton)
  - Finished: bug fixes and working nanopub store interface
  - New feature ideas (many thanks to Nuria and Erick):
    - more authentication options, e.g. LinkedIn, Twitter, Scopus, ResearchID
    - options to store annotation (nanopub) in different locations
    - register type of evidence with ECO ontology, or
    - register source of evidence with PubAnnotation URL
  - In progress: RDFa bookmarklet to connect to the annotator from any html page
- (Graph-based) data analytics on top of integrated text mining data sources (disease - gene - phenotype - chemical entities - species) (Tudor, interested: E. Bolton, Atsuko, Joe Miyamoto)
- QA over LOD
  - Bio2RDF+UniProt setup for LODQA (Michel, Hongyan, Jin-Dong, Jerven, interested: E. Bolton)

Day 1&2

Text mining annotation interoperability (Text mining annotation alignment and analysis)

Motivation
- Various annotation projects sharing the same target, PubMed and PMC.
- They are maintained in silos.
Goal
- To collect annotations to literature and align them
- To estabilish interoperability between text annotation resources
- To make them publicly accessible through dereferenceable URIs, REST API, RDF and SPARQL endpoints
- To find and develop applications
Participants
- Wei, Tudor, Núria, Mark T., Jee-Hyub, Kevin, Alex G, Jin-Dong ...
Started integration of several text mined data sets:
- DisGeNET (diseases + genes)
- PubTator (diseases + genes + variants + species)
- HPO Pubmed annotations
- DisGeRel? (diseases + genes)
PubAnnotation to provide alignment and storage of annotation resources
API-level interoperability
- PubTator is interoperable with PubAnnotation at API level.
- Phenotype CR
Issues
- How to represent document level annotation.
- How to align concept labels (ontology alignment problem).
- How to do the quality assessment.
Goals:
Naive comparison of text mined concepts - for quality purposes
Detection of gaps between mined data and curated domain knowledge with focus on disease - gene - phenotype associations (e.g., inferring new phenotypes for rare disorders)
Clustering of genes based on phenotypes and / or phenotypes based on genes - and perhaps going forward towards BPs | MFs, etc ... (integration with GO)

Day 4

API-level interoperability (con.)

Manual curated MESH term annotation
- Integrated the MeSH-PMIDs manual curated annotation in PubAnnotation.
  - E-utilites for accessing the title/abstract and MESH terms.
    - http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByMeSH.cgi
    - http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByPubTator.cgi
  - Disease/Chemical are recognized as two individual categories, otherwise are represented by the general category: "MESH".
  - Gang Fu used SPARQL to extract MeSH synonyms for string match.
  - Exact match with simple pre- and post-processing.
- To-Do: separates genes, species as two individual categories from the general category.
- Ex. http://textae.pubannotation.org/editor.html?mode=edit&target=http://pubannotation.org/projects/Test_MeSH/docs/sourcedb/PubMed/sourceid/9916105/annotations.json

Final Wrap-up

Literature annotation interoperability

Data interoperability

Annotation data set deposited
- PubmedHPO
- DisGeNet
Large data set interchange API to be developed.

API level interoperability

PubAnnotation API for annotation interoperability
- curl -d text="example text" URL_of_annotation_web_service
- curl -d sourcedb="PubMed" -d sourceid="123456" URL_of_annotation_web_service
PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByPubTator.cgi
MESH: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByMeSH.cgi
PhenoCR: http://phenocr.bio-lark.org:5555/pubannotation
Demo: http://pubannotation.org/projects/Test_MeSH/docs/sourcedb/PubMed/sourceid/17175308
Others
- Enju dependency parser
- Sentence segmentator
- PubDictionaries
Respository of PubAannotation-compatible tools to be developed

PubDictionaries - PubAnnotation application

BioPortal dictionary is generated from BioPortal endpoint. (labing_properties_in_bioportal)
A sample text is annotated using the dictionary
- http://pubannotation.org/projects/system-maintenance/docs/sourcedb/PMC/sourceid/13914/divs
10,000 PubMed articles are under annotation
- http://pubannotation.org/projects/BioPortalExp (Tudor)

Question-Answering

New version of LODQA released: http://lodqa.org
- unstable at the moment but will be improved.
Setting up for Bio2RDF and UniProt's begun.
- not yet finished.

Phenotype data analytics

Processed NCBI disease - gene corpus and the HPO annotation corpus - ~6.5Mio abstracts
Results:
Gene & GO BP annotations for 4,722 diseases
HPO TF-IDF annotations for 4,691 diseases
7,666 HPO terms
10,438 GO BP terms
Integrated: a large HP - GO BP association matrix based on additive TF-IDF scores

Textmining phenotypes

Drug labels and human/mouse/rat phenotypes: first attempt at http://aber-owl.net/aber-owl/diseasephenotypes/drugs/

Knowledge graph annotation

Knowledge Graph Annotator (contact MarkT; input from Erick, Nuria, Jee-Hyub, Wei, Jin-Dong, Gang, MarkW, Michel, ... -> Thank you!)
- purpose: service for human curation directly on knowledge graph
  - UI
  - machine readable annotations
- new name! Open Reusable Knowledge graph Annotator: “ORKA”
- result 1: UI integration: partial implementation of Bookmarklet
  - extract RDFa triple statements (e.g. Virtuoso Facet Viewer)
  - display for user selection -> send to ORKA API
  - todo: better human readable labels
  - todo: look at alternative content representation
- result 2: better back-end accessibility: nanopub store interface
- result 3: bug fixes, but..
- future work
  - evidence to point to PubAnnotation span (identifier)
  - query PubAnnotation for evidence suggestions
  - previous text-mining results in PubAnnotation <=> training, validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text and data mining

Text and data mining

Day 1&2

Text mining annotation interoperability (Text mining annotation alignment and analysis)

Day 4

API-level interoperability (con.)

Final Wrap-up

Literature annotation interoperability

Data interoperability

API level interoperability

PubDictionaries - PubAnnotation application

Question-Answering

Phenotype data analytics

Textmining phenotypes

Knowledge graph annotation

Clone this wiki locally