-
Notifications
You must be signed in to change notification settings - Fork 3
Text and data mining
th0mp50n edited this page Sep 18, 2015
·
55 revisions
- Textmining phenotypes (Rob H, Hongyan Wu, Yas; interested: S Kawashima, Tudor, R Vos, Mark W, S Kumagai, MichelD, interested: Joe Miyamoto)
- Drug interactions and phenotypes
- complex phenotypes
- environment-phenotype interactions
- host-pathogen interactions and phenotypes for infectious diseases
- microbial phenotypes, environments, genetic information (add to MicrobeDB?)
- Create a X Phenotype Ontology, X \in { frog, chicken, mosquito }
- Natural language processing, literature annotation and QA system
- Interoperability of text mining/annotation resources (Tudor, interested: Takatomo, E. Bolton, Takeru Nakazato, Raoul, MarkT)
- Goal
- to collect annotations to Biomedical Literature
- to align them
- to make them publicly accessible
- REST API
- SPARQL endpoint
- to find and develop appliations
- Resources
- PubTator REST API to become compatible with PubAnnotation Annotation Server API (Wei)
- Phenotype annotation to be registered to PubAnnotation (Tudor)
- TextAE to be used in Patient Archive and Orphanet Knowledge Management (Tudor)
- DisGeNet annotation to be registered to PubAnnotation (Núria, Tudor)
- EuropeanPMC - PubAnnotation Interoperability (Jee-Hyub)
- Nanotea (Alex)
- PubAnnotation - an open repository of literature annotation
- Goal
- Knowledge Graph Annotator for human curation (MarkT, Jee-Hyub, interested: E. Bolton)
- Finished: bug fixes and working nanopub store interface
- New feature ideas (many thanks to Nuria and Erick):
- more authentication options, e.g. LinkedIn, Twitter, Scopus, ResearchID
- options to store annotation (nanopub) in different locations
- register type of evidence with ECO ontology, or
- register source of evidence with PubAnnotation URL
- In progress: RDFa bookmarklet to connect to the annotator from any html page
- (Graph-based) data analytics on top of integrated text mining data sources (disease - gene - phenotype - chemical entities - species) (Tudor, interested: E. Bolton, Atsuko, Joe Miyamoto)
- QA over LOD
- Bio2RDF+UniProt setup for LODQA (Michel, Hongyan, Jin-Dong, Jerven, interested: E. Bolton)
- Interoperability of text mining/annotation resources (Tudor, interested: Takatomo, E. Bolton, Takeru Nakazato, Raoul, MarkT)
- Motivation
- Various annotation projects sharing the same target, PubMed and PMC.
- They are maintained in silos.
- Goal
- To collect annotations to literature and align them
- To estabilish interoperability between text annotation resources
- To make them publicly accessible through dereferenceable URIs, REST API, RDF and SPARQL endpoints
- To find and develop applications
- Participants
- Wei, Tudor, Núria, Mark T., Jee-Hyub, Kevin, Alex G, Jin-Dong ...
- Started integration of several text mined data sets:
- DisGeNET (diseases + genes)
- PubTator (diseases + genes + variants + species)
- HPO Pubmed annotations
- DisGeRel? (diseases + genes)
- PubAnnotation to provide alignment and storage of annotation resources
- API-level interoperability
- PubTator is interoperable with PubAnnotation at API level.
- Phenotype CR
- Issues
- How to represent document level annotation.
- How to align concept labels (ontology alignment problem).
- How to do the quality assessment.
- Goals:
- Naive comparison of text mined concepts - for quality purposes
- Detection of gaps between mined data and curated domain knowledge with focus on disease - gene - phenotype associations (e.g., inferring new phenotypes for rare disorders)
- Clustering of genes based on phenotypes and / or phenotypes based on genes - and perhaps going forward towards BPs | MFs, etc ... (integration with GO)
- Manual curated MESH term annotation
- Integrated the MeSH-PMIDs manual curated annotation in PubAnnotation.
- E-utilites for accessing the title/abstract and MESH terms.
- Disease/Chemical are recognized as two individual categories, otherwise are represented by the general category: "MESH".
- Gang Fu used SPARQL to extract MeSH synonyms for string match.
- Exact match with simple pre- and post-processing.
- To-Do: separates genes, species as two individual categories from the general category.
- Ex. http://textae.pubannotation.org/editor.html?mode=edit&target=http://pubannotation.org/projects/Test_MeSH/docs/sourcedb/PubMed/sourceid/9916105/annotations.json
- Integrated the MeSH-PMIDs manual curated annotation in PubAnnotation.
- Annotation data set deposited
- PubmedHPO
- DisGeNet
- Large data set interchange API to be developed.
- PubAnnotation API for annotation interoperability
- curl -d text="example text" URL_of_annotation_web_service
- curl -d sourcedb="PubMed" -d sourceid="123456" URL_of_annotation_web_service
- PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByPubTator.cgi
- MESH: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/PubAnnotationByMeSH.cgi
- PhenoCR: http://phenocr.bio-lark.org:5555/pubannotation
- Demo: http://pubannotation.org/projects/Test_MeSH/docs/sourcedb/PubMed/sourceid/17175308
- Others
- Enju dependency parser
- Sentence segmentator
- PubDictionaries
- Respository of PubAannotation-compatible tools to be developed
PubDictionaries - PubAnnotation application
- BioPortal dictionary is generated from BioPortal endpoint. (labing_properties_in_bioportal)
- Dictionary URL: http://pubdictionaries.org/dictionaries/BioPortal
- Lookup API: http://pubdictionaries.org/mapping/term_to_id?dictionaries=BioPortal&output_format=simple&threshold=0.6&top_n=0
- Text annotation API: http://pubdictionaries.org/mapping/text_annotation?dictionaries=BioPortal&matching_method=approximate&max_tokens=6&min_tokens=1&threshold=0.6&top_n=0
- A sample text is annotated using the dictionary
- 10,000 PubMed articles are under annotation
- New version of LODQA released: http://lodqa.org
- unstable at the moment but will be improved.
- Setting up for Bio2RDF and UniProt's begun.
- not yet finished.
- Processed NCBI disease - gene corpus and the HPO annotation corpus - ~6.5Mio abstracts
- Results:
- Gene & GO BP annotations for 4,722 diseases
- HPO TF-IDF annotations for 4,691 diseases
- 7,666 HPO terms
- 10,438 GO BP terms
- Integrated: a large HP - GO BP association matrix based on additive TF-IDF scores
- Drug labels and human/mouse/rat phenotypes: first attempt at http://aber-owl.net/aber-owl/diseasephenotypes/drugs/
- Knowledge Graph Annotator (contact MarkT; input from Erick, Nuria, Jee-Hyub, Wei, Jin-Dong, Gang, MarkW, Michel, ... -> Thank you!)
- purpose: service for human curation directly on knowledge graph
- UI
- machine readable annotations
- new name! Open Reusable Knowledge graph Annotator: “ORKA”
- result 1: UI integration: partial implementation of Bookmarklet
- extract RDFa triple statements (e.g. Virtuoso Facet Viewer)
- display for user selection -> send to ORKA API
- todo: better human readable labels
- todo: look at alternative content representation
- result 2: better back-end accessibility: nanopub store interface
- result 3: bug fixes, but..
- future work
- evidence to point to PubAnnotation span (identifier)
- query PubAnnotation for evidence suggestions
- previous text-mining results in PubAnnotation <=> training, validation
- purpose: service for human curation directly on knowledge graph