-
Notifications
You must be signed in to change notification settings - Fork 3
RDF LOD
Standardization of RDF data, metadata, ontologies and provenance
- Provenance vocabularies (Arto Bendiken, S Kawashima, T Katayama)
-
Making existing data available from SPARQL (Jerven Bolleman, Arto Bendiken, Shin Kawano, Kieron Taylor, Joe Miyamoto, Raoul)
- VCF file as SPARQL database (VCF to RDF Mapping)
- On the demand/fly conversion into triples - without conversion (look at bgzf & tabix for indexed access to vcf)
- Gene Expression Analyses/generic datasets
- SPARQL endpoint for Tara Ocean data, link to MicrobeDB
- Create a Phenotype RDF data (mouse, rat, cell) (T,Takatsuki, interested: S Kawashima, R Vos, Rob H, S.Kumagai)
- FAIR - Findable Accessible Interoperable Reusable Data (MarkW and MichelD - group leads)
- FAIR Principles - revisited (http://datafairport.org ; https://www.force11.org/group/fairgroup/fairprinciples) (MarkW, MarkT, MichelD, interested: Erick, Nick) (potential new document describing the principles https://docs.google.com/document/d/1XEW76g3cLqOBmgQZGZOxWB0I1rwQrJSixpWL4UallTc/edit)
- Server and Client for Triple Pattern Fragments (in Perl + other languages) (MarkW) (http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/) - a way to expose non-RDF data sources as triples; a way to push SPARQL query resolution to the client, rather than the server. (Client-side algorithm here: http://ceur-ws.org/Vol-1272/paper_10.pdf)
- NBDC RDF portal [Akio, Shuichi, Toshiaki]
- NBDC NikkajiRDF [Issaku Yamada (Glycomics), Toshiaki Tokimatsu (KNApSAcK), Akira Kinjo(PDBj), Gang Fu, Evan Bolton(PubChem), Kouji Kozaki (Hozo), and Tatsuya Kushida (Nikkaji) ]
Participants: Michel Dumontier, Mark Wilkinson, Mark Thompson, Nick Juty
Progress:
- Discussed and refined all FAIR principles. see document
Next steps:
- replace data with meta(data), metadata, or data
- compare to published FAIR principles
- initiate discussion with FAIR principles stewards and Barend Mons
Progress:
- Prepared a complete first draft of the FAIR principles. see document
Next steps:
- continue discussions with FAIR principles stewards and Barend Mons
Pre-Hackathon:
- implementation of FAIR for non-RDF data sources
- prototype used the EU Huntington Disease Network exemplar dataset (part of RD Connect project)
- Constraints: rare diseases are EXTRREMELY sensitive data - highly identifiable. Therefore, need to have extremely fine-grained access and control to not only the data, but also the metadata
- Prototype solution uses Linked Data Platform (W3C)
- REST interface - First URL returns repository-level metadata, and a list of URLs representing meta-records
- REST interface - meta-record URLs return metadata about individual records, and (maybe) links to the actual records data
- everything is 100% under the control of the data owner. Everything is Linked Data.
- FINDABLE: Everything is identified by a URL, that resolves to RDF
- ACESSIBLE: URLs resolve, and contain information about license and access protocol
- INTEROPERABLE: RDF is nicely formatted Linked Data with rich link-outs to other data following open ontologies.
- RE-USABLE: Metadata is as "maximal" as the data provider can provide; all retrievals have licensing information;
So, prior to the hackathon, we had a FAIR solution for METADATA over any kind of repository (RDF repositories or other!)
But... we still want a FAIR solution for DATA within those repositories. For this, we think that Triple Pattern Fragments is a good solution. The idea is that a repository responds to requests for very simple fragments of data that match certain patterns of ?s ?p ?o.
Progress:
- discussions with Ruben Verborgh, Kjetil Kjernsmo, and Patrick Hochstenbach regarding what has already been done, and what needs to be done.
- Client-side has a reasonably good implementation
- Server-side depends on pre-existing Triplestore. My use-case presumes that we start from something other than a triplestore
- Advised to extend RDF::Trine::Store with a new type (e.g. CSV) and implement the get_statements method to dynamically generate triples.
- PROBLEM: the "smart" way to do this would be to use a tool like Tarql, however Tarql currently wont build via maven :-P
- hacking a custom "solution" for the moment
- Solved the problem with maven --> now have a functional Tarql. Tarql (https://github.com/tarql) is "SPARQL for Tables" - a way to convert tabular data into rdf using SPARQL CONSTRUCT. It works nicely on my CSV files!
- for csv entries with spaces, you need to use the SPARQL 'ENCODE_FOR_URI' string function to get the %20 (and same for any other unuusal characters)
- I want to use Plack::App::RDF::LinkedData to serve my newly transformed CVS data
- This library requires a RDF::Trine::Store object, but I want to dynamically transform my data, so none of the existing Trine::Stores are useful to me (they all require a pre-existing datastore)
- I created a new RDF::Trine::Store::CSV object. NOTE: I will probably change the namespace for these modules because they don't implement the full range of RDF::Trine::Store functionalities - they are readonly, for example...
- I originally created this in Moose, but that was a waste of time because (for some odd reason) RDF::Trine::Store Child objects are not created by calling ->new on the RDF::Trine::Store::CHILD, but rather by calling ->new on the RDF::Trine::Store("CHILD"). Therefore my CHILD Moose object's constructor was never called, and all of the lovely Mooseyness was lost. So... now it just re-implements the required subroutines from RDF::Trine::Store.
- RDF::Trine::Store::CSV implements the two methods that are required by RDF::LinkedData - "get_statements" and "count_statements". These both have output that is dynamically generated from an IPC call to Tarql
- According to the Triple Pattern Fragments spec, the incoming URL has three parameters - subject, predicate,and object (thing?subject=this;predicate=that)
- Problem: It appears that RDF::LinkedData (or the Plack App) isn't correctly parsing the incoming URL - everythign is passed to my subroutine in a single parameter - $subject
- I have contacted the authors to ask for their advice.
- For the moment I am parsing the URL parameters in my own code, so that I can move forward.
- the triple-pattern matching is working, and I return an iterator to the RDF::LinkedData code. It seems that the triple-counting and the iterating over them (without redundancy) is working properly.
- moved the Triple Pattern Fragments server from localhost to my public server so that I could test it using Ruben's node.js-based client.
- My output is rejected by the TPF client as invalid. :-P So... I got a bit more granular, using the s/p/o patterned URLs to query my server, and compare the output to that from DBPedia.
- when I compare my output to the output from dbpedia's TPF server, it is very different! :-( Mine lacks the various control elements and metadata elements that I had expected to be added by the Plack TPF server.
- I have contacted the authors for advice.
- Michel has combined all Bio2RDF data into a single endpoint. To compensate for this, I need to revise all of the (32,000+) OpenLifeData SADI services.
- I am re-querying the endpoints now - one of the consequences of having the data in the same endpoint is that it is easier to discover the precise-type-relations between one endpoint and another (they used to be e.g. 'uniprot:Resource' but now will be something more specific like 'uniprot:Gene')
- work in-progress...
-
Jee-hyub Kim and Kieron Taylor developed queries of the form:
PREFIX obo: <http://purl.obolibrary.org/obo/> ... SELECT ?prefix ?exact ?postfix ?section ?source WHERE { ?annotation oa:hasBody ?xref . ?annotation oa:hasTarget ?target . ?target oa:hasSource ?source . ?target oa:hasSelector ?selector . ?target dcterms:isPartOf ?section . ?selector oa:prefix ?prefix . ?selector oa:exact ?exact . ?selector oa:postfix ?postfix . VALUES ?xref {<http://purl.uniprot.org/uniparc/UPI0000DA7DCA> ...}}
Kieron wishes to establish the extent of crossover between the two resources.
- Script running all Ensembl IDs against PubMed discovered 115 matches
- Made a module to transform Ensembl "xrefs" into LOD URIs
- Queries still running against PubMed, hopefully more hits to come.
- Next step to try federated query with Ensembl RDF
Raoul, Jerven, Arto, Kieron, others
- Brainstorming of methods to not create RDF dump of entire Ensembl Variation
- SPARQL over VCF - Functional, but multiple reads on the file per query makes it ultra-slow without indexing. VCF does not contain all information Ensembl provides.
- SPARQL over SQL - No prototype, SQL schema very complex, needs Ensembl Variation experts.
- SPARQL over Variation Graph (ga4gh) - No annotation, but this can be fetched from other sources, interesting.
- Dump subset of RDF - all RSIDs with simple allele information more manageable than full schema. Still useful in conjunction with core Ensembl RDF for location awareness.
- Dump ALL RDF - Possible, but very very big.
Conclusion? Not sure. Demand high, difficulty also high.
Making example queries for the NBDC endpoint.
- no significant progress from yesterday. Discussions with authors of the Triple Pattern Fragments/Plack LDF server code is ongoing.
- new OLD endpoint indexed and config file for SADI written.
- just waiting for sparql endpoint to come-up and I will re-register these SADI services.