Skip to content
Arto Bendiken edited this page Sep 18, 2015 · 31 revisions

RDF / LOD (data)

Standardization of RDF data, metadata, ontologies and provenance

Day 1

FAIR

Participants: Michel Dumontier, Mark Wilkinson, Mark Thompson, Nick Juty

Progress:

  • Discussed and refined all FAIR principles. see document

Next steps:

  • replace data with meta(data), metadata, or data
  • compare to published FAIR principles
  • initiate discussion with FAIR principles stewards and Barend Mons

Day 2

FAIR

Progress:

  • Prepared a complete first draft of the FAIR principles. see document

Next steps:

  • continue discussions with FAIR principles stewards and Barend Mons

Server and Client for Triple Pattern Fragments (Perl) - Mark

Pre-Hackathon:

  • implementation of FAIR for non-RDF data sources
  • prototype used the EU Huntington Disease Network exemplar dataset (part of RD Connect project)
  • Constraints: rare diseases are EXTRREMELY sensitive data - highly identifiable. Therefore, need to have extremely fine-grained access and control to not only the data, but also the metadata
  • Prototype solution uses Linked Data Platform (W3C)
    • REST interface - First URL returns repository-level metadata, and a list of URLs representing meta-records
    • REST interface - meta-record URLs return metadata about individual records, and (maybe) links to the actual records data
    • everything is 100% under the control of the data owner. Everything is Linked Data.
    • FINDABLE: Everything is identified by a URL, that resolves to RDF
    • ACESSIBLE: URLs resolve, and contain information about license and access protocol
    • INTEROPERABLE: RDF is nicely formatted Linked Data with rich link-outs to other data following open ontologies.
    • RE-USABLE: Metadata is as "maximal" as the data provider can provide; all retrievals have licensing information;

So, prior to the hackathon, we had a FAIR solution for METADATA over any kind of repository (RDF repositories or other!)

But... we still want a FAIR solution for DATA within those repositories. For this, we think that Triple Pattern Fragments is a good solution. The idea is that a repository responds to requests for very simple fragments of data that match certain patterns of ?s ?p ?o.

Progress:

  • discussions with Ruben Verborgh, Kjetil Kjernsmo, and Patrick Hochstenbach regarding what has already been done, and what needs to be done.
  • Client-side has a reasonably good implementation
  • Server-side depends on pre-existing Triplestore. My use-case presumes that we start from something other than a triplestore
  • Advised to extend RDF::Trine::Store with a new type (e.g. CSV) and implement the get_statements method to dynamically generate triples.
  • PROBLEM: the "smart" way to do this would be to use a tool like Tarql, however Tarql currently wont build via maven :-P
  • hacking a custom "solution" for the moment

Day 3

Server and Client for Triple Pattern Fragments (Perl) - Mark

  • Solved the problem with maven --> now have a functional Tarql. Tarql (https://github.com/tarql) is "SPARQL for Tables" - a way to convert tabular data into rdf using SPARQL CONSTRUCT. It works nicely on my CSV files!
    • for csv entries with spaces, you need to use the SPARQL 'ENCODE_FOR_URI' string function to get the %20 (and same for any other unuusal characters)
  • I want to use Plack::App::RDF::LinkedData to serve my newly transformed CVS data
    • This library requires a RDF::Trine::Store object, but I want to dynamically transform my data, so none of the existing Trine::Stores are useful to me (they all require a pre-existing datastore)
    • I created a new RDF::Trine::Store::CSV object. NOTE: I will probably change the namespace for these modules because they don't implement the full range of RDF::Trine::Store functionalities - they are readonly, for example...
      • I originally created this in Moose, but that was a waste of time because (for some odd reason) RDF::Trine::Store Child objects are not created by calling ->new on the RDF::Trine::Store::CHILD, but rather by calling ->new on the RDF::Trine::Store("CHILD"). Therefore my CHILD Moose object's constructor was never called, and all of the lovely Mooseyness was lost. So... now it just re-implements the required subroutines from RDF::Trine::Store.
    • RDF::Trine::Store::CSV implements the two methods that are required by RDF::LinkedData - "get_statements" and "count_statements". These both have output that is dynamically generated from an IPC call to Tarql
    • According to the Triple Pattern Fragments spec, the incoming URL has three parameters - subject, predicate,and object (thing?subject=this;predicate=that)
    • Problem: It appears that RDF::LinkedData (or the Plack App) isn't correctly parsing the incoming URL - everythign is passed to my subroutine in a single parameter - $subject
      • I have contacted the authors to ask for their advice.

Day 4

Server for Triple Pattern Fragments (Perl) - Mark

  • For the moment I am parsing the URL parameters in my own code, so that I can move forward.
  • the triple-pattern matching is working, and I return an iterator to the RDF::LinkedData code. It seems that the triple-counting and the iterating over them (without redundancy) is working properly.
  • moved the Triple Pattern Fragments server from localhost to my public server so that I could test it using Ruben's node.js-based client.
    • My output is rejected by the TPF client as invalid. :-P So... I got a bit more granular, using the s/p/o patterned URLs to query my server, and compare the output to that from DBPedia.
  • when I compare my output to the output from dbpedia's TPF server, it is very different! :-( Mine lacks the various control elements and metadata elements that I had expected to be added by the Plack TPF server.
  • I have contacted the authors for advice.

Revising the OpenLifeData2SADI Services

  • Michel has combined all Bio2RDF data into a single endpoint. To compensate for this, I need to revise all of the (32,000+) OpenLifeData SADI services.
    • I am re-querying the endpoints now - one of the consequences of having the data in the same endpoint is that it is easier to discover the precise-type-relations between one endpoint and another (they used to be e.g. 'uniprot:Resource' but now will be something more specific like 'uniprot:Gene')
    • work in-progress...

Making Connections between Ensembl and PubMedCentral

  • Jee-hyub Kim and Kieron Taylor developed queries of the form:

    PREFIX obo: <http://purl.obolibrary.org/obo/> ... SELECT ?prefix ?exact ?postfix ?section ?source WHERE { ?annotation oa:hasBody ?xref . ?annotation oa:hasTarget ?target . ?target oa:hasSource ?source . ?target oa:hasSelector ?selector . ?target dcterms:isPartOf ?section . ?selector oa:prefix ?prefix . ?selector oa:exact ?exact . ?selector oa:postfix ?postfix . VALUES ?xref {<http://purl.uniprot.org/uniparc/UPI0000DA7DCA> ...}}

Kieron wishes to establish the extent of crossover between the two resources.

  • Script running all Ensembl IDs against PubMed discovered 115 matches
  • Made a module to transform Ensembl "xrefs" into LOD URIs
  • Queries still running against PubMed, hopefully more hits to come.
  • Next step to try federated query with Ensembl RDF

Ensembl Variation data via SPARQL

Raoul, Jerven, Arto, Kieron, others

  • Brainstorming of methods to not create RDF dump of entire Ensembl Variation
  • SPARQL over VCF - Functional, but multiple reads on the file per query makes it ultra-slow without indexing. VCF does not contain all information Ensembl provides.
  • SPARQL over SQL - No prototype, SQL schema very complex, needs Ensembl Variation experts.
  • SPARQL over Variation Graph (ga4gh) - No annotation, but this can be fetched from other sources, interesting.
  • Dump subset of RDF - all RSIDs with simple allele information more manageable than full schema. Still useful in conjunction with core Ensembl RDF for location awareness.
  • Dump ALL RDF - Possible, but very very big.

Conclusion? Not sure. Demand high, difficulty also high.

wwPDB/RDF SPARQL examples (AR Kinjo)

Making example queries for the NBDC endpoint.

Day 5

Server for Triple Pattern Fragments (Perl) - Mark

  • no significant progress from yesterday. Discussions with authors of the Triple Pattern Fragments/Plack LDF server code is ongoing.

OpenLifeData2SADI (Mark & Michel)

  • new OLD endpoint indexed and config file for SADI written.
    • just waiting for sparql endpoint to come-up and I will re-register these SADI services.