Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does SPARQL TSV Results make sense? #48

Open
chrdebru opened this issue Mar 11, 2024 · 5 comments
Open

Does SPARQL TSV Results make sense? #48

chrdebru opened this issue Mar 11, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed question Further information is requested working-group

Comments

@chrdebru
Copy link
Contributor

chrdebru commented Mar 11, 2024

Nor the documentation, nor the test cases provide such examples (same for XML and JSON results, by the way). But I question the usefulness of rml:SPARQL_RESULT_TSV. Taking the example of 0003, we would have the following TSV:

?person	?name	?age
<http://example.org/0>	"Monica Geller"	"33"
<http://example.org/1>	"Rachel Green"	"34"
<http://example.org/2>	"Joey Tribbiani"	"35"
<http://example.org/3>	"Chandler Bing"	"36"
<http://example.org/4>	"Ross Geller"	"37"

How should we iterate over those? We cannot treat them as regular TSV. The angle brackets should be removed from IRIs. Literals should be "cast" to their datatypes. And I have no idea what to do with blank node identifiers. Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

@DylanVanAssche
Copy link
Collaborator

I think this is again the same problem that @pmaria mentioned and wanted to 'document' in a Note: kg-construct/rml-core#113

Basically, we would need to properly define a better reference formulation here. formats:SPARQL_Results_TSV defines the format, not how to iterate upon them. We would need something like rml:SPARQLSelectTSV. In the Note we define then what a RML processor should do to iterate over the results:

  • Row-basis
  • Remove angle brackets
  • Blank nodes: I would suggest to just read them as blank nodes, the engine can re-use them or make new blank nodes when outputting the data.
  • Cast Literals to their data types
  • ...

Same for the others.
If you need multiple iterators, it is a RML Fields thing I think, there you can have nested iterators even with mixed data formats like JSON in CSV, etc.

@DylanVanAssche DylanVanAssche added documentation Improvements or additions to documentation help wanted Extra attention is needed question Further information is requested labels Mar 12, 2024
@chrdebru
Copy link
Contributor Author

chrdebru commented Mar 12, 2024

I see. I believe we need test cases, as only CSV is covered and CSV boils down to iterating CSV documents. The other formats have quirks. I disagree with the use of BN identifiers. One query can generate _:b1 for a BN, and another from another dataset as well. Reusing these BN identifiers (which refer to different things when they reside in different graphs) would lead to problems. Also, it will become engine-dependent (rdflib, vs. apache jena, vs rdf4j, ...).

SPARQL stipulates that you should at least support CSV and XML (among others); in other words, we could technically limit it to two: one with data type information and CSV for easier processing.

@andimou
Copy link

andimou commented Mar 17, 2024

formats:SPARQL_Results_TSV defines the format, not how to iterate upon them

A reference formulation specifies which grammar one can use to access the data of a logical source, not the format. Does rml:SPARQL_RESULT_TSV aim to indicate that the results should be in SPARQL TSV results format or that the data need to be accessed as a SPARQL result of TSV format (whatever that means?)?

How should we iterate over those?

@chrdebru do you want us to include a description of a reference formulation that indicates the iteration pattern to be per row?

We cannot treat them as regular TSV.

Why not?

I have no idea what to do with blank node identifiers

@chrdebru could you please clarify this?

Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Isn't the delimiter possible to be specified as a CSVW description of the result?

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

@chrdebru I do not understand this, what would the first iterator be?

@chrdebru
Copy link
Contributor Author

I'm using the data follwoing data and query as an example:

<http://example.org/0> <http://xmlns.com/foaf/0.1/age> 33 .
<http://example.org/0> <http://xmlns.com/foaf/0.1/name> "Monica Geller" .
<http://example.org/1> <http://xmlns.com/foaf/0.1/age> 34 .
<http://example.org/1> <http://xmlns.com/foaf/0.1/name> "Rachel Green" .

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?person (STR(?person) AS ?person2) ?name ?age WHERE {
        ?person foaf:name ?name .
        ?person foaf:age ?age .
    } 

CSV:

  • variables have no question marks.
  • distinction between IRIs representing resources and IRIs as literals is lost
person,person2,name,age
http://example.org/1,http://example.org/1,Rachel Green,34
http://example.org/0,http://example.org/0,Monica Geller,33

TSV

  • variable names are represented with question marks (as per SPARQL)
  • resources also have angled brackets
?person	?person2	?name	?age
<http://example.org/1>	"http://example.org/1"	"Rachel Green"	34
<http://example.org/0>	"http://example.org/0"	"Monica Geller"	33

JSON and XML

  • variable names have no question marks
  • we iterate over solutions
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="person"/>
    <variable name="person2"/>
    <variable name="name"/>
    <variable name="age"/>
  </head>
  <results>
    <result>
      <binding name="person">
        <uri>http://example.org/1</uri>
      </binding>
      <binding name="person2">
        <literal>http://example.org/1</literal>
      </binding>
      <binding name="name">
        <literal>Rachel Green</literal>
      </binding>
      <binding name="age">
        <literal datatype="http://www.w3.org/2001/XMLSchema#integer">34</literal>
      </binding>
    </result>
    <result>
      <binding name="person">
        <uri>http://example.org/0</uri>
      </binding>
      <binding name="person2">
        <literal>http://example.org/0</literal>
      </binding>
      <binding name="name">
        <literal>Monica Geller</literal>
      </binding>
      <binding name="age">
        <literal datatype="http://www.w3.org/2001/XMLSchema#integer">33</literal>
      </binding>
    </result>
  </results>
</sparql>

{ "head": {
    "vars": [ "person" , "person2" , "name" , "age" ]
  } ,
  "results": {
    "bindings": [
      { 
        "person": { "type": "uri" , "value": "http://example.org/1" } ,
        "person2": { "type": "literal" , "value": "http://example.org/1" } ,
        "name": { "type": "literal" , "value": "Rachel Green" } ,
        "age": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "34" }
      } ,
      { 
        "person": { "type": "uri" , "value": "http://example.org/0" } ,
        "person2": { "type": "literal" , "value": "http://example.org/0" } ,
        "name": { "type": "literal" , "value": "Monica Geller" } ,
        "age": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "33" }
      }
    ]
  }
}

@chrdebru
Copy link
Contributor Author

formats:SPARQL_Results_TSV defines the format, not how to iterate upon them

A reference formulation specifies which grammar one can use to access the data of a logical source, not the format. Does rml:SPARQL_RESULT_TSV aim to indicate that the results should be in SPARQL TSV results format or that the data need to be accessed as a SPARQL result of TSV format (whatever that means?)?

So we are iterating over solutions then, right? What is then the point of having those formats if we know that all SPARQL implementations must support XML and CSV (at least)? So would rml:referenceFormulation rml:SPARQL_RESULT_SET not be sufficient?

The following are details that are not relevant anymore if the above answer is "yes."

We cannot treat them as regular TSV.

Why not?

TSV representation of SPARQL prescribes how terms are encoded (e.g., the angled brackets). The variables names also have question marks. Should references use name or ?name? The former is used in XML, JSON, and CSV, the latter in TSV. I would find it weird that I should rewrite all references if I change from TSV to CSV.

When you retrieve ?person, do you want to retrieve <http://example.org/1> as a value, or do you want to process the TSV file as a SPARQL Resultset serialization and thus remove the < and > from <http://example.org/1> before returning the value?

I have no idea what to do with blank node identifiers

@chrdebru could you please clarify this?

With CSV, we cannot distinguish between blank node identifiers and literals (same as with IRIs).

person,person2,name,age
b0,b0,Foo Bar,22

Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Isn't the delimiter possible to be specified as a CSVW description of the result?

No. I'm talking about TSV of SPARQL result sets, which must use tabss

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

@chrdebru I do not understand this, what would the first iterator be?

The iterator for SPARQL queries is the SPARQL query. So one iterates over the result set. The problem is that I believe the community thought that CSV result sets can be processed as regular CSV files. This is true, but there are unfortunate corner cases. However, we iterate over solutions in a result set (which are dictionaries), and not over a CSV file. There is much more information in TSV (a more constrained one), JSON and XML (explicit data types, resource types, etc.). TSV uses a different variable naming convention.

For CSV and TSV, the lines correspond with iterations. For XML and JSON, however, the returned JSON and XML documents need a different iterator. E.g., $.results.bindings.[*] and then use person.value to obtain the value. But we cannot provide two such iterators.

As such, I am questioning the added value of rml:SPARQL_RESULT_XXX. We know that SPARQL implementations should at least return CSV and XML. If we use XML, we have everything we need to iterate over the solutions.

The test cases for SPARQL queries are a bit naïve as they only look at CSV without corner cases (e.g., there are no IRIs in the result set).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed question Further information is requested working-group
Projects
None yet
Development

No branches or pull requests

3 participants