Validation in linked multi-graph #152

svenschneider · 2022-08-10T10:30:14Z

Hi everyone,

I am trying to validate a data graph that is composed of multiple named graphs with links between the named graphs. However, pySHACL does not seem to be able to follow the links. The following two files exemplify this problem.

First, the Turtle shape graph (shape.ttl). It defines a ex:Foo node shape which expects a ex:ref path of class ex:Bar.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://example.org/> .

ex:Foo
    a sh:NodeShape, rdfs:Class ;
    sh:property [
        sh:path ex:ref ;
        sh:class ex:Bar
    ] .

Next, the TriG data graph (data.trig). It defines two graphs (ex:g1 and ex:g2) where the second graph contains an ex:Foo with a link to an ex:Bar (defined in the first graph).

@prefix ex: <http://example.org/> .

ex:g1 { ex:bar a ex:Bar }
ex:g2 { ex:foo a ex:Foo ;
               ex:ref ex:bar }

When I now run pyshacl on those files (pyshacl -s shape.ttl data.trig) I get the following output:

Validation Report
Conforms: False
Results (1):
Constraint Violation in ClassConstraintComponent (http://www.w3.org/ns/shacl#ClassConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:class ex:Bar ; sh:path ex:ref ]
Focus Node: ex:foo
Value Node: ex:bar
Result Path: ex:ref
Message: Value does not have class ex:Bar

I would expect this graph to validate without violations. This is for example the case on the SHACL playground.

I think, that the reason for the validation error is because of the following line

pySHACL/pyshacl/validate.py

Line 259 in a9f5192

for g in named_graphs:

Hence, I wonder (i) if it was possible to iterate over the whole target graph; and (ii) what the rationale behind iterating over the named graphs individually is?

Thanks for developing the library!

Best regards
Sven

The text was updated successfully, but these errors were encountered:

ashleysommer · 2022-08-10T11:32:30Z

Hi @svenschneider
Thanks for opening discussion around this.
(Some nomenclature for the purposes of this discussion: an RDFLib "Graph" is a structure that represents a single flat graph, and an RDFLib "Dataset" is a structure that represents a Graph containing one or more named graphs).

You are correct, the reason you are seeing your error is due to the line you linked, that is, PySHACL iterates across each named Graph in the Dataset and validates each separately.

what the rationale behind iterating over the named graphs individually is?

It is essentially for historical and backward-compatibility reasons. Early versions of PySHACL did not support Datasets that contained named graphs. It could operate on flat Graphs only, (ie, only N3, NT, Turtle, RDFXML files).
When we added support for JSON-LD files, we found they usually contain a named graph, so we modified PySHACL to work on Datasets that contain one or more named graphs, and that included allowing TRIG files too.

The easiest and lest-disruptive way to enable this functionality in PySHACL at the time was to simply iterate over all of the named Graphs in the Dataset and validate them individually. This behavior suited the example datasets we were using because they did not have links between the graph as your example does.

One hurdle to a better implementation is due to how RDFLib handles named Graphs in Datasets. When performing a lookup on a Dataset, you usually need to specify the identifier of the graph you're querying. So in your example, when validating node ex:foo, when checking if ref ex:bar is instance of class ex:Bar, PySHACL would need to somehow know to query ex:bar rdf:type ?c from within the ex:g1 graph, but it cannot do that because there is nothing telling the validator which named graph to query into. So keeping validation isolated to each named graph and assuming no links between them was the easiest way to implement the feature.

There is a feature in RDFLib called "default_union" that can be enabled on Dataset objects that does allow the application to execute queries across all named graphs in a dataset at once. If PySHACL used that feature, then your example would work as you expect. This is not used by PySHACL because it is a nonstandard operating mode for RDFLib. It is not enabled by default in RDFLib, and if PySHACL enabled that feature, it could cause unexpected behaviour for users who expect RDFLib dataset queries to operate in the "normal" manner. Additionally, I believe last time I experimented with enabling that feature, it caused some W3C SHACL Test suite tests to no longer pass, but I don't recall the specifics.

possible to validate over the whole target graph

I have been thinking of implementing an optional operating mode for PySHACL, something like "union-graph mode", that would force "default_union" enabled on the target Dataset, and would run the validator once on the whole Dataset rather than running the validator individually over each named graph. If users find this mode to be useful and convenient, then it may become the default operating mode for an eventual "backward-incompatible" v1.0 release.

svenschneider · 2022-08-10T13:09:45Z

Hi @ashleysommer ,

thanks for that elaborate and quick reply! I can now see why it's implemented as is. Additionally, I have been able to reproduce the problem you mention with respect to the queries in an RDFLib Dataset.

At the moment, as a workaround I flatten the whole data graph before passing it to PySHACL. For now that works for my setup, but I don't know in how far that approach generalizes or which problems that could introduce.

As for RDFLib's "default_union" feature, instead of "forcing" that on the input Dataset object you could perhaps check if it is enabled and only then execute the queries on the whole Dataset? The consequences could be that (i) this results in unexpected SHACL behaviour for users of the library; and (ii) there is (yet) another special case to be handled in the code. Thus, presumably your suggestion with the explicit processing mode seems like a nicer solution.

One more observation: upon re-reading the SHACL standard it seems that validating an RDF dataset is out of scope for SHACL and will hence remain implementation-specific. In particular, here it says

A data graph is one of the inputs to the SHACL processor for validation. SHACL processors treat it as a general RDF graph and makes no assumption about its nature. For example, it can be an in-memory graph or a named graph from an RDF dataset or a SPARQL endpoint.

At the very least it remains vague on what should happen when you provide an RDF dataset (instead of an RDF graph) to a SHACL processor.

ashleysommer · 2022-08-10T22:53:22Z

As for RDFLib's "default_union" feature, instead of "forcing" that on the input Dataset object you could perhaps check if it is enabled and only then execute the queries on the whole Dataset? The consequences could be that (i) this results in unexpected SHACL behaviour for users of the library; and (ii) there is (yet) another special case to be handled in the code. Thus, presumably your suggestion with the explicit processing mode seems like a nicer solution.

One more observation: upon re-reading the SHACL standard it seems that validating an RDF dataset is out of scope for SHACL and will hence remain implementation-specific.

Yep, those are two paragraphs I did think about including in my previous response. You're right, I have thought about checking if "default_union" is already enabled on the input Dataset at runtime, and if it is, then use the alternate validation behaviour. But as you said, that is introducing yet another alternate operating mode for PySHACL that users may accidentally trigger and introduce unexpected behaviour. Also, that particular feature would not be able to be utilised by the PySHACL CLI tool, because that constructs the graphs from parsing files, so it cannot be determined whether "default_union" should be on or off.

And you brought up another point that I should have mentioned in my previous comment, regarding the wording the SHACL Spec. It does remain vague on how to approach this problem, and that is the reason that early versions of PySHACL intentionally only operated on single graphs.

ashleysommer mentioned this issue Mar 31, 2023

Using SPARQL constraints with the keyword graph on nquads #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation in linked multi-graph #152

Validation in linked multi-graph #152

svenschneider commented Aug 10, 2022

ashleysommer commented Aug 10, 2022 •

edited

Loading

svenschneider commented Aug 10, 2022

ashleysommer commented Aug 10, 2022

Validation in linked multi-graph #152

Validation in linked multi-graph #152

Comments

svenschneider commented Aug 10, 2022

ashleysommer commented Aug 10, 2022 • edited Loading

svenschneider commented Aug 10, 2022

ashleysommer commented Aug 10, 2022

ashleysommer commented Aug 10, 2022 •

edited

Loading