Warn user about absence of prefix declarations in SPARQL constraints #272

bretelerjmw · 2024-10-25T10:45:00Z

Hi there, thanks for the awesome tool.

I recently learned about prefix declarations for SPARQL constraints and I was surprised that in my project, pyshacl happily validated my SPARQL constraints as expected without me having made any such prefix declarations.

It seems that pyshacl automatically consumes the prefixes found in some turtle files, even if they have not been "declared the SHACL way". (based on my testing, I believe it consumes prefixes for both data graphs and shacl graphs, but not the ontology graphs).

While this behavior contributes to pyshacl being a smoothly usable tool, I worry that it could also give users the impression that they are following the standard properly when in fact they are not. I could see several ways to address this issue. In any case I think it would be helpful if the tool warned the user when it encounters a SPARQL constraint that lacks any prefix declarations.

I'm happy to provide more details/examples if that helps to look into this further.

ashleysommer · 2024-10-25T11:07:29Z

Hi @bretelerjmw

Thanks for the report.
You're right about how this is not communicated, and its weird that PySHACL seems to just work even if you don't add the prefixes.
This is something that has caught me out personally as a user of PySHACL, as well as the developer.

PySHACL doesn't do anything special to allow using the prefixes from the turtle files, that is actually a side-effect of RDFLib. PySHACL uses the python RDFLib library for all of the file parsing and graph manipulation, as well as the rdflib SPARQL engine for SPARQL constraints. It is a peculiar side-effect that RDFLib will cache prefixes from the parsed turtle file, and then reuse those when passing the datagraph to the SPARQL engine.

In regards to communicating this, emitting a warning when no prefixes are defined seems like a good idea (though it is entirely possible for a SPARQL constraint to use a query that does not require any prefixes).

A more correct, but less user-friendly solution would be to strip out all of the rdflib cached prefixes in the graph before passing the graph to the SPARQL engine. This will cause many user's SPARQL constraints to stop working, but will force the use of SPARQL constraint prefix declarations.

bretelerjmw · 2024-10-25T12:15:21Z

(though it is entirely possible for a SPARQL constraint to use a query that does not require any prefixes)

I thought about this, but I would suspect that outside of toy examples, that makes up a rather small portion of the use cases for pyshacl, because prefixes just add so much efficiency and readability when writing SPARQL queries. Or am I missing something? Maybe this doesn't hold for automated SPARQL generation, but then, would users at that level of automation be affected by the warning?

Another compromise option could be to search for prefix usage in the SPARQL string with regex and emit the warning only if some prefix is found. But the search should not apply to comments in the SPARQL string. I'm not 100% sure what the regex should be, but probably something along the lines of (?<!#)[^\n]*:.

frank-fzi · 2024-10-28T12:28:06Z

@bretelerjmw thanks for starting the discussion.

In regards to communicating this, emitting a warning when no prefixes are defined seems like a good idea (though it is entirely possible for a SPARQL constraint to use a query that does not require any prefixes).

A more correct, but less user-friendly solution would be to strip out all of the rdflib cached prefixes in the graph before passing the graph to the SPARQL engine. This will cause many user's SPARQL constraints to stop working, but will force the use of SPARQL constraint prefix declarations.

Personally, I appreciate that rdflib (and thus pySHACL) is able to reuse the prefixes for the SPARQL engine as they are bound with the graph instance anyway. Defining these prefixes redundantly for the SPARQL queries may result in inconsistent prefix declarations. Therefore, as a developer, I do not expect any warnings for missing prefix declarations in the SPARQL constraint, as long as these prefixes are provided with the graph instance, either by parsing from file or explicitly bind prefix declaration to the graph. If there are use cases which require a prefix declaration for each SPARQL constraint explicitly, these checks should be carried out outside pySHACL, rather than polluting the logging of use cases where prefix declarations are simply reused.

bretelerjmw · 2024-10-28T12:59:18Z

@frank-fzi sounds like we agree on matters of usability. In fact, I think even the editor of the SHACL standard would agree with us there - he said the following about the matter in an earlier discussion:

this is a common pain point for SHACL users that unfortunately was required because RDF graphs do not have a concept of prefixes (which are an aspect of the serialization only), yet SHACL is serialization-agnostic.

A key point for me is that we are dealing with a standard, so in addition to the experience of a single development team there needs to be some predictability in a wider context. The current pyshacl behavior is more lenient than the standard, without telling the user that that is the case. So a developer might think all the tests pass, call it a day, and then somebody else who is not using rdflib but is relying on the interoperability of the data would run into trouble.

How do you weigh that factor of supporting the developer in adhering to the standards?

frank-fzi · 2024-10-28T13:27:47Z

Well, the standard just tells that "shapes graph may include declarations of namespace prefixes", therefore I consider them as optional. Moreover, it's up to the SHACL-SPARQL processor to "produce lines such as PREFIX ex: http://example.com/ns#. The SHACL-SPARQL processor MUST produce a failure if the resulting query string cannot be parsed into a valid SPARQL 1.1 query." From my understanding, this is exactly how pySHACL currently behaves. If a valid SPARQL 1.1 query can be derived from the shapes graph, no warning or failure is raised. Also, I agree with Holger Knublauch that prefixes are an aspect of the serialization. When SPARQL constraints are serialized, these prefixes are mostly declared for the shapes graph as a whole, not separately for each query string. Otherwise, we may end up with varying namespace bindings for the same prefix within a single shapes graph which would become very difficult to maintain.

bretelerjmw · 2024-10-28T14:56:31Z

Thank you for including the references to the SHACL standard! I will try to clarify what I mean. I cannot quite tell if we have a misunderstanding or a disagreement ; in the latter case I might let this be my last note on the matter.

the standard just tells that "shapes graph may include declarations of namespace prefixes"

The referenced paragraph also describes the specific syntax required for declaring the prefixes in SHACL.

Moreover, it's up to the SHACL-SPARQL processor to "produce lines such as PREFIX ex: http://example.com/ns#. The SHACL-SPARQL processor MUST produce a failure if the resulting query string cannot be parsed into a valid SPARQL 1.1 query."

This, too, is based on processing of prefix mappings per the required syntax: "A SHACL processor collects a set of prefix mappings as the union of all individual prefix mappings that are values of the SPARQL property path sh:prefixes/owl:imports*/sh:declare of the SPARQL-based constraint or validator."

From my understanding, this is exactly how pySHACL currently behaves.

pySHACL does more than only this: it also, through rdflib defaults, collects the prefix mappings coming from serializations of both the shapes graph and the data graph (note that the data graph is outside the shapes graph under any interpretation!). These serialization-derived prefix mappings do not use the syntax described above so they do not fall within the behavior described by the SHACL standard.

Otherwise, we may end up with varying namespace bindings for the same prefix within a single shapes graph which would become very difficult to maintain.

I might not fully understand this concern. Fortunately it hasn't been a problem in our project. But if it is a problem in general, then recalling Knublauch's comment, it might just be an unavoidable painpoint of SHACL. If anything, wouldn't it clarify matters then if pyshacl followed the SHACL standard more strictly, instead of allowing additional prefix mappings to slip in from some serializations?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn user about absence of prefix declarations in SPARQL constraints #272

Warn user about absence of prefix declarations in SPARQL constraints #272

bretelerjmw commented Oct 25, 2024

ashleysommer commented Oct 25, 2024

bretelerjmw commented Oct 25, 2024

frank-fzi commented Oct 28, 2024

bretelerjmw commented Oct 28, 2024

frank-fzi commented Oct 28, 2024 •

edited

Loading

bretelerjmw commented Oct 28, 2024

Warn user about absence of prefix declarations in SPARQL constraints #272

Warn user about absence of prefix declarations in SPARQL constraints #272

Comments

bretelerjmw commented Oct 25, 2024

ashleysommer commented Oct 25, 2024

bretelerjmw commented Oct 25, 2024

frank-fzi commented Oct 28, 2024

bretelerjmw commented Oct 28, 2024

frank-fzi commented Oct 28, 2024 • edited Loading

bretelerjmw commented Oct 28, 2024

frank-fzi commented Oct 28, 2024 •

edited

Loading