Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid RDF - datatype issue #110

Open
samiscoding opened this issue Apr 4, 2023 · 7 comments
Open

Invalid RDF - datatype issue #110

samiscoding opened this issue Apr 4, 2023 · 7 comments

Comments

@samiscoding
Copy link

Hi there,

We noticed that the engine is not checking the validity of the datatype that is provided in mapping based on the data value and generates invalid RDF triple.
e.g. a triple is generated as:

:xxx ds:studyMaximumAge "74 Years"^^xsd:int.

Seems a bug that engine doesn't detect that “74 years” is not an xsd:int and generates an invalid RDF.

@pmaria
Copy link
Contributor

pmaria commented Apr 4, 2023

@samiscoding thanks for reporting the issue.

Validation is possible in CARML, but the on-switch is currently not exposed via the CLI.
I think I'll change it so that validation is turned on by default , and provide an optional off switch if one wants to forego validation.

@jetztgradnet
Copy link

Hey Pano,
I'm working with @samiscoding on this topic. How would one enable this when using Carml as a library? We anyway do not use the CLI for the actual conversion process so we can enable this also on the mapper if you provide us some pointers?
Thanks!

@pmaria
Copy link
Contributor

pmaria commented May 8, 2023

@jetztgradnet

You can do this by building the mapper with the following builder option:

import org.eclipse.rdf4j.model.impl.ValidatingValueFactory;

RdfRmlMapper.builder()
       .valueFactorySupplier(ValidatingValueFactory::new)

I'm still contemplating if it should be the default. But will update this issue when I do.

@jetztgradnet
Copy link

Hey @pmaria,
thanks for the hint. As I understand, ValidatingValueFactory will fail when invalid data is provided. What would that mean for the conversion? Would it abort? Or simply not create this one single statement?
Aborting the whole process would be rather inconvenient when converting a big amount of data (I encountered that: after parsing and ingesting multiple GB of data / some billions of statements for a coule of hours, a single invalid literal caused the whole thing to abort requiring searching for the culprit, fixing it, restarting the whole process....)

Ideally there would be multiple options:

  • no validation, do as now by default. This will possibly create invalid RDF data.
  • validation active, failing the whole conversion process for any encountered invalid data
  • validation active, skipping invalid data. This will mean some data might be lost in the RDF representation.
  • validation active, converting invalid data to string literal to at least preserve the original data in RDF, just not with the intended datatype. This can then e.g. be caught using SHACL validation later.

For any failed validation an indication should appear in the logs why and where it failed (with failed liteal value, expected datatype, if possible also input file and line number/position hint) so unclean data can be identified and fixed.
For some use cases this logging should also be able to be suppressed if not required, as this might lead to a lot of log spam.

What do you think?

@pmaria
Copy link
Contributor

pmaria commented May 10, 2023

Hi @jetztgradnet

Hey @pmaria, thanks for the hint. As I understand, ValidatingValueFactory will fail when invalid data is provided. What would that mean for the conversion? Would it abort? Or simply not create this one single statement? Aborting the whole process would be rather inconvenient when converting a big amount of data (I encountered that: after parsing and ingesting multiple GB of data / some billions of statements for a coule of hours, a single invalid literal caused the whole thing to abort requiring searching for the culprit, fixing it, restarting the whole process....)

Right now it would abort. I can see how that can be problematic.

Ideally there would be multiple options:

  • no validation, do as now by default. This will possibly create invalid RDF data.
  • validation active, failing the whole conversion process for any encountered invalid data
  • validation active, skipping invalid data. This will mean some data might be lost in the RDF representation.

Agreed with above.

  • validation active, converting invalid data to string literal to at least preserve the original data in RDF, just not with the intended datatype. This can then e.g. be caught using SHACL validation later.

For this one there is also the option of doing some value coercion based on the XSD datatype IRI, which I'm currently looking at, like in R2RML.
It could also be made possible to register custom coercers for other datatypes. Would that be something that would be useful?

For any failed validation an indication should appear in the logs why and where it failed (with failed liteal value, expected datatype, if possible also input file and line number/position hint) so unclean data can be identified and fixed. For some use cases this logging should also be able to be suppressed if not required, as this might lead to a lot of log spam.

Agreed.

I'll open some issues to track these features. Let me know if you have any comments.

@jetztgradnet
Copy link

jetztgradnet commented May 10, 2023

  • validation active, converting invalid data to string literal to at least preserve the original data in RDF, just not with the intended datatype. This can then e.g. be caught using SHACL validation later.

For this one there is also the option of doing some value coercion based on the XSD datatype IRI, which I'm currently looking at, like in R2RML. It could also be made possible to register custom coercers for other datatypes. Would that be something that would be useful?

Not sure what you mean by value coercion in this context?

To give an example of what I mean with converting invalid data to string literal:
let's say we have some JSON data which has a date field. In most cases it would have a properly formatted timestamp which basically matches the format expected by xsd:dateTime, so looking at some sample data we naively create a RML rule to create a RDF literal with datatype xsd:dateTime.
Now assume that there are some date values which do not conform to that expected date format, but are e.g. May 1st 1970. When not using validation in the RML mapping this would produce a literal "May 1st 1970"^^xsd:dateTime which would be invalid and may fail at any point later in the processing pipeline, e.g. when ingesting it into a database.

The last variant I sketched above would be to simply change the datatype on the fly to xsd:string when validation fails. That way this value is preserved and it is valid RDF, even though it might not comply with the expected datatpye for the predicate in the statement.

Same could be considered when expecting an integer value for e.g. an age property, but sometimes encountering a value of 12 years instead of just 12 which again makes it an invalid literal when represented as "12 years"^^xsd:int.

What would coercion do in these cases?

From an implementation point-of-view, all of this could be handled by a ValueFactory, so I can easily create something like this for my application, but this might still be interesting for other users of carml.

The only tricky part might be getting the location (file/stream, line number/position) for proper reporting, as this is not exposed to the ValueFactory. This might need to be handled outside.

@pmaria
Copy link
Contributor

pmaria commented May 10, 2023

Not sure what you mean by value coercion in this context?

I meant something like R2RML's canonical RDF lexical form.

To give an example of what I mean with converting invalid data to string literal: let's say we have some JSON data which has a date field. In most cases it would have a properly formatted timestamp which basically matches the format expected by xsd:dateTime, so looking at some sample data we naively create a RML rule to create a RDF literal with datatype xsd:dateTime. Now assume that there are some date values which do not conform to that expected date format, but are e.g. May 1st 1970. When not using validation in the RML mapping this would produce a literal "May 1st 1970"^^xsd:dateTime which would be invalid and may fail at any point later in the processing pipeline, e.g. when ingesting it into a database.

The last variant I sketched above would be to simply change the datatype on the fly to xsd:string when validation fails. That way this value is preserved and it is valid RDF, even though it might not comply with the expected datatpye for the predicate in the statement.

Ah right, coercion would not make sense here.

Same could be considered when expecting an integer value for e.g. an age property, but sometimes encountering a value of 12 years instead of just 12 which again makes it an invalid literal when represented as "12 years"^^xsd:int.

What would coercion do in these cases?

Same as above, this is indeed not something you could solve with coercion.

From an implementation point-of-view, all of this could be handled by a ValueFactory, so I can easily create something like this for my application, but this might still be interesting for other users of carml.

Right, so basically this would mean being able to define a strategy on how to handle values that are invalid for a certain datatype.

The only tricky part might be getting the location (file/stream, line number/position) for proper reporting, as this is not exposed to the ValueFactory. This might need to be handled outside.

Hmm yeah, this is not something that can be done reliably. This would be dependent on the source and possibly the parser. It might be possible to incorporate some context on a best effort basis, if a source / parser supports this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants