Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPDX and Datatypes #464

Closed
davaya opened this issue Aug 1, 2023 · 34 comments
Closed

SPDX and Datatypes #464

davaya opened this issue Aug 1, 2023 · 34 comments
Labels
model Something about the abstract model RDF/OWL/SHACL
Milestone

Comments

@davaya
Copy link
Contributor

davaya commented Aug 1, 2023

Most of the 8/1 tech meeting discussed the use of "Blank Nodes" - generally said to be harmful in the literature and Datatypes to define Element content specifically within the context of the SPDX model. It has been generally accepted that Element is the only root class defined by SPDX, but that view was challenged by asserting that Datatypes (e.g., CreationInfo, ExternalReference, Hash, PositiveIntegerRange) should also be modeled as RDF objects with their own globally-unique identifiers.

Use Case:
The Snippet element has a lineRange property of type PositiveIntegerRange. PositiveIntegerRange is a Datatype with two properties begin and end.

Alternatives:
If Snippet had begin and end properties there would be no global ID assigned to those properties; the identity of a Datatype depends only on its value, not the value of a global ID. The PositiveIntegerRange Datatype should have the identical semantics - a Snippet element with a PositiveIntegerRange property should not result in a global ID being assigned to instances of that property, with the attendant assignment complexity and serialization bulk - a global ID would be significantly larger than the property value itself.

The proposed alternatives are:

  1. do not assign global IDs to Datatype instances
  2. explicitly redesign Datatype instances to contain global ID properties
  3. implicitly assign global ID properties to Datatype instances (via Skolemization)

For alternative 1:
RDF Schema says

rdfs:Literal is an instance of rdfs:Class. rdfs:Literal is a subclass of rdfs:Resource.

and

Each instance of rdfs:Datatype is a subclass of rdfs:Literal.

Because Datatype is a subclass of Literal, and Literal instances don't have global IDs, Datatype instances also don't have global IDs.
(Datatype definitions do have recognized Datatype IRIs similar to all other class definitions in the SPDX model). Redefining "Datatype" to mean data instances that are identified by something other than their values, as proposed in alternatives 2 and 3, is not a valid option. Whether IDs are created explicitly or implicitly, something identified by ID is not a Datatype.

Defining lexical values for a Datatype's RDF serialization is done in XSD. The L2V mapping for other serializations of the RDF model would be done as appropriate for the serialization format (e.g., JSON Schema for JSON-LD-serialized Datatypes). That is a serialization issue, not a restriction imposed by the RDF model.

Grokking the difference between logical values and lexical forms is the key to understanding the difference between the model and serialized data.

@aamedina
Copy link

aamedina commented Aug 1, 2023

RDF literals represent concrete values like strings, numbers, dates, etc. The important thing to remember is that literals in RDF represent concrete values and not resources.

In RDF, datatypes are utilized to specify the type of data that is allowed within a literal. They provide an opportunity to structure information as a literal instead of representing it as an RDF resource with a separate URI. For example, the xsd:dateTime datatype allows you to represent a point in time with a specific format string literal.

It appears there may be some confusion between RDF resources (which have URIs as global IDs) and literals (which do not). Typically, datatype instances are not assigned global IDs because they are not resources, but rather values. In RDF, datatypes provide interpretation for the lexical form of a literal, not changing the essence of a literal.

Blank nodes in RDF are utilized to denote resources that lack a globally identifiable identifier. Skolemization, on the other hand, is a process by which we replace these blank nodes with globally unique identifiers. This process is not applicable to literals or datatypes, as it's not about turning literals into resources but about turning locally identifiable resources (blank nodes) into globally identifiable ones.

You mentioned that 'Each instance of rdfs:Datatype is a subclass of rdfs:Literal' and seem to interpret this as meaning that instances of a datatype are a kind of literal. However, rdfs:Datatype is not intended to be instantiated as individual literals. Rather, it is used to define classes of literals that share a common lexical space and a datatype IRI.

RDF resources represent entities in the world, and RDF literals represent concrete values. This distinction is foundational to the RDF data model.

@goneall
Copy link
Member

goneall commented Aug 1, 2023

Thanks @davaya for opening the issue and thanks @aamedina for the clear description of RDF.

Using the RDF description above, PositiveIntegerRange would be a resource with 2 properties - begin and end both of which are datatypes.

On one of our tech calls a few months back, we made a decision that we would want any resource which is of type Element to be referenceable across different contexts (different stores, graphs, serializations) which requires them to have a URI type ID which is globally unique and assigned when the resource is "minted". Note that this applies to all subclasses of Element (e.g. File, Package, Relationship, ...). We did not put the same requirement on resources of other SPDX resource types (e.g. PositiveIntegerRange, CreationInfo, Hash).

On some of the calls, we have referred to these non-Element resources as "data" which probably leads to the confusion. We also referred to them as "struts" to draw an analogy to the C programming language constructs - which isn't exactly accurate either. I believe in RDF, anything that has properties associated with it is a resource and resources have ID's. The ID must be unique within their context. Since the Element context is global - it must have a URI. In the serialization team, we all assumed that non-Elements can (and many times should) be blank nodes whereas @sbarnum believes they must also have a URI - hence our discussion on the tech call.

Going forward, I would recommend we reserve the term "Datatype" for describing the type of data which may be assigned to a literal and not use the term "Datatype" to describe resource types - even resource types for non-Element resource types.

To distinguish Elements from non-Elements I suggest we use just those terms - Element and non-Element resource types.

@goneall
Copy link
Member

goneall commented Aug 1, 2023

Re-wording @davaya proposal using my propose "RDFSpeak" definitions above, the options would be:

  1. do not assign global IDs to non-Element resources
  2. explicitly redesign non-Element classes to require URI ID types - this would involve add an ID property to our UML models
  3. implicitly assign ID's which are unique within the context and, when needed, translate to globally unique ID's (via Skolemization)

In RDF, since all resources must have some form of ID, whether assigned or anonymous, 1 can not be implemented within the RDF spec, so we are left with the last 2. My vote is for implicitly assigning ID's (this is done in most RDF implementations when you have an implied blank node).

@goneall
Copy link
Member

goneall commented Aug 1, 2023

One last comment - since we want to serialize in formats that don't fully represent RDF (e.g. YAML), we have the option of not assigning ID's at all.

This looks like 1 above, but if and when we convert to an RDF graph, we are actually implementing number 3.

One of the reasons I really like 3 is that simple serialization folks tend to really dislike creating globally unique IDs for every complex type (intentionally not using the term data type).

@aamedina
Copy link

aamedina commented Aug 1, 2023

Thanks @davaya for opening the issue and thanks @aamedina for the clear description of RDF.

Using the RDF description above, PositiveIntegerRange would be a resource with 2 properties - begin and end both of which are datatypes.

On one of our tech calls a few months back, we made a decision that we would want any resource which is of type Element to be referenceable across different contexts (different stores, graphs, serializations) which requires them to have a URI type ID which is globally unique and assigned when the resource is "minted". Note that this applies to all subclasses of Element (e.g. File, Package, Relationship, ...). We did not put the same requirement on resources of other SPDX resource types (e.g. PositiveIntegerRange, CreationInfo, Hash).

On some of the calls, we have referred to these non-Element resources as "data" which probably leads to the confusion. We also referred to them as "struts" to draw an analogy to the C programming language constructs - which isn't exactly accurate either. I believe in RDF, anything that has properties associated with it is a resource and resources have ID's. The ID must be unique within their context. Since the Element context is global - it must have a URI. In the serialization team, we all assumed that non-Elements can (and many times should) be blank nodes whereas @sbarnum believes they must also have a URI - hence our discussion on the tech call.

Going forward, I would recommend we reserve the term "Datatype" for describing the type of data which may be assigned to a literal and not use the term "Datatype" to describe resource types - even resource types for non-Element resource types.

To distinguish Elements from non-Elements I suggest we use just those terms - Element and non-Element resource types.

Ah, so that explains why the issue says PositiveIntegerRange is an rdfs:Datatype when it is declared as an owl:Class (and sh:NodeShape) in the model. There was some distinction drawn (and taken too literally) about the distinctions in the class model between types which are components of a certain Element (with semantics) but which aren't Elements themselves.

It doesn't sound like there is a problem to me.

@davaya
Copy link
Contributor Author

davaya commented Aug 1, 2023

@goneall:

In RDF, since all resources must have some form of ID, whether assigned or anonymous, 1 can not be implemented within the RDF spec, so we are left with the last 2. My vote is for implicitly assigning ID's (this is done in most RDF implementations when you have an implied blank node).

Going back to the Snippet example, is begin a resource or not?

Snippet is a resource and has an @id or spdxId.
begin is not in my understanding a resource, it a literal that is instantiated with a particular value (say 1) in a particular instance of Snippet.
end is not a resource, it is a literal value (say 10).
And PositiveIntegerRange, being a literal composed of two literals, is not a resource and doesn't have an instance ID, it is a literal that has a single value (say, 1-10) within a single instance of Snippet, and the same (1-10) or different value (3-4) within another instance of Snippet.

@aamedina
Copy link

aamedina commented Aug 1, 2023

If users are consuming the RDF model they should be able to do whatever they need including skolemizing. I do not see why the specification needs to think about any of that downstream complexity. The most important question is whether or not the model is correct and useful to structure software bill of materials for implementation.

I personally aim to align the SPDX 3 model with D3FEND's OWL ontology for automating SSVC and what I require is a model that follows best practices for linked data. The graphical nature of RDF makes it ideal for logic based querying of SBOMs with rich metadata over competing SBOM standards like CycloneDX. This is a critical advantage for SPDX. The RDF nature enables automated reasoning across large graphs of SBOMs with any tool that understands RDF. I don't have to map the SBOM specification into a graphical model: I can rely on the RDF specification.

If they are consuming a serialization of the RDF model (JSON Schema) the implementation should follow the JSON Schema which is published. Same for the YAML spec.

One serialization shouldn't be concerned with the semantics of the canonical model unless they are depending on its semantics.

@aamedina
Copy link

aamedina commented Aug 1, 2023

@goneall:

In RDF, since all resources must have some form of ID, whether assigned or anonymous, 1 can not be implemented within the RDF spec, so we are left with the last 2. My vote is for implicitly assigning ID's (this is done in most RDF implementations when you have an implied blank node).

Going back to the Snippet example, is begin a resource or not?

Snippet is a resource and has an @id or spdxId. begin is not in my understanding a resource, it a literal that is instantiated with a particular value in a particular instance of Snippet. end is not a resource, it is a literal. And PositiveIntegerRange, being a literal composed of two literals, is not a resource and doesn't have an instance ID, it is a literal that has a single value (say, 1-10) within a single instance of Snippet, and the same (1-10) or different value (3-4) within another instance of Snippet.

In the current model begin and end are not resources. The PositiveIntegerRange is the resource. It is not a literal. That is the blank node.

@davaya
Copy link
Contributor Author

davaya commented Aug 1, 2023

@aamedina:

The PositiveIntegerRange is the resource. It is not a literal.

Then in the current model PositiveIntegerRange should be changed. PostiveIntegerRange is not a resource and should not be modeled as a resource, and modeling it in a way contrary to its meaning as a literal is incorrect.

It is semantically a literal with the identical meaning as a string "1-10" or "1:10" or "[1,10]" or "[1..10]" or "{begin:1,end:10)" or "begin=1, end=10" or any other lexical form corresponding to the literal value of a range with begin and end. That's what Lexical-to-Value mapping means.

@aamedina
Copy link

aamedina commented Aug 1, 2023

@aamedina:

The PositiveIntegerRange is the resource. It is not a literal.

Then in the current model PositiveIntegerRange should be changed. PostiveIntegerRange is not a resource and should not be modeled as a resource, and modeling it in a way contrary to its meaning as a literal is incorrect.

It is semantically a literal with the identical meaning as a string "1-10" or "1:10" or "[1,10]" or "{begin:1,end:10)" or "begin=1, end=10" or any other lexical form corresponding to the literal value of a range with begin and end. That's what Lexical-to-Value mapping means.

Parsing all of those lexical variants in JSON for example is going to be a nightmare for implementations.

Furthermore have you considered the advantages to being able to query RDF graphs in SPARQL and express logical constraints on spdx-core:begin and spdx-core:end values when inferencing?

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

I didn't say parse all of them, and I intentionally didn't include the JSON string that JSON serialization would use because the logical value of Datatypes is independent of serialization. Datatypes are not modeled as resources and in JSON serialization they would be JSON data validated by (exactly one) JSON schema just as in RDF serialization they would be XML data validated by XSD.

The logical inferencing and constraints have the same power as if the Snippet model file had begin and end datatypes, because the logical value of Snippet is identical either way. Querying for Snippet/begin < 10 is the same thing as querying for Snippet/lineRange/begin < 10. Snippet is the resource that is modeled and constrained and queried.

@aamedina
Copy link

aamedina commented Aug 2, 2023

I didn't say parse all of them, and I intentionally didn't include the JSON string that JSON serialization would use because the logical value of Datatypes is independent of serialization. Datatypes are not modeled as resources and in JSON serialization they would be JSON data validated by (exactly one) JSON schema just as in RDF serialization they would be XML data validated by XSD.

The logical inferencing and constraints have the same power as if the Snippet model file had begin and end datatypes, because the logical value of Snippet is identical either way. Querying for Snippet/begin < 10 is the same thing as querying for Snippet/lineRange/begin < 10. Snippet is the resource that is modeled and constrained and queried.

I am not following.

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Pretend that the model Software/Snippet.md was changed from:

## Properties

- byteRange
  - type: /Core/PositiveIntegerRange
  - minCount: 0
  - maxCount: 1
- lineRange
  - type: /Core/PositiveIntegerRange
  - minCount: 0
  - maxCount: 1
- snippetFromFile
  - type: File
  - minCount: 1
  - maxCount: 1

to

## Properties

- byteRange
  - type: /Core/PositiveIntegerRange
  - minCount: 0
  - maxCount: 1
- lineRangeBegin
  - type: /Core/PositiveInteger
  - minCount: 0
  - maxCount: 1
- lineRangeEnd
  - type: /Core/PositiveInteger
  - minCount: 0
  - maxCount: 1
- snippetFromFile
  - type: File
  - minCount: 1
  - maxCount: 1

What is the difference in the kind of constraints and queries that you could perform on Snippet byte ranges vs what you could perform on Snippet line ranges?

The purpose of defining datatypes like CreationInfo and PositiveIntegerRange is to make the elements that use them cleaner and easier to read, not to change the semantics of those elements. Those datatypes are logical reusable groupings of related datatypes, created so the whole grouping doesn't have to be repeated everywhere it's used. Element is less cluttered to look at with CreationInfo than with the seven creation-related properties. De-cluttering is the purpose.

@aamedina
Copy link

aamedina commented Aug 2, 2023

Pretend that the model Software/Snippet.md was changed from:


## Properties



- byteRange

  - type: /Core/PositiveIntegerRange

  - minCount: 0

  - maxCount: 1

- lineRange

  - type: /Core/PositiveIntegerRange

  - minCount: 0

  - maxCount: 1

- snippetFromFile

  - type: File

  - minCount: 1

  - maxCount: 1

to


## Properties



- byteRange

  - type: /Core/PositiveIntegerRange

  - minCount: 0

  - maxCount: 1

- lineRangeBegin

  - type: /Core/PositiveInteger

  - minCount: 0

  - maxCount: 1

- lineRangeEnd

  - type: /Core/PositiveInteger

  - minCount: 0

  - maxCount: 1

- snippetFromFile

  - type: File

  - minCount: 1

  - maxCount: 1

What is the difference in the kind of constraints and queries that you could perform on Snippet byte ranges vs what you could perform on Snippet line ranges?

Thank you for sharing that to clarify; and also showcases why having unambiguous models as source can help communicate semantics that are simple to intuit but hard to express in language. Lean on standardized notations whenever possible to increase communication efficiencies.

Originally you suggested treating the PositiveIntegerRange as a compound literal with a 'begin' and 'end' that would be lexically constrained in the model by a rdfs:Datatype.

However by modeling it you partially resolved the issue by attaching those properties to the snippet itself. That could work. There is no need for a new Core profile datatype you can use xsd:positiveInteger.

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Originally you suggested treating the PositiveIntegerRange as a compound literal with a 'begin' and 'end' that would be lexically constrained in the model by a rdfs:Datatype.

I believe I am still saying that - rdfs:Datatype is defined by the Lexical-to-Value mapping for a literal (e.g. PositiveIntegerRange) with a lexical space that is defined by the SPDX serialization spec to be JSON strings or XML strings with start and end values.

Yes, we've had the discussion of DateTime datatypes ad nauseum as well. Pretend that we needed a non-predefined xsd:superPositiveInteger with minInclusive 2 instead of 1 :-). Then we'd need a Core/SuperPositiveInteger to define it in SPDX. It is unnecessary but it also doesn't hurt anything for Core to define PositiveInteger.

@aamedina
Copy link

aamedina commented Aug 2, 2023

Originally you suggested treating the PositiveIntegerRange as a compound literal with a 'begin' and 'end' that would be lexically constrained in the model by a rdfs:Datatype.

I believe I am still saying that - rdfs:Datatype is defined by the Lexical-to-Value mapping for a literal (e.g. PositiveIntegerRange) with a lexical space that is defined by the SPDX serialization spec to be JSON strings or XML strings with start and end values.

Yes, we've had the discussion of DateTime datatypes ad nauseum as well. Pretend that we needed a non-predefined xsd:superPositiveInteger with minInclusive 2 instead of 1 :-). Then we'd need a Core/SuperPositiveInteger to define it in SPDX.

Then all of the feedback I have given still applies. Your suggested model was more correct. This isn't a datatype.

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Now I'm not following. The Snippet/byteRange is a datatype and not a resource. The pretend Snippet/lineRange is also a datatype and not a resource. They behave identically, as literal values that exist within the Snippet resource.

@aamedina
Copy link

aamedina commented Aug 2, 2023

The only compound literal value beyond the xsd data types I am aware of that is actually used in practice are language tagged literals. It also looks like the byteRange is going to have the same issue. I just don't see why this is a problem.

@aamedina
Copy link

aamedina commented Aug 2, 2023

Now I'm not following. The Snippet/byteRange is a datatype and not a resource. The pretend Snippet/lineRange is also a datatype and not a resource. They behave identically, as literal values that exist within the Snippet resource.

So, byteRangeBegin and byteRangeEnd?

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

It is a problem because, as William points out, adding globally unique IDs to two tiny integers will bloat up the serialized data, whether those IDs are inserted stealthily or explicitly defined in the model files.

It is a problem because creating graph nodes just for the sake of creating graph nodes feels like cargo cult religion - people do it just because people used do it, not because we understand why it's done or whether it accomplishes anything we want to accomplish.

What we want to accomplish is for Snippet elements (resources) to have byte and line ranges with start and end literal values. That is the goal. The goal isn't to create a bunch of new resources with graph nodes and globally unique IDs.

In my pretend model file, Snippet/byteRange/begin is a literal and Snippet/byteRange/end is a literal. Snippet/byteRange is a PositiveIntegerRange literal that groups begin and end literals.

Those literals have the identical purpose as the pretend Snippet/lineRangeBegin literal and Snippet/lineRangeEnd literal. There is no difference in how byte ranges and line ranges function to identify what the Snippet refers to.

@aamedina
Copy link

aamedina commented Aug 2, 2023

Is this what they are referring to? https://www.w3.org/TR/rdf-canon/#ex-ca-unique-hashes

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Definitely not. That discusses blank nodes, and I'm explicitly not discussing blank nodes, because literals and blank nodes are mutually exclusive.

I'm referring to Datatypes and Literals and Lexical (serialized string) to (model) Value mapping.

@aamedina
Copy link

aamedina commented Aug 2, 2023

Yes, I know, we disagree fundamentally. I’m trying to understand the original issue.

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

As I understand it, one issue is that we have since the beginning understood that Elements are the graph nodes modeled by SPDX v3.

Sean now says that he didn't agree to that way back when, and both Sean and Gary have researched the literature to find all the reasons that blank nodes are bad. I accept that blank nodes are bad, but don't think that has anything to do with the logical model.

Serialization can assign short identifiers that are NOT globally unique (e.g., "c1", "c2", "c3") and CANNOT be used outside a LOCAL chunk of serialized data (a single document), That idea is seductively similar to blank nodes, but falling for it conflates logical values with serialized data. Serialized short identifiers do not exist anywhere in the nodes of the logical value graph, they exist only in payloads/documents.

So confusing the logical model with serialized data, and confusing blank nodes with serialized data compaction may have contributed to the disagreement over whether Elements are the only SPDX 3 resources. The larger disagreement is what I object to - ontologists believe that everything, even two little integers, should be a resource and a graph node. I, being an engineer, believe that only Elements should be resources, and that the SPDX RDF model should take advantage of RDF datatypes to not turn literals into resources when there is no compelling reason to do so.

The disadvantages are engineering considerations: serialized data bloat, graph node implementation bloat, and conceptual understanding bloat. SPDX can do everything it needs to do by modeling only Elements, and it has hooks to refer to external ontologies that can be as fine-grained as their creators want to make them.

@aamedina
Copy link

aamedina commented Aug 2, 2023

I’m an engineer as well and I’m actually using RDF models, including SPDX 2.3, to implement real world SBOMs with datomic for VEX. This isn’t a hypothetical issue. I’m actually installing your model into a datomic database and would like to continue to use it to query SBOMs.

https://github.com/aamedina/ssvc/blob/c178fd9246345a6f9c7e352f4e07ed82c0e78f93/dev/dev.clj#L101

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Interesting - I see https://docs.datomic.com/cloud/schema/schema-reference.html#composite-tuples as being somewhat akin to compound datatypes. Datomic uses tuples just as database keys, but the same concept would apply to a named group of values used outside of joins.

@maxhbr maxhbr added model Something about the abstract model RDF/OWL/SHACL labels Aug 2, 2023
@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

@goneall

In RDF, since all resources must have some form of ID, whether assigned or anonymous, 1 can not be implemented within the RDF spec, so we are left with the last 2. My vote is for implicitly assigning ID's (this is done in most RDF implementations when you have an implied blank node).

Resources must have IDs. Datatypes are not resources and do not have IDs.

  1. is choosing to make PositiveIntegerRange a datatype where instances have only values (as it currently exists in the model)
    2 and 3) are choosing to make PositiveIntegerRange a resource where instances must have individual IDs.

The RDF spec defines Datatypes. Some RDF software may not properly support defining the lexical (serialized) form of RDF Datatypes. Not too long ago some people were saying it was impossible for RDF to implement the DateTime datatype. Now it is implemented. PositiveIntegerRange and the other datatypes should be moved to the Datatypes directory to make it clear that they are not resources.

For SPDX v2, how do you implement datatypes like Hash in Example1?

FileChecksum: SHA1: 20291a81ef065ff891b537b64d4fdccaf6f5ac02

There are no Skolemized IDs hiding in those files, just an algorithm and a value, which the serialization spec for Hash would define as "SHA: 20291a81ef065ff891b537b64d4fdccaf6f5ac02" (straight from the example), {"sha1": "20291a81ef065ff891b537b64d4fdccaf6f5ac02"} (the obvious JSON choice :-), or {"algorithm": "sha1", "hashValue": "20291a81ef065ff891b537b64d4fdccaf6f5ac02"} (the ugly JSON choice that everyone else prefers).

We need to compare serialized values of a Snippet for 1, 2, and 3 to decide whether we want datatypes or resources.

@aamedina
Copy link

aamedina commented Aug 2, 2023

in SPDX 2 there is a spdx:Checksum class with two direct properties: spdx:checksumValue and spdx:algorithm.

From an example SPDX RDF document:

<spdx:Checksum>
<spdx:checksumValue>d6a770ba38583ed4bb4525bd96e50461655d2759</spdx:checksumValue>
<spdx:algorithm rdf:resource=”[http://spdx.org/rdf/terms#checksumAlgorithm_sha1″/>](http://spdx.org/rdf/terms#checksumAlgorithm_sha1%22/%3E)
</spdx:Checksum>

@davaya
Copy link
Contributor Author

davaya commented Aug 2, 2023

Great! What does an RDF or JSON-LD instance of Checksum look like?

@aamedina
Copy link

aamedina commented Aug 2, 2023

That would require a JSON-LD context to map the keys.

{
  "@context": {
    "spdx": "http://spdx.org/rdf/terms#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  },
  "@type": "spdx:Checksum",
  "spdx:checksumValue": "d6a770ba38583ed4bb4525bd96e50461655d2759",
  "spdx:algorithm": {
    "@id": "http://spdx.org/rdf/terms#checksumAlgorithm_sha1"
  }
}

This is how it looks as EDN (extensible data notation, used in Clojure/Datomic):

{:rdf/type :spdx/Checksum
 :spdx/checksumValue "d6a770ba38583ed4bb4525bd96e50461655d2759"
 :spdx/algorithm :spdx/checksumAlgorithm_sha1}

Which is equivalent to 3 triples:

[?e :rdf/type :spdx/Checksum]
[?e :spdx/checksumValue "d6a770ba38583ed4bb4525bd96e50461655d2759"]
[?e :spdx/algorithm :spdx/checksumAlgorithm_sha1]

The ?e is the blank node (the subject). It is an entity without identity that is a component of another entity with identity. However, it is still an entity.

@goneall
Copy link
Member

goneall commented Aug 2, 2023

Agree with @aamedina descriptions above - thanks for the responses.

Just a couple more notes (probably more detail than needed for this discussion):

  • In SPDX 2.X, we typically use blank nodes / anonymous ID's for checksums - in JSON-LD, this typically just "inlines" the above JSON object
  • In the SPDX java tools, we go through some effort to de-duplicate checksums. We just reference the same anonymous ID similar to the creationInfo examples created by @armintaenzertng - this resulted in some significant reduction in size for some SPDX documents (a major complaint for RDF serializations of SPDX).
  • I've toyed with the idea of creating a URI by combining a prefix + unique string for the algorithm + hash value since this would be a globally unique URI for any hash. The strings could be quite large, but not unusually large from my experience. The intent would (definitely) not be to put some semantic meaning into the URI, but rather have a better algorithm for de-duplicating hashes.

@davaya
Copy link
Contributor Author

davaya commented Aug 7, 2023

@aamedina

The ?e is the blank node (the subject). It is an entity without identity that is a component of another entity with identity. However, it is still an entity.

An entity without an identity means an entity without an identifier. In other words, it is a datatype, which means:

  • RDF can define datatypes
  • PositiveIntegerRange (and Hash and CreationInfo and ExternalReference) can be defined as datatypes
  • Instances of the PositiveIntegerRange entity do not have an @id other than that of the Element they are included in

correct?

@davaya
Copy link
Contributor Author

davaya commented Aug 7, 2023

@goneall

I've toyed with the idea of creating a URI by combining a prefix + unique string for the algorithm + hash value since this would be a globally unique URI for any hash.

Except that you don't need a globally unique ID to de-duplicate hash values within a document, you just need a document-unique ID, i.e, "h1", "h2", "h3" ...

And if #431 is merged there should be no duplicate hashes to de-duplicate.

@goneall goneall added this to the 3.0 milestone Aug 12, 2023
@goneall
Copy link
Member

goneall commented Apr 3, 2024

I believe we have resolved how we handle datatypes and anonymous classes - closing this issue

@goneall goneall closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Something about the abstract model RDF/OWL/SHACL
Projects
None yet
Development

No branches or pull requests

4 participants