Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create super-types for relationships #4724

Open
cmgrote opened this issue Feb 23, 2021 · 25 comments
Open

Create super-types for relationships #4724

cmgrote opened this issue Feb 23, 2021 · 25 comments
Assignees
Labels
consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions functionality-call To discuss and agree during weekly functionality call(s) optimization Non-breaking, but could be more efficient performance Performance & Scalability/reliability related pinned Keep open (do not time out)

Comments

@cmgrote
Copy link
Member

cmgrote commented Feb 23, 2021

At the moment none of the RelationshipDefs have a super type defined, and I was wondering if there was some rationale behind this: for example, would it mean that the ends of the relationship are fixed by the supertype (?)

I raise it because there is significant duplication in the properties of certain relationships, particular in the area of GlossaryTerms, where all of these relationships have almost exactly the same properties defined (description, expression, status, steward, and source):

  • Antonym
  • ISARelationship
  • PreferredTerm
  • RelatedTerm
  • ReplacementTerm
  • SemanticAssignment (has one additional property confidence beyond those above)
  • Synonym
  • TermHASARelationship (though this is missing expression compared to the others)
  • TermISATypeOFRelationship (also missing expression compared to the others)
  • TermTYPEDBYRelationship (also missing expression)
  • Translation
  • UsedInContext
  • ValidValue

(There are ~10 more that have steward defined, and a few other overlaps.)

Aside from the question of consistency (see minor inconsistencies above), there is also the question of the performance impact of this. Per #4647 we agreed that we should not have a single "namespace" for all properties; however, in most (if not all?) of the examples above, I would expect that these 5 properties all have the same meaning and representation.

In other words, that they'd be relationship subtypes of a more abstract relationship type like GlossaryTermRelationship.

The impact of having so many identically-named properties across types is that when a consumer runs a search through our find... methods there is no mandatory qualification of the type against which they want results. For an underlying repository (both the graph repository and the Crux repository), this means that we need to do an OR-based query across ~20 different underlying properties (treating each type as a separate property) to find all that match -- combined with the fact that since many of these properties are string values, a consumer could be sending their request with an arbitrary regular expression -- and this therefore ends up being very expensive (slow).

If we could use a supertype, at least the property would only be registered against a single type in the underlying repositories, and then we could do a single query for the property rather than what ends up being the equivalent of ~20 distinct queries whose results are all combined.

@cmgrote cmgrote added question Not a bug, asking a question functionality-call To discuss and agree during weekly functionality call(s) labels Feb 23, 2021
@cmgrote
Copy link
Member Author

cmgrote commented Feb 23, 2021

To illustrate, this query on the findRelationshipsByProperty method:

{
    "class": "PropertyMatchFindRequest",
    "matchProperties": {
        "class": "InstanceProperties",
        "instanceProperties": {
            "steward": {
                "class": "PrimitivePropertyValue",
                "instancePropertyCategory": "PRIMITIVE",
                "primitiveDefCategory": "OM_PRIMITIVE_TYPE_STRING",
                "primitiveValue": "someone"
            }
        }
    },
    "pageSize": 2,
    "matchCriteria": "ALL"
}

currently must turns into something like this query on the back-end:

{
 :find [e],
 :where [(or [(text-search :relationshipProperties/Antonym.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/DataClassAssignment.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ForeignKey.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ISARelationship.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/KnownDuplicateLink.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/LibraryCategoryReference.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/LibraryTermReference.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/PreferredTerm.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ReferenceValueAssignment.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/RelatedTerm.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ReplacementTerm.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/SemanticAssignment.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/Synonym.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/TermHASARelationship.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/TermISATypeOFRelationship.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/TermTYPEDBYRelationship.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/Translation.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/UsedInContext.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ValidValue.steward.value "/someone/") [[e _]]]
             [(text-search :relationshipProperties/ValidValuesMapping.steward.value "/someone/") [[e _]]])
        [e :type.category 8]],
 :order-by [[e :asc]]
}

... when it could be as simple as this (if there were a supertype with these common properties):

{
 :find [e],
 :where [(text-search :relationshipProperties/GlossaryTermRelationship.steward.value "/someone/") [[e _]]]
        [e :type.category 8],
 :order-by [[e :asc]]
}```

@cmgrote
Copy link
Member Author

cmgrote commented Feb 23, 2021

Actually I have similar questions / concerns, though maybe somewhat different, with regards to the name and displayName properties that are distinctly defined across ~30-40 different EntityDefs. Given it would be optional anyway, it would seem to me that anything we must assign a qualifiedName (every Referenceable, since it's mandatory) we might reasonably expect could (since it would be optional) be assigned a name for display purposes?

That way we'd have greater consistency across the types, and a vastly reduced search space for the underlying repositories when someone is searching for something based just on its name.

@davidradl
Copy link
Member

@cmgrote some thoughts on this. I can see that finds for relationship properties would be simpler if there were super types. Early on we decided to not have super types for relationships (back in the Altas days). Many repositories do not have support for top level relationships, requiring them to support inheritance as well would seem like a very high barrier to overcome. Another consideration is that an OMAS could expose inherited relationships. In fact subject area does this with the relationships extending Line. I don't think it exposes a find on a Line - you can raise this as an enhancement if you like.

@cmgrote cmgrote self-assigned this Feb 25, 2021
@cmgrote
Copy link
Member Author

cmgrote commented Feb 25, 2021

Discussed on 25/02 and we all seem to agree this would be a good idea, but we all expect that it will cause various unforeseen breakages -- so needs to be introduced early in a release cycle for thorough testing.

Ideal approach to attempt initially:

  • Replace the existing type (rather than patching), so we don't end up with deprecated attributes at the lowest-level that are actually inherited (and should not be deprecated) from a higher level
  • Ensure that relationship types that have a supertype have ends that only ever narrow (further subtype) their parent's end types

Note: this is different from an OO-interpretation of "supertype" and "subtype" -- where a subtype would need to be capable of doing everything (and more) than its supertype, these constructions would make it such that the subtype will do less (be more restrictive than) the supertype.

@cmgrote cmgrote removed functionality-call To discuss and agree during weekly functionality call(s) question Not a bug, asking a question labels Feb 26, 2021
@cmgrote cmgrote changed the title Rationale for relationships never having a supertype? Create super-types for relationships Mar 1, 2021
@cmgrote cmgrote added consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions optimization Non-breaking, but could be more efficient performance Performance & Scalability/reliability related labels Mar 1, 2021
@planetf1
Copy link
Member

planetf1 commented Mar 3, 2021

In the call we discussed setting up a feature branch to ease integration testing, to avoid breakage in master whilst we
work through all the consequential changes

I propose to create a new feature branch 'egeria-feature-4724' but open to other name suggestions.. (4724 being linked to this id). Or we could do 'egeria-feature-superrel' or similar. Best not to be too long. cc: @mandy-chessell @davidradl @grahamwallis . I'll check the build pipelines are ok, and propose to do this tomorrow after the status call unless there is followup discussion here/keenness to get started sooner

@cmgrote
Copy link
Member Author

cmgrote commented Mar 3, 2021

I think that should be fine -- we need to do the usual cut-over of the OpenMetadataTypesArchive class post-release for master first, anyway (see #4849).

@planetf1
Copy link
Member

planetf1 commented Mar 3, 2021

Ok - add a comment here when ready and I'll setup & check

@cmgrote
Copy link
Member Author

cmgrote commented Mar 5, 2021

@planetf1 ok, I think we're ready for the new branch now -- I'd be fine with feature-relationship-supertypes as a name (not sure why everything needs to be prefixed with egeria- as it seems a bit redundant?)

@planetf1
Copy link
Member

planetf1 commented Mar 5, 2021

@cmgrote just convention really -- the original release started as egeria-release-xx and we took it from there. I started with egeria-feature-xx when we looked at OLS and we discussed/agreed about the feature proposal there as per . It's good to have a release or feature to help with wildcards in the github actions, but agreed egeria is redundant and is only there historically.

A few formatting errors - but this was the documented process: https://github.com/odpi/egeria/blob/master/developer-resources/Feature-Branch.md

I do think egeria-feature-relationship-supertypes is quite cumbersome. Technically I have little concerns about changing it - beyond ALSO updating the other cicd files (before the branch), which means a little more testing & also should gather more support for a proposal to change the policy & update the guidelines accordingly?

@cmgrote
Copy link
Member Author

cmgrote commented Mar 5, 2021

Fine for me to gather more support and update the process accordingly... If we want to limit the length of the names, it's a bit odd to me to waste 7-15 characters in that limited length every time by forcing every name to begin with the same 7-15 character string 🤷

In fact, taking this a step further, if we use the first x- to give the branch some decipherable meaning, then I'd make this part the short bit (e.g. just f- for a feature branch) rather than restricting the actually-meaningfully-distinguishing characteristics of the branch itself the part that gets squashed.

So rather than something like egeria-feature-superrel which forces me to me to ponder why we're mis-spelling the word "squirrel" and what this branch could possibly be about, it would make far more sense to me to call it f-relationship-supertypes (which is about the same number of characters overall, but far more descriptive).

@planetf1
Copy link
Member

planetf1 commented Mar 8, 2021

I suggest we discuss and agree with the team on the call tomorrow.
Technically we just want a prefix
Reduction of length was purely a human interaction concern
I agree egeria-* is redundant (but let's not change existing - ok for new)
For anyone landing on our repo 'f-' may be a little crypic

@cmgrote cmgrote added the functionality-call To discuss and agree during weekly functionality call(s) label Mar 8, 2021
@cmgrote
Copy link
Member Author

cmgrote commented Mar 8, 2021

To discuss and agree on 09/03/2021 call:

  • Naming of the feature branches
  • Bubbling up of name / displayName (see 3rd comment in this stream)

@mandy-chessell
Copy link
Contributor

mandy-chessell commented Mar 9, 2021

On the question of name vs displayName - in most cases I can think of, the distinction between name and display name is important and we make use of it. I would be cautious about loosing the distinction.

Typically "name" is used for technical assets - you can see it occurring frequently in Area 0. The name is typically the technical name used in the digital landscape. It is extracted automatically and is not what someone has chosen to be a meaningful display name. In these entities the qualified name is manufactured by the metadata extraction process often based on the name.

The displayable (human authored) names for these entities are added later using a GlossaryTerm linked with the SupplementaryProperties relationship. So for example, in the Asset Manager OMAS interface, Assets have both technical names and business names. The technical names go into the Asset entity and the business names go in the attached GlossaryTerm entity. Using two entities for this is important since the Asset entity is often read-only to the open metadata ecosystem.

The displayName property typically appears in metadata elements that are human authored - such as glossary terms, governance zone definitions, ... The author may or may not control the qualified name, but they can choose the display name.

It would be interesting to understand if we see inconsistencies in the use of these two property names in the types.

@mandy-chessell
Copy link
Contributor

I did a quick review of the types and while most use of name vs displayName is appropriate there are some notable exceptions:

  • SchemaElement uses displayName rather than name (0501)
  • ProcessVariable uses displayName - it could be argued that it needs name too? (0525)
  • Port uses displayName rather than name (0217)

@cmgrote
Copy link
Member Author

cmgrote commented Mar 9, 2021

I also just did some quick counts (rough numbers that may be off by a few each):

  • name is defined on 24 different type defs
  • displayName is defined on 34 different type defs
  • description is defined on more than 100 separate type defs (!)

@cmgrote
Copy link
Member Author

cmgrote commented Mar 9, 2021

I like the distinction between technical and non-technical, but could this not simply be a characteristic of the instance type in which the property has a value? An area 0 type with a name populated is a "technical name" because area 0 has technical types, while a glossary type with a name populated would be a "human authored" one since the glossary is human-authored (?)

Taking the example where we have both (Asset Manager OMAS): the name from the technical Asset is then clearly a "technical name" (because the Asset is technical) while the name from its associated GlossaryTerm is a "business name" (since the GlossaryTerm is human-authored).

(Then we don't even have the risk of things becoming inconsistent per the examples above with technical things using displayName...)

@cmgrote
Copy link
Member Author

cmgrote commented Mar 9, 2021

On the branch naming point, agreed on 09/03 call to use feature- as the prefix going forward.

cmgrote added a commit to cmgrote/egeria that referenced this issue Mar 9, 2021
Signed-off-by: Christopher Grote <chris@thegrotes.net>
@mandy-chessell
Copy link
Contributor

I think it is more helpful to maintain the different property names since the values have different expectations on their content. I like the idea of updating SchemaElement and Port to use name and deprecate displayName - and add name to ProcessVariables.

@cmgrote
Copy link
Member Author

cmgrote commented Mar 9, 2021

What about introducing 2 new entity typedefs in-between that distinguish between these "technical" and "human-authored" types?

For example (purely theoretical names for the types):

  • Technical, which has Referenceable as a supertype, with properties: name
  • HumanAuthored which has Referenceable as a supertype, with properties: displayName, description, (steward?), etc
  • Asset, SchemaElement, etc have Technical as their supertype
  • Glossary, GlossaryTerm, etc have HumanAuthored as their supertype

This would:

  1. make the distinction between name and displayName more obvious
  2. enforce consistency in their usage by any subtypes
  3. significantly reduce the current "duplication" of these properties being defined across each type individually

@mandy-chessell
Copy link
Contributor

I think it works for the Technical Entity - but not for the human authored because we have human authored entities that do not have display name

cmgrote added a commit that referenced this issue Mar 9, 2021
#4724 Changes feature branch naming guidance as agreed
@mandy-chessell
Copy link
Contributor

The way the model is today is that there are these sorts of supertypes for each major grouping of entity types. Also there are entities that do not inherit from referenceable that have name or display name.

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2021

Have added PR #4887 to update the cicd actions for egeria - sometimes the scripts have assumptions which can lead to clashes of docker images, snapshots etc since we have multiple branches with same pom version etc. fortunately we were ok with the current definitions. But they are needed to allow the branch checks to be done. (if we want those tests)

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2021

I've also updated the release patterns (annoyingly we have a revert branch - so it's an addition) and I think the pipelines as above are working. Should be ok to use now as long as #4890 builds ok. The checks did result in master versions of docker and website being overwritten but the next master merge overwrote so should be good now. (those actions may benefit from additional verification, they currently have assumptions)

cmgrote added a commit to cmgrote/egeria that referenced this issue Mar 10, 2021
Signed-off-by: Christopher Grote <chris@thegrotes.net>
@github-actions
Copy link

github-actions bot commented May 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label May 9, 2021
@mandy-chessell mandy-chessell removed the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label May 25, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Jul 25, 2021
@mandy-chessell mandy-chessell added pinned Keep open (do not time out) and removed no-issue-activity Issues automatically marked as stale because they have not had recent activity. labels Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions functionality-call To discuss and agree during weekly functionality call(s) optimization Non-breaking, but could be more efficient performance Performance & Scalability/reliability related pinned Keep open (do not time out)
Projects
None yet
Development

No branches or pull requests

4 participants