Data Health As Service Using Graph DB

In this blog I will explore design/implementation of "Health Of The Data" by designing & building - "Data Health As Service" which finds the health of data from various contributors using graph inferencing. We also understand how to calculate health of data mathematically. Let's start our journey by understanding a modern data platform architecture.

Modern Data Platform Architecture

The below picture shows Modern Data Platform architecture:

The above architecture shows various key components/services of the "Data Platform Architecture". In this blog, we will be looking in "Data Health Service" and components/services required for it function.

Context

The prime goal of data in data platform is to enable data driven decision using reports. The quality of the reports build using data, at broad level dependent on:

Quality of data
Timeliness of data
Data lineage
...

Since quality of the data depends on various factors, I would like to propose a service/component in the design called "Data Health As Service". This service will derive the health of the data from following key components:

Data Quality as service
ETL pipeline metadata service (pipeline context metadata)
...

Hence in nut shell, "Data Health As Service" will provide "Index Of Readiness" of data as show below:

Data Health Using Index Of Readiness

As the saying goes - "Data is Gold". A good jeweller will tell you, before buying the gold, double/triple check the quality. The quality of the gold is measured in "carats". The value of "carats" governs the cost of the gold. For example, 24 carat gold is expensive then 22 carat and so on. The same applies to data as well. The "index of readiness of data" is calculated as:

Index Of Readiness = 1 / sum (score metrics influencing data quality) + score (data lineage) + score (data integrity)

If the index of readiness is closer to "zero", this implies that data is healthy.

So mathematically, we can drive health of the data using "Index Of Readiness". In real world scenario, it's hard to get all the parameter which affects the health of the data.

A million dollar question, how can we derive "data health sense", from the metadata non mathematically.

Data Health Using Graph DB

Graph databases are powerful in deducing connection between data (no matter what type data). Hence, I would like to propose the usage of graph database to derive health of the data. How would the graph model look like, to develop such a service?

Data Health Graph Model

The following diagram shows the design of graph model for data health deduciton:

DataHeath

The starting point the "Data Health Service". Its responsibility is to enroll data contributors which are contributing to final report/outcome or DQ metrices etc.

Contributor

This is responsible for capturing information affecting final report. An example of Contributor is an ETL pipeline.

Report

This is a final node in the graph. A typical example, can be a cube or batch report etc.

Data Heath Entities UML Class Diagram

Below diagram shows the data health entities class diagram:

Data Health Output

The graph db output looks like below:

Conclusion

Using above graph depiction it's very easy to deduce the health of data. If any of link from the "contributor/or report" is missing, this will directly imply health is unhealthy. This is great way to find the health of data as graph based approach allows iterative addition of data health influencing metrices.

How to Run?

Data As Service is build using:

Java 14
Spring Boot 2.3.3
Neo4j
Spring data neo4j

Below are steps to run this service:

Run neo4j docker
Run the DataHealthServiceApplicaiton

Reference Documentation

For further reference, please consider the following sections:

Guides

The following guides illustrate how to use some features concretely:

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
src		src
.gitignore		.gitignore
DataArchitecture.png		DataArchitecture.png
LICENSE		LICENSE
README.md		README.md
datahealth-entities-classmodel.png		datahealth-entities-classmodel.png
datahealth-graph-output.png		datahealth-graph-output.png
datahealth-graphdb-model.png		datahealth-graphdb-model.png
datahealth-indexofreadiness.png		datahealth-indexofreadiness.png
indexofreadiness.png		indexofreadiness.png
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Health As Service Using Graph DB

Modern Data Platform Architecture

Context

Data Health Using Index Of Readiness

Data Health Using Graph DB

Data Health Graph Model

DataHeath

Contributor

Report

Data Heath Entities UML Class Diagram

Data Health Output

Conclusion

How to Run?

Reference Documentation

Guides

About

Releases

Packages

Languages

License

mgorav/data-health-service

Folders and files

Latest commit

History

Repository files navigation

Data Health As Service Using Graph DB

Modern Data Platform Architecture

Context

Data Health Using Index Of Readiness

Data Health Using Graph DB

Data Health Graph Model

DataHeath

Contributor

Report

Data Heath Entities UML Class Diagram

Data Health Output

Conclusion

How to Run?

Reference Documentation

Guides

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages