In this blog I will explore design/implementation of "Health Of The Data" by designing & building - "Data Health As Service" which finds the health of data from various contributors using graph inferencing. We also understand how to calculate health of data mathematically. Let's start our journey by understanding a modern data platform architecture.
The below picture shows Modern Data Platform architecture:
The above architecture shows various key components/services of the "Data Platform Architecture". In this blog, we will be looking in "Data Health Service" and components/services required for it function.
The prime goal of data in data platform is to enable data driven decision using reports. The quality of the reports build using data, at broad level dependent on:
- Quality of data
- Timeliness of data
- Data lineage
- ...
Since quality of the data depends on various factors, I would like to propose a service/component in the design called "Data Health As Service". This service will derive the health of the data from following key components:
- Data Quality as service
- ETL pipeline metadata service (pipeline context metadata)
- ...
Hence in nut shell, "Data Health As Service" will provide "Index Of Readiness" of data as show below:
As the saying goes - "Data is Gold". A good jeweller will tell you, before buying the gold, double/triple check the quality. The quality of the gold is measured in "carats". The value of "carats" governs the cost of the gold. For example, 24 carat gold is expensive then 22 carat and so on. The same applies to data as well. The "index of readiness of data" is calculated as:
Index Of Readiness = 1 / sum (score metrics influencing data quality) + score (data lineage) + score (data integrity)
If the index of readiness is closer to "zero", this implies that data is healthy.
So mathematically, we can drive health of the data using "Index Of Readiness". In real world scenario, it's hard to get all the parameter which affects the health of the data.
A million dollar question, how can we derive "data health sense", from the metadata non mathematically.
Graph databases are powerful in deducing connection between data (no matter what type data). Hence, I would like to propose the usage of graph database to derive health of the data. How would the graph model look like, to develop such a service?
The following diagram shows the design of graph model for data health deduciton:
The starting point the "Data Health Service". Its responsibility is to enroll data contributors which are contributing to final report/outcome or DQ metrices etc.
This is responsible for capturing information affecting final report. An example of Contributor is an ETL pipeline.
This is a final node in the graph. A typical example, can be a cube or batch report etc.
Below diagram shows the data health entities class diagram:
The graph db output looks like below:
Using above graph depiction it's very easy to deduce the health of data. If any of link from the "contributor/or report" is missing, this will directly imply health is unhealthy. This is great way to find the health of data as graph based approach allows iterative addition of data health influencing metrices.
Data As Service is build using:
- Java 14
- Spring Boot 2.3.3
- Neo4j
- Spring data neo4j
Below are steps to run this service:
- Run neo4j docker
- Run the DataHealthServiceApplicaiton
For further reference, please consider the following sections:
- Official Apache Maven documentation
- Spring Boot Maven Plugin Reference Guide
- Create an OCI image
- Spring Web
- Spring Data Neo4j
The following guides illustrate how to use some features concretely: