Skip to content

Commit

Permalink
Merge branch 'current' into 1-8-init
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewshaver authored Feb 8, 2024
2 parents 8c2c3d5 + a353f9a commit 29faad7
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions website/blog/2023-01-24-aggregating-test-failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Testing the quality of data in your warehouse is an important aspect in any matu

<!--truncate-->

At [Tempus](https://www.tempus.com/), a precision medicine company specializing in oncology, high quality data is a necessary component for high quality clinical models. With roughly 1,000 dbt models, nearly a hundred data sources, and a dozen different data quality stakeholders, producing a framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — in early 2022, we had nearly a thousand tests, hundreds of which failed on a daily basis yet were wholly ignored.
Producing a data quality framework that allows stakeholders to take action on test failures is challenging. Without an actionable framework, data quality tests can backfire — one failing test becomes two becomes ten and suddenly you have too many test failures to act on any of them.

Recently, we overhauled our testing framework. We cut the number of tests down to 200, creating a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:
Recently, we overhauled our testing framework. We cut the number of tests down by 80% to create a more mature framework that includes metadata and emphasizes actionability. Our system for managing data quality is a three step process, described below:

1. Leveraging the contextual knowledge of stakeholders, writing specific, high quality data tests, perpetuating test failure results into aliased models for easy access.
1. Aggregating test failure results using Jinja macros and pre-configured metadata to pull together high level summary tables.
Expand All @@ -37,35 +37,35 @@ Data Integrity tests (Generic Tests)  are simple — they’re tests akin to a
```yaml
version: 2
models:
- name: patient
- name: customer
columns:
- name: id
description: Unique ID associated with the record
tests:
- unique:
alias: patient__id__unique
alias: id__unique
- not_null:
alias: patient__id__not_null
alias: id__not_null
```
<center><i>Example Data Integrity Tests in a YAML file — the alias argument is an important piece that will be touched on later.</i></center><br />
Context Driven Tests are more complex and look a lot more like models. Essentially, they’re data models that select bad data or records we don’t want, defined as SQL files that live in the `dbt/tests` directory. An example is shown below —

```sql
{{ config(
tags=['check_birth_date_in_range', 'patient'],
alias='ad_hoc__check_birth-date_in_range'
tags=['check_purchase_date_in_range', 'customer'],
alias='ad_hoc__check_purchase_date_in_range
)
}}
SELECT
id,
birth_date
purchase_date
FROM
{{ ref('patient') }}
WHERE birth_date < '1900-01-01'
{{ ref('customer') }}
WHERE purchase_date < '1900-01-01'
```
<center><i>The above test selects all patients with a birth date before 1900, due to data rules we have about maximum patient age.</i></center><br />
<center><i>The above test selects all customers who have made a purchase before 1900. The idea is that any customer that exists before 1900 probably isn't real.</i></center><br />

Importantly, we leverage [Test Aliasing](https://docs.getdbt.com/reference/resource-configs/alias) to ensure that our tests all follow a standard and predictable naming convention; our naming convention for Data Integrity tests is *table_name_ _column_name__test_name*, and our naming convention for Context Driven Tests is *ad_hoc__test_name*. Finally, to ensure all of our tests can then be aggregated, we modify the `dbt_project.yml` file  and [set the `store_failures` tag to ‘TRUE’](https://docs.getdbt.com/reference/resource-configs/store_failures), thus persisting test failures into SQL tables.

Expand All @@ -86,15 +86,15 @@ After defining our metadata Seed file, we begin the process of aggregating our d
incremental_strategy = 'merge',
unique_key='row_key',
full_refresh=false,
tags=['dq_test_warning_failures','clinical_mart', 'data_health']
tags=['dq_test_warning_failures','customer_mart', 'data_health']
)
}}
WITH failures as (
SELECT
count(*) as test_failures,
_TABLE_SUFFIX as table_suffix,
FROM {{ var('clinical_mart_schema') }}_dbt_test__audit.`*`
FROM {{ var('customer_mart_schema') }}_dbt_test__audit.`*`
GROUP BY _TABLE_SUFFIX
),

Expand Down Expand Up @@ -131,4 +131,4 @@ With our finalized data quality base table, there are many other options for cle

First, we create views on top of the base table that filter down by test owner. We strongly believe that test noise is the biggest risk towards the success of a quality framework. Creating specific views is like giving each team a magnifying glass that lets them zoom into only the tests they care about. We also have a dashboard, currently in Google Looker Studio, that shows historical test failures with a suite of filters to let users magnify high severity tests and constructs machine-composed example queries for users to select failing records. When a test fails, a business analyst can copy and paste a query from the dashboard and get all the relevant information.

As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.
As with any framework, it’s always a work in progress — we still encounter issues with noise in our tests, and still struggle to wrangle our users to care when a test fails. However, we’ve found that this data framework works exceptionally well at enabling data users to create and deploy their own tests. All they need to do is submit a pull request with SQL code that flags bad data, and write one line of metadata.

0 comments on commit 29faad7

Please sign in to comment.