Skip to content

aivanzhang/data_checks

Repository files navigation

Data Checks

License Python

Create, schedule, and deploy data quality checks.

Questions and feedback: https://calendly.com/ivanzhang0/demo-with-ivan

See Examples for examples like anomaly and PII detection.

See Docs for a comprehensive list of all avalilable features.

Overview

Exisiting data observability solutions are painfully static. data_checks provides a dynamic data observability framework that allows you to reuse existing Python code and/or write new Python code to define data quality checks that can then be easily scheduled and monitored. Inspired by Python's unittest, data_checks allows you to write data quality checks as easily and seamlessly as you would write unittests on your code.

Quickstart

1) Installation

Install the latest version of data_checks using pip:

pip install pydata-checks

2) Start a Data Check project

Initialize a new data_checks project by using the init command from your project directory (/Users/USERNAME/Desktop/PROJECT_NAME):

python -m data_checks.init

This will start a series of prompts that will guide you through the process of initializing a new data_checks project. For example:

$ python -m data_checks.init
Enter the relative file path of the directory where suites will be stored: my_first_data_checks_project/suites
Directory '/Users/USERNAME/Desktop/PROJECT_NAME/my_first_data_checks_project/suites' does not exist.
Would you like to create it? [y/n]: y
Enter the relative file path of the directory where checks will be stored: my_first_data_checks_project/checks
Directory '/Users/USERNAME/Desktop/PROJECT_NAME/my_first_data_checks_project/checks' does not exist.
Would you like to create it? [y/n]: y
Enter the default CRON schedule: * * * * *
Enter the database URL: database_url
Enter the alerting endpoint URL:
check_settings.py generated.
my_first_data_check.py generated.

This will create a new directory with the following structure:

PROJECT_NAME
├── my_first_data_checks_project
│   ├── __init__.py
│   ├── checks
│   │   ├── __init__.py
│   │   └── my_first_data_check.py
│   ├── suites
│   │   ├── __init__.py
├── check_settings.py

3) Set the CHECK_SETTINGS_MODULE to point to the check_settings.py file

export CHECK_SETTINGS_MODULE=check_settings

4) Run the autogenerated data check

python -m data_checks.do.run_check MyFirstDataCheck

Output:

[1/1 checks] MyFirstDataCheck
	[1/2 Rules] rule_my_first_failed_rule
This rule failed
DataCheckException(severity=1.0, exception=This rule failed, metadata={'rule': 'rule_my_first_failed_rule', 'params': {'args': (), 'kwargs': {}}})
	[2/2 Rules] rule_my_first_successful_rule
		rule_my_first_successful_rule took 0.0 seconds

5) Modify the autogenerated data check

Open up the my_first_data_checks_project/checks.my_first_data_check.py file and customize the data check to your liking. For instance, you can modify the rule_my_first_failed_rule to always pass by removing the exception:

from data_checks.classes.data_check import DataCheck


class MyFirstDataCheck(DataCheck):
    ...

    def rule_my_first_failed_rule(self):
        # This rule will now succeed
        assert True, "This rule now succeeds"

    ...

Rerun the data check:

python -m data_checks.do.run_check MyFirstDataCheck

Output:

[1/1 checks] MyFirstDataCheck
	[1/2 Rules] rule_my_first_successful_rule
		rule_my_first_successful_rule took 9.5367431640625e-07 seconds
	[2/2 Rules] rule_my_first_failed_rule
		rule_my_first_failed_rule took 9.5367431640625e-07 seconds

🎉 Congrats! 🎉 You've created and executed your first data check! See the documentation for more information on how writing more advanced checks, suites, and other features like scheduling and alerting.

About

Data quality checks that don't suck.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published