Create, schedule, and deploy data quality checks.
Questions and feedback: https://calendly.com/ivanzhang0/demo-with-ivan
See Examples for examples like anomaly and PII detection.
See Docs for a comprehensive list of all avalilable features.
Exisiting data observability solutions are painfully static. data_checks provides a dynamic data observability framework that allows you to reuse existing Python code and/or write new Python code to define data quality checks that can then be easily scheduled and monitored. Inspired by Python's unittest, data_checks allows you to write data quality checks as easily and seamlessly as you would write unittests on your code.
Install the latest version of data_checks using pip:
pip install pydata-checks
Initialize a new data_checks project by using the init
command from your project directory (/Users/USERNAME/Desktop/PROJECT_NAME
):
python -m data_checks.init
This will start a series of prompts that will guide you through the process of initializing a new data_checks project. For example:
$ python -m data_checks.init
Enter the relative file path of the directory where suites will be stored: my_first_data_checks_project/suites
Directory '/Users/USERNAME/Desktop/PROJECT_NAME/my_first_data_checks_project/suites' does not exist.
Would you like to create it? [y/n]: y
Enter the relative file path of the directory where checks will be stored: my_first_data_checks_project/checks
Directory '/Users/USERNAME/Desktop/PROJECT_NAME/my_first_data_checks_project/checks' does not exist.
Would you like to create it? [y/n]: y
Enter the default CRON schedule: * * * * *
Enter the database URL: database_url
Enter the alerting endpoint URL:
check_settings.py generated.
my_first_data_check.py generated.
This will create a new directory with the following structure:
PROJECT_NAME
├── my_first_data_checks_project
│ ├── __init__.py
│ ├── checks
│ │ ├── __init__.py
│ │ └── my_first_data_check.py
│ ├── suites
│ │ ├── __init__.py
├── check_settings.py
export CHECK_SETTINGS_MODULE=check_settings
python -m data_checks.do.run_check MyFirstDataCheck
Output:
[1/1 checks] MyFirstDataCheck
[1/2 Rules] rule_my_first_failed_rule
This rule failed
DataCheckException(severity=1.0, exception=This rule failed, metadata={'rule': 'rule_my_first_failed_rule', 'params': {'args': (), 'kwargs': {}}})
[2/2 Rules] rule_my_first_successful_rule
rule_my_first_successful_rule took 0.0 seconds
Open up the my_first_data_checks_project/checks.my_first_data_check.py
file and customize the data check to your liking. For instance, you can modify the rule_my_first_failed_rule
to always pass by removing the exception:
from data_checks.classes.data_check import DataCheck
class MyFirstDataCheck(DataCheck):
...
def rule_my_first_failed_rule(self):
# This rule will now succeed
assert True, "This rule now succeeds"
...
Rerun the data check:
python -m data_checks.do.run_check MyFirstDataCheck
Output:
[1/1 checks] MyFirstDataCheck
[1/2 Rules] rule_my_first_successful_rule
rule_my_first_successful_rule took 9.5367431640625e-07 seconds
[2/2 Rules] rule_my_first_failed_rule
rule_my_first_failed_rule took 9.5367431640625e-07 seconds
🎉 Congrats! 🎉 You've created and executed your first data check! See the documentation for more information on how writing more advanced checks, suites, and other features like scheduling and alerting.