Data Quality Gate is a Terraform module that enables data engineers and data QA professionals to effortlessly set up the Provectus DataQA solution within their infrastructure in a single click. It is AWS-based and built on the solid foundation of Great Expectations, YData Profiling (ex. Pandas Profiling), and Allure.
The main engine, based on Great Expectations (GX), is used to profile, generate suites, and run tests.
The mapping from the GX format into the Allure Test Report tool is executed,
The existing metadata and metrics are aggregated and pushed down the pipeline.
- AWS Lambda Runtime: Utilizes Python 3.9.
- AWS Step Functions Pipeline: Incorporates the entire DataQA cycle, including profiling, test generation, and reporting.
- Notifications and Reporting: Offers support for Slack and Jira notifications and reporting.
- AWS SNS: Outputs message bus, allowing for seamless integration with existing data pipelines.
- Web Reports Delivery: Provides report delivery via Nginx for company-specific VPN/IP settings.
- AWS DynamoDB and Athena Integration: Enables the construction of AWS QuickSight or Grafana dashboards.
- Configuration Management: Provides a flexible method for managing configurations of underlying technologies like Allure and Great Expectations.
module "data_qa" {
source = "github.com/provectus/data-quality-gate"
data_test_storage_bucket_name = "my-data-settings-dev"
s3_source_data_bucket = "my-data-bucket"
environment = "example"
project = "my-project"
allure_report_image_uri = "xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-allure_report:latest"
data_test_image_uri = "xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-data_test:latest"
push_report_image_uri = "xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-push_reportt:latest"
data_reports_notification_settings = {
channel = "DataReportSlackChannelName"
webhook_url = "https://hooks.slack.com/services/xxxxxxxxxxxxxxx"
}
lambda_private_subnet_ids = ["private_subnet_id"]
lambda_security_group_ids = ["security_group_id"]
reports_vpc_id = "some_vpc_id"
reports_subnet_id = "subnet_id"
reports_whitelist_ips = ["0.0.0.0/0"]
}
The tool can be used as a standard Terraform module, with deployment examples provided in the examples
directory.
- Data-QA-Basic - Creates a DataQA module that builds AWS infrastructure.
See the functions for further details.
This solution is completely free because it is open source. However, if you want to integrate it into a live/production environment, there will be associated costs due to its cloud-based nature. These costs can be divided into two parts: the required infrastructure (which you may already have in place, such as VPCs and subnets) and the AWS services necessary for data quality implementation.
Note: All the information provided below has been calculated using the maximum score strategy.
AWS Service | Approximate monthly cost | Description |
---|---|---|
AWS S3 and DynamoDB endpoints | - | There is no extra charge for gateway-type endpoints. You only pay for the usage of S3 and DynamoDB itself. |
AWS Interface VPC endpoints(secrets manager, monitoring, sns) | 3 endpoints * (30 days * 24 hours * 0.01 rate) = 21.6 USD | Interface endpoints charged by hour. 1 hour = $0.01 |
AWS ECRs (allure, data_test, reports, notifications) | 7 versions * (865mb + 432mb + 380mb) => 11.3gb * 0.1 rate per gb month= 1.13 USD | allure image size = 865mb, data_test image size = 432mb, reports image size = 380mb, notifications image size = 160mb. For the purpose of our calculations, let's assume we are storing 7 versions of each image. |
AWS QuickSight | $7.3 aprx rate per user * 5 = 36.4 USD | Let's assume you have a team consisting of 5 individuals who are interested in the QuickSight data quality dashboard. They frequently check for changes, typically 2-3 times per day. |
Monthly total is
For most of the services used by Data Quality, AWS offers a free-tier supply. Additionally, the costs for these services are typically just a fraction of a cent. To provide further clarity, below you can find a basic cost formula and a few usage examples with cost estimations.
We are going to count:
- number of AWS Lambda runs
- number of AWS StepFunction transitions
- web reports AWS EC2 instance running(720 hrs per month)
Description | Formula |
---|---|
number of AWS Lambda runs for each | (number of data sources * number of changes * work_days_month) * lambda specific rate(depends on lambda duration and memory used) |
number of AWS StepFunction transitions | number of lambda runs * 2 |
Let's say we have 1000 data sources and half of them changed every day. Number of runs formula for any lambda is (1000 data sources * 0.5 changed * 30 days)
AWS Service | Number of runs | Price |
---|---|---|
AWS Lambda AllureReport | 15000 | $8.33 |
AWS Lambda DataTest | 15000 | $67.28 |
AWS Lambda Reports | 15000 | $2.08 |
AWS StepFunctions | 15000 | $0.65 |
AWS EC2 Reports S3 Gateway | 720 hrs | $7.25 |
Monthly total: 85.59 US$
Let's say we have 10000 data sources and 70% of them changed every day. Number of runs formula for any lambda is (10000 data sources * 0.7 changes * 30 days)
AWS Service | Number of runs | Price |
---|---|---|
AWS Lambda AllureReport | 210k | $203.33 |
AWS Lambda DataTest | 210k | $1028.57 |
AWS Lambda Reports | 210k | $115.83 |
AWS StepFunctions | 210k | $10.40 |
AWS EC2 Reports S3 Gateway | 720 hrs | $7.25 |
Monthly total: 1 365.38 US$
Let's say we have 30000 data sources and all of them changed every day. Number of runs formula for any lambda is (30000 data sources * 1 changes * 30 days)
AWS Service | Number of runs | Price |
---|---|---|
AWS Lambda AllureReport | 900k | $893.34 |
AWS Lambda DataTest | 900k | $4430.06 |
AWS Lambda Reports | 900k | $518.33 |
AWS StepFunctions | 900k | $44.90 |
AWS EC2 Reports S3 Gateway | 720 hrs | $7.25 |
Monthly total: 5 893.88 US$
Price per changed data source: 0.006 US$
Apache 2 Licensed. See LICENSE for full details.