

# Data testing


 Data testing is a specialized type of testing that emphasizes the evaluation of data processed by systems, encompassing aspects like data transformations, data integrity rules, and data processing logic. Its purpose is to evaluate various attributes of data to identify data quality issues, such as duplication, missing data, or errors. By performing data testing, organizations can establish a foundation of reliable and trustworthy data for their systems which in turn enables informed decision-making, efficient business operations, and positive customer experiences.  

**Topics**
+ [

# Indicators for data testing
](indicators-for-data-testing.md)
+ [

# Anti-patterns for data testing
](anti-patterns-for-data-testing.md)
+ [

# Metrics for data testing
](metrics-for-data-testing.md)

# Indicators for data testing


Validate the integrity, accuracy, and consistency of data processes to help ensure that data operations, from input to storage and retrieval, maintain quality and reliability standards.

**Topics**
+ [

# [QA.DT.1] Ensure data integrity and accuracy with data quality tests
](qa.dt.1-ensure-data-integrity-and-accuracy-with-data-quality-tests.md)
+ [

# [QA.DT.2] Enhance understanding of data through data profiling
](qa.dt.2-enhance-understanding-of-data-through-data-profiling.md)
+ [

# [QA.DT.3] Validate data processing rules with data logic tests
](qa.dt.3-validate-data-processing-rules-with-data-logic-tests.md)
+ [

# [QA.DT.4] Detect and mitigate data issues with anomaly detection
](qa.dt.4-detect-and-mitigate-data-issues-with-anomaly-detection.md)
+ [

# [QA.DT.5] Utilize incremental metrics computation
](qa.dt.5-utilize-incremental-metrics-computation.md)

# [QA.DT.1] Ensure data integrity and accuracy with data quality tests
[QA.DT.1] Ensure data integrity and accuracy with data quality tests

 **Category:** RECOMMENDED 

 Data quality tests assess the accuracy, consistency, and overall quality of the data used within the application or system. These tests typically involve validating data against predefined rules and checking for duplicate or missing data to ensure the dataset remains reliable. While data quality testing might not fall under the traditional definitions of functional or non-functional testing, it's still an essential aspect of ensuring that an application or system functions correctly, as the quality of data can significantly impact the overall performance, user experience, and reliability of the software. 

 We recommend data quality tests because they enable rapid software delivery and continuous improvement of data driving systems. Using data quality tests, teams can spend more of their time focusing on how data should appear rather than continually checking it for accuracy, streamlining the development and deployment process. To calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. 

**Related information:**
+  [Getting started with AWS Glue Data Quality from the AWSAWS Glue Data Catalog](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-from-the-aws-glue-data-catalog/) 
+  [Deequ - Unit Tests for Data](https://github.com/awslabs/deequ) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [How to Architect Data Quality on the AWS Cloud](https://aws.amazon.com/blogs/industries/how-to-architect-data-quality-on-the-aws-cloud/) 

# [QA.DT.2] Enhance understanding of data through data profiling
[QA.DT.2] Enhance understanding of data through data profiling

 **Category:** OPTIONAL 

 Use data profiling tools to examine, analyze, and understand the data including its content, structure, and relationships to identify issues such as inconsistencies, outliers, and missing values. By performing data profiling, teams can gain deeper insights into the characteristics and quality of their data, enabling them to make informed decisions about data management, data governance, and data integration strategies. This data is often used to enable or improve other types of data testing. 

 To integrate data profiling into a DevOps environment, consider automating the process using data profiling tools such as [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/), open-source tools, or custom scripts that analyze data regularly. Incorporate the profiling results into your data management, governance, and integration strategies, allowing your team to proactively address data quality issues and maintain consistent data standards throughout the development lifecycle. 

**Related information:**
+  [Build an automatic data profiling and reporting solution with Amazon EMR, AWS Glue, and Quick](https://aws.amazon.com/blogs/big-data/build-an-automatic-data-profiling-and-reporting-solution-with-amazon-emr-aws-glue-and-amazon-quicksight/) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ single column profiling](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md) 
+  [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) 

# [QA.DT.3] Validate data processing rules with data logic tests
[QA.DT.3] Validate data processing rules with data logic tests

 **Category:** OPTIONAL 

 Data logic tests verify the accuracy and reliability of data processing and transformation within your application, ensuring that it functions as intended. 

 Establish test cases for data processing workflows and transformation functions, confirming that expected outcomes are achieved. Use version control systems to track changes in data logic and collaborate effectively with team members. Implement automated data logic tests in development and staging environments, which can be triggered by code commits or scheduled intervals, to proactively identify and fix issues before they reach production environments. 

**Related information:**
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ automatic suggestion of constraints](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md) 

# [QA.DT.4] Detect and mitigate data issues with anomaly detection
[QA.DT.4] Detect and mitigate data issues with anomaly detection

 **Category:** OPTIONAL 

 Data anomaly detection is a specialized form of anomaly detection which focuses on identifying unusual patterns or behaviors in data quality metrics that may indicate data quality issues. 

 Consider integrating machine learning algorithms and statistical methods into your data quality monitoring processes. Use tools that can detect and address data anomalies in real-time and incorporate them into your development and deployment workflows. This enables automated assessment of the accuracy and reliability of data processing and analysis, enhancing the overall performance of your applications and systems. 

**Related information:**
+  [What Is Anomaly Detection?](https://aws.amazon.com/what-is/anomaly-detection/) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ anomaly detection](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md) 
+  [Amazon Lookout for Metrics](https://aws.amazon.com/lookout-for-metrics/) 
+  [Introducing Amazon Lookout for Metrics: An anomaly detection service to proactively monitor the health of your business](https://aws.amazon.com/blogs/machine-learning/introducing-amazon-lookout-for-metrics-an-anomaly-detection-service-to-proactively-monitor-the-health-of-your-business/) 
+  [Quick: ML-powered anomaly detection for outliers](https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection-function.html) 
+  [Amazon Kinesis: Detecting Data Anomalies on a Stream ](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-anomaly-detection.html) 

# [QA.DT.5] Utilize incremental metrics computation
[QA.DT.5] Utilize incremental metrics computation

 **Category:** OPTIONAL 

 Incremental metrics computation allows teams to efficiently monitor and maintain data quality without needing to recompute metrics on the entire dataset every time data is updated. Use this method to significantly reduce computational resources and time spent on data quality testing, allowing for more agile and responsive data management practices.  

 Start by identifying the specific data quality metrics that are essential for your system. This could include metrics related to accuracy, completeness, timeliness, and consistency. Depending on your dataset's size and complexity, select a tool or framework that supports incremental computation. Some modern data processing tools, such as [Apache Spark](https://spark.apache.org/) and [Deequ](https://github.com/awslabs/deequ), provide built-in support for incremental computations. 

 Segment your data into logical partitions, often based on time, such as daily or hourly partitions. As new data is added, it becomes a new partition. Automate the computation process by setting up triggers that initiate the metric computation whenever new data is added or an existing partition is updated. 

 Continuously monitor the updated metrics to help ensure they reflect the true state of your data. Periodically validate the results of the incremental metrics computation against a full computation to ensure accuracy. As you get more familiar with the process, look for ways to optimize the computation to save even more on computational resources. This could involve refining your partitions or improving the computation logic. 

**Related information:**
+  [Deequ stateful metrics computation](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/algebraic_states_example.md) 

# Anti-patterns for data testing
Anti-patterns
+  **Testing data drift:** Testing data in environments that do not mirror production datasets can result in testing outdated data schemas, different configurations, or testing data not representative of real-world conditions. Tests that pass in a non-representative environment might fail in production, leading to undetected data issues. Ensure that testing environments mirror production as closely as possible, both in terms of configuration and the nature of the data. Regularly update testing environment datasets to reflect changes in production. 

# Metrics for data testing
Metrics
+  **Data test coverage:** The percentage of your dataset or data processing logic covered by tests. Data test coverage gives an overview of potential untested or under-tested areas of the application. A high coverage helps ensure a comprehensive evaluation, while low coverage can indicate blind spots in testing. Improve this metric by prioritizing areas with lower coverage, using automation and tools to enhance coverage, and regularly review test strategies to help ensure they align with recent changes in the dataset or processing logic. Measure this metric by calculating the ratio of code or data elements covered by tests to the total lines of code or data elements in the application. 
+  **Test case run time**: The duration taken to run a test case or a suite of test cases. Increasing duration may highlight bottlenecks in the test process or performance issues emerging in the software under test. Improve this metric by optimizing test scripts and the order they run in, enhancing testing infrastructure, and running tests in parallel. Measure the timestamp difference between the start and end of test case execution. 
+  **Data quality score**: The combined quality of data in a system, encompassing facets such as consistency, completeness, correctness, accuracy, validity, and timeliness. Derive the data quality score by individually assessing each facet. Then aggregate and normalize them into a single metric, typically ranging from 0 to 100, with higher scores indicating better data quality. The specific method for aggregating the scores may vary depending on the organization's data quality framework and the relative importance assigned to each facet. Consider factors like the uniformity of data values (consistency), the presence or absence of missing values (completeness), the degree of data accuracy relative to real-world entities (correctness and accuracy), the adherence of the data to predefined rules (validity), and the currency and relevance of the data (timeliness). 