# Indicators for data testing


Validate the integrity, accuracy, and consistency of data processes to help ensure that data operations, from input to storage and retrieval, maintain quality and reliability standards.

**Topics**
+ [

# [QA.DT.1] Ensure data integrity and accuracy with data quality tests
](qa.dt.1-ensure-data-integrity-and-accuracy-with-data-quality-tests.md)
+ [

# [QA.DT.2] Enhance understanding of data through data profiling
](qa.dt.2-enhance-understanding-of-data-through-data-profiling.md)
+ [

# [QA.DT.3] Validate data processing rules with data logic tests
](qa.dt.3-validate-data-processing-rules-with-data-logic-tests.md)
+ [

# [QA.DT.4] Detect and mitigate data issues with anomaly detection
](qa.dt.4-detect-and-mitigate-data-issues-with-anomaly-detection.md)
+ [

# [QA.DT.5] Utilize incremental metrics computation
](qa.dt.5-utilize-incremental-metrics-computation.md)

# [QA.DT.1] Ensure data integrity and accuracy with data quality tests
[QA.DT.1] Ensure data integrity and accuracy with data quality tests

 **Category:** RECOMMENDED 

 Data quality tests assess the accuracy, consistency, and overall quality of the data used within the application or system. These tests typically involve validating data against predefined rules and checking for duplicate or missing data to ensure the dataset remains reliable. While data quality testing might not fall under the traditional definitions of functional or non-functional testing, it's still an essential aspect of ensuring that an application or system functions correctly, as the quality of data can significantly impact the overall performance, user experience, and reliability of the software. 

 We recommend data quality tests because they enable rapid software delivery and continuous improvement of data driving systems. Using data quality tests, teams can spend more of their time focusing on how data should appear rather than continually checking it for accuracy, streamlining the development and deployment process. To calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. 

**Related information:**
+  [Getting started with AWS Glue Data Quality from the AWSAWS Glue Data Catalog](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-from-the-aws-glue-data-catalog/) 
+  [Deequ - Unit Tests for Data](https://github.com/awslabs/deequ) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [How to Architect Data Quality on the AWS Cloud](https://aws.amazon.com/blogs/industries/how-to-architect-data-quality-on-the-aws-cloud/) 

# [QA.DT.2] Enhance understanding of data through data profiling
[QA.DT.2] Enhance understanding of data through data profiling

 **Category:** OPTIONAL 

 Use data profiling tools to examine, analyze, and understand the data including its content, structure, and relationships to identify issues such as inconsistencies, outliers, and missing values. By performing data profiling, teams can gain deeper insights into the characteristics and quality of their data, enabling them to make informed decisions about data management, data governance, and data integration strategies. This data is often used to enable or improve other types of data testing. 

 To integrate data profiling into a DevOps environment, consider automating the process using data profiling tools such as [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/), open-source tools, or custom scripts that analyze data regularly. Incorporate the profiling results into your data management, governance, and integration strategies, allowing your team to proactively address data quality issues and maintain consistent data standards throughout the development lifecycle. 

**Related information:**
+  [Build an automatic data profiling and reporting solution with Amazon EMR, AWS Glue, and Quick](https://aws.amazon.com/blogs/big-data/build-an-automatic-data-profiling-and-reporting-solution-with-amazon-emr-aws-glue-and-amazon-quicksight/) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ single column profiling](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md) 
+  [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) 

# [QA.DT.3] Validate data processing rules with data logic tests
[QA.DT.3] Validate data processing rules with data logic tests

 **Category:** OPTIONAL 

 Data logic tests verify the accuracy and reliability of data processing and transformation within your application, ensuring that it functions as intended. 

 Establish test cases for data processing workflows and transformation functions, confirming that expected outcomes are achieved. Use version control systems to track changes in data logic and collaborate effectively with team members. Implement automated data logic tests in development and staging environments, which can be triggered by code commits or scheduled intervals, to proactively identify and fix issues before they reach production environments. 

**Related information:**
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ automatic suggestion of constraints](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md) 

# [QA.DT.4] Detect and mitigate data issues with anomaly detection
[QA.DT.4] Detect and mitigate data issues with anomaly detection

 **Category:** OPTIONAL 

 Data anomaly detection is a specialized form of anomaly detection which focuses on identifying unusual patterns or behaviors in data quality metrics that may indicate data quality issues. 

 Consider integrating machine learning algorithms and statistical methods into your data quality monitoring processes. Use tools that can detect and address data anomalies in real-time and incorporate them into your development and deployment workflows. This enables automated assessment of the accuracy and reliability of data processing and analysis, enhancing the overall performance of your applications and systems. 

**Related information:**
+  [What Is Anomaly Detection?](https://aws.amazon.com/what-is/anomaly-detection/) 
+  [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) 
+  [Deequ anomaly detection](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md) 
+  [Amazon Lookout for Metrics](https://aws.amazon.com/lookout-for-metrics/) 
+  [Introducing Amazon Lookout for Metrics: An anomaly detection service to proactively monitor the health of your business](https://aws.amazon.com/blogs/machine-learning/introducing-amazon-lookout-for-metrics-an-anomaly-detection-service-to-proactively-monitor-the-health-of-your-business/) 
+  [Quick: ML-powered anomaly detection for outliers](https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection-function.html) 
+  [Amazon Kinesis: Detecting Data Anomalies on a Stream ](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-anomaly-detection.html) 

# [QA.DT.5] Utilize incremental metrics computation
[QA.DT.5] Utilize incremental metrics computation

 **Category:** OPTIONAL 

 Incremental metrics computation allows teams to efficiently monitor and maintain data quality without needing to recompute metrics on the entire dataset every time data is updated. Use this method to significantly reduce computational resources and time spent on data quality testing, allowing for more agile and responsive data management practices.  

 Start by identifying the specific data quality metrics that are essential for your system. This could include metrics related to accuracy, completeness, timeliness, and consistency. Depending on your dataset's size and complexity, select a tool or framework that supports incremental computation. Some modern data processing tools, such as [Apache Spark](https://spark.apache.org/) and [Deequ](https://github.com/awslabs/deequ), provide built-in support for incremental computations. 

 Segment your data into logical partitions, often based on time, such as daily or hourly partitions. As new data is added, it becomes a new partition. Automate the computation process by setting up triggers that initiate the metric computation whenever new data is added or an existing partition is updated. 

 Continuously monitor the updated metrics to help ensure they reflect the true state of your data. Periodically validate the results of the incremental metrics computation against a full computation to ensure accuracy. As you get more familiar with the process, look for ways to optimize the computation to save even more on computational resources. This could involve refining your partitions or improving the computation logic. 

**Related information:**
+  [Deequ stateful metrics computation](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/algebraic_states_example.md)