

# 1 – Monitor the health of the analytics application workload
1 – Monitor the health of the analytics application workload

 **How do you measure the health of your analytics workload?** Data analytics workloads often involve multiple systems and process steps working in coordination. It is imperative that you monitor not only individual components but also the interaction of dependent processes to ensure a healthy data analytics workload. 


|  **ID**  |  **Priority**  |  **Best practice**  | 
| --- | --- | --- | 
|  ☐ BP 1.1   |  Required  |  Validate the data quality of source systems before transferring data for analytics.  | 
|  ☐ BP 1.2   |  Required  |  Monitor operational metrics of data processing jobs and the availability of source data. | 

 For more details, refer to the following information: 
+ AWS Big Data Blog: [Monitor data pipelines in a serverless data lake](https://aws.amazon.com/blogs/big-data/monitor-data-pipelines-in-a-serverless-data-lake/)
+  AWS Compute Blog: [Monitoring and troubleshooting serverless data analytics applications](https://aws.amazon.com/blogs/compute/monitoring-and-troubleshooting-serverless-data-analytics-applications/) 
+  AWS Big Data Blog: [Building a serverless data quality and analysis framework with Deequ and AWS](https://aws.amazon.com/blogs/big-data/building-a-serverless-data-quality-and-analysis-framework-with-deequ-and-aws-glue/) [Glue](https://aws.amazon.com/blogs/big-data/building-a-serverless-data-quality-and-analysis-framework-with-deequ-and-aws-glue/) 

# Best practice 1.1 – Validate the data quality of source systems before transferring data for analytics
BP 1.1 – Validate the data quality of source systems before transferring data for analytics

 Data quality can have an intrinsic impact on the success or failure of your organization’s data analytics projects. To avoid committing significant resources to process potentially poor-quality data, your organization should understand the quality of the source data, and monitor the changes to data quality throughout the data pipeline. 

 Data source validation can often be performed quickly on a subset of the latest data range to look for data defects. Such defects include missing values, anomalous data, or wrong data types that could fail the analytics job completion or lead to completion of the job with inaccurate results. 

 For more details refer to following document: 
+  AWS Blog: [How to Architect Data Quality on the AWS Cloud](https://aws.amazon.com/blogs/industries/how-to-architect-data-quality-on-the-aws-cloud/) 
+ AWS Blog: [Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-from-the-aws-glue-data-catalog/)

## Suggestion 1.1.1 – Implement data quality validation mechanisms
Suggestion 1.1.1 – Implement data quality validation mechanisms

 The critical attributes of data quality that should be measured and tracked through your environment are completeness, accuracy, and uniqueness. Validating and measuring your data quality using metrics is important to build trust in your data, which increases data adoption throughout your organization.

 For more details, refer to the following information: 
+ AWS Big Data Blog: [Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality ](https://aws.amazon.com/blogs/big-data/set-up-advanced-rules-to-validate-quality-of-multiple-datasets-with-aws-glue-data-quality/)
+ AWS Big Data Blog: [Getting started with AWS Glue Data Quality for ETL Pipelines](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-for-etl-pipelines/) 
+ AWS Big Data Blog: [Set up alerts and orchestrate data quality rules with AWS Glue Data Quality](https://aws.amazon.com/blogs/big-data/set-up-alerts-and-orchestrate-data-quality-rules-with-aws-glue-data-quality/) 
+  AWS Big Data Blog: [Enforce customized data quality rules in AWS Glue DataBrew](https://aws.amazon.com/blogs/big-data/enforce-customized-data-quality-rules-in-aws-glue-databrew/). 
+  AWS Big Data Blog: [Build a data quality score card using AWS Glue DataBrew, Amazon Athena, and Quick](https://aws.amazon.com/blogs/big-data/build-a-data-quality-score-card-using-aws-glue-databrew-amazon-athena-and-amazon-quicksight/). 

## Suggestion 1.1.2 – Notify stakeholders and use business logic to determine how to remediate data that is not valid
Suggestion 1.1.2 – Notify stakeholders and use business logic to determine how to remediate data that is not valid

Alerts and notifications play a crucial role in maintaining data quality because they facilitate prompt and efficient responses to any data quality issues that may arise within a dataset. By establishing and configuring alerts and notifications, you can actively monitor data quality and receive timely alerts when data quality issues are identified. This proactive approach helps mitigate the risk of making decisions based on inaccurate information. 

 It’s usually more efficient to impute missing values, but in other cases it’s more efficient to block processing until the data quality issue can be resolved at source. 

## Suggestion 1.1.3 – Score and share the quality of your datasets
Suggestion 1.1.3 – Score and share the quality of your datasets

 To improve the ongoing trust in data quality and adoption of your organization’s datasets, consider creating a data quality matrix that can be accessed by the relevant teams advertising the quality score of your datasets and potential issues with the data. This information can be incorporated in your Data Catalog. 

# Best practice 1.2 – Monitor operational metrics of data processing jobs and the availability of source data
BP 1.2 – Monitor operational metrics of data processing jobs and the availability of source data

Data processing pipelines often consist of multiple steps that all need to run in sequence to output the desired data sets and meet business deadlines. Monitoring each job in the pipeline is key to ensure operational excellence. The operational metrics of the jobs themselves should be monitored, as well as the availability of source data, and that results are produced. 

For example, if your pipeline runs on a fixed schedule, and there is no new source data to process, the pipeline may still appear healthy because it runs without failures. Similarly, if the pipeline runs when new source data becomes available, it can appear healthy when no new source data becomes available if you only alert on failed runs. 

## Suggestion 1.2.1 – Alert when new data has not arrived or become available within the expected time
Suggestion 1.2.1 – Alert when new data has not arrived or become available within the expected time

You should monitor the time when new data arrives or becomes available, and alert when too much time has passed since the last occurrence. Even if the jobs in your data processing pipeline runs flawlessly, the quality of the results depend on the quality and availability of the source data. 

In a complex data pipeline it can also be necessary to monitor that one stage produces results within an expected time frame as it affects downstream stages. 

## Suggestion 1.2.2 – Alert when data processing jobs don’t complete on time or don’t produce results
Suggestion 1.2.2 – Alert when data processing jobs don’t complete on time or don’t produce results

You should monitor the running time of data processing jobs and alert when too much time has passed since the last completed run. You should also alert if a job does not produce a result. With monitoring and alerts you can discover jobs that fail, and also jobs that fail silently by not producing results. 

The expected completion time should be based on the normal running time of the job, with some margin. The margin is needed because the running time of data processing jobs depend on the amount of data they process. Jobs that start as a result of new data becoming available also don’t have a set starting time, which should be factored into the margin.

For very long running jobs it can also be necessary to monitor the start time of jobs, and alert when too much time has passed since the last start. Sometimes it can cause too much delay to wait until the expected completion time before the failure is discovered.