# 6 – Design resilience for analytics workload How do you design analytics workloads to withstand and mitigate failures? | **ID** | **Priority** | **Best practice** | | --- | --- | --- | | ☐ BP 6.1 | Required | Create an illustration of data flow dependencies. | | ☐ BP 6.2 | Required | Monitor analytics systems to detect analytics or extract, transform and load (ETL) job failures. | | ☐ BP 6.3 | Required | Notify stakeholders about analytics or ETL job failures. | | ☐ BP 6.4 | Recommended | Automate the recovery of analytics and ETL job failures. | | ☐ BP 6.5 | Recommended | Build a disaster recovery (DR) plan for the analytics infrastructure and the data. | For more details, refer to the following documentation: + AWS Glue Developer Guide: [Running and Monitoring AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html) + AWS Glue Developer Guide: [Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/glue/latest/dg/monitor-cloudwatch.html) + AWS Glue Developer Guide: [Monitoring AWS Glue Using Amazon CloudWatch Metrics](https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) + AWS Prescriptive Guidance – Patterns: [Orchestrate an ETL pipeline with validation, transformation, and](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) [partitioning using AWS Step Functions](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) + AWS Support Knowledge Center: [How can I use a Lambda function to receive SNS alerts](https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-fail-retry-lambda-sns-alerts/) [when an AWS Glue job fails a retry?](https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-fail-retry-lambda-sns-alerts/) + AWS Glue Developer Guide: [Repairing and Resuming a Workﬂow Run](https://docs.aws.amazon.com/glue/latest/dg/resuming-workflow.html) # Best practice 6.1 – Create an illustration of data flow dependencies Work with business stakeholders to create a visual illustration of the data pipeline. Identify the systems that interact with each dependency. The key architecture components that are expected to be captured are data acquisition, ingestion, data transformation, data processing, data storage, data protection and governance, and data consumption. All system dependencies need owners. Agree within your organization who owns which dependency. # Best practice 6.2 – Monitor analytics systems to detect analytics or extract, transform and load (ETL) job failures Detect extract, transform, and load (ETL) and analytics job failures as soon as possible. Pinpointing where and how the error occurred is critical for notiﬁcations and corrective actions. ## Suggestion 6.2.1– Monitor and track job errors from different levels, including infrastructure, ETL workﬂow, and ETL application code Failures can occur at all levels of the analytics system. Each task in the analytics workload should be instrumented to provide metrics indicating the health of the task. Monitor the emitted metrics and raise alarms if any components fail. Create dashboards to visualize metrics and govern access to them. For more details, refer to the following: + [ Visualize data warehouse metrics: Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana ](https://aws.amazon.com/blogs/big-data/query-and-visualize-amazon-redshift-operational-metrics-using-the-amazon-redshift-plugin-for-grafana/) + [ Visualize Amazon EMR metrics: Monitor Amazon EMR on Amazon EKS with Amazon Managed Prometheus and Amazon Managed Grafana ](https://aws.amazon.com/blogs/mt/monitoring-amazon-emr-on-eks-with-amazon-managed-prometheus-and-amazon-managed-grafana/) ## Suggestion 6.2.2 – Establish end-to-end monitoring for the complete analytics and ETL pipeline End-to-end monitoring allows tracking the ﬂow of data as it passes through the analytics system. In many cases, data processing might be dependent on application logic, such as sampling a subset of data from a data stream to check accuracy. Properly identifying and monitoring the end-to-end ﬂow of data allows detecting at which step the analytics and ETL job fails. ## Suggestions 6.2.3 – Determine what data was processed when the job failed Failures in data processing systems can cause data integrity or data quality issues. Determine what data was being processed at the time of failure and perform quality checks of both the input and output data. If possible, roll-back the committed data and restart your job. For more details, see AWS Glue: [Overview of Data Quality in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html). ## Suggestions 6.2.4 – Classify the severity of the job failures based on the type of failure and the business impact Classifying the severity of different job failures helps you prioritize remediation and guide the notiﬁcation requirements to key stakeholders. Classification of jobs can be agreed upon based on importance and the impact the failure has on meeting internal and external SLAs. # Best practice 6.3 – Notify stakeholders about analytics or ETL job failures Analytics and ETL job failures can impact the SLAs for delivering the data on time for downstream analytics workloads. Failures might cause data quality or data integrity issues as well. Notifying all stakeholders about the job failure as soon as possible is important for remediation actions needed. Stakeholders may include IT operations, help desk, data sources, analytics, and downstream workloads. For more details, see AWS Well-Architected: [Design your Workload to Withstand Component Failures](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-your-workload-to-withstand-component-failures.html) ## Suggestions 6.3.1 – Establish automated notifications to predefined recipients Use services such Amazon Simple Notification Service (Amazon SNS) to send automated emails, SMS alerts, or both in the event of failure. Store the alert logs in an immutable data store for future reference. ## Suggestions 6.3.2 – Do not include sensitive data in notifications Automated alerts often include indicators of useful information for troubleshooting the failure. Ensure PII and sensitive information, such as personal, medical, or ﬁnancial information is not shared in failure notiﬁcations. For more details, see AWS Glue: [Detect and process sensitive data](https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html). ## Suggestions 6.3.3 – Integrate the analytics job failure notification solution with the enterprise operation management system Where possible, integrate automated notifications into existing operations management tools. For example, an operations support ticket can be automatically filed in the event of a failure. That same ticket can automatically be resolved if the analytics system recovers on retry. ## Suggestions 6.3.4 – Notify IT operations and help desk teams of any ETL job failures Normally, the IT operations team should be the first contact for production workload failures. The IT operations team troubleshoots and attempts to recover the failed job, if possible. It is also helpful to notify the IT help desk of system failures that have an end user impact. These can include issues with the data warehouse used by the business intelligence (BI) analysts. ## Suggestions 6.3.5 – Notify downstream systems of data freshness Monitor data updates as this gives process and application information when data becomes stale. Stale data can lead to misreporting due to the correct values being stale and not current. # Best practice 6.4 – Automate the recovery of analytics and ETL job failures Many factors can cause analytics and ETL jobs to fail. Job failures can be recovered using automated recovery solutions, however, others might require manual intervention. Designing and implementing an automated recovery solution can help reduce the impact of the job failures and streamline IT operations. ## Suggestions 6.4.1– Discover recovery procedures that work for multiple failure types Conﬁgure automatic retries to handle intermittent network disruptions. Conﬁgure managed scaling to ensure that there are sufficient resources available for jobs to complete within speciﬁc time limits. ## Suggestions 6.4.2 – Limit the number of automatic reruns and create log entries for the automatic recovery attempts and results Track the number of reruns an automated recovery process has attempted. Limit the number of reruns to avoid unnecessary reruns and resources. Track the number of recovery attempts and outcomes to identify failure trends and drive future improvements. ## Suggestion 6.4.3 – Design the job recovery solution based on the delivery SLA Build systems that can meet SLA requirements even if jobs must be retried or manually recovered. Consider the service-level agreements of the different services that you use, and monitor the performance of your jobs against your organization’s internal SLAs. ## Suggestion 6.4.4 – Consider idempotency when designing ETL jobs To avoid unexpected outcomes when automatically rerunning pipelines such as duplicated or stale data, enforce idempotency where possible. Idempotent ETL jobs can be rerun with the same result or outcome. Some strategies to achieve this are the overwriting method (for example, Spark overwrite) and the delete-write method (deleting existing data prior to writing it to ensure that there are no duplicates or stale data), although deletion should be applied with caution. # Best practice 6.5 – Build a disaster recovery (DR) plan for the analytics infrastructure and the data Discuss with business stakeholders to understand maximum amount of data loss (RPO) and maximum amount of service loss (RTO). ## Suggestion 6.5.1 – Confirm the business requirement of the disaster recovery (DR) plan Agree with the business shareholders what the internal and external SLAs are for your analytics processes. For example, not all business reports are business critical so it’s important that your DR plans are aligned with the severity of the outage. ## Suggestion 6.5.2 – Design the disaster recovery (DR) solution for each layer of the solution Review the architecture for your data and analytics pipeline and select the DR pattern that meets your DR requirements, working backwards from the most important information that must be saved in the event of a DR scenario, to the least important. ## Suggestion 6.5.3 – Implement and test your backup solution based on the RPO and RTO Backup solutions must be implemented to reduce data loss. Test your backup to ensure it is performing correctly by periodically restoring the data and validating the results.