# Change management
<a name="a-change-management"></a>

**Topics**
+ [REL 6  How do you monitor workload resources?](rel-06.md)
+ [REL 7  How do you design your workload to adapt to changes in demand?](rel-07.md)
+ [REL 8  How do you implement change?](rel-08.md)

# REL 6  How do you monitor workload resources?
<a name="rel-06"></a>

Logs and metrics are powerful tools to gain insight into the health of your workload. You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.

**Topics**
+ [REL06-BP01 Monitor all components for the workload (Generation)](rel_monitor_aws_resources_monitor_resources.md)
+ [REL06-BP02 Define and calculate metrics (Aggregation)](rel_monitor_aws_resources_notification_aggregation.md)
+ [REL06-BP03 Send notifications (Real-time processing and alarming)](rel_monitor_aws_resources_notification_monitor.md)
+ [REL06-BP04 Automate responses (Real-time processing and alarming)](rel_monitor_aws_resources_automate_response_monitor.md)
+ [REL06-BP05 Analytics](rel_monitor_aws_resources_storage_analytics.md)
+ [REL06-BP06 Conduct reviews regularly](rel_monitor_aws_resources_review_monitoring.md)
+ [REL06-BP07 Monitor end-to-end tracing of requests through your system](rel_monitor_aws_resources_end_to_end.md)

# REL06-BP01 Monitor all components for the workload (Generation)
<a name="rel_monitor_aws_resources_monitor_resources"></a>

 Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with AWS Health Dashboard. 

 All components of your workload should be monitored, including the front-end, business logic, and storage tiers. Define key metrics, describe how to extract them from logs (if necessary), and set thresholds for triggering corresponding alarm events. Ensure metrics are relevant to the key performance indicators (KPIs) of your workload, and use metrics and logs to identify early warning signs of service degradation. For example, a metric related to business outcomes such as the number of orders successfully processed per minute, can indicate workload issues faster than technical metric, such as CPU Utilization. Use AWS Health Dashboard for a personalized view into the performance and availability of the AWS services underlying your AWS resources. 

 Monitoring in the cloud offers new opportunities. Most cloud providers have developed customizable hooks and can deliver insights to help you monitor multiple layers of your workload. AWS services such as Amazon CloudWatch apply statistical and machine learning algorithms to continually analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. Anomaly detection algorithms account for the seasonality and trend changes of metrics. 

 AWS makes an abundance of monitoring and log information available for consumption that can be used to define workload-specific metrics, change-in-demand processes, and adopt machine learning techniques regardless of ML expertise. 

 In addition, monitor all of your external endpoints to ensure that they are independent of your base implementation. This active monitoring can be done with synthetic transactions (sometimes referred to as *user canaries*, but not to be confused with canary deployments) which periodically run a number of common tasks matching actions performed by clients of the workload. Keep these tasks short in duration and be sure not to overload your workload during testing. Amazon CloudWatch Synthetics enables you to [create synthetic canaries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to monitor your endpoints and APIs. You can also combine the synthetic canary client nodes with AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame. 

 **Desired Outcome:** 

 Collect and use critical metrics from all components of the workload to ensure workload reliability and optimal user experience. Detecting that a workload is not achieving business outcomes allows you to quickly declare a disaster and recover from an incident. 

 **Common anti-patterns:** 
+  Only monitoring external interfaces to your workload. 
+  Not generating any workload-specific metrics and only relying on metrics provided to you by the AWS services your workload uses. 
+  Only using technical metrics in your workload and not monitoring any metrics related to non-technical KPIs the workload contributes to. 
+  Relying on production traffic and simple health checks to monitor and evaluate workload state. 

 **Benefits of establishing this best practice:** Monitoring at all tiers in your workload enables you to more rapidly anticipate and resolve problems in the components that comprise the workload. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

1.  **Enable logging where available.** Monitoring data should be obtained from all components of the workloads. Turn on additional logging, such as S3 Access Logs, and enable your workload to log workload specific data. Collect metrics for CPU, network I/O, and disk I/O averages from services such as Amazon ECS, Amazon EKS, Amazon EC2, Elastic Load Balancing, AWS Auto Scaling, and Amazon EMR. See [AWS Services That Publish CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) for a list of AWS services that publish metrics to CloudWatch. 

1.  **Review all default metrics and explore any data collection gaps.** Every service generates default metrics. Collecting default metrics allows you to better understand the dependencies between workload components, and how component reliability and performance affect the workload. You can also create and [publish your own metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) to CloudWatch using the AWS CLI or an API. 

1.  **Evaluate all the metrics to decide which ones to alert on for each AWS service in your workload.** You may choose to select a subset of metrics that have a major impact on workload reliability. Focusing on critical metrics and threshold allows you to refine the number of [alerts](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) and can help minimize false-positives. 

1.  **Define alerts and the recovery process for your workload after the alert is triggered.** Defining alerts allows you to quickly notify, escalate, and follow steps necessary to recover from an incident and meet your prescribed Recovery Time Objective (RTO). You can use [https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-actions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-actions) to invoke automated workflows and initiate recovery procedures based on defined thresholds. 

1.  **Explore use of synthetic transactions to collect relevant data about workloads state.** Synthetic monitoring follows the same routes and perform the same actions as a customer, which makes it possible for you to continually verify your customer experience even when you don't have any customer traffic on your workloads. By using [synthetic transactions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html), you can discover issues before your customers do. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md)

 **Related documents:** 
+  [Getting started with your AWS Health Dashboard – Your account health](https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html) 
+  [AWS Services That Publish CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Access Logs for Your Network Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html) 
+  [Access logs for your application load balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html) 
+  [Accessing Amazon CloudWatch Logs for AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions-logs.html) 
+  [Amazon S3 Server Access Logging](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html) 
+  [Enable Access Logs for Your Classic Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/enable-access-logs.html) 
+  [Exporting log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) 
+  [Install the CloudWatch agent on an Amazon EC2 instance](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html) 
+  [Publishing Custom Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [Using Amazon CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
+  [Using Canaries (Amazon CloudWatch Synthetics)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [What are Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 

   **User guides:** 
+  [Creating a trail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-a-trail-using-the-console-first-time.html) 
+  [Monitoring memory and disk metrics for Amazon EC2 Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html) 
+  [Using CloudWatch Logs with container instances](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html) 
+  [VPC Flow Logs](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/flow-logs.html) 
+  [What is Amazon DevOps Guru?](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [What is AWS X-Ray?](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 

 **Related blogs:** 
+  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 

 **Related examples and workshops:** 
+  [AWS Well-Architected Labs: Operational Excellence - Dependency Monitoring](https://wellarchitectedlabs.com/operational-excellence/100_labs/100_dependency_monitoring/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [Observability workshop](https://catalog.workshops.aws/observability/en-US) 

# REL06-BP02 Define and calculate metrics (Aggregation)
<a name="rel_monitor_aws_resources_notification_aggregation"></a>

 Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps. 

 Amazon CloudWatch and Amazon S3 serve as the primary aggregation and storage layers. For some services, such as AWS Auto Scaling and Elastic Load Balancing, default metrics are provided by default for CPU load or average request latency across a cluster or instance. For streaming services, such as VPC Flow Logs and AWS CloudTrail, event data is forwarded to CloudWatch Logs and you need to define and apply metrics filters to extract metrics from the event data. This gives you time series data, which can serve as inputs to CloudWatch alarms that you define to trigger alerts. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define and calculate metrics (Aggregation). Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps 
  +  Metric filters define the terms and patterns to look for in log data as it is sent to CloudWatch Logs. CloudWatch Logs uses these metric filters to turn log data into numerical CloudWatch metrics that you can graph or set an alarm on. 
    +  [Searching and Filtering Log Data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
  +  Use a trusted third party to aggregate logs. 
    +  Follow the instructions of the third party. Most third-party products integrate with CloudWatch and Amazon S3. 
  +  Some AWS services can publish logs directly to Amazon S3. If your main requirement for logs is storage in Amazon S3, you can easily have the service producing the logs send them directly to Amazon S3 without setting up additional infrastructure. 
    +  [Sending Logs Directly to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Logs Insights Sample Queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) 
+  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [Searching and Filtering Log Data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
+  [Sending Logs Directly to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 

# REL06-BP03 Send notifications (Real-time processing and alarming)
<a name="rel_monitor_aws_resources_notification_monitor"></a>

 Organizations that need to know, receive notifications when significant events occur. 

 Alerts can be sent to Amazon Simple Notification Service (Amazon SNS) topics, and then pushed to any number of subscribers. For example, Amazon SNS can forward alerts to an email alias so that technical staff can respond. 

 **Common anti-patterns:** 
+  Configuring alarms at too low of threshold, causing too many notifications to be sent. 
+  Not archiving alarms for future exploration. 

 **Benefits of establishing this best practice:** Notifications on events (even those that can be responded to and automatically resolved) allow you to have a record of events and potentially address them in a different manner in the future. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Perform real-time processing and alarming. Organizations that need to know, receive notifications when significant events occur 
  +  Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different Regions. 
    +  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
  +  Create an alarm when the metric surpasses a limit. 
    +  [Using Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [Using Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [Using Amazon CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 

# REL06-BP04 Automate responses (Real-time processing and alarming)
<a name="rel_monitor_aws_resources_automate_response_monitor"></a>

 Use automation to take action when an event is detected, for example, to replace failed components. 

 Alerts can trigger AWS Auto Scaling events, so that clusters react to changes in demand. Alerts can be sent to Amazon Simple Queue Service (Amazon SQS), which can serve as an integration point for third-party ticket systems. AWS Lambda can also subscribe to alerts, providing users an asynchronous serverless model that reacts to change dynamically. AWS Config continually monitors and records your AWS resource configurations, and can trigger [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) to remediate issues. 

 Amazon DevOps Guru can automatically monitor application resources for anomalous behavior and deliver targeted recommendations to speed up problem identification and remediation times. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use Amazon DevOps Guru to perform automated actions. Amazon DevOps Guru can automatically monitor application resources for anomalous behavior and deliver targeted recommendations to speed up problem identification and remediation times. 
  +  [What is Amazon DevOps Guru?](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  Use AWS Systems Manager to perform automated actions. AWS Config continually monitors and records your AWS resource configurations, and can trigger AWS Systems Manager Automation to remediate issues. 
  +  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 
    +  Create and use Systems Manager Automation documents. These define the actions that Systems Manager performs on your managed instances and other AWS resources when an automation process runs. 
    +  [Working with Automation Documents (Playbooks)](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+  Amazon CloudWatch sends alarm state change events to Amazon EventBridge. Create EventBridge rules to automate responses. 
  +  [Creating an EventBridge Rule That Triggers on an Event from an AWS Resource](https://docs.aws.amazon.com/eventbridge/latest/userguide/create-eventbridge-rule.html) 
+  Create and execute a plan to automate responses. 
  +  Inventory all your alert response procedures. You must plan your alert responses before you rank the tasks. 
  +  Inventory all the tasks with specific actions that must be taken. Most of these actions are documented in runbooks. You must also have playbooks for alerts of unexpected events. 
  +  Examine the runbooks and playbooks for all automatable actions. In general, if an action can be defined, it most likely can be automated. 
  +  Rank the error-prone or time-consuming activities first. It is most beneficial to remove sources of errors and reduce time to resolution. 
  +  Establish a plan to complete automation. Maintain an active plan to automate and update the automation. 
  +  Examine manual requirements for opportunities for automation. Challenge your manual process for opportunities to automate. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 
+  [Creating an EventBridge Rule That Triggers on an Event from an AWS Resource](https://docs.aws.amazon.com/eventbridge/latest/userguide/create-eventbridge-rule.html) 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [What is Amazon DevOps Guru?](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [Working with Automation Documents (Playbooks)](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 

# REL06-BP05 Analytics
<a name="rel_monitor_aws_resources_storage_analytics"></a>

 Collect log files and metrics histories and analyze these for broader trends and workload insights. 

 Amazon CloudWatch Logs Insights supports a [simple yet powerful query language](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html) that you can use to analyze log data. Amazon CloudWatch Logs also supports subscriptions that allow data to flow seamlessly to Amazon S3 where you can use or Amazon Athena to query the data. It also supports queries on a large array of formats. See [Supported SerDes and Data Formats](https://docs.aws.amazon.com/athena/latest/ug/supported-format.html) in the Amazon Athena User Guide for more information. For analysis of huge log file sets, you can run an Amazon EMR cluster to run petabyte-scale analyses. 

 There are a number of tools provided by AWS Partners and third parties that allow for aggregation, processing, storage, and analytics. These tools include New Relic, Splunk, Loggly, Logstash, CloudHealth, and Nagios. However, outside generation of system and application logs is unique to each cloud provider, and often unique to each service. 

 An often-overlooked part of the monitoring process is data management. You need to determine the retention requirements for monitoring data, and then apply lifecycle policies accordingly. Amazon S3 supports lifecycle management at the S3 bucket level. This lifecycle management can be applied differently to different paths in the bucket. Toward the end of the lifecycle, you can transition data to Amazon Glacier for long-term storage, and then expiration after the end of the retention period is reached. The S3 Intelligent-Tiering storage class is designed to optimize costs by automatically moving data to the most cost-effective access tier, without performance impact or operational overhead. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. 
  +  [Analyzing Log Data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html) 
  +  [Amazon CloudWatch Logs Insights Sample Queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) 
+  Use Amazon CloudWatch Logs send logs to Amazon S3 where you can use or Amazon Athena to query the data. 
  +  [How do I analyze my Amazon S3 server access logs using Athena?](https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/) 
    +  Create an S3 lifecycle policy for your server access logs bucket. Configure the lifecycle policy to periodically remove log files. Doing so reduces the amount of data that Athena analyzes for each query. 
      +  [How Do I Create a Lifecycle Policy for an S3 Bucket?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Logs Insights Sample Queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) 
+  [Analyzing Log Data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html) 
+  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 
+  [How Do I Create a Lifecycle Policy for an S3 Bucket?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html) 
+  [How do I analyze my Amazon S3 server access logs using Athena?](https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/) 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 

# REL06-BP06 Conduct reviews regularly
<a name="rel_monitor_aws_resources_review_monitoring"></a>

 Frequently review how workload monitoring is implemented and update it based on significant events and changes. 

 Effective monitoring is driven by key business metrics. Ensure these metrics are accommodated in your workload as business priorities change. 

 Auditing your monitoring helps ensure that you know when an application is meeting its availability goals. Root cause analysis requires the ability to discover what happened when failures occur. AWS provides services that allow you to track the state of your services during an incident: 
+  **Amazon CloudWatch Logs:** You can store your logs in this service and inspect their contents. 
+  **Amazon CloudWatch Logs Insights**: Is a fully managed service that enables you to analyze massive logs in seconds. It gives you fast, interactive queries and visualizations.  
+  **AWS Config:** You can see what AWS infrastructure was in use at different points in time. 
+  **AWS CloudTrail:** You can see which AWS APIs were invoked at what time and by what principal. 

 At AWS, we conduct a weekly meeting to [review operational performance](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html) and to share learnings between teams. Because there are so many teams in AWS, we created [The Wheel](https://aws.amazon.com/blogs/opensource/the-wheel/) to randomly pick a workload to review. Establishing a regular cadence for operational performance reviews and knowledge sharing enhances your ability to achieve higher performance from your operational teams. 

 **Common anti-patterns:** 
+  Collecting only default metrics. 
+  Setting a monitoring strategy and never reviewing it. 
+  Not discussing monitoring when major changes are deployed. 

 **Benefits of establishing this best practice:** Regularly reviewing your monitoring enables the anticipation of potential problems, instead of reacting to notifications when an anticipated problem actually occurs. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Create multiple dashboards for the workload. You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies. You should also have dashboards for various application tiers and dependencies that can be inspected. 
  +  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  Schedule and conduct regular reviews of the workload dashboards. Conduct regular inspection of the dashboards. You may have different cadences for the depth at which you inspect. 
  +  Inspect for trends in the metrics. Compare the metric values to historic values to see if there are trends that may indicate that something that needs investigation. Examples of this include: increasing latency, decreasing primary business function, and increasing failure responses. 
  +  Inspect for outliers/anomalies in your metrics. Averages or medians can mask outliers and anomalies. Look at the highest and lowest values during the time frame and investigate the causes of extreme scores. As you continue to eliminate these causes, lowering your definition of extreme allows you to continue to improve the consistency of your workload performance. 
  +  Look for sharp changes in behavior. An immediate change in quantity or direction of a metric may indicate that there has been a change in the application, or external factors that you may need to add additional metrics to track. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Logs Insights Sample Queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) 
+  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 

# REL06-BP07 Monitor end-to-end tracing of requests through your system
<a name="rel_monitor_aws_resources_end_to_end"></a>

 Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Monitor end-to-end tracing of requests through your system. AWS X-Ray is a service that collects data about requests that your application serves, and provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization. For any traced request to your application, you can see detailed information not only about the request and response, but also about calls that your application makes to downstream AWS resources, microservices, databases, and web APIs. 
  +  [What is AWS X-Ray?](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 
  +  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Debugging with Amazon CloudWatch Synthetics and AWS X-Ray](https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/) 
+  [One Observability Workshop](https://observability.workshop.aws/) 
+  [The Amazon Builders' Library: Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [Using Canaries (Amazon CloudWatch Synthetics)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [What is AWS X-Ray?](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 

# REL 7  How do you design your workload to adapt to changes in demand?
<a name="rel-07"></a>

A scalable workload provides elasticity to add or remove resources automatically so that they closely match the current demand at any given point in time.

**Topics**
+ [REL07-BP01 Use automation when obtaining or scaling resources](rel_adapt_to_changes_autoscale_adapt.md)
+ [REL07-BP02 Obtain resources upon detection of impairment to a workload](rel_adapt_to_changes_reactive_adapt_auto.md)
+ [REL07-BP03 Obtain resources upon detection that more resources are needed for a workload](rel_adapt_to_changes_proactive_adapt_auto.md)
+ [REL07-BP04 Load test your workload](rel_adapt_to_changes_load_tested_adapt.md)

# REL07-BP01 Use automation when obtaining or scaling resources
<a name="rel_adapt_to_changes_autoscale_adapt"></a>

 When replacing impaired resources or scaling your workload, automate the process by using managed AWS services, such as Amazon S3 and AWS Auto Scaling. You can also use third-party tools and AWS SDKs to automate scaling. 

 Managed AWS services include Amazon S3, Amazon CloudFront, AWS Auto Scaling, AWS Lambda, Amazon DynamoDB, AWS Fargate, and Amazon Route 53. 

 AWS Auto Scaling lets you detect and replace impaired instances. It also lets you build scaling plans for resources including [Amazon EC2](https://aws.amazon.com/ec2/) instances and Spot Fleets, [Amazon ECS](https://aws.amazon.com/ecs/) tasks, [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) tables and indexes, and [Amazon Aurora](https://aws.amazon.com/aurora/) Replicas. 

 When scaling EC2 instances, ensure that you use multiple Availability Zones (preferably at least three) and add or remove capacity to maintain balance across these Availability Zones. ECS tasks or Kubernetes pods (when using Amazon Elastic Kubernetes Service) should also be distributed across multiple Availability Zones. 

 When using AWS Lambda, instances scale automatically. Every time an event notification is received for your function, AWS Lambda quickly locates free capacity within its compute fleet and runs your code up to the allocated concurrency. You need to ensure that the necessary concurrency is configured on the specific Lambda, and in your Service Quotas. 

 Amazon S3 automatically scales to handle high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. 

 Configure and use Amazon CloudFront or a trusted content delivery network (CDN). A CDN can provide faster end-user response times and can serve requests for content from cache, therefore reducing the need to scale your workload. 

 **Common anti-patterns:** 
+  Implementing Auto Scaling groups for automated healing, but not implementing elasticity. 
+  Using automatic scaling to respond to large increases in traffic. 
+  Deploying highly stateful applications, eliminating the option of elasticity. 

 **Benefits of establishing this best practice:** Automation removes the potential for manual error in deploying and decommissioning resources. Automation removes the risk of cost overruns and denial of service due to slow response on needs for deployment or decommissioning. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Configure and use AWS Auto Scaling. This monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. Using AWS Auto Scaling, you can setup application scaling for multiple resources across multiple services. 
  +  [What is AWS Auto Scaling?](https://docs.aws.amazon.com/autoscaling/plans/userguide/what-is-aws-auto-scaling.html) 
    +  Configure Auto Scaling on your Amazon EC2 instances and Spot Fleets, Amazon ECS tasks, Amazon DynamoDB tables and indexes, Amazon Aurora Replicas, and AWS Marketplace appliances as applicable. 
      +  [Managing throughput capacity automatically with DynamoDB Auto Scaling](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html) 
        +  Use service API operations to specify the alarms, scaling policies, warm up times, and cool down times. 
+  Use Elastic Load Balancing. Load balancers can distribute load by path or by network connectivity. 
  +  [What is Elastic Load Balancing?](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html) 
    +  Application Load Balancers can distribute load by path. 
      +  [What is an Application Load Balancer?](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) 
        +  Configure an Application Load Balancer to distribute traffic to different workloads based on the path under the domain name. 
        +  Application Load Balancers can be used to distribute loads in a manner that integrates with AWS Auto Scaling to manage demand. 
          +  [Using a load balancer with an Auto Scaling group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/autoscaling-load-balancer.html) 
    +  Network Load Balancers can distribute load by connection. 
      +  [What is a Network Load Balancer?](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) 
        +  Configure a Network Load Balancer to distribute traffic to different workloads using TCP, or to have a constant set of IP addresses for your workload. 
        +  Network Load Balancers can be used to distribute loads in a manner that integrates with AWS Auto Scaling to manage demand. 
+  Use a highly available DNS provider. DNS names allow your users to enter names instead of IP addresses to access your workloads and distributes this information to a defined scope, usually globally for users of the workload. 
  +  Use Amazon Route 53 or a trusted DNS provider. 
    +  [What is Amazon Route 53?](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) 
  +  Use Route 53 to manage your CloudFront distributions and load balancers. 
    +  Determine the domains and subdomains you are going to manage. 
    +  Create appropriate record sets using ALIAS or CNAME records. 
      +  [Working with records](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/rrsets-working-with.html) 
+  Use the AWS global network to optimize the path from your users to your applications. AWS Global Accelerator continually monitors the health of your application endpoints and redirects traffic to healthy endpoints in less than 30 seconds. 
  +  AWS Global Accelerator is a service that improves the availability and performance of your applications with local or global users. It provides static IP addresses that act as a fixed entry point to your application endpoints in a single or multiple AWS Regions, such as your Application Load Balancers, Network Load Balancers or Amazon EC2 instances. 
    +  [What Is AWS Global Accelerator?](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html) 
+  Configure and use Amazon CloudFront or a trusted content delivery network (CDN). A content delivery network can provide faster end-user response times and can serve requests for content that may cause unnecessary scaling of your workloads. 
  +  [What is Amazon CloudFront?](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html) 
    +  Configure Amazon CloudFront distributions for your workloads, or use a third-party CDN. 
      +  You can limit access to your workloads so that they are only accessible from CloudFront by using the IP ranges for CloudFront in your endpoint security groups or access policies. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help you create automated compute solutions](https://aws.amazon.com/partners/find/results/?facets=%27Product%20:%20Compute%27) 
+  [AWS Auto Scaling: How Scaling Plans Work](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html) 
+  [AWS Marketplace: products that can be used with auto scaling](https://aws.amazon.com/marketplace/search/results?searchTerms=Auto+Scaling) 
+  [Managing Throughput Capacity Automatically with DynamoDB Auto Scaling](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html) 
+  [Using a load balancer with an Auto Scaling group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/autoscaling-load-balancer.html) 
+  [What Is AWS Global Accelerator?](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html) 
+  [What Is Amazon EC2 Auto Scaling?](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html) 
+  [What is AWS Auto Scaling?](https://docs.aws.amazon.com/autoscaling/plans/userguide/what-is-aws-auto-scaling.html) 
+  [What is Amazon CloudFront?](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html?ref=wellarchitected) 
+  [What is Amazon Route 53?](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) 
+  [What is Elastic Load Balancing?](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html) 
+  [What is a Network Load Balancer?](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) 
+  [What is an Application Load Balancer?](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) 
+  [Working with records](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/rrsets-working-with.html) 

# REL07-BP02 Obtain resources upon detection of impairment to a workload
<a name="rel_adapt_to_changes_reactive_adapt_auto"></a>

 Scale resources reactively when necessary if availability is impacted, to restore workload availability. 

 You first must configure health checks and the criteria on these checks to indicate when availability is impacted by lack of resources. Then either notify the appropriate personnel to manually scale the resource, or trigger automation to automatically scale it. 

 Scale can be manually adjusted for your workload, for example, changing the number of EC2 instances in an Auto Scaling group or modifying throughput of a DynamoDB table can be done through the AWS Management Console or AWS CLI. However automation should be used whenever possible (refer to **Use automation when obtaining or scaling resources**). 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Obtain resources upon detection of impairment to a workload. Scale resources reactively when necessary if availability is impacted, to restore workload availability. 
  +  Use scaling plans, which are the core component of AWS Auto Scaling, to configure a set of instructions for scaling your resources. If you work with AWS CloudFormation or add tags to AWS resources, you can set up scaling plans for different sets of resources, per application. AWS Auto Scaling provides recommendations for scaling strategies customized to each resource. After you create your scaling plan, AWS Auto Scaling combines dynamic scaling and predictive scaling methods together to support your scaling strategy. 
    +  [AWS Auto Scaling: How Scaling Plans Work](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html) 
  +  Amazon EC2 Auto Scaling helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You create collections of EC2 instances, called Auto Scaling groups. You can specify the minimum number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your group never goes below this size. You can specify the maximum number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your group never goes above this size. 
    +  [What Is Amazon EC2 Auto Scaling?](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html) 
  +  Amazon DynamoDB auto scaling uses the AWS Application Auto Scaling service to dynamically adjust provisioned throughput capacity on your behalf, in response to actual traffic patterns. This enables a table or a global secondary index to increase its provisioned read and write capacity to handle sudden increases in traffic, without throttling. 
    +  [Managing Throughput Capacity Automatically with DynamoDB Auto Scaling](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help you create automated compute solutions](https://aws.amazon.com/partners/find/results/?facets=%27Product%20:%20Compute%27) 
+  [AWS Auto Scaling: How Scaling Plans Work](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html) 
+  [AWS Marketplace: products that can be used with auto scaling](https://aws.amazon.com/marketplace/search/results?searchTerms=Auto+Scaling) 
+  [Managing Throughput Capacity Automatically with DynamoDB Auto Scaling](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html) 
+  [What Is Amazon EC2 Auto Scaling?](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html) 

# REL07-BP03 Obtain resources upon detection that more resources are needed for a workload
<a name="rel_adapt_to_changes_proactive_adapt_auto"></a>

 Scale resources proactively to meet demand and avoid availability impact. 

 Many AWS services automatically scale to meet demand. If using Amazon EC2 instances or Amazon ECS clusters, you can configure automatic scaling of these to occur based on usage metrics that correspond to demand for your workload. For Amazon EC2, average CPU utilization, load balancer request count, or network bandwidth can be used to scale out (or scale in) EC2 instances. For Amazon ECS, average CPU utilization, load balancer request count, and memory utilization can be used to scale out (or scale in) ECS tasks. Using Target Auto Scaling on AWS, the autoscaler acts like a household thermostat, adding or removing resources to maintain the target value (for example, 70% CPU utilization) that you specify. 

 Amazon EC2 Auto Scaling can also do [Predictive Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-predictive-scaling.html), which uses machine learning to analyze each resource's historical workload and regularly forecasts the future load. 

 Little’s Law helps calculate how many instances of compute (EC2 instances, concurrent Lambda functions, etc.) that you need. 

 *L* = *λW* 

 L = number of instances (or mean concurrency in the system) 

 λ = mean rate at which requests arrive (req/sec) 

 W = mean time that each request spends in the system (sec) 

 For example, at 100 rps, if each request takes 0.5 seconds to process, you will need 50 instances to keep up with demand. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Obtain resources upon detection that more resources are needed for a workload. Scale resources proactively to meet demand and avoid availability impact. 
  +  Calculate how many compute resources you will need (compute concurrency) to handle a given request rate. 
    +  [Telling Stories About Little's Law](https://brooker.co.za/blog/2018/06/20/littles-law.html) 
  +  When you have a historical pattern for usage, set up scheduled scaling for Amazon EC2 auto scaling. 
    +  [Scheduled Scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html) 
  +  Use AWS predictive scaling. 
    +  [Predictive scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-predictive-scaling.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Auto Scaling: How Scaling Plans Work](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html) 
+  [AWS Marketplace: products that can be used with auto scaling](https://aws.amazon.com/marketplace/search/results?searchTerms=Auto+Scaling) 
+  [Managing Throughput Capacity Automatically with DynamoDB Auto Scaling](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html) 
+  [Predictive Scaling for EC2, Powered by Machine Learning](https://aws.amazon.com/blogs/aws/new-predictive-scaling-for-ec2-powered-by-machine-learning/) 
+  [Scheduled Scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html) 
+  [Telling Stories About Little's Law](https://brooker.co.za/blog/2018/06/20/littles-law.html) 
+  [What Is Amazon EC2 Auto Scaling?](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html) 

# REL07-BP04 Load test your workload
<a name="rel_adapt_to_changes_load_tested_adapt"></a>

 Adopt a load testing methodology to measure if scaling activity meets workload requirements. 

 It’s important to perform sustained load testing. Load tests should discover the breaking point and test the performance of your workload. AWS makes it easy to set up temporary testing environments that model the scale of your production workload. In the cloud, you can create a production-scale test environment on demand, complete your testing, and then decommission the resources. Because you only pay for the test environment when it's running, you can simulate your live environment for a fraction of the cost of testing on premises. 

 Load testing in production should also be considered as part of game days where the production system is stressed, during hours of lower customer usage, with all personnel on hand to interpret results and address any problems that arise. 

 **Common anti-patterns:** 
+  Performing load testing on deployments that are not the same configuration as your production. 
+  Performing load testing only on individual pieces of your workload, and not on the entire workload. 
+  Performing load testing with a subset of requests and not a representative set of actual requests. 
+  Performing load testing to a small safety factor above expected load. 

 **Benefits of establishing this best practice:** You know what components in your architecture fail under load and be able to identify what metrics to watch to indicate that you are approaching that load in time to address the problem, preventing the impact of that failure. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Perform load testing to identify which aspect of your workload indicates that you must add or remove capacity. Load testing should have representative traffic similar to what you receive in production. Increase the load while watching the metrics you have instrumented to determine which metric indicates when you must add or remove resources. 
  +  [Distributed Load Testing on AWS: simulate thousands of connected users](https://aws.amazon.com/solutions/distributed-load-testing-on-aws/) 
    +  Identify the mix of requests. You may have varied mixes of requests, so you should look at various time frames when identifying the mix of traffic. 
    +  Implement a load driver. You can use custom code, open source, or commercial software to implement a load driver. 
    +  Load test initially using small capacity. You see some immediate effects by driving load onto a lesser capacity, possibly as small as one instance or container. 
    +  Load test against larger capacity. The effects will be different on a distributed load, so you must test against as close to a product environment as possible. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Distributed Load Testing on AWS: simulate thousands of connected users](https://aws.amazon.com/solutions/distributed-load-testing-on-aws/) 

# REL 8  How do you implement change?
<a name="rel-08"></a>

Controlled changes are necessary to deploy new functionality, and to ensure that the workloads and the operating environment are running known software and can be patched or replaced in a predictable manner. If these changes are uncontrolled, then it makes it difficult to predict the effect of these changes, or to address issues that arise because of them. 

**Topics**
+ [REL08-BP01 Use runbooks for standard activities such as deployment](rel_tracking_change_management_planned_changemgmt.md)
+ [REL08-BP02 Integrate functional testing as part of your deployment](rel_tracking_change_management_functional_testing.md)
+ [REL08-BP03 Integrate resiliency testing as part of your deployment](rel_tracking_change_management_resiliency_testing.md)
+ [REL08-BP04 Deploy using immutable infrastructure](rel_tracking_change_management_immutable_infrastructure.md)
+ [REL08-BP05 Deploy changes with automation](rel_tracking_change_management_automated_changemgmt.md)

# REL08-BP01 Use runbooks for standard activities such as deployment
<a name="rel_tracking_change_management_planned_changemgmt"></a>

 Runbooks are the predefined procedures to achieve specific outcomes. Use runbooks to perform standard activities, whether done manually or automatically. Examples include deploying a workload, patching a workload, or making DNS modifications. 

 For example, put processes in place to [ensure rollback safety during deployments](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments). Ensuring that you can roll back a deployment without any disruption for your customers is critical in making a service reliable. 

 For runbook procedures, start with a valid effective manual process, implement it in code, and trigger it to automatically run where appropriate. 

 Even for sophisticated workloads that are highly automated, runbooks are still useful for [running game days](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/test-reliability.html#GameDays) or meeting rigorous reporting and auditing requirements. 

 Note that playbooks are used in response to specific incidents, and runbooks are used to achieve specific outcomes. Often, runbooks are for routine activities, while playbooks are used for responding to non-routine events. 

 **Common anti-patterns:** 
+  Performing unplanned changes to configuration in production. 
+  Skipping steps in your plan to deploy faster, resulting in a failed deployment. 
+  Making changes without testing the reversal of the change. 

 **Benefits of establishing this best practice:** Effective change planning increases your ability to successfully execute the change because you are aware of all the systems impacted. Validating your change in test environments increases your confidence. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Enable consistent and prompt responses to well understood events by documenting procedures in runbooks. 
  +  [AWS Well-Architected Framework: Concepts: Runbook](https://wa.aws.amazon.com/wat.concept.runbook.en.html) 
+  Use the principle of infrastructure as code to define your infrastructure. By using AWS CloudFormation (or a trusted third party) to define your infrastructure, you can use version control software to version and track changes. 
  +  Use AWS CloudFormation (or a trusted third-party provider) to define your infrastructure. 
    +  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
  +  Create templates that are singular and decoupled, using good software design principles. 
    +  Determine the permissions, templates, and responsible parties for implementation. 
      + [ Controlling access with AWS Identity and Access Management](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-iam-template.html)
    +  Use source control, like AWS CodeCommit or a trusted third-party tool, for version control. 
      +  [What is AWS CodeCommit?](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help you create automated deployment solutions](https://aws.amazon.com/partners/find/results/?keyword=devops) 
+  [AWS Marketplace: products that can be used to automate your deployments](https://aws.amazon.com/marketplace/search/results?searchTerms=DevOps) 
+  [AWS Well-Architected Framework: Concepts: Runbook](https://wa.aws.amazon.com/wat.concept.runbook.en.html) 
+  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
+  [What is AWS CodeCommit?](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) 

   **Related examples:** 
+  [Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/) 

# REL08-BP02 Integrate functional testing as part of your deployment
<a name="rel_tracking_change_management_functional_testing"></a>

 Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back. 

 These tests are run in a pre-production environment, which is staged prior to production in the pipeline. Ideally, this is done as part of a deployment pipeline. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Integrate functional testing as part of your deployment. Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back. 
  +  Invoke AWS CodeBuild during the ‘Test Action’ of your software release pipelines modeled in AWS CodePipeline. This capability enables you to easily run a variety of tests against your code, such as unit tests, static code analysis, and integration tests. 
    +  [AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild](https://aws.amazon.com/about-aws/whats-new/2017/03/aws-codepipeline-adds-support-for-unit-testing/) 
  +  Use AWS Marketplace solutions for executing automated tests as part of your software delivery pipeline. 
    +  [Software test automation](https://aws.amazon.com/marketplace/solutions/devops/software-test-automation) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild](https://aws.amazon.com/about-aws/whats-new/2017/03/aws-codepipeline-adds-support-for-unit-testing/) 
+  [Software test automation](https://aws.amazon.com/marketplace/solutions/devops/software-test-automation) 
+  [What Is AWS CodePipeline?](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 

# REL08-BP03 Integrate resiliency testing as part of your deployment
<a name="rel_tracking_change_management_resiliency_testing"></a>

 Resiliency tests (using the [principles of chaos engineering](https://principlesofchaos.org/)) are run as part of the automated deployment pipeline in a pre-production environment. 

 These tests are staged and run in the pipeline in a pre-production environment. They should also be run in production as part of [https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/test-reliability.html#GameDays](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/test-reliability.html#GameDays). 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Integrate resiliency testing as part of your deployment. Use Chaos Engineering, the discipline of experimenting on a workload to build confidence in the workload’s capability to withstand turbulent conditions in production. 
  +  Resiliency tests inject faults or resource degradation to assess that your workload responds with its designed resilience. 
    +  [Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3](https://wellarchitectedlabs.com/Reliability/300_Testing_for_Resiliency_of_EC2_RDS_and_S3/README.html) 
  +  These tests can be run regularly in pre-production environments in automated deployment pipelines. 
  +  They should also be run in production, as part of scheduled game days. 
  +  Using Chaos Engineering principles, propose hypotheses about how your workload will perform under various impairments, then test your hypotheses using resiliency testing. 
    +  [Principles of Chaos Engineering](https://principlesofchaos.org/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Principles of Chaos Engineering](https://principlesofchaos.org/) 
+  [What is AWS Fault Injection Simulator?](https://docs.aws.amazon.com/fis/latest/userguide/what-is.html) 

 **Related examples:** 
+  [Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3](https://wellarchitectedlabs.com/Reliability/300_Testing_for_Resiliency_of_EC2_RDS_and_S3/README.html) 

# REL08-BP04 Deploy using immutable infrastructure
<a name="rel_tracking_change_management_immutable_infrastructure"></a>

 Immutable infrastructure is a model that mandates that no updates, security patches, or configuration changes happen in-place on production workloads. When a change is needed, the architecture is built onto new infrastructure and deployed into production. 

 The most common implementation of the immutable infrastructure paradigm is the ***immutable server***. This means that if a server needs an update or a fix, new servers are deployed instead of updating the ones already in use. So, instead of logging into the server via SSH and updating the software version, every change in the application starts with a software push to the code repository, for example, git push. Since changes are not allowed in immutable infrastructure, you can be sure about the state of the deployed system. Immutable infrastructures are inherently more consistent, reliable, and predictable, and they simplify many aspects of software development and operations. 

 Use a canary or blue/green deployment when deploying applications in immutable infrastructures. 

 [https://martinfowler.com/bliki/CanaryRelease.html](https://martinfowler.com/bliki/CanaryRelease.html) is the practice of directing a small number of your customers to the new version, usually running on a single service instance (the canary). You then deeply scrutinize any behavior changes or errors that are generated. You can remove traffic from the canary if you encounter critical problems and send the users back to the previous version. If the deployment is successful, you can continue to deploy at your desired velocity, while monitoring the changes for errors, until you are fully deployed. AWS CodeDeploy can be configured with a deployment configuration that will enable a canary deployment. 

 [https://martinfowler.com/bliki/BlueGreenDeployment.html](https://martinfowler.com/bliki/BlueGreenDeployment.html) is similar to the canary deployment except that a full fleet of the application is deployed in parallel. You alternate your deployments across the two stacks (blue and green). Once again, you can send traffic to the new version, and fall back to the old version if you see problems with the deployment. Commonly all traffic is switched at once, however you can also use fractions of your traffic to each version to dial up the adoption of the new version using the weighted DNS routing capabilities of Amazon Route 53. AWS CodeDeploy and AWS Elastic Beanstalk can be configured with a deployment configuration that will enable a blue/green deployment. 

![\[Diagram showing blue/green deployment with AWS Elastic Beanstalk and Amazon Route 53\]](http://docs.aws.amazon.com/wellarchitected/2022-03-31/framework/images/blue-green-deployment.png)


 Benefits of immutable infrastructure: 
+  **Reduction in configuration drifts:** By frequently replacing servers from a base, known and version-controlled configuration, the infrastructure is **reset** to a known state, avoiding configuration drifts. 
+  **Simplified deployments**: Deployments are simplified because they don’t need to support upgrades. Upgrades are just new deployments. 
+  **Reliable atomic deployments:** Deployments either complete successfully, or nothing changes. It gives more trust in the deployment process. 
+  **Safer deployments with fast rollback and recovery processes:** Deployments are safer because the previous working version is not changed. You can roll back to it if errors are detected. 
+  **Consistent testing and debugging environments:** Since all servers use the same image, there are no differences between environments. One build is deployed to multiple environments. It also prevents inconsistent environments and simplifies testing and debugging. 
+  **Increased scalability:** Since servers use a base image, are consistent, and repeatable, automatic scaling is trivial. 
+  **Simplified toolchain**: The toolchain is simplified since you can get rid of configuration management tools managing production software upgrades. No extra tools or agents are installed on servers. Changes are made to the base image, tested, and rolled-out. 
+  **Increased security:** By denying all changes to servers, you can disable SSH on instances and remove keys. This reduces the attack vector, improving your organization’s security posture. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Deploy using immutable infrastructure. Immutable infrastructure is a model in which no updates, security patches, or configuration changes happen *in-place* on production systems. If any change is needed, a new version of the architecture is built and deployed into production. 
  +  [Overview of a Blue/Green Deployment](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html#welcome-deployment-overview-blue-green) 
  +  [Deploying Serverless Applications Gradually](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/automating-updates-to-serverless-apps.html) 
  +  [Immutable Infrastructure: Reliability, consistency and confidence through immutability](https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23) 
  +  [CanaryRelease](https://martinfowler.com/bliki/CanaryRelease.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [CanaryRelease](https://martinfowler.com/bliki/CanaryRelease.html) 
+  [Deploying Serverless Applications Gradually](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/automating-updates-to-serverless-apps.html) 
+  [Immutable Infrastructure: Reliability, consistency and confidence through immutability](https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23) 
+  [Overview of a Blue/Green Deployment](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html#welcome-deployment-overview-blue-green) 
+  [The Amazon Builders' Library: Ensuring rollback safety during deployments](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments) 

# REL08-BP05 Deploy changes with automation
<a name="rel_tracking_change_management_automated_changemgmt"></a>

 Deployments and patching are automated to eliminate negative impact. 

 Making changes to production systems is one of the largest risk areas for many organizations. We consider deployments a first-class problem to be solved alongside the business problems that the software addresses. Today, this means the use of automation wherever practical in operations, including testing and deploying changes, adding or removing capacity, and migrating data. AWS CodePipeline lets you manage the steps required to release your workload. This includes a deployment state using AWS CodeDeploy to automate deployment of application code to Amazon EC2 instances, on-premises instances, serverless Lambda functions, or Amazon ECS services. 

**Recommendation**  
 Although conventional wisdom suggests that you keep humans in the loop for the most difficult operational procedures, we suggest that you automate the most difficult procedures for that very reason. 

 **Common anti-patterns:** 
+  Manually performing changes. 
+  Skipping steps in your automation through emergency work flows. 
+  Not following your plans. 

 **Benefits of establishing this best practice:** Using automation to deploy all changes removes the potential for introduction of human error and enables the ability to test before changing production to ensure that your plans are complete. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate your deployment pipeline. Deployment pipelines allow you to invoke automated testing and detection of anomalies, and either halt the pipeline at a certain step before production deployment, or automatically roll back a change. 
  +  [The Amazon Builders' Library: Ensuring rollback safety during deployments](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments) 
  +  [The Amazon Builders' Library: Going faster with continuous delivery](https://aws.amazon.com/builders-library/going-faster-with-continuous-delivery/) 
    +  Use AWS CodePipeline (or a trusted third-party product) to define and run your pipelines. 
      +  Configure the pipeline to start when a change is committed to your code repository. 
        +  [What is AWS CodePipeline?](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 
      +  Use Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Email Service (Amazon SES) to send notifications about problems in the pipeline or integrate with a team chat tool, like Amazon Chime. 
        +  [What is Amazon Simple Notification Service?](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 
        +  [What is Amazon SES?](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/Welcome.html) 
        +  [What is Amazon Chime?](https://docs.aws.amazon.com/chime/latest/ug/what-is-chime.html) 
        +  [Automate chat messages with webhooks.](https://docs.aws.amazon.com/chime/latest/ug/webhooks.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help you create automated deployment solutions](https://aws.amazon.com/partners/find/results/?keyword=devops) 
+  [AWS Marketplace: products that can be used to automate your deployments](https://aws.amazon.com/marketplace/search/results?searchTerms=DevOps) 
+  [Automate chat messages with webhooks.](https://docs.aws.amazon.com/chime/latest/ug/webhooks.html) 
+  [The Amazon Builders' Library: Ensuring rollback safety during deployments](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments) 
+  [The Amazon Builders' Library: Going faster with continuous delivery](https://aws.amazon.com/builders-library/going-faster-with-continuous-delivery/) 
+  [What Is AWS CodePipeline?](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 
+  [What Is CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
+  [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 
+  [What is Amazon SES?](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/Welcome.html) 
+  [What is Amazon Simple Notification Service?](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 

 **Related videos:** 
+  [AWS Summit 2019: CI/CD on AWS](https://youtu.be/tQcF6SqWCoY)