

# Operate
<a name="a-operate"></a>

**Topics**
+ [

# OPS 8  How do you understand the health of your workload?
](ops-08.md)
+ [

# OPS 9  How do you understand the health of your operations?
](ops-09.md)
+ [

# OPS 10  How do you manage workload and operations events?
](ops-10.md)

# OPS 8  How do you understand the health of your workload?
<a name="ops-08"></a>

 Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action. 

**Topics**
+ [

# OPS08-BP01 Identify key performance indicators
](ops_workload_health_define_workload_kpis.md)
+ [

# OPS08-BP02 Define workload metrics
](ops_workload_health_design_workload_metrics.md)
+ [

# OPS08-BP03 Collect and analyze workload metrics
](ops_workload_health_collect_analyze_workload_metrics.md)
+ [

# OPS08-BP04 Establish workload metrics baselines
](ops_workload_health_workload_metric_baselines.md)
+ [

# OPS08-BP05 Learn expected patterns of activity for workload
](ops_workload_health_learn_workload_usage_patterns.md)
+ [

# OPS08-BP06 Alert when workload outcomes are at risk
](ops_workload_health_workload_outcome_alerts.md)
+ [

# OPS08-BP07 Alert when workload anomalies are detected
](ops_workload_health_workload_anomaly_alerts.md)
+ [

# OPS08-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
](ops_workload_health_biz_level_view_workload.md)

# OPS08-BP01 Identify key performance indicators
<a name="ops_workload_health_define_workload_kpis"></a>

 Identify key performance indicators (KPIs) based on desired business outcomes (for example, order rate, customer retention rate, and profit versus operating expense) and customer outcomes (for example, customer satisfaction). Evaluate KPIs to determine workload success. 

 **Common anti-patterns:** 
+  You are asked by business leadership how successful a workload has been serving business needs but have no frame of reference to determine success. 
+  You are unable to determine if the commercial off-the-shelf application you operate for your organization is cost-effective. 

 **Benefits of establishing this best practice:** By identifying key performance indicators you enable achieving business outcomes as the test of the health and success of your workload. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine workload success. 

# OPS08-BP02 Define workload metrics
<a name="ops_workload_health_design_workload_metrics"></a>

 Define workload metrics to measure the achievement of KPIs (for example, abandoned shopping carts, orders placed, cost, price, and allocated workload expense). Define workload metrics to measure the health of the workload (for example, interface response time, error rate, requests made, requests completed, and utilization). Evaluate metrics to determine if the workload is achieving desired outcomes, and to understand the health of the workload. 

 You should send log data to a service such as CloudWatch Logs, and generate metrics from observations of necessary log content. 

 CloudWatch has specialized features such as [Amazon CloudWatch Insights for .NET and SQL Server](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/appinsights-what-is.html) and [Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) that can assist you by identifying and setting up key metrics, logs, and alarms across your specifically supported application resources and technology stack. 

 **Common anti-patterns:** 
+  You have defined standard metrics, not associated to any KPIs or tailored to any workload. 
+  You have errors in your metrics calculations that will yield invalid results. 
+  You don't have any metrics defined for your workload. 
+  You only measure for availability. 

 **Benefits of establishing this best practice:** By defining and evaluating workload metrics you can determine the health of your workload and measure the achievement of business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define workload metrics: Define workload metrics to measure the achievement of KPIs. Define workload metrics to measure the health of the workload and its individual components. Evaluate metrics to determine if the workload is achieving desired outcomes, and to understand the health of the workload. 
  +  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
  +  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 

# OPS08-BP03 Collect and analyze workload metrics
<a name="ops_workload_health_collect_analyze_workload_metrics"></a>

 Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 

 You should aggregate log data from your application, workload components, services, and API calls to a service such as CloudWatch Logs. Generate metrics from observations of necessary log content to enable insight into the performance of operations activities. 

 On AWS, you can analyze workload metrics and identify operational issues using the machine learning capabilities of [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html). AWS DevOps Guru provides notification of operational issues with [targeted and proactive](https://docs.aws.amazon.com/devops-guru/latest/userguide/view-insights.html) recommendations to resolve issues and maintain application health. 

 In the AWS Shared Responsibility Model, portions of monitoring are delivered to you through the [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/personal-health-dashboard/). This dashboard provides alerts and remediation guidance when AWS is experiencing events that might affect you. Customers with Business and Enterprise Support subscriptions also get access to the [AWS Health API](https://docs.aws.amazon.com/health/latest/ug/getting-started-api.html), enabling integration to their event management systems. 

 On AWS, you can [export your log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) or [send logs directly](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) to [Amazon S3](https://aws.amazon.com/s3/) for long-term storage. Using [AWS Glue](https://aws.amazon.com/glue/), you can discover and prepare your log data in Amazon S3 for analytics, storing associated metadata in the [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html). [Amazon Athena](https://aws.amazon.com/athena/), through its native integration with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a business intelligence tool like [Quick](https://aws.amazon.com/quicksight/) you can visualize, explore, and analyze your data. 

 An alternative [solution](https://aws.amazon.com/solutions/centralized-logging/) would be to use the [Amazon OpenSearch Service](https://aws.amazon.com/elasticsearch-service/) and [OpenSearch Dashboards](https://aws.amazon.com/elasticsearch-service/the-elk-stack/kibana/) to collect, analyze, and display logs on AWS across multiple accounts and AWS Regions. 

 **Common anti-patterns:** 
+  You are asked by the network design team for current network bandwidth utilization rates. You provide the current metrics, network utilization is at 35%. They reduce circuit capacity as a cost savings measure causing widespread connectivity issues as your point-in-time measurement did not reflect the trend in utilization rates. 
+  Your router has failed. It has been logging non-critical memory errors with greater and greater frequency up until its complete failure. You did not detect this trend and as a result did not replace the faulty memory before the router caused a service interruption. 

 **Benefits of establishing this best practice:** By collecting and analyzing your workload metrics you gain understanding of the health of your workload and can gain insight to trends that may have an impact on your workload or the achievement of your business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Collect and analyze workload metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
  +  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon Athena](https://aws.amazon.com/athena/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [AWS Glue](https://aws.amazon.com/glue/) 
+  [AWSAWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html) 
+  [Amazon OpenSearch Service](https://aws.amazon.com/elasticsearch-service/) 
+  [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/personal-health-dashboard/) 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 

# OPS08-BP04 Establish workload metrics baselines
<a name="ops_workload_health_workload_metric_baselines"></a>

 Establish baselines for metrics to provide expected values as the basis for comparison and identification of under- and over-performing components. Identify thresholds for improvement, investigation, and intervention. 

 **Common anti-patterns:** 
+  A server is running at 95% CPU utilization you are asked if that is good or bad. CPU utilization on that server has not been baselined so you have no idea if that is good or bad. 

 **Benefits of establishing this best practice:** By defining baseline metric values you are able to evaluate current metric values, and metric trends, to determine if action is required. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Establish baselines for workload metrics: Establish baselines for workload metrics to provide expected values as the basis for comparison. 
  +  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 

# OPS08-BP05 Learn expected patterns of activity for workload
<a name="ops_workload_health_learn_workload_usage_patterns"></a>

 Establish patterns of workload activity to identify anomalous behavior so that you can respond appropriately if required. 

 CloudWatch through the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature applies statistical and machine learning algorithms to generate a range of expected values that represent normal metric behavior. 

 [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) can be used to identify anomalous behavior through event correlation, log analysis, and applying machine learning to analyze your workload telemetry. When unexpected behaviors are detected, it provides the [related metrics and events](https://docs.aws.amazon.com/devops-guru/latest/userguide/understanding-insights-console.html) with recommendations to address the behavior. 

 **Common anti-patterns:** 
+  You are reviewing network utilization logs and see that network utilization increased between 11:30am and 1:30pm and then again at 4:30pm through 6:00pm. You are unaware if this should be considered normal or not. 
+  Your web servers reboot every night at 3:00am. You are unaware if this is an expected behavior. 

 **Benefits of establishing this best practice:** By learning patterns of behavior you can recognize unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for workload: Establish patterns of workload activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 

# OPS08-BP06 Alert when workload outcomes are at risk
<a name="ops_workload_health_workload_outcome_alerts"></a>

 Raise an alert when workload outcomes are at risk so that you can respond appropriately if necessary. 

 Ideally, you have previously identified a metric threshold that you are able to alarm upon or an event that you can use to trigger an automated response. 

 On AWS, you can use [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to create canary scripts to monitor your endpoints and APIs by performing the same actions as your customers. The telemetry generated and the [insight gained](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_Details.html) can enable you to identify issues before your customers are impacted. 

 You can also use [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to interactively search and analyze your log data using a purpose-built query language. CloudWatch Logs Insights automatically [discovers fields in logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html) from AWS services, and custom log events in JSON. It scales with your log volume and query complexity and gives you answers in seconds, helping you to search for the contributing factors of an incident. 

 **Common anti-patterns:** 
+  You have no network connectivity. No one is aware. No one is trying to identify why or taking action to restore connectivity. 
+  Following a patch, your persistent instances have become unavailable, disrupting users. Your users have opened support cases. No one has been notified. No one is taking action. 

 **Benefits of establishing this best practice:** By identifying that business outcomes are at risk and alerting for action to be taken you have the opportunity to prevent or mitigate the impact of an incident. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) 
+  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS08-BP07 Alert when workload anomalies are detected
<a name="ops_workload_health_workload_anomaly_alerts"></a>

 Raise an alert when workload anomalies are detected so that you can respond appropriately if necessary. 

 Your analysis of your workload metrics over time may establish patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response. 

 Once trained, the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature can be used to [alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) on detected anomalies or can provide overlaid expected values onto a [graph](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph_a_metric.html#create-metric-graph) of metric data for ongoing comparison. 

 **Common anti-patterns:** 
+  Your retail website sales have increased suddenly and dramatically. No one is aware. No one is trying to identify what led to this surge. No one is taking action to ensure quality customer experiences under the additional load. 
+  Following the application of a patch, your persistent servers are rebooting frequently, disrupting users. Your servers typically reboot up to three times but not more. No one is aware. No one is trying to identify why this is happening. 

 **Benefits of establishing this best practice:** By understanding patterns of workload behavior, you can identify unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when workload anomalies are detected: Raise an alert when workload anomalies are detected so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS08-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
<a name="ops_workload_health_biz_level_view_workload"></a>

 Create a business-level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 

 AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash). 

 **Common anti-patterns:** 
+  Page response time has never been considered a contributor to customer satisfaction. You have never established a metric or threshold for page response time. Your customers are complaining about slowness. 
+  You have not been achieving your minimum response time goals. In an effort to improve response time, you have scaled up your application servers. You are now exceeding response time goals by a significant margin and also have significant unused capacity you are paying for. 

 **Benefits of establishing this best practice:** By reviewing and revising KPIs and metrics, you understand how your workload supports the achievement of your business outcomes and can identify where improvement is needed to reach business goals. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Create a business level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 
  +  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
  +  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

# OPS 9  How do you understand the health of your operations?
<a name="ops-09"></a>

 Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action. 

**Topics**
+ [

# OPS09-BP01 Identify key performance indicators
](ops_operations_health_define_ops_kpis.md)
+ [

# OPS09-BP02 Define operations metrics
](ops_operations_health_design_ops_metrics.md)
+ [

# OPS09-BP03 Collect and analyze operations metrics
](ops_operations_health_collect_analyze_ops_metrics.md)
+ [

# OPS09-BP04 Establish operations metrics baselines
](ops_operations_health_ops_metric_baselines.md)
+ [

# OPS09-BP05 Learn the expected patterns of activity for operations
](ops_operations_health_learn_ops_usage_patterns.md)
+ [

# OPS09-BP06 Alert when operations outcomes are at risk
](ops_operations_health_ops_outcome_alerts.md)
+ [

# OPS09-BP07 Alert when operations anomalies are detected
](ops_operations_health_ops_anomaly_alerts.md)
+ [

# OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
](ops_operations_health_biz_level_view_ops.md)

# OPS09-BP01 Identify key performance indicators
<a name="ops_operations_health_define_ops_kpis"></a>

 Identify key performance indicators (KPIs) based on desired business outcomes (for example, new features delivered) and customer outcomes (for example, customer support cases). Evaluate KPIs to determine operations success. 

 **Common anti-patterns:** 
+  You are asked by business leadership how successful operations is at accomplishing business goals but have no frame of reference to determine success. 
+  You are unable to determine if your maintenance windows have an impact on business outcomes. 

 **Benefits of establishing this best practice:** By identifying key performance indicators you enable achieving business outcomes as the test of the health and success of your operations. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine operations success. 

# OPS09-BP02 Define operations metrics
<a name="ops_operations_health_design_ops_metrics"></a>

 Define operations metrics to measure the achievement of KPIs (for example, successful deployments, and failed deployments). Define operations metrics to measure the health of operations activities (for example, mean time to detect an incident (MTTD), and mean time to recovery (MTTR) from an incident). Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of your operations activities. 

 **Common anti-patterns:** 
+  Your operations metrics are based on what the team thinks is reasonable. 
+  You have errors in your metrics calculations that will yield incorrect results. 
+  You don't have any metrics defined for your operations activities. 

 **Benefits of establishing this best practice:** By defining and evaluating operations metrics you can determine the health of your operations activities and measure the achievement of business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define operations metrics: Define operations metrics to measure the achievement of KPIs. Define operations metrics to measure the health of operations and its activities. Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of the operations. 
  +  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
  +  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Answers: Centralized Logging](https://aws.amazon.com/answers/logging/centralized-logging/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Detect and React to Changes in Pipeline State with Amazon CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/detect-state-changes-cloudwatch-events.html) 
+  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 

 **Related videos:** 
+  Build a Monitoring Plan 

# OPS09-BP03 Collect and analyze operations metrics
<a name="ops_operations_health_collect_analyze_ops_metrics"></a>

 Perform regular, proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 

 You should aggregate log data from the execution of your operations activities and operations API calls, into a service such as CloudWatch Logs. Generate metrics from observations of necessary log content to gain insight into the performance of operations activities. 

 On AWS, you can [export your log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) or [send logs directly](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) to [Amazon S3](https://aws.amazon.com/s3/) for long-term storage. Using [AWS Glue](https://aws.amazon.com/glue/), you can discover and prepare your log data in Amazon S3 for analytics, storing associated metadata in the [AWSAWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html). [Amazon Athena](https://aws.amazon.com/athena/), through its native integration with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a business intelligence tool like [Quick](https://aws.amazon.com/quicksight/) you can visualize, explore, and analyze your data. 

 **Common anti-patterns:** 
+  Consistent delivery of new features is considered a key performance indicator. You have no method to measure how frequently deployments occur. 
+  You log deployments, rolled back deployments, patches, and rolled back patches to track you operations activities, but no one reviews the metrics. 
+  You have a recovery time objective to restore a lost database within fifteen minutes that was defined when the system was deployed and had no users. You now have ten thousand users and have been operating for two years. A recent restore took over two hours. This was not recorded and no one is aware. 

 **Benefits of establishing this best practice:** By collecting and analyzing your operations metrics, you gain understanding of the health of your operations and can gain insight to trends that have may an impact on your operations or the achievement of your business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Collect and analyze operations metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
  +  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon Athena](https://aws.amazon.com/athena/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [AWS Glue](https://aws.amazon.com/glue/) 
+  [AWSAWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html) 
+  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 

# OPS09-BP04 Establish operations metrics baselines
<a name="ops_operations_health_ops_metric_baselines"></a>

 Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing operations activities. 

 **Common anti-patterns:** 
+  You have been asked what the expected time to deploy is. You have not measured how long it takes to deploy and can not determine expected times. 
+  You have been asked what how long it takes to recover from an issue with the application servers. You have no information about time to recovery from first customer contact. You have no information about time to recovery from first identification of an issue through monitoring. 
+  You have been asked how many support personnel are required over the weekend. You have no idea how many support cases are typical over a weekend and can not provide an estimate. 
+  You have a recovery time objective to restore lost databases within fifteen minutes that was defined when the system was deployed and had no users. You now have ten thousand users and have been operating for two years. You have no information on how the time to restore has changed for your database. 

 **Benefits of establishing this best practice:** By defining baseline metric values you are able to evaluate current metric values, and metric trends, to determine if action is required. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

# OPS09-BP05 Learn the expected patterns of activity for operations
<a name="ops_operations_health_learn_ops_usage_patterns"></a>

 Establish patterns of operations activities to identify anomalous activity so that you can respond appropriately if necessary. 

 **Common anti-patterns:** 
+  Your deployment failure rate has increased substantially recently. You address each of the failures independently. You do not realize that the failures correspond to deployments by a new employee who is unfamiliar with the deployment management system. 

 **Benefits of establishing this best practice:** By learning patterns of behavior, you can recognize unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

# OPS09-BP06 Alert when operations outcomes are at risk
<a name="ops_operations_health_ops_outcome_alerts"></a>

 Whenever operations outcomes are at risk, an alert must be raised and acted upon. Operations outcomes are any activity that supports a workload in production. This includes everything from deploying new versions of applications to recovering from an outage. Operations outcomes must be treated with the same importance as business outcomes. 

Software teams should identify key operations metrics and activities and build alerts for them. Alerts must be timely and actionable. If an alert is raised, a reference to a corresponding runbook or playbook should be included. Alerts without a corresponding action can lead to alert fatigue.

 **Desired outcome:** When operations activities are at risk, alerts are sent to drive action. The alerts contain context on why an alert is being raised and point to a playbook to investigate or a runbook to mitigate. Where possible, runbooks are automated and notifications are sent. 

 **Common anti-patterns:** 
+ You are investigating an incident and support cases are being filed. The support cases are breaching the service level agreement (SLA) but no alerts are being raised. 
+ A deployment to production scheduled for midnight is delayed due to last-minute code changes. No alert is raised and the deployment hangs.
+ A production outage occurs but no alerts are sent.
+  Your deployment time consistently runs behind estimates. No action is taken to investigate. 

 **Benefits of establishing this best practice:** 
+  Alerting when operations outcomes are at risk boosts your ability to support your workload by staying ahead of issues. 
+  Business outcomes are improved due to healthy operations outcomes. 
+  Detection and remediation of operations issues are improved. 
+  Overall operational health is increased. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Operations outcomes must be defined before you can alert on them. Start by defining what operations activities are most important to your organization. Is it deploying to production in under two hours or responding to a support case within a set amount of time? Your organization must define key operations activities and how they are measured so that they can be monitored, improved, and alerted on. You need a central location where workload and operations telemetry is stored and analyzed. The same mechanism should be able to raise an alert when an operations outcome is at risk. 

 **Customer example** 

 A CloudWatch alarm was triggered during a routine deployment at AnyCompany Retail. The lead time for deployment was breached. Amazon EventBridge created an OpsItem in AWS Systems Manager OpsCenter. The Cloud Operations team used a playbook to investigate the issue and identified that a schema change was taking longer than expected. They alerted the on-call developer and continued monitoring the deployment. Once the deployment was complete, the Cloud Operations team resolved the OpsItem. The team will analyze the incident during a postmortem. 

## Implementation steps
<a name="implementation-steps"></a>

1. If you have not identified operations KPIs, metrics, and activities, work on implementing the preceding best practices to this question (OPS09-BP01 to OPS09-BP05). 
   +  Support customers with [Enterprise Support](https://aws.amazon.com/premiumsupport/plans/enterprise/) can request the [Operations KPI Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This collaborative workshop helps you define operations KPIs and metrics aligned to business goals, provided at no additional cost. Contact your Technical Account Manager to learn more. 

1.  Once you have operations activities, KPIs, and metrics established, configure alerts in your observability platform. Alerts should have an action associated to them, like a playbook or runbook. Alerts without an action should be avoided. 

1.  Over time, you should evaluate your operations metrics, KPIs, and activities to identify areas of improvement. Capture feedback in runbooks and playbooks from operators to identify areas for improvement in responding to alerts. 

1.  Alerts should include a mechanism to flag them as a false-positive. This should lead to a review of the metric thresholds. 

 **Level of effort for the implementation plan:** Medium. There are several best practices that must be in place before implementing this best practice. Once operations activities have been identified and operations KPIs established, alerts should be established. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md): Every operation activity and outcome should have an identified owner that's responsible. This is who should be alerted when outcomes are at risk. 
+  [OPS03-BP02 Team members are empowered to take action when outcomes are at risk](ops_org_culture_team_emp_take_action.md): When alerts are raised, your team should have agency to act to remedy the issue. 
+  [OPS09-BP01 Identify key performance indicators](ops_operations_health_define_ops_kpis.md): Alerting on operations outcomes starts with identify operations KPIs. 
+  [OPS09-BP02 Define operations metrics](ops_operations_health_design_ops_metrics.md): Establish this best practice before you start generating alerts. 
+  [OPS09-BP03 Collect and analyze operations metrics](ops_operations_health_collect_analyze_ops_metrics.md): Centrally collecting operations metrics is required to build alerts. 
+  [OPS09-BP04 Establish operations metrics baselines](ops_operations_health_ops_metric_baselines.md): Operations metrics baselines provide the ability to tune alerts and avoid alert fatigue. 
+  [OPS09-BP05 Learn the expected patterns of activity for operations](ops_operations_health_learn_ops_usage_patterns.md): You can improve the accuracy of your alerts by understanding the activity patterns for operations events. 
+  [OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics](ops_operations_health_biz_level_view_ops.md): Evaluate the achievement of operations outcomes to ensure that your KPIs and metrics are valid. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Every alert should have an associated runbook or playbook and provide context for the person being alerted. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Conduct a post-incident analysis after the alert to identify areas for improvement. 

 **Related documents:** 
+  [AWS Deployment Pipelines Reference Architecture: Application Pipeline Architecture](https://pipelines.devops.aws.dev/application-pipeline/) 

 **Related videos:** 
+  [Aggregate and Resolve Operational Issues Using AWS Systems Manager OpsCenter](https://www.youtube.com/watch?v=r6ilQdxLcqY) 
+  [Integrate AWS Systems Manager OpsCenter with Amazon CloudWatch Alarms](https://www.youtube.com/watch?v=Gpc7a5kVakI) 
+  [Integrate Your Data Sources into AWS Systems Manager OpsCenter Using Amazon EventBridge](https://www.youtube.com/watch?v=Xmmu5mMsq3c) 

 **Related examples:** 
+  [Automate remediation actions for Amazon EC2 notifications and beyond using Amazon EC2 Systems Manager Automation and AWS Health](https://aws.amazon.com/blogs/mt/automate-remediation-actions-for-amazon-ec2-notifications-and-beyond-using-ec2-systems-manager-automation-and-aws-health/) 
+  [AWS Management and Governance Tools Workshop - Operations 2022](https://mng.workshop.aws/operations-2022.html) 
+  [Ingesting, analyzing, and visualizing metrics with DevOps Monitoring Dashboard on AWS](https://docs.aws.amazon.com/solutions/latest/devops-monitoring-dashboard-on-aws/welcome.html) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Support Proactive Services - Operations KPI Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 
+  [CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS09-BP07 Alert when operations anomalies are detected
<a name="ops_operations_health_ops_anomaly_alerts"></a>

 Raise an alert when operations anomalies are detected so that you can respond appropriately if necessary. 

 Your analysis of your operations metrics over time may established patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response. 

 Once trained, the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature can be used to [alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) on detected anomalies or can provide overlaid expected values onto a [graph](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph_a_metric.html#create-metric-graph) of metric data for ongoing comparison. 

 [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) can be used to identify anomalous behavior through event correlation, log analysis, and applying machine learning to analyze your workload telemetry. The [insights](https://docs.aws.amazon.com/devops-guru/latest/userguide/understanding-insights-console.html) gained are presented with the relevant data and recommendations. 

 **Common anti-patterns:** 
+  You are applying a patch to your fleet of instances. You tested the patch successfully in the test environment. The patch is failing for a large percentage of instances in your fleet. You do nothing. 
+  You note that there are deployments starting Friday end of day. Your organization has predefined maintenance windows on Tuesdays and Thursdays. You do nothing. 

 **Benefits of establishing this best practice:** By understanding patterns of operations behavior you can identify unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when operations anomalies are detected: Raise an alert when operations anomalies are detected so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 
+  [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Detect and React to Changes in Pipeline State with Amazon CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/detect-state-changes-cloudwatch-events.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
<a name="ops_operations_health_biz_level_view_ops"></a>

 Create a business-level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 

 AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash). 

 **Common anti-patterns:** 
+  The frequency of your deployments has increased with the growth in number of development teams. Your defined expected number of deployments is once per week. You have been regularly deploying daily. When their is an issue with your deployment system, and deployments are not possible, it goes undetected for days. 
+  When your business previously provided support only during core business hours from Monday to Friday. You established a next business day response time goal for incidents. You have recently started offering 24x7 support coverage with a two hour response time goal. Your overnight staff are overwhelmed and customers are unhappy. There is no indication that there are issues with incident response times because you are reporting against a next business day target. 

 **Benefits of establishing this best practice:** By reviewing and revising KPIs and metrics, you understand how your workload supports the achievement of your business outcomes and can identify where improvement is needed to reach business goals. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Create a business level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 
  +  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
  +  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

# OPS 10  How do you manage workload and operations events?
<a name="ops-10"></a>

 Prepare and validate procedures for responding to events to minimize their disruption to your workload. 

**Topics**
+ [

# OPS10-BP01 Use a process for event, incident, and problem management
](ops_event_response_event_incident_problem_process.md)
+ [

# OPS10-BP02 Have a process per alert
](ops_event_response_process_per_alert.md)
+ [

# OPS10-BP03 Prioritize operational events based on business impact
](ops_event_response_prioritize_events.md)
+ [

# OPS10-BP04 Define escalation paths
](ops_event_response_define_escalation_paths.md)
+ [

# OPS10-BP05 Enable push notifications
](ops_event_response_push_notify.md)
+ [

# OPS10-BP06 Communicate status through dashboards
](ops_event_response_dashboards.md)
+ [

# OPS10-BP07 Automate responses to events
](ops_event_response_auto_event_response.md)

# OPS10-BP01 Use a process for event, incident, and problem management
<a name="ops_event_response_event_incident_problem_process"></a>

Your organization has processes to handle events, incidents, and problems. *Events* are things that occur in your workload but may not need intervention. *Incidents* are events that require intervention. *Problems* are recurring events that require intervention or cannot be resolved. You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately.

When incidents and problems happen to your workload, you need processes to handle them. How will you communicate the status of the event with stakeholders? Who oversees leading the response? What are the tools that you use to mitigate the event? These are examples of some of the questions you need answer to have a solid response process. 

Processes must be documented in a central location and available to anyone involved in your workload. If you don’t have a central wiki or document store, a version control repository can be used. You’ll keep these plans up to date as your processes evolve. 

Problems are candidates for automation. These events take time away from your ability to innovate. Start with building a repeatable process to mitigate the problem. Over time, focus on automating the mitigation or fixing the underlying issue. This frees up time to devote to making improvements in your workload. 

**Desired outcome:** Your organization has a process to handle events, incidents, and problems. These processes are documented and stored in a central location. They are updated as processes change. 

**Common anti-patterns:** 
+  An incident happens on the weekend and the on-call engineer doesn’t know what to do. 
+  A customer sends you an email that the application is down. You reboot the server to fix it. This happens frequently. 
+  There is an incident with multiple teams working independently to try to solve it. 
+  Deployments happen in your workload without being recorded. 

 **Benefits of establishing this best practice:** 
+  You have an audit trail of events in your workload. 
+  Your time to recover from an incident is decreased. 
+  Team members can resolve incidents and problems in a consistent manner. 
+  There is a more consolidated effort when investigating an incident. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed. 

 **Customer example** 

AnyCompany Retail has a portion of their internal wiki devoted to processes for event, incident, and problem management. All events are sent to [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html). Problems are identified as OpsItems in [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) and prioritized to fix, reducing undifferentiated labor. As processes change, they’re updated in their internal wiki. They use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to manage incidents and coordinate mitigation efforts. 

## Implementation steps
<a name="implementation-steps"></a>

1.  Events 
   +  Track events that happen in your workload, even if no human intervention is required. 
   +  Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching. 
   +  You can use services like [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) or [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) to generate custom events for tracking. 

1.  Incidents 
   +  Start by defining the communication plan for incidents. What stakeholders must be informed? How will you keep them in the loop? Who oversees coordinating efforts? We recommend standing up an internal chat channel for communication and coordination. 
   +  Define escalation paths for the teams that support your workload, especially if the team doesn’t have an on-call rotation. Based on your support level, you can also file a case with Support. 
   +  Create a playbook to investigate the incident. This should include the communication plan and detailed investigation steps. Include checking the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) in your investigation. 
   +  Document your incident response plan. Communicate the incident management plan so internal and external customers understand the rules of engagement and what is expected of them. Train your team members on how to use it. 
   +  Customers can use [Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to set up and manage their incident response plan. 
   +  Enterprise Support customers can request the [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement. 

1.  Problems 
   +  Problems must be identified and tracked in your ITSM system. 
   +  Identify all known problems and prioritize them by effort to fix and impact to workload.   
![\[Action priority matrix for prioritizing problems.\]](http://docs.aws.amazon.com/wellarchitected/2022-03-31/framework/images/impact-effort-chart.png)
   +  Solve problems that are high impact and low effort first. Once those are solved, move on to problems to that fall into the low impact low effort quadrant. 
   +  You can use [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) to identify these problems, attach runbooks to them, and track them. 

**Level of effort for the implementation plan:** Medium. You need both a process and tools to implement this best practice. Document your processes and make them available to anyone associated with the workload. Update them frequently. You have a process for managing problems and mitigating them or fixing them. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Known problems need an associated runbook so that mitigation efforts are consistent.
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Incidents must be investigated using playbooks. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Always conduct a postmortem after you recover from an incident. 

 **Related documents:** 
+  [Atlassian - Incident management in the age of DevOps](https://www.atlassian.com/incident-management/devops) 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+  [Incident Management in the Age of DevOps and SRE](https://www.infoq.com/presentations/incident-management-devops-sre/) 
+  [PagerDuty - What is Incident Management?](https://www.pagerduty.com/resources/learn/what-is-incident-management/) 

 **Related videos:** 
+  [AWS re:Invent 2020: Incident management in a distributed organization](https://www.youtube.com/watch?v=tyS1YDhMVos) 
+  [AWS re:Invent 2021 - Building next-gen applications with event-driven architectures](https://www.youtube.com/watch?v=U5GZNt0iMZY) 
+  [AWS Supports You \$1 Exploring the Incident Management Tabletop Exercise](https://www.youtube.com/watch?v=0m8sGDx-pRM) 
+  [AWS Systems Manager Incident Manager - AWS Virtual Workshops](https://www.youtube.com/watch?v=KNOc0DxuBSY) 
+  [AWS What's Next ft. Incident Manager \$1 AWS Events](https://www.youtube.com/watch?v=uZL-z7cII3k) 

 **Related examples:** 
+  [AWS Management and Governance Tools Workshop - OpsCenter](https://mng.workshop.aws/ssm/capability_hands-on_labs/opscenter.html) 
+  [AWS Proactive Services – Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [Building an event-driven application with Amazon EventBridge](https://aws.amazon.com/blogs/compute/building-an-event-driven-application-with-amazon-eventbridge/) 
+  [Building event-driven architectures on AWS](https://catalog.us-east-1.prod.workshops.aws/workshops/63320e83-6abc-493d-83d8-f822584fb3cb/en-US/) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 
+  [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 

# OPS10-BP02 Have a process per alert
<a name="ops_event_response_process_per_alert"></a>

 Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert. This ensures effective and prompt responses to operations events and prevents actionable events from being obscured by less valuable notifications. 

 **Common anti-patterns:** 
+  Your monitoring system presents you a stream of approved connections along with other messages. The volume of messages is so large that you miss periodic error messages that require your intervention. 
+  You receive an alert that the website is down. There is no defined process for when this happens. You are forced to take an ad hoc approach to diagnose and resolve the issue. Developing this process as you go extends the time to recovery. 

 **Benefits of establishing this best practice:** By alerting only when action is required, you prevent low value alerts from concealing high value alerts. By having a process for every actionable alert, you enable a consistent and prompt response to events in your environment. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion. Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes. By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example, automatic scaling might be applied to scale a web front end, but the operations team might be accountable to ensure that the automatic scaling rules and limits are appropriate for workload needs. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

# OPS10-BP03 Prioritize operational events based on business impact
<a name="ops_event_response_prioritize_events"></a>

 Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, or damage to reputation or trust. 

 **Common anti-patterns:** 
+  You receive a support request to add a printer configuration for a user. While working on the issue, you receive a support request stating that your retail site is down. After completing the printer configuration for your user, you start work on the website issue. 
+  You get notified that both your retail website and your payroll system are down. You don't know which one should get priority. 

 **Benefits of establishing this best practice:** Prioritizing responses to the incidents with the greatest impact on the business enables your management of that impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, regulatory violations, or damage to reputation or trust. 

# OPS10-BP04 Define escalation paths
<a name="ops_event_response_define_escalation_paths"></a>

 Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation. Specifically identify owners for each action to ensure effective and prompt responses to operations events. 

 Identify when a human decision is required before an action is taken. Work with decision makers to have that decision made in advance, and the action preapproved, so that MTTR is not extended waiting for a response. 

 **Common anti-patterns:** 
+  Your retail site is down. You don't understand the runbook for recovering the site. You start calling colleagues hoping that someone will be able to help you. 
+  You receive a support case for an unreachable application. You don't have permissions to administer the system. You don't know who does. You attempt to contact the system owner that opened the case and there is no response. You have no contacts for the system and your colleagues are not familiar with it. 

 **Benefits of establishing this best practice:** By defining escalations, triggers for escalation, and procedures for escalation you enable the systematic addition of resources to an incident at an appropriate rate for the impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define escalation paths: Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation. For example, escalation of an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed. Another example of an appropriate escalation path is from senior support engineers to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time has elapsed. Specifically identify owners for each action to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network connectivity provider or a software vendor. Escalations can include identified authorized decision makers for impacted systems. 

# OPS10-BP05 Enable push notifications
<a name="ops_event_response_push_notify"></a>

 Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and again when the services return to normal operating conditions, to enable users to take appropriate action. 

 **Common anti-patterns:** 
+  Your application is experiencing a distributed denial of service incident and has been unresponsive for days. There is no error message. You have not sent a notification email. You have not sent text notifications. You have not shared information on social media. You customers are frustrated and looking for other vendors who can support them. 
+  On Monday, your application had issues following a patch and was down for a couple of hours. On Tuesday, your application had issues following a code deployment and was unreliable for a couple of hours. On Wednesday, your application had issues following a code deployment to mitigate a security vulnerability associated to the failed patch and was unavailable for a couple of hours. On Thursday, your frustrated customers started looking for another vendor who could support them. 
+  Your application is going to be down for maintenance this weekend. You don't inform your customers. Some of your customers had scheduled activities involving the use of your application. They are very frustrated upon discovery that your application is not available. 

 **Benefits of establishing this best practice:** By defining notifications, triggers for notifications, and procedures for notifications you enable your customer to be informed and respond when issues with your workload impact them. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Enable push notifications: Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and when the services return to normal operating conditions, to enable users to take appropriate action. 
  +  [Amazon SES features](https://aws.amazon.com/ses/details/) 
  +  [What is Amazon SES?](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/Welcome.html) 
  +  [Set up Amazon SNS notifications](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/US_SetupSNS.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon SES features](https://aws.amazon.com/ses/details/) 
+  [Set up Amazon SNS notifications](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/US_SetupSNS.html) 
+  [What is Amazon SES?](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/Welcome.html) 

# OPS10-BP06 Communicate status through dashboards
<a name="ops_event_response_dashboards"></a>

 Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. 

 You can create dashboards using [Amazon CloudWatch Dashboards](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) on customizable home pages in the CloudWatch console. Using business intelligence services such as [Quick](https://aws.amazon.com/quicksight/) you can create and publish interactive dashboards of your workload and operational health (for example, order rates, connected users, and transaction times). Create Dashboards that present system and business-level views of your metrics. 

 **Common anti-patterns:** 
+  Upon request, you run a report on the current utilization of your application for management. 
+  During an incident, you are contacted every twenty minutes by a concerned system owner wanting to know if it is fixed yet. 

 **Benefits of establishing this best practice:** By creating dashboards, you enable self-service access to information enabling your customers to inform themselves and determine if they need to take action. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Health Dashboard. 
  +  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

# OPS10-BP07 Automate responses to events
<a name="ops_event_response_auto_event_response"></a>

 Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 

 There are multiple ways to automate runbook and playbook actions on AWS. To respond to an event from a state change in your AWS resources, or from your own custom events, you should create [CloudWatch Events rules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) to trigger responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation). 

 To respond to a metric that crosses a threshold for a resource (for example, wait time), you should create [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) to perform one or more actions using Amazon EC2 actions, Auto Scaling actions, or to send a notification to an Amazon SNS topic. If you need to perform custom actions in response to an alarm, invoke Lambda through an Amazon SNS notification. Use Amazon SNS to publish event notifications and escalation messages to keep people informed. 

 AWS also supports third-party systems through the AWS service APIs and SDKs. There are a number of monitoring tools provided by AWS Partners and third parties that allow for monitoring, notifications, and responses. Some of these tools include New Relic, Splunk, Loggly, SumoLogic, and Datadog. 

 You should keep critical manual procedures available for use when automated procedures fail 

 **Common anti-patterns:** 
+  A developer checks in their code. This event could have been used to start a build and then perform testing but instead nothing happens. 
+  Your application logs a specific error before it stops working. The procedure to restart the application is well understood and could be scripted. You could use the log event to invoke a script and restart the application. Instead, when the error happens at 3am Sunday morning, you are woken up as the on-call resource responsible to fix the system. 

 **Benefits of establishing this best practice:** By using automated responses to events, you reduce the time to respond and limit the introduction of errors from manual activities. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating a CloudWatch Events rule that triggers on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
  +  [Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
  +  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 
+  [Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
+  [Creating a CloudWatch Events rule that triggers on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

 **Related examples:** 