

# OPS 9  How do you understand the health of your operations?
<a name="ops-09"></a>

 Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action. 

**Topics**
+ [OPS09-BP01 Identify key performance indicators](ops_operations_health_define_ops_kpis.md)
+ [OPS09-BP02 Define operations metrics](ops_operations_health_design_ops_metrics.md)
+ [OPS09-BP03 Collect and analyze operations metrics](ops_operations_health_collect_analyze_ops_metrics.md)
+ [OPS09-BP04 Establish operations metrics baselines](ops_operations_health_ops_metric_baselines.md)
+ [OPS09-BP05 Learn the expected patterns of activity for operations](ops_operations_health_learn_ops_usage_patterns.md)
+ [OPS09-BP06 Alert when operations outcomes are at risk](ops_operations_health_ops_outcome_alerts.md)
+ [OPS09-BP07 Alert when operations anomalies are detected](ops_operations_health_ops_anomaly_alerts.md)
+ [OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics](ops_operations_health_biz_level_view_ops.md)

# OPS09-BP01 Identify key performance indicators
<a name="ops_operations_health_define_ops_kpis"></a>

 Identify key performance indicators (KPIs) based on desired business outcomes (for example, new features delivered) and customer outcomes (for example, customer support cases). Evaluate KPIs to determine operations success. 

 **Common anti-patterns:** 
+  You are asked by business leadership how successful operations is at accomplishing business goals but have no frame of reference to determine success. 
+  You are unable to determine if your maintenance windows have an impact on business outcomes. 

 **Benefits of establishing this best practice:** By identifying key performance indicators you enable achieving business outcomes as the test of the health and success of your operations. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine operations success. 

# OPS09-BP02 Define operations metrics
<a name="ops_operations_health_design_ops_metrics"></a>

 Define operations metrics to measure the achievement of KPIs (for example, successful deployments, and failed deployments). Define operations metrics to measure the health of operations activities (for example, mean time to detect an incident (MTTD), and mean time to recovery (MTTR) from an incident). Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of your operations activities. 

 **Common anti-patterns:** 
+  Your operations metrics are based on what the team thinks is reasonable. 
+  You have errors in your metrics calculations that will yield incorrect results. 
+  You don't have any metrics defined for your operations activities. 

 **Benefits of establishing this best practice:** By defining and evaluating operations metrics you can determine the health of your operations activities and measure the achievement of business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define operations metrics: Define operations metrics to measure the achievement of KPIs. Define operations metrics to measure the health of operations and its activities. Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of the operations. 
  +  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
  +  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Answers: Centralized Logging](https://aws.amazon.com/answers/logging/centralized-logging/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Detect and React to Changes in Pipeline State with Amazon CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/detect-state-changes-cloudwatch-events.html) 
+  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [Searching and filtering log data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 

 **Related videos:** 
+  Build a Monitoring Plan 

# OPS09-BP03 Collect and analyze operations metrics
<a name="ops_operations_health_collect_analyze_ops_metrics"></a>

 Perform regular, proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 

 You should aggregate log data from the execution of your operations activities and operations API calls, into a service such as CloudWatch Logs. Generate metrics from observations of necessary log content to gain insight into the performance of operations activities. 

 On AWS, you can [export your log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) or [send logs directly](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Sending-Logs-Directly-To-S3.html) to [Amazon S3](https://aws.amazon.com/s3/) for long-term storage. Using [AWS Glue](https://aws.amazon.com/glue/), you can discover and prepare your log data in Amazon S3 for analytics, storing associated metadata in the [AWSAWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html). [Amazon Athena](https://aws.amazon.com/athena/), through its native integration with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a business intelligence tool like [Quick](https://aws.amazon.com/quicksight/) you can visualize, explore, and analyze your data. 

 **Common anti-patterns:** 
+  Consistent delivery of new features is considered a key performance indicator. You have no method to measure how frequently deployments occur. 
+  You log deployments, rolled back deployments, patches, and rolled back patches to track you operations activities, but no one reviews the metrics. 
+  You have a recovery time objective to restore a lost database within fifteen minutes that was defined when the system was deployed and had no users. You now have ten thousand users and have been operating for two years. A recent restore took over two hours. This was not recorded and no one is aware. 

 **Benefits of establishing this best practice:** By collecting and analyzing your operations metrics, you gain understanding of the health of your operations and can gain insight to trends that have may an impact on your operations or the achievement of your business outcomes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Collect and analyze operations metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed. 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
  +  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon Athena](https://aws.amazon.com/athena/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [AWS Glue](https://aws.amazon.com/glue/) 
+  [AWSAWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html) 
+  [Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 

# OPS09-BP04 Establish operations metrics baselines
<a name="ops_operations_health_ops_metric_baselines"></a>

 Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing operations activities. 

 **Common anti-patterns:** 
+  You have been asked what the expected time to deploy is. You have not measured how long it takes to deploy and can not determine expected times. 
+  You have been asked what how long it takes to recover from an issue with the application servers. You have no information about time to recovery from first customer contact. You have no information about time to recovery from first identification of an issue through monitoring. 
+  You have been asked how many support personnel are required over the weekend. You have no idea how many support cases are typical over a weekend and can not provide an estimate. 
+  You have a recovery time objective to restore lost databases within fifteen minutes that was defined when the system was deployed and had no users. You now have ten thousand users and have been operating for two years. You have no information on how the time to restore has changed for your database. 

 **Benefits of establishing this best practice:** By defining baseline metric values you are able to evaluate current metric values, and metric trends, to determine if action is required. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

# OPS09-BP05 Learn the expected patterns of activity for operations
<a name="ops_operations_health_learn_ops_usage_patterns"></a>

 Establish patterns of operations activities to identify anomalous activity so that you can respond appropriately if necessary. 

 **Common anti-patterns:** 
+  Your deployment failure rate has increased substantially recently. You address each of the failures independently. You do not realize that the failures correspond to deployments by a new employee who is unfamiliar with the deployment management system. 

 **Benefits of establishing this best practice:** By learning patterns of behavior, you can recognize unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

# OPS09-BP06 Alert when operations outcomes are at risk
<a name="ops_operations_health_ops_outcome_alerts"></a>

 Whenever operations outcomes are at risk, an alert must be raised and acted upon. Operations outcomes are any activity that supports a workload in production. This includes everything from deploying new versions of applications to recovering from an outage. Operations outcomes must be treated with the same importance as business outcomes. 

Software teams should identify key operations metrics and activities and build alerts for them. Alerts must be timely and actionable. If an alert is raised, a reference to a corresponding runbook or playbook should be included. Alerts without a corresponding action can lead to alert fatigue.

 **Desired outcome:** When operations activities are at risk, alerts are sent to drive action. The alerts contain context on why an alert is being raised and point to a playbook to investigate or a runbook to mitigate. Where possible, runbooks are automated and notifications are sent. 

 **Common anti-patterns:** 
+ You are investigating an incident and support cases are being filed. The support cases are breaching the service level agreement (SLA) but no alerts are being raised. 
+ A deployment to production scheduled for midnight is delayed due to last-minute code changes. No alert is raised and the deployment hangs.
+ A production outage occurs but no alerts are sent.
+  Your deployment time consistently runs behind estimates. No action is taken to investigate. 

 **Benefits of establishing this best practice:** 
+  Alerting when operations outcomes are at risk boosts your ability to support your workload by staying ahead of issues. 
+  Business outcomes are improved due to healthy operations outcomes. 
+  Detection and remediation of operations issues are improved. 
+  Overall operational health is increased. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Operations outcomes must be defined before you can alert on them. Start by defining what operations activities are most important to your organization. Is it deploying to production in under two hours or responding to a support case within a set amount of time? Your organization must define key operations activities and how they are measured so that they can be monitored, improved, and alerted on. You need a central location where workload and operations telemetry is stored and analyzed. The same mechanism should be able to raise an alert when an operations outcome is at risk. 

 **Customer example** 

 A CloudWatch alarm was triggered during a routine deployment at AnyCompany Retail. The lead time for deployment was breached. Amazon EventBridge created an OpsItem in AWS Systems Manager OpsCenter. The Cloud Operations team used a playbook to investigate the issue and identified that a schema change was taking longer than expected. They alerted the on-call developer and continued monitoring the deployment. Once the deployment was complete, the Cloud Operations team resolved the OpsItem. The team will analyze the incident during a postmortem. 

## Implementation steps
<a name="implementation-steps"></a>

1. If you have not identified operations KPIs, metrics, and activities, work on implementing the preceding best practices to this question (OPS09-BP01 to OPS09-BP05). 
   +  Support customers with [Enterprise Support](https://aws.amazon.com/premiumsupport/plans/enterprise/) can request the [Operations KPI Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This collaborative workshop helps you define operations KPIs and metrics aligned to business goals, provided at no additional cost. Contact your Technical Account Manager to learn more. 

1.  Once you have operations activities, KPIs, and metrics established, configure alerts in your observability platform. Alerts should have an action associated to them, like a playbook or runbook. Alerts without an action should be avoided. 

1.  Over time, you should evaluate your operations metrics, KPIs, and activities to identify areas of improvement. Capture feedback in runbooks and playbooks from operators to identify areas for improvement in responding to alerts. 

1.  Alerts should include a mechanism to flag them as a false-positive. This should lead to a review of the metric thresholds. 

 **Level of effort for the implementation plan:** Medium. There are several best practices that must be in place before implementing this best practice. Once operations activities have been identified and operations KPIs established, alerts should be established. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md): Every operation activity and outcome should have an identified owner that's responsible. This is who should be alerted when outcomes are at risk. 
+  [OPS03-BP02 Team members are empowered to take action when outcomes are at risk](ops_org_culture_team_emp_take_action.md): When alerts are raised, your team should have agency to act to remedy the issue. 
+  [OPS09-BP01 Identify key performance indicators](ops_operations_health_define_ops_kpis.md): Alerting on operations outcomes starts with identify operations KPIs. 
+  [OPS09-BP02 Define operations metrics](ops_operations_health_design_ops_metrics.md): Establish this best practice before you start generating alerts. 
+  [OPS09-BP03 Collect and analyze operations metrics](ops_operations_health_collect_analyze_ops_metrics.md): Centrally collecting operations metrics is required to build alerts. 
+  [OPS09-BP04 Establish operations metrics baselines](ops_operations_health_ops_metric_baselines.md): Operations metrics baselines provide the ability to tune alerts and avoid alert fatigue. 
+  [OPS09-BP05 Learn the expected patterns of activity for operations](ops_operations_health_learn_ops_usage_patterns.md): You can improve the accuracy of your alerts by understanding the activity patterns for operations events. 
+  [OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics](ops_operations_health_biz_level_view_ops.md): Evaluate the achievement of operations outcomes to ensure that your KPIs and metrics are valid. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Every alert should have an associated runbook or playbook and provide context for the person being alerted. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Conduct a post-incident analysis after the alert to identify areas for improvement. 

 **Related documents:** 
+  [AWS Deployment Pipelines Reference Architecture: Application Pipeline Architecture](https://pipelines.devops.aws.dev/application-pipeline/) 

 **Related videos:** 
+  [Aggregate and Resolve Operational Issues Using AWS Systems Manager OpsCenter](https://www.youtube.com/watch?v=r6ilQdxLcqY) 
+  [Integrate AWS Systems Manager OpsCenter with Amazon CloudWatch Alarms](https://www.youtube.com/watch?v=Gpc7a5kVakI) 
+  [Integrate Your Data Sources into AWS Systems Manager OpsCenter Using Amazon EventBridge](https://www.youtube.com/watch?v=Xmmu5mMsq3c) 

 **Related examples:** 
+  [Automate remediation actions for Amazon EC2 notifications and beyond using Amazon EC2 Systems Manager Automation and AWS Health](https://aws.amazon.com/blogs/mt/automate-remediation-actions-for-amazon-ec2-notifications-and-beyond-using-ec2-systems-manager-automation-and-aws-health/) 
+  [AWS Management and Governance Tools Workshop - Operations 2022](https://mng.workshop.aws/operations-2022.html) 
+  [Ingesting, analyzing, and visualizing metrics with DevOps Monitoring Dashboard on AWS](https://docs.aws.amazon.com/solutions/latest/devops-monitoring-dashboard-on-aws/welcome.html) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Support Proactive Services - Operations KPI Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 
+  [CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS09-BP07 Alert when operations anomalies are detected
<a name="ops_operations_health_ops_anomaly_alerts"></a>

 Raise an alert when operations anomalies are detected so that you can respond appropriately if necessary. 

 Your analysis of your operations metrics over time may established patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response. 

 Once trained, the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature can be used to [alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) on detected anomalies or can provide overlaid expected values onto a [graph](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph_a_metric.html#create-metric-graph) of metric data for ongoing comparison. 

 [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) can be used to identify anomalous behavior through event correlation, log analysis, and applying machine learning to analyze your workload telemetry. The [insights](https://docs.aws.amazon.com/devops-guru/latest/userguide/understanding-insights-console.html) gained are presented with the relevant data and recommendations. 

 **Common anti-patterns:** 
+  You are applying a patch to your fleet of instances. You tested the patch successfully in the test environment. The patch is failing for a large percentage of instances in your fleet. You do nothing. 
+  You note that there are deployments starting Friday end of day. Your organization has predefined maintenance windows on Tuesdays and Thursdays. You do nothing. 

 **Benefits of establishing this best practice:** By understanding patterns of operations behavior you can identify unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when operations anomalies are detected: Raise an alert when operations anomalies are detected so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 
+  [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Detect and React to Changes in Pipeline State with Amazon CloudWatch Events](https://docs.aws.amazon.com/codepipeline/latest/userguide/detect-state-changes-cloudwatch-events.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
<a name="ops_operations_health_biz_level_view_ops"></a>

 Create a business-level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 

 AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash). 

 **Common anti-patterns:** 
+  The frequency of your deployments has increased with the growth in number of development teams. Your defined expected number of deployments is once per week. You have been regularly deploying daily. When their is an issue with your deployment system, and deployments are not possible, it goes undetected for days. 
+  When your business previously provided support only during core business hours from Monday to Friday. You established a next business day response time goal for incidents. You have recently started offering 24x7 support coverage with a two hour response time goal. Your overnight staff are overwhelmed and customers are unhappy. There is no indication that there are issues with incident response times because you are reporting against a next business day target. 

 **Benefits of establishing this best practice:** By reviewing and revising KPIs and metrics, you understand how your workload supports the achievement of your business outcomes and can identify where improvement is needed to reach business goals. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Create a business level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 
  +  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
  +  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [What is log analytics?](https://aws.amazon.com/log-analytics/) 