

# Define and configure alarms in Incident Detection and Response
<a name="idr-gs-alarms"></a>

AWS works with you to define metrics and alarms to provide visibility into the performance of your applications and their underlying AWS infrastructure. We ask that alarms adhere to the following criteria when defining and configuring thresholds:
+ Alarms only enter the "Alarm" state when there is critical impact to the monitored workload (loss of revenue or degraded customer experience that significantly reduces performance) that requires immediate operator attention.
+ Alarms must also engage your specified resolvers for the workload at the same time, or prior to, engaging the incident management team. Incident management engineers should be collaborating with your specified resolvers in the mitigation process, not serve as a first line responder and then escalate to you.
+ Alarm thresholds must be set to an appropriate threshold and duration so that any time an alarm fires, an investigation must take place. If an alarm is flapping between "Alarm" and "OK" state, sufficient impact is occurring to warrant operator response and attention.

**Types of alarms**:
+ Alarms that portray the level of business impact and pass relevant information for simple fault detection.
+ Amazon CloudWatch canaries. For more information, see [Canaries and X-Ray tracing](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_tracing.html), and [X-Ray](https://aws.amazon.com/xray/).
+ Aggregate alarming (monitoring of dependencies)

The following table provides example alarms, all using the CloudWatch monitoring system.


****  

| Metric name / Alarm threshold | Alarm ARN or resource ID | If this alarm fires | If engaged, cut a Premium Support Case for these services | 
| --- | --- | --- | --- | 
| API errors / \$1 of errors >= 10 for 10 datapoints | arn:aws:cloudwatch:us-west-2:000000000000:alarm:E2MPmimLambda-Errors | Ticket cut to database administrator (DBA) team | Lambda, API Gateway | 
| ServiceUnavailable (Http status code 503) \$1 of errors >=3 for 10 datapoints (different clients) in a 5 minute window | arn:aws:cloudwatch:us-west-2:xxxxx:alarm:httperrorcode503 | Ticket cut to Service team | Lambda, API Gateway | 
| ThrottlingException (Http status code 400) \$1 of errors >=3 for 10 datapoints (different clients) in a 5 minute window | arn:aws:cloudwatch:us-west-2:xxxxx:alarm:httperrorcode400 | Ticket cut to Service team | EC2, Amazon Aurora | 

For more details, see [AWS Incident Detection and Response monitoring and observability](observe-idr.md).

If you prefer to use automation tools to onboard alarms, the Incident Detection and Response Command Line Interface (CLI) helps you deploy and onboard your alarms. For more details, see [AWS Incident Detection and Response CLI](idr-cli.md).

**Key outputs:**
+ Definition and configuration of alarms on your workloads.
+ Completion of the alarm details on the onboarding questionnaire.

**Topics**
+ [Create CloudWatch alarms](idr-alarms-fit-purpose.md)
+ [Build CloudWatch alarms with CloudFormation templates](idr-create-alarms-with-cfn.md)
+ [Example use cases for CloudWatch alarms](idr-ex-alarm-use-cases.md)

# Create CloudWatch alarms that fit your business needs in Incident Detection and Response
<a name="idr-alarms-fit-purpose"></a>

When you create Amazon CloudWatch alarms, there are several steps that you can take to make sure your alarms best fit your business needs.

**Note**  
For examples of recommended CloudWatch alarms for AWS services to onboard to Incident Detection and Response, see [Incident Detection and Response Alarm Best Practices on AWS re:Post](https://repost.aws/selections/KP6FA7iQgVSVeSNq1jAcjwxg/incident-detection-and-response-idr).

## Review your proposed CloudWatch alarms
<a name="idr-review-alarms"></a>

Review your proposed alarms to make sure that they only enter the "Alarm" state when there is critical impact to the monitored workload (loss of revenue or degraded customer experience that significantly reduces performance). For example, do you consider this alarm critical enough that you must react immediately if it goes into the "Alarm" state?

The following are suggested metrics that might represent critical business impact, such as affecting your end users' experience with an application:
+ **CloudFront:** For more information, see [Viewing CloudFront and edge function metrics](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/viewing-cloudfront-metrics.html).
+ **Application Load Balancers:** It's a best practice that you create the following alarms for Application Load Balancers, if possible:
  + HTTPCode\$1ELB\$15XX\$1Count
  + HTTPCode\$1Target\$15XX\$1Count

  The preceding alarms allow you to monitor responses from targets that are behind the Application Load Balancer, or behind other resources. This makes it easier to identify the source of 5XX errors. For more information, see [CloudWatch metrics for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html).
+ **Amazon API Gateway:** If you use WebSocket API in Elastic Beanstalk, then consider using the following metrics:
  + Integration error rates (filtered to 5XX errors)
  + Integration latency
  + Execution errors

  For more information, see [Monitoring WebSocket API execution with CloudWatch metrics](https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-websocket-api-logging.html).
+ **Amazon Route 53:** Monitor the **EndPointUnhealthyENICount** metric. This metric is the number of elastic network interfaces in the **Auto-recovering** status. This status indicates attempts by the resolver to recover one or more of the Amazon Virtual Private Cloud network interfaces that are associated with the endpoint (specified by **EndpointId**). In the recovery process, the endpoint functions with limited capacity. The endpoint can't process DNS queries until it's fully recovered. For more information, see [Monitoring Amazon Route 53 Resolver endpoints with Amazon CloudWatch](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/monitoring-resolver-with-cloudwatch.html).

## Validate your alarm configurations
<a name="idr-validate-alarm-config"></a>

After you confirm that your proposed alarms fit your business needs, validate the configuration and history of the alarms:
+ Validate the **Threshold** for the metric to enter the "Alarm" state against the metric's graph trend.
+ Validate the **Period** used for polling data points. Polling data points at 60 seconds assist in early incident detection.
+ Validate the **DatapointToAlarm** configuration. In most cases, it's a best practice to set this to 3 out of 3 or 5 out of 5. In an incident, the alarm triggers after 3 minutes when set as [60 second metrics with 3 out of 3 DatapointToAlarm] or 5 minutes when set as [60 second metrics with 5 out of 5 DatapointToAlarm]. Use this combination to eliminate noisy alarms.

**Note**  
The preceding recommendations might vary depending on how you use a service. Each AWS service operates differently within a workload. And, the same service might operate differently when used in multiple places. You must be sure that you understand how your workload utilizes the resources that feed the alarm, as well as the upstream and downstream effects.

## Validate how your alarms handle missing data
<a name="idr-validate-missing-data"></a>

Some metric sources don't send data to CloudWatch at regular intervals. For these metrics, it's a best practice to treat missing data as **notBreaching**. For more information, see [Configuring how CloudWatch alarms treat missing data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data) and [Avoiding premature transitions to alarm state](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#CloudWatch-alarms-avoiding-premature-transition).

For example, if a metric monitors an error rate, and there are no errors, then the metric reports no data (nil) data points. If you configure the alarm to treat missing data as **Missing**, then a single breaching data point followed by two no data (nil) data points causes the metric to go into the "Alarm" state (for 3 out of 3 data points). This is because the missing data configuration evaluates the last known data point in the evaluation period.

In cases where metrics monitor an error rate, in the absence of service degradation you can assume that no data is a good thing. It's a best practice to treat missing data as **notBreaching** so that missing data is treated as "OK" and the metric doesn't enter the "Alarm" state on a single data point.

## Review the history of each alarm
<a name="idr-review-alarm-history"></a>

If an alarm's history shows that it frequently enters the "Alarm" state and then recovers quickly, then the alarm might become an issue for you. Make sure that you tune the alarm to prevent noise or false alarms.

## Validate metrics for underlying resources
<a name="idr-validate-underlying-resources"></a>

Make sure that your metrics look at valid underlying resources and use the correct statistics. If an alarm is configured to review invalid resource names, then the alarm might not be able to track the underlying data. This might cause the alarm to enter the "Alarm" state.

## Create composite alarms
<a name="idr-create-composite-alarms"></a>

If you provide Incident Detection and Response operations with a large number of alarms for onboarding, you might be asked to create composite alarms. Composite alarms reduce the total number of alarms that need to be onboarded.

# Build CloudWatch alarms in Incident Detection and Response with CloudFormation templates
<a name="idr-create-alarms-with-cfn"></a>

To accelerate onboarding to AWS Incident Detection and Response, and to reduce the effort needed to build alarms, AWS provides you with CloudFormation templates. These templates include optimized alarm settings for commonly onboarded services, such as Application Load Balancer, Network Load Balancer, and Amazon CloudFront.

**Build CloudWatch alarms with CloudFormation templates**

1. Download a template using the provided links:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/idr-create-alarms-with-cfn.html)

1. Review the downloaded JSON file to make sure that it meets your organization's operation and security processes.

1. Create a CloudFormation stack:
**Note**  
The following steps use the standard CloudFormation stack creation process. For detailed steps, see [Creating a stack on the CloudFormation console](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html).

   1. Open the AWS CloudFormation console at [https://console.aws.amazon.com/cloudformation](https://console.aws.amazon.com/cloudformation/).

   1. Choose **Create stack**.

   1. Choose **Template is ready**, and then upload the template file from your local folder.

      The following is an example of the **Create stack** screen.  
![\[Create stack upload template file example\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/create-cfn-stack1.png)

   1. Choose **Next**.

   1. Enter the following required information:
      + **AlarmNameConfig** and **AlarmDescriptionConfig**: Enter a name and description for your alarm.
      + **ThresholdConfig**: Revise the threshold value to meet your application's requirements.
      + **DistributionIDConfig**: Make sure that the distribution ID point to the correct resources in the account that you're creating the CloudFormation stack in.

   1. Choose **Next**.

   1. Review the default values in the **PeriodConfig**, **EvalutionPeriodConfig**, and **DatapointsToAlarmConfig** fields. It's a best practice to use the default values for these fields. You can make adjustments, if needed, to meet your application's requirements.

   1. Optionally enter tags and SNS notification information as needed. It's a best practice to turn on **Termination protection**to prevent accidental deletion of the alarm. To turn on termination protection, select the **Activated** radio button, as shown in the following example:  
![\[Create stack activate termination protection example\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/create-cfn-stack2.png)

   1. Choose **Next**.

   1. Review your stack settings, and then choose **Create stack**.

   1. After you create the stack, you see the alarm listed in the Amazon CloudWatch **Alarm** list, as shown in the following example:  
![\[Example CloudWatch alarm list\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/create-cfn-stack3.png)

1. After you create all of your alarms in the correct account and AWS Region, notify your Technical Account Manager (TAM). The AWS Incident Detection and Response team reviews the status of your new alarms, and then continues your onboarding.

# Example use cases for CloudWatch alarms in Incident Detection and Response
<a name="idr-ex-alarm-use-cases"></a>

The following use cases provide examples of how you can use Amazon CloudWatch alarms in Incident Detection and Response. These examples demonstrate how CloudWatch alarms can be configured to monitor key metrics and thresholds across various AWS services, enabling you to identify and respond to potential issues that could impact the availability and performance of your applications and workloads.

## Example Use Case A: Application Load Balancer
<a name="use-case-alb"></a>

You can create the following CloudWatch alarm that signals potential workload impact. To do this, you create a metric math that alarms when successful connections drop below a certain threshold. For the available CloudWatch metrics, see [CloudWatch metrics for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html)

**Metric:**`HTTPCode_Target_3XX_Count;HTTPCode_Target_4XX_Count;HTTPCode_Target_5XX_Count. (m1+m2)/(m1+m2+m3+m4)*100 m1 = HTTP Code 2xx || m2 = HTTP Code 3xx || m3 = HTTP Code 4xx || m4 = HTTP Code 5xx`

**NameSpace:** AWS/ApplicationELB

**ComparisonOperator(Threshold):** Less than x (x = customer’s threshold).

**Period:** 60 seconds

**DatapointsToAlarm:** 3 out of 3

**Missing data treatment:** Treat missing data as [breaching](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data).

**Statistic:** Sum

The following diagram shows the flow for Use Case A:

![\[Example use case for Application Load Balancer\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/UseCaseAALB.png)


## Example Use Case B: Amazon API Gateway
<a name="use-case-apigateway"></a>

You can create the following CloudWatch alarm that signals potential workload impact. To do this, you create a composite metric that alarms when there is high lantency or a high average number of 4XX errors in the API Gateway. For the available metrics, see [Amazon API Gateway dimensions and metrics](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-metrics-and-dimensions.html)

**Metric:**`compositeAlarmAPI Gateway (ALARM(error4XXMetricApiGatewayAlarm)) OR (AALARM(latencyMetricApiGatewayAlarm))`

**NameSpace:** AWS/API Gateway

**ComparisonOperator(Threshold):** Greater than (x or y customer's thresholds)

**Period:** 60 seconds

**DatapointsToAlarm:** 1 out of 1

**Missing data treatment:** Treat missing data as [not breaching](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data).

**Statistic:**

The following diagram shows the flow for Use Case B:

![\[Example use case for API Gateway\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/UseCaseBAPIGW.png)


## Example Use Case C: Amazon Route 53
<a name="use-case-apigateway"></a>

You can monitor your resources by creating Route 53 health checks that use CloudWatch to collect and process raw data into readable, near real-time metrics. You can create the following CloudWatch alarm that signals potential workload impact. You can use the CloudWatch metrics to create an alarm that triggers when it breaches the established threshold. For the available CloudWatch metrics, see [CloudWatch metrics for Route 53 health checks](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/monitoring-cloudwatch.html#cloudwatch-metrics)

**Metric:**`R53-HC-Success`

**NameSpace:** AWS/Route 53

**Threshold HealthCheckStatus:** HealthCheckStatus < x for 3 datapoints within 3 minutes (being x customer's threshold)

**Period:** 1 minute

**DatapointsToAlarm:** 3 out of 3

**Missing data treatment:** Treat missing data as [breaching](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data).

**Statistic:** Minimum

The following diagram shows the flow for Use Case C:

![\[Example use case for Route 53\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/UseCaseCR53.png)


## Example Use Case D: Monitor a workload with a custom app
<a name="use-case-apigateway"></a>

It's critical that you take the time to define an appropriate health check in this scenario. If you only verify that an application's port is open, then you haven't verified that the application is working. Additionally, making a call to the home page of an application is not necessarily the correct way to determine if the app is working. For instance, if an application depends on both a database and Amazon Simple Storage Service (Amazon S3), then the health check must validate all of the elements. One way to do that is to create a monitoring webpage, such as **/monitor**. The monitoring webpage makes a call to the database to make sure that it can connect and get data. And, the monitoring webpage makes a call to Amazon S3. Then, you point the health check on the load balancer to the **/monitor** page.

The following diagram shows the flow for Use Case D:

![\[Example use case for monitoring with a custom app\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/CustomAlarm.png)
