# Alarm evaluation
<a name="alarm-evaluation"></a>

## Metric alarm states
<a name="alarm-states"></a>

A metric alarm has the following possible states:
+ `OK` – The metric or expression is within the defined threshold.
+ `ALARM` – The metric or expression is outside of the defined threshold.
+ `INSUFFICIENT_DATA` – The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

## Alarm evaluation state
<a name="alarm-evaluation-state"></a>

In addition to the alarm state, each alarm has an evaluation state that provides information about the alarm evaluation process. The following states may occur:
+ `PARTIAL_DATA` – Indicates that not all the available data was able to be retrieved due to quota limitations. For more information, see [How partial data is handled](cloudwatch-metrics-insights-alarms-partial-data.md).
+ `EVALUATION_ERROR` – Indicates configuration errors in alarm setup that require review and correction. Refer to StateReason field of the alarm for more details.
+ `EVALUATION_FAILURE` – Indicates temporary CloudWatch issues. We recommend manual monitoring until the issue is resolved

You can view the evaluation state in the alarm details in the console, or by using the `describe-alarms` CLI command or `DescribeAlarms` API.

## Alarm evaluation settings
<a name="alarm-evaluation-settings"></a>

When you create an alarm, you specify three settings to enable CloudWatch to evaluate when to change the alarm state:
+ **Period** is the length of time to use to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds.
+ **Evaluation Periods** is the number of the most recent periods, or data points, to evaluate when determining alarm state.
+ **Datapoints to Alarm** is the number of data points within the Evaluation Periods that must be breaching to cause the alarm to go to the `ALARM` state. The breaching data points don't have to be consecutive, but they must all be within the last number of data points equal to **Evaluation Period**.

For any period of one minute or longer, an alarm is evaluated every minute and the evaluation is based on the window of time defined by the **Period** and **Evaluation Periods**. For example, if the **Period** is 5 minutes (300 seconds) and **Evaluation Periods** is 1, then at the end of minute 5 the alarm evaluates based on data from minutes 1 to 5. Then at the end of minute 6, the alarm is evaluated based on the data from minutes 2 to 6.

If the alarm period is 10 seconds, 20 seconds, or 30 seconds, the alarm is evaluated every 10 seconds. For more information, see [High-resolution alarms](#high-resolution-alarms).

If the number of evaluation periods multiplied by the length of each evaluation period exceeds one day, the alarm is evaluated once per hour. For more details about how these multi-day alarms are evaluated, see [Example of evaluating a multi-day alarm](#evaluate-multiday-alarm).

In the following figure, the alarm threshold for a metric alarm is set to three units. Both **Evaluation Period** and **Datapoints to Alarm** are 3. That is, when all existing data points in the most recent three consecutive periods are above the threshold, the alarm goes to `ALARM` state. In the figure, this happens in the third through fifth time periods. At period six, the value dips below the threshold, so one of the periods being evaluated is not breaching, and the alarm state changes back to `OK`. During the ninth time period, the threshold is breached again, but for only one period. Consequently, the alarm state remains `OK`.

![\[Alarm threshold trigger alarm\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_graph.png)


When you configure **Evaluation Periods** and **Datapoints to Alarm** as different values, you're setting an "M out of N" alarm. **Datapoints to Alarm** is ("M") and **Evaluation Periods** is ("N").The evaluation interval is the number of evaluation periods multiplied by the period length. For example, if you configure 4 out of 5 data points with a period of 1 minute, the evaluation interval is 5 minutes. If you configure 3 out of 3 data points with a period of 10 minutes, the evaluation interval is 30 minutes.

**Note**  
If data points are missing soon after you create an alarm, and the metric was being reported to CloudWatch before you created the alarm, CloudWatch retrieves the most recent data points from before the alarm was created when evaluating the alarm.

## High-resolution alarms
<a name="high-resolution-alarms"></a>

If you set an alarm on a high-resolution metric, you can specify a high-resolution alarm with a period of 10 seconds, 20 seconds, or 30 seconds. There is a higher charge for high-resolution alarms. For more information about high-resolution metrics, see [Publish custom metrics](publishingMetrics.md).

## Example of evaluating a multi-day alarm
<a name="evaluate-multiday-alarm"></a>

An alarm is a multi-day alarm if the number of evaluation periods multiplied by the length of each evaluation period exceeds one day. Multi-day alarms are evaluated once per hour. When multi-day alarms are evaluated, CloudWatch takes into account only the metrics up to the current hour at the :00 minute when evaluating.

For example, consider an alarm that monitors a job that runs every 3 days at 10:00.

1. At 10:02, the job fails

1. At 10:03, the alarm evaluates and stays in `OK` state, because the evaluation considers data only up to 10:00.

1. At 11:03, the alarm considers data up to 11:00 and goes into `ALARM` state.

1. At 11:43, you correct the error and the job now runs successfully.

1. At 12:03, the alarm evaluates again, sees the successful job, and returns to `OK` state.

# Configuring how CloudWatch alarms treat missing data
<a name="alarms-and-missing-data"></a>

Sometimes, not every expected data point for a metric gets reported to CloudWatch. For example, this can happen when a connection is lost, a server goes down, or when a metric reports data only intermittently by design.

CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to `ALARM` state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem.

Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:
+ Not breaching (within the threshold)
+ Breaching (violating the threshold)
+ Missing

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:
+ `notBreaching` – Missing data points are treated as "good" and within the threshold
+ `breaching` – Missing data points are treated as "bad" and breaching the threshold
+ `ignore` – The current alarm state is maintained
+ `missing` – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT\$1DATA.

The best choice depends on the type of metric and the purpose of the alarm. For example, if you are creating an application rollback alarm using a metric that continually reports data, you might want to treat missing data points as breaching, because it might indicate that something is wrong. But for a metric that generates data points only when an error occurs, such as `ThrottledRequests` in Amazon DynamoDB, you would want to treat missing data as `notBreaching`. The default behavior is `missing`.

**Important**  
Alarms configured on Amazon EC2 metrics can temporarily enter the INSUFFICIENT\$1DATA state if there are missing metric data points. This is rare, but can happen when the metric reporting is interrupted, even when the Amazon EC2 instance is healthy. For alarms on Amazon EC2 metrics that are configured to take stop, terminate, reboot, or recover actions, we recommend that you configure those alarms to treat missing data as `missing`, and to have these alarms trigger only when in the ALARM state.

Choosing the best option for your alarm prevents unnecessary and misleading alarm condition changes, and also more accurately indicates the health of your system.

**Important**  
Alarms that evaluate metrics in the `AWS/DynamoDB` namespace default to ignore missing data. You can override this if you choose a different option for how the alarm should treat missing data. When an `AWS/DynamoDB` metric has missing data, alarms that evaluate that metric remain in their current state.

## How alarm state is evaluated when data is missing
<a name="alarms-evaluating-missing-data"></a>

Whenever an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than the number specified as **Evaluation Periods**. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the *evaluation range*.

Once CloudWatch retrieves these data points, the following happens:
+ If no data points in the evaluation range are missing, CloudWatch evaluates the alarm based on the most recent data points collected. The number of data points evaluated is equal to the **Evaluation Periods** for the alarm. The extra data points from farther back in the evaluation range are not needed and are ignored.
+ If some data points in the evaluation range are missing, but the total number of existing data points that were successfully retrieved from the evaluation range is equal to or more than the alarm's **Evaluation Periods**, CloudWatch evaluates the alarm state based on the most recent real data points that were successfully retrieved, including the necessary extra data points from farther back in the evaluation range. In this case, the value you set for how to treat missing data is not needed and is ignored.
+ If some data points in the evaluation range are missing, and the number of actual data points that were retrieved is lower than the alarm's number of **Evaluation Periods**, CloudWatch fills in the missing data points with the result you specified for how to treat missing data, and then evaluates the alarm. However, all real data points in the evaluation range are included in the evaluation. CloudWatch uses missing data points only as few times as possible. 

**Note**  
A particular case of this behavior is that CloudWatch alarms might repeatedly re-evaluate the last set of data points for a period of time after the metric has stopped flowing. This re-evaluation might cause the alarm to change state and re-execute actions, if it had changed state immediately prior to the metric stream stopping. To mitigate this behavior, use shorter periods.

The following tables illustrate examples of the alarm evaluation behavior. In the first table, **Datapoints to Alarm** and **Evaluation Periods** are both 3. CloudWatch retrieves the 5 most recent data points when evaluating the alarm, in case some of the most recent 3 data points are missing. 5 is the evaluation range for the alarm.

Column 1 shows the 5 most recent data points, because the evaluation range is 5. These data points are shown with the most recent data point on the right. 0 is a non-breaching data point, X is a breaching data point, and - is a missing data point.

Column 2 shows how many of the 3 necessary data points are missing. Even though the most recent 5 data points are evaluated, only 3 (the setting for **Evaluation Periods**) are necessary to evaluate the alarm state. The number of data points in Column 2 is the number of data points that must be "filled in", using the setting for how missing data is being treated. 

In columns 3-6, the column headers are the possible values for how to treat missing data. The rows in these columns show the alarm state that is set for each of these possible ways to treat missing data.


| Data points | \$1 of data points that must be filled | MISSING | IGNORE | BREACHING | NOT BREACHING | 
| --- | --- | --- | --- | --- | --- | 
|  0 - X - X  |  0  |  `OK`  |  `OK`  |  `OK`  |  `OK`  | 
|  0 - - - -  |  2  |  `OK`  |  `OK`  |  `OK`  |  `OK`  | 
|  - - - - -  |  3  |  `INSUFFICIENT_DATA`  |  Retain current state  |  `ALARM`  |  `OK`  | 
|  0 X X - X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  - - X - -   |  2  |  `ALARM`  |  Retain current state  |  `ALARM`  |  `OK`  | 

In the second row of the preceding table, the alarm stays `OK` even if missing data is treated as breaching, because the one existing data point is not breaching, and this is evaluated along with two missing data points which are treated as breaching. The next time this alarm is evaluated, if the data is still missing it will go to `ALARM`, as that non-breaching data point will no longer be in the evaluation range.

The third row, where all five of the most recent data points are missing, illustrates how the various settings for how to treat missing data affect the alarm state. If missing data points are considered breaching, the alarm goes into ALARM state, while if they are considered not breaching, then the alarm goes into OK state. If missing data points are ignored, the alarm retains the current state it had before the missing data points. And if missing data points are just considered as missing, then the alarm does not have enough recent real data to make an evaluation, and goes into INSUFFICIENT\$1DATA.

In the fourth row, the alarm goes to `ALARM` state in all cases because the three most recent data points are breaching, and the alarm's **Evaluation Periods** and **Datapoints to Alarm** are both set to 3. In this case, the missing data point is ignored and the setting for how to evaluate missing data is not needed, because there are 3 real data points to evaluate.

Row 5 represents a special case of alarm evaluation called *premature alarm state*. For more information, see [Avoiding premature transitions to alarm state](#CloudWatch-alarms-avoiding-premature-transition).

In the next table, the **Period** is again set to 5 minutes, and **Datapoints to Alarm** is only 2 while **Evaluation Periods** is 3. This is a 2 out of 3, M out of N alarm.

The evaluation range is 5. This is the maximum number of recent data points that are retrieved and can be used in case some data points are missing.


| Data points | \$1 of missing data points | MISSING | IGNORE | BREACHING | NOT BREACHING | 
| --- | --- | --- | --- | --- | --- | 
|  0 - X - X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  0 0 X 0 X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  0 - X - -  |  1  |  `OK`  |  `OK`  |  `ALARM`  |  `OK`  | 
|  - - - - 0  |  2  |  `OK`  |  `OK`  |  `ALARM`  |  `OK`  | 
|  - - - X -  |  2  |  `ALARM`  |  Retain current state  |  `ALARM`  |  `OK`  | 

In rows 1 and 2, the alarm always goes to ALARM state because 2 of the 3 most recent data points are breaching. In row 2, the two oldest data points in the evaluation range are not needed because none of the 3 most recent data points are missing, so these two older data points are ignored.

In rows 3 and 4, the alarm goes to ALARM state only if missing data is treated as breaching, in which case the two most recent missing data points are both treated as breaching. In row 4, these two missing data points that are treated as breaching provide the two necessary breaching data points to trigger the ALARM state.

Row 5 represents a special case of alarm evaluation called *premature alarm state*. For more information, see the following section.

### Avoiding premature transitions to alarm state
<a name="CloudWatch-alarms-avoiding-premature-transition"></a>

CloudWatch alarm evaluation includes logic to try to avoid false alarms, where the alarm goes into ALARM state prematurely when data is intermittent. The example shown in row 5 in the tables in the previous section illustrate this logic. In those rows, and in the following examples, the **Evaluation Periods** is 3 and the evaluation range is 5 data points. **Datapoints to Alarm** is 3, except for the M out of N example, where **Datapoints to Alarm** is 2.

Suppose an alarm's most recent data is `- - - - X`, with four missing data points and then a breaching data point as the most recent data point. Because the next data point may be non-breaching, the alarm does not go immediately into ALARM state when the data is either `- - - - X` or `- - - X -` and **Datapoints to Alarm** is 3. This way, false positives are avoided when the next data point is non-breaching and causes the data to be `- - - X O` or `- - X - O`.

However, if the last few data points are `- - X - -`, the alarm goes into ALARM state even if missing data points are treated as missing. This is because alarms are designed to always go into ALARM state when the oldest available breaching datapoint during the **Evaluation Periods** number of data points is at least as old as the value of **Datapoints to Alarm**, and all other more recent data points are breaching or missing. In this case, the alarm goes into ALARM state even if the total number of datapoints available is lower than M (**Datapoints to Alarm**).

This alarm logic applies to M out of N alarms as well. If the oldest breaching data point during the evaluation range is at least as old as the value of **Datapoints to Alarm**, and all of the more recent data points are either breaching or missing, the alarm goes into ALARM state no matter the value of M (**Datapoints to Alarm**).

## Missing Data in CloudWatch Metrics Insights alarms
<a name="mi-missing-data-treatment"></a>

 ** Alarms based on Metrics Insights queries that aggregate to a single time series ** 

 The missing data scenarios and their effects upon alarm evaluation are the same as a standard metric alarm in terms of the configured missing data treatment. See, [Configuring how CloudWatch alarms treat missing data](#alarms-and-missing-data). 

 ** Alarms based on Metrics Insights queries that produce multiple time series ** 

Missing data scenarios for Metrics Insights alarms occur when: 
+  Individual datapoints within a time series are not present. 
+  One or more time series disappear when evaluating upon multiple time series. 
+  No time series are retrieved by the query. 

 Missing data scenarios affect the alarm evaluation in the following manner: 
+  For the evaluation of a time series, the treat missing data treatment is applied for individual datapoints within the time series. For example, if 3 datapoints were queried for the time series but only 1 was received, 2 datapoints would follow the configured missing data configuration. 
+  If a time series is not retrieved by the query anymore, it will transition to `OK` no matter the treat missing data treatment. Alarm actions associated with the `OK` transition at the contributor level are executed and the `StateReason` specifies that the aforementioned contributor was not found with the message, "No data was returned for this contributor". The state of the alarm will depend on the state of the other contributors that were retrieved by the query. 
+  At alarm level, if the query returns an empty result (no time series at all), the treat missing data treatment is applied. For example, if the treat missing data was set as `BREACHING`, the alarm will transition to `ALARM`. 

# How partial data is handled
<a name="cloudwatch-metrics-insights-alarms-partial-data"></a>

## How partial data from a Metrics Insights query is evaluated
<a name="cloudwatch-metrics-insights-query-evaluation"></a>

If the Metrics Insights query used for the alarm matches more than 10,000 metrics, the alarm is evaluated based on the first 10,000 metrics that the query finds. This means that the alarm is being evaluated on partial data.

You can use the following methods to find whether a Metrics Insights alarm is currently evaluating its alarm state based on partial data: 
+ In the console, if you choose an alarm to see the **Details** page, the message **Evaluation warning: Not evaluating all data** appears on that page.
+ You see the value `PARTIAL_DATA` in the `EvaluationState` field when you use the [ describe-alarms](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/cloudwatch/describe-alarms.html?highlight=describe%20alarms) AWS CLI command or the [ DescribeAlarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DescribeAlarms.html) API.

Alarms also publish events to Amazon EventBridge when it goes into the partial data state, so you can create an EventBridge rule to watch for these events. In these events, the `evaluationState` field has the value `PARTIAL_DATA`. The following is an example.

```
{
    "version": "0",
    "id": "12345678-3bf9-6a09-dc46-12345EXAMPLE",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-11-08T11:26:05Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:my-alarm-name"
    ],
    "detail": {
        "alarmName": "my-alarm-name",
        "state": {
            "value": "ALARM",
            "reason": "Threshold Crossed: 3 out of the last 3 datapoints [20000.0 (08/11/22 11:25:00), 20000.0 (08/11/22 11:24:00), 20000.0 (08/11/22 11:23:00)] were greater than the threshold (0.0) (minimum 1 datapoint for OK -> ALARM transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2022-11-08T11:26:05.399+0000\",\"startDate\":\"2022-11-08T11:23:00.000+0000\",\"period\":60,\"recentDatapoints\":[20000.0,20000.0,20000.0],\"threshold\":0.0,\"evaluatedDatapoints\":[{\"timestamp\":\"2022-11-08T11:25:00.000+0000\",\"value\":20000.0}]}",
            "timestamp": "2022-11-08T11:26:05.401+0000",
            "evaluationState": "PARTIAL_DATA"
        },
        "previousState": {
            "value": "INSUFFICIENT_DATA",
            "reason": "Unchecked: Initial alarm creation",
            "timestamp": "2022-11-08T11:25:51.227+0000"
        },
        "configuration": {
            "metrics": [
                {
                    "id": "m2",
                    "expression": "SELECT SUM(PartialDataTestMetric) FROM partial_data_test",
                    "returnData": true,
                    "period": 60
                }
            ]
        }
    }
}
```

If the query for the alarm includes a GROUP BY statement that initially returns more than 500 time series, the alarm is evaluated based on the first 500 time series that the query finds. However, if you use an ORDER BY clause, then all the time series that the query finds are sorted, and the 500 that have the highest or lowest values according to your ORDER BY clause are used to evaluate the alarm. 

## How partial data from a multi data source alarm is evaluated
<a name="multi-data-source-partial-data"></a>

If the Lambda function returns partial data:
+ The alarm continues to be evaluated on the data points that are returned.
+ You can use the following methods to find whether an alarm on a Lambda function is currently evaluating its alarm state based on partial data:
  + In the console, choose an alarm and choose the **Details** page. If you see the message **Evaluation warning: Not evaluating all data appears on that page**, it is evaluating on partial data.
  + If you see the value `PARTIAL_DATA` in the `EvaluationState` field when you use the `describe-alarms` AWS CLI command or the DescribeAlarms API, it is evaluating on partial data.
+ An alarm also publishes events to Amazon EventBridge when it goes into the partial data state.

# Percentile-based alarms and low data samples
<a name="percentiles-with-low-samples"></a>

When you set a percentile as the statistic for an alarm, you can specify what to do when there is not enough data for a good statistical assessment. You can choose to have the alarm evaluate the statistic anyway and possibly change the alarm state. Or, you can have the alarm ignore the metric while the sample size is low, and wait to evaluate it until there is enough data to be statistically significant.

For percentiles between 0.5 (inclusive) and 1.00 (exclusive), this setting is used when there are fewer than 10/(1-percentile) data points during the evaluation period. For example, this setting would be used if there were fewer than 1000 samples for an alarm on a p99 percentile. For percentiles between 0 and 0.5 (exclusive), the setting is used when there are fewer than 10/percentile data points.