

# CloudWatch events and metrics from Amazon EMR
<a name="emr-manage-cluster-cloudwatch"></a>

Use events and metrics to track the activity and health of an Amazon EMR cluster. Events are useful for monitoring a specific occurrence within a cluster - for example, when a cluster changes state from starting to running. Metrics are useful to monitor a specific value - for example, the percentage of available disk space that HDFS is using within a cluster.

For more information about CloudWatch Events, see the [Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/). For more information about CloudWatch metrics, see [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) and [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

**Topics**
+ [Monitoring Amazon EMR metrics with CloudWatch](UsingEMR_ViewingMetrics.md)
+ [Monitoring Amazon EMR events with CloudWatch](emr-manage-cloudwatch-events.md)
+ [Responding to CloudWatch events from Amazon EMR](emr-events-response.md)

# Monitoring Amazon EMR metrics with CloudWatch
<a name="UsingEMR_ViewingMetrics"></a>

Metrics are updated every five minutes and automatically collected and pushed to CloudWatch for every Amazon EMR cluster. This interval is not configurable. There is no charge for the Amazon EMR metrics reported in CloudWatch. These five minute datapoint metrics are archived for 63 days, after which the data is discarded. 

## How do I use Amazon EMR metrics?
<a name="UsingEMR_ViewingMetrics_HowDoI"></a>

The following table shows common uses for metrics reported by Amazon EMR. These are suggestions to get you started, not a comprehensive list. For a complete list of metrics reported by Amazon EMR, see [Metrics reported by Amazon EMR in CloudWatch](#UsingEMR_ViewingMetrics_MetricsReported). 


****  

| How do I? | Relevant metrics | 
| --- | --- | 
| Track the progress of my cluster | Look at the RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics.  | 
| Detect clusters that are idle | The IsIdle metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.  | 
| Detect when a node runs out of storage | The MRUnhealthyNodes metric tracks when one or more core or task nodes run out of local disk storage and transition to an UNHEALTHY YARN state. For example, core or task nodes are running low on disk space and will not be able to run tasks. | 
| Detect when a cluster runs out of storage | The HDFSUtilization metric monitors the cluster's combined HDFS capacity, and can require resizing the cluster to add more core nodes. For example, the HDFS utilization is high, which may affect jobs and cluster health.  | 
| Detect when a cluster is running at reduced capacity | The MRLostNodes metric tracks when one or more core or task nodes is unable to communicate with the master node. For example, the core or task node is unreachable by the master node. | 

For more information, see [Amazon EMR cluster terminates with NO\$1SLAVE\$1LEFT and core nodes FAILED\$1BY\$1MASTER](emr-cluster-NO_SLAVE_LEFT-FAILED_BY_MASTER.md) and [AWSSupport-AnalyzeEMRLogs](https://docs.aws.amazon.com//systems-manager-automation-runbooks/latest/userguide/automation-awssupport-analyzeemrlogs.html). 

## Access CloudWatch metrics for Amazon EMR
<a name="UsingEMR_ViewingMetrics_Access"></a>

You can view the metrics that Amazon EMR reports to CloudWatch using the Amazon EMR console or the CloudWatch console. You can also retrieve metrics using the CloudWatch CLI command `[mon-get-stats](https://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/cli-mon-get-stats.html)` or the CloudWatch `[GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html)` API. For more information about viewing or retrieving metrics for Amazon EMR using CloudWatch, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/).

------
#### [ Console ]

**To view metrics with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose the cluster that you want to view metrics for. This opens the cluster details page.

1. Select the **Monitoring** tab on the cluster details page. Choose any one of the **Cluster status**, **Node status**, or **Inputs and outputs** options to load the reports about the progress and health of the cluster. 

1. After you choose a metric to view, you can enlarge each graph. To filter the time frame of your graph, select a prefilled option or choose **Custom**.

------

## Metrics reported by Amazon EMR in CloudWatch
<a name="UsingEMR_ViewingMetrics_MetricsReported"></a>

The following tables list the metrics that Amazon EMR reports in the console and pushes to CloudWatch.

### Amazon EMR metrics
<a name="emr-metrics-reported"></a>

Amazon EMR sends data for several metrics to CloudWatch. All Amazon EMR clusters automatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period, the data is discarded. 

The `AWS/ElasticMapReduce` namespace includes the following metrics.

**Note**  
Amazon EMR pulls metrics from a cluster. If a cluster becomes unreachable, no metrics are reported until the cluster becomes available again.

The following metrics are available for clusters running Hadoop 2.x versions.


| Metric | Description | 
| --- | --- | 
| Cluster Status | 
| IsIdle  | Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer. Use case: Monitor cluster performance Units: *Boolean*  | 
| ContainerAllocated  | The number of resource containers allocated by the ResourceManager. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerReserved  | The number of containers reserved. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerPending  | The number of containers in the queue that have not yet been allocated. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerPendingRatio  | The ratio of pending containers to containers allocated (ContainerPendingRatio = ContainerPending / ContainerAllocated). If ContainerAllocated = 0, then ContainerPendingRatio = ContainerPending. The value of ContainerPendingRatio represents a number, not a percentage. This value is useful for scaling cluster resources based on container allocation behavior. Units: *Count*  | 
| AppsCompleted  | The number of applications submitted to YARN that have completed. Use case: Monitor cluster progress Units: *Count*  | 
| AppsFailed  | The number of applications submitted to YARN that have failed to complete. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| AppsKilled  | The number of applications submitted to YARN that have been killed. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| AppsPending  | The number of applications submitted to YARN that are in a pending state. Use case: Monitor cluster progress Units: *Count*  | 
| AppsRunning  | The number of applications submitted to YARN that are running. Use case: Monitor cluster progress Units: *Count*  | 
| AppsSubmitted  | The number of applications submitted to YARN. Use case: Monitor cluster progress Units: *Count*  | 
| Node Status | 
| CoreNodesRunning  | The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| CoreNodesPending  | The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveDataNodes  | The percentage of data nodes that are receiving work from Hadoop. Use case: Monitor cluster health Units: *Percent*  | 
| MRTotalNodes  | The number of nodes presently available to MapReduce jobs. Equivalent to YARN metric `mapred.resourcemanager.TotalNodes`. Use ase: Monitor cluster progress Units: *Count* Note: MRTotalNodes only counts currently active nodes in the system. YARN automatically removes terminated nodes from this count and stops tracking them, so they are not considered in the MRTotalNodes metric.  | 
| MRActiveNodes  | The number of nodes presently running MapReduce tasks or jobs. Equivalent to YARN metric `mapred.resourcemanager.NoOfActiveNodes`. Use case: Monitor cluster progress Units: *Count*  | 
| MRLostNodes  | The number of nodes allocated to MapReduce that have been marked in a LOST state. Equivalent to YARN metric `mapred.resourcemanager.NoOfLostNodes`. Use case: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MRUnhealthyNodes  | The number of nodes available to MapReduce jobs marked in an UNHEALTHY state. Equivalent to YARN metric `mapred.resourcemanager.NoOfUnhealthyNodes`. Use case: Monitor cluster progress Units: *Count*  | 
| MRDecommissionedNodes  | The number of nodes allocated to MapReduce applications that have been marked in a DECOMMISSIONED state. Equivalent to YARN metric `mapred.resourcemanager.NoOfDecommissionedNodes`. Use ase: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MRRebootedNodes  | The number of nodes available to MapReduce that have been rebooted and marked in a REBOOTED state. Equivalent to YARN metric `mapred.resourcemanager.NoOfRebootedNodes`. Use case: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MultiMasterInstanceGroupNodesRunning  | The number of running master nodes. Use case: Monitor master node failure and replacement Units: *Count*  | 
| MultiMasterInstanceGroupNodesRunningPercentage  | The percentage of master nodes that are running over the requested master node instance count.  Use case: Monitor master node failure and replacement Units: *Percent*  | 
| MultiMasterInstanceGroupNodesRequested  | The number of requested master nodes.  Use case: Monitor master node failure and replacement Units: *Count*  | 
| IO | 
| S3BytesWritten  | The number of bytes written to Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.  Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| S3BytesRead  | The number of bytes read from Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.  Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSUtilization  | The percentage of HDFS storage currently used. Use case: Analyze cluster performance Units: *Percent*  | 
| HDFSBytesRead  | The number of bytes read from HDFS. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSBytesWritten  | The number of bytes written to HDFS. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| MissingBlocks  | The number of blocks in which HDFS has no replicas. These might be corrupt blocks. Use case: Monitor cluster health Units: *Count*  | 
| CorruptBlocks  | The number of blocks that HDFS reports as corrupted. Use case: Monitor cluster health Units: *Count*  | 
| TotalLoad  | The total number of concurrent data transfers. Use case: Monitor cluster health Units: *Count*  | 
| MemoryTotalMB  | The total amount of memory in the cluster. Use case: Monitor cluster progress Units: *Count*  | 
| MemoryReservedMB  | The amount of memory reserved. Use case: Monitor cluster progress Units: *Count*  | 
| MemoryAvailableMB  | The amount of memory available to be allocated. Use case: Monitor cluster progress Units: *Count*  | 
| YARNMemoryAvailablePercentage  | The percentage of remaining memory available to YARN (YARNMemoryAvailablePercentage = MemoryAvailableMB / MemoryTotalMB). This value is useful for scaling cluster resources based on YARN memory usage. Units: *Percent*  | 
| MemoryAllocatedMB  | The amount of memory allocated to the cluster. Use case: Monitor cluster progress Units: *Count*  | 
| PendingDeletionBlocks  | The number of blocks marked for deletion. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| UnderReplicatedBlocks  | The number of blocks that need to be replicated one or more times. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| DfsPendingReplicationBlocks  | The status of block replication: blocks being replicated, age of replication requests, and unsuccessful replication requests. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| CapacityRemainingGB  | The amount of remaining HDFS disk capacity.  Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 

The following are Hadoop 1 metrics:


| Metric | Description | 
| --- | --- | 
| Cluster Status | 
| IsIdle  | Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer. Use case: Monitor cluster performance Units: *Boolean*  | 
| JobsRunning  | The number of jobs in the cluster that are currently running. Use case: Monitor cluster health Units: *Count*  | 
| JobsFailed  | The number of jobs in the cluster that have failed. Use case: Monitor cluster health Units: *Count*  | 
| Map/Reduce | 
| MapTasksRunning  | The number of running map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| MapTasksRemaining  | The number of remaining map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. A remaining map task is one that is not in any of the following states: Running, Killed, or Completed. Use case: Monitor cluster progress Units: *Count*  | 
| MapSlotsOpen  | The unused map task capacity. This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster. Use case: Analyze cluster performance Units: *Count*  | 
| RemainingMapTasksPerSlot  | The ratio of the total map tasks remaining to the total map slots available in the cluster. Use case: Analyze cluster performance Units: *Ratio*  | 
| ReduceTasksRunning  | The number of running reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| ReduceTasksRemaining  | The number of remaining reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| ReduceSlotsOpen  | Unused reduce task capacity. This is calculated as the maximum reduce task capacity for a given cluster, less the number of reduce tasks currently running in that cluster. Use case: Analyze cluster performance Units: *Count*  | 
| Node Status | 
| CoreNodesRunning  | The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| CoreNodesPending  | The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveDataNodes  | The percentage of data nodes that are receiving work from Hadoop. Use case: Monitor cluster health Units: *Percent*  | 
| TaskNodesRunning  | The number of task nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| TaskNodesPending  | The number of task nodes waiting to be assigned. All of the task nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveTaskTrackers  | The percentage of task trackers that are functional. Use case: Monitor cluster health Units: *Percent*  | 
| IO | 
| S3BytesWritten  | The number of bytes written to Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| S3BytesRead  | The number of bytes read from Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSUtilization  | The percentage of HDFS storage currently used. Use case: Analyze cluster performance Units: *Percent*  | 
| HDFSBytesRead  | The number of bytes read from HDFS. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSBytesWritten  | The number of bytes written to HDFS. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| MissingBlocks  | The number of blocks in which HDFS has no replicas. These might be corrupt blocks. Use case: Monitor cluster health Units: *Count*  | 
| TotalLoad  | The current, total number of readers and writers reported by all DataNodes in a cluster. Use case: Diagnose the degree to which high I/O might be contributing to poor job execution performance. Worker nodes running the DataNode daemon must also perform map and reduce tasks. Persistently high TotalLoad values over time can indicate that high I/O might be a contributing factor to poor performance. Occasional spikes in this value are typical and do not usually indicate a problem. Units: *Count*  | 

#### Cluster capacity metrics
<a name="emr-metrics-managed-scaling"></a>

The following metrics indicate the current or target capacities of a cluster. These metrics are only available when managed scaling or auto-termination is enabled. 

For clusters composed of instance fleets, the cluster capacity metrics are measured in `Units`. For clusters composed of instance groups, the cluster capacity metrics are measured in `Nodes` or `VCPU` based on the unit type used in the managed scaling policy. For more information, see [Using EMR-managed scaling](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-scaling.html) in the *Amazon EMR Management Guide*.


| Metric | Description | 
| --- | --- | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) | The target total number of units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current total number of units/nodes/vCPUs available in a running cluster. When a cluster resize is requested, this metric will be updated after the new instances are added or removed from the cluster. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The target number of CORE units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current number of CORE units/nodes/vCPUs running in a cluster. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The target number of TASK units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current number of TASK units/nodes/vCPUs running in a cluster. Units: *Count*  | 

Amazon EMR emits the following metrics at a one-minute granularity when you enable auto-termination using an auto-termination policy. Some metrics are only available for Amazon EMR versions 6.4.0 and later. To learn more about auto-termination, see [Using an auto-termination policy for Amazon EMR cluster cleanup](emr-auto-termination-policy.md).


****  

| Metric | Description | 
| --- | --- | 
| TotalNotebookKernels | The total number of running and idle notebook kernels on the cluster. This metric is only available for Amazon EMR versions 6.4.0 and later. | 
| AutoTerminationIsClusterIdle | Indicates whether the cluster is in use.A value of **0** indicates that the cluster is in active use by one of the following components:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) A value of **1** indicates that the cluster is idle. Amazon EMR checks for continuous cluster idleness (`AutoTerminationIsClusterIdle` = 1). When a cluster's idle time equals the `IdleTimeout` value in your auto-termination policy, Amazon EMR terminates the cluster.  | 

### Dimensions for Amazon EMR metrics
<a name="emr-metrics-dimensions"></a>

Amazon EMR data can be filtered using any of the dimensions in the following table. 


| Dimension  | Description  | 
| --- | --- | 
| JobFlowId | The same as cluster ID, which is the unique identifier of a cluster in the form j-XXXXXXXXXXXXX. Find this value by clicking on the cluster in the Amazon EMR console.  | 

# Monitoring Amazon EMR events with CloudWatch
<a name="emr-manage-cloudwatch-events"></a>

Amazon EMR tracks events and keeps information about them for up to seven days in the Amazon EMR console. Amazon EMR records events when there is a change in the state of clusters, instance groups, instance fleets, automatic scaling policies, or steps. Events capture the date and time the event occurred, details about the affected elements, and other critical data points.

The following table lists Amazon EMR events, along with the state or state change that the event indicates, the severity of the event, event type, event code, and event messages. Amazon EMR represents events as JSON objects and automatically sends them to an event stream. The JSON object is important when you set up rules for event processing using CloudWatch Events because rules seek to match patterns in the JSON object. For more information, see [Events and event patterns](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/CloudWatchEventsandEventPatterns.html) and [Amazon EMR events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type) in the *Amazon CloudWatch Events User Guide*.

**Note**  
EMR periodically emits events with the event code **EC2 provisioning - Insufficient Instance Capacity**. These events occur when your Amazon EMR cluster encounters an insufficient capacity error from Amazon EMR for your instance fleet or instance group during cluster creation or resize operation. An event might not include all the instance types and AZs you have provided, because EMR only includes the instance types and AZs it attempted to provision capacity in since the last the Insufficient capacity event was emitted. For information on how to respond to these events, see [Responding to Amazon EMR cluster insufficient instance capacity events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-events-response-insuff-capacity.html).

## Cluster start events
<a name="emr-cloudwatch-cluster-events"></a>


| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| CREATING | WARN | EMR instance fleet provisioning | EC2 provisioning - Insufficient Instance Capacity | We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Fleet InstanceFleetID Amazon EC2 has insufficient Spot capacity for Instance type [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance type [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1, AvaliabilityZone2]. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event. | 
| CREATING | WARN | EMR instance group provisioning | EC2 provisioning - Insufficient Instance Capacity | We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Group InstanceGroupID Amazon EC2 has insufficient Spot capacity for Instance type [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance type [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1, AvaliabilityZone2]. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event. | 
| CREATING | WARN | EMR instance fleet provisioning | EC2 provisioning - Insufficient Free Addresses In Subnet | We can’t create the Amazon EMR cluster ClusterId (ClusterName) that you requested for instance fleet InstanceFleetID because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to see how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html) | 
| CREATING | WARN | EMR instance group provisioning | EC2 provisioning - Insufficient Free Addresses In Subnet | We can’t create the Amazon EMR cluster ClusterId (ClusterName) that you requested for instance group InstanceGroupID because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to see how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html) | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – vCPU Limit Exceeded  | The provision of InstanceFleetID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html)  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – vCPU Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterId is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html)  | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – Spot Instance Count Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – Spot Instance Count Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – Instance Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of instances you can run concurrently in your account (accountID). For more information on Amazon EC2 service limits, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – Instance Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of instances you can run concurrently in your account (accountID). For more information on Amazon EC2 service limits, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING | WARN | EMR instance group provisioning | *none* | Amazon EMR cluster `ClusterId (ClusterName)` was created at `Time` and is ready for use. - or -  Amazon EMR cluster `ClusterId (ClusterName)` finished running all pending steps at `Time`.  A cluster in the `WAITING` state may still be processing jobs.   | 
| STARTING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` was requested at `Time` and is being created.  | 
| STARTING  | INFO  | EMR cluster state change  | *none*  |  Applies only to clusters with the instance fleets configuration and multiple Availability Zones selected within Amazon EC2.  Amazon EMR cluster `ClusterId (ClusterName)` is being created in zone (`AvailabilityZoneID`), which was chosen from the specified Availability Zone options.  | 
| STARTING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` began running steps at `Time`.  | 
| WAITING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` was created at `Time` and is ready for use. - or -  Amazon EMR cluster `ClusterId (ClusterName)` finished running all pending steps at `Time`.  A cluster in the `WAITING` state may still be processing jobs.   | 

**Note**  
The events with event code `EC2 provisioning - Insufficient Instance Capacity` periodically emit when your EMR cluster encounters an insufficient capacity error from Amazon EC2 for your instance fleet or instance group during cluster creation or resize operation. For information on how to respond to these events, see [Responding to Amazon EMR cluster insufficient instance capacity events](emr-events-response-insuff-capacity.md).

## Cluster termination events
<a name="emr-cloudwatch-cluster-termination-events"></a>


| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| TERMINATED  | The severity depends on the reason for the state change, as shown in the following: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html)  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated at `Time` with a reason of `StateChangeReason:Code`.  | 
| TERMINATED\$1WITH\$1ERRORS  | CRITICAL  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated with errors at `Time` with a reason of `StateChangeReason:Code`.  | 
| TERMINATED\$1WITH\$1ERRORS  | CRITICAL  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated with errors at `Time` with a reason of `StateChangeReason:Code`.  | 

## Instance fleet state-change events
<a name="emr-cloudwatch-instance-fleet-events"></a>

**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.


****  

| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| From `PROVISIONING` to `WAITING`  | INFO  |  | none | Provisioning for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. Provisioning started at `Time` and took `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num`, and target Spot capacity was `Num`.  | 
| From `WAITING` to `RESIZING`  | INFO  |  | none | A resize for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. The instance fleet is resizing from an On-Demand capacity of `Num` to a target of `Num`, and from a Spot capacity of `Num` to a target of `Num`.  | 
| From `RESIZING` to `WAITING`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. The resize started at `Time` and took `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num` and target Spot capacity was `Num`.  | 
| From `RESIZING` to `WAITING`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` has reached the timeout and stopped. The resize started at `Time` and stopped after `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num` and target Spot capacity was `Num`.  | 
| SUSPENDED  | ERROR  |  | none | Instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was arrested at `Time` for the following reason: `ReasonDesc`.  | 
| RESIZING  | WARNING  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is stuck for the following reason: `ReasonDesc`.  | 
| `WAITING` or `Running`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` couldn't complete while Amazon EMR added Spot capacity in availability zone `AvailabilityZone`. We've cancelled your request to provision additional Spot capacity. For recommended actions, check [Availability Zone flexibility for an Amazon EMR cluster](emr-flexibility.md) and try again.  | 
| `WAITING` or `Running`  | INFO  |  | none | A resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was initiated by `Entity` at `Time`.  | 

## Instance fleet reconfiguration events
<a name="emr-cloudwatch-instance-fleet-events-reconfig"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| Instance Fleet Reconfiguration Requested  | INFO  | A user has requested to reconfigure the instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId` (`ClusterName`).  | 
| Instance Fleet Reconfiguration Start  | INFO  | Amazon EMR has started a reconfiguration of the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time`.  | 
| Instance Fleet Reconfiguration Completed  | INFO  | Amazon EMR has finished reconfiguring instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`).  | 
| Instance Fleet Reconfiguration Failed  | WARNING  | Amazon EMR failed to reconfigure the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time`. The reconfiguration failed because `Reason`.  | 
| Instance Fleet Reconfiguration Reversion Start  | INFO  | Amazon EMR is reverting the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previous successful configuration.  | 
| Instance Fleet Reconfiguration Reversion Completed  | INFO  | Amazon EMR finished reverting the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previous successful configuration.  | 
| Instance Fleet Reconfiguration Reversion Failed  | CRITICAL  | Amazon EMR couldn't revert the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previously successful configuration at `Time`. The reconfiguration reversion failed because of `Reason`.  | 
| Instance Fleet Reconfiguration Reversion Blocked  | INFO  | Amazon EMR tmeporarily blocked the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time` because the instance fleet is in the `State` state.  | 

## Instance fleet resize events
<a name="emr-cloudwatch-instance-fleet-resize-events"></a>


****  

| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| EMR instance fleet resize   | ERROR | Spot Provisioning timeout  | The Resize operation for Instance Fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was not able to complete while acquiring Spot capacity in AZ `AvailabilityZone`. We have now cancelled your request and stopped trying to provision any additional Spot capacity and the Instance Fleet has provisioned Spot capacity of `num`. Target Spot capacity was `num`. For more information and recommended actions, please check the documentation page [here](emr-flexibility.md) and retry again.  | 
| EMR instance fleet resize   | ERROR | On-Demand Provisioning timeout  | The Resize operation for Instance Fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was not able to complete while acquiring On-Demand capacity in AZ `AvailabilityZone`. We have now cancelled your request and stopped trying to provision any additional On-Demand capacity and the Instance Fleet has provisioned On-Demand capacity of `num`. Target On-Demand capacity was `num`. For more information and recommended actions, please check the documentation page [here](emr-flexibility.md) and retry again.  | 
| EMR instance fleet resize   | WARNING | EC2 provisioning - Insufficient Instance Capacity | We are not able to complete the resize operation for Instance Fleet `InstanceFleetID` in EMR cluster `ClusterId (ClusterName)` as Amazon EC2 has insufficient Spot capacity for Instance types `[Instancetype1, Instancetype2]` and insufficient On-Demand capacity for Instance types `[Instancetype3, Instancetype4]` in Availability Zone `[AvailabilityZone1]`. So far, the instance fleet has provisioned On-Demand capacity of `num` and target On-Demand capacity was `num`. Provisioned Spot capacity is `num` and target Spot capacity was `num`. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event.  | 
| EMR instance fleet resize   | WARNING | Spot Provisioning Timeout - Continuing Resize  | We're still provisioning Spot capacity for the Instance Fleet resize operation that initiated at `time` for instance fleet ID `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` for `[Instancetype1, Instancetype2]` in AZ `AvailabilityZone`. For the previous resize operation that initiated at `time`, the timeout period expired, so Amazon EMR stopped provisioning Spot capacity after adding `num` of the requested `num` instances to your instance fleet. For more information, please check the documentation page [here](emr-flexibility.md). | 
| EMR instance fleet resize   | WARNING | On-Demand Provisioning Timeout - Continuing Resize  | We're still provisioning On-Demand capacity for the Instance Fleet resize operation that initiated at `time` for instance fleet ID `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` for `[Instancetype1, Instancetype2]` in AZ `AvailabilityZone`. For the previous resize operation that initiated at `time`, the timeout period expired, so Amazon EMR stopped provisioning On-Demand capacity after adding `num` of the requested `num` instances to your instance fleet. For more information, please check the documentation page [here](emr-flexibility.md). | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - Insufficient Free Address in Subnet  | We can't complete the resize operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to view how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - vCPU Limit Exceeded  | The resize of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterName is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance fleet resize  | WARNING | EC2 Provisioning - Spot Instance Count Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - Instance Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of on-demand instances you can run in your account (accountId). For more information on [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 

**Note**  
The provisioning timeout events are emitted when Amazon EMR stops provisioning Spot or On-demand capacity for the fleet after the timeout expires. For information on how to respond to these events, see [Responding to Amazon EMR cluster instance fleet resize timeout events](emr-events-response-timeout-events.md) .

## Instance group events
<a name="emr-cloudwatch-instance-group-events"></a>


****  

| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| From `RESIZING` to `Running`  | INFO  | none | The resizing operation for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. It now has an instance count of `Num`. The resize started at `Time` and took `Num` minutes to complete.  | 
| From `RUNNING` to `RESIZING`  | INFO  | none | A resize for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. It is resizing from an instance count of `Num` to `Num`.  | 
| SUSPENDED  | ERROR  | none | Instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was arrested at `Time` for the following reason: `ReasonDesc`.  | 
| RESIZING  | WARNING  | none | The resizing operation for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` is stuck for the following reason: `ReasonDesc`.  | 
| EMR instance group resize   | WARNING | EC2 provisioning - Insufficient Instance Capacity | We are not able to complete the resize operation that started at `time` for Instance Group `InstanceGroupID` in EMR cluster `ClusterId (ClusterName)` as Amazon EC2 has insufficient `Spot/On Demand` capacity for Instance type `[Instancetype]` in Availability Zone `[AvailabilityZone1]`. So far, the instance group has a running instance count of `num` and requested instance count was `num`. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event.  | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Insufficient Free Address in Subnet  | We can't complete the resize operation for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to view how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance group resize   | WARNING | EC2 Provisioning - vCPU Limit Exceeded  | The resize of instance group InstanceGroupID in the Amazon EMR cluster ClusterName is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Spot Instance Count Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Instance Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of on-demand instances you can run in your account (accountId). For more information on [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| From `RUNNING` to `RESIZING`  | INFO  | none | A resize for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was initiated by `Entity` at `Time`.  | 

**Note**  
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. For more information, see [Supplying a Configuration for an Instance Group in a Running Cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html).

The following table lists Amazon EMR events for the reconfiguration operation, along with the state or state change that the event indicates, the severity of the event, and event messages. 


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| RUNNING  | INFO  | A reconfiguration for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` was initiated by user at `Time`. Version of requested configuration is `Num`.  | 
| From `RECONFIGURING` to `Running` | INFO  | The reconfiguration operation for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` is complete. The reconfiguration started at `Time` and took `Num` minutes to complete. Current configuration version is `Num`.  | 
| From `RUNNING` to `RECONFIGURING` in  | INFO  | A reconfiguration for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. It is configuring from version number `Num` to version number `Num`.  | 
| RESIZING  | INFO  | Reconfiguring operation towards configuration version `Num` for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` is temporarily blocked at `Time` because instance group is in `State`.  | 
| RECONFIGURING  | INFO  | Resizing operation towards instance count Num for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is temporarily blocked at Time because the instance group is in State. | 
| RECONFIGURING  | WARNING  | The reconfiguration operation for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` failed at `Time` and took `Num` minutes to fail. Failed configuration version is `Num`.   | 
| RECONFIGURING  | INFO  | Configurations are reverting to the previous successful version number `Num`for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. New configuration version is `Num`.   | 
| From `RECONFIGURING` to `Running` | INFO  | Configurations were successfully reverted to the previous successful version `Num` for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. New configuration version is `Num`.  | 
| From `RECONFIGURING` to `SUSPENDED`  | CRITICAL  | Failed to revert to the previous successful version `Num` for Instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`.  | 

## Automatic scaling policy events
<a name="emr-cloudwatch-autoscale-events"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| PENDING  | INFO  | An Auto Scaling policy was added to instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. The policy is pending attachment. - or -  The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was updated at `Time`. The policy is pending attachment.  | 
| ATTACHED  | INFO  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was attached at `Time`.  | 
| `DETACHED`  | INFO  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was detached at `Time`.  | 
| FAILED  | ERROR  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` could not attach and failed at `Time`. - or -  The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` could not detach and failed at `Time`.  | 

## Step events
<a name="emr-cloudwatch-step-events"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| PENDING  | INFO  | Step `StepID (StepName)` was added to Amazon EMR cluster `ClusterId (ClusterName)` at `Time` and is pending execution.   | 
| CANCEL\$1PENDING  | WARN  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` was cancelled at `Time` and is pending cancellation.   | 
| RUNNING  | INFO  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` started running at `Time`.   | 
| COMPLETED  | INFO  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` completed execution at `Time`. The step started running at `Time` and took `Num` minutes to complete.  | 
| CANCELLED  | WARN  | Cancellation request has succeeded for cluster step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` at `Time`, and the step is now cancelled.   | 
| FAILED  | ERROR  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` failed at `Time`.  | 

## Unhealthy node replacement events
<a name="emr-cloudwatch-unhealthy-node-replacement-events"></a>


| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| Amazon EMR unhealthy node replacement | INFO | Unhealthy core node detected | Amazon EMR has identified that core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)` is `UNHEALTHY`. Amazon EMR will attempt to recover or gracefully replace the `UNHEALTHY` instance.  | 
| Amazon EMR unhealthy node replacement | INFO | Core node unhealthy - replacement disabled | Amazon EMR has identified that core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `(clusterID) (ClusterName)` is `UNHEALTHY`. Turn on graceful unhealthy core node replacement in your cluster to let Amazon EMR gracefully replace the `UNHEALTHY` instances in the event that they can’t be recovered.  | 
| Amazon EMR unhealthy node replacement | WARN | Unhealthy core node not replaced | Amazon EMR can't replace your `UNHEALTHY` core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)` because of *reason*. The reason of why Amazon EMR can't replace your core node differs depending on your scenario. For example, one reason of why Amazon EMR can't delete a node is because a cluster wouldn't have any remaining core nodes.  | 
| Amazon EMR unhealthy node replacement | INFO | Unhealthy core node recovered | Amazon EMR has recovered your `UNHEALTHY` core instances `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)`  | 

For more information about unhealthy node replacement, see [Replacing unhealthy nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-node-replacement.html).

## Viewing events with the Amazon EMR console
<a name="emr-events-console"></a>

For each cluster, you can view a simple list of events in the details pane, which lists events in descending order of occurrence. You can also view all events for all clusters in a region in descending order of occurrence.

If you don't want a user to see all cluster events for a region, add a statement that denies permission (`"Effect": "Deny"`) for the `elasticmapreduce:ViewEventsFromAllClustersInConsole` action to a policy that is attached to the user. 

**To view events for all clusters in a Region with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Events**.

**To view events for a particular cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose a cluster.

1. To view all of your events, select the **Events** tab on the cluster details page.

# Responding to CloudWatch events from Amazon EMR
<a name="emr-events-response"></a>

This section describes various ways that you can respond to actionable events that Amazon EMR emits as [CloudWatch event messages](emr-manage-cloudwatch-events.md). Ways you can respond to events include creating rules, setting alarms, and other responses. The sections that follow include links to procedures and recommneded responses to common evens.

**Topics**
+ [Creating rules for Amazon EMR events with CloudWatch](emr-events-cloudwatch-console.md)
+ [Setting alarms on CloudWatch metrics from Amazon EMR](UsingEMR_ViewingMetrics_Alarm.md)
+ [Responding to Amazon EMR cluster insufficient instance capacity events](emr-events-response-insuff-capacity.md)
+ [Responding to Amazon EMR cluster instance fleet resize timeout events](emr-events-response-timeout-events.md)

# Creating rules for Amazon EMR events with CloudWatch
<a name="emr-events-cloudwatch-console"></a>

Amazon EMR automatically sends events to a CloudWatch event stream. You can create rules that match events according to a specified pattern, and route the events to targets to take action, such as sending an email notification. Patterns are matched against the event JSON object. For more information about Amazon EMR event details, see [Amazon EMR events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type) in the *Amazon CloudWatch Events User Guide*.

For information about setting up CloudWatch event rules, see [Creating a CloudWatch rule that triggers on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html).

# Setting alarms on CloudWatch metrics from Amazon EMR
<a name="UsingEMR_ViewingMetrics_Alarm"></a>

Amazon EMR pushes metrics to Amazon CloudWatch. In response, you can use CloudWatch to set alarms on your Amazon EMR metrics. For example, you can configure an alarm in CloudWatch to send you an email any time the HDFS utilization rises above 80%. For detailed instructions, see [Create or edit a CloudWatch alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *Amazon CloudWatch User Guide*. 

# Responding to Amazon EMR cluster insufficient instance capacity events
<a name="emr-events-response-insuff-capacity"></a>

## Overview
<a name="emr-events-response-insuff-capacity-overview"></a>

Amazon EMR clusters return the event code `EC2 provisioning - Insufficient Instance Capacity` when the selected Availability Zone doesn't have enough capacity to fulfill your cluster start or resize request. The event emits periodically with both instance groups and instance fleets if Amazon EMR repeatedly encounters insufficient capacity exceptions and can't fulfill your provisioning request for a cluster start or cluster resize operation.

This page describes how you can best respond to this event type when it occurs for your EMR cluster.

## Recommended response to an insufficient capacity event
<a name="emr-events-response-insuff-capacity-rec"></a>

We recommend that you respond to an insufficient-capacity event in one of the following ways:
+ Wait for capacity to recover. Capacity shifts frequently, so an insufficient capacity exception can recover on its own. Your clusters will start or finish resizing as soon as Amazon EC2 capacity becomes available.
+ Alternatively, you can terminate your cluster, modify your instance type configurations, and create a new cluster with the updated cluster configuration request. For more information, see [Availability Zone flexibility for an Amazon EMR cluster](emr-flexibility.md).

You can also set up rules or automated responses to an insufficient capacity event, as described in the next section.

## Automated recovery from an insufficient capacity event
<a name="emr-events-response-insuff-capacity-ex"></a>

You can build automation in response to Amazon EMR events such as the ones with event code `EC2 provisioning - Insufficient Instance Capacity`. For example, the following AWS Lambda function terminates an EMR cluster with an instance group that uses On-Demand instances, and then creates a new EMR cluster with an instance group that contains different instance types than the original request.

The following conditions trigger the automated process to occur:
+ The insufficient capacity event has been emitting for primary or core nodes for more than 20 minutes.
+ The cluster is not in a **READY** or **WAITING** state. For more information about EMR cluster states, see [Understanding the cluster lifecycle](emr-overview.md#emr-overview-cluster-lifecycle).

**Note**  
When you build an automated process for an insufficient capacity exception, you should consider that the insufficient capacity event is recoverable. Capacity often shifts and your clusters will resume the resize or start operation as soon as Amazon EC2 capacity becomes available.

**Example function to respond to insufficient capacity event**  

```
// Lambda code with Python 3.10 and handler is lambda_function.lambda_handler
// Note: related IAM role requires permission to use Amazon EMR

import json
import boto3
import datetime
from datetime import timezone

INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE = "EMR Instance Group Provisioning"
INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE = (
    "EC2 provisioning - Insufficient Instance Capacity"
)
ALLOWED_INSTANCE_TYPES_TO_USE = [
    "m5.xlarge",
    "c5.xlarge",
    "m5.4xlarge",
    "m5.2xlarge",
    "t3.xlarge",
]
CLUSTER_START_ACCEPTABLE_STATES = ["WAITING", "RUNNING"]
CLUSTER_START_SLA = 20

CLIENT = boto3.client("emr", region_name="us-east-1")

# checks if the incoming event is 'EMR Instance Fleet Provisioning' with eventCode 'EC2 provisioning - Insufficient Instance Capacity'
def is_insufficient_capacity_event(event):
    if not event["detail"]:
        return False
    else:
        return (
            event["detail-type"] == INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE
            and event["detail"]["eventCode"]
            == INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE
        )


# checks if the cluster is eligible for termination
def is_cluster_eligible_for_termination(event, describeClusterResponse):
    # instanceGroupType could be CORE, MASTER OR TASK
    instanceGroupType = event["detail"]["instanceGroupType"]
    clusterCreationTime = describeClusterResponse["Cluster"]["Status"]["Timeline"][
        "CreationDateTime"
    ]
    clusterState = describeClusterResponse["Cluster"]["Status"]["State"]

    now = datetime.datetime.now()
    now = now.replace(tzinfo=timezone.utc)
    isClusterStartSlaBreached = clusterCreationTime < now - datetime.timedelta(
        minutes=CLUSTER_START_SLA
    )

    # Check if instance group receiving Insufficient capacity exception is CORE or PRIMARY (MASTER),
    # and it's been more than 20 minutes since cluster was created but the cluster state and the cluster state is not updated to RUNNING or WAITING
    if (
        (instanceGroupType == "CORE" or instanceGroupType == "MASTER")
        and isClusterStartSlaBreached
        and clusterState not in CLUSTER_START_ACCEPTABLE_STATES
    ):
        return True
    else:
        return False


# Choose item from the list except the exempt value
def choice_excluding(exempt):
    for i in ALLOWED_INSTANCE_TYPES_TO_USE:
        if i != exempt:
            return i


# Create a new cluster by choosing different InstanceType.
def create_cluster(event):
    # instanceGroupType cloud be CORE, MASTER OR TASK
    instanceGroupType = event["detail"]["instanceGroupType"]

    # Following two lines assumes that the customer that created the cluster already knows which instance types they use in original request
    instanceTypesFromOriginalRequestMaster = "m5.xlarge"
    instanceTypesFromOriginalRequestCore = "m5.xlarge"

    # Select new instance types to include in the new createCluster request
    instanceTypeForMaster = (
        instanceTypesFromOriginalRequestMaster
        if instanceGroupType != "MASTER"
        else choice_excluding(instanceTypesFromOriginalRequestMaster)
    )
    instanceTypeForCore = (
        instanceTypesFromOriginalRequestCore
        if instanceGroupType != "CORE"
        else choice_excluding(instanceTypesFromOriginalRequestCore)
    )

    print("Starting to create cluster...")
    instances = {
        "InstanceGroups": [
            {
                "InstanceRole": "MASTER",
                "InstanceCount": 1,
                "InstanceType": instanceTypeForMaster,
                "Market": "ON_DEMAND",
                "Name": "Master",
            },
            {
                "InstanceRole": "CORE",
                "InstanceCount": 1,
                "InstanceType": instanceTypeForCore,
                "Market": "ON_DEMAND",
                "Name": "Core",
            },
        ]
    }
    response = CLIENT.run_job_flow(
        Name="Test Cluster",
        Instances=instances,
        VisibleToAllUsers=True,
        JobFlowRole="EMR_EC2_DefaultRole",
        ServiceRole="EMR_DefaultRole",
        ReleaseLabel="emr-6.10.0",
    )

    return response["JobFlowId"]


# Terminated the cluster using clusterId received in an event
def terminate_cluster(event):
    print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"])
    response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]])
    print(f"Terminate cluster response: {response}")


def describe_cluster(event):
    response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"])
    return response


def lambda_handler(event, context):
    if is_insufficient_capacity_event(event):
        print(
            "Received insufficient capacity event for instanceGroup, clusterId: "
            + event["detail"]["clusterId"]
        )

        describeClusterResponse = describe_cluster(event)

        shouldTerminateCluster = is_cluster_eligible_for_termination(
            event, describeClusterResponse
        )
        if shouldTerminateCluster:
            terminate_cluster(event)

            clusterId = create_cluster(event)
            print("Created a new cluster, clusterId: " + clusterId)
        else:
            print(
                "Cluster is not eligible for termination, clusterId: "
                + event["detail"]["clusterId"]
            )

    else:
        print("Received event is not insufficient capacity event, skipping")
```

# Responding to Amazon EMR cluster instance fleet resize timeout events
<a name="emr-events-response-timeout-events"></a>

## Overview
<a name="emr-events-response-timeout-events-overview"></a>

Amazon EMR clusters emit [events](emr-manage-cloudwatch-events.md#emr-cloudwatch-instance-fleet-resize-events) while executing the resize operation for instance fleet clusters. The provisioning timeout events are emitted when Amazon EMR stops provisioning Spot or On-demand capacity for the fleet after the timeout expires. The timeout duration can be configured by the user as part of the [resize specifications](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html) for the instance fleets. In scenarios of consecutive resizes for the same instance fleet, Amazon EMR emits the `Spot provisioning timeout - continuing resize` or `On-Demand provisioning timeout - continuing resize` events when timeout for the current resize operation expires. It then starts provisioning capacity for the fleet’s next resize operation.

## Responding to instance fleet resize timeout events
<a name="emr-events-response-timeout-events-rec"></a>

We recommend that you respond to a provisioning timeout event in one of the following ways:
+ Revisit the [resize specifications](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html) and retry the resize operation. As capacity shifts frequently, your clusters will successfully resize as soon as Amazon EC2 capacity becomes available. We recommend customers to configure lower values for the timeout duration for the jobs that require stricter SLAs.
+ Alternatively, you can either:
  + Launch a new cluster with diversified instance types based on the [best practices for instance and Availability Zone flexibility](emr-flexibility.md#emr-flexibility-types) or
  + Launch a cluster with On-demand capacity
+ For the provisioning timeout - continuing resize event, you can additionally wait for resize operations to be processed. Amazon EMR will continue to sequentially process the resize operations triggered for the fleet, respecting the configured resize specifications.

You can also set up rules or automated responses to this event as described in the next section.

## Automated recovery from a provisioning timeout event
<a name="emr-events-response-timeout-events-ex"></a>

You can build automation in response to Amazon EMR events with the `Spot Provisioning timeout` event code. For example, the following AWS Lambda function shuts down an EMR cluster with an instance fleet that uses Spot instances for Task nodes, and then creates a new EMR cluster with an instance fleet that contains more diversified instance types than the original request. In this example, the `Spot Provisioning timeout` event emitted for task nodes will trigger the execution of the Lambda function.

**Example function to respond to `Spot Provisioning timeout` event**  

```
// Lambda code with Python 3.10 and handler is lambda_function.lambda_handler
// Note: related IAM role requires permission to use Amazon EMR
 
import json
import boto3
import datetime
from datetime import timezone
 
SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE = "EMR Instance Fleet Resize"
SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE = (
    "Spot Provisioning timeout"
)
 
CLIENT = boto3.client("emr", region_name="us-east-1")
 
# checks if the incoming event is 'EMR Instance Fleet Resize' with eventCode 'Spot provisioning timeout'
def is_spot_provisioning_timeout_event(event):
    if not event["detail"]:
        return False
    else:
        return (
            event["detail-type"] == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE
            and event["detail"]["eventCode"]
            == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE
        )
 
 
# checks if the cluster is eligible for termination
def is_cluster_eligible_for_termination(event, describeClusterResponse):
    # instanceFleetType could be CORE, MASTER OR TASK
    instanceFleetType = event["detail"]["instanceFleetType"]
 
    # Check if instance fleet receiving Spot provisioning timeout event is TASK
    if (instanceFleetType == "TASK"):
        return True
    else:
        return False
 
 
# create a new cluster by choosing different InstanceType.
def create_cluster(event):
    # instanceFleetType cloud be CORE, MASTER OR TASK
    instanceFleetType = event["detail"]["instanceFleetType"]
 
    # the following two lines assumes that the customer that created the cluster already knows which instance types they use in original request
    instanceTypesFromOriginalRequestMaster = "m5.xlarge"
    instanceTypesFromOriginalRequestCore = "m5.xlarge"
   
    # select new instance types to include in the new createCluster request
    instanceTypesForTask = [
        "m5.xlarge",
        "m5.2xlarge",
        "m5.4xlarge",
        "m5.8xlarge",
        "m5.12xlarge"
    ]
    
    print("Starting to create cluster...")
    instances = {
        "InstanceFleets": [
            {
                "InstanceFleetType":"MASTER",
                "TargetOnDemandCapacity":1,
                "TargetSpotCapacity":0,
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesFromOriginalRequestMaster,
                        "WeightedCapacity":1,
                    }
                ]
            },
            {
                "InstanceFleetType":"CORE",
                "TargetOnDemandCapacity":1,
                "TargetSpotCapacity":0,
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesFromOriginalRequestCore,
                        "WeightedCapacity":1,
                    }
                ]
            },
            {
                "InstanceFleetType":"TASK",
                "TargetOnDemandCapacity":0,
                "TargetSpotCapacity":100,
                "LaunchSpecifications":{},
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesForTask[0],
                        "WeightedCapacity":1,
                    },
                    {
                        'InstanceType': instanceTypesForTask[1],
                        "WeightedCapacity":2,
                    },
                    {
                        'InstanceType': instanceTypesForTask[2],
                        "WeightedCapacity":4,
                    },
                    {
                        'InstanceType': instanceTypesForTask[3],
                        "WeightedCapacity":8,
                    },
                    {
                        'InstanceType': instanceTypesForTask[4],
                        "WeightedCapacity":12,
                    }
                ],
                "ResizeSpecifications": {
                    "SpotResizeSpecification": {
                        "TimeoutDurationMinutes": 30
                    }
                }
            }
        ]
    }
    response = CLIENT.run_job_flow(
        Name="Test Cluster",
        Instances=instances,
        VisibleToAllUsers=True,
        JobFlowRole="EMR_EC2_DefaultRole",
        ServiceRole="EMR_DefaultRole",
        ReleaseLabel="emr-6.10.0",
    )
 
    return response["JobFlowId"]
 
 
# terminated the cluster using clusterId received in an event
def terminate_cluster(event):
    print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"])
    response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]])
    print(f"Terminate cluster response: {response}")
 
 
def describe_cluster(event):
    response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"])
    return response
 
 
def lambda_handler(event, context):
    if is_spot_provisioning_timeout_event(event):
        print(
            "Received spot provisioning timeout event for instanceFleet, clusterId: "
            + event["detail"]["clusterId"]
        )
 
        describeClusterResponse = describe_cluster(event)
 
        shouldTerminateCluster = is_cluster_eligible_for_termination(
            event, describeClusterResponse
        )
        if shouldTerminateCluster:
            terminate_cluster(event)
 
            clusterId = create_cluster(event)
            print("Created a new cluster, clusterId: " + clusterId)
        else:
            print(
                "Cluster is not eligible for termination, clusterId: "
                + event["detail"]["clusterId"]
            )
 
    else:
        print("Received event is not spot provisioning timeout event, skipping")
```