

# View and monitor an Amazon EMR cluster as it performs work
<a name="emr-manage-view"></a>

Amazon EMR provides several tools you can use to gather information about your cluster. You can access information about the cluster from the console, the CLI or programmatically. The standard Hadoop web interfaces and log files are available on the primary node. You can also use monitoring services such as CloudWatch and Ganglia to track the performance of your cluster. 

Application history is also available from the console using the "persistent" application UIs for Spark History Server starting with Amazon EMR 5.25.0. With Amazon EMR 6.x, persistent YARN timeline server, and Tez user interfaces are also available. These services are hosted off-cluster, so you can access application history for 30 days after the cluster terminates, without the need for a SSH connection or web proxy. See [View application history](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html).

**Topics**
+ [View Amazon EMR cluster status and details](emr-manage-view-clusters.md)
+ [Enhanced step debugging with Amazon EMR](emr-enhanced-step-debugging.md)
+ [View Amazon EMR application history](emr-cluster-application-history.md)
+ [View Amazon EMR log files](emr-manage-view-web-log-files.md)
+ [View cluster instances in Amazon EC2](UsingEMR_Tagging.md)
+ [CloudWatch events and metrics from Amazon EMR](emr-manage-cluster-cloudwatch.md)
+ [View cluster application metrics using Ganglia with Amazon EMR](ViewingGangliaMetrics.md)
+ [Logging AWS EMR API calls using AWS CloudTrail](logging-using-cloudtrail.md)
+ [EMR Observability Best Practices](emr-metrics-observability.md)

# View Amazon EMR cluster status and details
<a name="emr-manage-view-clusters"></a>

After you create a cluster, you can monitor its status and get detailed information about its execution and errors that may have occurred, even after it has terminated. Amazon EMR saves metadata about terminated clusters for your reference for two months, after which the metadata is deleted. You can't delete clusters from the cluster history, but using the AWS Management Console, you can use the **Filter**, and using the AWS CLI, you can use options with the `list-clusters` command to focus on the clusters that you care about.

You can access application history stored on-cluster for one week from the time it is recorded, regardless of whether the cluster is running or terminated. In addition, persistent application user interfaces store application history off-cluster for 30 days after a cluster terminates. See [View application history](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html).

For more information about cluster states, such as Waiting and Running, see [Understanding the cluster lifecycle](emr-overview.md#emr-overview-cluster-lifecycle).

## View cluster details using the AWS Management Console
<a name="emr-view-cluster-console"></a>

The **Clusters** list in the [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr) lists all the clusters in your account and AWS Region, including terminated clusters. The list shows the following for each cluster: the **Name** and **ID**, the **Status** and **Status details**, the **Creation time**, the **Elapsed time** that the cluster was running, and the **Normalized instance hours** that have accrued for all EC2 instances in the cluster. This list is the starting point for monitoring the status of your clusters. It's designed so that you can drill down into each cluster's details for analysis and troubleshooting.

------
#### [ Console ]

**To view cluster information with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to view.

1. Use the **Summary** panel to view the basics of your cluster configuration, such as cluster status, the open-source applications that Amazon EMR installed on the cluster, and the version of Amazon EMR that you used to create the cluster. Use each tab below the Summary to view information as described in the following table.

------

## View cluster details using the AWS CLI
<a name="view-cluser-cli"></a>

The following examples demonstrate how to retrieve cluster details using the AWS CLI. For more information about available commands, see the [AWS CLI Command Reference for Amazon EMR](https://docs.aws.amazon.com/cli/latest/reference/emr). You can use the [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/describe-cluster.html) command to view cluster-level details including status, hardware and software configuration, VPC settings, bootstrap actions, instance groups, and so on. For more information about cluster states, see [Understanding the cluster lifecycle](emr-overview.md#emr-overview-cluster-lifecycle). The following example demonstrates using the `describe-cluster` command, followed by examples of the [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/emr/describe-cluster.html) command.

**Example Viewing cluster status**  
To use the `describe-cluster` command, you need the cluster ID. This example demonstrates using to get a list of clusters created within a certain date range, and then using one of the cluster IDs returned to list more information about an individual cluster's status.  
The following command describes cluster *j-1K48XXXXXXHCB*, which you replace with your cluster ID.  

```
aws emr describe-cluster --cluster-id j-1K48XXXXXXHCB
```
The output of your command is similar to the following:  

```
{
    "Cluster": {
        "Status": {
            "Timeline": {
                "ReadyDateTime": 1438281058.061, 
                "CreationDateTime": 1438280702.498
            }, 
            "State": "WAITING", 
            "StateChangeReason": {
                "Message": "Waiting for steps to run"
            }
        }, 
        "Ec2InstanceAttributes": {
            "EmrManagedMasterSecurityGroup": "sg-cXXXXX0", 
            "IamInstanceProfile": "EMR_EC2_DefaultRole", 
            "Ec2KeyName": "myKey", 
            "Ec2AvailabilityZone": "us-east-1c", 
            "EmrManagedSlaveSecurityGroup": "sg-example"
        }, 
        "Name": "Development Cluster", 
        "ServiceRole": "EMR_DefaultRole", 
        "Tags": [], 
        "TerminationProtected": false, 
        "ReleaseLabel": "emr-4.0.0", 
        "NormalizedInstanceHours": 16, 
        "InstanceGroups": [
            {
                "RequestedInstanceCount": 1, 
                "Status": {
                    "Timeline": {
                        "ReadyDateTime": 1438281058.101, 
                        "CreationDateTime": 1438280702.499
                    }, 
                    "State": "RUNNING", 
                    "StateChangeReason": {
                        "Message": ""
                    }
                }, 
                "Name": "CORE", 
                "InstanceGroupType": "CORE", 
                "Id": "ig-2EEXAMPLEXXP", 
                "Configurations": [], 
                "InstanceType": "m5.xlarge", 
                "Market": "ON_DEMAND", 
                "RunningInstanceCount": 1
            }, 
            {
                "RequestedInstanceCount": 1, 
                "Status": {
                    "Timeline": {
                        "ReadyDateTime": 1438281023.879, 
                        "CreationDateTime": 1438280702.499
                    }, 
                    "State": "RUNNING", 
                    "StateChangeReason": {
                        "Message": ""
                    }
                }, 
                "Name": "MASTER", 
                "InstanceGroupType": "MASTER", 
                "Id": "ig-2A1234567XP", 
                "Configurations": [], 
                "InstanceType": "m5.xlarge", 
                "Market": "ON_DEMAND", 
                "RunningInstanceCount": 1
            }
        ], 
        "Applications": [
            {
                "Version": "1.0.0", 
                "Name": "Hive"
            }, 
            {
                "Version": "2.6.0", 
                "Name": "Hadoop"
            }, 
            {
                "Version": "0.14.0", 
                "Name": "Pig"
            }, 
            {
                "Version": "1.4.1", 
                "Name": "Spark"
            }
        ], 
        "BootstrapActions": [], 
        "MasterPublicDnsName": "ec2-X-X-X-X.compute-1.amazonaws.com", 
        "AutoTerminate": false, 
        "Id": "j-jobFlowID", 
        "Configurations": [
            {
                "Properties": {
                    "hadoop.security.groups.cache.secs": "250"
                }, 
                "Classification": "core-site"
            }, 
            {
                "Properties": {
                    "mapreduce.tasktracker.reduce.tasks.maximum": "5", 
                    "mapred.tasktracker.map.tasks.maximum": "2", 
                    "mapreduce.map.sort.spill.percent": "90"
                }, 
                "Classification": "mapred-site"
            }, 
            {
                "Properties": {
                    "hive.join.emit.interval": "1000", 
                    "hive.merge.mapfiles": "true"
                }, 
                "Classification": "hive-site"
            }
        ]
    }
}
```

**Example Listing clusters by creation date**  
To retrieve clusters created within a specific data range, use the `list-clusters` command with the `--created-after` and `--created-before` parameters.  
The following command lists all clusters created between October 09, 2019 and October 12, 2019.  

```
aws emr list-clusters --created-after 2019-10-09T00:12:00 --created-before 2019-10-12T00:12:00
```

**Example Listing clusters by state**  
To list clusters by state, use the `list-clusters` command with the `--cluster-states` parameter. Valid cluster states include: STARTING, BOOTSTRAPPING, RUNNING, WAITING, TERMINATING, TERMINATED, and TERMINATED\$1WITH\$1ERRORS.   

```
aws emr list-clusters --cluster-states TERMINATED
```
You can also use the following shortcut parameters to list all clusters in the states specified.:  
+ `--active` filters clusters in the STARTING,BOOTSTRAPPING, RUNNING, WAITING, or TERMINATING states.
+ `--terminated` filters clusters in the TERMINATED state.
+ `--failed` parameter filters clusters in the TERMINATED\$1WITH\$1ERRORS state.
The following commands return the same result.  

```
aws emr list-clusters --cluster-states TERMINATED
```

```
aws emr list-clusters --terminated
```
For more information about cluster states, see [Understanding the cluster lifecycle](emr-overview.md#emr-overview-cluster-lifecycle).

# Enhanced step debugging with Amazon EMR
<a name="emr-enhanced-step-debugging"></a>

If an Amazon EMR step fails and you submitted your work using the Step API operation with an AMI of version 5.x or later, Amazon EMR can identify and return the root cause of the step failure in some cases, along with the name of the relevant log file and a portion of the application stack trace via API. For example, the following failures can be identified: 
+ A common Hadoop error such as the output directory already exists, the input directory does not exist, or an application runs out of memory.
+ Java errors such as an application that was compiled with an incompatible version of Java or run with a main class that is not found.
+ An issue accessing objects stored in Amazon S3.

This information is available using the [DescribeStep](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_DescribeStep.html) and [ListSteps](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_ListSteps.html) API operations. The [FailureDetails](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_FailureDetails.html) field of the [StepSummary](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_StepSummary.html) returned by those operations. To access the FailureDetails information, use the AWS CLI, console, or AWS SDK.

------
#### [ Console ]

The new Amazon EMR console doesn't offer step debugging. However, you can view cluster termination details with the following steps.

**To view failure details with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then select the cluster that you want to view.

1. Note the **Status** value in the **Summary** section of the cluster details page. If the status is **Terminated with errors**, hover over the text to view cluster failure details.

------
#### [ CLI ]

**To view failure details with the AWS CLI**
+ To get failure details for a step with the AWS CLI, use the `describe-step` command.

  ```
  aws emr describe-step --cluster-id j-1K48XXXXXHCB --step-id s-3QM0XXXXXM1W
  ```

  The output will look similar to the following:

  ```
  {
    "Step": {
      "Status": {
        "FailureDetails": {
          "LogFile": "s3://amzn-s3-demo-bucket/logs/j-1K48XXXXXHCB/steps/s-3QM0XXXXXM1W/stderr.gz",
          "Message": "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://amzn-s3-demo-bucket/logs/beta already exists",
          "Reason": "Output directory already exists."
        },
        "Timeline": {
          "EndDateTime": 1469034209.143,
          "CreationDateTime": 1469033847.105,
          "StartDateTime": 1469034202.881
        },
        "State": "FAILED",
        "StateChangeReason": {}
      },
      "Config": {
        "Args": [
          "wordcount",
          "s3://amzn-s3-demo-bucket/input/input.txt",
          "s3://amzn-s3-demo-bucket/logs/beta"
        ],
        "Jar": "s3://amzn-s3-demo-bucket/jars/hadoop-mapreduce-examples-2.7.2-amzn-1.jar",
        "Properties": {}
      },
      "Id": "s-3QM0XXXXXM1W",
      "ActionOnFailure": "CONTINUE",
      "Name": "ExampleJob"
    }
  }
  ```

------

# View Amazon EMR application history
<a name="emr-cluster-application-history"></a>

You can view Spark History Server and YARN timeline service application details with the cluster's detail page in the console. Amazon EMR application history makes it easier for you to troubleshoot and analyze active jobs and job history. 

**Note**  
To augment the security for the off-console applications that you might use with Amazon EMR, the application hosting domains are registered in the Public Suffix List (PSL). Examples of these hosting domains include the following: `emrstudio-prod.us-east-1.amazonaws.com`, `emrnotebooks-prod.us-east-1.amazonaws.com`, `emrappui-prod.us-east-1.amazonaws.com`. For further security, if you ever need to set sensitive cookies in the default domain name, we recommend that you use cookies with a `__Host-` prefix. This helps to defend your domain against cross-site request forgery attempts (CSRF). For more information, see the [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#cookie_prefixes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#cookie_prefixes) page in the *Mozilla Developer Network*. 

The **Application user interfaces** section of the **Applications** tab provides several viewing options, depending on the cluster status and the applications you installed on the cluster.
+ [Off-cluster access to persistent application user interfaces](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html) – Starting with Amazon EMR version 5.25.0, persistent application user interface links are available for Spark UI and Spark History Service. With Amazon EMR version 5.30.1 and later, Tez UI and the YARN timeline server also have persistent application user interfaces. The YARN timeline server and Tez UI are open-source applications that provide metrics for active and terminated clusters. The Spark user interface provides details about scheduler stages and tasks, RDD sizes and memory usage, environmental information, and information about the running executors. Persistent application UIs are run off-cluster, so cluster information and logs are available for 30 days after an application terminates. Unlike on-cluster application user interfaces, persistent application UIs don't require you to set up a web proxy through a SSH connection.
+ [On-cluster application user interfaces](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) – There are a variety of application history user interfaces that can be run on a cluster. On-cluster user interfaces are hosted on the master node and require you to set up a SSH connection to the web server. On-cluster application user interfaces keep application history for one week after an application terminates. For more information and instructions on setting up an SSH tunnel, see [View web interfaces hosted on Amazon EMR clusters](emr-web-interfaces.md).

  With the exception of the Spark History Server, YARN timeline server, and Hive applications, on-cluster application history can only be viewed while the cluster is running.

# View persistent application user interfaces in Amazon EMR
<a name="app-history-spark-UI"></a>

Starting with Amazon EMR version 5.25.0, you can connect to the persistent Spark History Server application details hosted off-cluster using the cluster **Summary** page or the **Application user interfaces** tab in the console. Tez UI and YARN timeline server persistent application interfaces are available starting with Amazon EMR version 5.30.1. One-click link access to persistent application history provides the following benefits: 
+ You can quickly analyze and troubleshoot active jobs and job history without setting up a web proxy through an SSH connection.
+ You can access application history and relevant log files for active and terminated clusters. The logs are available for 30 days after the application ends. 

Navigate to your cluster details in the console, and select the **Applications** tab. Select the application UI that you want once your cluster has launched. The application UI opens in a new browser tab. For more information, see [Monitoring and instrumentation](https://spark.apache.org/docs/latest/monitoring.html).

You can view YARN container logs through the links on the Spark history server, YARN timeline server, and Tez UI. 

**Note**  
To access YARN container logs from the Spark history server, YARN timeline server, and Tez UI, you must enable logging to Amazon S3 for your cluster. If you don't enable logging, the links to YARN container logs won't work. 

## Logs collection
<a name="app-history-spark-UI-event-logs"></a>

To enable one-click access to persistent application user interfaces, Amazon EMR collects two types of logs: 
+ **Application event logs** are collected into an EMR system bucket. The event logs are encrypted at rest using Server-Side Encryption with Amazon S3 Managed Keys (SSE-S3). If you use a private subnet for your cluster, make sure to include the correct system bucket ARNs in the resource list of the Amazon S3 policy for the private subnet. For more information, see [Minimum Amazon S3 policy for private subnet](https://docs.aws.amazon.com/emr/latest/ManagementGuide/private-subnet-iampolicy.html).
+ **YARN container logs** are collected into an Amazon S3 bucket that you own. You must enable logging for your cluster to access YARN container logs. For more information, see [Configure cluster logging and debugging](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html).

If you need to disable this feature for privacy reasons, you can stop the daemon by using a bootstrap script when you create a cluster, as the following example demonstrates.

```
aws emr create-cluster --name "Stop Application UI Support" --release-label emr-7.12.0 \
--applications Name=Hadoop Name=Spark --ec2-attributes KeyName=<myEMRKeyPairName> \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=TASK,InstanceCount=1,InstanceType=m3.xlarge \
--use-default-roles --bootstrap-actions Path=s3://region.elasticmapreduce/bootstrap-actions/run-if,Args=["instance.isMaster=true","echo Stop Application UI | sudo tee /etc/apppusher/run-apppusher; sudo systemctl stop apppusher || exit 0"]
```

After you run this bootstrap script, Amazon EMR will not collect any Spark History Server or YARN timeline server event logs into the EMR system bucket. No application history information will be available on the **Application user interfaces** tab, and you will lose access to all application user interfaces from the console.

## Large Spark event log files
<a name="app-history-spark-UI-large-event-logs"></a>

In some cases, long-running Spark jobs, such as Spark streaming, and large jobs, such as Spark SQL queries, can generate large event logs. With large events logs, you can quickly use up disk space on compute instances and encounter `OutOfMemory` errors when you load Persistent UIs. To avoid these issues, we recommend that you turn on the Spark event log rolling and compaction feature. This feature is available on Amazon EMR versions emr-6.1.0 and later. For more details about rolling and compaction, see [Applying compaction on rolling event log files](https://spark.apache.org/docs/latest/monitoring.html#applying-compaction-on-rolling-event-log-files) in the Spark documentation.

To activate the Spark event log rolling and compaction feature, turn on the following Spark configuration settings.
+ `spark.eventLog.rolling.enabled` – Turns on event log rolling based on size. This setting is deactivated by default.
+ `spark.eventLog.rolling.maxFileSize` – When rolling is activated, specifies the maximum size of the event log file before it rolls over. The default is 128 MB.
+ `spark.history.fs.eventLog.rolling.maxFilesToRetain` – Specifies the maximum number of non-compacted event log files to retain. By default, all event log files are retained. Set to a lower number to compact older event logs. The lowest value is 1.

Note that compaction attempts to exclude events with outdated event log files, such as the following. If it does discard events, you no longer see them on the Spark History Server UI.
+ Events for finished jobs and related stage or task events.
+ Events for terminated executors.
+ Events for completed SQL inquiries, and related job, stage, and tasks events.

**To launch a cluster with rolling and compaction enabled**

1. Create a `spark-configuration.json` file with the following configuration.

   ```
   [
      {
        "Classification": "spark-defaults",
           "Properties": {
              "spark.eventLog.rolling.enabled": true,
              "spark.history.fs.eventLog.rolling.maxFilesToRetain": 1
           }
      }
   ]
   ```

1. Create your cluster with the Spark rolling compaction configuration as follows.

   ```
   aws emr create-cluster \
   --release-label emr-6.6.0 \
   --instance-type m4.large \
   --instance-count 2 \
   --use-default-roles \
   --configurations file://spark-configuration.json
   ```

## Permissions for viewing persistent application user interfaces
<a name="app-history-spark-UI-permissions"></a>

The following sample shows the role permissions required for access to persistent application user interfaces. For clusters with runtime role enabled, this will only allow users to access applications submitted by the same user identity and runtime role.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI"
      ],
      "Resource": [
        "arn:aws:elasticmapreduce:*:123456789012:cluster/clusterId"
      ],
      "Sid": "AllowELASTICMAPREDUCECreatepersistentappui"
    },
    {
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:GetPersistentAppUIPresignedURL"
      ],
      "Resource": [
        "arn:aws:elasticmapreduce:*:123456789012:cluster/clusterId",
        "arn:aws:elasticmapreduce:*:123456789012:persistent-app-ui/*"
      ],
      "Condition": {
        "StringEqualsIfExists": {
          "elasticmapreduce:ExecutionRoleArn": [
            "arn:aws:iam::123456789012:role/executionRoleArn"
          ]
        }
      },
      "Sid": "AllowELASTICMAPREDUCEGetpersistentappuipresignedurl"
    }
  ]
}
```

------

The following sample shows the role permissions required for removing the restrictions on viewing applications in the persistent application user interfaces for runtime role enabled clusters.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:CreatePersistentAppUI",
        "elasticmapreduce:DescribePersistentAppUI",
        "elasticmapreduce:AccessAllEventLogs"
      ],
      "Resource": [
        "arn:aws:elasticmapreduce:us-east-1:123456789012:cluster/j-XXXXXXXXXXXXX"
      ],
      "Sid": "AllowELASTICMAPREDUCECreatepersistentappui"
    },
    {
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:GetPersistentAppUIPresignedURL"
      ],
      "Resource": [
        "arn:aws:elasticmapreduce:us-east-1:123456789012:cluster/j-XXXXXXXXXXXXX",
        "arn:aws:elasticmapreduce:us-east-1:123456789012:persistent-app-ui/*"
      ],
      "Condition": {
        "StringEqualsIfExists": {
          "elasticmapreduce:ExecutionRoleArn": [
            "arn:aws:iam::123456789012:role/YourExecutionRoleName"
          ]
        }
      },
      "Sid": "AllowELASTICMAPREDUCEGetpersistentappuipresignedurl"
    }
  ]
}
```

------

## Considerations and limitations
<a name="app-history-spark-UI-limitations"></a>

One-click access to persistent application user interfaces currently has the following limitations.
+ There will be at least a two-minute delay when the application details show up on the Spark History Server UI.
+ This feature works only when the event log directory for the application is in HDFS. By default, Amazon EMR stores event logs in a directory of HDFS. If you change the default directory to a different file system, such as Amazon S3, this feature will not work. 
+ This feature is currently not available for EMR clusters with multiple master nodes or for EMR clusters integrated with AWS Lake Formation. 
+ To enable one-click access to persistent application user interfaces, you must have permission to the `CreatePersistentAppUI`, `DescribePersistentAppUI` and `GetPersistentAppUIPresignedURL` actions for Amazon EMR. If you deny an IAM principal's permission to these actions, it takes approximately five minutes for the permission change to propagate.
+ If a cluster is a runtime role enabled cluster, when accessing the Spark History Server from the Persistent App UI, the user will only be able to access a Spark job if the Spark job is submitted by a runtime role.
+ If a cluster is a runtime role enabled cluster, each user can access only an application submitted by the same user identity and runtime role.
+  The `AccessAllEventLogs` action for Amazon EMR is necessary to view all applications in persistent application user interfaces for runtime role enabled clusters. 
+ If you reconfigure applications in a running cluster, the application history will be not available through the application UI. 
+ For each AWS account, the default limit for active application UIs is 200.
+ In the following AWS Regions, you can access application UIs from the console with Amazon EMR 6.14.0 and higher: 
  + Asia Pacific (Jakarta) (ap-southeast-3)
  + Europe (Spain) (eu-south-2)
  + Asia Pacific (Melbourne) (ap-southeast-4)
  + Israel (Tel Aviv) (il-central-1)
  + Middle East (UAE) (me-central-1)
+ In the following AWS Regions, you can access application UIs from the console with Amazon EMR 5.25.0 and higher: 
  + US East (N. Virginia) (us-east-1)
  + US West (Oregon) (us-west-2)
  + Asia Pacific (Mumbai) (ap-south-1)
  + Asia Pacific (Seoul) (ap-northeast-2)
  + Asia Pacific (Singapore) (ap-southeast-1)
  + Asia Pacific (Sydney) (ap-southeast-2)
  + Asia Pacific (Tokyo) (ap-northeast-1)
  + Canada (Central) (ca-central-1)
  + South America (São Paulo) (sa-east-1)
  + Europe (Frankfurt) (eu-central-1)
  + Europe (Ireland) (eu-west-1)
  + Europe (London) (eu-west-2)
  + Europe (Paris) (eu-west-3)
  + Europe (Stockholm) (eu-north-1)
  + China (Beijing) (cn-north-1)
  + China (Ningxia) (cn-northwest-1)

# View a high-level application history in Amazon EMR
<a name="app-history-summary"></a>

**Note**  
We recommend that you use the persistent application interface for an improved user experience that retains app history for up to 30 days. The high-level application history described on this page isn't available in the new Amazon EMR console ([https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr)). For more information, see [View persistent application user interfaces in Amazon EMR](app-history-spark-UI.md).

With Amazon EMR releases 5.8.0 to 5.36.0 and 6.x releases up to 6.8.0, you can view a high-level application history from the **Application user interfaces** tab in the old Amazon EMR console. An Amazon EMR **Application user interface** keeps the summary of application history for 7 days after an application has completed. 

## Considerations and limitations
<a name="app-history-limitations"></a>

Consider the following limitations when you use the **Application user interfaces** tab in the old Amazon EMR console.
+ You can only access the high-level application history feature when using Amazon EMR releases 5.8.0 to 5.36.0 and 6.x releases up to 6.8.0. Effective January 23, 2023, Amazon EMR will discontinue high-level application history for all versions. If you use Amazon EMR version 5.25.0 or higher, we recommend that you use the persistent application user interface instead.
+ The high-level application history feature does not support Spark Streaming applications.
+ One-click access to persistent application user interfaces is currently not available for Amazon EMR clusters with multiple master nodes or for Amazon EMR clusters integrated with AWS Lake Formation.

## Example: View a high-level application history
<a name="app-history-example"></a>

The following sequence demonstrates a drill-down through a Spark or YARN application into job details using the **Application user interfaces** tab on the cluster details page of the old console. 

To view cluster details, select a cluster **Name** from the **Clusters** list. To view information about YARN container logs, you must enable logging for your cluster. For more information, see [Configure cluster logging and debugging](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html). For Spark application history, the information provided in the summary table is only a subset of the information available through the Spark history server UI.

In the **Application user interfaces** tab under **High-level application history**, you can expand a row to show the diagnostic summary for a Spark application or select an **Application ID** link to view details about a different application.

![\[Application user interfaces tab showing persistent and on-cluster UIs, with YARN application history.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/app-history-app.png)


When you select an **Application ID** link, the UI changes to show the **YARN application** details for that application. In the **Jobs** tab of **YARN application** details, you can choose the **Description** link for a job to display details for that job.

![\[YARN application details showing job history with completed Spark tasks and their statuses.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/app-history-job-1.png)


On the job details page, you can expand information about individual job stages, and then select the **Description** link to see stage details.

![\[EMR cluster interface showing persistent and on-cluster application UIs, with job details and stages.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/app-history-job-2.png)


On the stage details page, you can view key metrics for stage tasks and executors. You can also view task and executor logs using the **View logs** links.

![\[Application history page showing task metrics, executor details, and log access links for a Spark job.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/app-history-job-3.png)


# View Amazon EMR log files
<a name="emr-manage-view-web-log-files"></a>

 Amazon EMR and Hadoop both produce log files that report status on the cluster. By default, these are written to the primary node in the `/mnt/var/log/` directory. Depending on how you configured your cluster when you launched it, these logs may also be archived to Amazon S3 and may be viewable through the graphical debugging tool. 

 There are many types of logs written to the primary node. Amazon EMR writes step, bootstrap action, and instance state logs. Apache Hadoop writes logs to report the processing of jobs, tasks, and task attempts. Hadoop also records logs of its daemons. For more information about the logs written by Hadoop, go to [http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html). 

## View log files on the primary node
<a name="emr-manage-view-web-log-files-master-node"></a>

The following table lists some of the log files you'll find on the primary node.


| Location | Description | 
| --- | --- | 
|  /emr/instance-controller/log/bootstrap-actions  | Logs written during the processing of the bootstrap actions. | 
|  /mnt/var/log/hadoop-state-pusher  | Logs written by the Hadoop state pusher process. | 
|  /emr/instance-controller/log  | Instance controller logs. | 
|  /emr/instance-state  | Instance state logs. These contain information about the CPU, memory state, and garbage collector threads of the node. | 
|  /emr/service-nanny  | Logs written by the service nanny process. | 
|  /mnt/var/log/*application*  | Logs specific to an application such as Hadoop, Spark, or Hive. | 
|  /mnt/var/log/hadoop/steps/*N*  | Step logs that contain information about the processing of the step. The value of *N* indicates the stepId assigned by Amazon EMR. For example, a cluster has two steps: `s-1234ABCDEFGH` and `s-5678IJKLMNOP`. The first step is located in `/mnt/var/log/hadoop/steps/s-1234ABCDEFGH/` and the second step in `/mnt/var/log/hadoop/steps/s-5678IJKLMNOP/`.  The step logs written by Amazon EMR are as follows.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html)  | 

**To view log files on the primary node with the AWS CLI.**

1.  Use SSH to connect to the primary node as described in [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

1.  Navigate to the directory that contains the log file information you wish to view. The preceding table gives a list of the types of log files that are available and where you will find them. The following example shows the command for navigating to the step log with an ID, `s-1234ABCDEFGH`. 

   ```
   cd /mnt/var/log/hadoop/steps/s-1234ABCDEFGH/
   ```

1. Use a file viewer of your choice to view the log file. The following example uses the Linux `less` command to view the `controller` log file.

   ```
   less controller
   ```

## View log files archived to Amazon S3
<a name="emr-manage-view-web-log-files-s3"></a>

By default, Amazon EMR clusters launched using the console automatically archive log files to Amazon S3. You can specify your own log path, or you can allow the console to automatically generate a log path for you. For clusters launched using the CLI or API, you must configure Amazon S3 log archiving manually. 

 When Amazon EMR is configured to archive log files to Amazon S3, it stores the files in the S3 location you specified, in the /*cluster-id*/ folder, where *cluster-id* is the cluster ID. 

The following table lists some of the log files you'll find on Amazon S3.


| Location | Description | 
| --- | --- | 
|  /*cluster-id*/node/  | Node logs, including bootstrap action, instance state, and application logs for the node. The logs for each node are stored in a folder labeled with the identifier of the EC2 instance of that node. | 
|  /*cluster-id*/node/*instance-id*/*application*  | The logs created by each application or daemon associated with an application. For example, the Hive server log is located at `cluster-id/node/instance-id/hive/hive-server.log`. | 
|  /*cluster-id*/steps/*step-id*/  | Step logs that contain information about the processing of the step. The value of *step-id* indicates the step ID assigned by Amazon EMR. For example, a cluster has two steps: `s-1234ABCDEFGH` and `s-5678IJKLMNOP`. The first step is located in `/mnt/var/log/hadoop/steps/s-1234ABCDEFGH/` and the second step in `/mnt/var/log/hadoop/steps/s-5678IJKLMNOP/`.  The step logs written by Amazon EMR are as follows.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html)  | 
|  /*cluster-id*/containers  |  Application container logs. The logs for each YARN application are stored in these locations.  | 
|  /*cluster-id*/hadoop-mapreduce/  | The logs that contain information about configuration details and job history of MapReduce jobs.  | 

**To view log files archived to Amazon S3 with the Amazon S3 console**

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Open the S3 bucket specified when you configured the cluster to archive log files in Amazon S3. 

1. Navigate to the log file containing the information to display. The preceding table gives a list of the types of log files that are available and where you will find them. 

1. Download the log file object to view it. For instructions, see [Downloading an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html).

# View cluster instances in Amazon EC2
<a name="UsingEMR_Tagging"></a>

 To help you manage your resources, Amazon EC2 allows you to assign metadata to resources in the form of tags. Each Amazon EC2 tag consists of a key and a value. Tags allow you to categorize your Amazon EC2 resources in different ways: for example, by purpose, owner, or environment. 

 You can search and filter resources based on the tags. The tags that you assign to resources through your AWS account are available only to you. Other accounts that share the same resource can't view your tags. 

Amazon EMR automatically tags each EC2 instance that it launches with key-value pairs. The keys identify the cluster and the instance group to which the instance belongs. This makes it easy to filter your EC2 instances to show, for example, only those instances that belong to a particular cluster, or to show all of the currently running instances in the instance group for the task. This is especially useful if you run several clusters concurrently or manage large numbers of EC2 instances.

These are the predefined key-value pairs that Amazon EMR assigns:


| Key | Value | Value definition | 
| --- | --- | --- | 
| aws:elasticmapreduce:job-flow-id |  `job-flow-identifier`  | The ID of the cluster that the instance is provisioned for. It appears in the format `j-XXXXXXXXXXXXX` and can be up to 256 characters long. | 
| aws:elasticmapreduce:instance-group-role |  `group-role`  | The type of instance group, entered as one of the following values: `master`, `core`, or `task`. | 

 You can view and filter on the tags that Amazon EMR adds. For more information, see [Using tags](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html) in the *Amazon EC2 User Guide*. Because the tags set by Amazon EMR are system tags and cannot be edited or deleted, the sections on displaying and filtering tags are the most relevant. 

**Note**  
 Amazon EMR adds tags to the EC2 instance when its status updates to **Running**. If latency occurs between the time that the EC2 instance is provisioned and the time that its status is set to **Running**, the tags that Amazon EMR sets will appear once the instance starts. If you don't see the tags, wait for a few minutes and refresh the view.

# CloudWatch events and metrics from Amazon EMR
<a name="emr-manage-cluster-cloudwatch"></a>

Use events and metrics to track the activity and health of an Amazon EMR cluster. Events are useful for monitoring a specific occurrence within a cluster - for example, when a cluster changes state from starting to running. Metrics are useful to monitor a specific value - for example, the percentage of available disk space that HDFS is using within a cluster.

For more information about CloudWatch Events, see the [Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/). For more information about CloudWatch metrics, see [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) and [Creating Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

**Topics**
+ [Monitoring Amazon EMR metrics with CloudWatch](UsingEMR_ViewingMetrics.md)
+ [Monitoring Amazon EMR events with CloudWatch](emr-manage-cloudwatch-events.md)
+ [Responding to CloudWatch events from Amazon EMR](emr-events-response.md)

# Monitoring Amazon EMR metrics with CloudWatch
<a name="UsingEMR_ViewingMetrics"></a>

Metrics are updated every five minutes and automatically collected and pushed to CloudWatch for every Amazon EMR cluster. This interval is not configurable. There is no charge for the Amazon EMR metrics reported in CloudWatch. These five minute datapoint metrics are archived for 63 days, after which the data is discarded. 

## How do I use Amazon EMR metrics?
<a name="UsingEMR_ViewingMetrics_HowDoI"></a>

The following table shows common uses for metrics reported by Amazon EMR. These are suggestions to get you started, not a comprehensive list. For a complete list of metrics reported by Amazon EMR, see [Metrics reported by Amazon EMR in CloudWatch](#UsingEMR_ViewingMetrics_MetricsReported). 


****  

| How do I? | Relevant metrics | 
| --- | --- | 
| Track the progress of my cluster | Look at the RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics.  | 
| Detect clusters that are idle | The IsIdle metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.  | 
| Detect when a node runs out of storage | The MRUnhealthyNodes metric tracks when one or more core or task nodes run out of local disk storage and transition to an UNHEALTHY YARN state. For example, core or task nodes are running low on disk space and will not be able to run tasks. | 
| Detect when a cluster runs out of storage | The HDFSUtilization metric monitors the cluster's combined HDFS capacity, and can require resizing the cluster to add more core nodes. For example, the HDFS utilization is high, which may affect jobs and cluster health.  | 
| Detect when a cluster is running at reduced capacity | The MRLostNodes metric tracks when one or more core or task nodes is unable to communicate with the master node. For example, the core or task node is unreachable by the master node. | 

For more information, see [Amazon EMR cluster terminates with NO\$1SLAVE\$1LEFT and core nodes FAILED\$1BY\$1MASTER](emr-cluster-NO_SLAVE_LEFT-FAILED_BY_MASTER.md) and [AWSSupport-AnalyzeEMRLogs](https://docs.aws.amazon.com//systems-manager-automation-runbooks/latest/userguide/automation-awssupport-analyzeemrlogs.html). 

## Access CloudWatch metrics for Amazon EMR
<a name="UsingEMR_ViewingMetrics_Access"></a>

You can view the metrics that Amazon EMR reports to CloudWatch using the Amazon EMR console or the CloudWatch console. You can also retrieve metrics using the CloudWatch CLI command `[mon-get-stats](https://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/cli-mon-get-stats.html)` or the CloudWatch `[GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html)` API. For more information about viewing or retrieving metrics for Amazon EMR using CloudWatch, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/).

------
#### [ Console ]

**To view metrics with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose the cluster that you want to view metrics for. This opens the cluster details page.

1. Select the **Monitoring** tab on the cluster details page. Choose any one of the **Cluster status**, **Node status**, or **Inputs and outputs** options to load the reports about the progress and health of the cluster. 

1. After you choose a metric to view, you can enlarge each graph. To filter the time frame of your graph, select a prefilled option or choose **Custom**.

------

## Metrics reported by Amazon EMR in CloudWatch
<a name="UsingEMR_ViewingMetrics_MetricsReported"></a>

The following tables list the metrics that Amazon EMR reports in the console and pushes to CloudWatch.

### Amazon EMR metrics
<a name="emr-metrics-reported"></a>

Amazon EMR sends data for several metrics to CloudWatch. All Amazon EMR clusters automatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period, the data is discarded. 

The `AWS/ElasticMapReduce` namespace includes the following metrics.

**Note**  
Amazon EMR pulls metrics from a cluster. If a cluster becomes unreachable, no metrics are reported until the cluster becomes available again.

The following metrics are available for clusters running Hadoop 2.x versions.


| Metric | Description | 
| --- | --- | 
| Cluster Status | 
| IsIdle  | Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer. Use case: Monitor cluster performance Units: *Boolean*  | 
| ContainerAllocated  | The number of resource containers allocated by the ResourceManager. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerReserved  | The number of containers reserved. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerPending  | The number of containers in the queue that have not yet been allocated. Use case: Monitor cluster progress Units: *Count*  | 
| ContainerPendingRatio  | The ratio of pending containers to containers allocated (ContainerPendingRatio = ContainerPending / ContainerAllocated). If ContainerAllocated = 0, then ContainerPendingRatio = ContainerPending. The value of ContainerPendingRatio represents a number, not a percentage. This value is useful for scaling cluster resources based on container allocation behavior. Units: *Count*  | 
| AppsCompleted  | The number of applications submitted to YARN that have completed. Use case: Monitor cluster progress Units: *Count*  | 
| AppsFailed  | The number of applications submitted to YARN that have failed to complete. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| AppsKilled  | The number of applications submitted to YARN that have been killed. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| AppsPending  | The number of applications submitted to YARN that are in a pending state. Use case: Monitor cluster progress Units: *Count*  | 
| AppsRunning  | The number of applications submitted to YARN that are running. Use case: Monitor cluster progress Units: *Count*  | 
| AppsSubmitted  | The number of applications submitted to YARN. Use case: Monitor cluster progress Units: *Count*  | 
| Node Status | 
| CoreNodesRunning  | The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| CoreNodesPending  | The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveDataNodes  | The percentage of data nodes that are receiving work from Hadoop. Use case: Monitor cluster health Units: *Percent*  | 
| MRTotalNodes  | The number of nodes presently available to MapReduce jobs. Equivalent to YARN metric `mapred.resourcemanager.TotalNodes`. Use ase: Monitor cluster progress Units: *Count* Note: MRTotalNodes only counts currently active nodes in the system. YARN automatically removes terminated nodes from this count and stops tracking them, so they are not considered in the MRTotalNodes metric.  | 
| MRActiveNodes  | The number of nodes presently running MapReduce tasks or jobs. Equivalent to YARN metric `mapred.resourcemanager.NoOfActiveNodes`. Use case: Monitor cluster progress Units: *Count*  | 
| MRLostNodes  | The number of nodes allocated to MapReduce that have been marked in a LOST state. Equivalent to YARN metric `mapred.resourcemanager.NoOfLostNodes`. Use case: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MRUnhealthyNodes  | The number of nodes available to MapReduce jobs marked in an UNHEALTHY state. Equivalent to YARN metric `mapred.resourcemanager.NoOfUnhealthyNodes`. Use case: Monitor cluster progress Units: *Count*  | 
| MRDecommissionedNodes  | The number of nodes allocated to MapReduce applications that have been marked in a DECOMMISSIONED state. Equivalent to YARN metric `mapred.resourcemanager.NoOfDecommissionedNodes`. Use ase: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MRRebootedNodes  | The number of nodes available to MapReduce that have been rebooted and marked in a REBOOTED state. Equivalent to YARN metric `mapred.resourcemanager.NoOfRebootedNodes`. Use case: Monitor cluster health, Monitor cluster progress Units: *Count*  | 
| MultiMasterInstanceGroupNodesRunning  | The number of running master nodes. Use case: Monitor master node failure and replacement Units: *Count*  | 
| MultiMasterInstanceGroupNodesRunningPercentage  | The percentage of master nodes that are running over the requested master node instance count.  Use case: Monitor master node failure and replacement Units: *Percent*  | 
| MultiMasterInstanceGroupNodesRequested  | The number of requested master nodes.  Use case: Monitor master node failure and replacement Units: *Count*  | 
| IO | 
| S3BytesWritten  | The number of bytes written to Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.  Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| S3BytesRead  | The number of bytes read from Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.  Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSUtilization  | The percentage of HDFS storage currently used. Use case: Analyze cluster performance Units: *Percent*  | 
| HDFSBytesRead  | The number of bytes read from HDFS. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSBytesWritten  | The number of bytes written to HDFS. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| MissingBlocks  | The number of blocks in which HDFS has no replicas. These might be corrupt blocks. Use case: Monitor cluster health Units: *Count*  | 
| CorruptBlocks  | The number of blocks that HDFS reports as corrupted. Use case: Monitor cluster health Units: *Count*  | 
| TotalLoad  | The total number of concurrent data transfers. Use case: Monitor cluster health Units: *Count*  | 
| MemoryTotalMB  | The total amount of memory in the cluster. Use case: Monitor cluster progress Units: *Count*  | 
| MemoryReservedMB  | The amount of memory reserved. Use case: Monitor cluster progress Units: *Count*  | 
| MemoryAvailableMB  | The amount of memory available to be allocated. Use case: Monitor cluster progress Units: *Count*  | 
| YARNMemoryAvailablePercentage  | The percentage of remaining memory available to YARN (YARNMemoryAvailablePercentage = MemoryAvailableMB / MemoryTotalMB). This value is useful for scaling cluster resources based on YARN memory usage. Units: *Percent*  | 
| MemoryAllocatedMB  | The amount of memory allocated to the cluster. Use case: Monitor cluster progress Units: *Count*  | 
| PendingDeletionBlocks  | The number of blocks marked for deletion. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| UnderReplicatedBlocks  | The number of blocks that need to be replicated one or more times. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| DfsPendingReplicationBlocks  | The status of block replication: blocks being replicated, age of replication requests, and unsuccessful replication requests. Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 
| CapacityRemainingGB  | The amount of remaining HDFS disk capacity.  Use case: Monitor cluster progress, Monitor cluster health Units: *Count*  | 

The following are Hadoop 1 metrics:


| Metric | Description | 
| --- | --- | 
| Cluster Status | 
| IsIdle  | Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer. Use case: Monitor cluster performance Units: *Boolean*  | 
| JobsRunning  | The number of jobs in the cluster that are currently running. Use case: Monitor cluster health Units: *Count*  | 
| JobsFailed  | The number of jobs in the cluster that have failed. Use case: Monitor cluster health Units: *Count*  | 
| Map/Reduce | 
| MapTasksRunning  | The number of running map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| MapTasksRemaining  | The number of remaining map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. A remaining map task is one that is not in any of the following states: Running, Killed, or Completed. Use case: Monitor cluster progress Units: *Count*  | 
| MapSlotsOpen  | The unused map task capacity. This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster. Use case: Analyze cluster performance Units: *Count*  | 
| RemainingMapTasksPerSlot  | The ratio of the total map tasks remaining to the total map slots available in the cluster. Use case: Analyze cluster performance Units: *Ratio*  | 
| ReduceTasksRunning  | The number of running reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| ReduceTasksRemaining  | The number of remaining reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. Use case: Monitor cluster progress Units: *Count*  | 
| ReduceSlotsOpen  | Unused reduce task capacity. This is calculated as the maximum reduce task capacity for a given cluster, less the number of reduce tasks currently running in that cluster. Use case: Analyze cluster performance Units: *Count*  | 
| Node Status | 
| CoreNodesRunning  | The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| CoreNodesPending  | The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveDataNodes  | The percentage of data nodes that are receiving work from Hadoop. Use case: Monitor cluster health Units: *Percent*  | 
| TaskNodesRunning  | The number of task nodes working. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| TaskNodesPending  | The number of task nodes waiting to be assigned. All of the task nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists. Use case: Monitor cluster health Units: *Count*  | 
| LiveTaskTrackers  | The percentage of task trackers that are functional. Use case: Monitor cluster health Units: *Percent*  | 
| IO | 
| S3BytesWritten  | The number of bytes written to Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| S3BytesRead  | The number of bytes read from Amazon S3. This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSUtilization  | The percentage of HDFS storage currently used. Use case: Analyze cluster performance Units: *Percent*  | 
| HDFSBytesRead  | The number of bytes read from HDFS. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| HDFSBytesWritten  | The number of bytes written to HDFS. Use case: Analyze cluster performance, Monitor cluster progress Units: *Count*  | 
| MissingBlocks  | The number of blocks in which HDFS has no replicas. These might be corrupt blocks. Use case: Monitor cluster health Units: *Count*  | 
| TotalLoad  | The current, total number of readers and writers reported by all DataNodes in a cluster. Use case: Diagnose the degree to which high I/O might be contributing to poor job execution performance. Worker nodes running the DataNode daemon must also perform map and reduce tasks. Persistently high TotalLoad values over time can indicate that high I/O might be a contributing factor to poor performance. Occasional spikes in this value are typical and do not usually indicate a problem. Units: *Count*  | 

#### Cluster capacity metrics
<a name="emr-metrics-managed-scaling"></a>

The following metrics indicate the current or target capacities of a cluster. These metrics are only available when managed scaling or auto-termination is enabled. 

For clusters composed of instance fleets, the cluster capacity metrics are measured in `Units`. For clusters composed of instance groups, the cluster capacity metrics are measured in `Nodes` or `VCPU` based on the unit type used in the managed scaling policy. For more information, see [Using EMR-managed scaling](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-scaling.html) in the *Amazon EMR Management Guide*.


| Metric | Description | 
| --- | --- | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) | The target total number of units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current total number of units/nodes/vCPUs available in a running cluster. When a cluster resize is requested, this metric will be updated after the new instances are added or removed from the cluster. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The target number of CORE units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current number of CORE units/nodes/vCPUs running in a cluster. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The target number of TASK units/nodes/vCPUs in a cluster as determined by managed scaling. Units: *Count*  | 
| [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html)  | The current number of TASK units/nodes/vCPUs running in a cluster. Units: *Count*  | 

Amazon EMR emits the following metrics at a one-minute granularity when you enable auto-termination using an auto-termination policy. Some metrics are only available for Amazon EMR versions 6.4.0 and later. To learn more about auto-termination, see [Using an auto-termination policy for Amazon EMR cluster cleanup](emr-auto-termination-policy.md).


****  

| Metric | Description | 
| --- | --- | 
| TotalNotebookKernels | The total number of running and idle notebook kernels on the cluster. This metric is only available for Amazon EMR versions 6.4.0 and later. | 
| AutoTerminationIsClusterIdle | Indicates whether the cluster is in use.A value of **0** indicates that the cluster is in active use by one of the following components:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) A value of **1** indicates that the cluster is idle. Amazon EMR checks for continuous cluster idleness (`AutoTerminationIsClusterIdle` = 1). When a cluster's idle time equals the `IdleTimeout` value in your auto-termination policy, Amazon EMR terminates the cluster.  | 

### Dimensions for Amazon EMR metrics
<a name="emr-metrics-dimensions"></a>

Amazon EMR data can be filtered using any of the dimensions in the following table. 


| Dimension  | Description  | 
| --- | --- | 
| JobFlowId | The same as cluster ID, which is the unique identifier of a cluster in the form j-XXXXXXXXXXXXX. Find this value by clicking on the cluster in the Amazon EMR console.  | 

# Monitoring Amazon EMR events with CloudWatch
<a name="emr-manage-cloudwatch-events"></a>

Amazon EMR tracks events and keeps information about them for up to seven days in the Amazon EMR console. Amazon EMR records events when there is a change in the state of clusters, instance groups, instance fleets, automatic scaling policies, or steps. Events capture the date and time the event occurred, details about the affected elements, and other critical data points.

The following table lists Amazon EMR events, along with the state or state change that the event indicates, the severity of the event, event type, event code, and event messages. Amazon EMR represents events as JSON objects and automatically sends them to an event stream. The JSON object is important when you set up rules for event processing using CloudWatch Events because rules seek to match patterns in the JSON object. For more information, see [Events and event patterns](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/CloudWatchEventsandEventPatterns.html) and [Amazon EMR events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type) in the *Amazon CloudWatch Events User Guide*.

**Note**  
EMR periodically emits events with the event code **EC2 provisioning - Insufficient Instance Capacity**. These events occur when your Amazon EMR cluster encounters an insufficient capacity error from Amazon EMR for your instance fleet or instance group during cluster creation or resize operation. An event might not include all the instance types and AZs you have provided, because EMR only includes the instance types and AZs it attempted to provision capacity in since the last the Insufficient capacity event was emitted. For information on how to respond to these events, see [Responding to Amazon EMR cluster insufficient instance capacity events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-events-response-insuff-capacity.html).

## Cluster start events
<a name="emr-cloudwatch-cluster-events"></a>


| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| CREATING | WARN | EMR instance fleet provisioning | EC2 provisioning - Insufficient Instance Capacity | We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Fleet InstanceFleetID Amazon EC2 has insufficient Spot capacity for Instance type [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance type [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1, AvaliabilityZone2]. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event. | 
| CREATING | WARN | EMR instance group provisioning | EC2 provisioning - Insufficient Instance Capacity | We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Group InstanceGroupID Amazon EC2 has insufficient Spot capacity for Instance type [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance type [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1, AvaliabilityZone2]. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event. | 
| CREATING | WARN | EMR instance fleet provisioning | EC2 provisioning - Insufficient Free Addresses In Subnet | We can’t create the Amazon EMR cluster ClusterId (ClusterName) that you requested for instance fleet InstanceFleetID because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to see how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html) | 
| CREATING | WARN | EMR instance group provisioning | EC2 provisioning - Insufficient Free Addresses In Subnet | We can’t create the Amazon EMR cluster ClusterId (ClusterName) that you requested for instance group InstanceGroupID because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to see how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html) | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – vCPU Limit Exceeded  | The provision of InstanceFleetID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html)  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – vCPU Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterId is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html)  | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – Spot Instance Count Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – Spot Instance Count Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance fleet provisioning  | EC2 Provisioning – Instance Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of instances you can run concurrently in your account (accountID). For more information on Amazon EC2 service limits, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING  | WARN  | EMR instance group provisioning  | EC2 Provisioning – Instance Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is delayed because you've reached the limit on the number of instances you can run concurrently in your account (accountID). For more information on Amazon EC2 service limits, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| CREATING | WARN | EMR instance group provisioning | *none* | Amazon EMR cluster `ClusterId (ClusterName)` was created at `Time` and is ready for use. - or -  Amazon EMR cluster `ClusterId (ClusterName)` finished running all pending steps at `Time`.  A cluster in the `WAITING` state may still be processing jobs.   | 
| STARTING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` was requested at `Time` and is being created.  | 
| STARTING  | INFO  | EMR cluster state change  | *none*  |  Applies only to clusters with the instance fleets configuration and multiple Availability Zones selected within Amazon EC2.  Amazon EMR cluster `ClusterId (ClusterName)` is being created in zone (`AvailabilityZoneID`), which was chosen from the specified Availability Zone options.  | 
| STARTING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` began running steps at `Time`.  | 
| WAITING  | INFO  | EMR cluster state change  | *none*  | Amazon EMR cluster `ClusterId (ClusterName)` was created at `Time` and is ready for use. - or -  Amazon EMR cluster `ClusterId (ClusterName)` finished running all pending steps at `Time`.  A cluster in the `WAITING` state may still be processing jobs.   | 

**Note**  
The events with event code `EC2 provisioning - Insufficient Instance Capacity` periodically emit when your EMR cluster encounters an insufficient capacity error from Amazon EC2 for your instance fleet or instance group during cluster creation or resize operation. For information on how to respond to these events, see [Responding to Amazon EMR cluster insufficient instance capacity events](emr-events-response-insuff-capacity.md).

## Cluster termination events
<a name="emr-cloudwatch-cluster-termination-events"></a>


| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| TERMINATED  | The severity depends on the reason for the state change, as shown in the following: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html)  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated at `Time` with a reason of `StateChangeReason:Code`.  | 
| TERMINATED\$1WITH\$1ERRORS  | CRITICAL  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated with errors at `Time` with a reason of `StateChangeReason:Code`.  | 
| TERMINATED\$1WITH\$1ERRORS  | CRITICAL  | EMR cluster state change  | *none*  | Amazon EMR Cluster `ClusterId (ClusterName)` has terminated with errors at `Time` with a reason of `StateChangeReason:Code`.  | 

## Instance fleet state-change events
<a name="emr-cloudwatch-instance-fleet-events"></a>

**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.


****  

| State or state change | Severity | Event type | Event code | Message | 
| --- | --- | --- | --- | --- | 
| From `PROVISIONING` to `WAITING`  | INFO  |  | none | Provisioning for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. Provisioning started at `Time` and took `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num`, and target Spot capacity was `Num`.  | 
| From `WAITING` to `RESIZING`  | INFO  |  | none | A resize for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. The instance fleet is resizing from an On-Demand capacity of `Num` to a target of `Num`, and from a Spot capacity of `Num` to a target of `Num`.  | 
| From `RESIZING` to `WAITING`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. The resize started at `Time` and took `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num` and target Spot capacity was `Num`.  | 
| From `RESIZING` to `WAITING`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` has reached the timeout and stopped. The resize started at `Time` and stopped after `Num` minutes. The instance fleet now has On-Demand capacity of `Num` and Spot capacity of `Num`. Target On-Demand capacity was `Num` and target Spot capacity was `Num`.  | 
| SUSPENDED  | ERROR  |  | none | Instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was arrested at `Time` for the following reason: `ReasonDesc`.  | 
| RESIZING  | WARNING  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` is stuck for the following reason: `ReasonDesc`.  | 
| `WAITING` or `Running`  | INFO  |  | none | The resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` couldn't complete while Amazon EMR added Spot capacity in availability zone `AvailabilityZone`. We've cancelled your request to provision additional Spot capacity. For recommended actions, check [Availability Zone flexibility for an Amazon EMR cluster](emr-flexibility.md) and try again.  | 
| `WAITING` or `Running`  | INFO  |  | none | A resizing operation for instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was initiated by `Entity` at `Time`.  | 

## Instance fleet reconfiguration events
<a name="emr-cloudwatch-instance-fleet-events-reconfig"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| Instance Fleet Reconfiguration Requested  | INFO  | A user has requested to reconfigure the instance fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId` (`ClusterName`).  | 
| Instance Fleet Reconfiguration Start  | INFO  | Amazon EMR has started a reconfiguration of the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time`.  | 
| Instance Fleet Reconfiguration Completed  | INFO  | Amazon EMR has finished reconfiguring instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`).  | 
| Instance Fleet Reconfiguration Failed  | WARNING  | Amazon EMR failed to reconfigure the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time`. The reconfiguration failed because `Reason`.  | 
| Instance Fleet Reconfiguration Reversion Start  | INFO  | Amazon EMR is reverting the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previous successful configuration.  | 
| Instance Fleet Reconfiguration Reversion Completed  | INFO  | Amazon EMR finished reverting the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previous successful configuration.  | 
| Instance Fleet Reconfiguration Reversion Failed  | CRITICAL  | Amazon EMR couldn't revert the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) to the previously successful configuration at `Time`. The reconfiguration reversion failed because of `Reason`.  | 
| Instance Fleet Reconfiguration Reversion Blocked  | INFO  | Amazon EMR tmeporarily blocked the instance fleet `InstanceFleetID` in the Amazon EMR cluster `ClusterId` (`ClusterName`) at `Time` because the instance fleet is in the `State` state.  | 

## Instance fleet resize events
<a name="emr-cloudwatch-instance-fleet-resize-events"></a>


****  

| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| EMR instance fleet resize   | ERROR | Spot Provisioning timeout  | The Resize operation for Instance Fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was not able to complete while acquiring Spot capacity in AZ `AvailabilityZone`. We have now cancelled your request and stopped trying to provision any additional Spot capacity and the Instance Fleet has provisioned Spot capacity of `num`. Target Spot capacity was `num`. For more information and recommended actions, please check the documentation page [here](emr-flexibility.md) and retry again.  | 
| EMR instance fleet resize   | ERROR | On-Demand Provisioning timeout  | The Resize operation for Instance Fleet `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` was not able to complete while acquiring On-Demand capacity in AZ `AvailabilityZone`. We have now cancelled your request and stopped trying to provision any additional On-Demand capacity and the Instance Fleet has provisioned On-Demand capacity of `num`. Target On-Demand capacity was `num`. For more information and recommended actions, please check the documentation page [here](emr-flexibility.md) and retry again.  | 
| EMR instance fleet resize   | WARNING | EC2 provisioning - Insufficient Instance Capacity | We are not able to complete the resize operation for Instance Fleet `InstanceFleetID` in EMR cluster `ClusterId (ClusterName)` as Amazon EC2 has insufficient Spot capacity for Instance types `[Instancetype1, Instancetype2]` and insufficient On-Demand capacity for Instance types `[Instancetype3, Instancetype4]` in Availability Zone `[AvailabilityZone1]`. So far, the instance fleet has provisioned On-Demand capacity of `num` and target On-Demand capacity was `num`. Provisioned Spot capacity is `num` and target Spot capacity was `num`. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event.  | 
| EMR instance fleet resize   | WARNING | Spot Provisioning Timeout - Continuing Resize  | We're still provisioning Spot capacity for the Instance Fleet resize operation that initiated at `time` for instance fleet ID `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` for `[Instancetype1, Instancetype2]` in AZ `AvailabilityZone`. For the previous resize operation that initiated at `time`, the timeout period expired, so Amazon EMR stopped provisioning Spot capacity after adding `num` of the requested `num` instances to your instance fleet. For more information, please check the documentation page [here](emr-flexibility.md). | 
| EMR instance fleet resize   | WARNING | On-Demand Provisioning Timeout - Continuing Resize  | We're still provisioning On-Demand capacity for the Instance Fleet resize operation that initiated at `time` for instance fleet ID `InstanceFleetID` in Amazon EMR cluster `ClusterId (ClusterName)` for `[Instancetype1, Instancetype2]` in AZ `AvailabilityZone`. For the previous resize operation that initiated at `time`, the timeout period expired, so Amazon EMR stopped provisioning On-Demand capacity after adding `num` of the requested `num` instances to your instance fleet. For more information, please check the documentation page [here](emr-flexibility.md). | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - Insufficient Free Address in Subnet  | We can't complete the resize operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to view how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - vCPU Limit Exceeded  | The resize of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterName is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance fleet resize  | WARNING | EC2 Provisioning - Spot Instance Count Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| EMR instance fleet resize   | WARNING | EC2 Provisioning - Instance Limit Exceeded  | The provision of instance fleet InstanceFleetID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of on-demand instances you can run in your account (accountId). For more information on [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 

**Note**  
The provisioning timeout events are emitted when Amazon EMR stops provisioning Spot or On-demand capacity for the fleet after the timeout expires. For information on how to respond to these events, see [Responding to Amazon EMR cluster instance fleet resize timeout events](emr-events-response-timeout-events.md) .

## Instance group events
<a name="emr-cloudwatch-instance-group-events"></a>


****  

| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| From `RESIZING` to `Running`  | INFO  | none | The resizing operation for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` is complete. It now has an instance count of `Num`. The resize started at `Time` and took `Num` minutes to complete.  | 
| From `RUNNING` to `RESIZING`  | INFO  | none | A resize for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. It is resizing from an instance count of `Num` to `Num`.  | 
| SUSPENDED  | ERROR  | none | Instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was arrested at `Time` for the following reason: `ReasonDesc`.  | 
| RESIZING  | WARNING  | none | The resizing operation for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` is stuck for the following reason: `ReasonDesc`.  | 
| EMR instance group resize   | WARNING | EC2 provisioning - Insufficient Instance Capacity | We are not able to complete the resize operation that started at `time` for Instance Group `InstanceGroupID` in EMR cluster `ClusterId (ClusterName)` as Amazon EC2 has insufficient `Spot/On Demand` capacity for Instance type `[Instancetype]` in Availability Zone `[AvailabilityZone1]`. So far, the instance group has a running instance count of `num` and requested instance count was `num`. Check here [documentation](emr-EC2_INSUFFICIENT_CAPACITY-error.md) for more information on how to respond to this event.  | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Insufficient Free Address in Subnet  | We can't complete the resize operation for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) because the specified subnet [Subnet1, Subnet2] doesn't contain enough free private IP addresses to fulfill your request. Use the DescribeSubnets operation to view how many IP addresses are available (unused) in your subnet. For information on how to respond to this event, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance group resize   | WARNING | EC2 Provisioning - vCPU Limit Exceeded  | The resize of instance group InstanceGroupID in the Amazon EMR cluster ClusterName is delayed because you've reached the limit on the number of vCPUs (virtual processing units) assigned to the running instances in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html). | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Spot Instance Count Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of Spot Instances that you can launch in your account (accountId). For more information, see [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| EMR instance group resize   | WARNING | EC2 Provisioning - Instance Limit Exceeded  | The provision of instance group InstanceGroupID in the Amazon EMR cluster ClusterID (ClusterName) is delayed because you've reached the limit on the number of on-demand instances you can run in your account (accountId). For more information on [Error codes for the Amazon EC2 API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html).  | 
| From `RUNNING` to `RESIZING`  | INFO  | none | A resize for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was initiated by `Entity` at `Time`.  | 

**Note**  
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. For more information, see [Supplying a Configuration for an Instance Group in a Running Cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html).

The following table lists Amazon EMR events for the reconfiguration operation, along with the state or state change that the event indicates, the severity of the event, and event messages. 


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| RUNNING  | INFO  | A reconfiguration for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` was initiated by user at `Time`. Version of requested configuration is `Num`.  | 
| From `RECONFIGURING` to `Running` | INFO  | The reconfiguration operation for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` is complete. The reconfiguration started at `Time` and took `Num` minutes to complete. Current configuration version is `Num`.  | 
| From `RUNNING` to `RECONFIGURING` in  | INFO  | A reconfiguration for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` started at `Time`. It is configuring from version number `Num` to version number `Num`.  | 
| RESIZING  | INFO  | Reconfiguring operation towards configuration version `Num` for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` is temporarily blocked at `Time` because instance group is in `State`.  | 
| RECONFIGURING  | INFO  | Resizing operation towards instance count Num for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is temporarily blocked at Time because the instance group is in State. | 
| RECONFIGURING  | WARNING  | The reconfiguration operation for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` failed at `Time` and took `Num` minutes to fail. Failed configuration version is `Num`.   | 
| RECONFIGURING  | INFO  | Configurations are reverting to the previous successful version number `Num`for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. New configuration version is `Num`.   | 
| From `RECONFIGURING` to `Running` | INFO  | Configurations were successfully reverted to the previous successful version `Num` for instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. New configuration version is `Num`.  | 
| From `RECONFIGURING` to `SUSPENDED`  | CRITICAL  | Failed to revert to the previous successful version `Num` for Instance group `InstanceGroupID` in the Amazon EMR cluster `ClusterId (ClusterName)` at `Time`.  | 

## Automatic scaling policy events
<a name="emr-cloudwatch-autoscale-events"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| PENDING  | INFO  | An Auto Scaling policy was added to instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` at `Time`. The policy is pending attachment. - or -  The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was updated at `Time`. The policy is pending attachment.  | 
| ATTACHED  | INFO  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was attached at `Time`.  | 
| `DETACHED`  | INFO  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` was detached at `Time`.  | 
| FAILED  | ERROR  | The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` could not attach and failed at `Time`. - or -  The Auto Scaling policy for instance group `InstanceGroupID` in Amazon EMR cluster `ClusterId (ClusterName)` could not detach and failed at `Time`.  | 

## Step events
<a name="emr-cloudwatch-step-events"></a>


****  

| State or state change | Severity | Message | 
| --- | --- | --- | 
| PENDING  | INFO  | Step `StepID (StepName)` was added to Amazon EMR cluster `ClusterId (ClusterName)` at `Time` and is pending execution.   | 
| CANCEL\$1PENDING  | WARN  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` was cancelled at `Time` and is pending cancellation.   | 
| RUNNING  | INFO  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` started running at `Time`.   | 
| COMPLETED  | INFO  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` completed execution at `Time`. The step started running at `Time` and took `Num` minutes to complete.  | 
| CANCELLED  | WARN  | Cancellation request has succeeded for cluster step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` at `Time`, and the step is now cancelled.   | 
| FAILED  | ERROR  | Step `StepID (StepName)` in Amazon EMR cluster `ClusterId (ClusterName)` failed at `Time`.  | 

## Unhealthy node replacement events
<a name="emr-cloudwatch-unhealthy-node-replacement-events"></a>


| Event type | Severity | Event code | Message | 
| --- | --- | --- | --- | 
| Amazon EMR unhealthy node replacement | INFO | Unhealthy core node detected | Amazon EMR has identified that core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)` is `UNHEALTHY`. Amazon EMR will attempt to recover or gracefully replace the `UNHEALTHY` instance.  | 
| Amazon EMR unhealthy node replacement | INFO | Core node unhealthy - replacement disabled | Amazon EMR has identified that core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `(clusterID) (ClusterName)` is `UNHEALTHY`. Turn on graceful unhealthy core node replacement in your cluster to let Amazon EMR gracefully replace the `UNHEALTHY` instances in the event that they can’t be recovered.  | 
| Amazon EMR unhealthy node replacement | WARN | Unhealthy core node not replaced | Amazon EMR can't replace your `UNHEALTHY` core instance `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)` because of *reason*. The reason of why Amazon EMR can't replace your core node differs depending on your scenario. For example, one reason of why Amazon EMR can't delete a node is because a cluster wouldn't have any remaining core nodes.  | 
| Amazon EMR unhealthy node replacement | INFO | Unhealthy core node recovered | Amazon EMR has recovered your `UNHEALTHY` core instances `[instanceID (InstanceName)]` in `InstanceGroup/Fleet` in the Amazon EMR cluster `clusterID (ClusterName)`  | 

For more information about unhealthy node replacement, see [Replacing unhealthy nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-node-replacement.html).

## Viewing events with the Amazon EMR console
<a name="emr-events-console"></a>

For each cluster, you can view a simple list of events in the details pane, which lists events in descending order of occurrence. You can also view all events for all clusters in a region in descending order of occurrence.

If you don't want a user to see all cluster events for a region, add a statement that denies permission (`"Effect": "Deny"`) for the `elasticmapreduce:ViewEventsFromAllClustersInConsole` action to a policy that is attached to the user. 

**To view events for all clusters in a Region with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Events**.

**To view events for a particular cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose a cluster.

1. To view all of your events, select the **Events** tab on the cluster details page.

# Responding to CloudWatch events from Amazon EMR
<a name="emr-events-response"></a>

This section describes various ways that you can respond to actionable events that Amazon EMR emits as [CloudWatch event messages](emr-manage-cloudwatch-events.md). Ways you can respond to events include creating rules, setting alarms, and other responses. The sections that follow include links to procedures and recommneded responses to common evens.

**Topics**
+ [Creating rules for Amazon EMR events with CloudWatch](emr-events-cloudwatch-console.md)
+ [Setting alarms on CloudWatch metrics from Amazon EMR](UsingEMR_ViewingMetrics_Alarm.md)
+ [Responding to Amazon EMR cluster insufficient instance capacity events](emr-events-response-insuff-capacity.md)
+ [Responding to Amazon EMR cluster instance fleet resize timeout events](emr-events-response-timeout-events.md)

# Creating rules for Amazon EMR events with CloudWatch
<a name="emr-events-cloudwatch-console"></a>

Amazon EMR automatically sends events to a CloudWatch event stream. You can create rules that match events according to a specified pattern, and route the events to targets to take action, such as sending an email notification. Patterns are matched against the event JSON object. For more information about Amazon EMR event details, see [Amazon EMR events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type) in the *Amazon CloudWatch Events User Guide*.

For information about setting up CloudWatch event rules, see [Creating a CloudWatch rule that triggers on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html).

# Setting alarms on CloudWatch metrics from Amazon EMR
<a name="UsingEMR_ViewingMetrics_Alarm"></a>

Amazon EMR pushes metrics to Amazon CloudWatch. In response, you can use CloudWatch to set alarms on your Amazon EMR metrics. For example, you can configure an alarm in CloudWatch to send you an email any time the HDFS utilization rises above 80%. For detailed instructions, see [Create or edit a CloudWatch alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *Amazon CloudWatch User Guide*. 

# Responding to Amazon EMR cluster insufficient instance capacity events
<a name="emr-events-response-insuff-capacity"></a>

## Overview
<a name="emr-events-response-insuff-capacity-overview"></a>

Amazon EMR clusters return the event code `EC2 provisioning - Insufficient Instance Capacity` when the selected Availability Zone doesn't have enough capacity to fulfill your cluster start or resize request. The event emits periodically with both instance groups and instance fleets if Amazon EMR repeatedly encounters insufficient capacity exceptions and can't fulfill your provisioning request for a cluster start or cluster resize operation.

This page describes how you can best respond to this event type when it occurs for your EMR cluster.

## Recommended response to an insufficient capacity event
<a name="emr-events-response-insuff-capacity-rec"></a>

We recommend that you respond to an insufficient-capacity event in one of the following ways:
+ Wait for capacity to recover. Capacity shifts frequently, so an insufficient capacity exception can recover on its own. Your clusters will start or finish resizing as soon as Amazon EC2 capacity becomes available.
+ Alternatively, you can terminate your cluster, modify your instance type configurations, and create a new cluster with the updated cluster configuration request. For more information, see [Availability Zone flexibility for an Amazon EMR cluster](emr-flexibility.md).

You can also set up rules or automated responses to an insufficient capacity event, as described in the next section.

## Automated recovery from an insufficient capacity event
<a name="emr-events-response-insuff-capacity-ex"></a>

You can build automation in response to Amazon EMR events such as the ones with event code `EC2 provisioning - Insufficient Instance Capacity`. For example, the following AWS Lambda function terminates an EMR cluster with an instance group that uses On-Demand instances, and then creates a new EMR cluster with an instance group that contains different instance types than the original request.

The following conditions trigger the automated process to occur:
+ The insufficient capacity event has been emitting for primary or core nodes for more than 20 minutes.
+ The cluster is not in a **READY** or **WAITING** state. For more information about EMR cluster states, see [Understanding the cluster lifecycle](emr-overview.md#emr-overview-cluster-lifecycle).

**Note**  
When you build an automated process for an insufficient capacity exception, you should consider that the insufficient capacity event is recoverable. Capacity often shifts and your clusters will resume the resize or start operation as soon as Amazon EC2 capacity becomes available.

**Example function to respond to insufficient capacity event**  

```
// Lambda code with Python 3.10 and handler is lambda_function.lambda_handler
// Note: related IAM role requires permission to use Amazon EMR

import json
import boto3
import datetime
from datetime import timezone

INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE = "EMR Instance Group Provisioning"
INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE = (
    "EC2 provisioning - Insufficient Instance Capacity"
)
ALLOWED_INSTANCE_TYPES_TO_USE = [
    "m5.xlarge",
    "c5.xlarge",
    "m5.4xlarge",
    "m5.2xlarge",
    "t3.xlarge",
]
CLUSTER_START_ACCEPTABLE_STATES = ["WAITING", "RUNNING"]
CLUSTER_START_SLA = 20

CLIENT = boto3.client("emr", region_name="us-east-1")

# checks if the incoming event is 'EMR Instance Fleet Provisioning' with eventCode 'EC2 provisioning - Insufficient Instance Capacity'
def is_insufficient_capacity_event(event):
    if not event["detail"]:
        return False
    else:
        return (
            event["detail-type"] == INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE
            and event["detail"]["eventCode"]
            == INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE
        )


# checks if the cluster is eligible for termination
def is_cluster_eligible_for_termination(event, describeClusterResponse):
    # instanceGroupType could be CORE, MASTER OR TASK
    instanceGroupType = event["detail"]["instanceGroupType"]
    clusterCreationTime = describeClusterResponse["Cluster"]["Status"]["Timeline"][
        "CreationDateTime"
    ]
    clusterState = describeClusterResponse["Cluster"]["Status"]["State"]

    now = datetime.datetime.now()
    now = now.replace(tzinfo=timezone.utc)
    isClusterStartSlaBreached = clusterCreationTime < now - datetime.timedelta(
        minutes=CLUSTER_START_SLA
    )

    # Check if instance group receiving Insufficient capacity exception is CORE or PRIMARY (MASTER),
    # and it's been more than 20 minutes since cluster was created but the cluster state and the cluster state is not updated to RUNNING or WAITING
    if (
        (instanceGroupType == "CORE" or instanceGroupType == "MASTER")
        and isClusterStartSlaBreached
        and clusterState not in CLUSTER_START_ACCEPTABLE_STATES
    ):
        return True
    else:
        return False


# Choose item from the list except the exempt value
def choice_excluding(exempt):
    for i in ALLOWED_INSTANCE_TYPES_TO_USE:
        if i != exempt:
            return i


# Create a new cluster by choosing different InstanceType.
def create_cluster(event):
    # instanceGroupType cloud be CORE, MASTER OR TASK
    instanceGroupType = event["detail"]["instanceGroupType"]

    # Following two lines assumes that the customer that created the cluster already knows which instance types they use in original request
    instanceTypesFromOriginalRequestMaster = "m5.xlarge"
    instanceTypesFromOriginalRequestCore = "m5.xlarge"

    # Select new instance types to include in the new createCluster request
    instanceTypeForMaster = (
        instanceTypesFromOriginalRequestMaster
        if instanceGroupType != "MASTER"
        else choice_excluding(instanceTypesFromOriginalRequestMaster)
    )
    instanceTypeForCore = (
        instanceTypesFromOriginalRequestCore
        if instanceGroupType != "CORE"
        else choice_excluding(instanceTypesFromOriginalRequestCore)
    )

    print("Starting to create cluster...")
    instances = {
        "InstanceGroups": [
            {
                "InstanceRole": "MASTER",
                "InstanceCount": 1,
                "InstanceType": instanceTypeForMaster,
                "Market": "ON_DEMAND",
                "Name": "Master",
            },
            {
                "InstanceRole": "CORE",
                "InstanceCount": 1,
                "InstanceType": instanceTypeForCore,
                "Market": "ON_DEMAND",
                "Name": "Core",
            },
        ]
    }
    response = CLIENT.run_job_flow(
        Name="Test Cluster",
        Instances=instances,
        VisibleToAllUsers=True,
        JobFlowRole="EMR_EC2_DefaultRole",
        ServiceRole="EMR_DefaultRole",
        ReleaseLabel="emr-6.10.0",
    )

    return response["JobFlowId"]


# Terminated the cluster using clusterId received in an event
def terminate_cluster(event):
    print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"])
    response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]])
    print(f"Terminate cluster response: {response}")


def describe_cluster(event):
    response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"])
    return response


def lambda_handler(event, context):
    if is_insufficient_capacity_event(event):
        print(
            "Received insufficient capacity event for instanceGroup, clusterId: "
            + event["detail"]["clusterId"]
        )

        describeClusterResponse = describe_cluster(event)

        shouldTerminateCluster = is_cluster_eligible_for_termination(
            event, describeClusterResponse
        )
        if shouldTerminateCluster:
            terminate_cluster(event)

            clusterId = create_cluster(event)
            print("Created a new cluster, clusterId: " + clusterId)
        else:
            print(
                "Cluster is not eligible for termination, clusterId: "
                + event["detail"]["clusterId"]
            )

    else:
        print("Received event is not insufficient capacity event, skipping")
```

# Responding to Amazon EMR cluster instance fleet resize timeout events
<a name="emr-events-response-timeout-events"></a>

## Overview
<a name="emr-events-response-timeout-events-overview"></a>

Amazon EMR clusters emit [events](emr-manage-cloudwatch-events.md#emr-cloudwatch-instance-fleet-resize-events) while executing the resize operation for instance fleet clusters. The provisioning timeout events are emitted when Amazon EMR stops provisioning Spot or On-demand capacity for the fleet after the timeout expires. The timeout duration can be configured by the user as part of the [resize specifications](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html) for the instance fleets. In scenarios of consecutive resizes for the same instance fleet, Amazon EMR emits the `Spot provisioning timeout - continuing resize` or `On-Demand provisioning timeout - continuing resize` events when timeout for the current resize operation expires. It then starts provisioning capacity for the fleet’s next resize operation.

## Responding to instance fleet resize timeout events
<a name="emr-events-response-timeout-events-rec"></a>

We recommend that you respond to a provisioning timeout event in one of the following ways:
+ Revisit the [resize specifications](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html) and retry the resize operation. As capacity shifts frequently, your clusters will successfully resize as soon as Amazon EC2 capacity becomes available. We recommend customers to configure lower values for the timeout duration for the jobs that require stricter SLAs.
+ Alternatively, you can either:
  + Launch a new cluster with diversified instance types based on the [best practices for instance and Availability Zone flexibility](emr-flexibility.md#emr-flexibility-types) or
  + Launch a cluster with On-demand capacity
+ For the provisioning timeout - continuing resize event, you can additionally wait for resize operations to be processed. Amazon EMR will continue to sequentially process the resize operations triggered for the fleet, respecting the configured resize specifications.

You can also set up rules or automated responses to this event as described in the next section.

## Automated recovery from a provisioning timeout event
<a name="emr-events-response-timeout-events-ex"></a>

You can build automation in response to Amazon EMR events with the `Spot Provisioning timeout` event code. For example, the following AWS Lambda function shuts down an EMR cluster with an instance fleet that uses Spot instances for Task nodes, and then creates a new EMR cluster with an instance fleet that contains more diversified instance types than the original request. In this example, the `Spot Provisioning timeout` event emitted for task nodes will trigger the execution of the Lambda function.

**Example function to respond to `Spot Provisioning timeout` event**  

```
// Lambda code with Python 3.10 and handler is lambda_function.lambda_handler
// Note: related IAM role requires permission to use Amazon EMR
 
import json
import boto3
import datetime
from datetime import timezone
 
SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE = "EMR Instance Fleet Resize"
SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE = (
    "Spot Provisioning timeout"
)
 
CLIENT = boto3.client("emr", region_name="us-east-1")
 
# checks if the incoming event is 'EMR Instance Fleet Resize' with eventCode 'Spot provisioning timeout'
def is_spot_provisioning_timeout_event(event):
    if not event["detail"]:
        return False
    else:
        return (
            event["detail-type"] == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE
            and event["detail"]["eventCode"]
            == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE
        )
 
 
# checks if the cluster is eligible for termination
def is_cluster_eligible_for_termination(event, describeClusterResponse):
    # instanceFleetType could be CORE, MASTER OR TASK
    instanceFleetType = event["detail"]["instanceFleetType"]
 
    # Check if instance fleet receiving Spot provisioning timeout event is TASK
    if (instanceFleetType == "TASK"):
        return True
    else:
        return False
 
 
# create a new cluster by choosing different InstanceType.
def create_cluster(event):
    # instanceFleetType cloud be CORE, MASTER OR TASK
    instanceFleetType = event["detail"]["instanceFleetType"]
 
    # the following two lines assumes that the customer that created the cluster already knows which instance types they use in original request
    instanceTypesFromOriginalRequestMaster = "m5.xlarge"
    instanceTypesFromOriginalRequestCore = "m5.xlarge"
   
    # select new instance types to include in the new createCluster request
    instanceTypesForTask = [
        "m5.xlarge",
        "m5.2xlarge",
        "m5.4xlarge",
        "m5.8xlarge",
        "m5.12xlarge"
    ]
    
    print("Starting to create cluster...")
    instances = {
        "InstanceFleets": [
            {
                "InstanceFleetType":"MASTER",
                "TargetOnDemandCapacity":1,
                "TargetSpotCapacity":0,
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesFromOriginalRequestMaster,
                        "WeightedCapacity":1,
                    }
                ]
            },
            {
                "InstanceFleetType":"CORE",
                "TargetOnDemandCapacity":1,
                "TargetSpotCapacity":0,
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesFromOriginalRequestCore,
                        "WeightedCapacity":1,
                    }
                ]
            },
            {
                "InstanceFleetType":"TASK",
                "TargetOnDemandCapacity":0,
                "TargetSpotCapacity":100,
                "LaunchSpecifications":{},
                "InstanceTypeConfigs":[
                    {
                        'InstanceType': instanceTypesForTask[0],
                        "WeightedCapacity":1,
                    },
                    {
                        'InstanceType': instanceTypesForTask[1],
                        "WeightedCapacity":2,
                    },
                    {
                        'InstanceType': instanceTypesForTask[2],
                        "WeightedCapacity":4,
                    },
                    {
                        'InstanceType': instanceTypesForTask[3],
                        "WeightedCapacity":8,
                    },
                    {
                        'InstanceType': instanceTypesForTask[4],
                        "WeightedCapacity":12,
                    }
                ],
                "ResizeSpecifications": {
                    "SpotResizeSpecification": {
                        "TimeoutDurationMinutes": 30
                    }
                }
            }
        ]
    }
    response = CLIENT.run_job_flow(
        Name="Test Cluster",
        Instances=instances,
        VisibleToAllUsers=True,
        JobFlowRole="EMR_EC2_DefaultRole",
        ServiceRole="EMR_DefaultRole",
        ReleaseLabel="emr-6.10.0",
    )
 
    return response["JobFlowId"]
 
 
# terminated the cluster using clusterId received in an event
def terminate_cluster(event):
    print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"])
    response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]])
    print(f"Terminate cluster response: {response}")
 
 
def describe_cluster(event):
    response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"])
    return response
 
 
def lambda_handler(event, context):
    if is_spot_provisioning_timeout_event(event):
        print(
            "Received spot provisioning timeout event for instanceFleet, clusterId: "
            + event["detail"]["clusterId"]
        )
 
        describeClusterResponse = describe_cluster(event)
 
        shouldTerminateCluster = is_cluster_eligible_for_termination(
            event, describeClusterResponse
        )
        if shouldTerminateCluster:
            terminate_cluster(event)
 
            clusterId = create_cluster(event)
            print("Created a new cluster, clusterId: " + clusterId)
        else:
            print(
                "Cluster is not eligible for termination, clusterId: "
                + event["detail"]["clusterId"]
            )
 
    else:
        print("Received event is not spot provisioning timeout event, skipping")
```

# View cluster application metrics using Ganglia with Amazon EMR
<a name="ViewingGangliaMetrics"></a>

Ganglia is available with Amazon EMR releases between 4.2 and 6.15. Ganglia is an open source project which is a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance. When you enable Ganglia on your cluster, you can generate reports and view the performance of the cluster as a whole, as well as inspect the performance of individual node instances. Ganglia is also configured to ingest and visualize Hadoop and Spark metrics. For more information, see [Ganglia](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-ganglia.html) in the *Amazon EMR Release Guide*.

# Logging AWS EMR API calls using AWS CloudTrail
<a name="logging-using-cloudtrail"></a>

AWS EMR is integrated with [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html), a service that provides a record of actions taken by a user, role, or an AWS service. CloudTrail captures all API calls for AWS EMR as events. The calls captured include calls from the AWS EMR console and code calls to the AWS EMR API operations. Using the information collected by CloudTrail, you can determine the request that was made to AWS EMR, the IP address from which the request was made, when it was made, and additional details.

Every event or log entry contains information about who generated the request. The identity information helps you determine the following:
+ Whether the request was made with root user or user credentials.
+ Whether the request was made on behalf of an IAM Identity Center user.
+ Whether the request was made with temporary security credentials for a role or federated user.
+ Whether the request was made by another AWS service.

CloudTrail is active in your AWS account when you create the account and you automatically have access to the CloudTrail **Event history**. The CloudTrail **Event history** provides a viewable, searchable, downloadable, and immutable record of the past 90 days of recorded management events in an AWS Region. For more information, see [Working with CloudTrail Event history](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events.html) in the *AWS CloudTrail User Guide*. There are no CloudTrail charges for viewing the **Event history**.

For an ongoing record of events in your AWS account past 90 days, create a trail or a [CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) event data store.

**CloudTrail trails**  
A *trail* enables CloudTrail to deliver log files to an Amazon S3 bucket. All trails created using the AWS Management Console are multi-Region. You can create a single-Region or a multi-Region trail by using the AWS CLI. Creating a multi-Region trail is recommended because you capture activity in all AWS Regions in your account. If you create a single-Region trail, you can view only the events logged in the trail's AWS Region. For more information about trails, see [Creating a trail for your AWS account](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-and-update-a-trail.html) and [Creating a trail for an organization](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/creating-trail-organization.html) in the *AWS CloudTrail User Guide*.  
You can deliver one copy of your ongoing management events to your Amazon S3 bucket at no charge from CloudTrail by creating a trail, however, there are Amazon S3 storage charges. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/). For information about Amazon S3 pricing, see [Amazon S3 Pricing](https://aws.amazon.com/s3/pricing/).

**CloudTrail Lake event data stores**  
*CloudTrail Lake* lets you run SQL-based queries on your events. CloudTrail Lake converts existing events in row-based JSON format to [ Apache ORC](https://orc.apache.org/) format. ORC is a columnar storage format that is optimized for fast retrieval of data. Events are aggregated into *event data stores*, which are immutable collections of events based on criteria that you select by applying [advanced event selectors](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake-concepts.html#adv-event-selectors). The selectors that you apply to an event data store control which events persist and are available for you to query. For more information about CloudTrail Lake, see [Working with AWS CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) in the *AWS CloudTrail User Guide*.  
CloudTrail Lake event data stores and queries incur costs. When you create an event data store, you choose the [pricing option](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake-manage-costs.html#cloudtrail-lake-manage-costs-pricing-option) you want to use for the event data store. The pricing option determines the cost for ingesting and storing events, and the default and maximum retention period for the event data store. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/).

## AWS EMR data events in CloudTrail
<a name="cloudtrail-data-events"></a>

[Data events](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#logging-data-events) provide information about the resource operations performed on or in a resource (for example, reading or writing to an Amazon S3 object). These are also known as data plane operations. Data events are often high-volume activities. By default, CloudTrail doesn’t log data events. The CloudTrail **Event history** doesn't record data events.

Additional charges apply for data events. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/).

You can log data events for the AWS EMR resource types by using the CloudTrail console, AWS CLI, or CloudTrail API operations. For more information about how to log data events, see [Logging data events with the AWS Management Console](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#logging-data-events-console) and [Logging data events with the AWS Command Line Interface](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#creating-data-event-selectors-with-the-AWS-CLI) in the *AWS CloudTrail User Guide*.

The following table lists the AWS EMR resource types for which you can log data events. The **Data event type (console)** column shows the value to choose from the **Data event type** list on the CloudTrail console. The **resources.type value** column shows the `resources.type` value, which you would specify when configuring advanced event selectors using the AWS CLI or CloudTrail APIs. The **Data APIs logged to CloudTrail** column shows the API calls logged to CloudTrail for the resource type.

For more information about these API operations, see [ Amazon EMR WAL (EMRWAL) CLI reference](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrwalcli-ref.html). Amazon EMR logs some Data API operations to CloudTrail that are HBase system operations that you never call directly. These operations aren't in the EMRWAL CLI reference.


| Data event type (console) | resources.type value | Data APIs logged to CloudTrail | 
| --- | --- | --- | 
| Amazon EMR write-ahead log workspace |  AWS::EMRWAL::Workspace  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/logging-using-cloudtrail.html)  | 

You can configure advanced event selectors to filter on the `eventName`, `readOnly`, and `resources.ARN` fields to log only those events that are important to you. For more information about these fields, see [https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/API_AdvancedFieldSelector.html](https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/API_AdvancedFieldSelector.html) in the *AWS CloudTrail API Reference*.

## AWS EMR management events in CloudTrail
<a name="cloudtrail-management-events"></a>

[Management events](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-management-events-with-cloudtrail.html#logging-management-events) provide information about management operations that are performed on resources in your AWS account. These are also known as control plane operations. By default, CloudTrail logs management events.

AWS EMR logs all AWS EMR control plane operations as management events. For a list of the AWS EMR control plane operations that AWS EMR logs to CloudTrail, see the [AWS EMR API Reference](https://docs.aws.amazon.com/emr/latest/APIReference/Welcome.html).

## AWS EMR event examples
<a name="cloudtrail-event-examples"></a>

An event represents a single request from any source and includes information about the requested API operation, the date and time of the operation, request parameters, and so on. CloudTrail log files aren't an ordered stack trace of the public API calls, so events don't appear in any specific order.

The following example shows a CloudTrail log entry that demonstrates the **RunJobFlow** action.

```
{
	"Records": [
	{
         "eventVersion":"1.01",
         "userIdentity":{
            "type":"IAMUser",
            "principalId":"EX_PRINCIPAL_ID",
            "arn":"arn:aws:iam::123456789012:user/temporary-user-xx-7M",
            "accountId":"123456789012",
            "userName":"temporary-user-xx-7M"
         },
         "eventTime":"2018-03-31T17:59:21Z",
         "eventSource":"elasticmapreduce.amazonaws.com",
         "eventName":"RunJobFlow",
         "awsRegion":"us-west-2",
         "sourceIPAddress":"192.0.2.1",
         "userAgent":"aws-sdk-java/unknown-version Linux/xx Java_HotSpot(TM)_64-Bit_Server_VM/xx",
         "requestParameters":{
            "tags":[
               {
                  "value":"prod",
                  "key":"domain"
               },
               {
                  "value":"us-west-2",
                  "key":"realm"
               },
               {
                  "value":"VERIFICATION",
                  "key":"executionType"
               }
            ],
            "instances":{
               "slaveInstanceType":"m5.xlarge",
               "ec2KeyName":"emr-integtest",
               "instanceCount":1,
               "masterInstanceType":"m5.xlarge",
               "keepJobFlowAliveWhenNoSteps":true,
               "terminationProtected":false
            },
            "visibleToAllUsers":false,
            "name":"MyCluster",
            "ReleaseLabel":"emr-5.16.0"
         },
         "responseElements":{
            "jobFlowId":"j-2WDJCGEG4E6AJ"
         },
         "requestID":"2f482daf-b8fe-11e3-89e7-75a3d0e071c5",
         "eventID":"b348a38d-f744-4097-8b2a-e68c9b424698"
      },
	...additional entries
  ]
}
```

For information about CloudTrail record contents, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html) in the *AWS CloudTrail User Guide*.

# EMR Observability Best Practices
<a name="emr-metrics-observability"></a>

EMR Observability encompasses a comprehensive monitoring and management approach for AWS EMR clusters. The foundation rests on Amazon CloudWatch as the primary monitoring service, complemented by EMR Studio, and third-party tools like Prometheus and Grafana for enhanced visibility. In this document, we explore specific aspects of cluster observability:

1. *[Spark observability](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Applications/Spark/observability.md)* (GitHub) – With regards to the Spark user interface, you have three options in Amazon EMR.

1. *[Spark troubleshooting](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Applications/Spark/troubleshooting.md)* (GitHub) – Resolutions for errors.

1. *[EMR Cluster monitoring](https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Observability/best_practices/) * (GitHub) – Monitoring cluster performance.

1. *[Troubleshooting EMR](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Troubleshooting/Troubleshooting%20EMR.md)* (GitHub) – Identify, diagnose, and resolve common EMR cluster problems.

1. *[Cost optimization](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Cost%20Optimizations/best_practices.md)* (GitHub) – This section outlines the best practices for running cost-effective workloads.

## Performance Optimization Tool for Apache Spark Applications
<a name="performance-optimization"></a>

1. [AWS EMR Advisor](https://github.com/aws-samples/aws-emr-advisor) tool analyzes Spark event logs to provide tailored recommendations for optimizing EMR cluster configurations, enhancing performance, and reducing costs. By leveraging historical data, it suggests ideal executor sizes and infrastructure settings, enabling more efficient resource utilization and improved overall cluster performance.

1. [Amazon CodeGuru Profiler](https://github.com/amzn/amazon-codeguru-profiler-for-spark) tool helps developers identify performance bottlenecks and inefficiencies in their Spark applications by collecting and analyzing runtime data. The tool integrates seamlessly with existing Spark applications, requiring minimal setup, and provides detailed insights through the AWS Console about CPU usage, memory patterns, and performance hotspots.