# Plan, configure and launch Amazon EMR clusters


This section explains configuration options and instructions for planning, configuring, and launching clusters using Amazon EMR. Before you launch a cluster, you make choices about your system based on the data that you're processing and your requirements for cost, speed, capacity, availability, security, and manageability. Your choices include: 
+ What region to run a cluster in, where and how to store data, and how to output results. See [Configure Amazon EMR cluster location and data storage](emr-cluster-location-data-storage.md).
+ Whether you are running Amazon EMR clusters on Outposts or Local Zones. See [EMR clusters on AWS Outposts](emr-plan-outposts.md) or [EMR clusters on AWS Local Zones](emr-plan-localzones.md).
+ Whether a cluster is long-running or transient, and what software it runs. See [Configuring an Amazon EMR cluster to continue or terminate after step execution](emr-plan-longrunning-transient.md) and [Configure applications when you launch your Amazon EMR cluster](emr-plan-software.md).
+ Whether a cluster has a single primary node or three primary nodes. See [Plan and configure primary nodes in your Amazon EMR cluster](emr-plan-ha.md).
+ The hardware and networking options that optimize cost, performance, and availability for your application. See [Configure Amazon EMR cluster hardware and networking](emr-plan-instances.md).
+ How to set up clusters so you can manage them more easily, and monitor activity, performance, and health. See [Configure Amazon EMR cluster logging and debugging](emr-plan-debugging.md) and [Tag and categorize Amazon EMR cluster resources](emr-plan-tags.md).
+ How to authenticate and authorize access to cluster resources, and how to encrypt data. See [Security in Amazon EMR](emr-security.md).
+ How to integrate with other software and services. See [Drivers and third-party application integration on Amazon EMR](emr-plan-third-party.md).

# Launch an Amazon EMR cluster quickly


Follow these steps to launch an Amazon EMR cluster in just a few minutes.

**To quickly launch a cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr/clusters](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. On the **Create Cluster** page, enter or select values for the provided fields. The persistent summary panel displays a real-time view of your currently selected cluster options. Select a heading in the summary panel to navigate to the corresponding section and make adjustments. Your cluster name can't contain the characters <, >, \$1, \$1, or ` (backtick). You must complete all required configurations before you can choose **Create cluster**.

1. Choose **Create cluster** to accept the configuration as shown. 

1. The cluster details page opens. Find the cluster **Status** next to the cluster name. The status should change from **Starting** to **Running** to **Waiting** during the cluster creation process. You might need to choose the refresh icon on the upper right or refresh your browser to receive updates.

   When the status changes to **Waiting**, your cluster is up, running, and ready to accept steps and SSH connections.

# Configure Amazon EMR cluster location and data storage


This section describes how to configure the region for a cluster, the different file systems available when you use Amazon EMR and how to use them. It also covers how to prepare or upload data to Amazon EMR if necessary, as well as how to prepare an output location for log files and any output data files you configure.

**Topics**
+ [

# Choose an AWS Region for your Amazon EMR cluster
](emr-plan-region.md)
+ [

# Working with storage and file systems with Amazon EMR
](emr-plan-file-systems.md)
+ [

# Prepare input data for processing with Amazon EMR
](emr-plan-input.md)
+ [

# Configure a location for Amazon EMR cluster output
](emr-plan-output.md)

# Choose an AWS Region for your Amazon EMR cluster


Amazon Web Services run on servers in data centers around the world. Data centers are organized by geographical Region. When you launch an Amazon EMR cluster, you must specify a Region. You might choose a Region to reduce latency, minimize costs, or address regulatory requirements. For the list of Regions and endpoints supported by Amazon EMR, see [Regions and endpoints](https://docs.aws.amazon.com/general/latest/gr/#emr_region) in the *Amazon Web Services General Reference*. 

For best performance, you should launch the cluster in the same Region as your data. For example, if the Amazon S3 bucket storing your input data is in the US West (Oregon) Region, you should launch your cluster in the US West (Oregon) Region to avoid cross-Region data transfer fees. If you use an Amazon S3 bucket to receive the output of the cluster, you would also want to create it in the US West (Oregon) Region. 

If you plan to associate an Amazon EC2 key pair with the cluster (required for using SSH to log on to the master node), the key pair must be created in the same Region as the cluster. Similarly, the security groups that Amazon EMR creates to manage the cluster are created in the same Region as the cluster. 

If you signed up for an AWS account on or after May 17, 2017, the default Region when you access a resource from the AWS Management Console is US East (Ohio) (us-east-2); for older accounts, the default Region is either US West (Oregon) (us-west-2) or US East (N. Virginia) (us-east-1). For more information, see [Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html). 

Some AWS features are available only in limited Regions. For example, Cluster Compute instances are available only in the US East (N. Virginia) Region, and the Asia Pacific (Sydney) Region supports only Hadoop 1.0.3 and later. When choosing a Region, check that it supports the features you want to use.

For best performance, use the same Region for all of your AWS resources that will be used with the cluster. The following table maps the Region names between services. For a list of Amazon EMR Regions, see [AWS Regions and endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region) in the *Amazon Web Services General Reference*.

## Choose a Region with the console


Your default Region is displayed to the left of your account information on the navigation bar. To switch Regions in both the new and old consoles, choose the Region dropdown menu and select a new option.

## Specify a Region with the AWS CLI


Specify a default Region in the AWS CLI using either the **aws configure** command or the `AWS_DEFAULT_REGION` environment variable. For more information, see [Configuring the AWS Region](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-installing-specifying-region) in the *AWS Command Line Interface User Guide*.

## Choose a Region with an SDK or the API


To choose a Region using an SDK, configure your application to use that Region's endpoint. If you are creating a client application using an AWS SDK, you can change the client endpoint by calling `setEndpoint`, as shown in the following example:

```
1. client.setEndpoint("elasticmapreduce.us-west-2.amazonaws.com");
```

After your application has specified a Region by setting the endpoint, you can set the Availability Zone for your cluster's EC2 instances. Availability Zones are distinct geographical locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. A Region contains one or more Availability Zones. To optimize performance and reduce latency, all resources should be located in the same Availability Zone as the cluster that uses them. 

# Working with storage and file systems with Amazon EMR
Working with storage and file systems

Amazon EMR and Hadoop provide a variety of file systems that you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example, `s3://amzn-s3-demo-bucket1/path` references an Amazon S3 bucket using S3A (since EMR-7.10.0 release). The following table lists the available file systems, with recommendations about when it's best to use each one.

Amazon EMR and Hadoop typically use two or more of the following file systems when processing a cluster. HDFS and S3A are the two main file systems used with Amazon EMR.

**Important**  
Beginning with Amazon EMR release 5.22.0, Amazon EMR uses AWS Signature Version 4 exclusively to authenticate requests to Amazon S3. Earlier Amazon EMR releases use AWS Signature Version 2 in some cases, unless the release notes indicate that Signature Version 4 is used exclusively. For more information, see [Authenticating Requests (AWS Signature Version 4)](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html) and [Authenticating Requests (AWS Signature Version 2)](https://docs.aws.amazon.com/AmazonS3/latest/API/auth-request-sig-v2.html) in the *Amazon Simple Storage Service Developer Guide*.


| File system | Prefix | Description | 
| --- | --- | --- | 
| HDFS | hdfs:// (or no prefix) |  HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information, see [Hadoop documentation](http://hadoop.apache.org/docs/stable).  HDFS is used by the master and core nodes. One advantage is that it's fast; a disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps.   | 
| S3A | s3://, s3a://, s3n:// |  The Hadoop S3A filesystem is a open source S3 connector that enables Apache Hadoop and its ecosystem to interact directly with Amazon S3 storage. It allows users to read and write data to S3 buckets using Hadoop-compatible file operations, providing a seamless integration between Hadoop applications and cloud storage. Prior to EMR-7.10.0, Amazon EMR used EMRFS for *s3://* and *s3n://* scheme.   | 
| local file system |  |  The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an EC2 instance that comes with a preconfigured block of preattached disk storage called an *instance store*. Data on instance store volumes persists only during the life of its EC2 instance. Instance store volumes are ideal for storing temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information, see [Amazon EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html). The local file system is used by HDFS, but Python also runs from the local file system and you can choose to store additional application files on instance store volumes.  | 
| (Legacy) Amazon S3 block file system | s3bfs:// |  The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.  We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.   | 

## Access file systems


You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems. 

**To access a local HDFS**
+ Specify the `hdfs:///` prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS. 

  ```
  1. hdfs:///path-to-data
  2. 							
  3. /path-to-data
  ```

**To access a remote HDFS**
+ Include the IP address of the master node in the URI, as shown in the following examples. 

  ```
  1. hdfs://master-ip-address/path-to-data
  2. 						
  3. master-ip-address/path-to-data
  ```

**To access Amazon S3**
+ Use the `s3://` prefix.

  ```
  1. 						
  2. s3://bucket-name/path-to-file-in-bucket
  ```

**To access the Amazon S3 block file system**
+ Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the `s3bfs://` prefix in the URI. 

  The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GB. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload large files to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated. For more information about multipart upload for EMR, see [Configure multipart upload for Amazon S3](emr-plan-upload-s3.html#Config_Multipart). For more information about S3 object-size and part-size limits, see [Amazon S3 multipart upload limits](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html) in the **Amazon Simple Storage Service** *User Guide*.
**Warning**  
Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use EMRFS instead. 

  ```
  1. s3bfs://bucket-name/path-to-file-in-bucket
  ```

# Prepare input data for processing with Amazon EMR


Most clusters load input data and then process that data. In order to load data, it needs to be in a location that the cluster can access and in a format the cluster can process. The most common scenario is to upload input data into Amazon S3. Amazon EMR provides tools for your cluster to import or read data from Amazon S3.

The default input format in Hadoop is text files, though you can customize Hadoop and use tools to import data stored in other formats. 

**Topics**
+ [

# Types of input Amazon EMR can accept
](emr-plan-input-accept.md)
+ [

# Different ways to get data into Amazon EMR
](emr-plan-get-data-in.md)

# Types of input Amazon EMR can accept


The default input format for a cluster is text files with each line separated by a newline (\$1n) character, which is the input format most commonly used. 

If your input data is in a format other than the default text files, you can use the Hadoop interface `InputFormat` to specify other input types. You can even create a subclass of the `FileInputFormat` class to handle custom data types. For more information, see [http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html). 

If you are using Hive, you can use a serializer/deserializer (SerDe) to read data in from a given format into HDFS. For more information, see [https://cwiki.apache.org/confluence/display/Hive/SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe). 

# Different ways to get data into Amazon EMR


Amazon EMR provides several ways to get data onto a cluster. The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can also use the DistributedCache feature of Hadoop to transfer files from a distributed file system to the local file system. The implementation of Hive provided by Amazon EMR (Hive version 0.7.1.1 and later) includes functionality that you can use to import and export data between DynamoDB and an Amazon EMR cluster. If you have large amounts of on-premises data to process, you may find the Direct Connect service useful. 

**Topics**
+ [

# Upload data to Amazon S3
](emr-plan-upload-s3.md)
+ [

# Upload data with AWS DataSync
](emr-plan-upload-datasync.md)
+ [

# Import files with distributed cache with Amazon EMR
](emr-plan-input-distributed-cache.md)
+ [

# Detecting and processing compressed files with Amazon EMR
](HowtoProcessGzippedFiles.md)
+ [

# Import DynamoDB data into Hive with Amazon EMR
](emr-plan-input-dynamodb.md)
+ [

# Connect to data with AWS Direct Connect from Amazon EMR
](emr-plan-input-directconnect.md)
+ [

# Upload large amounts of data for Amazon EMR with AWS Snowball Edge
](emr-plan-input-snowball.md)

# Upload data to Amazon S3


For information on how to upload objects to Amazon S3, see [ Add an object to your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/PuttingAnObjectInABucket.html) in the *Amazon Simple Storage Service User Guide*. For more information about using Amazon S3 with Hadoop, see [http://wiki.apache.org/hadoop/AmazonS3](http://wiki.apache.org/hadoop2/AmazonS3). 

**Topics**
+ [

## Create and configure an Amazon S3 bucket
](#create-s3-bucket-input)
+ [

## Configure multipart upload for Amazon S3
](#Config_Multipart)
+ [

## Best practices
](#emr-bucket-bestpractices)
+ [

# Upload data to Amazon S3 Express One Zone
](emr-express-one-zone.md)

## Create and configure an Amazon S3 bucket
Create and configure a bucket

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, see [Bucket restrictions and limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service User Guide*.

This section shows you how to use the Amazon S3 AWS Management Console to create and then set permissions for an Amazon S3 bucket. You can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or AWS CLI. You can also use curl along with a modification to pass the appropriate authentication parameters for Amazon S3.

See the following resources:
+ To create a bucket using the console, see [Create a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket.html) in the *Amazon S3 User Guide*.
+ To create and work with buckets using the AWS CLI, see [Using high-level S3 commands with the AWS Command Line Interface](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-s3-commands.html) in the *Amazon S3 User Guide*.
+ To create a bucket using an SDK, see [Examples of creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-get-location-example.html) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using curl, see [Amazon S3 authentication tool for curl](https://aws.amazon.com/code/amazon-s3-authentication-tool-for-curl/).
+ For more information on specifying Region-specific buckets, see [Accessing a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html#access-bucket-intro) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using Amazon S3 Access Points, see [Using a bucket-style alias for your access point](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html) in the *Amazon S3 User Guide*. You can easily use Amazon S3 Access Points with the Amazon S3 Access Point Alias instead of the Amazon S3 bucket name. You can use the Amazon S3 Access Point Alias for both existing and new applications, including Spark, Hive, Presto and others.

**Note**  
If you enable logging for a bucket, it enables only bucket access logs, not Amazon EMR cluster logs. 

During bucket creation or after, you can set the appropriate permissions to access the bucket depending on your application. Typically, you give yourself (the owner) read and write access and give authenticated users read access.

Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. 

## Configure multipart upload for Amazon S3
Configure multipart upload

Amazon EMR supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.

For more information, see [Multipart upload overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) in the *Amazon Simple Storage Service User Guide*.

In addition, Amazon EMR offers properties that allow you to more precisely control the clean-up of failed multipart upload parts.

The following table describes the Amazon EMR configuration properties for multipart upload. You can configure these using the `core-site` configuration classification. For more information, see [Configure applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configure-apps.html) in the *Amazon EMR Release Guide*.


| Configuration parameter name | Default value | Description | 
| --- | --- | --- | 
| fs.s3n.multipart.uploads.enabled | true | A Boolean type that indicates whether to enable multipart uploads. When EMRFS consistent view is enabled, multipart uploads are enabled by default and setting this value to false is ignored. | 
| fs.s3n.multipart.uploads.split.size | 134217728 | Specifies the maximum size of a part, in bytes, before EMRFS starts a new part upload when multipart uploads is enabled. The minimum value is `5242880` (5 MB). If a lesser value is specified, `5242880` is used. The maximum is `5368709120` (5 GB). If a greater value is specified, `5368709120` is used. If EMRFS client-side encryption is disabled and the Amazon S3 Optimized Committer is also disabled, this value also controls the maximum size that a data file can grow until EMRFS uses multipart uploads rather than a `PutObject` request to upload the file. For more information, see  | 
| fs.s3n.ssl.enabled | true | A Boolean type that indicates whether to use http or https.  | 
| fs.s3.buckets.create.enabled | false | A Boolean type that indicates whether a bucket should be created if it does not exist. Setting to false causes an exception on CreateBucket operations. | 
| fs.s3.multipart.clean.enabled | false | A Boolean type that indicates whether to enable background periodic clean-up of incomplete multipart uploads. | 
| fs.s3.multipart.clean.age.threshold | 604800 | A long type that specifies the minimum age of a multipart upload, in seconds, before it is considered for cleanup. The default is one week. | 
| fs.s3.multipart.clean.jitter.max | 10000 | An integer type that specifies the maximum amount of random jitter delay in seconds added to the 15-minute fixed delay before scheduling next round of clean-up. | 

### Disable multipart uploads


------
#### [ Console ]

**To disable multipart uploads with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Software settings**, enter the following configuration: `classification=core-site,properties=[fs.s3n.multipart.uploads.enabled=false]`.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To disable multipart upload using the AWS CLI**

This procedure explains how to disable multipart upload using the AWS CLI. To disable multipart upload, type the `create-cluster` command with the `--bootstrap-actions` parameter. 

1. Create a file, `myConfig.json`, with the following contents and save it in the same directory where you run the command:

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3n.multipart.uploads.enabled": "false"
       }
     }
   ]
   ```

1. Type the following command and replace *myKey* with the name of your EC2 key pair.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   1. aws emr create-cluster --name "Test cluster" \
   2. --release-label emr-7.12.0 --applications Name=Hive Name=Pig \
   3. --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
   4. --instance-count 3 --configurations file://myConfig.json
   ```

------
#### [ API ]

**To disable multipart upload using the API**
+ For information on using Amazon S3 multipart uploads programmatically, see [Using the AWS SDK for Java for multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMPDotJavaAPI.html) in the *Amazon Simple Storage Service User Guide*.

  For more information about the AWS SDK for Java, see [AWS SDK for Java](https://aws.amazon.com/sdkforjava/).

------

## Best practices


The following are recommendations for using Amazon S3 buckets with EMR clusters.

### Enable versioning


Versioning is a recommended configuration for your Amazon S3 bucket. By enabling versioning, you ensure that even if data is unintentionally deleted or overwritten it can be recovered. For more information, see [Using versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) in the Amazon Simple Storage Service User Guide.

### Clean up failed multipart uploads


EMR cluster components use multipart uploads via the AWS SDK for Java with Amazon S3 APIs to write log files and output data to Amazon S3 by default. For information about changing properties related to this configuration using Amazon EMR, see [Configure multipart upload for Amazon S3](#Config_Multipart). Sometimes the upload of a large file can result in an incomplete Amazon S3 multipart upload. When a multipart upload is unable to complete successfully, the in-progress multipart upload continues to occupy your bucket and incurs storage charges. We recommend the following options to avoid excessive file storage:
+ For buckets that you use with Amazon EMR, use a lifecycle configuration rule in Amazon S3 to remove incomplete multipart uploads three days after the upload initiation date. Lifecycle configuration rules allow you to control the storage class and lifetime of objects. For more information, see [Object lifecycle management](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html), and [Aborting incomplete multipart uploads using a bucket lifecycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config).
+ Enable Amazon EMR's multipart cleanup feature by setting `fs.s3.multipart.clean.enabled` to `true` and tuning other cleanup parameters. This feature is useful at high volume, large scale, and with clusters that have limited uptime. In this case, the `DaysAfterIntitiation` parameter of a lifecycle configuration rule may be too long, even if set to its minimum, causing spikes in Amazon S3 storage. Amazon EMR's multipart cleanup allows more precise control. For more information, see [Configure multipart upload for Amazon S3](#Config_Multipart). 

### Manage version markers


We recommend that you enable a lifecycle configuration rule in Amazon S3 to remove expired object delete markers for versioned buckets that you use with Amazon EMR. When deleting an object in a versioned bucket, a delete marker is created. If all previous versions of the object subsequently expire, an expired object delete marker is left in the bucket. While you are not charged for delete markers, removing expired markers can improve the performance of LIST requests. For more information, see [Lifecycle configuration for a bucket with versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-bucket-with-versioning.html) in the Amazon Simple Storage Service User Guide.

### Performance best practices


Depending on your workloads, specific types of usage of EMR clusters and applications on those clusters can result in a high number of requests against a bucket. For more information, see [Request rate and performance considerations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/request-rate-perf-considerations.html) in the *Amazon Simple Storage Service User Guide*. 

# Upload data to Amazon S3 Express One Zone
Upload data to S3 Express One Zone

## Overview


With Amazon EMR 6.15.0 and higher, you can use Amazon EMR with Apache Spark in conjunction with the [Amazon S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html) storage class for improved performance on your Spark jobs. Amazon EMR releases 7.2.0 and higher also support HBase, Flink, and Hive, so you can also benefit from S3 Express One Zone if you use these applications. *S3 Express One Zone* is an S3 storage class for applications that frequently access data with hundreds of thousands of requests per second. At the time of its release, S3 Express One Zone delivers the lowest latency and highest performance cloud object storage in Amazon S3. 

## Prerequisites

+ **S3 Express One Zone permissions** – When S3 Express One Zone initially performs an action like `GET`, `LIST`, or `PUT` on an S3 object, the storage class calls `CreateSession` on your behalf. Your IAM policy must allow the `s3express:CreateSession` permission so that the S3A connector can invoke the `CreateSession` API. For an example policy with this permission, see [Getting started with Amazon S3 Express One Zone](#emr-express-one-zone-start).
+ **S3A connector** – To configure your Spark cluster to access data from an Amazon S3 bucket that uses the S3 Express One Zone storage class, you must use the Apache Hadoop connector S3A. To use the connector, ensure all S3 URIs use the `s3a` scheme. If they don’t, you can change the filesystem implementation that you use for `s3` and `s3n` schemes.

To change the `s3` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

To change the `s3n` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3n.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3n.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

## Getting started with Amazon S3 Express One Zone
Get started

**Topics**
+ [

### Create a permission policy
](#emr-express-one-zone-permissions)
+ [

### Create and configure your cluster
](#emr-express-one-zone-create)
+ [

### Configurations overview
](#emr-express-one-zone-configs)

### Create a permission policy


Before you can create a cluster that uses Amazon S3 Express One Zone, you must create an IAM policy to attach to the Amazon EC2 instance profile for the cluster. The policy must have permissions to access the S3 Express One Zone storage class. The following example policy shows how to grant the required permission. After you create the policy, attach the policy to the instance profile role that you use to create your EMR cluster, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3express:*:123456789012:bucket/example-s3-bucket"
      ],
      "Action": [
        "s3express:CreateSession"
      ],
      "Sid": "AllowS3EXPRESSCreatesession"
    }
  ]
}
```

------

### Create and configure your cluster


Next, create a cluster that runs Spark, HBase, Flink, or Hive with S3 Express One Zone. The following steps describe a high-level overview to create a cluster in the AWS Management Console:

1. Navigate to the Amazon EMR console and select **Clusters** from the sidebar. Then choose **Create cluster**.

1. If you use Spark, select Amazon EMR release `emr-6.15.0` or higher. If you use HBase, Flink, or Hive, select `emr-7.2.0` or higher.

1. Select the applications that you want to include on your cluster, such as Spark, HBase, or Flink.

1. To enable Amazon S3 Express One Zone, enter a configuration similar to the following example in the **Software settings** section. The configurations and recommended values are described in the [Configurations overview](#emr-express-one-zone-configs) section that follows this procedure.

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider",
         "fs.s3a.change.detection.mode": "none",
         "fs.s3a.endpoint.region": "aa-example-1",
         "fs.s3a.select.enabled": "false"
       }
     },
     {
       "Classification": "spark-defaults",
       "Properties": {
         "spark.sql.sources.fastS3PartitionDiscovery.enabled": "false"
       }
     }
   ]
   ```

1. In the **EC2 instance profile for Amazon EMR** section, choose to use an existing role, and use a role with the policy attached that you created in the [Create a permission policy](#emr-express-one-zone-permissions) section above.

1. Configure the rest of your cluster settings as appropriate for your application, and then select **Create cluster**.

### Configurations overview


The following tables describe the configurations and suggested values that you should specify when you set up a cluster that uses S3 Express One Zone with Amazon EMR, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

**S3A configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `fs.s3a.aws.credentials.provider`  |  If not specified, uses `AWSCredentialProviderList` in the following order: `TemporaryAWSCredentialsProvider`, `SimpleAWSCredentialsProvider`, `EnvironmentVariableCredentialsProvider`, `IAMInstanceCredentialsProvider`.  |  <pre>software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider</pre>  |  The Amazon EMR instance profile role should have the policy that allows the S3A filesystem to call `s3express:CreateSession`. Other crendential providers also work if they have the S3 Express One Zone permissions.  | 
|  `fs.s3a.endpoint.region`  |  null  |  The AWS Region where you created the bucket.  |  Region resolution logic doesn't work with S3 Express One Zone storage class.  | 
|  `fs.s3a.select.enabled`  |  `true`  |  `false`  |  Amazon S3 `select` is not supported with S3 Express One Zone storage class.  | 
|  `fs.s3a.change.detection.mode`  |  `server`  |  none  |  Change detection by S3A works by checking MD5-based `etags`. S3 Express One Zone storage class doesn't support MD5 `checksums`.  | 

**Spark configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `spark.sql.sources.fastS3PartitionDiscovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

**Hive configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `hive.exec.fast.s3.partition.discovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

## Considerations


Consider the following when you integrate Apache Spark on Amazon EMR with the S3 Express One Zone storage class:
+ The S3A connector is required to use S3 Express One Zone with Amazon EMR. Only S3A has the features and storage classes that are required to interact with S3 Express One Zone. For steps to set up the connector, see [Prerequisites](#emr-express-one-zone-prereqs).
+ The Amazon S3 Express One Zone storage class supports SSE-S3 and SSE-KMS encryption. For more information, see [Server-side encryption with Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-data-protection.html#s3-express-ecnryption).
+ The Amazon S3 Express One Zone storage class does not support writes with the S3A `FileOutputCommitter`. Writes with the S3A `FileOutputCommitter` on S3 Express One Zone buckets result in an error: *InvalidStorageClass: The storage class you specified is not valid*.
+ Amazon S3 Express One Zone is supported with Amazon EMR releases 6.15.0 and higher on EMR on EC2. Additionally, it's supported on Amazon EMR releases 7.2.0 and higher on Amazon EMR on EKS and on Amazon EMR Serverless.

# Upload data with AWS DataSync


AWS DataSync is an online data transfer service that simplifies, automates, and accelerates the process of moving data between your on-premises storage and AWS storage services or between AWS storage services. DataSync supports a variety of on-premises storage systems such as Hadoop Distributed File System (HDFS), NAS file servers, and self-managed object storage.

The most common way to get data onto a cluster is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster.

DataSync can help you accomplish the following tasks:
+ Replicate HDFS on your Hadoop cluster to Amazon S3 for business continuity
+ Copy HDFS to Amazon S3 to populate your data lakes
+ Transfer data between your Hadoop cluster's HDFS and Amazon S3 for analysis and processing

To upload data to your S3 bucket, you first deploy one or more DataSync agents in the same network as your on-premises storage. An *agent* is a virtual machine (VM) that is used to read data from or write data to a self-managed location. You then activate your agents in the AWS account and AWS Region where your S3 bucket is located.

After your agent is activated, you create a source location for your on-premises storage, a destination location for your S3 bucket, and a task. A *task* is a set of two locations (source and destination) and a set of default options that you use to control the behavior of the task.

Finally, you run your DataSync task to transfer data from the source to the destination. 

For more information, see [Getting started with AWS DataSync](https://docs.aws.amazon.com/datasync/latest/userguide/getting-started.html).

# Import files with distributed cache with Amazon EMR


DistributedCache is a Hadoop feature that can boost efficiency when a map or a reduce task needs access to common data. If your cluster depends on existing applications or binaries that are not installed when the cluster is created, you can use DistributedCache to import these files. This feature lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes. 

For more information, go to [http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html).

You invoke DistributedCache when you create the cluster. The files are cached just before starting the Hadoop job and the files remain cached for the duration of the job. You can cache files stored on any Hadoop-compatible file system, for example HDFS or Amazon S3. The default size of the file cache is 10GB. To change the size of the cache, reconfigure the Hadoop parameter, `local.cache.size` using the bootstrap action. For more information, see [Create bootstrap actions to install additional software with an Amazon EMR cluster](emr-plan-bootstrap.md).

**Topics**
+ [

## Supported file types
](#emr-dev-supported-file-types)
+ [

## Location of cached files
](#locationofcache)
+ [

## Access cached files from streaming applications
](#cachemapper)
+ [

## Access cached files from streaming applications
](#cacheinconsole)

## Supported file types


DistributedCache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.

Archives are one or more files packaged using a utility, such as `gzip`. DistributedCache passes the compressed files to each core node and decompresses the archive as part of caching. DistributedCache supports the following compression formats:
+ zip
+ tgz
+ tar.gz
+ tar
+ jar

## Location of cached files


DistributedCache copies files to core nodes only. If there are no core nodes in the cluster, DistributedCache copies the files to the primary node.

DistributedCache associates the cache files to the current working directory of the mapper and reducer using symlinks. A symlink is an alias to a file location, not the actual file location. The value of the parameter, `yarn.nodemanager.local-dirs` in `yarn-site.xml`, specifies the location of temporary files. Amazon EMR sets this parameter to `/mnt/mapred`, or some variation based on instance type and EMR version. For example, a setting may have `/mnt/mapred` and `/mnt1/mapred` because the instance type has two ephemeral volumes. Cache files are located in a subdirectory of the temporary file location at `/mnt/mapred/taskTracker/archive`. 

If you cache a single file, DistributedCache puts the file in the `archive` directory. If you cache an archive, DistributedCache decompresses the file, creates a subdirectory in `/archive` with the same name as the archive file name. The individual files are located in the new subdirectory.

You can use DistributedCache only when using Streaming.

## Access cached files from streaming applications


To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.

## Access cached files from streaming applications


You can use the AWS Management Console and the AWS CLI to create clusters that use Distributed Cache. 

------
#### [ Console ]

**To specify distributed cache files with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Steps**, choose **Add step**. This opens the **Add step** dialog. In the **Arguments** field, include the files and archives to save to the cache. The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

   If you want to add an individual file to the distributed cache, specify `-cacheFile`, followed by the name and location of the file, the pound (\$1) sign, and the name you want to give the file when it's placed in the local cache. The following example demonstrates how to add an individual file to the distributed cache.

   ```
   -cacheFile \
   s3://amzn-s3-demo-bucket/file-name#cache-file-name
   ```

   If you want to add an archive file to the distributed cache, enter `-cacheArchive` followed by the location of the files in Amazon S3, the pound (\$1) sign, and then the name you want to give the collection of files in the local cache. The following example demonstrates how to add an archive file to the distributed cache.

   ```
   -cacheArchive \
   s3://amzn-s3-demo-bucket/archive-name#cache-archive-name
   ```

   Enter appropriate values in the other dialog fields. Options differ depending on the step type. To add your step and exit the dialog, choose **Add step**.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To specify distributed cache files with the AWS CLI**
+ To submit a Streaming step when a cluster is created, type the `create-cluster` command with the `--steps` parameter. To specify distributed cache files using the AWS CLI, specify the appropriate arguments when submitting a Streaming step. 

  If you want to add an individual file to the distributed cache, specify `-cacheFile`, followed by the name and location of the file, the pound (\$1) sign, and the name you want to give the file when it's placed in the local cache. 

  If you want to add an archive file to the distributed cache, enter `-cacheArchive` followed by the location of the files in Amazon S3, the pound (\$1) sign, and then the name you want to give the collection of files in the local cache. The following example demonstrates how to add an archive file to the distributed cache.

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

**Example 1**  
Type the following command to launch a cluster and submit a Streaming step that uses `-cacheFile` to add one file, `sample_dataset_cached.dat`, to the cache.   

```
aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]
```
When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.  
If you have not previously created the default EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

**Example 2**  
The following command shows the creation of a streaming cluster and uses `-cacheArchive` to add an archive of files to the cache.   

```
aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]
```
When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.  
If you have not previously created the default EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

------

# Detecting and processing compressed files with Amazon EMR


Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.

To index LZO files, you can use the hadoop-lzo library which can be downloaded from [https://github.com/kevinweil/hadoop-lzo](https://github.com/kevinweil/hadoop-lzo). Note that because this is a third-party library, Amazon EMR does not offer developer support on how to use this tool. For usage information, see [the hadoop-lzo readme file.](https://github.com/kevinweil/hadoop-lzo/blob/master/README.md) 

# Import DynamoDB data into Hive with Amazon EMR


The implementation of Hive provided by Amazon EMR includes functionality that you can use to import and export data between DynamoDB and an Amazon EMR cluster. This is useful if your input data is stored in DynamoDB. For more information, see [Export, import, query, and join tables in DynamoDB using Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMRforDynamoDB.html). 

# Connect to data with AWS Direct Connect from Amazon EMR


Direct Connect is a service you can use to establish a private dedicated network connection to Amazon Web Services from your data center, office, or colocation environment. If you have large amounts of input data, using Direct Connect may reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. For more information see the [Direct Connect User Guide](https://docs.aws.amazon.com/directconnect/latest/UserGuide/). 

# Upload large amounts of data for Amazon EMR with AWS Snowball Edge


AWS Snowball Edge is a service you can use to transfer large amounts of data between Amazon Simple Storage Service (Amazon S3) and your onsite data storage location at faster-than-internet speeds. Snowball Edge supports two job types: import jobs and export jobs. Import jobs involve a data transfer from an on-premises source to an Amazon S3 bucket. Export jobs involve a data transfer from an Amazon S3 bucket to an on-premises source. For both job types, Snowball Edge devices secure and protect your data while regional shipping carriers transport them between Amazon S3 and your onsite data storage location. Snowball Edge devices are physically rugged and protected by the AWS Key Management Service (AWS KMS). For more information, see the [AWS Snowball Edge Edge Developer Guide](https://docs.aws.amazon.com/snowball/latest/developer-guide/).

# Configure a location for Amazon EMR cluster output


 The most common output format of an Amazon EMR cluster is as text files, either compressed or uncompressed. Typically, these are written to an Amazon S3 bucket. This bucket must be created before you launch the cluster. You specify the S3 bucket as the output location when you launch the cluster. 

For more information, see the following topics:

**Topics**
+ [

## Create and configure an Amazon S3 bucket
](#create-s3-bucket-output)
+ [

# What formats can Amazon EMR return?
](emr-plan-output-formats.md)
+ [

# How to write data to an Amazon S3 bucket you don't own with Amazon EMR
](emr-s3-acls.md)
+ [

# Ways to compress the output of your Amazon EMR cluster
](emr-plan-output-compression.md)

## Create and configure an Amazon S3 bucket


Amazon EMR (Amazon EMR) uses Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, go to [Bucket Restrictions and Limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service Developers Guide*.

To create a an Amazon S3 bucket, follow the instructions on the [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) page in the *Amazon Simple Storage Service Developers Guide*.

**Note**  
 If you enable logging in the **Create a Bucket** wizard, it enables only bucket access logs, not cluster logs. 

**Note**  
For more information on specifying Region-specific buckets, refer to [Buckets and Regions](https://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html) in the *Amazon Simple Storage Service Developer Guide* and [ Available Region Endpoints for the AWS SDKs ](https://aws.amazon.com/articles/available-region-endpoints-for-the-aws-sdks/).

 After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access. We strongly recommend that you follow [Security Best Practices for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html) when configuring your bucket. 

 Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. The following table describes example data, scripts, and log file locations. 


| Information | Example Location on Amazon S3 | 
| --- | --- | 
| script or program |  s3://amzn-s3-demo-bucket1/script/MapperScript.py  | 
| log files |  s3://amzn-s3-demo-bucket1/logs  | 
| input data |  s3://amzn-s3-demo-bucket1/input  | 
| output data |  s3://amzn-s3-demo-bucket1/output  | 

# What formats can Amazon EMR return?


 The default output format for a cluster is text with key, value pairs written to individual lines of the text files. This is the output format most commonly used. 

 If your output data needs to be written in a format other than the default text files, you can use the Hadoop interface `OutputFormat` to specify other output types. You can even create a subclass of the `FileOutputFormat` class to handle custom data types. For more information, see [http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html). 

 If you are launching a Hive cluster, you can use a serializer/deserializer (SerDe) to output data from HDFS to a given format. For more information, see [https://cwiki.apache.org/confluence/display/Hive/SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe). 

# How to write data to an Amazon S3 bucket you don't own with Amazon EMR


 When you write a file to an Amazon Simple Storage Service (Amazon S3) bucket, by default, you are the only one able to read that file. The assumption is that you will write files to your own buckets, and this default setting protects the privacy of your files. 

 However, if you are running a cluster, and you want the output to write to the Amazon S3 bucket of another AWS user, and you want that other AWS user to be able to read that output, you must do two things: 
+  Have the other AWS user grant you write permissions for their Amazon S3 bucket. The cluster you launch runs under your AWS credentials, so any clusters you launch will also be able to write to that other AWS user's bucket. 
+  Set read permissions for the other AWS user on the files that you or the cluster write to the Amazon S3 bucket. The easiest way to set these read permissions is to use canned access control lists (ACLs), a set of pre-defined access policies defined by Amazon S3. 

 For information about how the other AWS user can grant you permissions to write files to the other user's Amazon S3 bucket, see [Editing bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EditingBucketPermissions.html) in the *Amazon Simple Storage Service User Guide*. 

 For your cluster to use canned ACLs when it writes files to Amazon S3, set the `fs.s3.canned.acl` cluster configuration option to the canned ACL to use. The following table lists the currently defined canned ACLs. 


| Canned ACL | Description | 
| --- | --- | 
| AuthenticatedRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AuthenticatedUsers group grantee is granted Permission.Read access. | 
| BucketOwnerFullControl | Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object. | 
| BucketOwnerRead | Specifies that the owner of the bucket is granted Permission.Read. The owner of the bucket is not necessarily the same as the owner of the object. | 
| LogDeliveryWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.LogDelivery group grantee is granted Permission.Write access, so that access logs can be delivered. | 
| Private | Specifies that the owner is granted Permission.FullControl. | 
| PublicRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read access. | 
| PublicReadWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read and Permission.Write access. | 

 There are many ways to set the cluster configuration options, depending on the type of cluster you are running. The following procedures show how to set the option for common cases. 

**To write files using canned ACLs in Hive**
+  From the Hive command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Hive command prompt connect to the master node using SSH, and type Hive at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to `BucketOwnerFullControl`, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command is case sensitive and contains no quotation marks or spaces. 

  ```
  hive> set fs.s3.canned.acl=BucketOwnerFullControl;   
  create table acl (n int) location 's3://amzn-s3-demo-bucket/acl/'; 
  insert overwrite table acl select count(*) from acl;
  ```

   The last two lines of the example create a table that is stored in Amazon S3 and write data to the table. 

**To write files using canned ACLs in Pig**
+  From the Pig command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Pig command prompt connect to the master node using SSH, and type Pig at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to BucketOwnerFullControl, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command includes one space before the canned ACL name and contains no quotation marks. 

  ```
  pig> set fs.s3.canned.acl BucketOwnerFullControl; 
  store some data into 's3://amzn-s3-demo-bucket/pig/acl';
  ```

**To write files using canned ACLs in a custom JAR**
+  Set the `fs.s3.canned.acl` configuration option using Hadoop with the -D flag. This is shown in the example below. 

  ```
  hadoop jar hadoop-examples.jar wordcount 
  -Dfs.s3.canned.acl=BucketOwnerFullControl s3://amzn-s3-demo-bucket/input s3://amzn-s3-demo-bucket/output
  ```

# Ways to compress the output of your Amazon EMR cluster


There are different ways to compress output that results from data processing. The compression tools you use depend on properties of your data. Compression can improve performance when you transfer large amounts of data.

## Output data compression


 This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true. 

 If you are running a streaming job you can enable this by passing the streaming job these arguments. 

```
1. -jobconf mapred.output.compress=true
```

 You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client. 

```
1.    
2. --bootstrap-actions s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
3. --args "-s,mapred.output.compress=true"
```

 Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job. 

```
1. FileOutputFormat.setCompressOutput(conf, true);
```

## Intermediate data compression


 If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compress the map output and decompress it when it arrives on the core node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression. 

 When writing a Custom Jar, use the following command: 

```
1. conf.setCompressMapOutput(true);
```

## Using the Snappy library with Amazon EMR


Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to [http://code.google.com/p/snappy/](http://code.google.com/p/snappy/). 

# Plan and configure primary nodes in your Amazon EMR cluster


When you launch an Amazon EMR cluster, you can choose to have one or three primary nodes in your cluster. High availability for *instance fleets* is supported with Amazon EMR releases 5.36.1, 5.36.2, 6.8.1, 6.9.1, 6.10.1, 6.11.1, 6.12.0, and higher. For *instance groups*, high availability is supported with Amazon EMR releases 5.23.0 and higher. To further improve cluster availability, Amazon EMR can use Amazon EC2 placement groups to ensure that primary nodes are placed on distinct underlying hardware. For more information, see [Amazon EMR integration with EC2 placement groups](emr-plan-ha-placementgroup.md).

An Amazon EMR cluster with multiple primary nodes provides the following benefits:
+ The primary node is no longer a single point of failure. If one of the primary nodes fails, the cluster uses the other two primary nodes and runs without interruption. In the meantime, Amazon EMR automatically replaces the failed primary node with a new one that is provisioned with the same configuration and bootstrap actions. 
+ Amazon EMR enables the Hadoop high-availability features of HDFS NameNode and YARN ResourceManager and supports high availability for a few other open source applications.

  For more information about how an Amazon EMR cluster with multiple primary nodes supports open source applications and other Amazon EMR features, see [Features that support high availability in an Amazon EMR cluster and how they work with open-source applications](emr-plan-ha-applications.md).

**Note**  
The cluster can reside only in one Availability Zone or subnet.

This section provides information about supported applications and features of an Amazon EMR cluster with multiple primary nodes as well as the configuration details, best practices, and considerations for launching the cluster.

**Topics**
+ [

# Features that support high availability in an Amazon EMR cluster and how they work with open-source applications
](emr-plan-ha-applications.md)
+ [

# Launch an Amazon EMR Cluster with multiple primary nodes
](emr-plan-ha-launch.md)
+ [

# Amazon EMR integration with EC2 placement groups
](emr-plan-ha-placementgroup.md)
+ [

# Considerations and best practices when you create an Amazon EMR cluster with multiple primary nodes
](emr-plan-ha-considerations.md)

# Features that support high availability in an Amazon EMR cluster and how they work with open-source applications


This topic provides information about the Hadoop high-availability features of HDFS NameNode and YARN ResourceManager in an Amazon EMR cluster, and how the high-availability features work with open source applications and other Amazon EMR features.

## High-availability HDFS


An Amazon EMR cluster with multiple primary nodes enables the HDFS NameNode high availability feature in Hadoop. For more information, see [HDFS high availability](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html).

In an Amazon EMR cluster, two or more separate nodes are configured as NameNodes. One NameNode is in an `active` state and the others are in a `standby` state. If the node with `active` NameNode fails, Amazon EMR starts an automatic HDFS failover process. A node with `standby` NameNode becomes `active` and takes over all client operations in the cluster. Amazon EMR replaces the failed node with a new one, which then rejoins as a `standby`.

**Note**  
In Amazon EMR versions 5.23.0 upto 5.36.2, only two of the three primary nodes run HDFS NameNode.  
In Amazon EMR versions 6.x and higher, all three of the primary nodes run HDFS NameNode.

If you need to find out which NameNode is `active`, you can use SSH to connect to any primary node in the cluster and run the following command:

```
hdfs haadmin -getAllServiceState
```

The output lists the nodes where NameNode is installed and their status. For example,

```
ip-##-#-#-##1.ec2.internal:8020 active
ip-##-#-#-##2.ec2.internal:8020 standby
ip-##-#-#-##3.ec2.internal:8020 standby
```

## High-availability YARN ResourceManager


An Amazon EMR cluster with multiple primary nodes enables the YARN ResourceManager high availability feature in Hadoop. For more information, see [ResourceManager high availability](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html).

In an Amazon EMR cluster with multiple primary nodes, YARN ResourceManager runs on all three primary nodes. One ResourceManager is in `active` state, and the other two are in `standby` state. If the primary node with `active` ResourceManager fails, Amazon EMR starts an automatic failover process. A primary node with a `standby` ResourceManager takes over all operations. Amazon EMR replaces the failed primary node with a new one, which then rejoins the ResourceManager quorum as a `standby`.

You can connect to "http://*master-public-dns-name*:8088/cluster" for any primary node, which automatically directs you to the `active` resource manager. To find out which resource manager is `active`, use SSH to connect to any primary node in the cluster. Then run the following command to get a list of the three primary nodes and their status:

```
yarn rmadmin -getAllServiceState
```

## Supported applications in an Amazon EMR Cluster with multiple primary nodes


You can install and run the following applications on an Amazon EMR cluster with multiple primary nodes. For each application, the primary node failover process varies. 


| Application | Availability during primary node failover | Notes | 
| --- | --- | --- | 
| Flink | Availability not affected by primary node failover | Flink jobs on Amazon EMR run as YARN applications. Flink's JobManagers run as YARN's ApplicationMasters on core nodes. The JobManager is not affected by the primary node failover process.  If you use Amazon EMR version 5.27.0 or earlier, the JobManager is a single point of failure. When the JobManager fails, it loses all job states and will not resume the running jobs. You can enable JobManager high availability by configuring application attempt count, checkpointing, and enabling ZooKeeper as state storage for Flink. For more information, see [Configuring Flink on an Amazon EMR Cluster with multiple primary nodes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-configure.html#flink-multi-master). Beginning with Amazon EMR version 5.28.0, no manual configuration is needed to enable JobManager high availability. | 
| Ganglia | Availability not affected by primary node failover | Ganglia is available on all primary nodes, so Ganglia can continue to run during the primary node failover process. | 
| Hadoop | High availability |  HDFS NameNode and YARN ResourceManager automatically fail over to the standby node when the active primary node fails.  | 
| HBase |  High availability  | HBase automatically fails over to the standby node when the active primary node fails.  If you are connecting to HBase through a REST or Thrift server, you must switch to a different primary node when the active primary node fails. | 
| HCatalog |  Availability not affected by primary node failover  | HCatalog is built upon Hive metastore, which exists outside of the cluster. HCatalog remains available during the primary node failover process. | 
| JupyterHub | High availability |  JupyterHub is installed on all three primary instances. It is highly recommended to configure notebook persistence to prevent notebook loss upon primary node failure. For more information, see [Configuring persistence for notebooks in Amazon S3](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-s3.html).  | 
| Livy | High availability |  Livy is installed on all three primary nodes. When the active primary node fails, you lose access to the current Livy session and need to create a new Livy session on a different primary node or on the new replacement node.   | 
| Mahout |  Availability not affected by primary node failover  | Since Mahout has no daemon, it is not affected by the primary node failover process. | 
| MXNet |  Availability not affected by primary node failover  | Since MXNet has no daemon, it is not affected by the primary node failover process. | 
| Phoenix |  High Availability   | Phoenix' QueryServer runs only on one of the three primary nodes. Phoenix on all three masters is configured to connect the Phoenix QueryServer. You can find the private IP of Phoenix's Query server by using `/etc/phoenix/conf/phoenix-env.sh` file | 
| Pig |  Availability not affected by primary node failover  | Since Pig has no daemon, it is not affected by the primary node failover process. | 
| Spark | High availability | All Spark applications run in YARN containers and can react to primary node failover in the same way as high-availability YARN features. | 
| Sqoop | High availability | By default, sqoop-job and sqoop-metastore store data(job descriptions) on local disk of master that runs the command, if you want to save metastore data on external Database, please refer to apache Sqoop documentation | 
| Tez |  High availability  | Since Tez containers run on YARN, Tez behaves the same way as YARN during the primary node failover process. | 
| TensorFlow |  Availability not affected by primary node failover  |  Since TensorFlow has no daemon, it is not affected by the primary node failover process. | 
| Zeppelin |  High availability  | Zeppelin is installed on all three primary nodes. Zeppelin stores notes and interpreter configurations in HDFS by default to prevent data loss. Interpreter sessions are completely isolated across all three primary instances. Session data will be lost upon master failure. It is recommended to not modify the same note concurrently on different primary instances. | 
| ZooKeeper | High availability |  ZooKeeper is the foundation of the HDFS automatic failover feature. ZooKeeper provides a highly available service for maintaining coordination data, notifying clients of changes in that data, and monitoring clients for failures. For more information, see [HDFS automatic failover](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Automatic_Failover).  | 

To run the following applications in an Amazon EMR cluster with multiple primary nodes, you must configure an external database. The external database exists outside the cluster and makes data persistent during the primary node failover process. For the following applications, the service components will automatically recover during the primary node failover process, but active jobs may fail and need to be retried.


| Application | Availability during primary node failover | Notes | 
| --- | --- | --- | 
| Hive | High availability for service components only |  An external metastore for Hive is required. This must be a MySQL external metastore, as PostgreSQL is not supported for multi-master clusters. For more information, see [Configuring an external metastore for Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html).  | 
| Hue | High availability for service components only |  An external database for Hue is required. For more information, see [Using Hue with a remote database in Amazon RDS](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hue-rds.html).  | 
| Oozie |  High availability for service components only  | An external database for Oozie is required. For more information, see [Using Oozie with a remote database in Amazon RDS](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/oozie-rds.html). Oozie-server and oozie-client are installed on all three primary nodes. The oozie-clients are configured to connect to the correct oozie-server by default. | 
| PrestoDB or PrestoSQL/Trino |  High availability for service components only  | An external Hive metastore for PrestoDB (PrestoSQL on Amazon EMR 6.1.0-6.3.0 or Trino on Amazon EMR 6.4.0 and later) is required. You can use [Presto with the AWS Glue Data Catalog](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html) or [use an external MySQL database for Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html).  The Presto CLI is installed on all three primary nodes so you can use it to access the Presto Coordinator from any of the primary nodes. The Presto Coordinator is installed on only one primary node. You can find the DNS name of the primary node where the Presto Coordinator is installed by calling the Amazon EMR `describe-cluster` API and reading the returned value of the `MasterPublicDnsName` field in the response.  | 

**Note**  
When a primary node fails, your Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) terminates its connection to the primary node. You can connect to any of the remaining primary nodes to continue your work because the Hive metastore daemon runs on all primary nodes. Or you can wait for the failed primary node to be replaced.

## How Amazon EMR features work in a cluster with multiple primary nodes


### Connecting to primary nodes using SSH


You can connect to any of the three primary nodes in an Amazon EMR cluster using SSH in the same way you connect to a single primary node. For more information, see [Connect to the primary node using SSH](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html).

If a primary node fails, your SSH connection to that primary node ends. To continue your work, you can connect to one of the other two primary nodes. Alternatively, you can access the new primary node after Amazon EMR replaces the failed one with a new one.

**Note**  
The private IP address for the replacement primary node remains the same as the previous one. The public IP address for the replacement primary node may change. You can retrieve the new IP addresses in the console or by using the `describe-cluster` command in the AWS CLI.  
NameNode only runs on two or three of the primary nodes. However, you can run `hdfs` CLI commands and operate jobs to access HDFS on all three primary nodes.

### Working with steps in an Amazon EMR Cluster with multiple primary nodes


You can submit steps to an Amazon EMR cluster with multiple primary nodes in the same way you work with steps in a cluster with a single primary node. For more information, see [Submit work to a cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-work-with-steps.html). 

The following are considerations for working with steps in an Amazon EMR cluster with multiple primary nodes:
+ If a primary node fails, the steps that are running on the primary node are marked as FAILED. Any data that were written locally are lost. However, the status FAILED may not reflect the real state of the steps.
+ If a running step has started a YARN application when the primary node fails, the step can continue and succeed due to the automatic failover of the primary node.
+ It is recommended that you check the status of steps by referring to the output of the jobs. For example, MapReduce jobs use a `_SUCCESS` file to determine if the job completes successfully.
+ It is recommended that you set ActionOnFailure parameter to CONTINUE, or CANCEL\$1AND\$1WAIT, instead of TERMINATE\$1JOB\$1FLOW, or TERMINATE\$1CLUSTER.

### Automatic termination protection


Amazon EMR automatically enables termination protection for all clusters with multiple primary nodes, and overrides any step execution settings that you supply when you create the cluster. You can disable termination protection after the cluster has been launched. See [Configuring termination protection for running clusters](UsingEMR_TerminationProtection.md#emr-termination-protection-running-cluster). To shut down a cluster with multiple primary nodes, you must first modify the cluster attributes to disable termination protection. For instructions, see [Terminate an Amazon EMR Cluster with multiple primary nodes](emr-plan-ha-launch.md#emr-plan-ha-launch-terminate).

For more information about termination protection, see [Using termination protection to protect your Amazon EMR clusters from accidental shut down](UsingEMR_TerminationProtection.md).

### Unsupported features in an Amazon EMR Cluster with multiple primary nodes


The following Amazon EMR features are currently not available in an Amazon EMR cluster with multiple primary nodes:
+ EMR Notebooks
+ One-click access to persistent Spark history server
+ Persistent application user interfaces
+ One-click access to persistent application user interfaces is currently not available for Amazon EMR clusters with multiple primary nodes or for Amazon EMR clusters integrated with AWS Lake Formation.
+ Runtime role-based access control. For more information, see [Additional considerations](emr-steps-runtime-roles.md#emr-steps-runtime-roles-considerations) in [Runtime roles for Amazon EMR steps](emr-steps-runtime-roles.md).
+ Amazon EMR integration with AWS IAM Identity Center (trusted identity propagation). For more information, see [Integrate Amazon EMR with AWS IAM Identity Center](emr-idc.md).

**Note**  
 To use Kerberos authentication in your cluster, you must configure an external KDC.  
Beginning with Amazon EMR version 5.27.0, you can configure HDFS Transparent encryption on an Amazon EMR cluster with multiple primary nodes. For more information, see [Transparent encryption in HDFS on Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html).

# Launch an Amazon EMR Cluster with multiple primary nodes


This topic provides configuration details and examples for launching an Amazon EMR cluster with multiple primary nodes.

**Note**  
Amazon EMR automatically enables termination protection for all clusters that have multiple primary nodes, and overrides any auto-termination settings that you supply when you create the cluster. To shut down a cluster with multiple primary nodes, you must first modify the cluster attributes to disable termination protection. For instructions, see [Terminate an Amazon EMR Cluster with multiple primary nodes](#emr-plan-ha-launch-terminate).

## Prerequisites

+ You can launch an Amazon EMR cluster with multiple primary nodes in both public and private VPC subnets. **EC2-Classic** is not supported. To launch an Amazon EMR cluster with multiple primary nodes in a public subnet, you must enable the instances in this subnet to receive a public IP address by selecting **Auto-assign IPv4** in the console or running the following command. Replace *22XXXX01* with your subnet ID.

  ```
  aws ec2 modify-subnet-attribute --subnet-id subnet-22XXXX01 --map-public-ip-on-launch					
  ```
+ To run Hive, Hue, or Oozie on an Amazon EMR cluster with multiple primary nodes, you must create an external metastore. For more information, see [Configuring an external metastore for Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html), [Using Hue with a remote database in Amazon RDS](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hue-rds.html), or [Apache Oozie](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-oozie.html).
+ To use Kerberos authentication in your cluster, you must configure an external KDC. For more information, see [Configuring Kerberos on Amazon Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-configure.html).

## Launch an Amazon EMR Cluster with multiple primary nodes


You can launch a cluster with multiple primary nodes when you use instance groups or instance fleets. When you use *instance groups* with multiple primary nodes, you must specify an instance count value of `3` for the primary node instance group. When you use *instance fleets* with multiple primary nodes, you must specify the `TargetOnDemandCapacity` of `3`, `TargetSpotCapacity` of `0` for the primary instance fleet, and `WeightedCapacity` of `1` for each instance type that you configure for the primary fleet. 

 The following examples demonstrate how to launch the cluster using the default AMI or a custom AMI with both instance groups and instance fleets:

**Note**  
You must specify the subnet ID when you launch an Amazon EMR cluster with multiple primary nodes using the AWS CLI. Replace *22XXXX01* and *22XXXX02 *with your subnet ID in the following examples.

------
#### [ Default AMI, instance groups ]

**Example – Launching an Amazon EMR instance group cluster with multiple primary nodes using a default AMI**  

```
aws emr create-cluster \
--name "ha-cluster" \
--release-label emr-6.15.0 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=3,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m5.xlarge \
--ec2-attributes KeyName=ec2_key_pair_name,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-22XXXX01 \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark
```

------
#### [ Default AMI, instance fleets ]

**Example – Launching an Amazon EMR instance fleet cluster with multiple primary nodes using a default AMI**  

```
aws emr create-cluster \
--name "ha-cluster" \
--release-label emr-6.15.0 \
--instance-fleets '[
    {
        "InstanceFleetType": "MASTER",
        "TargetOnDemandCapacity": 3,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Master - 1"
    },
    {
        "InstanceFleetType": "CORE",
        "TargetOnDemandCapacity": 5,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 2,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 4,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Core - 2"
    }
]' \
 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetIds":["subnet-22XXXX01", "subnet-22XXXX02"]}' \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark
```

------
#### [ Custom AMI, instance groups ]

**Example – Launching an Amazon EMR instance group cluster with multiple primary nodes using a custom AMI**  

```
aws emr create-cluster \
--name "custom-ami-ha-cluster" \
--release-label emr-6.15.0 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=3,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m5.xlarge \
--ec2-attributes KeyName=ec2_key_pair_name,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-22XXXX01 \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark \
--custom-ami-id ami-MyAmiID
```

------
#### [ Custom AMI, instance fleets ]

**Example – Launching an Amazon EMR instance fleet cluster with multiple primary nodes using a custom AMI**  

```
aws emr create-cluster \
--name "ha-cluster" \
--release-label emr-6.15.0 \
--instance-fleets '[
    {
        "InstanceFleetType": "MASTER",
        "TargetOnDemandCapacity": 3,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Master - 1"
    },
    {
        "InstanceFleetType": "CORE",
        "TargetOnDemandCapacity": 5,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 2,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 4,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Core - 2"
    }
]' \
--ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetIds":["subnet-22XXXX01", "subnet-22XXXX02"]}' \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark \
--custom-ami-id ami-MyAmiID
```

------

## Terminate an Amazon EMR Cluster with multiple primary nodes


To terminate an Amazon EMR cluster with multiple primary nodes, you must disable termination protection before terminating the cluster, as the following example demonstrates. Replace *j-3KVTXXXXXX7UG* with your cluster ID.

```
aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --no-termination-protected
aws emr terminate-clusters --cluster-id j-3KVTXXXXXX7UG
```

# Amazon EMR integration with EC2 placement groups


When you launch an Amazon EMR multiple primary node cluster on Amazon EC2, you have the option to use placement group strategies to specify how you want the primary node instances deployed to protect against hardware failure.

Placement group strategies are supported starting with Amazon EMR version 5.23.0 as an option for multiple primary node clusters. Currently, only primary node types are supported by the placement group strategy, and the `SPREAD` strategy is applied to those primary nodes. The `SPREAD` strategy places a small group of instances across separate underlying hardware to guard against the loss of multiple primary nodes in the event of a hardware failure. Note that an instance launch request could fail if there is insufficient unique hardware to fulfill the request. For more information about EC2 placement strategies and limitations, see [Placement groups](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html) in the *EC2 User Guide for Linux Instances*.

There is an initial limit from Amazon EC2 of 500 placement group strategy-enabled clusters that can be launched per AWS region. Contact AWS support to request an increase in the number of allowed placement groups. You can identify EC2 placement groups Amazon EMR creates by tracking the key-value pair that Amazon EMR associates with the Amazon EMR placement group strategy. For more information about EC2 cluster instance tags, see [View cluster instances in Amazon EC2](UsingEMR_Tagging.md).

## Attach the placement group managed policy to the Amazon EMRrole


The placement group strategy requires a managed policy called `AmazonElasticMapReducePlacementGroupPolicy`, which allows Amazon EMR to create, delete, and describe placement groups on Amazon EC2. You must attach `AmazonElasticMapReducePlacementGroupPolicy` to the service role for Amazon EMR before you launch an Amazon EMR cluster with multiple primary nodes. 

You can alternatively attach the `AmazonEMRServicePolicy_v2` managed policy to the Amazon EMR service role instead of the placement group managed policy. `AmazonEMRServicePolicy_v2` allows the same access to placement groups on Amazon EC2 as the `AmazonElasticMapReducePlacementGroupPolicy`. For more information, see [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

The `AmazonElasticMapReducePlacementGroupPolicy` managed policy is the following JSON text that is created and administered by Amazon EMR.

**Note**  
Because the `AmazonElasticMapReducePlacementGroupPolicy` managed policy is updated automatically, the policy shown here may be out-of-date. Use the AWS Management Console to view the current policy.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Resource": [
        "*"
      ],
      "Effect": "Allow",
      "Action": [
        "ec2:DeletePlacementGroup",
        "ec2:DescribePlacementGroups"
      ],
      "Sid": "AllowEC2Deleteplacementgroup"
    },
    {
      "Resource": [
        "arn:aws:ec2:*:*:placement-group/pg-*"
      ],
      "Effect": "Allow",
      "Action": [
        "ec2:CreatePlacementGroup"
      ],
      "Sid": "AllowEC2Createplacementgroup"
    }
  ]
}
```

------

## Launch an Amazon EMR cluster with multiple primary nodes using placement group strategy


To launch an Amazon EMR cluster that has multiple primary nodes with a placement group strategy, attach the placement group managed policy `AmazonElasticMapReducePlacementGroupPolicy` to the Amazon EMR role. For more information, see [Attach the placement group managed policy to the Amazon EMRrole](#emr-plan-ha-launch-pg-policy).

Every time you use this role to start an Amazon EMR cluster with multiple primary nodes, Amazon EMR attempts to launch a cluster with `SPREAD` strategy applied to its primary nodes. If you use a role that does not have the placement group managed policy `AmazonElasticMapReducePlacementGroupPolicy` attached to it, Amazon EMR attempts to launch an Amazon EMR cluster that has multiple primary nodes without a placement group strategy.

If you launch an Amazon EMR cluster that has multiple primary nodes with the `placement-group-configs` parameter using the Amazon EMRAPI or CLI, Amazon EMR only launches the cluster if the Amazon EMRrole has the placement group managed policy `AmazonElasticMapReducePlacementGroupPolicy` attached. If the Amazon EMRrole does not have the policy attached, the Amazon EMR cluster with multiple primary nodes start fails.

------
#### [ Amazon EMR API ]

**Example – Use a placement group strategy to launch an instance group cluster with multiple primary nodes from the Amazon EMR API**  
When you use the RunJobFlow action to create an Amazon EMR cluster with multiple primary nodes, set the `PlacementGroupConfigs` property to the following. Currently, the `MASTER` instance role automatically uses `SPREAD` as the placement group strategy.  

```
{
   "Name":"ha-cluster",
   "PlacementGroupConfigs":[
      {
         "InstanceRole":"MASTER"
      }
   ],
   "ReleaseLabel": emr-6.15.0,
   "Instances":{
      "ec2SubnetId":"subnet-22XXXX01",
      "ec2KeyName":"ec2_key_pair_name",
      "InstanceGroups":[
         {
            "InstanceCount":3,
            "InstanceRole":"MASTER",
            "InstanceType":"m5.xlarge"
         },
         {
            "InstanceCount":4,
            "InstanceRole":"CORE",
            "InstanceType":"m5.xlarge"
         }
      ]
   },
   "JobFlowRole":"EMR_EC2_DefaultRole",
   "ServiceRole":"EMR_DefaultRole"
}
```
+ Replace *ha-cluster* with the name of your high-availability cluster.
+ Replace *subnet-22XXXX01* with your subnet ID.
+ Replace the *ec2\$1key\$1pair\$1name* with the name of your EC2 key pair for this cluster. EC2 key pair is optional and only required if you want to use SSH to access your cluster.

------
#### [ AWS CLI ]

**Example – Use a placement group strategy to launch an instance fleet cluster with multiple primary nodes from the AWS Command Line Interface**  
When you use the RunJobFlow action to create an Amazon EMR cluster with multiple primary nodes, set the `PlacementGroupConfigs` property to the following. Currently, the `MASTER` instance role automatically uses `SPREAD` as the placement group strategy.  

```
aws emr create-cluster \
--name "ha-cluster" \
--placement-group-configs InstanceRole=MASTER \
--release-label emr-6.15.0 \
--instance-fleets '[
    {
        "InstanceFleetType": "MASTER",
        "TargetOnDemandCapacity": 3,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Master - 1"
    },
    {
        "InstanceFleetType": "CORE",
        "TargetOnDemandCapacity": 5,
        "TargetSpotCapacity": 0,
        "LaunchSpecifications": {
            "OnDemandSpecification": {
                "AllocationStrategy": "lowest-price"
            }
        },
        "InstanceTypeConfigs": [
            {
                "WeightedCapacity": 1,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.xlarge"
            },
            {
                "WeightedCapacity": 2,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.2xlarge"
            },
            {
                "WeightedCapacity": 4,
                "BidPriceAsPercentageOfOnDemandPrice": 100,
                "InstanceType": "m5.4xlarge"
            }
        ],
        "Name": "Core - 2"
    }
]' \
--ec2-attributes '{
    "KeyName": "ec2_key_pair_name",
    "InstanceProfile": "EMR_EC2_DefaultRole",
    "SubnetIds": [
        "subnet-22XXXX01",
        "subnet-22XXXX02"
    ]
}' \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark
```
+ Replace *ha-cluster* with the name of your high-availability cluster.
+ Replace the *ec2\$1key\$1pair\$1name* with the name of your EC2 key pair for this cluster. EC2 key pair is optional and only required if you want to use SSH to access your cluster.
+ Replace *subnet-22XXXX01* and *subnet-22XXXX02*with your subnet IDs.

------

## Launch a cluster with multiple primary nodes without a placement group strategy


For a cluster with multiple primary nodes to launch primary nodes without the placement group strategy, you need to do one of the following:
+ Remove the placement group managed policy `AmazonElasticMapReducePlacementGroupPolicy`from the Amazon EMRrole, or
+ Launch a cluster with multiple primary nodes with the `placement-group-configs` parameter using the Amazon EMRAPI or CLI choosing `NONE` as the placement group strategy.

------
#### [ Amazon EMR API ]

**Example – Launching a cluster with multiple primary nodes without placement group strategy using the Amazon EMRAPI.**  
When using the RunJobFlow action to create a cluster with multiple primary nodes, set the `PlacementGroupConfigs` property to the following.  

```
{
   "Name":"ha-cluster",
   "PlacementGroupConfigs":[
      {
         "InstanceRole":"MASTER",
         "PlacementStrategy":"NONE"
      }
   ],
   "ReleaseLabel":"emr-5.30.1",
   "Instances":{
      "ec2SubnetId":"subnet-22XXXX01",
      "ec2KeyName":"ec2_key_pair_name",
      "InstanceGroups":[
         {
            "InstanceCount":3,
            "InstanceRole":"MASTER",
            "InstanceType":"m5.xlarge"
         },
         {
            "InstanceCount":4,
            "InstanceRole":"CORE",
            "InstanceType":"m5.xlarge"
         }
      ]
   },
   "JobFlowRole":"EMR_EC2_DefaultRole",
   "ServiceRole":"EMR_DefaultRole"
}
```
+ Replace *ha-cluster* with the name of your high-availability cluster.
+ Replace *subnet-22XXXX01* with your subnet ID.
+ Replace the *ec2\$1key\$1pair\$1name* with the name of your EC2 key pair for this cluster. EC2 key pair is optional and only required if you want to use SSH to access your cluster.

------
#### [ Amazon EMR CLI ]

**Example – Launching a cluster with multiple primary nodes without a placement group strategy using the Amazon EMRCLI.**  
When using the RunJobFlow action to create a cluster with multiple primary nodes, set the `PlacementGroupConfigs` property to the following.  

```
aws emr create-cluster \
--name "ha-cluster" \
--placement-group-configs InstanceRole=MASTER,PlacementStrategy=NONE \
--release-label emr-5.30.1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=3,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m5.xlarge \
--ec2-attributes KeyName=ec2_key_pair_name,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-22XXXX01 \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=Spark
```
+ Replace *ha-cluster* with the name of your high-availability cluster.
+ Replace *subnet-22XXXX01* with your subnet ID.
+ Replace the *ec2\$1key\$1pair\$1name* with the name of your EC2 key pair for this cluster. EC2 key pair is optional and only required if you want to use SSH to access your cluster.

------

## Checking placement group strategy configuration attached to the cluster with multiple primary nodes


You can use the Amazon EMR describe cluster API to see the placement group strategy configuration attached to the cluster with multiple primary nodes.

**Example**  

```
aws emr describe-cluster --cluster-id "j-xxxxx"
{
   "Cluster":{
      "Id":"j-xxxxx",
      ...
      ...
      "PlacementGroups":[
         {
            "InstanceRole":"MASTER",
            "PlacementStrategy":"SPREAD"
         }
      ]
   }
}
```

# Considerations and best practices when you create an Amazon EMR cluster with multiple primary nodes
Considerations and best practices for a cluster with multiple primary nodes

Consider the following when you create an Amazon EMR cluster with multiple primary nodes:

**Important**  
To launch high-availability EMR clusters with multiple primary nodes, we strongly recommend that you use the latest Amazon EMR release. This ensures that you get the highest level of resiliency and stability for your high-availability clusters.
+ High availability for *instance fleets* is supported with Amazon EMR releases 5.36.1, 5.36.2, 6.8.1, 6.9.1, 6.10.1, 6.11.1, 6.12.0, and higher. For *instance groups*, high availability is supported with Amazon EMR releases 5.23.0 and higher. To learn more, see [About Amazon EMR Releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html).
+ On high-availability clusters, Amazon EMR only supports the launch of primary nodes with On Demand instances. This ensures the highest availability for your cluster.
+ You can still specify multiple instance types for primary fleet but all the primary nodes of high-availability clusters are launched with the same instance type, including replacements for unhealthy primary nodes.
+ To continue operations, a high-availability cluster with multiple primary nodes requires two out of three primary nodes to be healthy. As a result, if any two primary nodes fail simultaneously, your EMR cluster will fail.
+ All EMR clusters, including high-availability clusters, are launched in a single Availability Zone. Therefore, they can't tolerate Availability Zone failures. In the case of an Availability Zone outage, you lose access to the cluster.
+ If you use If you’re using a custom service role or policy when you launch a cluster inside an instance fleet, you can add the `ec2:DescribeInstanceTypeOfferings` permission so Amazon EMR can filter out unsupported Availability Zones (AZ). When Amazon EMR filters out the AZs that don’t support any instance types of primary nodes, Amazon EMR prevents cluster launches from failing because of unsupported primary instance types. For more information, see [Instance type not supported](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-INSTANCE_TYPE_NOT_SUPPORTED-error.html).
+ Amazon EMR doesn't guarantee high availability for open-source applications other than the ones that are specified in [Supported applications in an Amazon EMR Cluster with multiple primary nodes](emr-plan-ha-applications.md#emr-plan-ha-applications-list).
+ In Amazon EMR releases 5.23.0 through 5.36.2, only two of the three primary nodes for an instance group cluster run HDFS NameNode.
+ In Amazon EMR releases 6.x and higher, all three of the primary nodes for an instance group run HDFS NameNode.

Considerations for configuring subnet:
+ An Amazon EMR cluster with multiple primary nodes can reside only in one Availability Zone or subnet. Amazon EMR cannot replace a failed primary node if the subnet is fully utilized or oversubscribed in the event of a failover. To avoid this scenario, it is recommended that you dedicate an entire subnet to an Amazon EMR cluster. In addition, make sure that there are enough private IP addresses available in the subnet.

Considerations for configuring core nodes:
+ To ensure the core nodes are also highly available, we recommend that you launch at least four core nodes. If you decide to launch a smaller cluster with three or fewer core nodes, set `dfs.replication parameter` to at least `2` for HDFS to have sufficient DFS replication. For more information, see [HDFS configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html).

**Warning**  
Setting `dfs.replication` to 1 on clusters with fewer than four nodes can lead to HDFS data loss if a single node goes down. We recommend you use a cluster with at least four core nodes for production workloads.
Amazon EMR will not allow clusters to scale core nodes below `dfs.replication`. For example, if `dfs.replication = 2`, the minimum number of core nodes is 2.
When you use Managed Scaling, Auto-scaling, or choose to manually resize your cluster, we recommend that you to set `dfs.replication` to 2 or higher.

Considerations for Setting Alarms on Metrics:
+ Amazon EMR doesn't provide application-specific metrics about HDFS or YARN. We reccommend that you set up alarms to monitor the primary node instance count. Configure the alarms using the following Amazon CloudWatch metrics: `MultiMasterInstanceGroupNodesRunning`, `MultiMasterInstanceGroupNodesRunningPercentage`, or `MultiMasterInstanceGroupNodesRequested`. CloudWatch will notify you in the case of primary node failure and replacement. 
  + If the `MultiMasterInstanceGroupNodesRunningPercentage` is lower than 100% and greater than 50%, the cluster may have lost a primary node. In this situation, Amazon EMR attempts to replace a primary node. 
  + If the `MultiMasterInstanceGroupNodesRunningPercentage` drops below 50%, two primary nodes may have failed. In this situation, the quorum is lost and the cluster can't be recovered. You must manually migrate data off of this cluster.

  For more information, see [Setting alarms on metrics](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html#UsingEMR_ViewingMetrics_Alarm).

# EMR clusters on AWS Outposts


Beginning with Amazon EMR 5.28.0, you can create and run EMR clusters on AWS Outposts. AWS Outposts enables native AWS services, infrastructure, and operating models in on-premises facilities. In AWS Outposts environments, you can use the same AWS APIs, tools, and infrastructure that you use in the AWS Cloud. Amazon EMR on AWS Outposts is ideal for low latency workloads that need to be run in close proximity to on-premises data and applications. For more information about AWS Outposts, see [AWS Outposts User Guide](https://docs.aws.amazon.com/outposts/latest/userguide/). 

## Prerequisites


 The following are the prerequisites for using Amazon EMR on AWS Outposts:
+ You must have installed and configured AWS Outposts in your on-premises data center.
+ You must have a reliable network connection between your Outpost environment and an AWS Region.
+ You must have sufficient capacity for Amazon EMR supported instance types available in your Outpost.

## Limitations


The following are the limitations of using Amazon EMR on AWS Outposts:
+ On-Demand Instances are the only supported option for Amazon EC2 instances. Spot Instances are not available for Amazon EMR on AWS Outposts.
+ If you need additional Amazon EBS storage volumes, only General Purpose SSD (GP2) is supported. 
+ When you use AWS Outposts with Amazon EMR releases 5.28 through 6.x, you can only use S3 buckets that store objects in an AWS Region that you specify. With Amazon EMR 7.0.0 and higher, Amazon EMR on AWS Outposts is also supported with the S3A filesystem client, prefix `s3a://`.
+ Only the following instance types are supported by Amazon EMR on AWS Outposts:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-outposts.html)

## Network connectivity considerations

+ If network connectivity between your Outpost and its AWS Region is lost, your clusters will continue to run. However, you cannot create new clusters or take new actions on existing clusters until connectivity is restored. In case of instance failures, the instance will not be automatically replaced. Additionally, actions such as adding steps to a running cluster, checking step execution status, and sending CloudWatch metrics and events will be delayed. 
+ We recommend that you provide reliable and highly available network connectivity between your Outpost and the AWS Region. If network connectivity between your Outpost and its AWS Region is lost for more than a few hours, clusters that have enabled terminate protection will continue to run, and clusters that have disabled terminate protection may be terminated. 
+ If network connectivity will be impacted due to routine maintenance, we recommend proactively enabling terminate protection. More generally, connectivity interruption means that any external dependencies that are not local to the Outpost or customer network will not be accessible. This includes Amazon S3, DynamoDB used with EMRFS consistency view, and Amazon RDS if an in-region instance is used for an Amazon EMR cluster with multiple primary nodes.

## Creating an Amazon EMR cluster on AWS Outposts


Creating an Amazon EMR cluster on AWS Outposts is similar to creating an Amazon EMR cluster in the AWS Cloud. When you create an Amazon EMR cluster on AWS Outposts, you must specify an Amazon EC2 subnet associated with your Outpost.

An Amazon VPC can span all of the Availability Zones in an AWS Region. AWS Outposts are extensions of Availability Zones, and you can extend an Amazon VPC in an account to span multiple Availability Zones and associated Outpost locations. When you configure your Outpost, you associate a subnet with it to extend your Regional VPC environment to your on-premises facility. Outpost instances and related services appear as part of your Regional VPC, similar to an Availability Zone with associated subnets. For information, see [AWS Outposts User Guide](https://docs.aws.amazon.com/outposts/latest/userguide/).

**Console**

To create a new Amazon EMR cluster on AWS Outposts with the AWS Management Console, specify an Amazon EC2 subnet that is associated with your Outpost.

------
#### [ Console ]

**To create a cluster on AWS Outposts with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Cluster configuration**, select **Instance groups** or **Instance fleets**. Then, choose an instance type from the **Choose EC2 instance type** dropdown menu or select **Actions** and choose **Add EBS volumes**. Amazon EMR on AWS Outposts supports limited Amazon EBS volume and instance types.

1. Under **Networking**, select an EC2 subnet with an Outpost ID in this format: op-123456789.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To create a cluster on AWS Outposts with the AWS CLI**
+ To create a new Amazon EMR cluster on AWS Outposts with the AWS CLI, specify an EC2 subnet that is associated with your Outpost, as in the following example. Replace *subnet-22XXXX01* with your own Amazon EC2 subnet ID.

  ```
  aws emr create-cluster \
  --name "Outpost cluster" \
  --release-label emr-7.12.0 \
  --applications Name=Spark \
  --ec2-attributes KeyName=myKey SubnetId=subnet-22XXXX01 \
  --instance-type m5.xlarge --instance-count 3 --use-default-roles
  ```

------

# EMR clusters on AWS Local Zones


Beginning with Amazon EMR version 5.28.0, you can create and run Amazon EMR clusters on an AWS Local Zones subnet as a logical extension of an AWS Region that supports Local Zones. A Local Zone enables Amazon EMR features and a subset of AWS services, like compute and storage services, to be located closer to users to provide very low latency access to applications running locally. For a list of available Local Zones, see [AWS Local Zones](https://aws.amazon.com/about-aws/global-infrastructure/localzones/). For information about accessing available AWS Local Zones, see [Regions, Availability Zones, and local zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).

## Supported instance types


 The following instance types are available for Amazon EMR clusters on Local Zones. Instance type availability may vary by Region.


| Instance class | Instance types | 
| --- | --- | 
| General purpose | m5.xlarge \$1 m5.2xlarge \$1 m5.4xlarge \$1 m5.12xlarge \$1 m5.24xlarge \$1 m5d.xlarge \$1 m5d.2xlarge \$1 m5d.4xlarge \$1 m5d.12xlarge \$1 m5d.24xlarge  | 
| Compute-optimized | c5.xlarge \$1 c5.2xlarge \$1 c5.4xlarge \$1 c5.9xlarge \$1 c5.18xlarge \$1 c5d.xlarge \$1 c5d.2xlarge \$1 c5d.4xlarge\$1 c5d.9xlarge \$1 c5d.18xlarge  | 
| Memory-optimized | r5.xlarge \$1 r5.2xlarge \$1 r5.4xlarge \$1 r5.12xlarge \$1 r5d.xlarge \$1 r5d.2xlarge \$1 r5d.4xlarge \$1 r5d.12xlarge \$1 r5d.24xlarge  | 
| Storage-optimized | i3en.xlarge \$1 i3en.2xlarge \$1 i3en.3xlarge \$1 i3en.6xlarge \$1 i3en.12xlarge \$1 i3en.24xlarge | 

## Creating an Amazon EMR cluster on Local Zones


Create an Amazon EMR cluster on AWS Local Zones by launching the Amazon EMR cluster into an Amazon VPC subnet that is associated with a Local Zone. You can access the cluster using the Local Zone name, such as us-west-2-lax-1a in the US West (Oregon) Console.

Local Zones don't currently support Amazon EMR Notebooks or connections directly to Amazon EMR using interface VPC endpoint (AWS PrivateLink).

------
#### [ Console ]

**To create a cluster on a Local Zone with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Networking**, select an EC2 subnet with a Local Zone ID in this format: subnet 123abc \$1 us-west-2-lax-1a.

1. Choose an instance type or add Amazon EBS storage volumes for uniform instance groups or instance fleets.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To create a cluster on a Local Zone with the AWS CLI**
+ Use the create-cluster command, along with the SubnetId for the Local Zone as shown in the following example. Replace subnet-22XXXX1234567 with the Local Zone SubnetId and replace other options as necessary. For more information, see [https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html).

  ```
  aws emr create-cluster \
  --name "Local Zones cluster" \
  --release-label emr-5.29.0 \
  --applications Name=Spark \
  --ec2-attributes KeyName=myKey,SubnetId=subnet-22XXXX1234567 \
  --instance-type m5.xlarge --instance-count 3 --use-default-roles
  ```

------

# Configure Docker for use with Amazon EMR clusters


Amazon EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the Amazon EMR cluster or inside a Docker container. Docker containers provide custom execution environments in which application code runs. The custom execution environment is isolated from the execution environment of the YARN NodeManager and other applications.

Docker containers can include special libraries used by the application and they can provide different versions of native tools and libraries, such as R and Python. You can use familiar Docker tooling to define libraries and runtime dependencies for your applications.

Amazon EMR 6.x clusters are configured by default to allow YARN applications, such as Spark, to run using Docker containers. To customize your container configuration, edit the Docker support options defined in the `yarn-site.xml` and `container-executor.cfg` files available in the `/etc/hadoop/conf` directory. For details about each configuration option and how it is used, see [Launching applications using Docker containers](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html). 

You can choose to use Docker when you submit a job. Use the following variables to specify the Docker runtime and Docker image.
+ `YARN_CONTAINER_RUNTIME_TYPE=docker`
+ `YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={DOCKER_IMAGE_NAME}`

When you use Docker containers to run your YARN applications, YARN downloads the Docker image that you specify when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. The configuration options for a Docker registry depend on whether you deploy the cluster using a public or private subnet.

## Docker registries


A Docker registry is a storage and distribution system for Docker images. For Amazon EMR we recommend that you use Amazon ECR, which is a fully managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.

**Deployment considerations**

Docker registries require network access from each host in the cluster. This is because each host downloads images from the Docker registry when your YARN application is running on the cluster. These network connectivity requirements may limit your choice of Docker registry, depending on whether you deploy your Amazon EMR cluster into a public or private subnet. 

**Public subnet**

When EMR clusters are deployed in a public subnet, the nodes running YARN NodeManager can directly access any registry available over the internet.

**Private subnet**

When EMR clusters are deployed in a private subnet, the nodes running YARN NodeManager don't have direct access to the internet. Docker images can be hosted in Amazon ECR and accessed through AWS PrivateLink.

For more information about how to use AWS PrivateLink to allow access to Amazon ECR in a private subnet scenario, see [Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR](https://aws.amazon.com/blogs/compute/setting-up-aws-privatelink-for-amazon-ecs-and-amazon-ecr/).

## Configuring Docker registries


To use Docker registries with Amazon EMR, you must configure Docker to trust the specific registry that you want to use to resolve Docker images. The default trust registries are local (private) and centos. To use other public repositories or Amazon ECR, you can override `docker.trusted.registries` settings in `/etc/hadoop/conf/container-executor.cfg` using the EMR Classification API with the `container-executor` classification key.

The following example shows how to configure the cluster to trust both a public repository, named `your-public-repo`, and an ECR registry endpoint, `123456789123.dkr.ecr.us-east-1.amazonaws.com`. If you use ECR, replace this endpoint with your specific ECR endpoint.

```
[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos,your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        }
    ]
  }
]
```

To launch an Amazon EMR 6.0.0 cluster with this configuration using the AWS Command Line Interface (AWS CLI), create a file named `container-executor.json` with the contents of the preceding ontainer-executor JSON configuration. Then, use the following commands to launch the cluster.

```
export KEYPAIR=<Name of your Amazon EC2 key-pair>
export SUBNET_ID=<ID of the subnet to which to deploy the cluster>
export INSTANCE_TYPE=<Name of the instance type to use>
export REGION=<Region to which to deploy the cluster>

aws emr create-cluster \
    --name "EMR-6.0.0" \
    --region $REGION \
    --release-label emr-6.0.0 \
    --applications Name=Hadoop Name=Spark \
    --service-role EMR_DefaultRole \
    --ec2-attributes KeyName=$KEYPAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$SUBNET_ID \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE InstanceGroupType=CORE,InstanceCount=2,InstanceType=$INSTANCE_TYPE \
    --configuration file://container-executor.json
```

## Configuring YARN to access Amazon ECR on EMR 6.0.0 and earlier


If you're new to Amazon ECR, follow the instructions in [Getting started with Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_GetStarted.html) and verify that you have access to Amazon ECR from each instance in your Amazon EMR cluster.

On EMR 6.0.0 and earlier, to access Amazon ECR using the Docker command, you must first generate credentials. To verify that YARN can access images from Amazon ECR, use the container environment variable `YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG` to pass a reference to the credentials that you generated.

Run the following command on one of the core nodes to get the login line for your ECR account.

```
aws ecr get-login --region us-east-1 --no-include-email
```

The `get-login` command generates the correct Docker CLI command to run to create credentials. Copy and run the output from `get-login`.

```
sudo docker login -u AWS -p <password> https://<account-id>.dkr.ecr.us-east-1.amazonaws.com
```

This command generates a `config.json` file in the `/root/.docker` folder. Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to Amazon ECR.

Run the commands below to copy the `config.json` file to your home directory.

```
mkdir -p ~/.docker
sudo cp /root/.docker/config.json ~/.docker/config.json
sudo chmod 644 ~/.docker/config.json
```

Run the commands below to put the config.json in HDFS so it may be used by jobs running on the cluster.

```
hadoop fs -put ~/.docker/config.json /user/hadoop/
```

YARN can access ECR as a Docker image registry and pull containers during job execution.

After configuring Docker registries and YARN, you can run YARN applications using Docker containers. For more information, see [Run Spark applications with Docker using Amazon EMR 6.0.0](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html).

In EMR 6.1.0 and later, you don't have to manually set up authentication to Amazon ECR. If an Amazon ECR registry is detected in the `container-executor` classification key, the Amazon ECR auto authentication feature activates, and YARN handles the authentication process when you submit a Spark job with an ECR image. You can confirm whether automatic authentication is enabled by checking `yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled` in yarn-site. Automatic authentication is enabled and the YARN authentication setting is set to `true` if the `docker.trusted.registries` contains an ECR registry URL.

**Prerequisites for using automatic authentication to Amazon ECR**
+ EMR version 6.1.0 or later
+ ECR registry included in configuration is in the same Region with the cluster
+ IAM role with permissions to get authorization token and pull any image

Refer to [Setting up with Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/get-set-up-for-amazon-ecr.html) for more information.

**How to enable automatic authentication**

Follow [Configuring Docker registries](#emr-docker-hub) to set an Amazon ECR registry as a trusted registry, and make sure the Amazon ECR repository and the cluster are in same Region.

To enable this feature even when the ECR registry is not set in the trusted registry, use the configuration classification to set `yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled` to `true`.

**How to disable automatic authentication**

By default, automatic authentication is disabled if no Amazon ECR registry is detected in the trusted registry.

To disable automatic authentication, even when the Amazon ECR registry is set in the trusted registry, use the configuration classification to set `yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled` to `false`.

**How to check if automatic authentication is enabled on a cluster**

On the master node, use a text editor such as `vi` to view the contents of the file: `vi /etc/hadoop/conf.empty/yarn-site.xml`. Check the value of `yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled`.

# Control Amazon EMR cluster termination


This section describes your options for shutting down Amazon EMR clusters. It covers auto-termination and termination protection, and how they interact with other Amazon EMR features.

You can shut down an Amazon EMR cluster in the following ways:
+ **Termination after last step execution** - Create a transient cluster that shuts down after all steps complete.
+ **Auto-termination (after idle)** - Create a cluster with an auto-termination policy that shuts down after a specified idle time. For more information, see [Using an auto-termination policy for Amazon EMR cluster cleanup](emr-auto-termination-policy.md).
+ **Manual termination** - Create a long-running cluster that continues to run until you terminate it deliberately. For information about how to terminate a cluster manually, see [Terminate an Amazon EMR cluster in the starting, running, or waiting states](UsingEMR_TerminateJobFlow.md).

You can also set termination protection on a cluster to avoid shutting down EC2 instances by accident or error.

When Amazon EMR shuts down your cluster, all Amazon EC2 instances in the cluster shut down. Data in the instance store and EBS volumes is no longer available and not recoverable. Understanding and managing cluster termination is critical to developing a strategy to manage and preserve data by writing to Amazon S3 and balancing cost. 

**Topics**
+ [

# Configuring an Amazon EMR cluster to continue or terminate after step execution
](emr-plan-longrunning-transient.md)
+ [

# Using an auto-termination policy for Amazon EMR cluster cleanup
](emr-auto-termination-policy.md)
+ [

# Using termination protection to protect your Amazon EMR clusters from accidental shut down
](UsingEMR_TerminationProtection.md)

# Configuring an Amazon EMR cluster to continue or terminate after step execution


This topic explains the differences between using a long-running cluster and creating a transient cluster that shuts down after the last step runs. It also covers how to configure step execution for a cluster.

## Create a long-running cluster


By default, clusters that you create with the console or the AWS CLI are long-running. Long-running clusters continue to run, accept work, and accrue charges until you take action to shut them down.

A long-running cluster is effective in the following situations:
+ When you need to interactively or automatically query data.
+ When you need to interact with big data applications hosted on the cluster on an ongoing basis.
+ When you periodically process a data set so large or so frequently that it is inefficient to launch new clusters and load data each time.

You can also set termination protection on a long-running cluster to avoid shutting down EC2 instances by accident or error. For more information, see [Using termination protection to protect your Amazon EMR clusters from accidental shut down](UsingEMR_TerminationProtection.md).

**Note**  
Amazon EMR automatically enables termination protection for all clusters with multiple primary nodes, and overrides any step execution settings that you supply when you create the cluster. You can disable termination protection after the cluster has been launched. See [Configuring termination protection for running clusters](UsingEMR_TerminationProtection.md#emr-termination-protection-running-cluster). To shut down a cluster with multiple primary nodes, you must first modify the cluster attributes to disable termination protection. For instructions, see [Terminate an Amazon EMR Cluster with multiple primary nodes](emr-plan-ha-launch.md#emr-plan-ha-launch-terminate).

## Configure a cluster to terminate after step execution
Configure step execution

When you configure termination after step execution, the cluster starts, runs bootstrap actions, and then runs the steps that you specify. As soon as the last step completes, Amazon EMR terminates the cluster's Amazon EC2 instances. Clusters that you launch with the Amazon EMR API have step execution enabled by default.

Termination after step execution is effective for clusters that perform a periodic processing task, such as a daily data processing run. Step execution also helps you ensure that you are billed only for the time required to process your data. For more information about steps, see [Submit work to an Amazon EMR cluster](emr-work-with-steps.md).

------
#### [ Console ]

**To turn on termination after step execution with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Steps**, choose **Add step**. In the **Add step** dialog, enter appropriate field values. Options differ depending on the step type. To add your step and exit the dialog, choose **Add step**.

1. Under **Cluster termination**, select the **Terminate cluster after last step completes** check box.

1. Choose any other options that apply to your cluster.

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To turn on termination after step execution with the AWS CLI**
+ Specify the `--auto-terminate` parameter when you use the `create-cluster` command to create a transient cluster.

  The following example demonstrates how to use the `--auto-terminate` parameter. You can type the following command and replace *myKey* with the name of your EC2 key pair.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-7.12.0 \
  --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey \
  --steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,\
  Args=[-f,s3://amzn-s3-demo-bucket/scripts/pigscript.pig,-p,\
  INPUT=s3://amzn-s3-demo-bucket/inputdata/,-p,OUTPUT=s3://amzn-s3-demo-bucket/outputdata/,\
  $INPUT=s3://amzn-s3-demo-bucket/inputdata/,$OUTPUT=s3://amzn-s3-demo-bucket/outputdata/]
  --instance-type m5.xlarge --instance-count 3 --auto-terminate
  ```

------
#### [ API ]

**To turn off termination after step execution with the Amazon EMR API in cluster launch**

1. When you use the [RunJobFlow](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow.html) action to create a cluster, set the [KeepJobFlowAliveWhenNoSteps](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_JobFlowInstancesConfig.html#EMR-Type-JobFlowInstancesConfig-KeepJobFlowAliveWhenNoSteps) property to `false`.

1. To change your configuration of termination after step execution with the Amazon EMR API post cluster launch:

   Use SetKeepJobFlowAliveWhenNoSteps action.

------

# Using an auto-termination policy for Amazon EMR cluster cleanup
Using an auto-termination policy

An auto-termination policy lets you orchestrate cluster cleanup without the need to monitor and manually terminate unused clusters. When you add an auto-termination policy to a cluster, you specify the amount of idle time after which the cluster should automatically shut down. 

Depending on release version, Amazon EMR uses different criteria to mark a cluster as idle. The following table outlines how Amazon EMR determines cluster idleness.


****  

| When you use ... | A cluster is considered idle when ... | 
| --- | --- | 
| Amazon EMR versions 5.34.0 and later, and 6.4.0 and later |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-auto-termination-policy.html)  | 
| Amazon EMR versions 5.30.0 - 5.33.0 and 6.1.0 - 6.3.0 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-auto-termination-policy.html)  Amazon EMR marks a cluster as idle and may automatically terminate the cluster even if you have an active Python3 kernel. This is because executing a Python3 kernel does not submit a Spark job on the cluster. To use auto-termination with a Python3 kernel, we recommend that you use Amazon EMR version 6.4.0 or later.   | 

**Note**  
Amazon EMR versions 6.4.0 and later support an on-cluster file for detecting activity on the primary node: `/emr/metricscollector/isbusy`. When you use a cluster to run shell scripts or non-YARN applications, you can periodically touch or update `isbusy` to tell Amazon EMR that the cluster is not idle.

You can attach an auto-termination policy when you create a cluster, or add a policy to an existing cluster. To change or disable auto-termination, you can update or remove the policy.

## Considerations


Consider the following features and limitations before using an auto-termination policy:
+ In the following AWS Regions, Amazon EMR auto-termination is available with Amazon EMR 6.14.0 and higher: 
  + Asia Pacific (Taipei) (ap-east-2)
  + Asia Pacific (Melbourne) (ap-southeast-4)
  + Asia Pacific (Malaysia) (ap-southeast-5)
  + Asia Pacific (New Zealand) (ap-southeast-6)
  + Asia Pacific (Thailand) (ap-southeast-7)
  + Canada West (Calgary) (ca-west-1)
  + Europe (Spain) (eu-south-2)
  + Mexico (Central) (mx-central-1)
+ In the following AWS Regions, Amazon EMR auto-termination is available with Amazon EMR 5.30.0 and 6.1.0 and higher:
  + US East (N. Virginia) (us-east-1)
  + US East (Ohio) (us-east-2)
  + US West (Oregon) (us-west-2)
  + US West (N. California) (us-west-1)
  + Africa (Cape Town) (af-south-1)
  + Asia Pacific (Hong Kong) (ap-east-1)
  + Asia Pacific (Mumbai) (ap-south-1)
  + Asia Pacific (Hyderabad) (ap-south-2)
  + Asia Pacific (Seoul) (ap-northeast-2)
  + Asia Pacific (Osaka) (ap-northeast-3)
  + Asia Pacific (Singapore) (ap-southeast-1)
  + Asia Pacific (Sydney) (ap-southeast-2)
  + Asia Pacific (Jakarta) (ap-southeast-3)
  + Asia Pacific (Tokyo) (ap-northeast-1)
  + Canada (Central) (ca-central-1)
  + South America (São Paulo) (sa-east-1)
  + Europe (Frankfurt) (eu-central-1)
  + Europe (Zurich) (eu-central-2)
  + Europe (Ireland) (eu-west-1)
  + Europe (London) (eu-west-2)
  + Europe (Milan) (eu-south-1)
  + Europe (Paris) (eu-west-3)
  + Europe (Stockholm) (eu-north-1)
  + Israel (Tel Aviv) (il-central-1)
  + Middle East (UAE) (me-central-1)
  + China (Beijing) (cn-north-1)
  + China (Ningxia) (cn-northwest-1)
  + AWS GovCloud (US-East) (us-gov-east-1)
  + AWS GovCloud (US-West) (us-gov-west-1)
+ Idle timeout defaults to 60 minutes (one hour) when you don't specify an amount. You can specify a minimum idle timeout of one minute, and a maximum idle timeout of 7 days.
+ With Amazon EMR versions 6.4.0 and later, auto-termination is enabled by default when you create a new cluster with the Amazon EMR console.
+ Amazon EMR publishes high-resolution Amazon CloudWatch metrics when you enable auto-termination for a cluster. You can use these metrics to track cluster activity and idleness. For more information, see [Cluster capacity metrics](UsingEMR_ViewingMetrics.md#emr-metrics-managed-scaling).
+ Auto-termination is not supported when you use non-YARN based applications such as Presto, Trino, or HBase.
+ To use auto-termination, the metrics-collector process must be able to connect to the public API endpoint for auto-termination in API Gateway. If you use a private DNS name with Amazon Virtual Private Cloud, auto-termination won't function properly. To ensure that auto-termination works, we recommend that you take one of the following actions:
  + Remove the API Gateway interface VPC endpoint from your Amazon VPC.
  + Follow the instructions in [Why do I get an HTTP 403 Forbidden error when connecting to my API Gateway APIs from a VPC?](https://aws.amazon.com/premiumsupport/knowledge-center/api-gateway-vpc-connections/) to disable the private DNS name setting.
  + Launch your cluster in a private subnet instead. For more information, see the topic on [Private subnets](emr-clusters-in-a-vpc.md#emr-vpc-private-subnet).
+ (EMR 5.30.0 and later) If you remove the default **Allow All** outbound rule to 0.0.0.0/ for the primary security group, you must add a rule that allows outbound TCP connectivity to your security group for service access on port 9443. Your security group for service access must also allow inbound TCP traffic on port 9443 from the primary security group. For more information about configuring security groups, see [Amazon EMR-managed security group for the primary instance (private subnets)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man-sec-groups.html#emr-sg-elasticmapreduce-master-private).

## Permissions to use auto-termination


Before you can apply and manage auto-termination policies for Amazon EMR, you need to attach the permissions that are listed in the following example IAM permissions policy to the IAM resources that manage your EMR cluster.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": {
      "Sid": "AllowAutoTerminationPolicyActions",
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:PutAutoTerminationPolicy",
        "elasticmapreduce:GetAutoTerminationPolicy",
        "elasticmapreduce:RemoveAutoTerminationPolicy"
      ],
      "Resource": "<your-resources>"
    }
}
```

## Attach, update, or remove an auto-termination policy


This section includes instructions to help you attach, update, or remove an auto-termination policy from an Amazon EMR cluster. Before you work with auto-termination policies, make sure you have the necessary IAM permissions. See [Permissions to use auto-termination](#emr-auto-termination-permissions).

------
#### [ Console ]

**To attach an auto-termination policy when you create a cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Cluster termination**, select **Terminate cluster after idle time**. 

1. Specify the number of idle hours and minutes that can elapse before the cluster auto-terminates. The default idle time is 1 hour.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

**To attach, update, or remove an auto-termination policy on a running cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Properties** tab of the cluster details page, find **Cluster termination** and select **Edit**.

1. Select or clear **Enable auto-termination** to turn the feature on or off. If you turn on auto-termination, specify the number of idle hours and minutes that can elapse before the cluster auto-terminates. Then select **Save changes** to confirm.

------
#### [ AWS CLI ]

**Before you start**

Before you work with auto-termination policies, we recommend that you update to the latest version of the AWS CLI. For instructions, see [Installing, updating, and uninstalling the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html).

**To attach or update an auto-termination policy using the AWS CLI**
+ You can use the `aws emr put-auto-termination-policy` command to attach or update an auto-termination policy on a cluster. 

  The following example specifies 3600 seconds for *IdleTimeout*. If you don't specify *IdleTimeout*, the value defaults to one hour. 

  ```
  aws emr put-auto-termination-policy \
  --cluster-id <your-cluster-id> \
  --auto-termination-policy IdleTimeout=3600
  ```
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  You can also specify a value for `--auto-termination-policy` when you use the `aws emr create-cluster` command. For more information on using Amazon EMR commands in the AWS CLI, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/emr).

**To remove an auto-termination policy with the AWS CLI**
+ Use the `aws emr remove-auto-termination-policy` command to remove an auto-termination policy from a cluster. For more information on using Amazon EMR commands in the AWS CLI, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/emr).

  ```
  aws emr remove-auto-termination-policy --cluster-id <your-cluster-id>
  ```

------

# Using termination protection to protect your Amazon EMR clusters from accidental shut down


Termination protection protects your clusters from accidental termination, which can be especially useful for long running clusters processing critical workloads. When termination protection is enabled on a long-running cluster, you can still terminate the cluster, but you must explicitly remove termination protection from the cluster first. This helps ensure that EC2 instances are not shut down by an accident or error. You can enable termination protection when you create a cluster, and you can change the setting on a running cluster.

With termination protection enabled, the `TerminateJobFlows` action in the Amazon EMR API does not work. Users cannot terminate the cluster using this API or the `terminate-clusters` command from the AWS CLI. The API returns an error, and the CLI exits with a non-zero return code. When you use the Amazon EMR console to terminate a cluster, you are prompted with an extra step to turn termination protection off.

**Warning**  
Termination protection does not guarantee that data is retained in the event of a human error or a workaround—for example, if a reboot command is issued from the command line while connected to the instance using SSH, if an application or script running on the instance issues a reboot command, or if the Amazon EC2 or Amazon EMR API is used to disable termination protection. This is true as well if you're running Amazon EMR releases 7.1 and higher and an instance becomes unhealthy and unrecoverable. Even with termination protection enabled, data saved to instance storage, including HDFS data, can be lost. Write data output to Amazon S3 locations and create backup strategies as appropriate for your business continuity requirements.

Termination protection does not affect your ability to scale cluster resources using any of the following actions:
+ Resizing a cluster manually with the AWS Management Console or AWS CLI. For more information, see [Manually resize a running Amazon EMR cluster](emr-manage-resize.md).
+ Removing instances from a core or task instance group using a scale-in policy with automatic scaling. For more information, see [Using automatic scaling with a custom policy for instance groups in Amazon EMR](emr-automatic-scaling.md).
+ Removing instances from an instance fleet by reducing target capacity. For more information, see [Instance fleet options](emr-instance-fleet.md#emr-instance-fleet-options).

## Termination protection and Amazon EC2


The termination protection setting in an Amazon EMR cluster corresponds with the `DisableApiTermination` attribute for all Amazon EC2 instances in the cluster. For example, if you enable termination protection in an EMR cluster, Amazon EMR automatically sets `DisableApiTermination` to true for all EC2 instances within the EMR cluster. The same applies if you disable termination protection. Amazon EMR automatically sets `DisableApiTermination` to false for all EC2 instances within the EMR cluster. If you terminate or scale down a cluster from Amazon EMR and the Amazon EC2 settings conflict for an EC2 instance, Amazon EMR prioritizes the Amazon EMR setting over the `DisableApiStop` and `DisableApiTermination` settings in Amazon EC2 and continues to terminate the EC2 instance. 

For example, you can use the Amazon EC2 console to enable termination protection on an Amazon EC2 instance in an EMR cluster with termination protection disabled. If you terminate or scale down the cluster with the Amazon EMR console, the AWS CLI, or the Amazon EMR API, Amazon EMR overrides the `DisableApiTermination` setting, sets it to false, and terminates the instance along with other instances.

You can also use the Amazon EC2 console to enable stop protection on an Amazon EC2 instance in an EMR cluster with termination protection disabled. If you terminate or scale down the cluster, Amazon EMR sets `DisableApiStop` to false in Amazon EC2 and terminates the instance along with other instances.

Amazon EMR overrides the `DisableApiStop` setting only when you terminate or scale down a cluster. When you enable or disable termination protection in an EMR cluster, Amazon EMR doesn’t change the `disableApiStop` setting for any of the EC2 instances in the respective EMR cluster.

**Important**  
If you create an instance as part of an Amazon EMR cluster with termination protection, and you use the Amazon EC2 API or AWS CLI commands to modify the instance so that `DisableApiTermination` is `false`, and then the Amazon EC2 API or AWS CLI commands run the `TerminateInstances` operation, the Amazon EC2 instance terminates.

## Termination protection and unhealthy YARN nodes


Amazon EMR periodically checks the Apache Hadoop YARN status of nodes running on core and task Amazon EC2 instances in a cluster. The health status is reported by the [NodeManager health checker service](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#Health_checker_service). If a node reports `UNHEALTHY`, the Amazon EMR instance controller adds the node to a denylist and does not allocate YARN containers to it until it becomes healthy again. Depending on the statuses of termination protection, unhealthy node replacement, and Amazon EMR release version, Amazon EMR will either [ replace the unhealthy instance or stop allocating controllers to the instance](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-node-replacement.html).

## Termination protection and termination after step execution


When you enable termination after step execution and *also* enable termination protection, Amazon EMR ignores the termination protection.

When you submit steps to a cluster, you can set the `ActionOnFailure` property to determine what happens if the step can't complete execution because of an error. The possible values for this setting are `TERMINATE_CLUSTER` (`TERMINATE_JOB_FLOW` with earlier versions), `CANCEL_AND_WAIT`, and `CONTINUE`. For more information, see [Submit work to an Amazon EMR cluster](emr-work-with-steps.md).

If a step fails that is configured with `ActionOnFailure` set to `CANCEL_AND_WAIT`, if termination after step execution is enabled, the cluster terminates without executing subsequent steps.

If a step fails that is configured with `ActionOnFailure` set to `TERMINATE_CLUSTER`, use the table of settings below to determine the outcome.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminationProtection.html)

## Termination protection and Spot Instances


Amazon EMR termination protection does not prevent an Amazon EC2 Spot Instance from terminating when the Spot price rises above the maximum Spot price.

## Configuring termination protection when you launch a cluster


You can enable or disable termination protection when you launch a cluster using the console, the AWS CLI, or the API. 

For single-node clusters, default termination protection settings are as follows:
+ Launching a cluster by Amazon EMR Console —Termination Protection is **disabled** by default.
+ Launching a cluster by AWS CLI `aws emr create-cluster`—Termination Protection is **disabled** unless `--termination-protected` is specified.
+ Launching a cluster by Amazon EMR API [RunJobFlow](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow) command—Termination Protection is **disabled** unless the `TerminationProtected` boolean value is set to `true`.

For high-availability clusters, default termination protection settings are as follows:
+ Launching a cluster by Amazon EMR Console — Termination Protection is **enabled** by default.
+ Launching a cluster by AWS CLI `aws emr create-cluster`—Termination Protection is **disabled** unless `--termination-protected` is specified.
+ Launching a cluster by Amazon EMR API [RunJobFlow](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow) command—Termination Protection is **disabled** unless the `TerminationProtected` boolean value is set to `true`.

------
#### [ Console ]

**To turn termination protection on or off when you create a cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. For **EMR release version**, choose **emr-6.6.0** or later.

1. Under **Cluster termination and node replacement**, make sure that **Use termination protection** is pre-selected, or clear the selection to turn it off. 

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To turn termination protection on or off when you create a cluster using the AWS CLI**
+ With the AWS CLI, you can launch a cluster with termination protection enabled with the `create-cluster` command with the `--termination-protected` parameter. Termination protection is disabled by default.

  The following example creates cluster with termination protection enabled:
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --name "TerminationProtectedCluster" --release-label emr-7.12.0 \
  --applications Name=Hadoop Name=Hive Name=Pig \
  --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
  --instance-count 3 --termination-protected
  ```

  For more information about using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

## Configuring termination protection for running clusters


You can configure termination protection for a running cluster with the console or the AWS CLI. 

------
#### [ Console ]

**To turn termination protection on or off for a running cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Properties** tab on the cluster details page, find **Cluster termination** and select **Edit**.

1. Select or clear the **Use termination protection** check box to turn the feature on or off. Then select **Save changes** to confirm.

------
#### [ AWS CLI ]

**To turn termination protection on or off for a running cluster using the AWS CLI**
+ To enable termination protection on a running cluster with the AWS CLI, use the `modify-cluster-attributes` command with the `--termination-protected` parameter. To disable it, use the `--no-termination-protected` parameter.

  The following example enables termination protection on the cluster with ID *j-3KVTXXXXXX7UG*:

  ```
  1. aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --termination-protected
  ```

  The following example disables termination protection on the same cluster:

  ```
  1. aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --no-termination-protected
  ```

------

# Replacing unhealthy nodes with Amazon EMR
Replacing unhealthy nodes with Amazon EMR

Amazon EMR periodically uses the [NodeManager health checker service](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#Health_checker_service) in Apache Hadoop to monitor the statuses of core nodes in your Amazon EMR on Amazon EC2 clusters. If a node is not functioning optimally, the node is marked as unhealthy and the health checker reports that node to the Amazon EMR controller. The Amazon EMR controller adds the node to a deny list, preventing the node from receiving new YARN applications until the status of the node improves. 

**Note**  
A common reason for a node to be unhealthy is that it is out of disk space. For more information about when a core node is almost out of disk space, the following **re:Post Knowledge Center** article is helpful: [Why is the core node in my Amazon EMR cluster running out of disk space?](https://repost.aws/knowledge-center/core-node-emr-cluster-disk-space) 

**Note**  
Hadoop does provide the ability to run customized node-health checks. This is explained in further detail in the Apache Hadoop documentation at [NodeManager](https://hadoop.apache.org/docs/r3.3.2/hadoop-yarn/hadoop-yarn-site/NodeManager.html).

You can choose whether Amazon EMR should terminate unhealthy nodes or keep them in the cluster. If you turn off unhealthy-node replacement, they stay in the deny list and continue to count toward cluster capacity. You can still connect to your Amazon EC2 core instance for configuration and recovery, so you can resize your cluster if you want to add capacity. For more information about how node replacement and termination work, see [Using termination protection](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminationProtection.html).

If unhealthy node replacement is turned on, Amazon EMR terminates an unhealthy core node and provisions a new instance, based on the number of instances in the instance group, or based on the target capacity for instance fleets. If any nodes are unhealthy for more than 45 minutes, Amazon EMR will [gracefully replace the nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-scaledown-behavior.html#emr-scaledown-terminate-task). If graceful decommissioning for a node doesn't complete within one hour, the node is forcefully terminated, unless terminating it brings the cluster below replication factor or HDFS capacity constraints.

**Important**  
Note that the time it takes before a node is gracefully decommissioned or terminated can be subject to change.  
While unhealthy node replacement significantly mitigates the possibility for data loss, it doesn't eliminate the risk entirely. HDFS data can be permanently lost during the graceful replacement of an unhealthy core instance. We recommend that you always back up your data.

For more information about identifying unhealthy nodes and recovery, see [Resource errors](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource.html). Additionally, for more best practices you can follow in order to maintain the health of a cluster, see the following documentation for the resource error [Amazon EMR cluster terminates with NO\$1SLAVE\$1LEFT and core nodes FAILED\$1BY\$1MASTER](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-NO_SLAVE_LEFT-FAILED_BY_MASTER.html).

Amazon EMR publishes Amazon CloudWatch Events for unhealthy node replacement, so you can keep track of what's happening with your unhealthy core instances. For more information, see [unhealthy node replacement events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-unhealthy-node-replacement-events).

## Default node replacement and termination protection settings


Unhealthy node replacement is available for all Amazon EMR releases, but the default settings depend on the release label you choose. You can change any of these settings by configuring unhealthy node replacement when creating a new cluster or by going to cluster configuration at any time.

If you're creating a single-node cluster or high-availability cluster that is running Amazon EMR release 7.0 or lower, the default setting of unhealthy node replacement is dependent on termination protection:
+ Enabling termination protection **disables** unhealthy node replacement.
+ Disabling termination protection **enables** unhealthy node replacement.

## Configuring unhealthy node replacement when you launch a cluster


You can enable or disable unhealthy node replacement when you launch a cluster using the console, the AWS CLI, or the API.

The default unhealthy node replacement setting depends on how you launch the cluster:
+ Amazon EMR console — unhealthy node replacement is **enabled** by default.
+ AWS CLI `aws emr create-cluster` — unhealthy node replacement is **enabled** by default unless you specify `--no-unhealthy-node-replacement`.
+ Amazon EMR [RunJobFlow API command](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html) — unhealthy node replacement is **enabled** by default unless you set the `UnhealthyNodeReplacement` Boolean value to `True` or `False`.

------
#### [ Console ]

**To turn unhealthy node replacement on or off when you create a cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. For **EMR release version**, choose the Amazon EMR release label you want.

1. Under **Cluster termination and node replacement**, make sure that **Unhealthy node replacement (recommended)** is pre-selected, or clear the selection to turn it off. 

1. Choose any other options that apply to your cluster.

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To turn unhealthy node replacement on or off when you create a cluster using the AWS CLI**
+ With the AWS CLI, you can launch a cluster with unhealthy node replacement enabled with the `create-cluster` command with the `--unhealthy-node-replacement ` parameter. Unhealthy node replacement is on by default.

  The following example creates a cluster with unhealthy node replacement enabled:
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --name "SampleCluster" --release-label emr-7.12.0 \
  --applications Name=Hadoop Name=Hive Name=Pig \
  --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
  --instance-count 3 --unhealthy-node-replacement
  ```

  For more information about using Amazon EMR commands in the AWS CLI, see [Amazon EMR AWS CLI commands](https://docs.aws.amazon.com//cli/latest/reference/emr).

------

## Configuring unhealthy node replacement in a running cluster


You can turn unhealthy node replacement on or off for a running cluster using the console, the AWS CLI, or the API.

------
#### [ Console ]

**To turn unhealthy node replacement on or off for a running cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Properties** tab on the cluster details page, find **Cluster termination and node replacement** and select **Edit**.

1. Select or clear the **unhealthy node replacement** check box to turn the feature on or off. Then select **Save changes** to confirm.

------
#### [ AWS CLI ]

**To turn unhealthy node replacement on or off for a running cluster using the AWS CLI**
+ To turn on unhealthy node replacement on a running cluster with the AWS CLI, use the `modify-cluster-attributes` command with the `--unhealthy-node-replacement` parameter. To disable it, use the `--no-unhealthy-node-replacement` parameter.

  The following example turns on unhealthy node replacement on the cluster with ID *j-3KVTXXXXXX7UG*:

  ```
  1. aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --unhealthy-node-replacement
  ```

  The following example turns off unhealthy node replacement on the same cluster:

  ```
  1. aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG --no-unhealthy-node-replacement
  ```

------

# Working with Amazon Linux AMIs in Amazon EMR
Working with AMIs

## Amazon Linux Amazon Machine Images (AMIs)
Overview

Amazon EMR uses an Amazon Linux Amazon Machine Image (AMI) to initialize Amazon EC2 instances when you create and launch a cluster. The AMI contains the Amazon Linux operating system, other software, and the configurations required for each instance to host your cluster applications.

By default, when you create a cluster, Amazon EMR uses a default Amazon Linux (AL) AMI that is created specifically for the Amazon EMR release version you use. For more information about the default Amazon Linux AMI, see [Using the default Amazon Linux AMI for Amazon EMR](emr-default-ami.md). When you use Amazon EMR 5.7.0 or higher, you can choose to specify a custom Amazon Linux AMI instead of the default Amazon Linux AMI for Amazon EMR. A custom AMI allows you to encrypt the root device volume and to customize applications and configurations as an alternative to using bootstrap actions. You can specify a custom AMI for each instance type in the instance group or instance fleet configuration of an Amazon EMR cluster. Multiple custom AMI support gives you the flexibility to use more than one architecture type in a cluster. See [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md).

Amazon EMR automatically attaches an Amazon EBS General Purpose SSD volume as the root device for all AMIs. EBS-backed AMIs enhance performance. For more information about Amazon Linux AMIs, see [Amazon Machine Images (AMI)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html). For more information about instance storage for Amazon EMR instances, see [Instance storage options and behavior in Amazon EMR](emr-plan-storage.md).

**Topics**
+ [

## Amazon Linux Amazon Machine Images (AMIs)
](#emr-default-ami-overview)
+ [

# Using the default Amazon Linux AMI for Amazon EMR
](emr-default-ami.md)
+ [

# Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration
](emr-custom-ami.md)
+ [

# Change the Amazon Linux release when you create an EMR cluster
](emr-custom-ami-change-al-release.md)
+ [

# Customizing the Amazon EBS root device volume
](emr-custom-ami-root-volume-size.md)

# Using the default Amazon Linux AMI for Amazon EMR
Using the default AMI

Each Amazon EMR release version uses a default Amazon Linux AMI for Amazon EMR unless you specify a custom AMI. Starting with Amazon EMR 5.36, Amazon EMR 6.6, and Amazon EMR 7.0 releases the default behavior for updating Amazon Linux 2 (AL2 for EMR 5.x and 6.x, AL2023 for EMR 7.x) in an Amazon EMR default AMI is to automatically apply the latest Amazon Linux release for the default Amazon EMR AMI.

## Automatic Amazon Linux updates for Amazon EMR releases
Automatic AL updates

When you launch a cluster with *the latest patch release* of Amazon EMR 7.0 or higher, 6.6 or higher, or 5.36 or higher, Amazon EMR uses the latest Amazon Linux release for the default Amazon EMR AMI. For example:
+ Where there is an `x.x.0` and an `x.x.1` release, the `x.x.0` release stops getting AMI updates when `x.x.1` launches.
+ Similarly, `x.x.1` stops getting AMI updates when `x.x.2` launches.
+ Later, when `x.y.0` releases, `x.x.[latest]` continues to receive AMI updates alongside `x.y.[latest]`.

To see if you're using the latest patch release as denoted by the number after the second decimal point (`6.8.1`) for an Amazon EMR release, refer to the available releases in the [https://docs.aws.amazon.com/emr/latest/ReleaseGuide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide), check the **Amazon EMR release** dropdown when you create a cluster in the console, or use the [https://docs.aws.amazon.com/emr/latest/APIReference/API_ListReleaseLabels.html](https://docs.aws.amazon.com/emr/latest/APIReference/API_ListReleaseLabels.html) API or [https://docs.aws.amazon.com/cli/latest/reference/emr/list-release-labels.html](https://docs.aws.amazon.com/cli/latest/reference/emr/list-release-labels.html) CLI action. To get updates when we launch a new Amazon EMR release, subscribe to the RSS feed on the [What's new?](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html) page in the Release Guide.

If you want to, you can choose to launch your cluster with the Amazon Linux version that the Amazon EMR release first shipped with. For information on how to specify the Amazon Linux release for your cluster, see [Change the Amazon Linux release when you create an EMR cluster](emr-custom-ami-change-al-release.md).

## Default Amazon Linux versions
Default AL versions

**Topics**
+ [

### Default AMIs for Amazon EMR 7.0 and higher
](#emr-default-ami-7)
+ [

### Default AMIs for Amazon EMR 6.6 and higher
](#emr-default-ami-6)
+ [

### Default AMIs for Amazon EMR 5.x
](#emr-default-ami-5)

### Default AMIs for Amazon EMR 7.0 and higher


The following table lists Amazon Linux information for the latest patch version of Amazon EMR releases 7.0 and higher.


| OsReleaseLabel (AL version) | AL kernel version | Available date | AWS Regions | 
| --- | --- | --- | --- | 
| 2023.10.20260302.1 | 6.1.163-186.299.amzn2023 | March 13, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.10.20260216.1 | 6.1.161-183.298.amzn2023 | February 25, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.10.20260120.4 | 6.1.159-182.297.amzn2023 | February 18, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.9.20251208.0 | 6.1.158-180.294.amzn2023 | January 13, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.9.20251117.1 | 6.1.158-178.288.amzn2023 | December 16, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.9.20251027.0 | 6.1.156-177.286.amzn2023 | November 10, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.9.20250929.0 | 6.1.153-175.280.amzn2023 | October 13, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.8.20250818.0 | 6.1.147-172.266.amzn2023 | September 17, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.8.20250808.1 | 6.1.147-172.266.amzn2023 | August 28, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.8.20250721.2 | 6.1.144-170.251.amzn2023 | August 14, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250623.1 | 6.1.141-155.222.amzn2023 | July 21, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250609.0 | 6.1.140-154.222.amzn2023 | July 14, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250527.1 | 6.1.134-152.225.amzn2023 | Jun 19, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250512.0 | 6.1.134-152.225.amzn2023 | Jun 04, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250428.1 | 6.1.134-150.224.amzn2023 | May 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250414.0 | 6.1.132-147.221.amzn2023 | May 12, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.7.20250331.0 | 6.1.131-143.221.amzn2023 | April 18, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20250303.0 | 6.1.129-138.220.amzn2023 | March 27, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20250218.2 | 6.1.128-136.201.amzn2023 | March 08, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20250211.0 | 6.1.127-135.201.amzn2023 | February 26, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20250123.4 | 6.1.124-134.200.amzn2023 | January 27, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20250115.0 | 6.1.119-129.201.amzn2023 | January 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.6.20241121.0 | 6.1.115-126.197.amzn2023 | December 12, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20241031.0 | 6.1.112-124.190.amzn2023 | November 15, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20241001.1  | 6.1.109-118.189.amzn2023 | October 4, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20240819.0 | 6.1.102-111.182.amzn2023 | August 20, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20240730.0 | 6.1.97-104.177.amzn2023 | August 2, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20240722.0 | 6.1.97-104.177.amzn2023 | July 24, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.5.20240708.0 | 6.1.96-102.177.amzn2023 | July 23, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20240304.0 | 6.1.79-99.164.amzn2023 | March 12, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20240219.0 | 6.1.77-99.164.amzn2023 | March 1, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20240205.0 | 6.1.75-99.163.amzn2023 | February 19, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20240122.0 | 6.1.72-96.166.amzn2023 | February 5, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20240108.0 | 6.1.72-96.166.amzn2023 | January 24, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2023.3.20231211.4 | 6.1.66-91.160.amzn2023 | December 19, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 

### Default AMIs for Amazon EMR 6.6 and higher


The following table lists Amazon Linux information for the latest patch version of Amazon EMR releases 6.6.x and higher.


| OsReleaseLabel (AL version) | AL kernel version | Available date | AWS Regions | 
| --- | --- | --- | --- | 
| 2.0.20260302.0 | 4.14.355-280.714.amzn2 | March 13, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20260216.0 | 4.14.355-280.714.amzn2 | February 25, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20260120.1 | 4.14.355-280.713.amzn2 | February 18, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251208.0 | 4.14.355-280.710.amzn2 | January 13, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251121.0 | 4.14.355-280.708.amzn2 | December 16, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251027.1 | 4.14.355-280.706.amzn2 | November 10, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250929.2 | 4.14.355-280.695.amzn2 | October 13, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250818.2 | 4.14.355-280.672.amzn2 | September 17, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250808.1 | 4.14.355-280.664.amzn2 | August 28, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250610.0 | 4.14.355-277.647.amzn2 | July 14, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250527.1 | 4.14.355-277.647.amzn2 | Jun 19, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250512.0 | 4.14.355-277.643.amzn2 | Jun 04, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250428.0 | 4.14.355-276.639.amzn2 | May 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250414.0 | 4.14.355-276.618.amzn2 | May 12, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250321.0 | 4.14.355 | April 09, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250305.0 | 4.14.355 | March 18, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250220.0 | 4.14.355 | March 08, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250201.0 | 4.14.355 | February 28, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250123.4 | 4.14.355 | January 27, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250116.0 | 4.14.355 | January 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20241217.0 | 4.14.355 | January 8, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20241001.0 | 4.14.352 | October 4, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240816.0 | 4.14.350 | August 21, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240809.0 | 4.14.349 | August 20, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240719.0 | 4.14.348 | July 25, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240709.1 | 4.14.348 | July 23, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240223.0 | 4.14.336 | March 8, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240131.0 | 4.14.336 | February 14, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240124.0 | 4.14.336 | February 7, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240109.0 | 4.14.334 | January 24, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20231218.0 | 4.14.330 | January 2, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20231206.0 | 4.14.330 | December 22, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20231116.0 | 4.14.328 | December 11, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20231101.0 | 4.14.327 | November 17, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20231020.1 | 4.14.326 | November 07, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20231012.1 | 4.14.326 | October 26, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230926.0 | 4.14.322 | October 19, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.202308906.0 | 4.14.322 | October 04, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230822.0 | 4.14.322 | August 30, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230808.0 | 4.14.320 | August 24, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230727.0 | 4.14.320 | August 14, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230719.0 | 4.14.320 | August 2, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230628.0 | 4.14.318 | July 12, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230612.0 | 4.14.314 | June 23, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230504.1 | 4.14.313 | May 16, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230418.0 | 4.14.311 | May 3, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230404.1 | 4.14.311 | April 18, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230404.0 | 4.14.311 | April 10, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230320.0 | 4.14.309 | March 30, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230307.0 | 4.14.305 | March 15, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230207.0 | 4.14.304 | March 3, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230119.1 | 4.14.301 | February 9, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221210.1 | 4.14.301 | January 12, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221103.3 | 4.14.296 | December 5, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221004.0 | 4.14.294 | November 2, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220912.1 | 4.14.291 | October 7, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220805.0 | 4.14.287 | August 30, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220719.0 | 4.14.287 | August 10, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220426.0 | 4.14.281 | June 10, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220406.1 | 4.14.275 | May 2, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 

### Default AMIs for Amazon EMR 5.x


The following table lists Amazon Linux information for the latest patch version of Amazon EMR 5.x releases 5.36 and higher.


| OsReleaseLabel (AL version) | AL kernel version | Available date | AWS Regions | 
| --- | --- | --- | --- | 
| 2.0.20260302.0 | 4.14.355-280.714.amzn2 | March 13, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20260216.0 | 4.14.355-280.714.amzn2 | February 25, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20260120.1 | 4.14.355-280.713.amzn2 | February 18, 2026 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251208.0 | 4.14.355-280.710.amzn2 | January 13, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251121.0 | 4.14.355-280.708.amzn2 | December 16, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20251027.1 | 4.14.355-280.706.amzn2 | November 10, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250929.2 | 4.14.355-280.695.amzn2 | October 13, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250818.2 | 4.14.355-280.672.amzn2 | September 17, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250808.1 | 4.14.355-280.664.amzn2 | August 28, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250428.0 | 4.14.355-276.639.amzn2 | May 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250321.0 | 4.14.355 | April 09, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250305.0 | 4.14.355 | March 18, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250220.0 | 4.14.355 | March 08, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250201.0 | 4.14.355 | February 28, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250123.4 | 4.14.355 | January 27, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20250116.0 | 4.14.355 | January 23, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20241217.0 | 4.14.355 | January 8, 2025 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240816.0 | 4.14.350 | August 21, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240809.0 | 4.14.349 | August 20, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240719.0 | 4.14.348 | July 25, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20240709.1 | 4.14.348 | July 23, 2024 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html)  | 
| 2.0.20230504.1 | 4.14.313 | May 16, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230418.0 | 4.14.311 | May 3, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230404.1 | 4.14.311 | April 18, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230404.0 | 4.14.311 | April 10, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230320.0 | 4.14.309 | March 30, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230307.0 | 4.14.305 | March 15, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20230207.0 | 4.14.304 | March 3, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221210.1 | 4.14.301 | January 12, 2023 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221103.3 | 4.14.296 | December 5, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20221004.0 | 4.14.294 | November 2, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220912.1 | 4.14.291 | October 7, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220719.0 | 4.14.287 | August 10, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 
| 2.0.20220426.0 | 4.14.281 | June 14, 2022 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-default-ami.html) | 

## Software update considerations
Software update considerations

Take note of the following default software update behaviors:

**Amazon EMR 7.x — Amazon Linux 2023**

Amazon EMR releases 7.0 and higher run on Amazon Linux 2023 (AL2023). Your clusters only contain security updates that were available in the version of AL2023 AMI that you chose when you created them. At launch time, EMR cluster instances will not install the latest security updates from the enabled package repositories. To receive the latest security updates, we recommend that you periodically recreate your cluster. For more information on AL2023, see [Updating Amazon Linux 2023](https://docs.aws.amazon.com/linux/al2023/ug/updating.html) in the *Amazon Linux 2023 User Guide*.

**Amazon EMR 5.36\$1 and 6.6\$1 — Amazon Linux 2**

Amazon EMR releases 5.36 and higher, and 6.6 and higher, run on Amazon Linux 2 (AL2). Your clusters only contain security updates that were available in the version of AL2 AMI that you chose when you created them. At launch time, EMR cluster instances will not install the latest security updates from the enabled package repositories. To receive the latest security updates, we recommend that you periodically recreate your cluster. For more information on AL2, see [Manage software on your AL2 instance](https://docs.aws.amazon.com//linux/al2/ug/managing-software.html) in the *Amazon Linux 2 User Guide*.

**Amazon EMR 3.x, 4.x, 5.0.0 to 5.35.0, and 6.0.0 to 6.5.0 — Amazon Linux AMI locked to Amazon EMR release version**

For Amazon EMR releases 3.x, 4.x, 5.0.0 to 5.35.0, and 6.0.0 to 6.5.0, the default AMI is based on the most up-to-date Amazon Linux AMI available at the time of the Amazon EMR release. The AMI is tested for compatibility with the big data applications and Amazon EMR features included in that release version.

When an Amazon EC2 instance boots for the first time in a cluster that is based on the default Amazon Linux (AL) or AL2 AMI for Amazon EMR, it checks for software updates that apply to the release version in the enabled package repositories for AL and Amazon EMR. As with other Amazon EC2 instances that run AL or AL2 AMIs, critical and important security updates from these repositories are automatically installed.

Also note that, in your networking configuration, you must allow HTTP and HTTPS egress to Amazon Linux repositories in Amazon S3. Otherwise, security updates will fail. For more information, see [Amazon Linux - Package repository](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#package-repository) in the *Amazon EC2 User Guide*. By default, other software packages and kernel updates that require a reboot, including NVIDIA and CUDA, are excluded from the automatic download at first boot.

**Important**  
Amazon EMR releases 3.x, 4.x, 5.0.0 to 5.35.0, and 6.0.0 to 6.5.0 use the default Amazon Linux behavior, and do not automatically download and install important and critical kernel updates that require a reboot. This is the same behavior as other Amazon EC2 instances that run the default AL or AL2 AMIs. If new Amazon Linux software updates that require a reboot (such as kernel, NVIDIA, and CUDA updates) become available after an Amazon EMR release becomes available, EMR cluster instances that run the default AMI do not automatically download and install those updates. To get kernel updates, you can use the [latest Amazon Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html) as described in [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html).

**The cluster launches with or without updates**

Be aware that if software updates can't be installed because package repositories are unreachable at first cluster boot, the cluster instance still completes its launch. For instance, repositories may be unreachable because S3 is temporarily unavailable, or you might have VPC or firewall rules configured to block access.

**Don't run `sudo yum update`**

When you connect to a cluster instance using SSH, the first few lines of screen output provide a link to the release notes for the Amazon Linux AMI that the instance uses, a notice of the most recent Amazon Linux AMI version, a notice of the number of packages available for update from the enabled repositories, and a directive to run `sudo yum update`.

**Important**  
We strongly recommend that you do not run `sudo yum update` on cluster instances, either while connected with SSH or when you use a bootstrap action. This might cause incompatibilities because all packages are installed indiscriminately.

## Software update best practices
Best practices

**Best practices for managing software updates**
+ If you use a lower release version of Amazon EMR, consider and test a migration to the latest release before updating software packages.
+ If you migrate to a higher release version or you upgrade software packages, test the implementation in a non-production environment first. The option to clone clusters with the Amazon EMR console is helpful for this.
+ Evaluate software updates for your applications and for your version of Amazon Linux AMI on an individual basis. Only test and install packages in production environments that you determine to be absolutely necessary for your security posture, application functionality, or performance.
+ Watch the [Amazon Linux Security Center](https://alas.aws.amazon.com/) for updates.
+ Avoid installing packages by connecting to individual cluster instances using SSH. Instead, use a bootstrap action to install and update packages on all cluster instances as necessary. This requires that you terminate a cluster and relaunch it. For more information, see [Create bootstrap actions to install additional software with an Amazon EMR cluster](emr-plan-bootstrap.md).

# Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration


When you use Amazon EMR 5.7.0 or higher, you can choose to specify a custom Amazon Linux AMI instead of the default Amazon Linux AMI for Amazon EMR. A custom AMI is useful if you want to do the following:
+ Pre-install applications and perform other customizations instead of using bootstrap actions. This can improve cluster start time and streamline the startup work flow. For more information and an example, see [Creating a custom Amazon Linux AMI from a preconfigured instance](#emr-custom-ami-preconfigure).
+ Implement more sophisticated cluster and node configurations than bootstrap actions allow.
+ Encrypt the EBS root device volumes (boot volumes) of EC2 instances in your cluster if you are using an Amazon EMR version lower than 5.24.0. As with the default AMI, the minimum root volume size for a custom AMI is 10 GiB for Amazon EMR releases 6.9 and lower, and 15 GiB for Amazon EMR releases 6.10 and higher. For more information, see [Creating a custom AMI with an encrypted Amazon EBS root device volume](#emr-custom-ami-encrypted).
**Note**  
Beginning with Amazon EMR version 5.24.0, you can use a security configuration option to encrypt EBS root device and storage volumes when you specify AWS KMS as your key provider. For more information, see [Local disk encryption](emr-data-encryption-options.md#emr-encryption-localdisk).

A custom AMI must exist in the same AWS Region where you create the cluster. It should also match the EC2 instance architecture. For example, an m5.xlarge instance has an x86\$164 architecture. Therefore, to provision an m5.xlarge using a custom AMI, your custom AMI should also have x86\$164 architecture. Similarly, to provision an m6g.xlarge instance, which has arm64 architecture, your custom AMI should have arm64 architecture. For more information about identifying a Linux AMI for your instance type, see [Find a Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html) in the *Amazon EC2 User Guide*.

**Important**  
EMR clusters that run Amazon Linux or Amazon Linux 2 Amazon Machine Images (AMIs) use default Amazon Linux behavior, and do not automatically download and install important and critical kernel updates that require a reboot. This is the same behavior as other Amazon EC2 instances that run the default Amazon Linux AMI. If new Amazon Linux software updates that require a reboot (such as kernel, NVIDIA, and CUDA updates) become available after an Amazon EMR release becomes available, EMR cluster instances that run the default AMI do not automatically download and install those updates. To get kernel updates, you can [customize your Amazon EMR AMI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html) to [use the latest Amazon Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html).

## Creating a custom Amazon Linux AMI from a preconfigured instance


The basic steps for pre-installing software and performing other configurations to create a custom Amazon Linux AMI for Amazon EMR are as follows:
+ Launch an instance from the base Amazon Linux AMI.
+ Connect to the instance to install software and perform other customizations.
+ Create a new image (AMI snapshot) of the instance you configured.

After you create the image based on your customized instance, you can copy that image to an encrypted target as described in [Creating a custom AMI with an encrypted Amazon EBS root device volume](#emr-custom-ami-encrypted).

### Tutorial: Creating an AMI from an instance with custom software installed


**To launch an EC2 instance based on the most recent Amazon Linux AMI**

1. Use the AWS CLI to run the following command, which creates an instance from an existing AMI. Replace `MyKeyName` with the key pair you use to connect to the instance and *MyAmiId* with the ID of an appropriate Amazon Linux AMI. For the most recent AMI IDs, see [Amazon Linux AMI](https://aws.amazon.com/amazon-linux-ami/).
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   aws ec2 run-instances --image-id MyAmiID \
   --count 1 --instance-type m5.xlarge \
   --key-name MyKeyName --region us-west-2
   ```

   The `InstanceId` output value is used as `MyInstanceId` in the next step.

1. Run the following command:

   ```
   aws ec2 describe-instances --instance-ids MyInstanceId
   ```

   The `PublicDnsName` output value is used to connect to the instance in the next step.

**To connect to the instance and install software**

1. Use an SSH connection that lets you run shell commands on your Linux instance. For more information, see [Connecting to your Linux instance using SSH](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) in the *Amazon EC2 User Guide*.

1. Perform any required customizations. For example:

   ```
   sudo yum install MySoftwarePackage
   sudo pip install MySoftwarePackage
   ```

**To create a snapshot from your customized image**
+ After you customize the instance, use the `create-image` command to create an AMI from the instance.

  ```
  aws ec2 create-image --no-dry-run --instance-id MyInstanceId --name MyEmrCustomAmi
  ```

  The `imageID` output value is used when you launch the cluster or create an encrypted snapshot. For more information, see [Use a single custom AMI in an EMR cluster](#single-custom-ami) and [Creating a custom AMI with an encrypted Amazon EBS root device volume](#emr-custom-ami-encrypted).

## How to use a custom AMI in an Amazon EMR cluster


You can use a custom AMI to provision an Amazon EMR cluster in two ways:
+ Use a single custom AMI for all the EC2 instances in the cluster.
+ Use different custom AMIs for the different EC2 instance types used in the cluster.

You can use only one of the two options when provisioning an EMR cluster, and you cannot change it once the cluster has started.


**Considerations for using single versus multiple custom AMIs in an Amazon EMR cluster**  

| Consideration | Single custom AMI | Multiple custom AMIs | 
| --- | --- | --- | 
|  Use both x86 and Graviton2 processors with custom AMIs in the same cluster  |   ![\[No\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-no.png) Not supported  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  | 
|  AMI customization varies across instance types  |   ![\[No\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-no.png) Not supported  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  | 
|  Change custom AMIs when adding new task instance groups/fleets to a running cluster. Note: you cannot change the custom AMI of existing instance groups/fleets.  |   ![\[No\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-no.png) Not supported  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  | 
|  Use AWS Console to start a cluster  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  |   ![\[No\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-no.png) Not supported  | 
|  Use AWS CloudFormation to start a cluster  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  |   ![\[Yes\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/icon-yes.png) Supported  | 

## Use a single custom AMI in an EMR cluster


To specify a custom AMI ID when you create a cluster, use one of the following:
+ AWS Management Console
+ AWS CLI
+ Amazon EMR SDK
+ Amazon EMR API [RunJobFlow](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html)
+ AWS CloudFormation (see the `CustomAmiID` property in [Cluster InstanceGroupConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-instancegroupconfig.html), [Cluster InstanceTypeConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-instancetypeconfig.html), [Resource InstanceGroupConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-emr-instancegroupconfig.html), or [Resource InstanceFleetConfig-InstanceTypeConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-instancefleetconfig-instancetypeconfig.html))

------
#### [ Amazon EMR console ]

**To specify a single custom AMI from the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Name and applications**, find **Operating system options**. Choose **Custom AMI**, and enter your AMI ID in the **Custom AMI** field. 

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To specify a single custom AMI with the AWS CLI**
+ Use the `--custom-ami-id` parameter to specify the AMI ID when you run the `aws emr [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html)` command.

  The following example specifies a cluster that uses a single custom AMI with a 20 GiB boot volume. For more information, see [Customizing the Amazon EBS root device volume](emr-custom-ami-root-volume-size.md).
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --name "Cluster with My Custom AMI" \
  --custom-ami-id MyAmiID --ebs-root-volume-size 20 \
  --release-label emr-5.7.0 --use-default-roles \
  --instance-count 2 --instance-type m5.xlarge
  ```

------

## Use multiple custom AMIs in an Amazon EMR cluster


To create a cluster using multiple custom AMIs, use one of the following: 
+ AWS CLI version 1.20.21 or higher
+ AWS SDK
+ Amazon EMR [RunJobFlow](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html) in the *Amazon EMR API Reference*
+ AWS CloudFormation (see the `CustomAmiID` property in [Cluster InstanceGroupConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-instancegroupconfig.html), [Cluster InstanceTypeConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-instancetypeconfig.html), [Resource InstanceGroupConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-emr-instancegroupconfig.html), or [Resource InstanceFleetConfig-InstanceTypeConfig](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-instancefleetconfig-instancetypeconfig.html))

The AWS Management Console currently does not support creating a cluster using multiple custom AMIs.

**Example - Use the AWS CLI to create an instance group cluster using multiple custom AMIs**  
Using the AWS CLI version 1.20.21 or higher, you can assign a single custom AMI to the entire cluster, or you can assign multiple custom AMIs to every instance node in your cluster.  
The following example shows a uniform instance group cluster created with two instance types (m5.xlarge) used across node types (primary, core, task). Each node has multiple custom AMIs. The example illustrates several features of the multiple custom AMI configuration:   
+ There is no custom AMI assigned at the cluster level. This is to avoid conflicts between the multiple custom AMIs and a single custom AMI, which would cause the cluster launch to fail.
+ The cluster can have multiple custom AMIs across primary, core, and individual task nodes. This allows individual AMI customizations, such as pre-installed applications, sophisticated cluster configurations, and encrypted Amazon EBS root device volumes.
+ The instance group core node can have only one instance type and corresponding custom AMI. Similarly, the primary node can have only one instance type and corresponding custom AMI. 
+ The cluster can have multiple task nodes.

```
aws emr create-cluster --instance-groups 
InstanceGroupType=PRIMARY,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456 
InstanceGroupType=CORE,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-234567
InstanceGroupType=TASK,InstanceType=m6g.xlarge,InstanceCount=1,CustomAmiId=ami-345678
InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-456789
```

**Example - Use the AWS CLI version 1.20.21 or higher to add a task node to a running instance group cluster with multiple instance types and multiple custom AMIs**  
Using the AWS CLI version 1.20.21 or higher, you can add multiple custom AMIs to an instance group that you add to a running cluster. The `CustomAmiId` argument can be used with the `add-instance-groups` command as shown in the following example. Notice that the same multiple custom AMI ID (ami-123456) is used in more than one node.   

```
aws emr create-cluster --instance-groups 
InstanceGroupType=PRIMARY,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456 
InstanceGroupType=CORE,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456
InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-234567

{
    "ClusterId": "j-123456",
    ...
}

aws emr add-instance-groups --cluster-id j-123456 --instance-groups InstanceGroupType=Task,InstanceType=m6g.xlarge,InstanceCount=1,CustomAmiId=ami-345678
```

**Example - Use the AWS CLI version 1.20.21 or higher to create an instance fleet cluster, multiple custom AMIs, multiple instance types, On-Demand primary, On-Demand core, multiple core and task nodes**  

```
aws emr create-cluster --instance-fleets 
InstanceFleetType=PRIMARY,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge, CustomAmiId=ami-123456}'] 
InstanceFleetType=CORE,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-234567},{InstanceType=m6g.xlarge, CustomAmiId=ami-345678}']
InstanceFleetType=TASK,TargetSpotCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-456789},{InstanceType=m6g.xlarge, CustomAmiId=ami-567890}']
```

**Example - Use the AWS CLI version 1.20.21 or higher to add task nodes to a running cluster with multiple instance types and multiple custom AMIs**  

```
aws emr create-cluster --instance-fleets 
InstanceFleetType=PRIMARY,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge, CustomAmiId=ami-123456}'] 
InstanceFleetType=CORE,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-234567},{InstanceType=m6g.xlarge, CustomAmiId=ami-345678}']

{
    "ClusterId": "j-123456",
    ...
}

aws emr add-instance-fleet --cluster-id j-123456 --instance-fleet 
InstanceFleetType=TASK,TargetSpotCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-234567},{InstanceType=m6g.xlarge, CustomAmiId=ami-345678}']
```

## Managing AMI package repository updates


On first boot, by default, Amazon Linux AMIs connect to package repositories to install security updates before other services start. Depending on your requirements, you may choose to disable these updates when you specify a custom AMI for Amazon EMR. The option to disable this feature is available only when you use a custom AMI. By default, Amazon Linux kernel updates and other software packages that require a reboot are not updated. Note that your networking configuration must allow for HTTP and HTTPS egress to Amazon Linux repositories in Amazon S3, otherwise security updates will not succeed.

**Warning**  
We strongly recommend that you choose to update all installed packages on reboot when you specify a custom AMI. Choosing not to update packages creates additional security risks.

With the AWS Management Console, you can select the option to disable updates when you choose **Custom AMI**. 

With the AWS CLI, you can specify `--repo-upgrade-on-boot NONE` along with `--custom-ami-id` when using the **create-cluster** command. 

With the Amazon EMR API, you can specify `NONE` for the [RepoUpgradeOnBoot](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow.html#EMR-RunJobFlow-request-RepoUpgradeOnBoot) parameter.

## Creating a custom AMI with an encrypted Amazon EBS root device volume


To encrypt the Amazon EBS root device volume of an Amazon Linux AMI for Amazon EMR, copy a snapshot image from an unencrypted AMI to an encrypted target. For information about creating encrypted EBS volumes, see [Amazon EBS encryption](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html) in the *Amazon EC2 User Guide*. The source AMI for the snapshot can be the base Amazon Linux AMI, or you can copy a snapshot from an AMI derived from the base Amazon Linux AMI that you customized. 

**Note**  
Beginning with Amazon EMR version 5.24.0, you can use a security configuration option to encrypt EBS root device and storage volumes when you specify AWS KMS as your key provider. For more information, see [Local disk encryption](emr-data-encryption-options.md#emr-encryption-localdisk).

You can use an external key provider or an AWS KMS key to encrypt the EBS root volume. The service role that Amazon EMR uses (usually the default `EMR_DefaultRole`) must be allowed to encrypt and decrypt the volume, at minimum, for Amazon EMR to create a cluster with the AMI. When using AWS KMS as the key provider, this means that the following actions must be allowed:
+ `kms:encrypt`
+ `kms:decrypt`
+ `kms:ReEncrypt*`
+ `kms:CreateGrant`
+ `kms:GenerateDataKeyWithoutPlaintext"`
+ `kms:DescribeKey"`

The simplest way to do this is to add the role as a key user as described in the following tutorial. The following example policy statement is provided if you need to customize role policies.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "EmrDiskEncryptionPolicy",
      "Effect": "Allow",
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:CreateGrant",
        "kms:GenerateDataKeyWithoutPlaintext",
        "kms:DescribeKey"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

### Tutorial: Creating a custom AMI with an encrypted root device volume using a KMS key


The first step in this example is to find the ARN of a KMS key or create a new one. For more information about creating keys, see [Creating keys](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) in the *AWS Key Management Service Developer Guide*. The following procedure shows you how to add the default service role, `EMR_DefaultRole`, as a key user to the key policy. Write down the **ARN** value for the key as you create or edit it. You use the ARN higher, when you create the AMI.

**To add the service role for Amazon EC2 to the list of encryption key users with the console**

1. Sign in to the AWS Management Console and open the AWS Key Management Service (AWS KMS) console at [https://console.aws.amazon.com/kms](https://console.aws.amazon.com/kms).

1. To change the AWS Region, use the Region selector in the upper-right corner of the page.

1. Choose the alias of the KMS key to use.

1. On the key details page under **Key Users**, choose **Add**.

1. In the **Attach** dialog box, choose the Amazon EMR service role. The name of the default role is `EMR_DefaultRole`.

1. Choose **Attach**.

**To create an encrypted AMI with the AWS CLI**
+ Use the `aws ec2 copy-image` command from the AWS CLI to create an AMI with an encrypted EBS root device volume and the key that you modified. Replace the `--kms-key-id` value specified with the full ARN of the key that you created or modified lower.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws ec2 copy-image --source-image-id MyAmiId \
  --source-region us-west-2 --name MyEncryptedEMRAmi \
  --encrypted --kms-key-id arn:aws:kms:us-west-2:12345678910:key/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  ```

The output of the command provides the ID of the AMI that you created, which you can specify when you create a cluster. For more information, see [Use a single custom AMI in an EMR cluster](#single-custom-ami). You can also choose to customize this AMI by installing software and performing other configurations. For more information, see [Creating a custom Amazon Linux AMI from a preconfigured instance](#emr-custom-ami-preconfigure).

## Best practices and considerations


When you create a custom AMI for Amazon EMR, consider the following:
+ The Amazon EMR 7.x series are based on Amazon Linux 2023. For these Amazon EMR versions, you need to use images based on Amazon Linux 2023 for custom AMIs. To find a base custom AMI, see [Finding a Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html). 
+ The Amazon Linux 2023 AMI should be based on Linux kernel 6.1, using Amazon Linux 2023 AMI based on Linux kernel version 6.12 can result in cluster provisioning failures.
+ For Amazon EMR versions lower than 7.x, Amazon Linux 2023 AMIs are not supported.
+ Amazon EMR 5.30.0 and higher, and the Amazon EMR 6.x series are based on Amazon Linux 2. For these Amazon EMR versions, you need to use images based on Amazon Linux 2 for custom AMIs. To find a base custom AMI, see [Finding a Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html).
+ For Amazon EMR versions lower than 5.30.0 and 6.x, Amazon Linux 2 AMIs are not supported.
+ You must use a 64-bit Amazon Linux AMI. A 32-bit AMI is not supported.
+ Amazon Linux AMIs with multiple Amazon EBS volumes are not supported.
+ Base your customization on the most recent EBS-backed [Amazon Linux AMI](https://aws.amazon.com/amazon-linux-ami/). For a list of Amazon Linux AMIs and corresponding AMI IDs, see [Amazon Linux AMI](https://aws.amazon.com/amazon-linux-ami/).
+ Do not copy a snapshot of an existing Amazon EMR instance to create a custom AMI. This causes errors.
+ Only the HVM virtualization type and instances compatible with Amazon EMR are supported. Be sure to select the HVM image and an instance type compatible with Amazon EMR as you go through the AMI customization process. For compatible instances and virtualization types, see [Supported instance types with Amazon EMR](emr-supported-instance-types.md).
+ Your service role must have launch permissions on the AMI, so either the AMI must be public, or you must be the owner of the AMI or have it shared with you by the owner.
+ Creating users on the AMI with the same name as applications causes errors (for example, `hadoop`, `hdfs`, `yarn`, or `spark`).
+ The contents of `/tmp`, `/var`, and `/emr` (if they exist on the AMI) are moved to `/mnt/tmp`, `/mnt/var`, and `/mnt/emr` respectively during startup. Files are preserved, but if there is a large amount of data, startup may take longer than expected.
+ If you use a custom Amazon Linux AMI based on an Amazon Linux AMI with a creation date of 2018-08-11, the Oozie server fails to start. If you use Oozie, create a custom AMI based on an Amazon Linux AMI ID with a different creation date. You can use the following AWS CLI command to return a list of Image IDs for all HVM Amazon Linux AMIs with a 2018.03 version, along with the release date, so that you can choose an appropriate Amazon Linux AMI as your base. Replace MyRegion with your Region identifier, such as us-west-2.

  ```
  aws ec2 --region MyRegion describe-images --owner amazon --query 'Images[?Name!=`null`]|[?starts_with(Name, `amzn-ami-hvm-2018.03`) == `true`].[CreationDate,ImageId,Name]' --output text | sort -rk1
  ```
+ In cases where you use a VPC with a non-standard domain name and AmazonProvidedDNS, you should not use the `rotate` option in the Operating Systems DNS configuration.
+ If you create a custom AMI that includes the Amazon EC2 Systems Manager (SSM) agent, the enabled SSM agent can cause a provisioning error on the cluster. To avoid this, disable the SSM agent when you use a custom AMI. To do this, when you choose and launch your Amazon EC2 instance, disable the SSM agent prior to using the instance to create a custom AMI and subsequently creating your EMR cluster. 

For more information, see [Creating an Amazon EBS-backed Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html) in the *Amazon EC2 User Guide*.

# Change the Amazon Linux release when you create an EMR cluster
Change the AL release

When you launch a cluster using Amazon EMR 6.6.0 or higher, it automatically uses the latest Amazon Linux 2 release that has been validated for the default Amazon EMR AMI. You can specify a different Amazon Linux release for your cluster with the Amazon EMR console or the AWS CLI. 

------
#### [ Amazon EMR console ]

**To change the Amazon Linux release when you create a cluster from the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. For **EMR version**, choose **emr-6.6.0** or higher. 

1. Under **Operating system options**, choose **Amazon Linux release**, uncheck the **Automatically apply latest Amazon Linux updates** check box, and choose a desired **Amazon Linux release**.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To change the Amazon Linux release when you create a cluster with the AWS CLI**
+ Use the `--os-release-label` parameter to specify the **Amazon Linux Release** when you run the `aws emr` [create-cluster](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/emr/create-cluster.html) command.

  ```
  aws emr create-cluster --name "Cluster with Different Amazon Linux Release" \  
  --os-release-label 2.0.20210312.1 \
  --release-label emr-6.6.0 --use-default-roles \ 
  --instance-count 2 --instance-type m5.xlarge
  ```

------

# Customizing the Amazon EBS root device volume
Customizing the EBS root volume

You can set your volume type and other attributes, depending on your use case and cost requirements. You can accept default values or make customizations.

## EBS root volume defaults
Root volume defaults

With Amazon EMR 4.x and higher, you can specify the root volume size when you create a cluster. With Amazon EMR releases 6.15.0 and higher, you can also specify the root volume IOPS and throughput. The attributes apply only to the Amazon EBS root device volume, and apply to all instances in the cluster. The attributes don’t apply to storage volumes, which you specify separately for each instance type when you create your cluster.
+ The default root volume size is 15 GiB in Amazon EMR 6.10.0 and higher. Earlier releases have a default root volume size of 10 GiB. You can adjust this up to 100 GiB.
+ The default root volume IOPS is 3000. You can adjust this up to 16000.
+ The default root volume throughput is 125 MiB/s. You can adjust this up to 1000 Mib/s.

**Note**  
The root volume size and IOPS can’t have a ratio higher than 1 volume to 500 IOPS (1:500), while root volume IOPS and throughput can’t have a ratio higher than 1 IOPS to 0.25 throughput (1:0.25).

For more information about Amazon EBS, see [Amazon EC2 root device volume](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html).

## Root device volume type with the default AMI
Default AMI

When you use the default AMI, the root device volume type is determined by the Amazon EMR release that you use.
+ With Amazon EMR releases 6.15.0 and higher, Amazon EMR attaches **General Purpose SSD (gp3)** as the root device volume type.
+ With Amazon EMR releases lower than 6.15.0, Amazon EMR attaches **General Purpose SSD (gp2)** as the root device volume type.

## Root device volume type with the custom AMI
Custom AMI

A custom AMI might have different root device volume types. Amazon EMR always uses your custom AMI volume type.
+ With Amazon EMR releases 6.15.0 and higher, you can configure root volume size, IOPS, and throughput for your custom AMI, provided tht these attributes are applicable to the custom AMI volume type.
+ With Amazon EMR releases lower than 6.15.0, you can only configure the root volume size for your custom AMI.

If you do not configure root volume size, IOPS, or throughput when you create your cluster, Amazon EMR uses the values from the custom AMI if applicable. If you decide to configure these values when you create your cluster, Amazon EMR uses the values that you specify as long as the values are compatible with and supported by the custom AMI root volume. For more information, see [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md).

## Root device volume size pricing
Pricing

The cost of the EBS root device volume is pro-rated by the hour, based on the monthly EBS charges for that volume type in the Region where the cluster runs. The same is true of storage volumes. Charges are in GB, but you specify the size of the root volume in GiB, so you might want to consider this in your estimates (1 GB is 0.931323 GiB).

General Purpose SSD gp2 and gp3 are billed differently. To estimate the charges associated with EBS root device volumes in your cluster, use the following formulas:

**General Purpose SSD gp2**  
Cost for **gp2** includes only the EBS volume size in GB.  

```
($EBS size in GB/month) * 0.931323 / 30 / 24 * EMR_EBSRootVolumesizeInGiB * InstanceCount
```
For example, take a cluster that has a primary node, a core node, and uses the base Amazon Linux AMI, with the default 10 GiB root device volume. If the EBS cost in the Region is USD \$10.10/GB/month, that works out to be approximately \$10.00129 per instance per hour, and \$10.00258 per hour for the cluster (\$10.10/GB/month divided by 30 days, divided by 24 hours, multiplied by 10 GB, multiplied by 2 cluster instances).

**General Purpose SSD gp3**  
Cost for **gp3** includes EBS volume size in GB, IOPS above 3000 (3000 IOPS free), and throughput above 125 MB/s (125 MB/s free).  

```
($EBS size in GB/month) * 0.931323 / 30 / 24 * EMR_EBSRootVolumesizeInGiB * InstanceCount
+
($EBS IOPS/Month)/30/24* (EMR_EBSRootVolumeIops - 3000) * InstanceCount
+
($EBS throughput/Month)/30/24* (EMR_EBSRootVolumeThroughputInMb/s - 125) * InstanceCount
```
For example, take a cluster that has a primary node, a core node, and uses the base Amazon Linux AMI, with the default 15 GiB root device volume Size, 4000 IOPS, and 140 throughput. If the EBS cost in the Region is USD \$10.10/GB/month, \$10.005/provisioned IOPS/month over 3000, and \$10.040/provisioned MB/s/month over 125. That works out to be approximately \$10.009293 per instance per hour, and \$10.018586 per hour for the cluster.

## Specifying custom root device volume settings
Customize

**Note**  
The root volume size and IOPS can’t have a ratio higher than 1 volume to 500 IOPS (1:500), while root volume IOPS and throughput can’t have a ratio higher than 1 IOPS to 0.25 throughput (1:0.25).

------
#### [ Console ]

**To specify Amazon EBS root device volume attributes from the Amazon EMR console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Select Amazon EMR release 6.15.0 or higher.

1. Under **Cluster configuration**, navigate to the **EBS root volume** section and enter a value for any of the attributes that you want to configure.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To specify Amazon EBS root device volume attributes with the AWS CLI**
+ Use the `--ebs-root-volume-size`, `--ebs-root-volume-iops`, and `--ebs-root-volume-throughput` parameters of the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) command, as shown in the following example.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  aws emr create-cluster --release-label emr-6.15.0\
  --ebs-root-volume-size 20 \
  --ebs-root-volume-iops 3000\
  --ebs-root-volume-throughput 135\
  --instance-groups InstanceGroupType=MASTER,\
  InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge
  ```

------

# Configure applications when you launch your Amazon EMR cluster


When you select a software release, Amazon EMR uses an Amazon Machine Image (AMI) with Amazon Linux to install the software that you choose when you launch your cluster, such as Hadoop, Spark, and Hive. Amazon EMR provides new releases on a regular basis, adding new features, new applications, and general updates. We recommend that you use the latest release to launch your cluster whenever possible. The latest release is the default option when you launch a cluster from the console. 

For more information about Amazon EMR releases and versions of software available with each release, go to the [Amazon EMR Release Guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/). For more information about how to edit the default configurations of applications and software installed on your cluster, go to [Configuring applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html) in the Amazon EMR Release Guide. Some versions of the open-source Hadoop and Spark ecosystem components that are included in Amazon EMR releases have patches and improvements, which are documented in the [Amazon EMR Release Guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/).

 In addition to the standard software and applications that are available for installation on your cluster, you can use bootstrap actions to install custom software. Bootstrap actions are scripts that run on the instances when your cluster is launched, and that run on new nodes that are added to your cluster when they are created. Bootstrap actions are also useful to invoke AWS CLI commands on each node to copy objects from Amazon S3 to each node in your cluster. 

**Note**  
 Bootstrap actions are used differently in Amazon EMR release 4.x and later. For more information about these differences from Amazon EMR AMI versions 2.x and 3.x, go to [Differences introduced in 4.x](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-differences.html) in the Amazon EMR Release Guide. 

# Create bootstrap actions to install additional software with an Amazon EMR cluster
Create bootstrap actions

You can use a *bootstrap action* to install additional software or customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. If you add nodes to a running cluster, bootstrap actions also run on those nodes in the same way. You can create custom bootstrap actions and specify them when you create your cluster. 

Most predefined bootstrap actions for Amazon EMR AMI versions 2.x and 3.x are not supported in Amazon EMR releases 4.x. For example, `configure-Hadoop` and `configure-daemons` are not supported in Amazon EMR release 4.x. Instead, Amazon EMR release 4.x natively provides this functionality. For more information about how to migrate bootstrap actions from Amazon EMR AMI versions 2.x and 3.x to Amazon EMR release 4.x, go to [ Customizing cluster and application configuration with earlier AMI versions of Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-3x-customizeappconfig.html) in the Amazon EMR Release Guide.

## Bootstrap action basics


Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using `sudo`. 

All Amazon EMR management interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per cluster by providing multiple `bootstrap-actions` parameters from the console, AWS CLI, or API. 

From the Amazon EMR console, you can optionally specify a bootstrap action while creating a cluster.

When you use the CLI, you can pass references to bootstrap action scripts to Amazon EMR by adding the `--bootstrap-actions` parameter when you create the cluster using the `create-cluster` command.

```
--bootstrap-actions Path="s3://amzn-s3-demo-bucket/filename",Args=[arg1,arg2]
```

If the bootstrap action returns a nonzero error code, Amazon EMR treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the cluster. If just a few instances fail, Amazon EMR attempts to reallocate the failed instances and continue. Use the cluster `lastStateChangeReason` error code to identify failures caused by a bootstrap action.

## Conditionally run a bootstrap action


In order to only run a bootstrap actions on the master node, you can use a custom bootstrap action with some logic to determine if the node is master.

```
#!/bin/bash
if grep isMaster /mnt/var/lib/info/instance.json | grep false;
then        
    echo "This is not master node, do nothing,exiting"
    exit 0
fi
echo "This is master, continuing to execute script"
# continue with code logic for master node below
```

The following output will print from a core node.

```
This is not master node, do nothing, exiting
```

The following output will print from master node.

```
This is master, continuing to execute script
```

To use this logic, upload your bootstrap action, including the above code, to your Amazon S3 bucket. On the AWS CLI, add the `--bootstrap-actions` parameter to the `aws emr create-cluster` API call and specify your bootstrap script location as the value of `Path`. 

## Shutdown actions


A bootstrap action script can create one or more shutdown actions by writing scripts to the `/mnt/var/lib/instance-controller/public/shutdown-actions/` directory. When a cluster is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds. 

Shutdown action scripts are not guaranteed to run if the node terminates with an error. 

**Note**  
When using Amazon EMR versions 4.0 and later, you must manually create the `/mnt/var/lib/instance-controller/public/shutdown-actions/` directory on the master node. It doesn't exist by default; however, after being created, scripts in this directory nevertheless run before shutdown. For more information about connecting to the Master node to create directories, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md).

## Use custom bootstrap actions


You can create a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

**Note**  
For the best performance, we recommend that you store custom bootstrap actions, scripts, and other files that you want to use with Amazon EMR in an Amazon S3 bucket that is in the same AWS Region as your cluster.

**Topics**
+ [

### Add custom bootstrap actions
](#custom-bootstrap)
+ [

### Use a custom bootstrap action to copy an object from Amazon S3 to each node
](#CustomBootstrapCopyS3Object)

### Add custom bootstrap actions


------
#### [ Console ]

**To create a cluster with a bootstrap action with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Bootstrap actions**, choose **Add** to specify a name, script location, and optional arguments for your action. Select **Add bootstrap action**.

1. Optionally, add more bootstrap actions.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To create a cluster with a custom bootstrap action with the AWS CLI**

When using the AWS CLI to include a bootstrap action, specify the `Path` and `Args` as a comma-separated list. The following example doesn't use an arguments list.
+ To launch a cluster with a custom bootstrap action, type the following command, replacing *myKey* with the name of your EC2 key pair. Include `--bootstrap-actions` as a parameter and specify your bootstrap script location as the value of `Path`.
  + Linux, UNIX, and Mac OS X users:

    ```
    1. aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 \
    2. --use-default-roles --ec2-attributes KeyName=myKey \
    3. --applications Name=Hive Name=Pig \
    4. --instance-count 3 --instance-type m5.xlarge \
    5. --bootstrap-actions Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
    ```
  + Windows users:

    ```
    1. aws emr create-cluster --name "Test cluster" --release-label emr-4.2.0 --use-default-roles --ec2-attributes KeyName=myKey --applications Name=Hive Name=Pig --instance-count 3 --instance-type m5.xlarge --bootstrap-actions Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
    ```

  When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

### Use a custom bootstrap action to copy an object from Amazon S3 to each node


You can use a bootstrap action to copy objects from Amazon S3 to each node in a cluster before your applications are installed. The AWS CLI is installed on each node of a cluster, so your bootstrap action can call AWS CLI commands.

The following example demonstrates a simple bootstrap action script that copies a file, `myfile.jar`, from Amazon S3 to a local folder, `/mnt1/myfolder`, on each cluster node. The script is saved to Amazon S3 with the file name `copymyfile.sh` with the following contents.

```
#!/bin/bash
aws s3 cp s3://amzn-s3-demo-bucket/myfilefolder/myfile.jar /mnt1/myfolder
```

When you launch the cluster, you specify the script. The following AWS CLI example demonstrates this:

```
aws emr create-cluster --name "Test cluster" --release-label emr-7.12.0 \
--use-default-roles --ec2-attributes KeyName=myKey \
--applications Name=Hive Name=Pig \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://amzn-s3-demo-bucket/myscriptfolder/copymyfile.sh"
```

# Configure Amazon EMR cluster hardware and networking


An important consideration when you create an Amazon EMR cluster is how you configure Amazon EC2 instances and network options. This chapter covers the following options, and then ties them all together with [best practices and guidelines](emr-plan-instances-guidelines.md).
+ **Node types** – Amazon EC2 instances in an EMR cluster are organized into *node types*. There are three: *primary nodes*, *core nodes*, and *task nodes*. Each node type performs a set of roles defined by the distributed applications that you install on the cluster. During a Hadoop MapReduce or Spark job, for example, components on core and task nodes process data, transfer output to Amazon S3 or HDFS, and provide status metadata back to the primary node. With a single-node cluster, all components run on the primary node. For more information, see [Understand node types in Amazon EMR: primary, core, and task nodes](emr-master-core-task-nodes.md).
+ **EC2 instances** – When you create a cluster, you make choices about the Amazon EC2 instances that each type of node will run on. The EC2 instance type determines the processing and storage profile of the node. The choice of Amazon EC2 instance for your nodes is important because it determines the performance profile of individual node types in your cluster. For more information, see [Configure Amazon EC2 instance types for use with Amazon EMR](emr-plan-ec2-instances.md).
+ **Networking** – You can launch your Amazon EMR cluster into a VPC using a public subnet, private subnet, or a shared subnet. Your networking configuration determines how customers and services can connect to clusters to perform work, how clusters connect to data stores and other AWS resources, and the options you have for controlling traffic on those connections. For more information, see [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md).
+ **Instance grouping** – The collection of EC2 instances that host each node type is called either an *instance fleet* or a *uniform instance group*. The instance grouping configuration is a choice you make when you create a cluster. This choice determines how you can add nodes to your cluster while it is running. The configuration applies to all node types. It can't be changed later. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).
**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

# Understand node types in Amazon EMR: primary, core, and task nodes
Understand node types

Use this section to understand how Amazon EMR uses each of these node types and as a foundation for cluster capacity planning.

## Primary node


The primary node manages the cluster and typically runs primary components of distributed applications. For example, the primary node runs the YARN ResourceManager service to manage resources for applications. It also runs the HDFS NameNode service, tracks the status of jobs submitted to the cluster, and monitors the health of the instance groups.

To monitor the progress of a cluster and interact directly with applications, you can connect to the primary node over SSH as the Hadoop user. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). Connecting to the primary node allows you to access directories and files, such as Hadoop log files, directly. For more information, see [View Amazon EMR log files](emr-manage-view-web-log-files.md). You can also view user interfaces that applications publish as websites running on the primary node. For more information, see [View web interfaces hosted on Amazon EMR clusters](emr-web-interfaces.md). 

**Note**  
With Amazon EMR 5.23.0 and later, you can launch a cluster with three primary nodes to support high availability of applications like YARN Resource Manager, HDFS NameNode, Spark, Hive, and Ganglia. The primary node is no longer a potential single point of failure with this feature. If one of the primary nodes fails, Amazon EMR automatically fails over to a standby primary node and replaces the failed primary node with a new one with the same configuration and bootstrap actions. For more information, see [Plan and Configure Primary Nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha.html).

## Core nodes


Core nodes are managed by the primary node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

There is only one core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple Amazon EC2 instances in the instance group or instance fleet. With instance groups, you can add and remove Amazon EC2 instances while the cluster is running. You can also set up automatic scaling to add instances based on the value of a metric. For more information about adding and removing Amazon EC2 instances with the instance groups configuration, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

With instance fleets, you can effectively add and remove instances by modifying the instance fleet's *target capacities* for On-Demand and Spot accordingly. For more information about target capacities, see [Instance fleet options](emr-instance-fleet.md#emr-instance-fleet-options).

**Warning**  
Removing HDFS daemons from a running core node or terminating core nodes risks data loss. Use caution when configuring core nodes to use Spot Instances. For more information, see [When should you use Spot Instances?](emr-plan-instances-guidelines.md#emr-plan-spot-instances).

## Task nodes


You can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don't run the Data Node daemon, nor do they store data in HDFS. As with core nodes, you can add task nodes to a cluster by adding Amazon EC2 instances to an existing uniform instance group or by modifying target capacities for a task instance fleet.

With the uniform instance group configuration, you can have up to a total of 48 task instance groups. The ability to add instance groups in this way allows you to mix Amazon EC2 instance types and pricing options, such as On-Demand Instances and Spot Instances. This gives you flexibility to respond to workload requirements in a cost-effective way.

With the instance fleet configuration, the ability to mix instance types and purchasing options is built in, so there is only one task instance fleet.

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application primary processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties: 
+ `yarn.node-labels.enabled: true`
+ `yarn.node-labels.am.default-node-label-expression: 'CORE'`

Starting with the Amazon EMR 7.x release series, Amazon EMR assigns YARN node labels to instances by their market type, such as On-Demand or Spot. You can enable node labels and restrict application processes to ON\$1DEMAND by configuring the following properties:

```
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'ON_DEMAND'
```

If you're using Amazon EMR 7.0 or higher, you can restrict application process to nodes with the `CORE` label using the following configuration:

```
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'CORE'
```

For Amazon EMR releases 7.2 and higher, if your cluster uses managed scaling with node labels, Amazon EMR will try to scale the cluster based on application process and executor demand independently.

For example, if you use Amazon EMR releases 7.2 or higher and restrict application process to `ON_DEMAND` nodes, managed scaling scales up `ON_DEMAND` nodes if application process demand increases. Similarly, if you restrict application process to `CORE` nodes, managed scaling scales up `CORE` nodes if application process demand increases.

For information about specific properties, see [Amazon EMR settings to prevent job failure because of task node Spot Instance termination](emr-plan-instances-guidelines.md#emr-plan-spot-YARN).

# Configure Amazon EC2 instance types for use with Amazon EMR


EC2 instances come in different configurations known as *instance types*. Instance types have different CPU, input/output, and storage capacities. In addition to the instance type, you can choose different purchasing options for Amazon EC2 instances. You can specify different instance types and purchasing options within uniform instance groups or instance fleets. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md). For guidance about choosing instance types and purchasing options for your application, see [Configuring Amazon EMR cluster instance types and best practices for Spot instances](emr-plan-instances-guidelines.md).

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).

**Topics**
+ [

# Supported instance types with Amazon EMR
](emr-supported-instance-types.md)
+ [

# Configure networking in a VPC for Amazon EMR
](emr-plan-vpc-subnet.md)
+ [

# Create an Amazon EMR cluster with instance fleets or uniform instance groups
](emr-instance-group-configuration.md)

# Supported instance types with Amazon EMR


This section describes the instance types that Amazon EMR supports, organized by AWS Region. To learn more about instance types, see [Amazon EC2 instances](https://aws.amazon.com/ec2/instance-types/) and [Amazon Linux AMI instance type matrix](https://aws.amazon.com/amazon-linux-ami/instance-type-matrix/).

Not all instance types are available in all Regions, and instance availability is subject to availability and demand in the specified Region and Availability Zone. An instance's Availability Zone is determined by the subnet you use to launch your cluster. 

## Considerations


Consider the following when you choose instance types for your Amazon EMR cluster.

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).
+ If you create a cluster using an instance type that is not available in the specified Region and Availability Zone, your cluster may fail to provision or may be stuck provisioning. For information about instance availability, see the [Amazon EMR pricing page](https://aws.amazon.com/emr/pricing) or see the [Supported instance types by AWS Region](#emr-instance-types-by-region) tables on this page.
+ Beginning with Amazon EMR release version 5.13.0, all instances use HVM virtualization and EBS-backed storage for root volumes. When using Amazon EMR release versions earlier than 5.13.0, some previous generation instances use PVM virtualization. For more information, see [Linux AMI virtualization types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html).
+ Because of a lack of hardware support and default settings that can lead to underutilization of memory and cores, we don't recommend that you use the instance types `c7a`, `c7i`, `m7i`, `m7i-flex`, `r7a`, `r7i`, `r7iz`, `i4i.12xlarge`, `i4i.24xlarge` if you run Amazon EMR releases lower than 5.36.1 and 6.10.0. If you run these instance types in those releases, you might experience lower performance, and you won't see the expected benefits of newer instance types, such as `c7i` vs `c6i`. For optimal resource utilization and performance with these performance types, you should run 5.36.1 and higher or 6.10.0 and higher to maximize their capabilities.
+ Some instance types support enhanced networking. For more information, see [Enhanced Networking on Linux](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html).
+ NVIDIA and CUDA drivers are installed on GPU instance types by default.

## Supported instance types by AWS Region


The following tables list the Amazon EC2 instance types that Amazon EMR supports, organized by AWS Region. The tables also list the earliest Amazon EMR releases in the 5.x, 6.x, and 7.x series that support each instance type.

### US East (N. Virginia) - us-east-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US East (Ohio) - us-east-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US West (N. California) - us-west-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US West (Oregon) - us-west-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### AWS GovCloud (US-West) - us-gov-west-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### AWS GovCloud (US-East) - us-gov-east-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Africa (Cape Town) - af-south-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Hong Kong) - ap-east-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Jakarta) - ap-southeast-3


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Melbourne) - ap-southeast-4


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Malaysia) - ap-southeast-5


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Mumbai) - ap-south-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Hyderabad) - ap-south-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Osaka) - ap-northeast-3


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Seoul) - ap-northeast-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Singapore) - ap-southeast-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Sydney) - ap-southeast-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Tokyo) - ap-northeast-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Canada (Central) - ca-central-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Canada West (Calgary) - ca-west-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### China (Ningxia) - cn-northwest-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### China (Beijing) - cn-north-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Frankfurt) - eu-central-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Zurich) - eu-central-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Ireland) - eu-west-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (London) - eu-west-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Milan) - eu-south-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Spain) - eu-south-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Paris) - eu-west-3


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Stockholm) - eu-north-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Israel (Tel Aviv) - il-central-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Middle East (Bahrain) - me-south-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Middle East (UAE) - me-central-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### South America (São Paulo) - sa-east-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Thailand) - ap-southeast-7


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Mexico (Central) - mx-central-1


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Taipei) - ap-east-2


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (New Zealand) - ap-southeast-6


[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

## Previous generation instances


Amazon EMR supports previous generation instances to support applications that are optimized for these instances and have not yet been upgraded. For more information about these instance types and upgrade paths, see [Previous Generation Instances](https://aws.amazon.com/ec2/previous-generation). 


| Instance class | Instance types | 
| --- | --- | 
|  General Purpose  |  m1.small¹ \$1 m1.medium¹ \$1 m1.large¹ \$1 m1.xlarge¹ \$1 m3.xlarge¹ \$1 m3.2xlarge¹ \$1 m4.large \$1 m4.xlarge \$1 m4.2xlarge \$1 m4.4xlarge \$1 m4.10xlarge \$1 m4.16xlarge  | 
|  Compute Optimized  |  c1.medium¹ ² \$1 c1.xlarge¹ \$1 c3.xlarge¹ \$1 c3.2xlarge¹ \$1 c3.4xlarge¹ \$1 c3.8xlarge¹ \$1 c4.large \$1 c4.xlarge \$1 c4.2xlarge \$1 c4.4xlarge \$1 c4.8xlarge  | 
|  Memory Optimized  |  m2.xlarge¹ \$1 m2.2xlarge¹ \$1 m2.4xlarge¹ \$1 r3.xlarge \$1 r3.2xlarge \$1 r3.4xlarge \$1 r3.8xlarge \$1 r4.xlarge \$1 r4.2xlarge \$1 r4.4xlarge \$1 r4.8xlarge \$1 r4.16xlarge  | 
|  Storage Optimized  |  d2.xlarge \$1 d2.2xlarge \$1 d2.4xlarge \$1 d2.8xlarge \$1 i2.xlarge \$1 i2.2xlarge \$1 i2.4xlarge \$1 i2.8xlarge  | 

¹ Uses PVM virtualization AMI with Amazon EMR release versions earlier than 5.13.0. For more information, see [Linux AMI Virtualization Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html).

² Not supported in release version 5.15.0.

# Instance purchasing options in Amazon EMR


When you set up a cluster, you choose a purchasing option for Amazon EC2 instances. You can choose On-Demand Instances, Spot Instances, or both. Prices vary based on the instance type and Region. The Amazon EMR price is in addition to the Amazon EC2 price (the price for the underlying servers) and Amazon EBS price (if attaching Amazon EBS volumes). For current pricing, see [Amazon EMR Pricing](https://aws.amazon.com/emr/pricing).

Your choice to use instance groups or instance fleets in your cluster determines how you can change instance purchasing options while a cluster is running. If you choose uniform instance groups, you can only specify the purchasing option for an instance group when you create it, and the instance type and purchasing option apply to all Amazon EC2 instances in each instance group. If you choose instance fleets, you can change purchasing options after you create the instance fleet, and you can mix purchasing options to fulfill a target capacity that you specify. For more information about these configurations, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

## On-Demand Instances


With On-Demand Instances, you pay for compute capacity by the second. Optionally, you can have these On-Demand Instances use Reserved Instance or Dedicated Instance purchasing options. With Reserved Instances, you make a one-time payment for an instance to reserve capacity. Dedicated Instances are physically isolated at the host hardware level from instances that belong to other AWS accounts. For more information about purchasing options, see [Instance Purchasing Options](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-purchasing-options.html) in the *Amazon EC2 User Guide*.

### Using Reserved Instances


To use Reserved Instances in Amazon EMR, you use Amazon EC2 to purchase the Reserved Instance and specify the parameters of the reservation, including the scope of the reservation as applying to either a Region or an Availability Zone. For more information, see [Amazon EC2 Reserved Instances](https://aws.amazon.com/ec2/reserved-instances/) and [Buying Reserved Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-market-concepts-buying.html) in the *Amazon EC2 User Guide*. After you purchase a Reserved Instance, if all of the following conditions are true, Amazon EMR uses the Reserved Instance when a cluster launches:
+ An On-Demand Instance is specified in the cluster configuration that matches the Reserved Instance specification.
+ The cluster is launched within the scope of the instance reservation (the Availability Zone or Region).
+ The Reserved Instance capacity is still available

For example, let's say you purchase one `m5.xlarge` Reserved Instance with the instance reservation scoped to the US-East Region. You then launch an Amazon EMR cluster in US-East that uses two `m5.xlarge` instances. The first instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. Reserved Instance capacity is used before any On-Demand Instances are created.

### Using Dedicated Instances


To use Dedicated Instances, you purchase Dedicated Instances using Amazon EC2 and then create a VPC with the **Dedicated** tenancy attribute. Within Amazon EMR, you then specify that a cluster should launch in this VPC. Any On-Demand Instances in the cluster that match the Dedicated Instance specification use available Dedicated Instances when the cluster launches.

**Note**  
Amazon EMR does not support setting the `dedicated` attribute on individual instances.

## Spot Instances


Spot Instances in Amazon EMR provide an option for you to purchase Amazon EC2 instance capacity at a reduced cost as compared to On-Demand purchasing. The disadvantage of using Spot Instances is that instances may terminate if Spot capacity becomes unavailable for the instance type you are running. For more information about when using Spot Instances may be appropriate for your application, see [When should you use Spot Instances?](emr-plan-instances-guidelines.md#emr-plan-spot-instances).

When Amazon EC2 has unused capacity, it offers EC2 instances at a reduced cost, called the *Spot price*. This price fluctuates based on availability and demand, and is established by Region and Availability Zone. When you choose Spot Instances, you specify the maximum Spot price that you're willing to pay for each EC2 instance type. When the Spot price in the cluster's Availability Zone is below the maximum Spot price specified for that instance type, the instances launch. While instances run, you're charged at the current Spot price *not your maximum Spot price*.

**Note**  
Spot Instances with a defined duration (also known as Spot blocks) are no longer available to new customers from July 1, 2021. For customers who have previously used the feature, we will continue to support Spot Instances with a defined duration until December 31, 2022.

For current pricing, see [Amazon EC2 Spot Instances Pricing](https://aws.amazon.com/ec2/spot/pricing/). For more information, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the *Amazon EC2 User Guide*. When you create and configure a cluster, you specify network options that ultimately determine the Availability Zone where your cluster launches. For more information, see [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md). 

**Tip**  
You can see the real-time Spot price in the console when you hover over the information tooltip next to the **Spot** purchasing option when you create a cluster using **Advanced Options**. The prices for each Availability Zone in the selected Region are displayed. The lowest prices are in the green-colored rows. Because of fluctuating Spot prices between Availability Zones, selecting the Availability Zone with the lowest initial price might not result in the lowest price for the life of the cluster. For optimal results, study the history of Availability Zone pricing before choosing. For more information, see [Spot Instance Pricing History](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-history.html) in the *Amazon EC2 User Guide*.

Spot Instance options depend on whether you use uniform instance groups or instance fleets in your cluster configuration.

****Spot Instances in uniform instance groups****  
When you use Spot Instances in a uniform instance group, all instances in an instance group must be Spot Instances. You specify a single subnet or Availability Zone for the cluster. For each instance group, you specify a single Spot Instance and a maximum Spot price. Spot Instances of that type launch if the Spot price in the cluster's Region and Availability Zone is below the maximum Spot price. Instances terminate if the Spot price is above your maximum Spot price. You set the maximum Spot price only when you configure an instance group. It can't be changed later. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

****Spot Instances in instance fleets****  
When you use the instance fleets configuration, additional options give you more control over how Spot Instances launch and terminate. Fundamentally, instance fleets use a different method than uniform instance groups to launch instances. The way it works is you establish a *target capacity* for Spot Instances (and On-Demand Instances) and up to five instance types. You can also specify a *weighted capacity* for each instance type or use the vCPU (YARN vcores) of the instance type as weighted capacity. This weighted capacity counts toward your target capacity when an instance of that type is provisioned. Amazon EMR provisions instances with both purchasing options until the target capacity for each target is fulfilled. In addition, you can define a range of Availability Zones for Amazon EMR to choose from when launching instances. You also provide additional Spot options for each fleet, including a provisioning timeout. For more information, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

# Instance storage options and behavior in Amazon EMR


## Overview


Instance store and Amazon EBS volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications might "spill" to the local file system.

Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. Amazon EBS volumes attached to Amazon EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so you shouldn't expect data to persist. Although the data is ephemeral, it is possible that data in HDFS could be replicated depending on the number and specialization of nodes in the cluster. When you add Amazon EBS storage volumes, these are mounted as additional volumes. They are not a part of the boot volume. YARN is configured to use all the additional volumes, but you are responsible for allocating the additional volumes as local storage (for local log files, for example).

## Considerations


Keep in mind these additional considerations when you use Amazon EBS with EMR clusters:
+ You can't snapshot an Amazon EBS volume and then restore it within Amazon EMR. To create reusable custom configurations, use a custom AMI (available in Amazon EMR version 5.7.0 and later). For more information, see [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md).
+ An encrypted Amazon EBS root device volume is supported only when using a custom AMI. For more information, see [Creating a custom AMI with an encrypted Amazon EBS root device volume](emr-custom-ami.md#emr-custom-ami-encrypted). 
+ If you apply tags using the Amazon EMR API, those operations are applied to EBS volumes.
+ There is a limit of 25 volumes per instance.
+ The Amazon EBS volumes on core nodes cannot be less than 5 GB.
+ Amazon EBS has a fixed limit of 2,500 EBS volumes per instance launch request. This limit also applies to Amazon EMR on EC2 clusters. We recommend that you launch clusters with the total number of EBS volumes within this limit, and then manually scale up the cluster or with Amazon EMR managed scaling as needed. To learn more about the EBS volume limit, see [Service quotas](https://docs.aws.amazon.com/general/latest/gr/ebs-service.html#limits_ebs:~:text=Amazon%20EBS%20has,exceeding%20the%20limit.).

## Default Amazon EBS storage for instances
Default Amazon EBS storage

For EC2 instances that have EBS-only storage, Amazon EMR allocates Amazon EBS gp2 or gp3 storage volumes to instances. When you create a cluster with Amazon EMR releases 5.22.0 and higher, the default amount of Amazon EBS storage increases relative to the size of the instance.

We split any increased storage across multiple volumes. This gives increased IOPS performance and, in turn, increased performance for some standardized workloads. If you want to use a different Amazon EBS instance storage configuration, you can specify this when you create an EMR cluster or add nodes to an existing cluster. You can use Amazon EBS gp2 or gp3 volumes as root volumes, and add gp2 or gp3 volumes as additional volumes. For more information, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

The following table identifies the default number of Amazon EBS gp2 storage volumes, sizes, and total sizes per instance type. For information about gp2 volumes compared to gp3, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md).


**Default Amazon EBS gp2 storage volumes and size by instance type for Amazon EMR 5.22.0 and higher**  

| Instance size | Number of volumes | Volume size (GiB) | Total size (GiB) | 
| --- | --- | --- | --- | 
|  \$1.large  |  1  |  32  |  32  | 
|  \$1.xlarge  |  2  |  32  |  64  | 
|  \$1.2xlarge  |  4  |  32  |  128  | 
|  \$1.4xlarge  |  4  |  64  |  256  | 
|  \$1.8xlarge  |  4  |  128  |  512  | 
|  \$1.9xlarge  |  4  |  144  |  576  | 
|  \$1.10xlarge  |  4  |  160  |  640  | 
|  \$1.12xlarge  |  4  |  192  |  768  | 
|  \$1.16xlarge  |  4  |  256  |  1024  | 
|  \$1.18xlarge  |  4  |  288  |  1152  | 
|  \$1.24xlarge  |  4  |  384  |  1536  | 

## Default Amazon EBS root volume for instances
Default Amazon EBS volume

With Amazon EMR releases 6.15 and higher, Amazon EMR automatically attaches an Amazon EBS General Purpose SSD (gp3) as the root device for its AMIs to enhance performance. With earlier releases, Amazon EMR attaches EBS General Purpose SSD (gp2) as the root device.


|  | 6.15 and higher | 6.14 and lower | 
| --- | --- | --- | 
| Default root volume type |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | 
| Default size |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  | 
| Default IOPS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 
| Default throughput |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 

For information on how to customize the Amazon EBS root device volume, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

## Specifying additional EBS storage volumes


When you configure instance types in Amazon EMR, you can specify additional EBS volumes to add capacity beyond the instance store (if present) and the default EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price, so you can tailor your storage to the analytic and business needs of your applications. For example, some applications might need to spill to disk while others can safely work in-memory or with Amazon S3.

You can only attach Amazon EBS volumes to instances at cluster startup time and when you add an extra task node instance group. If an instance in an Amazon EMR cluster fails, then both the instance and attached Amazon EBS volumes are replaced with new volumes. Consequently, if you manually detach an Amazon EBS volume, Amazon EMR treats that as a failure and replaces both instance storage (if applicable) and the volume stores.

Amazon EMR doesn’t allow you to modify your volume type from gp2 to gp3 for an existing EMR cluster. To use gp3 for your workloads, launch a new EMR cluster. In addition, we don't recommend that you update the throughput and IOPS on a cluster that is in use or that is being provisioned, because Amazon EMR uses the throughput and IOPS values you specify at cluster launch time for any new instance that it adds during cluster scale-up. For more information, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md) and [Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types](emr-plan-storage-gp3-migration-selection.md).

**Important**  
To use a gp3 volume with your EMR cluster, you must launch a new cluster.

# Comparing Amazon EBS volume types gp2 and gp3


Here is a comparison of cost between gp2 and gp3 volumes in the US East (N. Virginia) Region. For the most up to date information, see the [Amazon EBS General Purpose Volumes](https://aws.amazon.com/ebs/general-purpose/) product page and the [Amazon EBS Pricing Page](https://aws.amazon.com/ebs/pricing/).


| Volume type | gp3 | gp2 | 
| --- | --- | --- | 
| Volume size | 1 GiB – 16 TiB | 1 GiB – 16 TiB | 
| Default/Baseline IOPS | 3000 | 3 IOPS/GiB (minimum 100 IOPS) to a maximum of 16,000 IOPS. Volumes smaller than 1 TiB can also burst up to 3,000 IOPS. | 
| Max IOPS/volume | 16,000 | 16,000 | 
| Default/Baseline throughput | 125 MiB/s | Throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. | 
| Max throughput/volume | 1,000 MiB/s | 250 MiB/s | 
| Price | \$10.08/GiB-month 3,000 IOPS free and \$10.005/provisioned IOPS-month over 3,000; 125 MiB/s free and \$10.04/provisioned MiB/s-month over 125MiB/s | \$10.10/GiB-month | 

# Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types


When provisioning a gp2 volume, you must figure out the size of the volume in order to get the proportional IOPS and throughput. With gp3, you don’t have to provision a bigger volume to get higher performance. You can choose your desired size and performance according to application need. Selecting the right size and right performance parameters (IOPS, throughput) can provide you maximum cost reduction, without affecting performance.

Here is a table to help you select gp3 configuration options:


| Volume size | IOPS | Throughput | 
| --- | --- | --- | 
| 1–170 GiB | 3000 | 125 MiB/s | 
| 170–334 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, use higher as per usage, Max 250 MiB/s\$1. | 
| 334–1000 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 
| 1000\$1 GiB | Match gp2 IOPS (Size in GiB x 3) or Max IOPS driven by current gp2 volume | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 

\$1Gp3 has the capability to provide throughput up to 2000 MiB/s. Since gp2 provides a maximum of 250MiB/s throughput, you may not need to go beyond this limit when you use gp3. Gp3 volumes deliver a consistent baseline throughput performance of 125 MiB/s, which is included with the price of storage. You can provision additional throughput (up to a maximum of 2,000 MiB/s) for an additional cost at a ratio of 0.25 MiB/s per provisioned IOPS. Maximum throughput can be provisioned at 8,000 IOPS or higher and 16 GiB or larger (8,000 IOPS × 0.25 MiB/s per IOPS = 2,000 MiB/s).

# Configure networking in a VPC for Amazon EMR


Most clusters launch into a virtual network using Amazon Virtual Private Cloud (Amazon VPC). A VPC is an isolated virtual network within AWS that is logically isolated within your AWS account. You can configure aspects such as private IP address ranges, subnets, routing tables, and network gateways. For more information, see the [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/).

VPC offers the following capabilities:
+ **Processing sensitive data**

  Launching a cluster into a VPC is similar to launching the cluster into a private network with additional tools, such as routing tables and network ACLs, to define who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster into a VPC provides. Furthermore, you can choose to launch your resources into a private subnet where none of those resources has direct internet connectivity.
+ **Accessing resources on an internal network**

  If your data source is located in a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster into a VPC and connect your data center to your VPC through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database in your data center, launching your cluster into a VPC connected to that network by VPN makes it possible for the cluster to access the Oracle database. 

****Public and private subnets****  
You can launch Amazon EMR clusters in both public and private VPC subnets. This means you do not need internet connectivity to run an Amazon EMR cluster; however, you may need to configure network address translation (NAT) and VPN gateways to access services or resources located outside of the VPC, for example in a corporate intranet or public AWS service endpoints like AWS Key Management Service.

**Important**  
Amazon EMR only supports launching clusters in private subnets in release version 4.2 and later.

For more information about Amazon VPC, see the [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/).

**Topics**
+ [

# Amazon VPC options when you launch a cluster
](emr-clusters-in-a-vpc.md)
+ [

# Set up a VPC to host Amazon EMR clusters
](emr-vpc-host-job-flows.md)
+ [

# Launch clusters into a VPC with Amazon EMR
](emr-vpc-launching-job-flows.md)
+ [

# Sample policies for private subnets that access Amazon S3
](private-subnet-iampolicy.md)
+ [

## More resources for learning about VPCs
](#emr-resources-about-vpcs)

# Amazon VPC options when you launch a cluster


When you launch an Amazon EMR cluster within a VPC, you can launch it within either a public, private, or shared subnet. There are slight but notable differences in configuration, depending on the subnet type you choose for a cluster.

## Public subnets


EMR clusters in a public subnet require a connected internet gateway. This is because Amazon EMR clusters must access AWS services and Amazon EMR. If a service, such as Amazon S3, provides the ability to create a VPC endpoint, you can access those services using the endpoint instead of accessing a public endpoint through an internet gateway. Additionally, Amazon EMR cannot communicate with clusters in public subnets through a network address translation (NAT) device. An internet gateway is required for this purpose but you can still use a NAT instance or gateway for other traffic in more complex scenarios.

All instances in a cluster connect to Amazon S3 through either a VPC endpoint or internet gateway. Other AWS services which do not currently support VPC endpoints use only an internet gateway.

If you have additional AWS resources that you do not want connected to the internet gateway, you can launch those components in a private subnet that you create within your VPC. 

Clusters running in a public subnet use two security groups: one for the primary node and another for core and task nodes. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

The following diagram shows how an Amazon EMR cluster runs in a VPC using a public subnet. The cluster is able to connect to other AWS resources, such as Amazon S3 buckets, through the internet gateway.

![\[Cluster on a VPC\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_default_v3a.png)


The following diagram shows how to set up a VPC so that a cluster in the VPC can access resources in your own network, such as an Oracle database.

![\[Set up a VPC and cluster to access local VPN resources\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_withVPN_v3a.png)


## Private subnets


A private subnet lets you launch AWS resources without requiring the subnet to have an attached internet gateway. Amazon EMR supports launching clusters in private subnets with release versions 4.2.0 or later.

**Note**  
When you set up an Amazon EMR cluster in a private subnet, we recommend that you also set up [VPC endpoints for Amazon S3](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html). If your EMR cluster is in a private subnet without VPC endpoints for Amazon S3, you will incur additional NAT gateway charges that are associated with S3 traffic because the traffic between your EMR cluster and S3 will not stay within your VPC.

Private subnets differ from public subnets in the following ways:
+ To access AWS services that do not provide a VPC endpoint, you still must use a NAT instance or an internet gateway.
+ At a minimum, you must provide a route to the Amazon EMR service logs bucket and Amazon Linux repository in Amazon S3. For more information, see [Sample policies for private subnets that access Amazon S3](private-subnet-iampolicy.md)
+ If you use EMRFS features, you need to have an Amazon S3 VPC endpoint and a route from your private subnet to DynamoDB.
+ Debugging only works if you provide a route from your private subnet to a public Amazon SQS endpoint.
+ Creating a private subnet configuration with a NAT instance or gateway in a public subnet is only supported using the AWS Management Console. The easiest way to add and configure NAT instances and Amazon S3 VPC endpoints for Amazon EMR clusters is to use the **VPC Subnets List** page in the Amazon EMR console. To configure NAT gateways, see [NAT Gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the *Amazon VPC User Guide*.
+ You cannot change a subnet with an existing Amazon EMR cluster from public to private or vice versa. To locate an Amazon EMR cluster within a private subnet, the cluster must be started in that private subnet. 

Amazon EMR creates and uses different default security groups for the clusters in a private subnet: ElasticMapReduce-Master-Private, ElasticMapReduce-Slave-Private, and ElasticMapReduce-ServiceAccess. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

For a complete listing of NACLs of your cluster, choose **Security groups for Primary** and **Security groups for Core & Task** on the Amazon EMR console **Cluster Details** page.

The following image shows how an Amazon EMR cluster is configured within a private subnet. The only communication outside the subnet is to Amazon EMR. 

![\[Launch an Amazon EMR cluster in a private subnet\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_with_private_subnet_v3a.png)


The following image shows a sample configuration for an Amazon EMR cluster within a private subnet connected to a NAT instance that is residing in a public subnet.

![\[Private subnet with NAT\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_private_subnet_nat_v3a.png)


## Shared subnets


VPC sharing allows customers to share subnets with other AWS accounts within the same AWS Organization. You can launch Amazon EMR clusters into both public shared and private shared subnets, with the following caveats.

The subnet owner must share a subnet with you before you can launch an Amazon EMR cluster into it. However, shared subnets can later be unshared. For more information, see [Working with Shared VPCs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html). When a cluster is launched into a shared subnet and that shared subnet is then unshared, you can observe specific behaviors based on the state of the Amazon EMR cluster when the subnet is unshared.
+ Subnet is unshared *before* the cluster is successfully launched - If the owner stops sharing the Amazon VPC or subnet while the participant is launching a cluster, the cluster could fail to start or be partially initialized without provisioning all requested instances. 
+ Subnet is unshared *after* the cluster is successfully launched - When the owner stops sharing a subnet or Amazon VPC with the participant, the participant's clusters will not be able to resize to add new instances or to replace unhealthy instances.

When you launch an Amazon EMR cluster, multiple security groups are created. In a shared subnet, the subnet participant controls these security groups. The subnet owner can see these security groups but cannot perform any actions on them. If the subnet owner wants to remove or modify the security group, the participant that created the security group must take the action.

## Control VPC permissions with IAM


By default, all users can see all of the subnets for the account, and any user can launch a cluster in any subnet. 

When you launch a cluster into a VPC, you can use AWS Identity and Access Management (IAM) to control access to clusters and restrict actions using policies, just as you would with clusters launched into Amazon EC2 Classic. For more information about IAM, see [IAM User Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/). 

You can also use IAM to control who can create and administer subnets. For example, you can create an IAM role to administer subnets, and a second role that can launch clusters but cannot modify Amazon VPC settings. For more information about administering policies and actions in Amazon EC2 and Amazon VPC, see [IAM Policies for Amazon EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-policies-for-amazon-ec2.html) in the *Amazon EC2 User Guide*. 

# Set up a VPC to host Amazon EMR clusters


Before you can launch clusters in a VPC, you must create a VPC and a subnet. For public subnets, you must create an internet gateway and attach it to the subnet. The following instructions describe how to create a VPC capable of hosting Amazon EMR clusters. 

**To create a VPC with subnets for an Amazon EMR cluster**

1. Open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. On the top-right of the page, choose the [AWS Region](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) for your VPC.

1. Choose **Create VPC**.

1. On the **VPC settings** page, choose **VPC and more**.

1. Under **Name tag auto-generation**, enable **Auto-generate** and enter a name for your VPC. This helps you to identify the VPC and subnet in the Amazon VPC console after you've created them.

1. In the **IPv4 CIDR block** field, enter a private IP address space for your VPC to ensure proper DNS hostname resolution; otherwise, you may experience Amazon EMR cluster failures. This includes the following IP address ranges: 
   + 10.0.0.0 - 10.255.255.255
   + 172.16.0.0 - 172.31.255.255
   + 192.168.0.0 - 192.168.255.255

1. Under **Number of Availability Zones (AZs)**, choose the number of Availability Zones you want to launch your subnets in.

1. Under **Number of public subnets**, choose a single public subnet to add to your VPC. If the data used by the cluster is available on the internet (for example, in Amazon S3 or Amazon RDS), you only need to use a public subnet and don't need to add a private subnet.

1. Under **Number of private subnets**, choose the number of private subnets you want to add to your VPC. Select one or more if the the data for your application is stored in your own network (for example, in an Oracle database). For a VPC in a private subnet, all Amazon EC2 instances must at minimum have a route to Amazon EMR through the elastic network interface. In the console, this is automatically configured for you.

1. Under **NAT gateways**, optionally choose to add NAT gateways. They are only necessary if you have private subnets that need to communicate with the internet.

1. Under **VPC endpoints**, optionally choose to add endpoints for Amazon S3 to your subnets.

1. Verify that **Enable DNS hostnames** and**Enable DNS resolution** are checked. For more information, see [Using DNS with your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html).

1. Choose **Create VPC**.

1. A status window shows the work in progress. When the work completes, choose **View VPC** to navigate to the **Your VPCs** page, which displays your default VPC and the VPC that you just created. The VPC that you created is a nondefault VPC, therefore the **Default VPC** column displays **No**. 

1. If you want to associate your VPC with a DNS entry that does not include a domain name, navigate to **DHCP option sets**, choose **Create DHCP options set**, and omit a domain name. After you create your option set, navigate to your new VPC, choose **Edit DHCP options set** under the **Actions** menu, and select the new option set. You cannot edit the domain name using the console after the DNS option set has been created. 

   It is a best practice with Hadoop and related applications to ensure resolution of the fully qualified domain name (FQDN) for nodes. To ensure proper DNS resolution, configure a VPC that includes a DHCP options set whose parameters are set to the following values:
   + **domain-name** = **ec2.internal**

     Use **ec2.internal** if your Region is US East (N. Virginia). For other Regions, use *region-name***.compute.internal**. For examples in `us-west-2`, use **us-west-2.compute.internal**. For the AWS GovCloud (US-West) Region, use **us-gov-west-1.compute.internal**.
   + **domain-name-servers** = **AmazonProvidedDNS**

   For more information, see [DHCP options sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html) in the *Amazon VPC User Guide*.

1. After the VPC is created, go to the **Subnets** page and note the **Subnet ID** of one of the subnets of your new VPC. You use this information when you launch the Amazon EMR cluster into the VPC.

# Launch clusters into a VPC with Amazon EMR


After you have a subnet that is configured to host Amazon EMR clusters, launch the cluster in that subnet by specifying the associated subnet identifier when creating the cluster.

**Note**  
Amazon EMR supports private subnets in release versions 4.2 and above.

When the cluster is launched, Amazon EMR adds security groups based on whether the cluster is launching into VPC private or public subnets. All security groups allow ingress at port 8443 to communicate to the Amazon EMR service, but IP address ranges vary for public and private subnets. Amazon EMR manages all of these security groups, and may need to add additional IP addresses to the AWS range over time. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

To manage the cluster on a VPC, Amazon EMR attaches a network device to the primary node and manages it through this device. You can view this device using the Amazon EC2 API action [https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstances.html](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstances.html). If you modify this device in any way, the cluster may fail.

------
#### [ Console ]

**To launch a cluster into a VPC with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Networking**, go to the **Virtual private cloud (VPC)** field. Enter the name of your VPC or choose **Browse** to select your VPC. Alternatively, choose **Create VPC** to create a VPC that you can use for your cluster.

1. Choose any other options that apply to your cluster.

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To launch a cluster into a VPC with the AWS CLI**
**Note**  
The AWS CLI does not provide a way to create a NAT instance automatically and connect it to your private subnet. However, to create a S3 endpoint in your subnet, you can use the Amazon VPC CLI commands. Use the console to create NAT instances and launch clusters in a private subnet.

After your VPC is configured, you can launch Amazon EMR clusters in it by using the `create-cluster` subcommand with the `--ec2-attributes` parameter. Use the `--ec2-attributes` parameter to specify the VPC subnet for your cluster.
+ To create a cluster in a specific subnet, type the following command, replace *myKey* with the name of your Amazon EC2 key pair, and replace *77XXXX03* with your subnet ID.

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-4.2.0 --applications Name=Hadoop Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey,SubnetId=subnet-77XXXX03 --instance-type m5.xlarge --instance-count 3
  ```

  When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes use the instance type specified in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

------

## Ensuring available IP addresses for an EMR cluster on EC2


To ensure that a subnet with enough free IP addresses is available when you launch, the EC2 subnet selection checks IP availability. It The creation process uses a subnet with the necessary count of IP address to launch core, primary and task nodes as required, even if upon initial creation, only core nodes for the cluster are created. EMR checks the number of IP addresses required to launch primary and task nodes during creation, as well as calculating separately the number of IP addresses needed to launch core nodes. The minimum number of primary and task instances or nodes required is determined automatically by Amazon EMR.

**Important**  
If no subnets in the VPC have enough available IPs to accommodate essential nodes, an error is returned and the cluster isn't created.

In most deployment cases, there is a time difference between each launch of core, primary and task nodes. Additionally, it's possible for multiple clusters to share a subnet. In these cases, IP-address availability can fluctuate and subsequent task-node launches, for instance, can be limited by available IP addresses.

# Sample policies for private subnets that access Amazon S3


For private subnets, at a minimum you must provide the ability for Amazon EMR to access Amazon Linux repositories. This private subnet policy is a part of the VPC endpoint policies for accessing Amazon S3.

With Amazon EMR 5.25.0 or later, to enable one-click access to persistent Spark history server, you must allow Amazon EMR to access the system bucket that collects Spark event logs. If you enable logging, provide PUT permissions to the following bucket: 

```
aws157-logs-${AWS::Region}/*
```

For more information, see [One-click access to persistent Spark History Server](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html).

It is up to you to determine the policy restrictions that meet your business needs. The following example policy provides permissions to access Amazon Linux repositories and the Amazon EMR system bucket for collecting Spark event logs. It shows a few sample resource names for the buckets. 

For more information about using IAM policies with Amazon VPC endpoints, see [Endpoint policies for Amazon S3](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#vpc-endpoints-policies-s3).

The following policy example contains sample resources in the us-east-1 region.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AmazonLinuxAMIRepositoryAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::packages.us-east-1.amazonaws.com/*",
        "arn:aws:s3:::repo.us-east-1.amazonaws.com/*"
      ]
    },
    {
      "Sid": "EnableApplicationHistory",
      "Effect": "Allow",
      "Action": [
        "s3:Put*",
        "s3:Get*",
        "s3:Create*",
        "s3:Abort*",
        "s3:List*"
      ],
      "Resource": [
        "arn:aws:s3:::prod.us-east-1.appinfo.src/*"
      ]
    }
  ]
}
```

------

The following example policy provides the permissions required to access Amazon Linux 2 repositories in the us-east-1 region.

```
{
   "Statement": [
       {
           "Sid": "AmazonLinux2AMIRepositoryAccess",
           "Effect": "Allow",
           "Principal": "*",
           "Action": "s3:GetObject",
           "Resource": [
           	"arn:aws:s3:::amazonlinux.us-east-1.amazonaws.com/*",
           	"arn:aws:s3:::amazonlinux-2-repos-us-east-1/*"
           ]
       }
   ]
}
```

The following example policy provides the permissions required to access Amazon Linux 2023 repositories in the us-east-1 region.

```
{       
    "Statement": [                                       
        {                                                        
            "Sid": "AmazonLinux2023AMIRepositoryAccess",
            "Effect": "Allow",           
            "Principal": "*",                    
            "Action": "s3:GetObject",                    
            "Resource": [                                
                 "arn:aws:s3:::al2023-repos-us-east-1-de612dc2/*"
            ]                                            
        }                                                
    ]                                                    
 }
```

## Available regions


The following table contains a list of buckets by region, and includes both an Amazon Resource Name (ARN) for the respository and a string that represents the ARN for the `appinfo.src`. The ARN, or Amazon Resource Name, is a string that uniquely identifies an AWS resource.


| Region | Repository buckets | AppInfo bucket | 
| --- | --- | --- | 
| US East (Ohio) | "arn:aws:s3:::packages.us-east-2.amazonaws.com/","arn:aws:s3:::repo.us-east-2.amazonaws.com/","arn:aws:s3:::repo.us-east-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-east-2.appinfo.src/\$1" | 
| US East (N. Virginia) | "arn:aws:s3:::packages.us-east-1.amazonaws.com/","arn:aws:s3:::repo.us-east-1.amazonaws.com/","arn:aws:s3:::repo.us-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-east-1.appinfo.src/\$1" | 
| US West (N. California) | "arn:aws:s3:::packages.us-west-1.amazonaws.com/","arn:aws:s3:::repo.us-west-1.amazonaws.com/","arn:aws:s3:::repo.us-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-west-1.appinfo.src/\$1" | 
| US West (Oregon) | "arn:aws:s3:::packages.us-west-2.amazonaws.com/","arn:aws:s3:::repo.us-west-2.amazonaws.com/","arn:aws:s3:::repo.us-west-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-west-2.appinfo.src/\$1" | 
| Africa (Cape Town) | "arn:aws:s3:::packages.af-south-1.amazonaws.com/","arn:aws:s3:::repo.af-south-1.amazonaws.com/","arn:aws:s3:::repo.af-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.af-south-1.appinfo.src/\$1" | 
| Africa (Cape Town) | "arn:aws:s3:::packages.ap-east-1.amazonaws.com/","arn:aws:s3:::repo.ap-east-1.amazonaws.com/","arn:aws:s3:::repo.ap-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-east-1.appinfo.src/\$1" | 
| Asia Pacific (Hyderabad) | "arn:aws:s3:::packages.ap-south-2.amazonaws.com/","arn:aws:s3:::repo.ap-south-2.amazonaws.com/","arn:aws:s3:::repo.ap-south-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-2.appinfo.src/\$1" | 
| Asia Pacific (Jakarta) | "arn:aws:s3:::packages.ap-southeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-3.appinfo.src/\$1" | 
| Asia Pacific (Malaysia) | "arn:aws:s3:::packages.ap-southeast-5.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-5.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-5.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-5.appinfo.src/\$1" | 
| Asia Pacific (Melbourne) | "arn:aws:s3:::packages.ap-southeast-4.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-4.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-4.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-4.appinfo.src/\$1" | 
| Asia Pacific (Mumbai) | "arn:aws:s3:::packages.ap-south-1.amazonaws.com/","arn:aws:s3:::repo.ap-south-1.amazonaws.com/","arn:aws:s3:::repo.ap-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-1.appinfo.src/\$1" | 
| Asia Pacific (Osaka) | "arn:aws:s3:::packages.ap-northeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-3.appinfo.src/\$1" | 
| Asia Pacific (Seoul) | "arn:aws:s3:::packages.ap-northeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-2.appinfo.src/\$1" | 
| Asia Pacific (Singapore) | "arn:aws:s3:::packages.ap-southeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-1.appinfo.src/\$1" | 
| Asia Pacific (Sydney) | "arn:aws:s3:::packages.ap-southeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-2.appinfo.src/\$1" | 
| Asia Pacific (Tokyo) | "arn:aws:s3:::packages.ap-northeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-1.appinfo.src/\$1" | 
| Canada (Central) | "arn:aws:s3:::packages.ca-central-1.amazonaws.com/","arn:aws:s3:::repo.ca-central-1.amazonaws.com/","arn:aws:s3:::repo.ca-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ca-central-1.appinfo.src/\$1" | 
| Canada West (Calgary) | "arn:aws:s3:::packages.ca-west-1.amazonaws.com/","arn:aws:s3:::repo.ca-west-1.amazonaws.com/","arn:aws:s3:::repo.ca-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ca-west-1.appinfo.src/\$1" | 
| Europe (Frankfurt) | "arn:aws:s3:::packages.eu-central-1.amazonaws.com/","arn:aws:s3:::repo.eu-central-1.amazonaws.com/","arn:aws:s3:::repo.eu-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-central-1.appinfo.src/\$1" | 
| Europe (Ireland) | "arn:aws:s3:::packages.eu-west-1.amazonaws.com/","arn:aws:s3:::repo.eu-west-1.amazonaws.com/","arn:aws:s3:::repo.eu-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-1.appinfo.src/\$1" | 
| Europe (London) | "arn:aws:s3:::packages.eu-west-2.amazonaws.com/","arn:aws:s3:::repo.eu-west-2.amazonaws.com/","arn:aws:s3:::repo.eu-west-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-2.appinfo.src/\$1" | 
| Europe (Milan) | "arn:aws:s3:::packages.eu-south-1.amazonaws.com/","arn:aws:s3:::repo.eu-south-1.amazonaws.com/","arn:aws:s3:::repo.eu-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-south-1.appinfo.src/\$1" | 
| Europe (Paris) | "arn:aws:s3:::packages.eu-west-3.amazonaws.com/","arn:aws:s3:::repo.eu-west-3.amazonaws.com/","arn:aws:s3:::repo.eu-west-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-3.appinfo.src/\$1" | 
| Europe (Spain) | "arn:aws:s3:::packages.eu-south-2.amazonaws.com/","arn:aws:s3:::repo.eu-south-2.amazonaws.com/","arn:aws:s3:::repo.eu-south-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-south-2.appinfo.src/\$1" | 
| Europe (Stockholm) | "arn:aws:s3:::packages.eu-north-1.amazonaws.com/","arn:aws:s3:::repo.eu-north-1.amazonaws.com/","arn:aws:s3:::repo.eu-north-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-north-1.appinfo.src/\$1" | 
| Europe (Zurich) | "arn:aws:s3:::packages.eu-central-2.amazonaws.com/","arn:aws:s3:::repo.eu-central-2.amazonaws.com/","arn:aws:s3:::repo.eu-central-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-central-2.appinfo.src/\$1" | 
| Israel (Tel Aviv) | "arn:aws:s3:::packages.il-central-1.amazonaws.com/","arn:aws:s3:::repo.il-central-1.amazonaws.com/","arn:aws:s3:::repo.il-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.il-central-1.appinfo.src/\$1" | 
| Middle East (Bahrain) | "arn:aws:s3:::packages.me-south-1.amazonaws.com/","arn:aws:s3:::repo.me-south-1.amazonaws.com/","arn:aws:s3:::repo.me-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-south-1.appinfo.src/\$1" | 
| Middle East (UAE) | "arn:aws:s3:::packages.me-central-1.amazonaws.com/","arn:aws:s3:::repo.me-central-1.amazonaws.com/","arn:aws:s3:::repo.me-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-central-1.appinfo.src/\$1" | 
| South America (São Paulo) | "arn:aws:s3:::packages.sa-east-1.amazonaws.com/","arn:aws:s3:::repo.sa-east-1.amazonaws.com/","arn:aws:s3:::repo.sa-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.sa-east-1.appinfo.src/\$1" | 
| AWS GovCloud (US-East) | "arn:aws:s3:::packages.us-gov-east-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-east-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-gov-east-1.appinfo.src/\$1" | 
| AWS GovCloud (US-West) | "arn:aws:s3:::packages.us-gov-west-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-west-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-south-1.appinfo.src/\$1" | 

## More resources for learning about VPCs


Use the following topics to learn more about VPCs and subnets.
+ Private Subnets in a VPC
  + [Scenario 2: VPC with Public and Private Subnets (NAT)](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario2.html)
  + [NAT Instances](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html)
  + [High Availability for Amazon VPC NAT Instances: An Example](https://aws.amazon.com/articles/2781451301784570)
+ Public Subnets in a VPC
  + [Scenario 1: VPC with a Single Public Subnet](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario1.html)
+ General VPC Information
  + [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/)
  + [VPC Peering](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peering.html)
  + [Using Elastic Network Interfaces with Your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_ElasticNetworkInterfaces.html)
  + [Securely connect to Linux instances running in a private VPC](https://blogs.aws.amazon.com/security/post/Tx3N8GFK85UN1G6/Securely-connect-to-Linux-instances-running-in-a-private-Amazon-VPC)

# Create an Amazon EMR cluster with instance fleets or uniform instance groups
Configure instance fleets or instance groups

When you create a cluster and specify the configuration of the primary node, core nodes, and task nodes, you have two configuration options. You can use *instance fleets* or *uniform instance groups*. The configuration option you choose applies to all nodes, it applies for the lifetime of the cluster, and instance fleets and instance groups cannot coexist in a cluster. The instance fleets configuration is available in Amazon EMR version 4.8.0 and later, excluding 5.0.x versions. 

You can use the Amazon EMR console, the AWS CLI, or the Amazon EMR API to create clusters with either configuration. When you use the `create-cluster` command from the AWS CLI, you use either the `--instance-fleets` parameters to create the cluster using instance fleets or, alternatively, you use the `--instance-groups` parameters to create it using uniform instance groups.

The same is true using the Amazon EMR API. You use either the `InstanceGroups` configuration to specify an array of `InstanceGroupConfig` objects, or you use the `InstanceFleets` configuration to specify an array of `InstanceFleetConfig` objects.

In the new Amazon EMR console, you can choose to use either instance groups or instance fleets when you create a cluster, and you have the option to use Spot Instances with each. With the old Amazon EMR console, if you use the default **Quick Options** settings when you create your cluster, Amazon EMR applies the uniform instance groups configuration to the cluster and uses On-Demand Instances. To use Spot Instances with uniform instance groups, or to configure instance fleets and other customizations, choose **Advanced Options**.

## Instance fleets


The instance fleets configuration offers the widest variety of provisioning options for Amazon EC2 instances. Each node type has a single instance fleet, and using a task instance fleet is optional. You can specify up to five EC2 instance types per fleet, or 30 EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](emr-instance-fleet.md#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. For the core and task instance fleets, you assign a *target capacity* for On-Demand Instances, and another for Spot Instances. Amazon EMR chooses any mix of the specified instance types to fulfill the target capacities, provisioning both On-Demand and Spot Instances.

For the primary node type, Amazon EMR chooses a single instance type from your list of instances, and you specify whether it's provisioned as an On-Demand or Spot Instance. Instance fleets also provide additional options for Spot Instance and On-Demand purchases. Spot Instance options include a timeout that specifies an action to take if Spot capacity can't be provisioned, and a preferred allocation strategy (capacity-optimized) for launching Spot Instance fleets. On-Demand Instance fleets can also be launched using the allocation strategy (lowest-price) option. If you use a service role that is not the EMR default service role, or use an EMR managed policy in your service role, you need to add additional permissions to the custom cluster service role to enable the allocation strategy option. For more information, see [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about configuring instance fleets, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

## Uniform instance groups
Instance groups

Uniform instance groups offer a simpler setup than instance fleets. Each Amazon EMR cluster can include up to 50 instance groups: one primary instance group that contains one Amazon EC2 instance, a core instance group that contains one or more EC2 instances, and up to 48 optional task instance groups. Each core and task instance group can contain any number of Amazon EC2 instances. You can scale each instance group by adding and removing Amazon EC2 instances manually, or you can set up automatic scaling. For information about adding and removing instances, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

For more information about configuring uniform instance groups, see [Configure uniform instance groups for your Amazon EMR cluster](emr-uniform-instance-group.md). 

## Working with instance fleets and instance groups
Working with fleets and groups

**Topics**
+ [

## Instance fleets
](#emr-plan-instance-fleets)
+ [

## Uniform instance groups
](#emr-plan-instance-groups)
+ [

## Working with instance fleets and instance groups
](#emr-plan-instance-topics)
+ [

# Planning and configuring instance fleets for your Amazon EMR cluster
](emr-instance-fleet.md)
+ [

# Reconfiguring instance fleets for your Amazon EMR cluster
](instance-fleet-reconfiguration.md)
+ [

# Use capacity reservations with instance fleets in Amazon EMR
](on-demand-capacity-reservations.md)
+ [

# Configure uniform instance groups for your Amazon EMR cluster
](emr-uniform-instance-group.md)
+ [

# Availability Zone flexibility for an Amazon EMR cluster
](emr-flexibility.md)
+ [

# Configuring Amazon EMR cluster instance types and best practices for Spot instances
](emr-plan-instances-guidelines.md)

# Planning and configuring instance fleets for your Amazon EMR cluster


**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

The instance fleet configuration for Amazon EMR clusters lets you select a wide variety of provisioning options for Amazon EC2 instances, and helps you develop a flexible and elastic resourcing strategy for each node type in your cluster. 

In an instance fleet configuration, you specify a *target capacity* for [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) and [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. When Amazon EC2 reclaims a Spot Instance in a running cluster because of a price increase or instance failure, Amazon EMR tries to replace the instance with any of the instance types that you specify. This makes it easier to regain capacity during a spike in Spot pricing. 

You can specify a maximum of five Amazon EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets, or a maximum of 30 Amazon EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. 

You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify. If Amazon EMR detects an AWS large-scale event in one or more of the Availability Zones, Amazon EMR automatically attempts to route traffic away from the impacted Availability Zones and tries to launch new clusters that you create in alternate Availability Zones according to your selections. Note that cluster Availability Zone selection happens only at cluster creation. Existing cluster nodes are not automatically re-launched in a new Availability Zone in the event of an Availability Zone outage.

## **Considerations for working with instance fleets**


Consider the following items when you use instance fleets with Amazon EMR.
+ You can have one instance fleet, and only one, per node type (primary, core, task). You can specify up to five Amazon EC2 instance types for each fleet on the AWS Management Console (or a maximum of 30 types per instance fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy)). 
+ Amazon EMR chooses any or all of the specified Amazon EC2 instance types to provision with both Spot and On-Demand purchasing options.
+ You can establish target capacities for Spot and On-Demand Instances for the core fleet and task fleet. Use vCPU or a generic unit assigned to each Amazon EC2 instance that counts toward the targets. Amazon EMR provisions instances until each target capacity is totally fulfilled. For the primary fleet, the target is always one.
+ You can choose one subnet (Availability Zone) or a range. If you choose a range, Amazon EMR provisions capacity in the Availability Zone that is the best fit.
+ When you specify a target capacity for Spot Instances:
  + For each instance type, specify a maximum Spot price. Amazon EMR provisions Spot Instances if the Spot price is below the maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.
  + For each fleet, define a timeout period for provisioning Spot Instances. If Amazon EMR can't provision Spot capacity, you can terminate the cluster or switch to provisioning On-Demand capacity instead. This only applies for provisioning clusters, not resizing them. If the timeout period ends during the cluster resizing process, unprovisioned Spot requests will be nullified without transferring to On-Demand capacity. 
+ For each fleet, you can specify one of the following allocation strategies for your Spot Instances: price-capacity optimized, capacity-optimized, capacity-optimized-prioritized, lowest-price, or diversified across all pools.
+ For each fleet, you can apply the following allocation strategies for your On-Demand Instances: the lowest-price strategy or the prioritized strategy.
+ For each fleet with On-Demand Instances, you can choose to apply capacity reservation options.
+ If you use allocation strategy for instance fleets, the following considerations apply when you choose subnets for your EMR cluster:
  + When Amazon EMR provisions a cluster with a task fleet, it filters out subnets that lack enough available IP addresses to provision all instances of the requested EMR cluster. This includes IP addresses required for the primary, core, and task instance fleets during cluster launch. Amazon EMR then leverages its allocation strategy to determine the instance pool, based on instance type and remaining subnets with sufficient IP addresses, to launch the cluster.
  + If Amazon EMR cannot launch the whole cluster due to insufficient available IP addresses, it will attempt to identify subnets with enough free IP addresses to launch the essential (core and primary) instance fleets. In such scenarios, your task instance fleet will go into a suspended state, rather than terminating the cluster with an error.
  + If none of the specified subnets contain enough IP addresses to provision the essential core and primary instance fleets, the cluster launch will fail with a **VALIDATION\$1ERROR**. This triggers a **CRITICAL** severity cluster termination event, notifying you that the cluster cannot be launched. To prevent this issue, we recommend increasing the number of IP addresses in your subnets.
+ If you run Amazon EMR release **emr-7.7.0** and above, and you use allocation strategy for instance fleets, you can scale the cluster up to 4000 EC2 instances and 14000 EBS volumes per instance fleet. For release versions below **emr-7.7.0**, the cluster can be scaled up only to 2000 EC2 instances and 7000 EBS volumes per instance fleet.
+ When you launch On-Demand Instances, you can use open or targeted capacity reservations for primary, core, and task nodes in your accounts. You might see insufficient capacity with On-Demand Instances with allocation strategy for instance fleets. We recommend that you specify multiple instance types to diversify and reduce the chance of experiencing insufficient capacity. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

## Instance fleet options


Use the following guidelines to understand instance fleet options.

**Topics**
+ [

### **Setting target capacities**
](#emr-fleet-capacity)
+ [

### **Launch options**
](#emr-fleet-spot-options)
+ [

### **Multiple subnet (Availability Zones) options**
](#emr-multiple-subnet-options)
+ [

### **Master node configuration**
](#emr-master-node-configuration)

### **Setting target capacities**


Specify the target capacities you want for the core fleet and task fleet. When you do, that determines the number of On-Demand Instances and Spot Instances that Amazon EMR provisions. When you specify an instance, you decide how much each instance counts toward the target. When an On-Demand Instance is provisioned, it counts toward the On-Demand target. The same is true for Spot Instances. Unlike core and task fleets, the primary fleet is always one instance. Therefore, the target capacity for this fleet is always one. 

When you use the console, the vCPUs of the Amazon EC2 instance type are used as the count for target capacities by default. You can change this to **Generic units**, and then specify the count for each EC2 instance type. When you use the AWS CLI, you manually assign generic units for each instance type. 

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).

For each fleet, you specify up to five Amazon EC2 instance types. If you use an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy) and create a cluster using the AWS CLI or the Amazon EMR API, you can specify up to 30 EC2 instance types per instance fleet. Amazon EMR chooses any combination of these EC2 instance types to fulfill your target capacities. Because Amazon EMR wants to fill target capacity completely, an overage might happen. For example, if there are two unfulfilled units, and Amazon EMR can only provision an instance with a count of five units, the instance still gets provisioned, meaning that the target capacity is exceeded by three units. 

If you reduce the target capacity to resize a running cluster, Amazon EMR attempts to complete application tasks and terminates instances to meet the new target. For more information, see [Terminate at task completion](emr-scaledown-behavior.md#emr-scaledown-terminate-task).

### **Launch options**


For Spot Instances, you can specify a **Maximum Spot price** for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. Amazon EMR provisions Spot Instances if the current Spot price in an Availability Zone is below your maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.

**Note**  
Spot Instances with a defined duration (also known as Spot blocks) are no longer available to new customers from July 1, 2021. For customers who have previously used the feature, we will continue to support Spot Instances with a defined duration until December 31, 2022.

Available in Amazon EMR 5.12.1 and later, you have the option to launch Spot and On-Demand Instance fleets with optimized capacity allocation. This allocation strategy option can be set in the old AWS Management Console or using the API `RunJobFlow`. Note that you can't customize allocation strategy in the new console. Using the allocation strategy option requires additional service role permissions. If you use the default Amazon EMR service role and managed policy ([`EMR_DefaultRole`](emr-iam-role.md) and `AmazonEMRServicePolicy_v2`) for the cluster, the permissions for the allocation strategy option are already included. If you're not using the default Amazon EMR service role and managed policy, you must add them to use this option. See [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about Spot Instances, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the Amazon EC2 User Guide. For more information about On-Demand Instances, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the Amazon EC2 User Guide.

If you choose to launch On-Demand Instance fleets with the lowest-price allocation strategy, you have the option to use capacity reservations. Capacity reservation options can be set using the Amazon EMR API `RunJobFlow`. Capacity reservations require additional service role permissions which you must add to use these options. See [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy). Note that you can't customize capacity reservations in the new console.

### **Multiple subnet (Availability Zones) options**


When you use instance fleets, you can specify multiple Amazon EC2 subnets within a VPC, each corresponding to a different Availability Zone. If you use EC2-Classic, you specify Availability Zones explicitly. Amazon EMR identifies the best Availability Zone to launch instances according to your fleet specifications. Instances are always provisioned in only one Availability Zone. You can select private subnets or public subnets, but you can't mix the two, and the subnets you specify must be within the same VPC.

### **Master node configuration**


Because the primary instance fleet is only a single instance, its configuration is slightly different from core and task instance fleets. You only select either On-Demand or Spot for the primary instance fleet because it consists of only one instance. If you use the console to create the instance fleet, the target capacity for the purchasing option you select is set to 1. If you use the AWS CLI, always set either `TargetSpotCapacity` or `TargetOnDemandCapacity` to 1 as appropriate. You can still choose up to five instance types for the primary instance fleet (or a maximum of 30 when you use the allocation strategy option for On-Demand or Spot Instances). However, unlike core and task instance fleets, where Amazon EMR might provision multiple instances of different types, Amazon EMR selects a single instance type to provision for the primary instance fleet.

## Allocation strategy for instance fleets
Allocation strategy

With Amazon EMR versions 5.12.1 and later, you can use the allocation strategy option with On-Demand and Spot Instances for each cluster node. When you create a cluster using the AWS CLI, Amazon EMR API, or Amazon EMR console with an allocation strategy, you can specify up to 30 Amazon EC2 instance types per fleet. With the default Amazon EMR cluster instance fleet configuration, you can have up to 5 instance types per fleet. We recommend that you use the allocation strategy option for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

**Topics**
+ [

### Allocation strategy with On-Demand Instances
](#emr-instance-fleet-allocation-strategy-od)
+ [

### Allocation strategy with Spot Instances
](#emr-instance-fleet-allocation-strategy-spot)
+ [

### Allocation strategy permissions
](#emr-instance-fleet-allocation-strategy-permissions)
+ [

### Required IAM permissions for an allocation strategy
](#create-cluster-allocation-policy)

### Allocation strategy with On-Demand Instances
On-Demand Instances

The following allocation strategies are available for your On-Demand Instances:

`lowest-price`** (default)**  
The lowest-price allocation strategy launches On-Demand instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the On-Demand Instances come from the next lowest-priced pool with available capacity.

`prioritized`  
The prioritized allocation strategy lets you specify a priority value for each instance type for your instance fleet. Amazon EMR launches your On-Demand Instances that have the highest priority. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure the priority value for an instance type, Amazon EMR assigns the lowest priority to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.

**Note**  
If you use the **capacity-optimized-prioritized** Spot allocation strategy, Amazon EMR applies the same priorities to both your On-Demand Instances and Spot instances when you set priorities.

### Allocation strategy with Spot Instances
Spot Instances

For *Spot Instances* you can choose from one of the following allocation strategies:

**`price-capacity-optimized` (recommended) **  
The price-capacity optimized allocation strategy launches Spot instances from the Spot instance pools that have the highest capacity available and the lowest price for the number of instances that are launching. As a result, the price-capacity optimized strategy typically has a higher chance of getting Spot capacity, and delivers lower interruption rates.This is the default strategy for Amazon EMR releases 6.10.0 and higher.

**`capacity-optimized`**  
The capacity-optimized allocation strategy launches Spot Instances into the most available pools with the lowest chance of interruption in the near term. This is a good option for workloads that might have a higher cost of interruption associated with work that gets restarted. This is the default strategy for Amazon EMR releases 6.9.0 and lower.

**`capacity-optimized-prioritized`**  
The capacity-optimized-prioritized allocation strategy lets you specify a priority value for each instance type in your instance fleet. Amazon EMR optimizes for capacity first, but it respects instance type priorities on a best-effort basis, such as if the priority doesn't significantly affect the fleet's ability to provision optimal capacity. We recommend this option if you have workloads that must have a minimal amount of disruption that still have a need for certain instance types. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure a priority for any instance type, Amazon EMR assigns the lowest priority value to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.  
If you use the **prioritized** On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

**`diversified`**  
With the diversified allocation strategy, Amazon EC2 distributes Spot Instances across all Spot capacity pools.

**`lowest-price`**  
The lowest-price allocation strategy launches Spot Instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the Spot Instances come from the next lowest priced pool that has available capacity. If a pool runs out of capacity before it fulfills your requested capacity, the Amazon EC2 fleet draws from the next lowest priced pool to continue to fulfill your request. To ensure that your desired capacity is met, you might receive Spot Instances from several pools. Because this strategy only considers instance price, and does not consider capacity availability, it might lead to high interruption rates.

### Allocation strategy permissions
Allocation strategy permissions

The allocation strategy option requires several IAM permissions that are automatically included in the default Amazon EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you use a custom service role or managed policy for your cluster, you must add these permissions before you create the cluster. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy).

Optional On-Demand Capacity Reservations (ODCRs) are available when you use the On-Demand allocation strategy option. Capacity reservation options let you specify a preference for using reserved capacity first for Amazon EMR clusters. You can use this to ensure that your critical workloads use the capacity you have already reserved using open or targeted ODCRs. For non-critical workloads, the capacity reservation preferences let you specify whether reserved capacity should be consumed.

Capacity reservations can only be used by instances that match their attributes (instance type, platform, and Availability Zone). By default, open capacity reservations are automatically used by Amazon EMR when provisioning On-Demand Instances that match the instance attributes. If you don't have any running instances that match the attributes of the capacity reservations, they remain unused until you launch an instance matching their attributes. If you don't want to use any capacity reservations when launching your cluster, you must set capacity reservation preference to **none** in launch options.

However, you can also target a capacity reservation for specific workflows. This enables you to explicitly control which instances are allowed to run in that reserved capacity. For more information about On-Demand capacity reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

### Required IAM permissions for an allocation strategy
IAM permissions for allocation strategy

Your [Service role for Amazon EMR (EMR role)](emr-iam-role.md) requires additional permissions to create a cluster that uses the allocation strategy option for On-Demand or Spot Instance fleets.

We automatically include these permissions in the default Amazon EMR service role [`EMR_DefaultRole`](emr-iam-role.md) and the Amazon EMR managed policy [`AmazonEMRServicePolicy_v2`](emr-managed-iam-policies.md).

If you use a custom service role or managed policy for your cluster, you must add the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DeleteLaunchTemplate",
        "ec2:CreateLaunchTemplate",
        "ec2:DescribeLaunchTemplates",
        "ec2:CreateLaunchTemplateVersion",
        "ec2:CreateFleet"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Deletelaunchtemplate"
    }
  ]
}
```

------

The following service role permissions are required to create a cluster that uses open or targeted capacity reservations. You must include these permissions in addition to the permissions required for using the allocation strategy option.

**Example Policy document for service role capacity reservations**  
To use open capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

**Example**  
To use targeted capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions",
        "resource-groups:ListGroupResources"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

## Configure instance fleets for your cluster
Configure instance fleets

------
#### [ Console ]

**To create a cluster with instance fleets with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance fleets**.

1. For each **Node group**, select **Add instance type** and choose up to 5 instance types for primary and core instance fleets and up to fifteen instance types for task instance fleets. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit weighted capacity**  
For the core node group, change this value to any number of units that fits your applications. The number of YARN vCores for each fleet instance type is used as the default weighted capacity units. You can't edit weighted capacity for the primary node.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, to add security groups for your nodes, expand **EC2 security groups (firewall)** in the **Networking** section and select your security group for each node type.

1. Optionally, select the check box next to **Apply allocation strategy** if you want to use the allocation strategy option, and select the allocation strategy that you want to specify for the Spot Instances. You shouldn't select this option if your Amazon EMR service role doesn't have the required permissions. For more information, see [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy).

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

To create and launch a cluster with instance fleets with the AWS CLI, follow these guidelines:
+ To create and launch a cluster with instance fleets, use the `create-cluster` command along with `--instance-fleet` parameters.
+ To get configuration details about the instance fleets in a cluster, use the `list-instance-fleets` command.
+ To add multiple custom Amazon Linux AMIs to a cluster you’re creating, use the `CustomAmiId` option with each `InstanceType` specification. You can configure instance fleet nodes with multiple instance types and multiple custom AMIs to fit your requirements. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli). 
+ To make changes to the target capacity for an instance fleet, use the `modify-instance-fleet` command.
+ To add a task instance fleet to a cluster that doesn't already have one, use the `add-instance-fleet` command.
+ Multiple custom AMIs can be added to the task instance fleet using the CustomAmiId argument with the add-instance-fleet command. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli).
+ To use the allocation strategy option when creating an instance fleet, update the service role to include the example policy document in the following section.
+ To use the capacity reservations options when creating an instance fleet with On-Demand allocation strategy, update the service role to include the example policy document in the following section.
+ The instance fleets are automatically included in the default EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you are using a custom service role or custom managed policy for your cluster, you must add the new permissions for allocation strategy in the following section.

------

## Examples: Creating a cluster with the instance fleets configuration
Examples

The following examples demonstrate `create-cluster` commands with a variety of options that you can combine.

**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, use `aws emr create-default-roles` to create them before using the `create-cluster` command.

**Example: On-Demand primary, On-Demand core with single instance type, Default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}']
```

**Example: Spot primary, Spot core with single instance type, default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

**Example: On-Demand primary, mixed core with single instance type, single EC2 subnet**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=2,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=2}']
```

**Example: On-Demand primary, spot core with multiple weighted instance Types, Timeout for Spot, Range of EC2 Subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=11,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'}
```

**Example: On-Demand primary, mixed core and task with multiple weighted instance types, timeout for core Spot Instances, range of EC2 subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=8,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
    InstanceFleetType=TASK,TargetOnDemandCapacity=3,TargetSpotCapacity=3,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}']
```

**Example: Spot primary, no core or task, Amazon EBS configuration, default VPC**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \ 
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=60,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,\
EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,\
SizeIn GB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iop s=100},VolumesPerInstance=4}]}}']
```

**Example: Multiple custom AMIs, multiple instance types, on-demand primary, on-demand core**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}'] \ 
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}']
```

**Example: Add a task node to a running cluster with multiple instance types and multiple custom AMIs**  

```
aws emr add-instance-fleet --cluster-id j-123456 --release-label Amazon EMR 5.3.1 \
  --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleet \
    InstanceFleetType=Task,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456}',\
'{InstanceType=m6g.xlarge,CustomAmiId=ami-234567}']
```

**Example: Use a JSON configuration file**  
You can configure instance fleet parameters in a JSON file, and then reference the JSON file as the sole parameter for instance fleets. For example, the following command references a JSON configuration file, `my-fleet-config.json`:  

```
aws emr create-cluster --release-label emr-5.30.0 --service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
--instance-fleets file://my-fleet-config.json
```
The *my-fleet-config.json* file specifies primary, core, and task instance fleets as shown in the following example. The core instance fleet uses a maximum Spot price (`BidPrice`) as a percentage of On-Demand, while the task and primary instance fleets use a maximum Spot price (BidPriceAsPercentageofOnDemandPrice) as a string in USD.  

```
[
    {
        "Name": "Masterfleet",
        "InstanceFleetType": "MASTER",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "SWITCH_TO_ON_DEMAND"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    },
    {
        "Name": "Corefleet",
        "InstanceFleetType": "CORE",
        "TargetSpotCapacity": 1,
        "TargetOnDemandCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "UsageStrategy": "use-capacity-reservations-first",
                "CapacityReservationResourceGroupArn": "String"
            }
        },
            "SpotSpecification": {
                "AllocationStrategy": "capacity-optimized",
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPriceAsPercentageOfOnDemandPrice": 100
            }
        ]
    },
    {
        "Name": "Taskfleet",
        "InstanceFleetType": "TASK",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "CapacityReservationPreference": "none"
            }
        },
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    }
]
```

## Modify target capacities for an instance fleet
Change target capacities

Use the `modify-instance-fleet` command to specify new target capacities for an instance fleet. You must specify the cluster ID and the instance fleet ID. Use the `list-instance-fleets` command to retrieve instance fleet IDs.

```
aws emr modify-instance-fleet --cluster-id <cluster-id> \
  --instance-fleet \
    InstanceFleetId='<instance-fleet-id>',TargetOnDemandCapacity=1,TargetSpotCapacity=1
```

## Add a task instance fleet to a cluster
Add a task fleet

If a cluster has only primary and core instance fleets, you can use the `add-instance-fleet` command to add a task instance fleet. You can only use this to add task instance fleets.

```
aws emr add-instance-fleet --cluster-id <cluster-id> 
  --instance-fleet \
    InstanceFleetType=TASK,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=20,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

## Get configuration details of instance fleets in a cluster
Get configuration details

Use the `list-instance-fleets` command to get configuration details of the instance fleets in a cluster. The command takes a cluster ID as input. The following example demonstrates the command and its output for a cluster that contains a primary task instance group and a core task instance group. For full response syntax, see [ListInstanceFleets](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_ListInstanceFleets.html) in the *Amazon EMR API Reference.*

```
list-instance-fleets --cluster-id <cluster-id>			
```

```
{
    "InstanceFleets": [
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759094.637,
                    "CreationDateTime": 1488758719.817
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 6,
            "Name": "CORE",
            "InstanceFleetType": "CORE",
            "LaunchSpecifications": {
                "SpotSpecification": {
                    "TimeoutDurationMinutes": 60,
                    "TimeoutAction": "TERMINATE_CLUSTER"
                }
            },
            "ProvisionedOnDemandCapacity": 2,
            "InstanceTypeSpecifications": [
                {
                    "BidPrice": "0.5",
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 2
                }
            ],
            "Id": "if-1ABC2DEFGHIJ3"
        },
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759058.598,
                    "CreationDateTime": 1488758719.811
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 0,
            "Name": "MASTER",
            "InstanceFleetType": "MASTER",
            "ProvisionedOnDemandCapacity": 1,
            "InstanceTypeSpecifications": [
                {
                    "BidPriceAsPercentageOfOnDemandPrice": 100.0,
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 1
                }
            ],
           "Id": "if-2ABC4DEFGHIJ4"
        }
    ]
}
```

# Reconfiguring instance fleets for your Amazon EMR cluster


With Amazon EMR version 5.21.0 and later, you can reconfigure cluster applications and specify additional configuration classifications for each instance fleet in a running cluster. To do so, you can use the AWS Command Line Interface (AWS CLI), or the AWS SDK.

You can track the state of an instance fleet, by viewing the CloudWatch events. For more information, see [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig).

**Note**  
You can only override the cluster Configurations object specified during cluster creation. For more information about Configurations objects, see [RunJobFlow request syntax](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_RequestSyntax). If there are differences between the existing configuration and the file that you supply, Amazon EMR resets manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

When you submit a reconfiguration request using the Amazon EMR console, the AWS Command Line interface (AWS CLI), or the AWS SDK, Amazon EMR checks the existing on-cluster configuration file. If there are differences between the existing configuration and the file that you supply, Amazon EMR initiates reconfiguration actions, restarts some applications, and resets any manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

## Reconfiguration behaviors


Reconfiguration overwrites on-cluster configuration with the newly submitted configuration set, and can overwrite configuration changes made outside of the reconfiguration API.

Amazon EMR follows a rolling process to reconfigure instances in the Task and Core instance fleet. Only a percentage of the instances for a single instance type are modified and restarted at a time. If your instance fleet has multiple different instance type configurations, they would reconfigure in parallel.

Reconfigurations are declared at the [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) level. For a visual example, refer to [Reconfigure an instance fleet](#instance-fleet-reconfiguration-cli-sdk). You can submit reconfiguration requests that contain updated configuration settings for one or more instance types within a single request. You must include all instance types that are part of your instance fleet in the modify request; however, instance types with populated configuration fields will undergo reconfiguration, while other `InstanceTypeConfig` instances in the fleet remain unchanged. A reconfiguration is considered successful only when all instances of the specified instance types complete reconfiguration. If any instance fails to reconfigure, the entire Instance Fleet automatically reverts to its last known stable configuration.

## Limitations


When you reconfigure an instance fleet in a running cluster, consider the following limitations:
+ Non-YARN applications can fail during restart or cause cluster issues, especially if the applications aren't configured properly. Clusters approaching maximum memory and CPU usage may run into issues after the restart process. This is especially true for the primary instance fleet. Consult the [Troubleshoot instance fleet reconfiguration](#instance-fleet-reconfiguration-troubleshooting) section.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ After reconfiguring an instance fleet, Amazon EMR restarts the applications to allow the new configurations to take effect. Job failure or other unexpected application behavior might occur if the applications are in use during reconfiguration.
+ If a reconfiguration for any instance type config under an instance fleet fails, Amazon EMR reverses the configuration parameters to the previous working version for the entire instance fleet, along with emitting events and updating state details. If the reversion process fails too, you must submit a new `ModifyInstanceFleet` request to recover the instance fleet from the `ARRESTED` state. Reversion failures result in [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig) and state change. 
+ Reconfiguration requests for Phoenix configuration classifications are only supported in Amazon EMR version 5.23.0 and later, and are not supported in Amazon EMR version 5.21.0 or 5.22.0.
+ Reconfiguration requests for HBase configuration classifications are only supported in Amazon EMR version 5.30.0 and later, and are not supported in Amazon EMR versions 5.23.0 through 5.29.0.
+ Reconfiguring hdfs-encryption-zones classification or any of the Hadoop KMS configuration classifications is not supported on an Amazon EMR cluster with multiple primary nodes.
+ Amazon EMR currently doesn't support certain reconfiguration requests for the YARN capacity scheduler that require restarting the YARN ResourceManager. For example, you cannot completely remove a queue.
+ When YARN needs to restart, all running YARN jobs are typically terminated and lost. This might cause data processing delays. To run YARN jobs during a YARN restart, you can either create an Amazon EMR cluster with multiple primary nodes or set yarn.resourcemanager.recovery.enabled to `true` in your yarn-site configuration classification. For more information about using multiple master nodes, see [High availability YARN ResourceManager](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-YARN).

## Reconfigure an instance fleet


------
#### [ Using the AWS CLI ]

Use the `modify-instance-fleet` command to specify a new configuration for an instance fleet in a running cluster.

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

**Example – Replace a configuration for an instance fleet**

**Warning**  
Specify all `InstanceTypeConfig` fields that you used at launch. Not including fields can result in overwriting specifications you declared at launch. Refer to [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) for a list.

The following example references a configuration JSON file called instanceFleet.json to edit the property of the YARN NodeManager disk health checker for an instance fleet.

**Instance Fleet Modification JSON**

1. Prepare your configuration classification, and save it as instanceFleet.json in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

**Example – Add a configuration to an instance fleet**

If you want to add a configuration to an instance type, you must include all previously specified configurations for that instance type in your new `ModifyInstanceFleet` request. Otherwise, the previously specified configurations are removed.

The following example adds a property for the YARN NodeManager virtual memory checker. The configuration also includes previously specified values for the YARN NodeManager disk health checker so that the values won't be overwritten.

1. Prepare the following contents in instanceFleet.json and save it in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0",
                               "yarn.nodemanager.vmem-check-enabled":"true",
                               "yarn.nodemanager.vmem-pmem-ratio":"3.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]      
   }
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

------
#### [ using the Java SDK ]

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

The following code snippet provides a new configuration for an instance fleet using the AWS SDK for Java.

```
AWSCredentials credentials = new BasicAWSCredentials("access-key", "secret-key");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.join.emit.interval","1000");
hiveProperties.put("hive.merge.mapfiles","true");
        
Configuration newConfiguration = new Configuration()
    .withClassification("hive-site")
    .withProperties(hiveProperties);
    
List<InstanceTypeConfig> instanceTypeConfigList = new ArrayList<>();

for (InstanceTypeConfig instanceTypeConfig : currentInstanceTypeConfigList) {
    instanceTypeConfigList.add(new InstanceTypeConfig()
        .withInstanceType(instanceTypeConfig.getInstanceType())
        .withBidPrice(instanceTypeConfig.getBidPrice())
        .withWeightedCapacity(instanceTypeConfig.getWeightedCapacity())
        .withConfigurations(newConfiguration)
    );
}

InstanceFleetModifyConfig instanceFleetModifyConfig = new InstanceFleetModifyConfig()
    .withInstanceFleetId("if-1xxxxxxx9")
    .withInstanceTypeConfigs(instanceTypeConfigList);
    
ModifyInstanceFleetRequest modifyInstanceFleetRequest = new ModifyInstanceFleetRequest()
    .withInstanceFleet(instanceFleetModifyConfig)
    .withClusterId("j-2AL4XXXXXX5T9");

emrClient.modifyInstanceFleet(modifyInstanceFleetRequest);
```

------

## Troubleshoot instance fleet reconfiguration


If the reconfiguration process for any instance type within an instance fleet fails, Amazon EMR reverts the in progress reconfiguration and logs a failure message using an AAmazon CloudWatch Events events. The event provides a brief summary of the reconfiguration failure. It lists the instances for which reconfiguration has failed and corresponding failure messages. The following is an example failure message.

`Amazon EMR couldn't revert the instance fleet if-1xxxxxxx9 in the Amazon EMR cluster j-2AL4XXXXXX5T9 (ExampleClusterName) to the previously successful configuration at 2021-01-01 00:00 UTC. The reconfiguration reversion failed because of Instance i-xxxxxxx1, i-xxxxxxx2, i-xxxxxxx3 failed with message "This is an example failure message"...`

### To access node provisioning logs


Use SSH to connect to the node on which reconfiguration has failed. For instructions, see [Connect to your Linux instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-linux-instance.html) in the *Amazon Elastic Compute Cloud*. 

------
#### [ Accessing logs by connecting to a node ]

1. Navigate to the following directory, which contains the node provisioning log files.

   ```
   /mnt/var/log/provision-node/
   ```

1. Open the reports subdirectory and search for the node provisioning report for your reconfiguration. The reports directory organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

1. You can examine a report using a file viewer like zless, as in the following example.

   ```
   zless 202104061715.yaml.gz
   ```

------
#### [ Accessing logs using Amazon S3 ]

Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/). Open the Amazon S3 bucket that you specified when you configured the cluster to archive log files.

1.  Navigate to the following folder, which contains the node provisioning log files:

   ```
   amzn-s3-demo-bucket/elasticmapreduce/cluster id/node/instance id/provision-node/
   ```

1. Open the reports folder and search for the node provisioning report for your reconfiguration. The reports folder organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

To view a log file, you can download it from Amazon S3 to your local machine as a text file. For instructions, see [Downloading an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html).

------

Each log file contains a detailed provisioning report for the associated reconfiguration. To find error message information, you can search for the `err` log level of a report. Report format depends on the version of Amazon EMR on your cluster. The following example shows error information for Amazon EMR release versions 5.32.0 and 6.2.0 and later use the following format: 

```
- level: err
  message: 'Example detailed error message.'
  source: Puppet
  tags:
  - err
  time: '2021-01-01 00:00:00.000000 +00:00'
  file: 
  line:
```

# Use capacity reservations with instance fleets in Amazon EMR


To launch On-Demand Instance fleets with capacity reservations options, attach additional service role permissions which are required to use capacity reservation options. Since capacity reservation options must be used together with On-Demand allocation strategy, you also have to include the permissions required for allocation strategy in your service role and managed policy. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](emr-instance-fleet.md#create-cluster-allocation-policy).

Amazon EMR supports both open and targeted capacity reservations. The following topics show instance fleets configurations that you can use with the `RunJobFlow` action or `create-cluster` command to launch instance fleets using On-Demand Capacity Reservations. 

## Use open capacity reservations on a best-effort basis


If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. However, it is not guaranteed that your capacity reservations will be used. For provisioning the cluster, Amazon EMR evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Available open capacity reservations that match the instance pools are applied automatically.

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations on a best-effort basis.

**Example 1: Lowest-price instance pool in launch request has available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. Your available open capacity reservations in that instance pool are used automatically.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 150 | 100 | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 100 | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | 50 | 100 | 100 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Example 2: Lowest-price instance pool in launch request does not have available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. However, your open capacity reservations remain unused.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
|  Available Open capacity reservations  | - | - | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 100 | 
| --- |--- |--- |--- |

**Configure Instance Fleets to use open capacity reservations on best-effort basis**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for capacity reservations options to `open`. Alternatively, if you leave this field blank, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "open"
         }
        }
    }
```

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using open capacity reservations.

```
aws emr create-cluster \
	--name 'open-ODCR-cluster' \
	--release-label emr-5.30.0 \
	--service-role EMR_DefaultRole \
	--ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
	--instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
	  InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
	  LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=open}}'}
```

Where,
+ `open-ODCR-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use open capacity reservations first


You can choose to override the lowest-price allocation strategy and prioritize using available open capacity reservations first while provisioning an Amazon EMR cluster. In this case, Amazon EMR evaluates all the instance pools with capacity reservations specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with capacity reservations have sufficient capacity for the requested core nodes, Amazon EMR falls back to the best-effort case described in the previous topic. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused. 

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR uses the available open capacity reservations available across each instance pool in the selected Availability Zone first, and only if required, uses the lowest-price strategy to provision any remaining task nodes. 

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations first.

**Example 1: **Instance pool with available open capacity reservations in launch request has sufficient capacity for core nodes****

In this case, Amazon EMR launches capacity in the instance pool with available open capacity reservations regardless of instance pool price. As a result, your open capacity reservations are used whenever possible, until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: **Instance pool with available open capacity reservations in launch request does not have sufficient capacity for core nodes****

In this case, Amazon EMR falls back to launching core nodes using lowest-price strategy with a best-effort to use capacity reservations.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 10 | - | - | 
| --- |--- |--- |--- |
| Available open capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use open capacity reservations first**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first"
         }
       }
    }
```

You can also use the Amazon EMR CLI to create an instance-fleet based cluster using capacity reservations first.

```
aws emr create-cluster \
  --name 'use-CR-first-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first}}'}
```

Where,
+ `use-CR-first-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use targeted capacity reservations first


When you provision an Amazon EMR cluster, you can choose to override the lowest-price allocation strategy and prioritize using available targeted capacity reservations first. In this case, Amazon EMR evaluates all the instance pools with targeted capacity reservations specified in the launch request and picks the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with targeted capacity reservations have sufficient capacity for core nodes, Amazon EMR falls back to the best-effort case described earlier. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and selects the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations which match the instance pool get applied automatically. However, targeted capacity reservations remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with targeted capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR tries to use the available targeted capacity reservations available across each instance pool in the selected Availability Zone first. Then, only if required, Amazon EMR uses the lowest-price strategy to provision any remaining task nodes.

The following are use cases of Amazon EMR capacity allocation logic for using targeted capacity reservations first.

**Example 1: Instance pool with available targeted capacity reservations in launch request has sufficient capacity for core nodes**

In this case, Amazon EMR launches capacity in the instance pool with available targeted capacity reservations regardless of instance pool price. As a result, your targeted capacity reservations are used whenever possible until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Usage Strategy | use-capacity-reservations-first | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Targeted capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: Instance pool with available targeted capacity reservations in launch request does not have sufficient capacity for core nodes**  


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Targeted capacity reservations used | 10 | - | - | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use targeted capacity reservations first**

When you use the `RunJobFlow` action to create an instance-fleet based cluster, set the On-Demand allocation strategy to `lowest-price`, `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`, and `CapacityReservationResourceGroupArn` for `CapacityReservationOptions` to `<your resource group ARN>`. For more information, see [Work with capacity reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-using.html) in the *Amazon EC2 User Guide*.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first",
            "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup"
         }
       }
    }
```

Where `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using targeted capacity reservations.

```
aws emr create-cluster \
  --name 'targeted-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,\
InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first,CapacityReservationResourceGroupArn=arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup}}'}
```

Where,
+ `targeted-CR-cluster` is replaced with the name of your cluster using targeted capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.
+ `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

## Avoid using available open capacity reservations


**Example**  
If you want to avoid unexpectedly using any of your open capacity reservations when launching an Amazon EMR cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for `CapacityReservationOptions` to `none`. Otherwise, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open` and tries using available open capacity reservations on a best-effort basis.  

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "none"
         }
       }
    }
```
You can also use the Amazon EMR CLI to create an instance fleet-based cluster without using any open capacity reservations.  

```
aws emr create-cluster \
  --name 'none-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=none}}'}
```
Where,  
+ `none-CR-cluster` is replaced with the name of your cluster that is not using any open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Scenarios for using capacity reservations


You can benefit from using capacity reservations in the following scenarios.

**Scenario 1: Rotate a long-running cluster using capacity reservations**  
When rotating a long running cluster, you might have strict requirements on the instance types and Availability Zones for the new instances you provision. With capacity reservations, you can use capacity assurance to complete the cluster rotation without interruptions.

![\[Cluster rotation using available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-longrunning-cluster-diagram.png)


**Scenario 2: Provision successive short-lived clusters using capacity reservations**  
You can also use capacity reservations to provision a group of successive, short-lived clusters for individual workloads so that when you terminate a cluster, the next cluster can use the capacity reservations. You can use targeted capacity reservations to ensure that only the intended clusters use the capacity reservations.

![\[Short-lived cluster provisioning that uses available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-short-cluster-diagram.png)


# Configure uniform instance groups for your Amazon EMR cluster


With the instance groups configuration, each node type (master, core, or task) consists of the same instance type and the same purchasing option for instances: On-Demand or Spot. You specify these settings when you create an instance group. They can't be changed later. You can, however, add instances of the same type and purchasing option to core and task instance groups. You can also remove instances.

If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. You can use open capacity reservations for primary, core, and task nodes. However, you cannot use targeted capacity reservations or prevent instances from launching into open capacity reservations with matching attributes when you provision clusters using instance groups. If you want to use targeted capacity reservations or prevent instances from launching into open capacity reservations, use Instance Fleets instead. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

To add different instance types after a cluster is created, you can add additional task instance groups. You can choose different instance types and purchasing options for each instance group. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

When launching instances, the On-Demand Instance's capacity reservation preference defaults to `open`, which enables it to run in any open capacity reservation that has matching attributes (instance type, platform, Availability Zone). For more information about On-Demand Capacity Reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

This section covers creating a cluster with uniform instance groups. For more information about modifying an existing instance group by adding or removing instances manually or with automatic scaling, see [Manage Amazon EMR clusters](emr-manage.md).

## Use the console to configure uniform instance groups


------
#### [ Console ]

**To create a cluster with instance groups with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance groups**.

1. Under **Node groups**, there is a section for each type of node group. For the primary node group, select the **Use multiple primary nodes** check box if you want to have 3 primary nodes. Select the **Use Spot purchasing option** check box if you want to use Spot purchasing.

1. For the primary and core node groups, select **Add instance type** and choose up to 5 instance types. For the task group, select **Add instance type** and choose up to fifteen instance types. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, expand **Node configuration** to enter a JSON configuration or to load JSON from Amazon S3.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------

## Use the AWS CLI to create a cluster with uniform instance groups


To specify the instance groups configuration for a cluster using the AWS CLI, use the `create-cluster` command along with the `--instance-groups` parameter. Amazon EMR assumes the On-Demand Instance option unless you specify the `BidPrice` argument for an instance group. For examples of `create-cluster` commands that launch uniform instance groups with On-Demand Instances and a variety of cluster options, type `aws emr create-cluster help `at the command line, or see [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) in the *AWS CLI Command Reference*.

You can use the AWS CLI to create uniform instance groups in a cluster that use Spot Instances. The offered Spot price depends on Availability Zone. When you use the CLI or API, you can specify the Availability Zone either with the `AvailabilityZone` argument (if you're using an EC2-classic network) or the `SubnetID `argument of the `--ec2-attributes `parameter. The Availability Zone or subnet that you select applies to the cluster, so it's used for all instance groups. If you don't specify an Availability Zone or subnet explicitly, Amazon EMR selects the Availability Zone with the lowest Spot price when it launches the cluster.

The following example demonstrates a `create-cluster` command that creates primary, core, and two task instance groups that all use Spot Instances. Replace *myKey* with the name of your Amazon EC2 key pair. 

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "MySpotCluster" \
  --release-label emr-7.12.0 \
  --use-default-roles \
  --ec2-attributes KeyName=myKey \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,BidPrice=0.25 \
    InstanceGroupType=CORE,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=4,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.04
```

Using the CLI, you can create uniform instance group clusters that specify a unique custom AMI for each instance type in the instance group. This allows you to use different instance architectures in the same instance group. Each instance type must use a custom AMI with a matching architecture. For example, you would configure an m5.xlarge instance type with an x86\$164 architecture custom AMI, and an m6g.xlarge instance type with a corresponding `AWS AARCH64` (ARM) architecture custom AMI. 

The following example shows a uniform instance group cluster created with two instance types, each with its own custom AMI. Notice that the custom AMIs are specified only at the instance type level, not at the cluster level. This is to avoid conflicts between the instance type AMIs and an AMI at the cluster level, which would cause the cluster launch to fail. 

```
aws emr create-cluster
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456 \
    InstanceGroupType=CORE,InstanceType=m6g.xlarge,InstanceCount=1,CustomAmiId=ami-234567
```

You can add multiple custom AMIs to an instance group that you add to a running cluster. The `CustomAmiId` argument can be used with the `add-instance-groups` command as shown in the following example.

```
aws emr add-instance-groups --cluster-id j-123456 \
  --instance-groups \
    InstanceGroupType=Task,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456
```

## Use the Java SDK to create an instance group


You instantiate an `InstanceGroupConfig` object that specifies the configuration of an instance group for a cluster. To use Spot Instances, you set the `withBidPrice` and `withMarket` properties on the `InstanceGroupConfig` object. The following code shows how to define primary, core, and task instance groups that run Spot Instances.

```
InstanceGroupConfig instanceGroupConfigMaster = new InstanceGroupConfig()
	.withInstanceCount(1)
	.withInstanceRole("MASTER")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.25"); 
	
InstanceGroupConfig instanceGroupConfigCore = new InstanceGroupConfig()
	.withInstanceCount(4)
	.withInstanceRole("CORE")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.03");
	
InstanceGroupConfig instanceGroupConfigTask = new InstanceGroupConfig()
	.withInstanceCount(2)
	.withInstanceRole("TASK")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.10");
```

# Availability Zone flexibility for an Amazon EMR cluster
Availability Zone flexibility for launching instances

Each AWS Region has multiple, isolated locations known as Availability Zones. When you launch an instance, you can optionally specify an Availability Zone (AZ) in the AWS Region that you use. [Availability Zone flexibility](#emr-flexibility-az) is the distribution of instances across multiple AZs. If one instance fails, you can design your application so that an instance in another AZ can handle requests. For more information on Availability Zones, see the [Region and zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) documentation in the *Amazon EC2 User Guide*.

[Instance flexibility](#emr-flexibility-types) is the use of multiple instance types to satisfy capacity requirements. When you express flexibility with instances, you can use aggregate capacity across instance sizes, families, and generations. Greater flexibility improves the chance to find and allocate your required amount of compute capacity when compared with a cluster that uses a single instance type.

Instance and Availability Zone flexibility reduces [insufficient capacity errors (ICE)](emr-events-response-insuff-capacity.md) and Spot interruptions when compared to a cluster with a single instance type or AZ. Use the best practices covered here to determine which instances to diversify after you know the initial instance family and size. This approach maximizes availability to Amazon EC2 capacity pools with minimal performance and cost variance.

## Being flexible about Availability Zones
Availability Zone flexibility

We recommend that you configure all Availability Zones for use in your virtual private cloud (VPC) and that you select them for your EMR cluster. Clusters must exist in only one Availability Zone, but with Amazon EMR instance fleets, you can select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options that you specify. When you provision an EMR cluster for multiple subnets, your cluster can access a deeper Amazon EC2 capacity pool when compared to clusters in a single subnet. 

If you must prioritize a certain number of Availability Zones for use in your virtual private cloud (VPC) for your EMR cluster, you can leverage the Spot placement score capability with Amazon EC2. With Spot placement scoring, you specify the compute requirements for your Spot Instances, then EC2 returns the top ten AWS Regions or Availability Zones scored on a scale from 1 to 10. A score of 10 indicates that your Spot request is highly likely to succeed; a score of 1 indicates that your Spot request is not likely to succeed. For more information on how to use Spot placement scoring, see [Spot placement score](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-placement-score.html) in the *Amazon EC2 User Guide*.

## Being flexible about instance types
Instance flexibility

Instance flexibility is the use of multiple instance types to satisfy capacity requirements. Instance flexibility benefits both Amazon EC2 Spot and On-Demand Instance usage. With Spot Instances, instance flexibility lets Amazon EC2 launch instances from deeper capacity pools using real-time capacity data. It also predicts which instances are most available. This offers fewer interruptions and can reduce the overall cost of a workload. With On-Demand Instances, instance flexibility reduces insufficient capacity errors (ICE) when total capacity provisions across a greater number of instance pools.

For **Instance Group** clusters, you can specify up to 50 EC2 instance types. For **Instance Fleets** with allocation strategy, you can specify up to 30 EC2 instance types for each primary, core, and task node group. A broader range of instances improves the benefits of instance flexibility. 

### Expressing instance flexibility
Flexibility expression

Consider the following best practices to express instance flexibility for your application.

**Topics**
+ [

#### Determine instance family and size
](#emr-flexibility-express-size)
+ [

#### Include additional instances
](#emr-flexibility-express-include)

#### Determine instance family and size
Instance family and size

Amazon EMR supports several instance types for different use cases. These instance types are listed in the [Supported instance types with Amazon EMR](emr-supported-instance-types.md) documentation. Each instance type belongs to an instance family that describes what application the type is optimized for.

For new workloads, you should benchmark with instance types in the general purpose family, such as `m5` or `c5`. Then, monitor the OS and YARN metrics from Ganglia and Amazon CloudWatch to determine system bottlenecks at peak load. Bottlenecks include CPU, memory, storage, and I/O operations. After you identify the bottlenecks, choose compute optimized, memory optimized, storage optimized, or another appropriate instance family for your instance types. For more details, see the [Determine right infrastructure for your Spark workloads](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Applications/Spark/best_practices.md#bp-512-----determine-right-infrastructure-for-your-spark-workloads) page in the Amazon EMR best practices guide on GitHub. 

Next, identify the smallest YARN container or Spark executor that your application requires. This is the smallest instance size that fits the container and the minimum instance size for the cluster. Use this metric to determine instances that you can further diversify with. A smaller instance will allow for more instance flexibility.

For maximum instance flexibility, you should leverage as many instances as possible. We recommend that you diversify with instances that have similar hardware specifications. This maximizes access to EC2 capacity pools with minimal cost and performance variance. Diversify across sizes. To do so, prioritize AWS Graviton and previous generations first. As a general rule, try to be flexible across at least 15 instance types for each workload. We recommend that you start with general purpose, compute optimized, or memory optimized instances. These instance types will provide the greatest flexibility. 

#### Include additional instances
Include additional instance types

For maximum diversity, include additional instance types. Prioritize instance size, Graviton, and generation flexibility first. This allows access to additional EC2 capacity pools with similar cost and performance profiles. If you need further flexibility due to ICE or spot interruptions, consider variant and family flexibility. Each approach has tradeoffs that depend on your use case and requirements. 
+ **Size flexibility** – First, diversify with instances of different sizes within the same family. Instances within the same family provide the same cost and performance, but can launch a different number of containers on each host. For example, if the minimum executor size that you need is 2vCPU and 8Gb memory, the minimum instance size is `m5.xlarge`. For size flexibility, include `m5.xlarge`, `m5.2xlarge`, `m5.4xlarge`, `m5.8xlarge`, `m5.12xlarge`, `m5.16xlarge`, and `m5.24xlarge`.
+ **Graviton flexibility** – In addition to size, you can diversify with Graviton instances. Graviton instances are powered by AWS Graviton2 processors that deliver the best price performance for cloud workloads in Amazon EC2. For example, with the minimum instance size of `m5.xlarge`, you can include `m6g.xlarge`, `m6g.2xlarge`, `m6g.4xlarge`, `m6g.8xlarge`, and `m6g.16xlarge` for Graviton flexibility.
+ **Generation flexibility** – Similar to Graviton and size flexibility, instances in previous generation families share the same hardware specifications. This results in a similar cost and performance profile with an increase in the total accessible Amazon EC2 pool. For generation flexibility, include `m4.xlarge`, `m4.2xlarge`, `m4.10xlarge`, and `m4.16xlarge`.
+ **Family and variant flexibility**
  + **Capacity** – To optimize for capacity, we recommend instance flexibility across instance families. Common instances from different instance families have deeper instance pools that can assist with meeting capacity requirements. However, instances from different families will have different vCPU to memory ratios. This results in under-utilization if the expected application container is sized for a different instance. For example, with `m5.xlarge`, include compute-optimized instances such as `c5` or memory-optimized instances such as `r5` for instance family flexibility.
  + **Cost** – To optimize for cost, we recommend instance flexibility across variants. These instances have the same memory and vCPU ratio as the initial instance. The tradeoff with variant flexibility is that these instances have smaller capacity pools which might result in limited additional capacity or higher Spot interruptions. With `m5.xlarge` for example, include AMD-based instances (`m5a`), SSD-based instances (`m5d`) or network-optimized instances (`m5n`) for instance variant flexibility.

# Configuring Amazon EMR cluster instance types and best practices for Spot instances
Guidelines and best practices

Use the guidance in this section to help you determine the instance types, purchasing options, and amount of storage to provision for each node type in an EMR cluster.

## What instance type should you use?


There are several ways to add Amazon EC2 instances to a cluster. The method you should choose depends on whether you use the instance groups configuration or the instance fleets configuration for the cluster.
+ **Instance Groups**
  + Manually add instances of the same type to existing core and task instance groups.
  + Manually add a task instance group, which can use a different instance type.
  + Set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).
+ **Instance Fleets**
  + Add a single task instance fleet.
  + Change the target capacity for On-Demand and Spot Instances for existing core and task instance fleets. For more information, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

One way to plan the instances of your cluster is to run a test cluster with a representative sample set of data and monitor the utilization of the nodes in the cluster. For more information, see [View and monitor an Amazon EMR cluster as it performs work](emr-manage-view.md). Another way is to calculate the capacity of the instances you are considering and compare that value against the size of your data.

In general, the primary node type, which assigns tasks, doesn't require an EC2 instance with much processing power; Amazon EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity; Amazon EC2 instances for the task node type, which don't store data, need only processing power. For guidelines about available Amazon EC2 instances and their configuration, see [Configure Amazon EC2 instance types for use with Amazon EMR](emr-plan-ec2-instances.md).

 The following guidelines apply to most Amazon EMR clusters. 
+ There is a vCPU limit for the total number of on-demand Amazon EC2 instances that you run on an AWS account per AWS Region. For more information about the vCPU limit and how to request a limit increase for your account, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the *Amazon EC2* *User Guide for Linux Instances*. 
+ The primary node does not typically have large computational requirements. For clusters with a large number of nodes, or for clusters with applications that are specifically deployed on the primary node (JupyterHub, Hue, etc.), a larger primary node may be required and can help improve cluster performance. For example, consider using an m5.xlarge instance for small clusters (50 or fewer nodes), and increasing to a larger instance type for larger clusters.
+ The computational needs of the core and task nodes depend on the type of processing your application performs. Many jobs can be run on general purpose instance types, which offer balanced performance in terms of CPU, disk space, and input/output. Computation-intensive clusters may benefit from running on High CPU instances, which have proportionally more CPU than RAM. Database and memory-caching applications may benefit from running on High Memory instances. Network-intensive and CPU-intensive applications like parsing, NLP, and machine learning may benefit from running on cluster compute instances, which provide proportionally high CPU resources and increased network performance.
+ If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your job flow's varying capacity requirements. 
+ The amount of data you can process depends on the capacity of your core nodes and the size of your data as input, during processing, and as output. The input, intermediate, and output datasets all reside on the cluster during processing. 

## When should you use Spot Instances?


When you launch a cluster in Amazon EMR, you can choose to launch primary, core, or task instances on Spot Instances. Because each type of instance group plays a different role in the cluster, there are implications of launching each node type on Spot Instances. You can't change an instance purchasing option while a cluster is running. To change from On-Demand to Spot Instances or vice versa, for the primary and core nodes, you must terminate the cluster and launch a new one. For task nodes, you can launch a new task instance group or instance fleet, and remove the old one.

**Topics**
+ [

### Amazon EMR settings to prevent job failure because of task node Spot Instance termination
](#emr-plan-spot-YARN)
+ [

### Primary node on a Spot Instance
](#emr-dev-master-instance-group-spot)
+ [

### Core nodes on Spot Instances
](#emr-dev-core-instance-group-spot)
+ [

### Task nodes on Spot Instances
](#emr-dev-task-instance-group-spot)
+ [

### Instance configurations for application scenarios
](#emr-plan-spot-scenarios)

### Amazon EMR settings to prevent job failure because of task node Spot Instance termination


Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Amazon EMR configures the following properties and values by default. Use caution when configuring these properties.

**Note**  
Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application primary processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties:   
`yarn.node-labels.enabled: true`
`yarn.node-labels.am.default-node-label-expression: 'CORE'`
+ **yarn-site (yarn-site.xml) On All Nodes**
  + `yarn.node-labels.enabled: true`
  + `yarn.node-labels.am.default-node-label-expression: 'CORE'`
  + `yarn.node-labels.fs-store.root-dir: '/apps/yarn/nodelabels'`
  + `yarn.node-labels.configuration-type: 'distributed'`
+ **yarn-site (yarn-site.xml) On Primary And Core Nodes**
  + `yarn.nodemanager.node-labels.provider: 'config'`
  + `yarn.nodemanager.node-labels.provider.configured-node-partition: 'CORE'`
+ **capacity-scheduler (capacity-scheduler.xml) On All Nodes**
  + `yarn.scheduler.capacity.root.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity: 100`

### Primary node on a Spot Instance


The primary node controls and directs the cluster. When it terminates, the cluster ends, so you should only launch the primary node as a Spot Instance if you are running a cluster where sudden termination is acceptable. This might be the case if you are testing a new application, have a cluster that periodically persists data to an external store such as Amazon S3, or are running a cluster where cost is more important than ensuring the cluster's completion. 

When you launch the primary instance group as a Spot Instance, the cluster does not start until that Spot Instance request is fulfilled. This is something to consider when selecting your maximum Spot price.

You can only add a Spot Instance primary node when you launch the cluster. You can't add or remove primary nodes from a running cluster. 

Typically, you would only run the primary node as a Spot Instance if you are running the entire cluster (all instance groups) as Spot Instances. 

### Core nodes on Spot Instances


Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.

When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won't launch. Amazon EMR continues to wait until all six Amazon EC2 instances are available or until you terminate the cluster. You can change the number of Spot Instances in a core instance group to add capacity to a running cluster. For more information about working with instance groups, and how Spot Instances work with instance fleets, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

### Task nodes on Spot Instances


The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.

When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible. 

Launching task instance groups as Spot Instances is a strategic way to expand the capacity of your cluster while minimizing costs. If you launch your primary and core instance groups as On-Demand Instances, their capacity is guaranteed for the run of the cluster. You can add task instances to your task instance groups as needed, to handle peak traffic or speed up data processing. 

You can add or remove task nodes using the console, AWS CLI, or API. You can also add additional task groups, but you cannot remove a task group after it is created. 

### Instance configurations for application scenarios


The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios. Choose the link to view more information about each scenario type.


| Application scenario | Primary node purchasing option | Core nodes purchasing option | Task nodes purchasing option | 
| --- | --- | --- | --- | 
| [Long-running clusters and data warehouses](#emr-dev-when-use-spot-data-warehouses) | On-Demand | On-Demand or instance-fleet mix | Spot or instance-fleet mix | 
| [Cost-driven workloads](#emr-dev-when-use-spot-cost-driven) | Spot | Spot | Spot | 
| [Data-critical workloads](#emr-dev-when-use-spot-data-critical) | On-Demand | On-Demand | Spot or instance-fleet mix | 
| [Application testing](#emr-dev-when-use-spot-application-testing) | Spot | Spot | Spot | 

 There are several scenarios in which Spot Instances are useful for running an Amazon EMR cluster. 

#### Long-running clusters and data warehouses


If you are running a persistent Amazon EMR cluster that has a predictable variation in computational capacity, such as a data warehouse, you can handle peak demand at lower cost with Spot Instances. You can launch your primary and core instance groups as On-Demand Instances to handle the normal capacity and launch the task instance group as Spot Instances to handle your peak load requirements.

#### Cost-driven workloads


If you are running transient clusters for which lower cost is more important than the time to completion, and losing partial work is acceptable, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to benefit from the largest cost savings.

#### Data-critical workloads


If you are running a cluster for which lower cost is more important than time to completion, but losing partial work is not acceptable, launch the primary and core instance groups as On-Demand Instances and supplement with one or more task instance groups of Spot Instances. Running the primary and core instance groups as On-Demand Instances ensures that your data is persisted in HDFS and that the cluster is protected from termination due to Spot market fluctuations, while providing cost savings that accrue from running the task instance groups as Spot Instances.

#### Application testing


When you are testing a new application in order to prepare it for launch in a production environment, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to reduce your testing costs.

## Calculating the required HDFS capacity of a cluster


 The amount of HDFS storage available to your cluster depends on the following factors:
+ The number of Amazon EC2 instances used for core nodes.
+ The capacity of the Amazon EC2 instance store for the instance type used. For more information on instance store volumes, see [Amazon Amazon EC2 instance store](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) in the *Amazon EC2 User Guide*.
+ The number and size of Amazon EBS volumes attached to core nodes.
+ A replication factor, which accounts for how each data block is stored in HDFS for RAID-like redundancy. By default, the replication factor is three for a cluster of 10 or more core nodes, two for a cluster of 4-9 core nodes, and one for a cluster of three or fewer nodes.

To calculate the HDFS capacity of a cluster, for each core node, add the instance store volume capacity to the Amazon EBS storage capacity (if used). Multiply the result by the number of core nodes, and then divide the total by the replication factor based on the number of core nodes. For example, a cluster with 10 core nodes of type i2.xlarge, which have 800 GB of instance storage without any attached Amazon EBS volumes, has a total of approximately 2,666 GB available for HDFS (10 nodes x 800 GB ÷ 3 replication factor).

 If the calculated HDFS capacity value is smaller than your data, you can increase the amount of HDFS storage in the following ways: 
+ Creating a cluster with additional Amazon EBS volumes or adding instance groups with attached Amazon EBS volumes to an existing cluster
+ Adding more core nodes
+ Choosing an Amazon EC2 instance type with greater storage capacity
+ Using data compression
+ Changing the Hadoop configuration settings to reduce the replication factor

Reducing the replication factor should be used with caution as it reduces the redundancy of HDFS data and the ability of the cluster to recover from lost or corrupted HDFS blocks. 

# Configure Amazon EMR cluster logging and debugging


One of the things to decide as you plan your cluster is how much debugging support you want to make available. When you are first developing your data processing application, we recommend testing the application on a cluster processing a small, but representative, subset of your data. When you do this, you will likely want to take advantage of all the debugging tools that Amazon EMR offers, such as archiving log files to Amazon S3. 

When you've finished development and put your data processing application into full production, you may choose to scale back debugging. Doing so can save you the cost of storing log file archives in Amazon S3 and reduce processing load on the cluster as it no longer needs to write state to Amazon S3. The trade off, of course, is that if something goes wrong, you'll have fewer tools available to investigate the issue. 

## Default log files


By default, each cluster writes log files on all nodes. These are written to the `/mnt/var/log/` directory. You can access them by using SSH to connect to any of the nodes as described in [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). Amazon EMR collects certain system and applications logs generated by Amazon EMR daemons and other Amazon EMR processes to ensure effective service operations.

**Note**  
If you use Amazon EMR release 6.8.0 or earlier, log files are not saved to Amazon S3 during cluster termination, so you can't access the log files once the nodes terminate. Amazon EMR releases 6.9.0 and later archive logs to Amazon S3 during cluster scale-down, so log files generated on the cluster persist even after the node is terminated.

You do not need to enable anything to have log files written on all nodes. This is the default behavior of Amazon EMR and Hadoop. 

Amazon EMR captures three categories of logs for S3 logging:
+ **System logs**: EMR daemon logs
+ **Application logs**: Framework logs from Hadoop, Spark, Hive, and other applications running on the cluster
+ **Persistent UI logs**: Logs required for persistent application UIs such as Spark History Server and Tez UI

On the local file system, a cluster generates several types of log files in `/mnt/var/log`, including:
+ **Step logs** — These logs are generated by the Amazon EMR service and contain information about the cluster and the results of each step. The log files are stored in `/mnt/var/log/hadoop/steps/` directory on the primary node. Each step logs its results in a separate numbered subdirectory: `/mnt/var/log/hadoop/steps/s-stepId1/` for the first step, `/mnt/var/log/hadoop/steps/s-stepId2/`, for the second step, and so on. The 13-character step identifiers (e.g. stepId1, stepId2) are unique to a cluster.
+ **Hadoop and YARN component logs** — The logs for components associated with both Apache YARN and MapReduce, for example, are contained in separate folders in `/mnt/var/log` on all nodes. The log file locations for the Hadoop components under `/mnt/var/log` are as follows: hadoop-hdfs, hadoop-mapreduce, hadoop-httpfs, and hadoop-yarn. The hadoop-state-pusher directory is for the output of the Hadoop state pusher process. 
+ **Bootstrap action logs** — If your job uses bootstrap actions, the results of those actions are logged. The log files are stored in /mnt/var/log/bootstrap-actions/ on all nodes. Each bootstrap action logs its results in a separate numbered subdirectory: `/mnt/var/log/bootstrap-actions/1/` for the first bootstrap action,` /mnt/var/log/bootstrap-actions/2/`, for the second bootstrap action, and so on. 
+ **Instance state logs** — These logs provide information about the CPU, memory state, and garbage collector threads of the node. The log files are stored in `/mnt/var/log/instance-state/` on all nodes. 

## Archive log files to Amazon S3


**Note**  
You cannot currently use log aggregation to Amazon S3 with the `yarn logs` utility.

Amazon EMR releases 6.9.0 and later archive logs to Amazon S3 during cluster scale-down, so log files generated on the cluster persist even after the node is terminated. This behavior is enabled automatically, so you don't need to do anything to turn it on. For Amazon EMR releases 6.8.0 and earlier, you can configure a cluster to periodically archive the log files stored on all nodes to Amazon S3. This ensures that the log files are available after the cluster terminates, whether this is through normal shut down or due to an error. Amazon EMR archives the log files to Amazon S3 at 5 minute intervals. 

To have the log files archived to Amazon S3 for Amazon EMR releases 6.8.0 and earlier, you must enable this feature when you launch the cluster. You can do this using the console, the CLI, or the API. By default, clusters launched using the console have log archiving enabled. For clusters launched using the CLI or API, logging to Amazon S3 must be manually enabled.

------
#### [ Console ]

**To archive log files to Amazon S3 with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Cluster logs**, select the **Publish cluster-specific logs to Amazon S3** check box. 

1. In the **Amazon S3 location** field, type (or browse to) an Amazon S3 path to store your logs. If you type the name of a folder that doesn't exist in the bucket, Amazon S3 creates it.

   When you set this value, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 terminates the instances hosting the cluster. These logs are useful for troubleshooting purposes. For more information, see [View log files](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html).

1. Optionally, select the **Encrypt cluster-specific logs** check box. Then, select an AWS KMS key from the list, enter a key ARN, or create a new key. This option is only available with Amazon EMR version 5.30.0 and later, excluding version 6.0.0. To use this option, add permission to AWS KMS for your EC2 instance profile and Amazon EMR role. For more information, see [To encrypt log files stored in Amazon S3 with an AWS KMS customer managed key](#emr-log-encryption).

1. Choose any other options that apply to your cluster.

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To archive log files to Amazon S3 with the AWS CLI**

To archive log files to Amazon S3 using the AWS CLI, type the `create-cluster` command and specify the Amazon S3 log path using the `--log-uri` parameter. 

1. To log files to Amazon S3 type the following command and replace *myKey* with the name of your EC2 key pair.

   ```
   aws emr create-cluster --name "Test cluster" --release-label emr-7.12.0 --log-uri s3://DOC-EXAMPLE-BUCKET/logs --applications Name=Hadoop Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3
   ```

1. When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, enter `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

------

### To encrypt log files stored in Amazon S3 with an AWS KMS customer managed key


With Amazon EMR version 5.30.0 and later (except Amazon EMR 6.0.0), you can encrypt log files stored in Amazon S3 with an AWS KMS customer managed key. To enable this option in the console, follow the steps in [Archive log files to Amazon S3](#emr-plan-debugging-logs-archive). Your Amazon EC2 instance profile and your Amazon EMR role must meet the following prerequisites: 
+ The Amazon EC2 instance profile used for your cluster must have permission to use `kms:GenerateDataKey`.
+ The Amazon EMR role used for your cluster must have permission to use `kms:DescribeKey`.
+ The Amazon EC2 instance profile and Amazon EMR role must be added to the list of key users for the specified AWS KMS customer managed key, as the following steps demonstrate:

  1. Open the AWS Key Management Service (AWS KMS) console at [https://console.aws.amazon.com/kms](https://console.aws.amazon.com/kms).

  1. To change the AWS Region, use the Region selector in the upper-right corner of the page.

  1. Select the alias of the KMS key to modify.

  1. On the key details page under **Key Users**, choose **Add**.

  1. In the **Add key users** dialog box, select your Amazon EC2 instance profile and Amazon EMR role.

  1. Choose **Add**.
+ You must also configure the KMS key to allow the `persistentappui.elasticmapreduce.amazonaws.com` and `elasticmapreduce.amazonaws.com` Service Principals to `kms:GenerateDataKey`, `kms:GenerateDataKeyWithoutPlaintext` and `kms:Decrypt`. This allows EMR to read and write logs encrypted with the KMS key to managed storage. The User IAM Role must have permission to use `kms:GenerateDataKey` and `kms:Decrypt`.

  ```
  {
     "Sid": "Allow User Role to use KMS key",
     "Effect": "Allow",
     "Principal": {
          "AWS": "User Role"
      },
      "Action": [
          "kms:Decrypt", 
          "kms:GenerateDataKey"
     ],
      "Resource": "*",
      "Condition": {
          "StringLike": {
              "kms:EncryptionContext:aws:elasticmapreduce:clusterId": "j-*",
             "kms:ViaService": "elasticmapreduce.region.amazonaws.com"
         }
      }
  },
  {
      "Sid": "Allow Persistent APP UI to validate KMS key for write",
      "Effect": "Allow",
      "Principal":{
          "Service": [
              "elasticmapreduce.amazonaws.com"
          ]
       },
       "Action": [
         "kms:GenerateDataKeyWithoutPlaintext"
        ],
       "Resource": "*",
       "Condition": {
          "StringLike": {
              "aws:SourceArn": "arn:aws:elasticmapreduce:region:account:cluster/j-*",
              "kms:EncryptionContext:aws:elasticmapreduce:clusterId": "j-*"
          }
       }
  },
  {
      "Sid": "Allow Persistent APP UI to Write/Read Logs",
      "Effect": "Allow",
      "Principal":{
          "Service": [
              "persistentappui.elasticmapreduce.amazonaws.com",
              "elasticmapreduce.amazonaws.com"
          ]
       },
       "Action": [
         "kms:Decrypt",
         "kms:GenerateDataKey"
       ],
       "Resource": "*",
       "Condition": {
          "StringLike": {
              "aws:SourceArn": "arn:aws:elasticmapreduce:region:account:cluster/j-*",
              "kms:EncryptionContext:aws:elasticmapreduce:clusterId": "j-*",
              "kms:ViaService": "s3.region.amazonaws.com"
          }
       }
  }
  ```

  As a security best practice, we recommend that you add the `kms:EncryptionContext` and `aws:SourceArn` conditions. These conditions help ensure the key is only used by Amazon EMR on EC2 and only used for logs generated from jobs running in a specific cluster.

For more information, see [IAM service roles used by Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-service-roles.html), and [Using key policies](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users) in the AWS Key Management Service developer guide.

### To aggregate logs in Amazon S3 using the AWS CLI

**Note**  
You cannot currently use log aggregation with the `yarn logs` utility. You can only use aggregation supported by this procedure.

Log aggregation (Hadoop 2.x) compiles logs from all containers for an individual application into a single file. To enable log aggregation to Amazon S3 using the AWS CLI, you use a bootstrap action at cluster launch to enable log aggregation and to specify the bucket to store the logs.
+ To enable log aggregation create the following configuration file called `myConfig.json` that contains the following:

  ```
  [
    {
      "Classification": "yarn-site",
      "Properties": {
        "yarn.log-aggregation-enable": "true",
        "yarn.log-aggregation.retain-seconds": "-1",
        "yarn.nodemanager.remote-app-log-dir": "s3:\/\/DOC-EXAMPLE-BUCKET\/logs"
      }
    }
  ]
  ```

  Type the following command and replace *`myKey`* with the name of your EC2 key pair. You can additionally replace any of the red text with your own configurations.

  ```
  aws emr create-cluster --name "Test cluster" \
  --release-label emr-7.12.0 \
  --applications Name=Hadoop \
  --use-default-roles \
  --ec2-attributes KeyName=myKey \
  --instance-type m5.xlarge \
  --instance-count 3 \
  --configurations file://./myConfig.json
  ```

  When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.
**Note**  
If you have not previously created the default EMR service role and EC2 instance profile, run `aws emr create-default-roles` to create them before running the `create-cluster` subcommand.

For more information on using Amazon EMR commands in the AWS CLI, see [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/emr).

### Amazon EMR self diagnostics and troubleshooting tools
Amazon EMR self diagnostics and troubleshooting tools

This runbook helps identify errors while running a job on an Amazon EMR cluster. The runbook analyzes a list of defined logs on the file system and looks for a list of predefined keywords. These log entries are used to create Amazon CloudWatch Events events so you can take any needed actions based on the events. Optionally, the runbook publishes log entries to the Amazon CloudWatch Logs log group of your choosing. [AWSSupport-AnalyzeEMRLogs](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-analyzeemrlogs.html).

This runbook helps diagnose Amazon EMR logs on S3 using Amazon Athena in integration with AWS Glue Data Catalog. Amazon Athena is used to query the Amazon EMR log files for containers, node logs, or both, with optional parameters for specific date ranges or keyword-based searches. This runbook provides list of all errors and frequently occurred exceptions found in the Amazon EMR cluster logs, along with the corresponding S3 log locations. It also provides summary of unique known exceptions matched in the Amazon EMR logs, along with recommended resolutions and Knowledge Center / re:Post articles to help in troubleshooting [AWSSupport-DiagnoseEMRLogsWithAthena](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/awssupport-diagnose-emr-logs-with-athena.html).

## Log locations


The following list includes all log types and their locations in Amazon S3. You can use these for troubleshooting Amazon EMR issues.

**Step logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/steps/<step-id>/`

**Application logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/containers/`  
This location includes container `stderr` and `stdout`, `directory.info`, `prelaunch.out`, and `launch_container.sh` logs.

**Resource manager logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<leader-instance-id>/applications/hadoop-yarn/`

**Hadoop HDFS**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<all-instance-id>/applications/hadoop-hdfs/`  
This location includes NameNode, DataNode, and YARN TimelineServer logs.

**Node manager logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<all-instance-id>/applications/hadoop-yarn/`

**Instance-state logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<all-instance-id>/daemons/instance-state/`

**Amazon EMR provisioning logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<leader-instance-id>/provision-node/*`

**Hive logs**  
`s3://DOC-EXAMPLE-LOG-BUCKET/<cluster-id>/node/<leader-instance-id>/applications/hive/*`  
+ To find Hive logs on your cluster, remove the asterisk (`*`) and append `/var/log/hive/` to the above link.
+ To find HiveServer2 logs, remove the asterisk (`*`) and append `var/log/hive/hiveserver2.log` to the above link.
+ To find HiveCLI logs, remove the asterisk (`*`) and append `/var/log/hive/user/hadoop/hive.log` to the above link.
+ To find Hive Metastore Server logs, remove the asterisk (`*`) and append `/var/log/hive/user/hive/hive.log` to the above link.
If your failure is in the primary or task node of your Tez application, provide logs of the appropriate Hadoop container.

## Control S3 logging behavior (Amazon EMR 7.13.0 and later)


Starting with Amazon EMR 7.13.0, you can control upload behavior through the S3LoggingConfiguration feature. This allows you to specify different upload policies for different log types: system logs, application logs, and persistent UI logs.

### Upload policies


For each log type, you can specify one of the following upload policies. Unspecified log types will default to standard behavior (emr-managed):

**emr-managed (Default)**  
Standard behavior. Logs are uploaded to Amazon S3 as configured in your LogUri, with certain logs retained by the service for operational support and troubleshooting purposes.

**on-customer-s3only**  
Customer-managed storage only. Logs are uploaded only to the customer-specified S3 bucket. This requires you to specify a LogUri when creating the cluster. Persistent-ui-logs cannot have on-customer-s3only policy. Allowed policies for persistent-ui-logs are emr-managed and disabled.

**disabled**  
No S3 upload for this log type.

### Configuration examples


You can configure S3 logging when creating a new Amazon EMR cluster through the AWS CLI, or AWS SDKs. The configuration is specified through the MonitoringConfiguration parameter.

**Example: Default behavior**  
If you don't specify S3LoggingConfiguration, all log types default to emr-managed behavior:

```
aws emr create-cluster \
--name "MyCluster" \
--release-label emr-7.13.0 \
--instance-type m5.xlarge \
--instance-count 3 \
--log-uri s3://my-bucket/logs/ \
--use-default-roles
```

**Example: Custom S3 logging configuration**  
This example shows how to configure different upload policies for each log type:

```
aws emr create-cluster \
--name "MyCluster" \
--release-label emr-7.13.0 \
--instance-type m5.xlarge \
--instance-count 3 \
--log-uri s3://my-bucket/logs/ \
--use-default-roles \
--monitoring-configuration '{
    "S3LoggingConfiguration": {
        "LogTypeUploadPolicy": {
            "application-logs": "on-customer-s3only",
            "persistent-ui-logs": "disabled"
        }
    }
}'
```

This configuration uploads application logs only to the customer S3 bucket, and disables persistent UI log uploads completely. The unspecified log type (system logs) follows default (emr-managed) behavior.

### Considerations

+ S3 logging configuration can only be set at cluster creation time and cannot be modified for running clusters.
+ Persistent-ui-logs cannot have on-customer-s3only policy. Allowed policies for persistent-ui-logs are emr-managed and disabled.
+ **LogUri Requirement**: When using on-customer-s3only policy for system-logs or application-logs, you must specify a LogUri parameter. Without LogUri, the cluster creation will fail.
+ **Default Behavior**: If S3LoggingConfiguration is not specified, all log types default to emr-managed behavior.

# Tag and categorize Amazon EMR cluster resources


It can be convenient to categorize your AWS resources in different ways; for example, by purpose, owner, or environment. You can achieve this in Amazon EMR by assigning custom metadata to your Amazon EMR clusters using tags. A tag consists of a key and a value, both of which you define. For Amazon EMR, the cluster is the resource-level that you can tag. For example, you could define a set of tags for your account's clusters that helps you track each cluster's owner or identify a production cluster versus a testing cluster. We recommend that you create a consistent set of tags to meet your organization requirements.

When you add a tag to an Amazon EMR cluster, the tag is also propagated to each active Amazon EC2 instance associated with the cluster. Similarly, when you remove a tag from an Amazon EMR cluster, that tag is removed from each associated active Amazon EC2 instance. 

**Important**  
Use the Amazon EMR console or CLI to manage tags on Amazon EC2 instances that are part of a cluster instead of the Amazon EC2 console or CLI, because changes that you make in Amazon EC2 do not synchronize back to the Amazon EMR tagging system.

You can identify an Amazon EC2 instance that is part of an Amazon EMR cluster by looking for the following system tags. In this example, *CORE* is the value for the instance group role and *j-12345678* is an example job flow (cluster) identifier value:
+ aws:elasticmapreduce:instance-group-role=*CORE*
+ aws:elasticmapreduce:job-flow-id=*j-12345678*

**Note**  
Amazon EMR and Amazon EC2 interpret your tags as a string of characters with no semantic meaning.

You can work with tags using the AWS Management Console, the CLI, and the API.

You can add tags when creating a new Amazon EMR cluster and you can add, edit, or remove tags from a running Amazon EMR cluster. Editing a tag is a concept that applies to the Amazon EMR console, however using the CLI and API, to edit a tag you remove the old tag and add a new one. You can edit tag keys and values, and you can remove tags from a resource at any time a cluster is running. However, you cannot add, edit, or remove tags from a terminated cluster or terminated instances which were previously associated with a cluster that is still active. In addition, you can set a tag's value to the empty string, but you can't set a tag's value to null.

If you're using AWS Identity and Access Management (IAM) with your Amazon EC2 instances for resource-based permissions by tag, your IAM policies are applied to tags that Amazon EMR propagates to a cluster's Amazon EC2 instances. For Amazon EMR tags to propagate to your Amazon EC2 instances, your IAM policy for Amazon EC2 needs to allow permissions to call the Amazon EC2 CreateTags and DeleteTags APIs. Also, propagated tags can affect your Amazon EC2's resource-based permissions. Tags propagated to Amazon EC2 can be read as conditions in your IAM policy, just like other Amazon EC2 tags. Keep your IAM policy in mind when adding tags to your Amazon EMR clusters to avoid a users having incorrect permissions for a cluster. To avoid problems, make sure that your IAM policies do not include conditions on tags that you also plan to use on your Amazon EMR clusters. For more information, see [Controlling access to Amazon EC2 resources](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingIAM.html). 

# Restrictions that apply to tagging resources in Amazon EMR


The following basic restrictions apply to tags:
+ Restrictions that apply to Amazon EC2 resources apply to Amazon EMR as well. For more information, see [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-restrictions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-restrictions).
+ Do not use the `aws:` prefix in tag names and values because it is reserved for AWS use. In addition, you cannot edit or delete tag names or values with this prefix.
+ You cannot change or edit tags on a terminated cluster.
+ A tag value can be an empty string, but not null. In addition, a tag key cannot be an empty string.
+ Keys and values can contain any alphabetic character in any language, any numeric character, white spaces, invisible separators, and the following symbols: \$1 . : / = \$1 - @ 

For more information about tagging using the AWS Management Console, see [Working with tags in the console](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#Using_Tags_Console) in the *Amazon EC2 User Guide*. For more information about tagging using the Amazon EC2API or command line, see [API and CLI overview](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#Using_Tags_CLI) in the *Amazon EC2 User Guide*.

# Tag Amazon EMR resources for billing


You can use tags for organizing your AWS bill to reflect your own cost structure. To do this, sign up to get your AWS account bill with tag key values included. You can then organize your billing information by tag key values, to see the cost of your combined resources. Although Amazon EMR and Amazon EC2 have different billing statements, the tags on each cluster are also placed on each associated instance so you can use tags to link related Amazon EMR and Amazon EC2 costs.

For example, you can tag several resources with a specific application name, and then organize your billing information to see the total cost of that application across several services. For more information, see [Cost allocation and tagging](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/allocation.html) in the *AWS Billing User Guide*. 

# Add tags to an Amazon EMR cluster


You can add tags to a cluster when you create it. 

------
#### [ Console ]

**To add tags when you create a cluster with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Tags**, choose **Add new tag**. Specify a tag in the **Key** field. Optionally, specify a tag in the **Value** field.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To add tags when you create a cluster with the the AWS CLI**

The following example demonstrates how to add a tag to a new cluster using the AWS CLI. To add tags when you create a cluster, type the `create-cluster` subcommand with the `--tags` parameter. 
+ To add a tag named *costCenter* with key value *marketing* when you create a cluster, type the following command and replace *myKey* with the name of your EC2 key pair.

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hadoop Name=Hive Name=Pig --tags "costCenter=marketing" --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3
  ```

  When you specify the instance count without using the `--instance-groups` parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.
**Note**  
If you have not previously created the default EMR service role and EC2 instance profile, type aws `emr create-default-roles` to create them before typing the `create-cluster` subcommand.

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

You can also add tags to an existing cluster.

------
#### [ Console ]

**To add tags to an existing cluster with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Tags** tab on the cluster details page, select **Manage tags**. Specify a tag in the **Key** field. Optionally, specify a tag in the **Value** field.

1. Select **Save changes**. The **Tags** tab updates with the new number of tags that you have on your cluster. For example, if you now have two tags, the label of your tab is **Tags (2)**.

------
#### [ AWS CLI ]

**To add tags to a running cluster with the AWS CLI**
+ Enter the `add-tags` subcommand with the `--tag` parameter to assign tags to the cluster ID. You can find the cluster ID using the console or the `list-clusters` command. The `add-tags` subcommand currently accepts only one resource ID.

  For example, to add two tags to a running cluster one with a key named *costCenter* with a value of *marketing* and another named *other* with a value of *accounting*, enter the following command and replace *j-KT4XXXXXXXX1NM* with your cluster ID. 

  ```
  aws emr add-tags --resource-id j-KT4XXXXXXXX1NM --tag "costCenter=marketing" --tag "other=accounting"
  ```

  Note that when tags are added using the AWS CLI, there's no output from the command. For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

# View tags on an Amazon EMR cluster


If you want to see all tags associated with a cluster, you can view them with the console or the AWS CLI.

------
#### [ Console ]

**To view tags on a cluster with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. To view all of your tags, select the **Tags** tab on the cluster details page.

------
#### [ AWS CLI ]

**To view tags on a cluster with the AWS CLI**

To view the tags on a cluster using the AWS CLI, type the `describe-cluster` subcommand with the `--query` parameter. 
+ To view a cluster's tags, type the following command and replace *j-KT4XXXXXXXX1NM* with your cluster ID.

  ```
  1. aws emr describe-cluster --cluster-id j-KT4XXXXXX1NM --query Cluster.Tags
  ```

  The output displays all the tag information about the cluster similar to the following:

  ```
  Value: accounting     Value: marketing                
  Key: other            Key: costCenter
  ```

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

# Remove tags from an Amazon EMR cluster


If you no longer need a tag, you can remove it from the cluster. 

------
#### [ Console ]

**To remove tags on a cluster with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Tags** tab on the cluster details page, select **Manage tags**.

1. Choose **Remove** for each key-value pair that you want to remove.

1. Choose **Save changes**.

------
#### [ AWS CLI ]

**To remove tags on a cluster with the AWS CLI**

Type the `remove-tags` subcommand with the `--tag-keys` parameter. When removing a tag, only the key name is required.
+ To remove a tag from a cluster, type the following command and replace *j-KT4XXXXXXXX1NM* with your cluster ID.

  ```
  aws emr remove-tags --resource-id j-KT4XXXXXX1NM --tag-keys "costCenter"
  ```
**Note**  
You cannot currently remove multiple tags using a single command.

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

------

# Drivers and third-party application integration on Amazon EMR


 You can run several popular big-data applications on Amazon EMR with utility pricing. This means you pay a nominal additional hourly fee for the third-party application while your cluster is running. It allows you to use the application without having to purchase an annual license. The following sections describe some of the tools you can use with EMR.

**Topics**
+ [

# Use business intelligence tools with Amazon EMR
](emr-bi-tools.md)

# Use business intelligence tools with Amazon EMR


You can use popular business intelligence tools like Microsoft Excel, MicroStrategy, QlikView, and Tableau with Amazon EMR to explore and visualize your data. Many of these tools require an ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity) driver. To download and install the latest drivers, see [http://awssupportdatasvcs.com/bootstrap-actions/Simba/latest/](http://awssupportdatasvcs.com/bootstrap-actions/Simba/latest/).

To find older versions of drivers, see [http://awssupportdatasvcs.com/bootstrap-actions/Simba/](http://awssupportdatasvcs.com/bootstrap-actions/Simba/).