

# Configure Amazon EMR cluster hardware and networking
<a name="emr-plan-instances"></a>

An important consideration when you create an Amazon EMR cluster is how you configure Amazon EC2 instances and network options. This chapter covers the following options, and then ties them all together with [best practices and guidelines](emr-plan-instances-guidelines.md).
+ **Node types** – Amazon EC2 instances in an EMR cluster are organized into *node types*. There are three: *primary nodes*, *core nodes*, and *task nodes*. Each node type performs a set of roles defined by the distributed applications that you install on the cluster. During a Hadoop MapReduce or Spark job, for example, components on core and task nodes process data, transfer output to Amazon S3 or HDFS, and provide status metadata back to the primary node. With a single-node cluster, all components run on the primary node. For more information, see [Understand node types in Amazon EMR: primary, core, and task nodes](emr-master-core-task-nodes.md).
+ **EC2 instances** – When you create a cluster, you make choices about the Amazon EC2 instances that each type of node will run on. The EC2 instance type determines the processing and storage profile of the node. The choice of Amazon EC2 instance for your nodes is important because it determines the performance profile of individual node types in your cluster. For more information, see [Configure Amazon EC2 instance types for use with Amazon EMR](emr-plan-ec2-instances.md).
+ **Networking** – You can launch your Amazon EMR cluster into a VPC using a public subnet, private subnet, or a shared subnet. Your networking configuration determines how customers and services can connect to clusters to perform work, how clusters connect to data stores and other AWS resources, and the options you have for controlling traffic on those connections. For more information, see [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md).
+ **Instance grouping** – The collection of EC2 instances that host each node type is called either an *instance fleet* or a *uniform instance group*. The instance grouping configuration is a choice you make when you create a cluster. This choice determines how you can add nodes to your cluster while it is running. The configuration applies to all node types. It can't be changed later. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).
**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

# Understand node types in Amazon EMR: primary, core, and task nodes
<a name="emr-master-core-task-nodes"></a>

Use this section to understand how Amazon EMR uses each of these node types and as a foundation for cluster capacity planning.

## Primary node
<a name="emr-plan-master"></a>

The primary node manages the cluster and typically runs primary components of distributed applications. For example, the primary node runs the YARN ResourceManager service to manage resources for applications. It also runs the HDFS NameNode service, tracks the status of jobs submitted to the cluster, and monitors the health of the instance groups.

To monitor the progress of a cluster and interact directly with applications, you can connect to the primary node over SSH as the Hadoop user. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). Connecting to the primary node allows you to access directories and files, such as Hadoop log files, directly. For more information, see [View Amazon EMR log files](emr-manage-view-web-log-files.md). You can also view user interfaces that applications publish as websites running on the primary node. For more information, see [View web interfaces hosted on Amazon EMR clusters](emr-web-interfaces.md). 

**Note**  
With Amazon EMR 5.23.0 and later, you can launch a cluster with three primary nodes to support high availability of applications like YARN Resource Manager, HDFS NameNode, Spark, Hive, and Ganglia. The primary node is no longer a potential single point of failure with this feature. If one of the primary nodes fails, Amazon EMR automatically fails over to a standby primary node and replaces the failed primary node with a new one with the same configuration and bootstrap actions. For more information, see [Plan and Configure Primary Nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha.html).

## Core nodes
<a name="emr-plan-core"></a>

Core nodes are managed by the primary node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

There is only one core instance group or instance fleet per cluster, but there can be multiple nodes running on multiple Amazon EC2 instances in the instance group or instance fleet. With instance groups, you can add and remove Amazon EC2 instances while the cluster is running. You can also set up automatic scaling to add instances based on the value of a metric. For more information about adding and removing Amazon EC2 instances with the instance groups configuration, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

With instance fleets, you can effectively add and remove instances by modifying the instance fleet's *target capacities* for On-Demand and Spot accordingly. For more information about target capacities, see [Instance fleet options](emr-instance-fleet.md#emr-instance-fleet-options).

**Warning**  
Removing HDFS daemons from a running core node or terminating core nodes risks data loss. Use caution when configuring core nodes to use Spot Instances. For more information, see [When should you use Spot Instances?](emr-plan-instances-guidelines.md#emr-plan-spot-instances).

## Task nodes
<a name="emr-plan-task"></a>

You can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don't run the Data Node daemon, nor do they store data in HDFS. As with core nodes, you can add task nodes to a cluster by adding Amazon EC2 instances to an existing uniform instance group or by modifying target capacities for a task instance fleet.

With the uniform instance group configuration, you can have up to a total of 48 task instance groups. The ability to add instance groups in this way allows you to mix Amazon EC2 instance types and pricing options, such as On-Demand Instances and Spot Instances. This gives you flexibility to respond to workload requirements in a cost-effective way.

With the instance fleet configuration, the ability to mix instance types and purchasing options is built in, so there is only one task instance fleet.

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application primary processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties: 
+ `yarn.node-labels.enabled: true`
+ `yarn.node-labels.am.default-node-label-expression: 'CORE'`

Starting with the Amazon EMR 7.x release series, Amazon EMR assigns YARN node labels to instances by their market type, such as On-Demand or Spot. You can enable node labels and restrict application processes to ON\$1DEMAND by configuring the following properties:

```
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'ON_DEMAND'
```

If you're using Amazon EMR 7.0 or higher, you can restrict application process to nodes with the `CORE` label using the following configuration:

```
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'CORE'
```

For Amazon EMR releases 7.2 and higher, if your cluster uses managed scaling with node labels, Amazon EMR will try to scale the cluster based on application process and executor demand independently.

For example, if you use Amazon EMR releases 7.2 or higher and restrict application process to `ON_DEMAND` nodes, managed scaling scales up `ON_DEMAND` nodes if application process demand increases. Similarly, if you restrict application process to `CORE` nodes, managed scaling scales up `CORE` nodes if application process demand increases.

For information about specific properties, see [Amazon EMR settings to prevent job failure because of task node Spot Instance termination](emr-plan-instances-guidelines.md#emr-plan-spot-YARN).

# Configure Amazon EC2 instance types for use with Amazon EMR
<a name="emr-plan-ec2-instances"></a>

EC2 instances come in different configurations known as *instance types*. Instance types have different CPU, input/output, and storage capacities. In addition to the instance type, you can choose different purchasing options for Amazon EC2 instances. You can specify different instance types and purchasing options within uniform instance groups or instance fleets. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md). For guidance about choosing instance types and purchasing options for your application, see [Configuring Amazon EMR cluster instance types and best practices for Spot instances](emr-plan-instances-guidelines.md).

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).

**Topics**
+ [Supported instance types with Amazon EMR](emr-supported-instance-types.md)
+ [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md)
+ [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md)

# Supported instance types with Amazon EMR
<a name="emr-supported-instance-types"></a>

This section describes the instance types that Amazon EMR supports, organized by AWS Region. To learn more about instance types, see [Amazon EC2 instances](https://aws.amazon.com/ec2/instance-types/) and [Amazon Linux AMI instance type matrix](https://aws.amazon.com/amazon-linux-ami/instance-type-matrix/).

Not all instance types are available in all Regions, and instance availability is subject to availability and demand in the specified Region and Availability Zone. An instance's Availability Zone is determined by the subnet you use to launch your cluster. 

## Considerations
<a name="emr-supported-instance-types-considerations"></a>

Consider the following when you choose instance types for your Amazon EMR cluster.

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).
+ If you create a cluster using an instance type that is not available in the specified Region and Availability Zone, your cluster may fail to provision or may be stuck provisioning. For information about instance availability, see the [Amazon EMR pricing page](https://aws.amazon.com/emr/pricing) or see the [Supported instance types by AWS Region](#emr-instance-types-by-region) tables on this page.
+ Beginning with Amazon EMR release version 5.13.0, all instances use HVM virtualization and EBS-backed storage for root volumes. When using Amazon EMR release versions earlier than 5.13.0, some previous generation instances use PVM virtualization. For more information, see [Linux AMI virtualization types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html).
+ Because of a lack of hardware support and default settings that can lead to underutilization of memory and cores, we don't recommend that you use the instance types `c7a`, `c7i`, `m7i`, `m7i-flex`, `r7a`, `r7i`, `r7iz`, `i4i.12xlarge`, `i4i.24xlarge` if you run Amazon EMR releases lower than 5.36.1 and 6.10.0. If you run these instance types in those releases, you might experience lower performance, and you won't see the expected benefits of newer instance types, such as `c7i` vs `c6i`. For optimal resource utilization and performance with these performance types, you should run 5.36.1 and higher or 6.10.0 and higher to maximize their capabilities.
+ Some instance types support enhanced networking. For more information, see [Enhanced Networking on Linux](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html).
+ NVIDIA and CUDA drivers are installed on GPU instance types by default.

## Supported instance types by AWS Region
<a name="emr-instance-types-by-region"></a>

The following tables list the Amazon EC2 instance types that Amazon EMR supports, organized by AWS Region. The tables also list the earliest Amazon EMR releases in the 5.x, 6.x, and 7.x series that support each instance type.

### US East (N. Virginia) - us-east-1
<a name="us-east-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US East (Ohio) - us-east-2
<a name="us-east-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US West (N. California) - us-west-1
<a name="us-west-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### US West (Oregon) - us-west-2
<a name="us-west-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### AWS GovCloud (US-West) - us-gov-west-1
<a name="us-gov-west-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### AWS GovCloud (US-East) - us-gov-east-1
<a name="us-gov-east-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Africa (Cape Town) - af-south-1
<a name="af-south-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Hong Kong) - ap-east-1
<a name="ap-east-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Jakarta) - ap-southeast-3
<a name="ap-southeast-3-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Melbourne) - ap-southeast-4
<a name="ap-southeast-4-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Malaysia) - ap-southeast-5
<a name="ap-southeast-5-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Mumbai) - ap-south-1
<a name="ap-south-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Hyderabad) - ap-south-2
<a name="ap-south-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Osaka) - ap-northeast-3
<a name="ap-northeast-3-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Seoul) - ap-northeast-2
<a name="ap-northeast-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Singapore) - ap-southeast-1
<a name="ap-southeast-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Sydney) - ap-southeast-2
<a name="ap-southeast-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Tokyo) - ap-northeast-1
<a name="ap-northeast-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Canada (Central) - ca-central-1
<a name="ca-central-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Canada West (Calgary) - ca-west-1
<a name="ca-west-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### China (Ningxia) - cn-northwest-1
<a name="cn-northwest-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### China (Beijing) - cn-north-1
<a name="cn-north-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Frankfurt) - eu-central-1
<a name="eu-central-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Zurich) - eu-central-2
<a name="eu-central-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Ireland) - eu-west-1
<a name="eu-west-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (London) - eu-west-2
<a name="eu-west-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Milan) - eu-south-1
<a name="eu-south-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Spain) - eu-south-2
<a name="eu-south-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Paris) - eu-west-3
<a name="eu-west-3-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Europe (Stockholm) - eu-north-1
<a name="eu-north-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Israel (Tel Aviv) - il-central-1
<a name="il-central-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Middle East (Bahrain) - me-south-1
<a name="me-south-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Middle East (UAE) - me-central-1
<a name="me-central-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### South America (São Paulo) - sa-east-1
<a name="sa-east-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Thailand) - ap-southeast-7
<a name="ap-southeast-7-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Mexico (Central) - mx-central-1
<a name="mx-central-1-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (Taipei) - ap-east-2
<a name="ap-east-2-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

### Asia Pacific (New Zealand) - ap-southeast-6
<a name="ap-southeast-6-supported-instances"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)

## Previous generation instances
<a name="emr-supported-instance-types-previous-generation"></a>

Amazon EMR supports previous generation instances to support applications that are optimized for these instances and have not yet been upgraded. For more information about these instance types and upgrade paths, see [Previous Generation Instances](https://aws.amazon.com/ec2/previous-generation). 


| Instance class | Instance types | 
| --- | --- | 
|  General Purpose  |  m1.small¹ \$1 m1.medium¹ \$1 m1.large¹ \$1 m1.xlarge¹ \$1 m3.xlarge¹ \$1 m3.2xlarge¹ \$1 m4.large \$1 m4.xlarge \$1 m4.2xlarge \$1 m4.4xlarge \$1 m4.10xlarge \$1 m4.16xlarge  | 
|  Compute Optimized  |  c1.medium¹ ² \$1 c1.xlarge¹ \$1 c3.xlarge¹ \$1 c3.2xlarge¹ \$1 c3.4xlarge¹ \$1 c3.8xlarge¹ \$1 c4.large \$1 c4.xlarge \$1 c4.2xlarge \$1 c4.4xlarge \$1 c4.8xlarge  | 
|  Memory Optimized  |  m2.xlarge¹ \$1 m2.2xlarge¹ \$1 m2.4xlarge¹ \$1 r3.xlarge \$1 r3.2xlarge \$1 r3.4xlarge \$1 r3.8xlarge \$1 r4.xlarge \$1 r4.2xlarge \$1 r4.4xlarge \$1 r4.8xlarge \$1 r4.16xlarge  | 
|  Storage Optimized  |  d2.xlarge \$1 d2.2xlarge \$1 d2.4xlarge \$1 d2.8xlarge \$1 i2.xlarge \$1 i2.2xlarge \$1 i2.4xlarge \$1 i2.8xlarge  | 

¹ Uses PVM virtualization AMI with Amazon EMR release versions earlier than 5.13.0. For more information, see [Linux AMI Virtualization Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html).

² Not supported in release version 5.15.0.

# Instance purchasing options in Amazon EMR
<a name="emr-instance-purchasing-options"></a>

When you set up a cluster, you choose a purchasing option for Amazon EC2 instances. You can choose On-Demand Instances, Spot Instances, or both. Prices vary based on the instance type and Region. The Amazon EMR price is in addition to the Amazon EC2 price (the price for the underlying servers) and Amazon EBS price (if attaching Amazon EBS volumes). For current pricing, see [Amazon EMR Pricing](https://aws.amazon.com/emr/pricing).

Your choice to use instance groups or instance fleets in your cluster determines how you can change instance purchasing options while a cluster is running. If you choose uniform instance groups, you can only specify the purchasing option for an instance group when you create it, and the instance type and purchasing option apply to all Amazon EC2 instances in each instance group. If you choose instance fleets, you can change purchasing options after you create the instance fleet, and you can mix purchasing options to fulfill a target capacity that you specify. For more information about these configurations, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

## On-Demand Instances
<a name="emr-instances-on-demand"></a>

With On-Demand Instances, you pay for compute capacity by the second. Optionally, you can have these On-Demand Instances use Reserved Instance or Dedicated Instance purchasing options. With Reserved Instances, you make a one-time payment for an instance to reserve capacity. Dedicated Instances are physically isolated at the host hardware level from instances that belong to other AWS accounts. For more information about purchasing options, see [Instance Purchasing Options](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-purchasing-options.html) in the *Amazon EC2 User Guide*.

### Using Reserved Instances
<a name="emr-instances-reserved"></a>

To use Reserved Instances in Amazon EMR, you use Amazon EC2 to purchase the Reserved Instance and specify the parameters of the reservation, including the scope of the reservation as applying to either a Region or an Availability Zone. For more information, see [Amazon EC2 Reserved Instances](https://aws.amazon.com/ec2/reserved-instances/) and [Buying Reserved Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-market-concepts-buying.html) in the *Amazon EC2 User Guide*. After you purchase a Reserved Instance, if all of the following conditions are true, Amazon EMR uses the Reserved Instance when a cluster launches:
+ An On-Demand Instance is specified in the cluster configuration that matches the Reserved Instance specification.
+ The cluster is launched within the scope of the instance reservation (the Availability Zone or Region).
+ The Reserved Instance capacity is still available

For example, let's say you purchase one `m5.xlarge` Reserved Instance with the instance reservation scoped to the US-East Region. You then launch an Amazon EMR cluster in US-East that uses two `m5.xlarge` instances. The first instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. Reserved Instance capacity is used before any On-Demand Instances are created.

### Using Dedicated Instances
<a name="emr-dedicated-instances"></a>

To use Dedicated Instances, you purchase Dedicated Instances using Amazon EC2 and then create a VPC with the **Dedicated** tenancy attribute. Within Amazon EMR, you then specify that a cluster should launch in this VPC. Any On-Demand Instances in the cluster that match the Dedicated Instance specification use available Dedicated Instances when the cluster launches.

**Note**  
Amazon EMR does not support setting the `dedicated` attribute on individual instances.

## Spot Instances
<a name="emr-spot-instances"></a>

Spot Instances in Amazon EMR provide an option for you to purchase Amazon EC2 instance capacity at a reduced cost as compared to On-Demand purchasing. The disadvantage of using Spot Instances is that instances may terminate if Spot capacity becomes unavailable for the instance type you are running. For more information about when using Spot Instances may be appropriate for your application, see [When should you use Spot Instances?](emr-plan-instances-guidelines.md#emr-plan-spot-instances).

When Amazon EC2 has unused capacity, it offers EC2 instances at a reduced cost, called the *Spot price*. This price fluctuates based on availability and demand, and is established by Region and Availability Zone. When you choose Spot Instances, you specify the maximum Spot price that you're willing to pay for each EC2 instance type. When the Spot price in the cluster's Availability Zone is below the maximum Spot price specified for that instance type, the instances launch. While instances run, you're charged at the current Spot price *not your maximum Spot price*.

**Note**  
Spot Instances with a defined duration (also known as Spot blocks) are no longer available to new customers from July 1, 2021. For customers who have previously used the feature, we will continue to support Spot Instances with a defined duration until December 31, 2022.

For current pricing, see [Amazon EC2 Spot Instances Pricing](https://aws.amazon.com/ec2/spot/pricing/). For more information, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the *Amazon EC2 User Guide*. When you create and configure a cluster, you specify network options that ultimately determine the Availability Zone where your cluster launches. For more information, see [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md). 

**Tip**  
You can see the real-time Spot price in the console when you hover over the information tooltip next to the **Spot** purchasing option when you create a cluster using **Advanced Options**. The prices for each Availability Zone in the selected Region are displayed. The lowest prices are in the green-colored rows. Because of fluctuating Spot prices between Availability Zones, selecting the Availability Zone with the lowest initial price might not result in the lowest price for the life of the cluster. For optimal results, study the history of Availability Zone pricing before choosing. For more information, see [Spot Instance Pricing History](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-history.html) in the *Amazon EC2 User Guide*.

Spot Instance options depend on whether you use uniform instance groups or instance fleets in your cluster configuration.

****Spot Instances in uniform instance groups****  
When you use Spot Instances in a uniform instance group, all instances in an instance group must be Spot Instances. You specify a single subnet or Availability Zone for the cluster. For each instance group, you specify a single Spot Instance and a maximum Spot price. Spot Instances of that type launch if the Spot price in the cluster's Region and Availability Zone is below the maximum Spot price. Instances terminate if the Spot price is above your maximum Spot price. You set the maximum Spot price only when you configure an instance group. It can't be changed later. For more information, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

****Spot Instances in instance fleets****  
When you use the instance fleets configuration, additional options give you more control over how Spot Instances launch and terminate. Fundamentally, instance fleets use a different method than uniform instance groups to launch instances. The way it works is you establish a *target capacity* for Spot Instances (and On-Demand Instances) and up to five instance types. You can also specify a *weighted capacity* for each instance type or use the vCPU (YARN vcores) of the instance type as weighted capacity. This weighted capacity counts toward your target capacity when an instance of that type is provisioned. Amazon EMR provisions instances with both purchasing options until the target capacity for each target is fulfilled. In addition, you can define a range of Availability Zones for Amazon EMR to choose from when launching instances. You also provide additional Spot options for each fleet, including a provisioning timeout. For more information, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

# Instance storage options and behavior in Amazon EMR
<a name="emr-plan-storage"></a>

## Overview
<a name="emr-plan-storage-ebs-storage-overview"></a>

Instance store and Amazon EBS volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications might "spill" to the local file system.

Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. Amazon EBS volumes attached to Amazon EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so you shouldn't expect data to persist. Although the data is ephemeral, it is possible that data in HDFS could be replicated depending on the number and specialization of nodes in the cluster. When you add Amazon EBS storage volumes, these are mounted as additional volumes. They are not a part of the boot volume. YARN is configured to use all the additional volumes, but you are responsible for allocating the additional volumes as local storage (for local log files, for example).

## Considerations
<a name="emr-plan-storage-ebs-storage-considerations"></a>

Keep in mind these additional considerations when you use Amazon EBS with EMR clusters:
+ You can't snapshot an Amazon EBS volume and then restore it within Amazon EMR. To create reusable custom configurations, use a custom AMI (available in Amazon EMR version 5.7.0 and later). For more information, see [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md).
+ An encrypted Amazon EBS root device volume is supported only when using a custom AMI. For more information, see [Creating a custom AMI with an encrypted Amazon EBS root device volume](emr-custom-ami.md#emr-custom-ami-encrypted). 
+ If you apply tags using the Amazon EMR API, those operations are applied to EBS volumes.
+ There is a limit of 25 volumes per instance.
+ The Amazon EBS volumes on core nodes cannot be less than 5 GB.
+ Amazon EBS has a fixed limit of 2,500 EBS volumes per instance launch request. This limit also applies to Amazon EMR on EC2 clusters. We recommend that you launch clusters with the total number of EBS volumes within this limit, and then manually scale up the cluster or with Amazon EMR managed scaling as needed. To learn more about the EBS volume limit, see [Service quotas](https://docs.aws.amazon.com/general/latest/gr/ebs-service.html#limits_ebs:~:text=Amazon%20EBS%20has,exceeding%20the%20limit.).

## Default Amazon EBS storage for instances
<a name="emr-plan-storage-ebs-storage-default"></a>

For EC2 instances that have EBS-only storage, Amazon EMR allocates Amazon EBS gp2 or gp3 storage volumes to instances. When you create a cluster with Amazon EMR releases 5.22.0 and higher, the default amount of Amazon EBS storage increases relative to the size of the instance.

We split any increased storage across multiple volumes. This gives increased IOPS performance and, in turn, increased performance for some standardized workloads. If you want to use a different Amazon EBS instance storage configuration, you can specify this when you create an EMR cluster or add nodes to an existing cluster. You can use Amazon EBS gp2 or gp3 volumes as root volumes, and add gp2 or gp3 volumes as additional volumes. For more information, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

The following table identifies the default number of Amazon EBS gp2 storage volumes, sizes, and total sizes per instance type. For information about gp2 volumes compared to gp3, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md).


**Default Amazon EBS gp2 storage volumes and size by instance type for Amazon EMR 5.22.0 and higher**  

| Instance size | Number of volumes | Volume size (GiB) | Total size (GiB) | 
| --- | --- | --- | --- | 
|  \$1.large  |  1  |  32  |  32  | 
|  \$1.xlarge  |  2  |  32  |  64  | 
|  \$1.2xlarge  |  4  |  32  |  128  | 
|  \$1.4xlarge  |  4  |  64  |  256  | 
|  \$1.8xlarge  |  4  |  128  |  512  | 
|  \$1.9xlarge  |  4  |  144  |  576  | 
|  \$1.10xlarge  |  4  |  160  |  640  | 
|  \$1.12xlarge  |  4  |  192  |  768  | 
|  \$1.16xlarge  |  4  |  256  |  1024  | 
|  \$1.18xlarge  |  4  |  288  |  1152  | 
|  \$1.24xlarge  |  4  |  384  |  1536  | 

## Default Amazon EBS root volume for instances
<a name="emr-plan-storage-ebs-root-volume"></a>

With Amazon EMR releases 6.15 and higher, Amazon EMR automatically attaches an Amazon EBS General Purpose SSD (gp3) as the root device for its AMIs to enhance performance. With earlier releases, Amazon EMR attaches EBS General Purpose SSD (gp2) as the root device.


|  | 6.15 and higher | 6.14 and lower | 
| --- | --- | --- | 
| Default root volume type |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | 
| Default size |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  | 
| Default IOPS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 
| Default throughput |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 

For information on how to customize the Amazon EBS root device volume, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

## Specifying additional EBS storage volumes
<a name="emr-plan-storage-additional-ebs-volumes"></a>

When you configure instance types in Amazon EMR, you can specify additional EBS volumes to add capacity beyond the instance store (if present) and the default EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price, so you can tailor your storage to the analytic and business needs of your applications. For example, some applications might need to spill to disk while others can safely work in-memory or with Amazon S3.

You can only attach Amazon EBS volumes to instances at cluster startup time and when you add an extra task node instance group. If an instance in an Amazon EMR cluster fails, then both the instance and attached Amazon EBS volumes are replaced with new volumes. Consequently, if you manually detach an Amazon EBS volume, Amazon EMR treats that as a failure and replaces both instance storage (if applicable) and the volume stores.

Amazon EMR doesn’t allow you to modify your volume type from gp2 to gp3 for an existing EMR cluster. To use gp3 for your workloads, launch a new EMR cluster. In addition, we don't recommend that you update the throughput and IOPS on a cluster that is in use or that is being provisioned, because Amazon EMR uses the throughput and IOPS values you specify at cluster launch time for any new instance that it adds during cluster scale-up. For more information, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md) and [Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types](emr-plan-storage-gp3-migration-selection.md).

**Important**  
To use a gp3 volume with your EMR cluster, you must launch a new cluster.

# Comparing Amazon EBS volume types gp2 and gp3
<a name="emr-plan-storage-compare-volume-types"></a>

Here is a comparison of cost between gp2 and gp3 volumes in the US East (N. Virginia) Region. For the most up to date information, see the [Amazon EBS General Purpose Volumes](https://aws.amazon.com/ebs/general-purpose/) product page and the [Amazon EBS Pricing Page](https://aws.amazon.com/ebs/pricing/).


| Volume type | gp3 | gp2 | 
| --- | --- | --- | 
| Volume size | 1 GiB – 16 TiB | 1 GiB – 16 TiB | 
| Default/Baseline IOPS | 3000 | 3 IOPS/GiB (minimum 100 IOPS) to a maximum of 16,000 IOPS. Volumes smaller than 1 TiB can also burst up to 3,000 IOPS. | 
| Max IOPS/volume | 16,000 | 16,000 | 
| Default/Baseline throughput | 125 MiB/s | Throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. | 
| Max throughput/volume | 1,000 MiB/s | 250 MiB/s | 
| Price | \$10.08/GiB-month 3,000 IOPS free and \$10.005/provisioned IOPS-month over 3,000; 125 MiB/s free and \$10.04/provisioned MiB/s-month over 125MiB/s | \$10.10/GiB-month | 

# Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types
<a name="emr-plan-storage-gp3-migration-selection"></a>

When provisioning a gp2 volume, you must figure out the size of the volume in order to get the proportional IOPS and throughput. With gp3, you don’t have to provision a bigger volume to get higher performance. You can choose your desired size and performance according to application need. Selecting the right size and right performance parameters (IOPS, throughput) can provide you maximum cost reduction, without affecting performance.

Here is a table to help you select gp3 configuration options:


| Volume size | IOPS | Throughput | 
| --- | --- | --- | 
| 1–170 GiB | 3000 | 125 MiB/s | 
| 170–334 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, use higher as per usage, Max 250 MiB/s\$1. | 
| 334–1000 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 
| 1000\$1 GiB | Match gp2 IOPS (Size in GiB x 3) or Max IOPS driven by current gp2 volume | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 

\$1Gp3 has the capability to provide throughput up to 2000 MiB/s. Since gp2 provides a maximum of 250MiB/s throughput, you may not need to go beyond this limit when you use gp3. Gp3 volumes deliver a consistent baseline throughput performance of 125 MiB/s, which is included with the price of storage. You can provision additional throughput (up to a maximum of 2,000 MiB/s) for an additional cost at a ratio of 0.25 MiB/s per provisioned IOPS. Maximum throughput can be provisioned at 8,000 IOPS or higher and 16 GiB or larger (8,000 IOPS × 0.25 MiB/s per IOPS = 2,000 MiB/s).

# Configure networking in a VPC for Amazon EMR
<a name="emr-plan-vpc-subnet"></a>

Most clusters launch into a virtual network using Amazon Virtual Private Cloud (Amazon VPC). A VPC is an isolated virtual network within AWS that is logically isolated within your AWS account. You can configure aspects such as private IP address ranges, subnets, routing tables, and network gateways. For more information, see the [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/).

VPC offers the following capabilities:
+ **Processing sensitive data**

  Launching a cluster into a VPC is similar to launching the cluster into a private network with additional tools, such as routing tables and network ACLs, to define who has access to the network. If you are processing sensitive data in your cluster, you may want the additional access control that launching your cluster into a VPC provides. Furthermore, you can choose to launch your resources into a private subnet where none of those resources has direct internet connectivity.
+ **Accessing resources on an internal network**

  If your data source is located in a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the cluster into a VPC and connect your data center to your VPC through a VPN connection, enabling the cluster to access resources on your internal network. For example, if you have an Oracle database in your data center, launching your cluster into a VPC connected to that network by VPN makes it possible for the cluster to access the Oracle database. 

****Public and private subnets****  
You can launch Amazon EMR clusters in both public and private VPC subnets. This means you do not need internet connectivity to run an Amazon EMR cluster; however, you may need to configure network address translation (NAT) and VPN gateways to access services or resources located outside of the VPC, for example in a corporate intranet or public AWS service endpoints like AWS Key Management Service.

**Important**  
Amazon EMR only supports launching clusters in private subnets in release version 4.2 and later.

For more information about Amazon VPC, see the [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/).

**Topics**
+ [Amazon VPC options when you launch a cluster](emr-clusters-in-a-vpc.md)
+ [Set up a VPC to host Amazon EMR clusters](emr-vpc-host-job-flows.md)
+ [Launch clusters into a VPC with Amazon EMR](emr-vpc-launching-job-flows.md)
+ [Sample policies for private subnets that access Amazon S3](private-subnet-iampolicy.md)
+ [More resources for learning about VPCs](#emr-resources-about-vpcs)

# Amazon VPC options when you launch a cluster
<a name="emr-clusters-in-a-vpc"></a>



When you launch an Amazon EMR cluster within a VPC, you can launch it within either a public, private, or shared subnet. There are slight but notable differences in configuration, depending on the subnet type you choose for a cluster.

## Public subnets
<a name="emr-vpc-public-subnet"></a>

EMR clusters in a public subnet require a connected internet gateway. This is because Amazon EMR clusters must access AWS services and Amazon EMR. If a service, such as Amazon S3, provides the ability to create a VPC endpoint, you can access those services using the endpoint instead of accessing a public endpoint through an internet gateway. Additionally, Amazon EMR cannot communicate with clusters in public subnets through a network address translation (NAT) device. An internet gateway is required for this purpose but you can still use a NAT instance or gateway for other traffic in more complex scenarios.

All instances in a cluster connect to Amazon S3 through either a VPC endpoint or internet gateway. Other AWS services which do not currently support VPC endpoints use only an internet gateway.

If you have additional AWS resources that you do not want connected to the internet gateway, you can launch those components in a private subnet that you create within your VPC. 

Clusters running in a public subnet use two security groups: one for the primary node and another for core and task nodes. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

The following diagram shows how an Amazon EMR cluster runs in a VPC using a public subnet. The cluster is able to connect to other AWS resources, such as Amazon S3 buckets, through the internet gateway.

![\[Cluster on a VPC\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_default_v3a.png)


The following diagram shows how to set up a VPC so that a cluster in the VPC can access resources in your own network, such as an Oracle database.

![\[Set up a VPC and cluster to access local VPN resources\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_withVPN_v3a.png)


## Private subnets
<a name="emr-vpc-private-subnet"></a>

A private subnet lets you launch AWS resources without requiring the subnet to have an attached internet gateway. Amazon EMR supports launching clusters in private subnets with release versions 4.2.0 or later.

**Note**  
When you set up an Amazon EMR cluster in a private subnet, we recommend that you also set up [VPC endpoints for Amazon S3](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html). If your EMR cluster is in a private subnet without VPC endpoints for Amazon S3, you will incur additional NAT gateway charges that are associated with S3 traffic because the traffic between your EMR cluster and S3 will not stay within your VPC.

Private subnets differ from public subnets in the following ways:
+ To access AWS services that do not provide a VPC endpoint, you still must use a NAT instance or an internet gateway.
+ At a minimum, you must provide a route to the Amazon EMR service logs bucket and Amazon Linux repository in Amazon S3. For more information, see [Sample policies for private subnets that access Amazon S3](private-subnet-iampolicy.md)
+ If you use EMRFS features, you need to have an Amazon S3 VPC endpoint and a route from your private subnet to DynamoDB.
+ Debugging only works if you provide a route from your private subnet to a public Amazon SQS endpoint.
+ Creating a private subnet configuration with a NAT instance or gateway in a public subnet is only supported using the AWS Management Console. The easiest way to add and configure NAT instances and Amazon S3 VPC endpoints for Amazon EMR clusters is to use the **VPC Subnets List** page in the Amazon EMR console. To configure NAT gateways, see [NAT Gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the *Amazon VPC User Guide*.
+ You cannot change a subnet with an existing Amazon EMR cluster from public to private or vice versa. To locate an Amazon EMR cluster within a private subnet, the cluster must be started in that private subnet. 

Amazon EMR creates and uses different default security groups for the clusters in a private subnet: ElasticMapReduce-Master-Private, ElasticMapReduce-Slave-Private, and ElasticMapReduce-ServiceAccess. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

For a complete listing of NACLs of your cluster, choose **Security groups for Primary** and **Security groups for Core & Task** on the Amazon EMR console **Cluster Details** page.

The following image shows how an Amazon EMR cluster is configured within a private subnet. The only communication outside the subnet is to Amazon EMR. 

![\[Launch an Amazon EMR cluster in a private subnet\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_with_private_subnet_v3a.png)


The following image shows a sample configuration for an Amazon EMR cluster within a private subnet connected to a NAT instance that is residing in a public subnet.

![\[Private subnet with NAT\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/vpc_private_subnet_nat_v3a.png)


## Shared subnets
<a name="emr-vpc-shared-subnet"></a>

VPC sharing allows customers to share subnets with other AWS accounts within the same AWS Organization. You can launch Amazon EMR clusters into both public shared and private shared subnets, with the following caveats.

The subnet owner must share a subnet with you before you can launch an Amazon EMR cluster into it. However, shared subnets can later be unshared. For more information, see [Working with Shared VPCs](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html). When a cluster is launched into a shared subnet and that shared subnet is then unshared, you can observe specific behaviors based on the state of the Amazon EMR cluster when the subnet is unshared.
+ Subnet is unshared *before* the cluster is successfully launched - If the owner stops sharing the Amazon VPC or subnet while the participant is launching a cluster, the cluster could fail to start or be partially initialized without provisioning all requested instances. 
+ Subnet is unshared *after* the cluster is successfully launched - When the owner stops sharing a subnet or Amazon VPC with the participant, the participant's clusters will not be able to resize to add new instances or to replace unhealthy instances.

When you launch an Amazon EMR cluster, multiple security groups are created. In a shared subnet, the subnet participant controls these security groups. The subnet owner can see these security groups but cannot perform any actions on them. If the subnet owner wants to remove or modify the security group, the participant that created the security group must take the action.

## Control VPC permissions with IAM
<a name="emr-iam-on-vpc"></a>

By default, all users can see all of the subnets for the account, and any user can launch a cluster in any subnet. 

When you launch a cluster into a VPC, you can use AWS Identity and Access Management (IAM) to control access to clusters and restrict actions using policies, just as you would with clusters launched into Amazon EC2 Classic. For more information about IAM, see [IAM User Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/). 

You can also use IAM to control who can create and administer subnets. For example, you can create an IAM role to administer subnets, and a second role that can launch clusters but cannot modify Amazon VPC settings. For more information about administering policies and actions in Amazon EC2 and Amazon VPC, see [IAM Policies for Amazon EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-policies-for-amazon-ec2.html) in the *Amazon EC2 User Guide*. 

# Set up a VPC to host Amazon EMR clusters
<a name="emr-vpc-host-job-flows"></a>

Before you can launch clusters in a VPC, you must create a VPC and a subnet. For public subnets, you must create an internet gateway and attach it to the subnet. The following instructions describe how to create a VPC capable of hosting Amazon EMR clusters. 

**To create a VPC with subnets for an Amazon EMR cluster**

1. Open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. On the top-right of the page, choose the [AWS Region](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) for your VPC.

1. Choose **Create VPC**.

1. On the **VPC settings** page, choose **VPC and more**.

1. Under **Name tag auto-generation**, enable **Auto-generate** and enter a name for your VPC. This helps you to identify the VPC and subnet in the Amazon VPC console after you've created them.

1. In the **IPv4 CIDR block** field, enter a private IP address space for your VPC to ensure proper DNS hostname resolution; otherwise, you may experience Amazon EMR cluster failures. This includes the following IP address ranges: 
   + 10.0.0.0 - 10.255.255.255
   + 172.16.0.0 - 172.31.255.255
   + 192.168.0.0 - 192.168.255.255

1. Under **Number of Availability Zones (AZs)**, choose the number of Availability Zones you want to launch your subnets in.

1. Under **Number of public subnets**, choose a single public subnet to add to your VPC. If the data used by the cluster is available on the internet (for example, in Amazon S3 or Amazon RDS), you only need to use a public subnet and don't need to add a private subnet.

1. Under **Number of private subnets**, choose the number of private subnets you want to add to your VPC. Select one or more if the the data for your application is stored in your own network (for example, in an Oracle database). For a VPC in a private subnet, all Amazon EC2 instances must at minimum have a route to Amazon EMR through the elastic network interface. In the console, this is automatically configured for you.

1. Under **NAT gateways**, optionally choose to add NAT gateways. They are only necessary if you have private subnets that need to communicate with the internet.

1. Under **VPC endpoints**, optionally choose to add endpoints for Amazon S3 to your subnets.

1. Verify that **Enable DNS hostnames** and**Enable DNS resolution** are checked. For more information, see [Using DNS with your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html).

1. Choose **Create VPC**.

1. A status window shows the work in progress. When the work completes, choose **View VPC** to navigate to the **Your VPCs** page, which displays your default VPC and the VPC that you just created. The VPC that you created is a nondefault VPC, therefore the **Default VPC** column displays **No**. 

1. If you want to associate your VPC with a DNS entry that does not include a domain name, navigate to **DHCP option sets**, choose **Create DHCP options set**, and omit a domain name. After you create your option set, navigate to your new VPC, choose **Edit DHCP options set** under the **Actions** menu, and select the new option set. You cannot edit the domain name using the console after the DNS option set has been created. 

   It is a best practice with Hadoop and related applications to ensure resolution of the fully qualified domain name (FQDN) for nodes. To ensure proper DNS resolution, configure a VPC that includes a DHCP options set whose parameters are set to the following values:
   + **domain-name** = **ec2.internal**

     Use **ec2.internal** if your Region is US East (N. Virginia). For other Regions, use *region-name***.compute.internal**. For examples in `us-west-2`, use **us-west-2.compute.internal**. For the AWS GovCloud (US-West) Region, use **us-gov-west-1.compute.internal**.
   + **domain-name-servers** = **AmazonProvidedDNS**

   For more information, see [DHCP options sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html) in the *Amazon VPC User Guide*.

1. After the VPC is created, go to the **Subnets** page and note the **Subnet ID** of one of the subnets of your new VPC. You use this information when you launch the Amazon EMR cluster into the VPC.

# Launch clusters into a VPC with Amazon EMR
<a name="emr-vpc-launching-job-flows"></a>

After you have a subnet that is configured to host Amazon EMR clusters, launch the cluster in that subnet by specifying the associated subnet identifier when creating the cluster.

**Note**  
Amazon EMR supports private subnets in release versions 4.2 and above.

When the cluster is launched, Amazon EMR adds security groups based on whether the cluster is launching into VPC private or public subnets. All security groups allow ingress at port 8443 to communicate to the Amazon EMR service, but IP address ranges vary for public and private subnets. Amazon EMR manages all of these security groups, and may need to add additional IP addresses to the AWS range over time. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

To manage the cluster on a VPC, Amazon EMR attaches a network device to the primary node and manages it through this device. You can view this device using the Amazon EC2 API action [https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstances.html](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstances.html). If you modify this device in any way, the cluster may fail.

------
#### [ Console ]

**To launch a cluster into a VPC with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Networking**, go to the **Virtual private cloud (VPC)** field. Enter the name of your VPC or choose **Browse** to select your VPC. Alternatively, choose **Create VPC** to create a VPC that you can use for your cluster.

1. Choose any other options that apply to your cluster.

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

**To launch a cluster into a VPC with the AWS CLI**
**Note**  
The AWS CLI does not provide a way to create a NAT instance automatically and connect it to your private subnet. However, to create a S3 endpoint in your subnet, you can use the Amazon VPC CLI commands. Use the console to create NAT instances and launch clusters in a private subnet.

After your VPC is configured, you can launch Amazon EMR clusters in it by using the `create-cluster` subcommand with the `--ec2-attributes` parameter. Use the `--ec2-attributes` parameter to specify the VPC subnet for your cluster.
+ To create a cluster in a specific subnet, type the following command, replace *myKey* with the name of your Amazon EC2 key pair, and replace *77XXXX03* with your subnet ID.

  ```
  aws emr create-cluster --name "Test cluster" --release-label emr-4.2.0 --applications Name=Hadoop Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey,SubnetId=subnet-77XXXX03 --instance-type m5.xlarge --instance-count 3
  ```

  When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes use the instance type specified in the command.
**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

------

## Ensuring available IP addresses for an EMR cluster on EC2
<a name="emr-vpc-launching-job-flows-ip-availability"></a>

To ensure that a subnet with enough free IP addresses is available when you launch, the EC2 subnet selection checks IP availability. It The creation process uses a subnet with the necessary count of IP address to launch core, primary and task nodes as required, even if upon initial creation, only core nodes for the cluster are created. EMR checks the number of IP addresses required to launch primary and task nodes during creation, as well as calculating separately the number of IP addresses needed to launch core nodes. The minimum number of primary and task instances or nodes required is determined automatically by Amazon EMR.

**Important**  
If no subnets in the VPC have enough available IPs to accommodate essential nodes, an error is returned and the cluster isn't created.

In most deployment cases, there is a time difference between each launch of core, primary and task nodes. Additionally, it's possible for multiple clusters to share a subnet. In these cases, IP-address availability can fluctuate and subsequent task-node launches, for instance, can be limited by available IP addresses.

# Sample policies for private subnets that access Amazon S3
<a name="private-subnet-iampolicy"></a>

For private subnets, at a minimum you must provide the ability for Amazon EMR to access Amazon Linux repositories. This private subnet policy is a part of the VPC endpoint policies for accessing Amazon S3.

With Amazon EMR 5.25.0 or later, to enable one-click access to persistent Spark history server, you must allow Amazon EMR to access the system bucket that collects Spark event logs. If you enable logging, provide PUT permissions to the following bucket: 

```
aws157-logs-${AWS::Region}/*
```

For more information, see [One-click access to persistent Spark History Server](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html).

It is up to you to determine the policy restrictions that meet your business needs. The following example policy provides permissions to access Amazon Linux repositories and the Amazon EMR system bucket for collecting Spark event logs. It shows a few sample resource names for the buckets. 

For more information about using IAM policies with Amazon VPC endpoints, see [Endpoint policies for Amazon S3](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#vpc-endpoints-policies-s3).

The following policy example contains sample resources in the us-east-1 region.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AmazonLinuxAMIRepositoryAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::packages.us-east-1.amazonaws.com/*",
        "arn:aws:s3:::repo.us-east-1.amazonaws.com/*"
      ]
    },
    {
      "Sid": "EnableApplicationHistory",
      "Effect": "Allow",
      "Action": [
        "s3:Put*",
        "s3:Get*",
        "s3:Create*",
        "s3:Abort*",
        "s3:List*"
      ],
      "Resource": [
        "arn:aws:s3:::prod.us-east-1.appinfo.src/*"
      ]
    }
  ]
}
```

------

The following example policy provides the permissions required to access Amazon Linux 2 repositories in the us-east-1 region.

```
{
   "Statement": [
       {
           "Sid": "AmazonLinux2AMIRepositoryAccess",
           "Effect": "Allow",
           "Principal": "*",
           "Action": "s3:GetObject",
           "Resource": [
           	"arn:aws:s3:::amazonlinux.us-east-1.amazonaws.com/*",
           	"arn:aws:s3:::amazonlinux-2-repos-us-east-1/*"
           ]
       }
   ]
}
```

The following example policy provides the permissions required to access Amazon Linux 2023 repositories in the us-east-1 region.

```
{       
    "Statement": [                                       
        {                                                        
            "Sid": "AmazonLinux2023AMIRepositoryAccess",
            "Effect": "Allow",           
            "Principal": "*",                    
            "Action": "s3:GetObject",                    
            "Resource": [                                
                 "arn:aws:s3:::al2023-repos-us-east-1-de612dc2/*"
            ]                                            
        }                                                
    ]                                                    
 }
```

## Available regions
<a name="private-subnet-iampolicy-regions"></a>

The following table contains a list of buckets by region, and includes both an Amazon Resource Name (ARN) for the respository and a string that represents the ARN for the `appinfo.src`. The ARN, or Amazon Resource Name, is a string that uniquely identifies an AWS resource.


| Region | Repository buckets | AppInfo bucket | 
| --- | --- | --- | 
| US East (Ohio) | "arn:aws:s3:::packages.us-east-2.amazonaws.com/","arn:aws:s3:::repo.us-east-2.amazonaws.com/","arn:aws:s3:::repo.us-east-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-east-2.appinfo.src/\$1" | 
| US East (N. Virginia) | "arn:aws:s3:::packages.us-east-1.amazonaws.com/","arn:aws:s3:::repo.us-east-1.amazonaws.com/","arn:aws:s3:::repo.us-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-east-1.appinfo.src/\$1" | 
| US West (N. California) | "arn:aws:s3:::packages.us-west-1.amazonaws.com/","arn:aws:s3:::repo.us-west-1.amazonaws.com/","arn:aws:s3:::repo.us-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-west-1.appinfo.src/\$1" | 
| US West (Oregon) | "arn:aws:s3:::packages.us-west-2.amazonaws.com/","arn:aws:s3:::repo.us-west-2.amazonaws.com/","arn:aws:s3:::repo.us-west-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-west-2.appinfo.src/\$1" | 
| Africa (Cape Town) | "arn:aws:s3:::packages.af-south-1.amazonaws.com/","arn:aws:s3:::repo.af-south-1.amazonaws.com/","arn:aws:s3:::repo.af-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.af-south-1.appinfo.src/\$1" | 
| Africa (Cape Town) | "arn:aws:s3:::packages.ap-east-1.amazonaws.com/","arn:aws:s3:::repo.ap-east-1.amazonaws.com/","arn:aws:s3:::repo.ap-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-east-1.appinfo.src/\$1" | 
| Asia Pacific (Hyderabad) | "arn:aws:s3:::packages.ap-south-2.amazonaws.com/","arn:aws:s3:::repo.ap-south-2.amazonaws.com/","arn:aws:s3:::repo.ap-south-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-2.appinfo.src/\$1" | 
| Asia Pacific (Jakarta) | "arn:aws:s3:::packages.ap-southeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-3.appinfo.src/\$1" | 
| Asia Pacific (Malaysia) | "arn:aws:s3:::packages.ap-southeast-5.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-5.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-5.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-5.appinfo.src/\$1" | 
| Asia Pacific (Melbourne) | "arn:aws:s3:::packages.ap-southeast-4.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-4.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-4.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-4.appinfo.src/\$1" | 
| Asia Pacific (Mumbai) | "arn:aws:s3:::packages.ap-south-1.amazonaws.com/","arn:aws:s3:::repo.ap-south-1.amazonaws.com/","arn:aws:s3:::repo.ap-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-south-1.appinfo.src/\$1" | 
| Asia Pacific (Osaka) | "arn:aws:s3:::packages.ap-northeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-3.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-3.appinfo.src/\$1" | 
| Asia Pacific (Seoul) | "arn:aws:s3:::packages.ap-northeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-2.appinfo.src/\$1" | 
| Asia Pacific (Singapore) | "arn:aws:s3:::packages.ap-southeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-1.appinfo.src/\$1" | 
| Asia Pacific (Sydney) | "arn:aws:s3:::packages.ap-southeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-2.amazonaws.com/","arn:aws:s3:::repo.ap-southeast-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-southeast-2.appinfo.src/\$1" | 
| Asia Pacific (Tokyo) | "arn:aws:s3:::packages.ap-northeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-1.amazonaws.com/","arn:aws:s3:::repo.ap-northeast-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ap-northeast-1.appinfo.src/\$1" | 
| Canada (Central) | "arn:aws:s3:::packages.ca-central-1.amazonaws.com/","arn:aws:s3:::repo.ca-central-1.amazonaws.com/","arn:aws:s3:::repo.ca-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ca-central-1.appinfo.src/\$1" | 
| Canada West (Calgary) | "arn:aws:s3:::packages.ca-west-1.amazonaws.com/","arn:aws:s3:::repo.ca-west-1.amazonaws.com/","arn:aws:s3:::repo.ca-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.ca-west-1.appinfo.src/\$1" | 
| Europe (Frankfurt) | "arn:aws:s3:::packages.eu-central-1.amazonaws.com/","arn:aws:s3:::repo.eu-central-1.amazonaws.com/","arn:aws:s3:::repo.eu-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-central-1.appinfo.src/\$1" | 
| Europe (Ireland) | "arn:aws:s3:::packages.eu-west-1.amazonaws.com/","arn:aws:s3:::repo.eu-west-1.amazonaws.com/","arn:aws:s3:::repo.eu-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-1.appinfo.src/\$1" | 
| Europe (London) | "arn:aws:s3:::packages.eu-west-2.amazonaws.com/","arn:aws:s3:::repo.eu-west-2.amazonaws.com/","arn:aws:s3:::repo.eu-west-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-2.appinfo.src/\$1" | 
| Europe (Milan) | "arn:aws:s3:::packages.eu-south-1.amazonaws.com/","arn:aws:s3:::repo.eu-south-1.amazonaws.com/","arn:aws:s3:::repo.eu-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-south-1.appinfo.src/\$1" | 
| Europe (Paris) | "arn:aws:s3:::packages.eu-west-3.amazonaws.com/","arn:aws:s3:::repo.eu-west-3.amazonaws.com/","arn:aws:s3:::repo.eu-west-3.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-west-3.appinfo.src/\$1" | 
| Europe (Spain) | "arn:aws:s3:::packages.eu-south-2.amazonaws.com/","arn:aws:s3:::repo.eu-south-2.amazonaws.com/","arn:aws:s3:::repo.eu-south-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-south-2.appinfo.src/\$1" | 
| Europe (Stockholm) | "arn:aws:s3:::packages.eu-north-1.amazonaws.com/","arn:aws:s3:::repo.eu-north-1.amazonaws.com/","arn:aws:s3:::repo.eu-north-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-north-1.appinfo.src/\$1" | 
| Europe (Zurich) | "arn:aws:s3:::packages.eu-central-2.amazonaws.com/","arn:aws:s3:::repo.eu-central-2.amazonaws.com/","arn:aws:s3:::repo.eu-central-2.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.eu-central-2.appinfo.src/\$1" | 
| Israel (Tel Aviv) | "arn:aws:s3:::packages.il-central-1.amazonaws.com/","arn:aws:s3:::repo.il-central-1.amazonaws.com/","arn:aws:s3:::repo.il-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.il-central-1.appinfo.src/\$1" | 
| Middle East (Bahrain) | "arn:aws:s3:::packages.me-south-1.amazonaws.com/","arn:aws:s3:::repo.me-south-1.amazonaws.com/","arn:aws:s3:::repo.me-south-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-south-1.appinfo.src/\$1" | 
| Middle East (UAE) | "arn:aws:s3:::packages.me-central-1.amazonaws.com/","arn:aws:s3:::repo.me-central-1.amazonaws.com/","arn:aws:s3:::repo.me-central-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-central-1.appinfo.src/\$1" | 
| South America (São Paulo) | "arn:aws:s3:::packages.sa-east-1.amazonaws.com/","arn:aws:s3:::repo.sa-east-1.amazonaws.com/","arn:aws:s3:::repo.sa-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.sa-east-1.appinfo.src/\$1" | 
| AWS GovCloud (US-East) | "arn:aws:s3:::packages.us-gov-east-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-east-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-east-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.us-gov-east-1.appinfo.src/\$1" | 
| AWS GovCloud (US-West) | "arn:aws:s3:::packages.us-gov-west-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-west-1.amazonaws.com/","arn:aws:s3:::repo.us-gov-west-1.emr.amazonaws.com/\$1" | "arn:aws:s3:::prod.me-south-1.appinfo.src/\$1" | 

## More resources for learning about VPCs
<a name="emr-resources-about-vpcs"></a>

Use the following topics to learn more about VPCs and subnets.
+ Private Subnets in a VPC
  + [Scenario 2: VPC with Public and Private Subnets (NAT)](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario2.html)
  + [NAT Instances](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html)
  + [High Availability for Amazon VPC NAT Instances: An Example](https://aws.amazon.com/articles/2781451301784570)
+ Public Subnets in a VPC
  + [Scenario 1: VPC with a Single Public Subnet](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario1.html)
+ General VPC Information
  + [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/)
  + [VPC Peering](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peering.html)
  + [Using Elastic Network Interfaces with Your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_ElasticNetworkInterfaces.html)
  + [Securely connect to Linux instances running in a private VPC](https://blogs.aws.amazon.com/security/post/Tx3N8GFK85UN1G6/Securely-connect-to-Linux-instances-running-in-a-private-Amazon-VPC)

# Create an Amazon EMR cluster with instance fleets or uniform instance groups
<a name="emr-instance-group-configuration"></a>

When you create a cluster and specify the configuration of the primary node, core nodes, and task nodes, you have two configuration options. You can use *instance fleets* or *uniform instance groups*. The configuration option you choose applies to all nodes, it applies for the lifetime of the cluster, and instance fleets and instance groups cannot coexist in a cluster. The instance fleets configuration is available in Amazon EMR version 4.8.0 and later, excluding 5.0.x versions. 

You can use the Amazon EMR console, the AWS CLI, or the Amazon EMR API to create clusters with either configuration. When you use the `create-cluster` command from the AWS CLI, you use either the `--instance-fleets` parameters to create the cluster using instance fleets or, alternatively, you use the `--instance-groups` parameters to create it using uniform instance groups.

The same is true using the Amazon EMR API. You use either the `InstanceGroups` configuration to specify an array of `InstanceGroupConfig` objects, or you use the `InstanceFleets` configuration to specify an array of `InstanceFleetConfig` objects.

In the new Amazon EMR console, you can choose to use either instance groups or instance fleets when you create a cluster, and you have the option to use Spot Instances with each. With the old Amazon EMR console, if you use the default **Quick Options** settings when you create your cluster, Amazon EMR applies the uniform instance groups configuration to the cluster and uses On-Demand Instances. To use Spot Instances with uniform instance groups, or to configure instance fleets and other customizations, choose **Advanced Options**.

## Instance fleets
<a name="emr-plan-instance-fleets"></a>

The instance fleets configuration offers the widest variety of provisioning options for Amazon EC2 instances. Each node type has a single instance fleet, and using a task instance fleet is optional. You can specify up to five EC2 instance types per fleet, or 30 EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](emr-instance-fleet.md#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. For the core and task instance fleets, you assign a *target capacity* for On-Demand Instances, and another for Spot Instances. Amazon EMR chooses any mix of the specified instance types to fulfill the target capacities, provisioning both On-Demand and Spot Instances.

For the primary node type, Amazon EMR chooses a single instance type from your list of instances, and you specify whether it's provisioned as an On-Demand or Spot Instance. Instance fleets also provide additional options for Spot Instance and On-Demand purchases. Spot Instance options include a timeout that specifies an action to take if Spot capacity can't be provisioned, and a preferred allocation strategy (capacity-optimized) for launching Spot Instance fleets. On-Demand Instance fleets can also be launched using the allocation strategy (lowest-price) option. If you use a service role that is not the EMR default service role, or use an EMR managed policy in your service role, you need to add additional permissions to the custom cluster service role to enable the allocation strategy option. For more information, see [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about configuring instance fleets, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

## Uniform instance groups
<a name="emr-plan-instance-groups"></a>

Uniform instance groups offer a simpler setup than instance fleets. Each Amazon EMR cluster can include up to 50 instance groups: one primary instance group that contains one Amazon EC2 instance, a core instance group that contains one or more EC2 instances, and up to 48 optional task instance groups. Each core and task instance group can contain any number of Amazon EC2 instances. You can scale each instance group by adding and removing Amazon EC2 instances manually, or you can set up automatic scaling. For information about adding and removing instances, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

For more information about configuring uniform instance groups, see [Configure uniform instance groups for your Amazon EMR cluster](emr-uniform-instance-group.md). 

## Working with instance fleets and instance groups
<a name="emr-plan-instance-topics"></a>

**Topics**
+ [Instance fleets](#emr-plan-instance-fleets)
+ [Uniform instance groups](#emr-plan-instance-groups)
+ [Working with instance fleets and instance groups](#emr-plan-instance-topics)
+ [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md)
+ [Reconfiguring instance fleets for your Amazon EMR cluster](instance-fleet-reconfiguration.md)
+ [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md)
+ [Configure uniform instance groups for your Amazon EMR cluster](emr-uniform-instance-group.md)
+ [Availability Zone flexibility for an Amazon EMR cluster](emr-flexibility.md)
+ [Configuring Amazon EMR cluster instance types and best practices for Spot instances](emr-plan-instances-guidelines.md)

# Planning and configuring instance fleets for your Amazon EMR cluster
<a name="emr-instance-fleet"></a>

**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

The instance fleet configuration for Amazon EMR clusters lets you select a wide variety of provisioning options for Amazon EC2 instances, and helps you develop a flexible and elastic resourcing strategy for each node type in your cluster. 

In an instance fleet configuration, you specify a *target capacity* for [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) and [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. When Amazon EC2 reclaims a Spot Instance in a running cluster because of a price increase or instance failure, Amazon EMR tries to replace the instance with any of the instance types that you specify. This makes it easier to regain capacity during a spike in Spot pricing. 

You can specify a maximum of five Amazon EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets, or a maximum of 30 Amazon EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. 

You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify. If Amazon EMR detects an AWS large-scale event in one or more of the Availability Zones, Amazon EMR automatically attempts to route traffic away from the impacted Availability Zones and tries to launch new clusters that you create in alternate Availability Zones according to your selections. Note that cluster Availability Zone selection happens only at cluster creation. Existing cluster nodes are not automatically re-launched in a new Availability Zone in the event of an Availability Zone outage.

## **Considerations for working with instance fleets**
<a name="emr-key-feature-summary"></a>

Consider the following items when you use instance fleets with Amazon EMR.
+ You can have one instance fleet, and only one, per node type (primary, core, task). You can specify up to five Amazon EC2 instance types for each fleet on the AWS Management Console (or a maximum of 30 types per instance fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy)). 
+ Amazon EMR chooses any or all of the specified Amazon EC2 instance types to provision with both Spot and On-Demand purchasing options.
+ You can establish target capacities for Spot and On-Demand Instances for the core fleet and task fleet. Use vCPU or a generic unit assigned to each Amazon EC2 instance that counts toward the targets. Amazon EMR provisions instances until each target capacity is totally fulfilled. For the primary fleet, the target is always one.
+ You can choose one subnet (Availability Zone) or a range. If you choose a range, Amazon EMR provisions capacity in the Availability Zone that is the best fit.
+ When you specify a target capacity for Spot Instances:
  + For each instance type, specify a maximum Spot price. Amazon EMR provisions Spot Instances if the Spot price is below the maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.
  + For each fleet, define a timeout period for provisioning Spot Instances. If Amazon EMR can't provision Spot capacity, you can terminate the cluster or switch to provisioning On-Demand capacity instead. This only applies for provisioning clusters, not resizing them. If the timeout period ends during the cluster resizing process, unprovisioned Spot requests will be nullified without transferring to On-Demand capacity. 
+ For each fleet, you can specify one of the following allocation strategies for your Spot Instances: price-capacity optimized, capacity-optimized, capacity-optimized-prioritized, lowest-price, or diversified across all pools.
+ For each fleet, you can apply the following allocation strategies for your On-Demand Instances: the lowest-price strategy or the prioritized strategy.
+ For each fleet with On-Demand Instances, you can choose to apply capacity reservation options.
+ If you use allocation strategy for instance fleets, the following considerations apply when you choose subnets for your EMR cluster:
  + When Amazon EMR provisions a cluster with a task fleet, it filters out subnets that lack enough available IP addresses to provision all instances of the requested EMR cluster. This includes IP addresses required for the primary, core, and task instance fleets during cluster launch. Amazon EMR then leverages its allocation strategy to determine the instance pool, based on instance type and remaining subnets with sufficient IP addresses, to launch the cluster.
  + If Amazon EMR cannot launch the whole cluster due to insufficient available IP addresses, it will attempt to identify subnets with enough free IP addresses to launch the essential (core and primary) instance fleets. In such scenarios, your task instance fleet will go into a suspended state, rather than terminating the cluster with an error.
  + If none of the specified subnets contain enough IP addresses to provision the essential core and primary instance fleets, the cluster launch will fail with a **VALIDATION\$1ERROR**. This triggers a **CRITICAL** severity cluster termination event, notifying you that the cluster cannot be launched. To prevent this issue, we recommend increasing the number of IP addresses in your subnets.
+ If you run Amazon EMR release **emr-7.7.0** and above, and you use allocation strategy for instance fleets, you can scale the cluster up to 4000 EC2 instances and 14000 EBS volumes per instance fleet. For release versions below **emr-7.7.0**, the cluster can be scaled up only to 2000 EC2 instances and 7000 EBS volumes per instance fleet.
+ When you launch On-Demand Instances, you can use open or targeted capacity reservations for primary, core, and task nodes in your accounts. You might see insufficient capacity with On-Demand Instances with allocation strategy for instance fleets. We recommend that you specify multiple instance types to diversify and reduce the chance of experiencing insufficient capacity. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

## Instance fleet options
<a name="emr-instance-fleet-options"></a>

Use the following guidelines to understand instance fleet options.

**Topics**
+ [**Setting target capacities**](#emr-fleet-capacity)
+ [**Launch options**](#emr-fleet-spot-options)
+ [**Multiple subnet (Availability Zones) options**](#emr-multiple-subnet-options)
+ [**Master node configuration**](#emr-master-node-configuration)

### **Setting target capacities**
<a name="emr-fleet-capacity"></a>

Specify the target capacities you want for the core fleet and task fleet. When you do, that determines the number of On-Demand Instances and Spot Instances that Amazon EMR provisions. When you specify an instance, you decide how much each instance counts toward the target. When an On-Demand Instance is provisioned, it counts toward the On-Demand target. The same is true for Spot Instances. Unlike core and task fleets, the primary fleet is always one instance. Therefore, the target capacity for this fleet is always one. 

When you use the console, the vCPUs of the Amazon EC2 instance type are used as the count for target capacities by default. You can change this to **Generic units**, and then specify the count for each EC2 instance type. When you use the AWS CLI, you manually assign generic units for each instance type. 

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).

For each fleet, you specify up to five Amazon EC2 instance types. If you use an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy) and create a cluster using the AWS CLI or the Amazon EMR API, you can specify up to 30 EC2 instance types per instance fleet. Amazon EMR chooses any combination of these EC2 instance types to fulfill your target capacities. Because Amazon EMR wants to fill target capacity completely, an overage might happen. For example, if there are two unfulfilled units, and Amazon EMR can only provision an instance with a count of five units, the instance still gets provisioned, meaning that the target capacity is exceeded by three units. 

If you reduce the target capacity to resize a running cluster, Amazon EMR attempts to complete application tasks and terminates instances to meet the new target. For more information, see [Terminate at task completion](emr-scaledown-behavior.md#emr-scaledown-terminate-task).

### **Launch options**
<a name="emr-fleet-spot-options"></a>

For Spot Instances, you can specify a **Maximum Spot price** for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. Amazon EMR provisions Spot Instances if the current Spot price in an Availability Zone is below your maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.

**Note**  
Spot Instances with a defined duration (also known as Spot blocks) are no longer available to new customers from July 1, 2021. For customers who have previously used the feature, we will continue to support Spot Instances with a defined duration until December 31, 2022.

Available in Amazon EMR 5.12.1 and later, you have the option to launch Spot and On-Demand Instance fleets with optimized capacity allocation. This allocation strategy option can be set in the old AWS Management Console or using the API `RunJobFlow`. Note that you can't customize allocation strategy in the new console. Using the allocation strategy option requires additional service role permissions. If you use the default Amazon EMR service role and managed policy ([`EMR_DefaultRole`](emr-iam-role.md) and `AmazonEMRServicePolicy_v2`) for the cluster, the permissions for the allocation strategy option are already included. If you're not using the default Amazon EMR service role and managed policy, you must add them to use this option. See [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about Spot Instances, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the Amazon EC2 User Guide. For more information about On-Demand Instances, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the Amazon EC2 User Guide.

If you choose to launch On-Demand Instance fleets with the lowest-price allocation strategy, you have the option to use capacity reservations. Capacity reservation options can be set using the Amazon EMR API `RunJobFlow`. Capacity reservations require additional service role permissions which you must add to use these options. See [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy). Note that you can't customize capacity reservations in the new console.

### **Multiple subnet (Availability Zones) options**
<a name="emr-multiple-subnet-options"></a>

When you use instance fleets, you can specify multiple Amazon EC2 subnets within a VPC, each corresponding to a different Availability Zone. If you use EC2-Classic, you specify Availability Zones explicitly. Amazon EMR identifies the best Availability Zone to launch instances according to your fleet specifications. Instances are always provisioned in only one Availability Zone. You can select private subnets or public subnets, but you can't mix the two, and the subnets you specify must be within the same VPC.

### **Master node configuration**
<a name="emr-master-node-configuration"></a>

Because the primary instance fleet is only a single instance, its configuration is slightly different from core and task instance fleets. You only select either On-Demand or Spot for the primary instance fleet because it consists of only one instance. If you use the console to create the instance fleet, the target capacity for the purchasing option you select is set to 1. If you use the AWS CLI, always set either `TargetSpotCapacity` or `TargetOnDemandCapacity` to 1 as appropriate. You can still choose up to five instance types for the primary instance fleet (or a maximum of 30 when you use the allocation strategy option for On-Demand or Spot Instances). However, unlike core and task instance fleets, where Amazon EMR might provision multiple instances of different types, Amazon EMR selects a single instance type to provision for the primary instance fleet.

## Allocation strategy for instance fleets
<a name="emr-instance-fleet-allocation-strategy"></a>

With Amazon EMR versions 5.12.1 and later, you can use the allocation strategy option with On-Demand and Spot Instances for each cluster node. When you create a cluster using the AWS CLI, Amazon EMR API, or Amazon EMR console with an allocation strategy, you can specify up to 30 Amazon EC2 instance types per fleet. With the default Amazon EMR cluster instance fleet configuration, you can have up to 5 instance types per fleet. We recommend that you use the allocation strategy option for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

**Topics**
+ [Allocation strategy with On-Demand Instances](#emr-instance-fleet-allocation-strategy-od)
+ [Allocation strategy with Spot Instances](#emr-instance-fleet-allocation-strategy-spot)
+ [Allocation strategy permissions](#emr-instance-fleet-allocation-strategy-permissions)
+ [Required IAM permissions for an allocation strategy](#create-cluster-allocation-policy)

### Allocation strategy with On-Demand Instances
<a name="emr-instance-fleet-allocation-strategy-od"></a>

The following allocation strategies are available for your On-Demand Instances:

`lowest-price`** (default)**  
The lowest-price allocation strategy launches On-Demand instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the On-Demand Instances come from the next lowest-priced pool with available capacity.

`prioritized`  
The prioritized allocation strategy lets you specify a priority value for each instance type for your instance fleet. Amazon EMR launches your On-Demand Instances that have the highest priority. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure the priority value for an instance type, Amazon EMR assigns the lowest priority to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.

**Note**  
If you use the **capacity-optimized-prioritized** Spot allocation strategy, Amazon EMR applies the same priorities to both your On-Demand Instances and Spot instances when you set priorities.

### Allocation strategy with Spot Instances
<a name="emr-instance-fleet-allocation-strategy-spot"></a>

For *Spot Instances* you can choose from one of the following allocation strategies:

**`price-capacity-optimized` (recommended) **  
The price-capacity optimized allocation strategy launches Spot instances from the Spot instance pools that have the highest capacity available and the lowest price for the number of instances that are launching. As a result, the price-capacity optimized strategy typically has a higher chance of getting Spot capacity, and delivers lower interruption rates.This is the default strategy for Amazon EMR releases 6.10.0 and higher.

**`capacity-optimized`**  
The capacity-optimized allocation strategy launches Spot Instances into the most available pools with the lowest chance of interruption in the near term. This is a good option for workloads that might have a higher cost of interruption associated with work that gets restarted. This is the default strategy for Amazon EMR releases 6.9.0 and lower.

**`capacity-optimized-prioritized`**  
The capacity-optimized-prioritized allocation strategy lets you specify a priority value for each instance type in your instance fleet. Amazon EMR optimizes for capacity first, but it respects instance type priorities on a best-effort basis, such as if the priority doesn't significantly affect the fleet's ability to provision optimal capacity. We recommend this option if you have workloads that must have a minimal amount of disruption that still have a need for certain instance types. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure a priority for any instance type, Amazon EMR assigns the lowest priority value to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.  
If you use the **prioritized** On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

**`diversified`**  
With the diversified allocation strategy, Amazon EC2 distributes Spot Instances across all Spot capacity pools.

**`lowest-price`**  
The lowest-price allocation strategy launches Spot Instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the Spot Instances come from the next lowest priced pool that has available capacity. If a pool runs out of capacity before it fulfills your requested capacity, the Amazon EC2 fleet draws from the next lowest priced pool to continue to fulfill your request. To ensure that your desired capacity is met, you might receive Spot Instances from several pools. Because this strategy only considers instance price, and does not consider capacity availability, it might lead to high interruption rates.

### Allocation strategy permissions
<a name="emr-instance-fleet-allocation-strategy-permissions"></a>

The allocation strategy option requires several IAM permissions that are automatically included in the default Amazon EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you use a custom service role or managed policy for your cluster, you must add these permissions before you create the cluster. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy).

Optional On-Demand Capacity Reservations (ODCRs) are available when you use the On-Demand allocation strategy option. Capacity reservation options let you specify a preference for using reserved capacity first for Amazon EMR clusters. You can use this to ensure that your critical workloads use the capacity you have already reserved using open or targeted ODCRs. For non-critical workloads, the capacity reservation preferences let you specify whether reserved capacity should be consumed.

Capacity reservations can only be used by instances that match their attributes (instance type, platform, and Availability Zone). By default, open capacity reservations are automatically used by Amazon EMR when provisioning On-Demand Instances that match the instance attributes. If you don't have any running instances that match the attributes of the capacity reservations, they remain unused until you launch an instance matching their attributes. If you don't want to use any capacity reservations when launching your cluster, you must set capacity reservation preference to **none** in launch options.

However, you can also target a capacity reservation for specific workflows. This enables you to explicitly control which instances are allowed to run in that reserved capacity. For more information about On-Demand capacity reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

### Required IAM permissions for an allocation strategy
<a name="create-cluster-allocation-policy"></a>

Your [Service role for Amazon EMR (EMR role)](emr-iam-role.md) requires additional permissions to create a cluster that uses the allocation strategy option for On-Demand or Spot Instance fleets.

We automatically include these permissions in the default Amazon EMR service role [`EMR_DefaultRole`](emr-iam-role.md) and the Amazon EMR managed policy [`AmazonEMRServicePolicy_v2`](emr-managed-iam-policies.md).

If you use a custom service role or managed policy for your cluster, you must add the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DeleteLaunchTemplate",
        "ec2:CreateLaunchTemplate",
        "ec2:DescribeLaunchTemplates",
        "ec2:CreateLaunchTemplateVersion",
        "ec2:CreateFleet"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Deletelaunchtemplate"
    }
  ]
}
```

------

The following service role permissions are required to create a cluster that uses open or targeted capacity reservations. You must include these permissions in addition to the permissions required for using the allocation strategy option.

**Example Policy document for service role capacity reservations**  
To use open capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

**Example**  
To use targeted capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions",
        "resource-groups:ListGroupResources"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

## Configure instance fleets for your cluster
<a name="emr-instance-fleet-console"></a>

------
#### [ Console ]

**To create a cluster with instance fleets with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance fleets**.

1. For each **Node group**, select **Add instance type** and choose up to 5 instance types for primary and core instance fleets and up to fifteen instance types for task instance fleets. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit weighted capacity**  
For the core node group, change this value to any number of units that fits your applications. The number of YARN vCores for each fleet instance type is used as the default weighted capacity units. You can't edit weighted capacity for the primary node.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, to add security groups for your nodes, expand **EC2 security groups (firewall)** in the **Networking** section and select your security group for each node type.

1. Optionally, select the check box next to **Apply allocation strategy** if you want to use the allocation strategy option, and select the allocation strategy that you want to specify for the Spot Instances. You shouldn't select this option if your Amazon EMR service role doesn't have the required permissions. For more information, see [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy).

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

To create and launch a cluster with instance fleets with the AWS CLI, follow these guidelines:
+ To create and launch a cluster with instance fleets, use the `create-cluster` command along with `--instance-fleet` parameters.
+ To get configuration details about the instance fleets in a cluster, use the `list-instance-fleets` command.
+ To add multiple custom Amazon Linux AMIs to a cluster you’re creating, use the `CustomAmiId` option with each `InstanceType` specification. You can configure instance fleet nodes with multiple instance types and multiple custom AMIs to fit your requirements. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli). 
+ To make changes to the target capacity for an instance fleet, use the `modify-instance-fleet` command.
+ To add a task instance fleet to a cluster that doesn't already have one, use the `add-instance-fleet` command.
+ Multiple custom AMIs can be added to the task instance fleet using the CustomAmiId argument with the add-instance-fleet command. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli).
+ To use the allocation strategy option when creating an instance fleet, update the service role to include the example policy document in the following section.
+ To use the capacity reservations options when creating an instance fleet with On-Demand allocation strategy, update the service role to include the example policy document in the following section.
+ The instance fleets are automatically included in the default EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you are using a custom service role or custom managed policy for your cluster, you must add the new permissions for allocation strategy in the following section.

------

## Examples: Creating a cluster with the instance fleets configuration
<a name="create-cluster-instance-fleet-cli"></a>

The following examples demonstrate `create-cluster` commands with a variety of options that you can combine.

**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, use `aws emr create-default-roles` to create them before using the `create-cluster` command.

**Example: On-Demand primary, On-Demand core with single instance type, Default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}']
```

**Example: Spot primary, Spot core with single instance type, default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

**Example: On-Demand primary, mixed core with single instance type, single EC2 subnet**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=2,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=2}']
```

**Example: On-Demand primary, spot core with multiple weighted instance Types, Timeout for Spot, Range of EC2 Subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=11,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'}
```

**Example: On-Demand primary, mixed core and task with multiple weighted instance types, timeout for core Spot Instances, range of EC2 subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=8,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
    InstanceFleetType=TASK,TargetOnDemandCapacity=3,TargetSpotCapacity=3,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}']
```

**Example: Spot primary, no core or task, Amazon EBS configuration, default VPC**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \ 
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=60,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,\
EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,\
SizeIn GB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iop s=100},VolumesPerInstance=4}]}}']
```

**Example: Multiple custom AMIs, multiple instance types, on-demand primary, on-demand core**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}'] \ 
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}']
```

**Example: Add a task node to a running cluster with multiple instance types and multiple custom AMIs**  

```
aws emr add-instance-fleet --cluster-id j-123456 --release-label Amazon EMR 5.3.1 \
  --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleet \
    InstanceFleetType=Task,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456}',\
'{InstanceType=m6g.xlarge,CustomAmiId=ami-234567}']
```

**Example: Use a JSON configuration file**  
You can configure instance fleet parameters in a JSON file, and then reference the JSON file as the sole parameter for instance fleets. For example, the following command references a JSON configuration file, `my-fleet-config.json`:  

```
aws emr create-cluster --release-label emr-5.30.0 --service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
--instance-fleets file://my-fleet-config.json
```
The *my-fleet-config.json* file specifies primary, core, and task instance fleets as shown in the following example. The core instance fleet uses a maximum Spot price (`BidPrice`) as a percentage of On-Demand, while the task and primary instance fleets use a maximum Spot price (BidPriceAsPercentageofOnDemandPrice) as a string in USD.  

```
[
    {
        "Name": "Masterfleet",
        "InstanceFleetType": "MASTER",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "SWITCH_TO_ON_DEMAND"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    },
    {
        "Name": "Corefleet",
        "InstanceFleetType": "CORE",
        "TargetSpotCapacity": 1,
        "TargetOnDemandCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "UsageStrategy": "use-capacity-reservations-first",
                "CapacityReservationResourceGroupArn": "String"
            }
        },
            "SpotSpecification": {
                "AllocationStrategy": "capacity-optimized",
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPriceAsPercentageOfOnDemandPrice": 100
            }
        ]
    },
    {
        "Name": "Taskfleet",
        "InstanceFleetType": "TASK",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "CapacityReservationPreference": "none"
            }
        },
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    }
]
```

## Modify target capacities for an instance fleet
<a name="emr-fleet-modify-target-cli"></a>

Use the `modify-instance-fleet` command to specify new target capacities for an instance fleet. You must specify the cluster ID and the instance fleet ID. Use the `list-instance-fleets` command to retrieve instance fleet IDs.

```
aws emr modify-instance-fleet --cluster-id <cluster-id> \
  --instance-fleet \
    InstanceFleetId='<instance-fleet-id>',TargetOnDemandCapacity=1,TargetSpotCapacity=1
```

## Add a task instance fleet to a cluster
<a name="emr-task-instance-fleet"></a>

If a cluster has only primary and core instance fleets, you can use the `add-instance-fleet` command to add a task instance fleet. You can only use this to add task instance fleets.

```
aws emr add-instance-fleet --cluster-id <cluster-id> 
  --instance-fleet \
    InstanceFleetType=TASK,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=20,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

## Get configuration details of instance fleets in a cluster
<a name="emr-instance-fleet-get-configuration"></a>

Use the `list-instance-fleets` command to get configuration details of the instance fleets in a cluster. The command takes a cluster ID as input. The following example demonstrates the command and its output for a cluster that contains a primary task instance group and a core task instance group. For full response syntax, see [ListInstanceFleets](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_ListInstanceFleets.html) in the *Amazon EMR API Reference.*

```
list-instance-fleets --cluster-id <cluster-id>			
```

```
{
    "InstanceFleets": [
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759094.637,
                    "CreationDateTime": 1488758719.817
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 6,
            "Name": "CORE",
            "InstanceFleetType": "CORE",
            "LaunchSpecifications": {
                "SpotSpecification": {
                    "TimeoutDurationMinutes": 60,
                    "TimeoutAction": "TERMINATE_CLUSTER"
                }
            },
            "ProvisionedOnDemandCapacity": 2,
            "InstanceTypeSpecifications": [
                {
                    "BidPrice": "0.5",
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 2
                }
            ],
            "Id": "if-1ABC2DEFGHIJ3"
        },
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759058.598,
                    "CreationDateTime": 1488758719.811
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 0,
            "Name": "MASTER",
            "InstanceFleetType": "MASTER",
            "ProvisionedOnDemandCapacity": 1,
            "InstanceTypeSpecifications": [
                {
                    "BidPriceAsPercentageOfOnDemandPrice": 100.0,
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 1
                }
            ],
           "Id": "if-2ABC4DEFGHIJ4"
        }
    ]
}
```

# Reconfiguring instance fleets for your Amazon EMR cluster
<a name="instance-fleet-reconfiguration"></a>

With Amazon EMR version 5.21.0 and later, you can reconfigure cluster applications and specify additional configuration classifications for each instance fleet in a running cluster. To do so, you can use the AWS Command Line Interface (AWS CLI), or the AWS SDK.

You can track the state of an instance fleet, by viewing the CloudWatch events. For more information, see [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig).

**Note**  
You can only override the cluster Configurations object specified during cluster creation. For more information about Configurations objects, see [RunJobFlow request syntax](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_RequestSyntax). If there are differences between the existing configuration and the file that you supply, Amazon EMR resets manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

When you submit a reconfiguration request using the Amazon EMR console, the AWS Command Line interface (AWS CLI), or the AWS SDK, Amazon EMR checks the existing on-cluster configuration file. If there are differences between the existing configuration and the file that you supply, Amazon EMR initiates reconfiguration actions, restarts some applications, and resets any manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

## Reconfiguration behaviors
<a name="instance-fleet-reconfiguration-behaviors"></a>

Reconfiguration overwrites on-cluster configuration with the newly submitted configuration set, and can overwrite configuration changes made outside of the reconfiguration API.

Amazon EMR follows a rolling process to reconfigure instances in the Task and Core instance fleet. Only a percentage of the instances for a single instance type are modified and restarted at a time. If your instance fleet has multiple different instance type configurations, they would reconfigure in parallel.

Reconfigurations are declared at the [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) level. For a visual example, refer to [Reconfigure an instance fleet](#instance-fleet-reconfiguration-cli-sdk). You can submit reconfiguration requests that contain updated configuration settings for one or more instance types within a single request. You must include all instance types that are part of your instance fleet in the modify request; however, instance types with populated configuration fields will undergo reconfiguration, while other `InstanceTypeConfig` instances in the fleet remain unchanged. A reconfiguration is considered successful only when all instances of the specified instance types complete reconfiguration. If any instance fails to reconfigure, the entire Instance Fleet automatically reverts to its last known stable configuration.

## Limitations
<a name="instance-fleet-reconfiguration-limitations"></a>

When you reconfigure an instance fleet in a running cluster, consider the following limitations:
+ Non-YARN applications can fail during restart or cause cluster issues, especially if the applications aren't configured properly. Clusters approaching maximum memory and CPU usage may run into issues after the restart process. This is especially true for the primary instance fleet. Consult the [Troubleshoot instance fleet reconfiguration](#instance-fleet-reconfiguration-troubleshooting) section.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ After reconfiguring an instance fleet, Amazon EMR restarts the applications to allow the new configurations to take effect. Job failure or other unexpected application behavior might occur if the applications are in use during reconfiguration.
+ If a reconfiguration for any instance type config under an instance fleet fails, Amazon EMR reverses the configuration parameters to the previous working version for the entire instance fleet, along with emitting events and updating state details. If the reversion process fails too, you must submit a new `ModifyInstanceFleet` request to recover the instance fleet from the `ARRESTED` state. Reversion failures result in [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig) and state change. 
+ Reconfiguration requests for Phoenix configuration classifications are only supported in Amazon EMR version 5.23.0 and later, and are not supported in Amazon EMR version 5.21.0 or 5.22.0.
+ Reconfiguration requests for HBase configuration classifications are only supported in Amazon EMR version 5.30.0 and later, and are not supported in Amazon EMR versions 5.23.0 through 5.29.0.
+ Reconfiguring hdfs-encryption-zones classification or any of the Hadoop KMS configuration classifications is not supported on an Amazon EMR cluster with multiple primary nodes.
+ Amazon EMR currently doesn't support certain reconfiguration requests for the YARN capacity scheduler that require restarting the YARN ResourceManager. For example, you cannot completely remove a queue.
+ When YARN needs to restart, all running YARN jobs are typically terminated and lost. This might cause data processing delays. To run YARN jobs during a YARN restart, you can either create an Amazon EMR cluster with multiple primary nodes or set yarn.resourcemanager.recovery.enabled to `true` in your yarn-site configuration classification. For more information about using multiple master nodes, see [High availability YARN ResourceManager](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-YARN).

## Reconfigure an instance fleet
<a name="instance-fleet-reconfiguration-cli-sdk"></a>

------
#### [ Using the AWS CLI ]

Use the `modify-instance-fleet` command to specify a new configuration for an instance fleet in a running cluster.

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

**Example – Replace a configuration for an instance fleet**

**Warning**  
Specify all `InstanceTypeConfig` fields that you used at launch. Not including fields can result in overwriting specifications you declared at launch. Refer to [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) for a list.

The following example references a configuration JSON file called instanceFleet.json to edit the property of the YARN NodeManager disk health checker for an instance fleet.

**Instance Fleet Modification JSON**

1. Prepare your configuration classification, and save it as instanceFleet.json in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

**Example – Add a configuration to an instance fleet**

If you want to add a configuration to an instance type, you must include all previously specified configurations for that instance type in your new `ModifyInstanceFleet` request. Otherwise, the previously specified configurations are removed.

The following example adds a property for the YARN NodeManager virtual memory checker. The configuration also includes previously specified values for the YARN NodeManager disk health checker so that the values won't be overwritten.

1. Prepare the following contents in instanceFleet.json and save it in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0",
                               "yarn.nodemanager.vmem-check-enabled":"true",
                               "yarn.nodemanager.vmem-pmem-ratio":"3.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]      
   }
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

------
#### [ using the Java SDK ]

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

The following code snippet provides a new configuration for an instance fleet using the AWS SDK for Java.

```
AWSCredentials credentials = new BasicAWSCredentials("access-key", "secret-key");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.join.emit.interval","1000");
hiveProperties.put("hive.merge.mapfiles","true");
        
Configuration newConfiguration = new Configuration()
    .withClassification("hive-site")
    .withProperties(hiveProperties);
    
List<InstanceTypeConfig> instanceTypeConfigList = new ArrayList<>();

for (InstanceTypeConfig instanceTypeConfig : currentInstanceTypeConfigList) {
    instanceTypeConfigList.add(new InstanceTypeConfig()
        .withInstanceType(instanceTypeConfig.getInstanceType())
        .withBidPrice(instanceTypeConfig.getBidPrice())
        .withWeightedCapacity(instanceTypeConfig.getWeightedCapacity())
        .withConfigurations(newConfiguration)
    );
}

InstanceFleetModifyConfig instanceFleetModifyConfig = new InstanceFleetModifyConfig()
    .withInstanceFleetId("if-1xxxxxxx9")
    .withInstanceTypeConfigs(instanceTypeConfigList);
    
ModifyInstanceFleetRequest modifyInstanceFleetRequest = new ModifyInstanceFleetRequest()
    .withInstanceFleet(instanceFleetModifyConfig)
    .withClusterId("j-2AL4XXXXXX5T9");

emrClient.modifyInstanceFleet(modifyInstanceFleetRequest);
```

------

## Troubleshoot instance fleet reconfiguration
<a name="instance-fleet-reconfiguration-troubleshooting"></a>

If the reconfiguration process for any instance type within an instance fleet fails, Amazon EMR reverts the in progress reconfiguration and logs a failure message using an AAmazon CloudWatch Events events. The event provides a brief summary of the reconfiguration failure. It lists the instances for which reconfiguration has failed and corresponding failure messages. The following is an example failure message.

`Amazon EMR couldn't revert the instance fleet if-1xxxxxxx9 in the Amazon EMR cluster j-2AL4XXXXXX5T9 (ExampleClusterName) to the previously successful configuration at 2021-01-01 00:00 UTC. The reconfiguration reversion failed because of Instance i-xxxxxxx1, i-xxxxxxx2, i-xxxxxxx3 failed with message "This is an example failure message"...`

### To access node provisioning logs
<a name="instance-fleet-reconfiguration-troubleshooting-connect-node"></a>

Use SSH to connect to the node on which reconfiguration has failed. For instructions, see [Connect to your Linux instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-linux-instance.html) in the *Amazon Elastic Compute Cloud*. 

------
#### [ Accessing logs by connecting to a node ]

1. Navigate to the following directory, which contains the node provisioning log files.

   ```
   /mnt/var/log/provision-node/
   ```

1. Open the reports subdirectory and search for the node provisioning report for your reconfiguration. The reports directory organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

1. You can examine a report using a file viewer like zless, as in the following example.

   ```
   zless 202104061715.yaml.gz
   ```

------
#### [ Accessing logs using Amazon S3 ]

Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/). Open the Amazon S3 bucket that you specified when you configured the cluster to archive log files.

1.  Navigate to the following folder, which contains the node provisioning log files:

   ```
   amzn-s3-demo-bucket/elasticmapreduce/cluster id/node/instance id/provision-node/
   ```

1. Open the reports folder and search for the node provisioning report for your reconfiguration. The reports folder organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

To view a log file, you can download it from Amazon S3 to your local machine as a text file. For instructions, see [Downloading an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html).

------

Each log file contains a detailed provisioning report for the associated reconfiguration. To find error message information, you can search for the `err` log level of a report. Report format depends on the version of Amazon EMR on your cluster. The following example shows error information for Amazon EMR release versions 5.32.0 and 6.2.0 and later use the following format: 

```
- level: err
  message: 'Example detailed error message.'
  source: Puppet
  tags:
  - err
  time: '2021-01-01 00:00:00.000000 +00:00'
  file: 
  line:
```

# Use capacity reservations with instance fleets in Amazon EMR
<a name="on-demand-capacity-reservations"></a>

To launch On-Demand Instance fleets with capacity reservations options, attach additional service role permissions which are required to use capacity reservation options. Since capacity reservation options must be used together with On-Demand allocation strategy, you also have to include the permissions required for allocation strategy in your service role and managed policy. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](emr-instance-fleet.md#create-cluster-allocation-policy).

Amazon EMR supports both open and targeted capacity reservations. The following topics show instance fleets configurations that you can use with the `RunJobFlow` action or `create-cluster` command to launch instance fleets using On-Demand Capacity Reservations. 

## Use open capacity reservations on a best-effort basis
<a name="on-demand-capacity-reservations-best-effort"></a>

If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. However, it is not guaranteed that your capacity reservations will be used. For provisioning the cluster, Amazon EMR evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Available open capacity reservations that match the instance pools are applied automatically.

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations on a best-effort basis.

**Example 1: Lowest-price instance pool in launch request has available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. Your available open capacity reservations in that instance pool are used automatically.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 150 | 100 | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 100 | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | 50 | 100 | 100 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Example 2: Lowest-price instance pool in launch request does not have available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. However, your open capacity reservations remain unused.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
|  Available Open capacity reservations  | - | - | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 100 | 
| --- |--- |--- |--- |

**Configure Instance Fleets to use open capacity reservations on best-effort basis**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for capacity reservations options to `open`. Alternatively, if you leave this field blank, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "open"
         }
        }
    }
```

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using open capacity reservations.

```
aws emr create-cluster \
	--name 'open-ODCR-cluster' \
	--release-label emr-5.30.0 \
	--service-role EMR_DefaultRole \
	--ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
	--instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
	  InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
	  LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=open}}'}
```

Where,
+ `open-ODCR-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use open capacity reservations first
<a name="on-demand-capacity-reservations-first"></a>

You can choose to override the lowest-price allocation strategy and prioritize using available open capacity reservations first while provisioning an Amazon EMR cluster. In this case, Amazon EMR evaluates all the instance pools with capacity reservations specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with capacity reservations have sufficient capacity for the requested core nodes, Amazon EMR falls back to the best-effort case described in the previous topic. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused. 

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR uses the available open capacity reservations available across each instance pool in the selected Availability Zone first, and only if required, uses the lowest-price strategy to provision any remaining task nodes. 

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations first.

**Example 1: **Instance pool with available open capacity reservations in launch request has sufficient capacity for core nodes****

In this case, Amazon EMR launches capacity in the instance pool with available open capacity reservations regardless of instance pool price. As a result, your open capacity reservations are used whenever possible, until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: **Instance pool with available open capacity reservations in launch request does not have sufficient capacity for core nodes****

In this case, Amazon EMR falls back to launching core nodes using lowest-price strategy with a best-effort to use capacity reservations.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 10 | - | - | 
| --- |--- |--- |--- |
| Available open capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use open capacity reservations first**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first"
         }
       }
    }
```

You can also use the Amazon EMR CLI to create an instance-fleet based cluster using capacity reservations first.

```
aws emr create-cluster \
  --name 'use-CR-first-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first}}'}
```

Where,
+ `use-CR-first-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use targeted capacity reservations first
<a name="on-demand-capacity-reservations-targeted"></a>

When you provision an Amazon EMR cluster, you can choose to override the lowest-price allocation strategy and prioritize using available targeted capacity reservations first. In this case, Amazon EMR evaluates all the instance pools with targeted capacity reservations specified in the launch request and picks the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with targeted capacity reservations have sufficient capacity for core nodes, Amazon EMR falls back to the best-effort case described earlier. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and selects the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations which match the instance pool get applied automatically. However, targeted capacity reservations remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with targeted capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR tries to use the available targeted capacity reservations available across each instance pool in the selected Availability Zone first. Then, only if required, Amazon EMR uses the lowest-price strategy to provision any remaining task nodes.

The following are use cases of Amazon EMR capacity allocation logic for using targeted capacity reservations first.

**Example 1: Instance pool with available targeted capacity reservations in launch request has sufficient capacity for core nodes**

In this case, Amazon EMR launches capacity in the instance pool with available targeted capacity reservations regardless of instance pool price. As a result, your targeted capacity reservations are used whenever possible until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Usage Strategy | use-capacity-reservations-first | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Targeted capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: Instance pool with available targeted capacity reservations in launch request does not have sufficient capacity for core nodes**  


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Targeted capacity reservations used | 10 | - | - | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use targeted capacity reservations first**

When you use the `RunJobFlow` action to create an instance-fleet based cluster, set the On-Demand allocation strategy to `lowest-price`, `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`, and `CapacityReservationResourceGroupArn` for `CapacityReservationOptions` to `<your resource group ARN>`. For more information, see [Work with capacity reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-using.html) in the *Amazon EC2 User Guide*.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first",
            "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup"
         }
       }
    }
```

Where `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using targeted capacity reservations.

```
aws emr create-cluster \
  --name 'targeted-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,\
InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first,CapacityReservationResourceGroupArn=arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup}}'}
```

Where,
+ `targeted-CR-cluster` is replaced with the name of your cluster using targeted capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.
+ `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

## Avoid using available open capacity reservations
<a name="on-demand-capacity-reservations-avoiding"></a>

**Example**  
If you want to avoid unexpectedly using any of your open capacity reservations when launching an Amazon EMR cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for `CapacityReservationOptions` to `none`. Otherwise, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open` and tries using available open capacity reservations on a best-effort basis.  

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "none"
         }
       }
    }
```
You can also use the Amazon EMR CLI to create an instance fleet-based cluster without using any open capacity reservations.  

```
aws emr create-cluster \
  --name 'none-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=none}}'}
```
Where,  
+ `none-CR-cluster` is replaced with the name of your cluster that is not using any open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Scenarios for using capacity reservations
<a name="on-demand-capacity-reservations-scenarios"></a>

You can benefit from using capacity reservations in the following scenarios.

**Scenario 1: Rotate a long-running cluster using capacity reservations**  
When rotating a long running cluster, you might have strict requirements on the instance types and Availability Zones for the new instances you provision. With capacity reservations, you can use capacity assurance to complete the cluster rotation without interruptions.

![\[Cluster rotation using available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-longrunning-cluster-diagram.png)


**Scenario 2: Provision successive short-lived clusters using capacity reservations**  
You can also use capacity reservations to provision a group of successive, short-lived clusters for individual workloads so that when you terminate a cluster, the next cluster can use the capacity reservations. You can use targeted capacity reservations to ensure that only the intended clusters use the capacity reservations.

![\[Short-lived cluster provisioning that uses available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-short-cluster-diagram.png)


# Configure uniform instance groups for your Amazon EMR cluster
<a name="emr-uniform-instance-group"></a>

With the instance groups configuration, each node type (master, core, or task) consists of the same instance type and the same purchasing option for instances: On-Demand or Spot. You specify these settings when you create an instance group. They can't be changed later. You can, however, add instances of the same type and purchasing option to core and task instance groups. You can also remove instances.

If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. You can use open capacity reservations for primary, core, and task nodes. However, you cannot use targeted capacity reservations or prevent instances from launching into open capacity reservations with matching attributes when you provision clusters using instance groups. If you want to use targeted capacity reservations or prevent instances from launching into open capacity reservations, use Instance Fleets instead. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

To add different instance types after a cluster is created, you can add additional task instance groups. You can choose different instance types and purchasing options for each instance group. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

When launching instances, the On-Demand Instance's capacity reservation preference defaults to `open`, which enables it to run in any open capacity reservation that has matching attributes (instance type, platform, Availability Zone). For more information about On-Demand Capacity Reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

This section covers creating a cluster with uniform instance groups. For more information about modifying an existing instance group by adding or removing instances manually or with automatic scaling, see [Manage Amazon EMR clusters](emr-manage.md).

## Use the console to configure uniform instance groups
<a name="emr-uniform-instance-group-console"></a>

------
#### [ Console ]

**To create a cluster with instance groups with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance groups**.

1. Under **Node groups**, there is a section for each type of node group. For the primary node group, select the **Use multiple primary nodes** check box if you want to have 3 primary nodes. Select the **Use Spot purchasing option** check box if you want to use Spot purchasing.

1. For the primary and core node groups, select **Add instance type** and choose up to 5 instance types. For the task group, select **Add instance type** and choose up to fifteen instance types. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, expand **Node configuration** to enter a JSON configuration or to load JSON from Amazon S3.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------

## Use the AWS CLI to create a cluster with uniform instance groups
<a name="emr-uniform-instance-group-cli"></a>

To specify the instance groups configuration for a cluster using the AWS CLI, use the `create-cluster` command along with the `--instance-groups` parameter. Amazon EMR assumes the On-Demand Instance option unless you specify the `BidPrice` argument for an instance group. For examples of `create-cluster` commands that launch uniform instance groups with On-Demand Instances and a variety of cluster options, type `aws emr create-cluster help `at the command line, or see [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) in the *AWS CLI Command Reference*.

You can use the AWS CLI to create uniform instance groups in a cluster that use Spot Instances. The offered Spot price depends on Availability Zone. When you use the CLI or API, you can specify the Availability Zone either with the `AvailabilityZone` argument (if you're using an EC2-classic network) or the `SubnetID `argument of the `--ec2-attributes `parameter. The Availability Zone or subnet that you select applies to the cluster, so it's used for all instance groups. If you don't specify an Availability Zone or subnet explicitly, Amazon EMR selects the Availability Zone with the lowest Spot price when it launches the cluster.

The following example demonstrates a `create-cluster` command that creates primary, core, and two task instance groups that all use Spot Instances. Replace *myKey* with the name of your Amazon EC2 key pair. 

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "MySpotCluster" \
  --release-label emr-7.12.0 \
  --use-default-roles \
  --ec2-attributes KeyName=myKey \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,BidPrice=0.25 \
    InstanceGroupType=CORE,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=4,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.04
```

Using the CLI, you can create uniform instance group clusters that specify a unique custom AMI for each instance type in the instance group. This allows you to use different instance architectures in the same instance group. Each instance type must use a custom AMI with a matching architecture. For example, you would configure an m5.xlarge instance type with an x86\$164 architecture custom AMI, and an m6g.xlarge instance type with a corresponding `AWS AARCH64` (ARM) architecture custom AMI. 

The following example shows a uniform instance group cluster created with two instance types, each with its own custom AMI. Notice that the custom AMIs are specified only at the instance type level, not at the cluster level. This is to avoid conflicts between the instance type AMIs and an AMI at the cluster level, which would cause the cluster launch to fail. 

```
aws emr create-cluster
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456 \
    InstanceGroupType=CORE,InstanceType=m6g.xlarge,InstanceCount=1,CustomAmiId=ami-234567
```

You can add multiple custom AMIs to an instance group that you add to a running cluster. The `CustomAmiId` argument can be used with the `add-instance-groups` command as shown in the following example.

```
aws emr add-instance-groups --cluster-id j-123456 \
  --instance-groups \
    InstanceGroupType=Task,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456
```

## Use the Java SDK to create an instance group
<a name="emr-instance-group-sdk"></a>

You instantiate an `InstanceGroupConfig` object that specifies the configuration of an instance group for a cluster. To use Spot Instances, you set the `withBidPrice` and `withMarket` properties on the `InstanceGroupConfig` object. The following code shows how to define primary, core, and task instance groups that run Spot Instances.

```
InstanceGroupConfig instanceGroupConfigMaster = new InstanceGroupConfig()
	.withInstanceCount(1)
	.withInstanceRole("MASTER")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.25"); 
	
InstanceGroupConfig instanceGroupConfigCore = new InstanceGroupConfig()
	.withInstanceCount(4)
	.withInstanceRole("CORE")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.03");
	
InstanceGroupConfig instanceGroupConfigTask = new InstanceGroupConfig()
	.withInstanceCount(2)
	.withInstanceRole("TASK")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.10");
```

# Availability Zone flexibility for an Amazon EMR cluster
<a name="emr-flexibility"></a>

Each AWS Region has multiple, isolated locations known as Availability Zones. When you launch an instance, you can optionally specify an Availability Zone (AZ) in the AWS Region that you use. [Availability Zone flexibility](#emr-flexibility-az) is the distribution of instances across multiple AZs. If one instance fails, you can design your application so that an instance in another AZ can handle requests. For more information on Availability Zones, see the [Region and zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) documentation in the *Amazon EC2 User Guide*.

[Instance flexibility](#emr-flexibility-types) is the use of multiple instance types to satisfy capacity requirements. When you express flexibility with instances, you can use aggregate capacity across instance sizes, families, and generations. Greater flexibility improves the chance to find and allocate your required amount of compute capacity when compared with a cluster that uses a single instance type.

Instance and Availability Zone flexibility reduces [insufficient capacity errors (ICE)](emr-events-response-insuff-capacity.md) and Spot interruptions when compared to a cluster with a single instance type or AZ. Use the best practices covered here to determine which instances to diversify after you know the initial instance family and size. This approach maximizes availability to Amazon EC2 capacity pools with minimal performance and cost variance.

## Being flexible about Availability Zones
<a name="emr-flexibility-az"></a>

We recommend that you configure all Availability Zones for use in your virtual private cloud (VPC) and that you select them for your EMR cluster. Clusters must exist in only one Availability Zone, but with Amazon EMR instance fleets, you can select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options that you specify. When you provision an EMR cluster for multiple subnets, your cluster can access a deeper Amazon EC2 capacity pool when compared to clusters in a single subnet. 

If you must prioritize a certain number of Availability Zones for use in your virtual private cloud (VPC) for your EMR cluster, you can leverage the Spot placement score capability with Amazon EC2. With Spot placement scoring, you specify the compute requirements for your Spot Instances, then EC2 returns the top ten AWS Regions or Availability Zones scored on a scale from 1 to 10. A score of 10 indicates that your Spot request is highly likely to succeed; a score of 1 indicates that your Spot request is not likely to succeed. For more information on how to use Spot placement scoring, see [Spot placement score](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-placement-score.html) in the *Amazon EC2 User Guide*.

## Being flexible about instance types
<a name="emr-flexibility-types"></a>

Instance flexibility is the use of multiple instance types to satisfy capacity requirements. Instance flexibility benefits both Amazon EC2 Spot and On-Demand Instance usage. With Spot Instances, instance flexibility lets Amazon EC2 launch instances from deeper capacity pools using real-time capacity data. It also predicts which instances are most available. This offers fewer interruptions and can reduce the overall cost of a workload. With On-Demand Instances, instance flexibility reduces insufficient capacity errors (ICE) when total capacity provisions across a greater number of instance pools.

For **Instance Group** clusters, you can specify up to 50 EC2 instance types. For **Instance Fleets** with allocation strategy, you can specify up to 30 EC2 instance types for each primary, core, and task node group. A broader range of instances improves the benefits of instance flexibility. 

### Expressing instance flexibility
<a name="emr-flexibility-express"></a>

Consider the following best practices to express instance flexibility for your application.

**Topics**
+ [Determine instance family and size](#emr-flexibility-express-size)
+ [Include additional instances](#emr-flexibility-express-include)

#### Determine instance family and size
<a name="emr-flexibility-express-size"></a>

Amazon EMR supports several instance types for different use cases. These instance types are listed in the [Supported instance types with Amazon EMR](emr-supported-instance-types.md) documentation. Each instance type belongs to an instance family that describes what application the type is optimized for.

For new workloads, you should benchmark with instance types in the general purpose family, such as `m5` or `c5`. Then, monitor the OS and YARN metrics from Ganglia and Amazon CloudWatch to determine system bottlenecks at peak load. Bottlenecks include CPU, memory, storage, and I/O operations. After you identify the bottlenecks, choose compute optimized, memory optimized, storage optimized, or another appropriate instance family for your instance types. For more details, see the [Determine right infrastructure for your Spark workloads](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Applications/Spark/best_practices.md#bp-512-----determine-right-infrastructure-for-your-spark-workloads) page in the Amazon EMR best practices guide on GitHub. 

Next, identify the smallest YARN container or Spark executor that your application requires. This is the smallest instance size that fits the container and the minimum instance size for the cluster. Use this metric to determine instances that you can further diversify with. A smaller instance will allow for more instance flexibility.

For maximum instance flexibility, you should leverage as many instances as possible. We recommend that you diversify with instances that have similar hardware specifications. This maximizes access to EC2 capacity pools with minimal cost and performance variance. Diversify across sizes. To do so, prioritize AWS Graviton and previous generations first. As a general rule, try to be flexible across at least 15 instance types for each workload. We recommend that you start with general purpose, compute optimized, or memory optimized instances. These instance types will provide the greatest flexibility. 

#### Include additional instances
<a name="emr-flexibility-express-include"></a>

For maximum diversity, include additional instance types. Prioritize instance size, Graviton, and generation flexibility first. This allows access to additional EC2 capacity pools with similar cost and performance profiles. If you need further flexibility due to ICE or spot interruptions, consider variant and family flexibility. Each approach has tradeoffs that depend on your use case and requirements. 
+ **Size flexibility** – First, diversify with instances of different sizes within the same family. Instances within the same family provide the same cost and performance, but can launch a different number of containers on each host. For example, if the minimum executor size that you need is 2vCPU and 8Gb memory, the minimum instance size is `m5.xlarge`. For size flexibility, include `m5.xlarge`, `m5.2xlarge`, `m5.4xlarge`, `m5.8xlarge`, `m5.12xlarge`, `m5.16xlarge`, and `m5.24xlarge`.
+ **Graviton flexibility** – In addition to size, you can diversify with Graviton instances. Graviton instances are powered by AWS Graviton2 processors that deliver the best price performance for cloud workloads in Amazon EC2. For example, with the minimum instance size of `m5.xlarge`, you can include `m6g.xlarge`, `m6g.2xlarge`, `m6g.4xlarge`, `m6g.8xlarge`, and `m6g.16xlarge` for Graviton flexibility.
+ **Generation flexibility** – Similar to Graviton and size flexibility, instances in previous generation families share the same hardware specifications. This results in a similar cost and performance profile with an increase in the total accessible Amazon EC2 pool. For generation flexibility, include `m4.xlarge`, `m4.2xlarge`, `m4.10xlarge`, and `m4.16xlarge`.
+ **Family and variant flexibility**
  + **Capacity** – To optimize for capacity, we recommend instance flexibility across instance families. Common instances from different instance families have deeper instance pools that can assist with meeting capacity requirements. However, instances from different families will have different vCPU to memory ratios. This results in under-utilization if the expected application container is sized for a different instance. For example, with `m5.xlarge`, include compute-optimized instances such as `c5` or memory-optimized instances such as `r5` for instance family flexibility.
  + **Cost** – To optimize for cost, we recommend instance flexibility across variants. These instances have the same memory and vCPU ratio as the initial instance. The tradeoff with variant flexibility is that these instances have smaller capacity pools which might result in limited additional capacity or higher Spot interruptions. With `m5.xlarge` for example, include AMD-based instances (`m5a`), SSD-based instances (`m5d`) or network-optimized instances (`m5n`) for instance variant flexibility.

# Configuring Amazon EMR cluster instance types and best practices for Spot instances
<a name="emr-plan-instances-guidelines"></a>

Use the guidance in this section to help you determine the instance types, purchasing options, and amount of storage to provision for each node type in an EMR cluster.

## What instance type should you use?
<a name="emr-instance-group-size"></a>

There are several ways to add Amazon EC2 instances to a cluster. The method you should choose depends on whether you use the instance groups configuration or the instance fleets configuration for the cluster.
+ **Instance Groups**
  + Manually add instances of the same type to existing core and task instance groups.
  + Manually add a task instance group, which can use a different instance type.
  + Set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).
+ **Instance Fleets**
  + Add a single task instance fleet.
  + Change the target capacity for On-Demand and Spot Instances for existing core and task instance fleets. For more information, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

One way to plan the instances of your cluster is to run a test cluster with a representative sample set of data and monitor the utilization of the nodes in the cluster. For more information, see [View and monitor an Amazon EMR cluster as it performs work](emr-manage-view.md). Another way is to calculate the capacity of the instances you are considering and compare that value against the size of your data.

In general, the primary node type, which assigns tasks, doesn't require an EC2 instance with much processing power; Amazon EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity; Amazon EC2 instances for the task node type, which don't store data, need only processing power. For guidelines about available Amazon EC2 instances and their configuration, see [Configure Amazon EC2 instance types for use with Amazon EMR](emr-plan-ec2-instances.md).

 The following guidelines apply to most Amazon EMR clusters. 
+ There is a vCPU limit for the total number of on-demand Amazon EC2 instances that you run on an AWS account per AWS Region. For more information about the vCPU limit and how to request a limit increase for your account, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the *Amazon EC2* *User Guide for Linux Instances*. 
+ The primary node does not typically have large computational requirements. For clusters with a large number of nodes, or for clusters with applications that are specifically deployed on the primary node (JupyterHub, Hue, etc.), a larger primary node may be required and can help improve cluster performance. For example, consider using an m5.xlarge instance for small clusters (50 or fewer nodes), and increasing to a larger instance type for larger clusters.
+ The computational needs of the core and task nodes depend on the type of processing your application performs. Many jobs can be run on general purpose instance types, which offer balanced performance in terms of CPU, disk space, and input/output. Computation-intensive clusters may benefit from running on High CPU instances, which have proportionally more CPU than RAM. Database and memory-caching applications may benefit from running on High Memory instances. Network-intensive and CPU-intensive applications like parsing, NLP, and machine learning may benefit from running on cluster compute instances, which provide proportionally high CPU resources and increased network performance.
+ If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your job flow's varying capacity requirements. 
+ The amount of data you can process depends on the capacity of your core nodes and the size of your data as input, during processing, and as output. The input, intermediate, and output datasets all reside on the cluster during processing. 

## When should you use Spot Instances?
<a name="emr-plan-spot-instances"></a>

When you launch a cluster in Amazon EMR, you can choose to launch primary, core, or task instances on Spot Instances. Because each type of instance group plays a different role in the cluster, there are implications of launching each node type on Spot Instances. You can't change an instance purchasing option while a cluster is running. To change from On-Demand to Spot Instances or vice versa, for the primary and core nodes, you must terminate the cluster and launch a new one. For task nodes, you can launch a new task instance group or instance fleet, and remove the old one.

**Topics**
+ [Amazon EMR settings to prevent job failure because of task node Spot Instance termination](#emr-plan-spot-YARN)
+ [Primary node on a Spot Instance](#emr-dev-master-instance-group-spot)
+ [Core nodes on Spot Instances](#emr-dev-core-instance-group-spot)
+ [Task nodes on Spot Instances](#emr-dev-task-instance-group-spot)
+ [Instance configurations for application scenarios](#emr-plan-spot-scenarios)

### Amazon EMR settings to prevent job failure because of task node Spot Instance termination
<a name="emr-plan-spot-YARN"></a>

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Amazon EMR configures the following properties and values by default. Use caution when configuring these properties.

**Note**  
Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application primary processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties:   
`yarn.node-labels.enabled: true`
`yarn.node-labels.am.default-node-label-expression: 'CORE'`
+ **yarn-site (yarn-site.xml) On All Nodes**
  + `yarn.node-labels.enabled: true`
  + `yarn.node-labels.am.default-node-label-expression: 'CORE'`
  + `yarn.node-labels.fs-store.root-dir: '/apps/yarn/nodelabels'`
  + `yarn.node-labels.configuration-type: 'distributed'`
+ **yarn-site (yarn-site.xml) On Primary And Core Nodes**
  + `yarn.nodemanager.node-labels.provider: 'config'`
  + `yarn.nodemanager.node-labels.provider.configured-node-partition: 'CORE'`
+ **capacity-scheduler (capacity-scheduler.xml) On All Nodes**
  + `yarn.scheduler.capacity.root.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity: 100`

### Primary node on a Spot Instance
<a name="emr-dev-master-instance-group-spot"></a>

The primary node controls and directs the cluster. When it terminates, the cluster ends, so you should only launch the primary node as a Spot Instance if you are running a cluster where sudden termination is acceptable. This might be the case if you are testing a new application, have a cluster that periodically persists data to an external store such as Amazon S3, or are running a cluster where cost is more important than ensuring the cluster's completion. 

When you launch the primary instance group as a Spot Instance, the cluster does not start until that Spot Instance request is fulfilled. This is something to consider when selecting your maximum Spot price.

You can only add a Spot Instance primary node when you launch the cluster. You can't add or remove primary nodes from a running cluster. 

Typically, you would only run the primary node as a Spot Instance if you are running the entire cluster (all instance groups) as Spot Instances. 

### Core nodes on Spot Instances
<a name="emr-dev-core-instance-group-spot"></a>

Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.

When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won't launch. Amazon EMR continues to wait until all six Amazon EC2 instances are available or until you terminate the cluster. You can change the number of Spot Instances in a core instance group to add capacity to a running cluster. For more information about working with instance groups, and how Spot Instances work with instance fleets, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

### Task nodes on Spot Instances
<a name="emr-dev-task-instance-group-spot"></a>

The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.

When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible. 

Launching task instance groups as Spot Instances is a strategic way to expand the capacity of your cluster while minimizing costs. If you launch your primary and core instance groups as On-Demand Instances, their capacity is guaranteed for the run of the cluster. You can add task instances to your task instance groups as needed, to handle peak traffic or speed up data processing. 

You can add or remove task nodes using the console, AWS CLI, or API. You can also add additional task groups, but you cannot remove a task group after it is created. 

### Instance configurations for application scenarios
<a name="emr-plan-spot-scenarios"></a>

The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios. Choose the link to view more information about each scenario type.


| Application scenario | Primary node purchasing option | Core nodes purchasing option | Task nodes purchasing option | 
| --- | --- | --- | --- | 
| [Long-running clusters and data warehouses](#emr-dev-when-use-spot-data-warehouses) | On-Demand | On-Demand or instance-fleet mix | Spot or instance-fleet mix | 
| [Cost-driven workloads](#emr-dev-when-use-spot-cost-driven) | Spot | Spot | Spot | 
| [Data-critical workloads](#emr-dev-when-use-spot-data-critical) | On-Demand | On-Demand | Spot or instance-fleet mix | 
| [Application testing](#emr-dev-when-use-spot-application-testing) | Spot | Spot | Spot | 

 There are several scenarios in which Spot Instances are useful for running an Amazon EMR cluster. 

#### Long-running clusters and data warehouses
<a name="emr-dev-when-use-spot-data-warehouses"></a>

If you are running a persistent Amazon EMR cluster that has a predictable variation in computational capacity, such as a data warehouse, you can handle peak demand at lower cost with Spot Instances. You can launch your primary and core instance groups as On-Demand Instances to handle the normal capacity and launch the task instance group as Spot Instances to handle your peak load requirements.

#### Cost-driven workloads
<a name="emr-dev-when-use-spot-cost-driven"></a>

If you are running transient clusters for which lower cost is more important than the time to completion, and losing partial work is acceptable, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to benefit from the largest cost savings.

#### Data-critical workloads
<a name="emr-dev-when-use-spot-data-critical"></a>

If you are running a cluster for which lower cost is more important than time to completion, but losing partial work is not acceptable, launch the primary and core instance groups as On-Demand Instances and supplement with one or more task instance groups of Spot Instances. Running the primary and core instance groups as On-Demand Instances ensures that your data is persisted in HDFS and that the cluster is protected from termination due to Spot market fluctuations, while providing cost savings that accrue from running the task instance groups as Spot Instances.

#### Application testing
<a name="emr-dev-when-use-spot-application-testing"></a>

When you are testing a new application in order to prepare it for launch in a production environment, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to reduce your testing costs.

## Calculating the required HDFS capacity of a cluster
<a name="emr-plan-instances-hdfs"></a>

 The amount of HDFS storage available to your cluster depends on the following factors:
+ The number of Amazon EC2 instances used for core nodes.
+ The capacity of the Amazon EC2 instance store for the instance type used. For more information on instance store volumes, see [Amazon Amazon EC2 instance store](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) in the *Amazon EC2 User Guide*.
+ The number and size of Amazon EBS volumes attached to core nodes.
+ A replication factor, which accounts for how each data block is stored in HDFS for RAID-like redundancy. By default, the replication factor is three for a cluster of 10 or more core nodes, two for a cluster of 4-9 core nodes, and one for a cluster of three or fewer nodes.

To calculate the HDFS capacity of a cluster, for each core node, add the instance store volume capacity to the Amazon EBS storage capacity (if used). Multiply the result by the number of core nodes, and then divide the total by the replication factor based on the number of core nodes. For example, a cluster with 10 core nodes of type i2.xlarge, which have 800 GB of instance storage without any attached Amazon EBS volumes, has a total of approximately 2,666 GB available for HDFS (10 nodes x 800 GB ÷ 3 replication factor).

 If the calculated HDFS capacity value is smaller than your data, you can increase the amount of HDFS storage in the following ways: 
+ Creating a cluster with additional Amazon EBS volumes or adding instance groups with attached Amazon EBS volumes to an existing cluster
+ Adding more core nodes
+ Choosing an Amazon EC2 instance type with greater storage capacity
+ Using data compression
+ Changing the Hadoop configuration settings to reduce the replication factor

Reducing the replication factor should be used with caution as it reduces the redundancy of HDFS data and the ability of the cluster to recover from lost or corrupted HDFS blocks. 