

# EMR File System (EMRFS)
<a name="emr-fs"></a>

**Note**  
Starting from the EMR 7.10.0 release, the S3A filesystem has replaced EMRFS as the default EMR S3 connector.

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like data encryption. 

Data encryption allows you to encrypt objects that EMRFS writes to Amazon S3, and enables EMRFS to work with encrypted objects in Amazon S3. If you're using Amazon EMR release version 4.8.0 or later, you can use security configurations to set up encryption for EMRFS objects in Amazon S3, along with other encryption settings. For more information, see [Encryption options](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options). If you use an earlier release version of Amazon EMR, you can manually configure encryption settings. For more information, see [Specifying Amazon S3 encryption using EMRFS properties](emr-emrfs-encryption.md).

Amazon S3 offers strong read-after write consistency for all GET, PUT, and LIST operations across all AWS Regions. This means that what you write using EMRFS is what you'll read from Amazon S3, with no impact on performance. For more information, see [Amazon S3 data consistency model](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel).

When using Amazon EMR release version 5.10.0 or later, you can use different IAM roles for EMRFS requests to Amazon S3 based on cluster users, groups, or the location of EMRFS data in Amazon S3. For more information, see [Configure IAM roles for EMRFS requests to Amazon S3](https://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-emrfs-iam-roles).

**Warning**  
Before turning on speculative execution for Amazon EMR clusters running Apache Spark jobs, please review the following information.  
EMRFS includes the EMRFS S3-optimized committer, an OutputCommitter implementation that is optimized for writing files to Amazon S3 when using EMRFS. If you turn on the Apache Spark speculative execution feature with applications that write data to Amazon S3 and do not use the EMRFS S3-optimized committer, you may encounter data correctness issues described in [SPARK-10063](https://issues.apache.org/jira/browse/SPARK-10063). This can occur if you are using Amazon EMR versions earlier than Amazon EMR release 5.19, or if you are writing files to Amazon S3 with formats such as ORC and CSV. These formats aren't supported by the EMRFS S3-optimized committer. For a complete list of requirements for using the EMRFS S3-optimized committer, see [Requirements for the EMRFS S3-optimized committer](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html).  
EMRFS direct write is typically used when the EMRFS S3-optimized committer is not supported, such as when writing the following:  
An output format other than Parquet, such as ORC or text.
Hadoop files using the Spark RDD API.
Parquet using Hive SerDe. See [Hive metastore Parquet table conversion](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#hive-metastore-parquet-table-conversion).
EMRFS direct write is not used in the following scenarios:  
When the EMRFS S3-optimized committer is enabled. See [Requirements for the EMRFS S3-optimized committer](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html).
When writing dynamic partitions with partitionOverwriteMode set to dynamic.
When writing to custom partition locations, such as locations that do not conform to the Hive default partition location convention.
When using file systems other than EMRFS, such as writing to HDFS or using the S3A file system.
To determine whether your application uses direct write in Amazon EMR 5.14.0 or later, enable Spark INFO logging. If a log line containing the text "Direct Write: ENABLED" is present in either Spark driver logs or Spark executor container logs, then your Spark application wrote using direct write.  
By default, speculative execution is turned `OFF` on Amazon EMRclusters. We highly recommend that you do not turn speculative execution on if both of these conditons are true:  
You are writing data to Amazon S3.
Data is written in a format other than Apache Parquet or in Apache Parquet format not using the EMRFS S3-optimized committer.
If you turn on Spark speculative execution and write data to Amazon S3 using EMRFS direct write, you may experience intermittent data loss. When you write data to HDFS, or write data in Parquet using the EMRFS S3-optimized committer, Amazon EMR does not use direct write and this issue does not occur.  
If you need to write data in formats that use EMRFS direct write from Spark to Amazon S3 and use speculative execution, we recommend writing to HDFS and then transferring output files to Amazon S3 using S3DistCP.

**Topics**
+ [

# Consistent view
](emr-plan-consistent-view.md)
+ [

# Authorizing access to EMRFS data in Amazon S3
](emr-plan-credentialsprovider.md)
+ [

# Managing the default AWS Security Token Service endpoint
](emr-emrfs-sts-endpoint.md)
+ [

# Specifying Amazon S3 encryption using EMRFS properties
](emr-emrfs-encryption.md)

# Consistent view
<a name="emr-plan-consistent-view"></a>

**Warning**  
On June 1, 2023, EMRFS consistent view will reach end of standard support for future Amazon EMR releases. EMRFS consistent view will continue to work for existing releases.

With the release of Amazon S3 strong read-after-write consistency on December 1, 2020, you no longer need to use EMRFS consistent view (EMRFS CV) with your Amazon EMR clusters. EMRFS CV is an optional feature that allows Amazon EMR clusters to check for list and read-after-write consistency for Amazon S3 objects. When you create a cluster and EMRFS CV is turned on, Amazon EMR creates an Amazon DynamoDB database to store object metadata that it uses to track list and read-after-write consistency for S3 objects. You can now turn off EMRFS CV and delete the DynamoDB database that it uses so that you don't accrue additional costs. The following procedures explain how to check for the CV feature, turn it off, and delete the DynamoDB database that the feature uses.<a name="enable-emr-fs-console"></a>

**To check if you're using the EMRFS CV feature**

1. Navigate to the **Configuration** tab. If your cluster has the following configuration, it uses EMRFS CV.

   ```
   Classification=emrfs-site,Property=fs.s3.consistent,Value=true
   ```

1. Alternatively, use the AWS CLI to describe your cluster with the [`describe-cluster` API](https://docs.aws.amazon.com/cli/latest/reference/emr/describe-cluster.html). If the output contains `fs.s3.consistent: true`, your cluster uses EMRFS CV.

**To turn off EMRFS CV on your Amazon EMR clusters**

To turn off the EMRFS CV feature, use one of the following three options. You should test these options in your testing environment before applying them to your production environments.

1. 

**To stop your existing cluster and start a new cluster without EMRFS CV options.**

   1. Before you stop your cluster, ensure that you back up your data and notify your users.

   1. To stop your cluster, follow the instructions in [Terminate a cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html).

   1. If you use the Amazon EMR console to create new cluster, navigate to **Advanced Options**. In the **Edit software settings** section, deselect the option to turn on EMRFS CV. If the check box for **EMRFS consistent view** is available, keep it unchecked.

   1. If you use AWS CLI to create a new cluster with the [`create-cluster` API](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html), don't use the `--emrfs` option, which turns on EMRFS CV.

   1. If you use an SDK or CloudFormation to create a new cluster, don't use any of the configurations listed in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html).

1. 

**To clone a cluster and remove EMRFS CV**

   1. In the Amazon EMR console, choose the cluster that uses EMRFS CV.

   1. At the top of the **Cluster Details** page, choose **Clone**.

   1. Choose **Previous** and navigate to **Step 1: Software and Steps**.

   1. In **Edit software settings**, remove EMRFS CV. In **Edit configuration**, delete the following configurations in the `emrfs-site` classification. If you're loading JSON from a S3 bucket, you must modify your S3 object.

      ```
      [
      	{"classification":
      		"emrfs-site",
      		"properties": {
      			"fs.s3.consistent.retryPeriodSeconds":"10",
      			"fs.s3.consistent":"true",
      			"fs.s3.consistent.retryCount":"5",
      			"fs.s3.consistent.metadata.tableName":"EmrFSMetadata"
      		}
      	}
      ]
      ```

1. 

**To remove EMRFS CV from a cluster that uses instance groups**

   1. Use the following command to check if a single EMR cluster uses the DynamoDB table that is associated with EMRFS CV, or if multiple clusters share the table. The table name is specified in `fs.s3.consistent.metadata.tableName`, as described in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html). The default table name used by EMRFS CV is `EmrFSMetadata`.

      ```
      aws emr describe-cluster --cluster-id j-XXXXX | grep fs.s3.consistent.metadata.tableName
      ```

   1. If your cluster doesn't share your DynamoDB database with another cluster, use the following command to reconfigure the cluster and deactivate EMRFS CV. For more information, see [Reconfigure an instance group in a running cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html).

      ```
      aws emr modify-instance-groups --cli-input-json file://disable-emrfs-1.json
      ```

      This command opens the file you want to modify. Modify the file with the following configurations.

      ```
      {
      	"ClusterId": "j-xxxx",
      	"InstanceGroups": [
      		{
      			"InstanceGroupId": "ig-xxxx",
      			"Configurations": [
      				{
      					"Classification": "emrfs-site",
      					"Properties": {
      						"fs.s3.consistent": "false"
      					},
      					"Configurations": []
      				}
      			]
      		}
      	]
      }
      ```

   1. If your cluster shares the DynamoDB table with another cluster, turn off EMRFS CV on all clusters at a time when no clusters modify any objects in the shared S3 location.

**To delete Amazon DynamoDB resources associated with EMRFS CV**

After you remove EMRFS CV from your Amazon EMR clusters, delete the DynamoDB resources associated with EMRFS CV. Until you do so, you continue to incur DynamoDB charges associated with EMRFS CV.

1. Check the CloudWatch metrics for your DynamoDB table and confirm that the table isn't used by any clusters.

1. Delete the DynamoDB table.

   ```
   aws dynamodb delete-table --table-name <your-table-name>
   ```

**To delete Amazon SQS resources associated with EMRFS CV**

1. If you configured your cluster to push inconsistency notifications to Amazon SQS, you can delete all SQS queues.

1. Find the Amazon SQS queue name specified in `fs.s3.consistent.notification.SQS.queueName`, as described in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html). The default queue name format is `EMRFS-Inconsistency-<j-cluster ID>`.

   ```
   aws sqs list-queues | grep ‘EMRFS-Inconsistency’
   aws sqs delete-queue –queue-url <your-queue-url>
   ```

**To stop using the EMRFS CLI**
+ The [EMRFS CLI](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-cli-reference.html) manages the metadata that EMRFS CV generates. As standard support for EMRFS CV reaches its end in future releases of Amazon EMR, support for the EMRFS CLI will also reach its end. 

**Topics**
+ [

# Enable consistent view
](enable-consistent-view.md)
+ [

# Understanding how EMRFS consistent view tracks objects in Amazon S3
](emrfs-files-tracked.md)
+ [

# Retry logic
](emrfs-retry-logic.md)
+ [

# EMRFS consistent view metadata
](emrfs-metadata.md)
+ [

# Configure consistency notifications for CloudWatch and Amazon SQS
](emrfs-configure-sqs-cw.md)
+ [

# Configure consistent view
](emrfs-configure-consistent-view.md)
+ [

# EMRFS CLI Command Reference
](emrfs-cli-reference.md)

# Enable consistent view
<a name="enable-consistent-view"></a>

You can enable Amazon S3 server-side encryption or consistent view for EMRFS using the AWS Management Console, AWS CLI, or the `emrfs-site` configuration classification.<a name="enable-emr-fs-console"></a>

**To configure consistent view using the console**

1. Navigate to the new Amazon EMR console and select **Switch to the old console** from the side navigation. For more information on what to expect when you switch to the old console, see [Using the old console](https://docs.aws.amazon.com/emr/latest/ManagementGuide/whats-new-in-console.html#console-opt-in).

1. Choose **Create cluster**, **Go to advanced options**.

1. Choose settings for **Step 1: Software and Steps** and **Step 2: Hardware**. 

1. For **Step 3: General Cluster Settings**, under **Additional Options**, choose **EMRFS consistent view**.

1. For **EMRFS Metadata store**, type the name of your metadata store. The default value is **EmrFSMetadata**. If the EmrFSMetadata table does not exist, it is created for you in DynamoDB.
**Note**  
Amazon EMR does not automatically remove the EMRFS metadata from DynamoDB when the cluster is terminated.

1. For **Number of retries**, type an integer value. If an inconsistency is detected, EMRFS tries to call Amazon S3 this number of times. The default value is **5**. 

1. For **Retry period (in seconds)**, type an integer value. This is the amount of time that EMRFS waits between retry attempts. The default value is **10**.
**Note**  
Subsequent retries use an exponential backoff. 

**To launch a cluster with consistent view enabled using the AWS CLI**

We recommend that you install the current version of AWS CLI. To download the latest release, see [https://aws.amazon.com/cli/](https://aws.amazon.com/cli/).
+ 
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  1. aws emr create-cluster --instance-type m5.xlarge --instance-count 3 --emrfs Consistent=true \
  2. --release-label emr-7.12.0 --ec2-attributes KeyName=myKey
  ```

**To check if consistent view is enabled using the AWS Management Console**
+ To check whether consistent view is enabled in the console, navigate to the **Cluster List** and select your cluster name to view **Cluster Details**. The "EMRFS consistent view" field has a value of `Enabled` or `Disabled`.

**To check if consistent view is enabled by examining the `emrfs-site.xml` file**
+ You can check if consistency is enabled by inspecting the `emrfs-site.xml` configuration file on the master node of the cluster. If the Boolean value for `fs.s3.consistent` is set to `true` then consistent view is enabled for file system operations involving Amazon S3.

# Understanding how EMRFS consistent view tracks objects in Amazon S3
<a name="emrfs-files-tracked"></a>

EMRFS creates a consistent view of objects in Amazon S3 by adding information about those objects to the EMRFS metadata. EMRFS adds these listings to its metadata when:
+  An object written by EMRFS during the course of an Amazon EMR job.
+  An object is synced with or imported to EMRFS metadata by using the EMRFS CLI.

Objects read by EMRFS are not automatically added to the metadata. When EMRFS deletes an object, a listing still remains in the metadata with a deleted state until that listing is purged using the EMRFS CLI. To learn more about the CLI, see [EMRFS CLI Command Reference](emrfs-cli-reference.md). For more information about purging listings in the EMRFS metadata, see [EMRFS consistent view metadata](emrfs-metadata.md).

For every Amazon S3 operation, EMRFS checks the metadata for information about the set of objects in consistent view. If EMRFS finds that Amazon S3 is inconsistent during one of these operations, it retries the operation according to parameters defined in `emrfs-site` configuration properties. After EMRFS exhausts the retries, it either throws a `ConsistencyException` or logs the exception and continue the workflow. For more information about retry logic, see [Retry logic](emrfs-retry-logic.md). You can find `ConsistencyExceptions` in your logs, for example:
+  listStatus: No Amazon S3 object for metadata item `/S3_bucket/dir/object`
+  getFileStatus: Key `dir/file` is present in metadata but not Amazon S3

If you delete an object directly from Amazon S3 that EMRFS consistent view tracks, EMRFS treats that object as inconsistent because it is still listed in the metadata as present in Amazon S3. If your metadata becomes out of sync with the objects EMRFS tracks in Amazon S3, you can use the **sync** sub-command of the EMRFS CLI to reset metadata so that it reflects Amazon S3. To discover discrepancies between metadata and Amazon S3, use the **diff**. Finally, EMRFS only has a consistent view of the objects referenced in the metadata; there can be other objects in the same Amazon S3 path that are not being tracked. When EMRFS lists the objects in an Amazon S3 path, it returns the superset of the objects being tracked in the metadata and those in that Amazon S3 path.

# Retry logic
<a name="emrfs-retry-logic"></a>

EMRFS tries to verify list consistency for objects tracked in its metadata for a specific number of retries. The default is 5. In the case where the number of retries is exceeded the originating job returns a failure unless `fs.s3.consistent.throwExceptionOnInconsistency` is set to `false`, where it will only log the objects tracked as inconsistent. EMRFS uses an exponential backoff retry policy by default but you can also set it to a fixed policy. Users may also want to retry for a certain period of time before proceeding with the rest of their job without throwing an exception. They can achieve this by setting `fs.s3.consistent.throwExceptionOnInconsistency` to `false`, `fs.s3.consistent.retryPolicyType` to `fixed`, and `fs.s3.consistent.retryPeriodSeconds` for the desired value. The following example creates a cluster with consistency enabled, which logs inconsistencies and sets a fixed retry interval of 10 seconds:

**Example Setting retry period to a fixed amount**  

```
aws emr create-cluster --release-label emr-7.12.0 \
--instance-type m5.xlarge --instance-count 1 \
--emrfs Consistent=true,Args=[fs.s3.consistent.throwExceptionOnInconsistency=false, fs.s3.consistent.retryPolicyType=fixed,fs.s3.consistent.retryPeriodSeconds=10] --ec2-attributes KeyName=myKey
```

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

For more information, see [Consistent view](emr-plan-consistent-view.md).

## EMRFS configurations for IMDS get region calls
<a name="randomized-exponential-backoff-retry"></a>

EMRFS relies on the IMDS (instance metadata service) to get instance region and Amazon S3, DynamoDB, or AWS KMS endpoints. However, IMDS has a limit on how many requests it can handle, and requests that exceed that limit will fail. This IMDS limit can cause EMRFS failures to initialize and cause the query or command to fail. You can use the following randomized exponential backoff retry mechanism and a fallback region configuration properties in emrfs-site.xml to address the scenario where all retries fail.

```
<property>
    <name>fs.s3.region.retryCount</name>
    <value>3</value>
    <description>
    Maximum retries that would be attempted to get AWS region.
    </description>
</property>
<property>
    <name>fs.s3.region.retryPeriodSeconds</name>
    <value>3</value>
    <description>
    Base sleep time in second for each get-region retry.
    </description>
</property>
<property>
    <name>fs.s3.region.fallback</name>
    <value>us-east-1</value>
    <description>
    Fallback to this region after maximum retries for getting AWS region have been reached.
    </description>
</property>
```

# EMRFS consistent view metadata
<a name="emrfs-metadata"></a>

EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRFS. The metadata is used to track all operations (read, write, update, and copy), and no actual content is stored in it. This metadata is used to validate whether the objects or metadata received from Amazon S3 matches what is expected. This confirmation gives EMRFS the ability to check list consistency and read-after-write consistency for new objects EMRFS writes to Amazon S3 or objects synced with EMRFS. Multiple clusters can share the same metadata.

**How to add entries to metadata**  
You can use the `sync` or `import` subcommands to add entries to metadata. `sync` reflects the state of the Amazon S3 objects in a path, while `import` is used strictly to add new entries to the metadata. For more information, see [EMRFS CLI Command Reference](emrfs-cli-reference.md).

**How to check differences between metadata and objects in Amazon S3**  
To check for differences between the metadata and Amazon S3, use the `diff` subcommand of the EMRFS CLI. For more information, see [EMRFS CLI Command Reference](emrfs-cli-reference.md).

**How to know if metadata operations are being throttled**  
EMRFS sets default throughput capacity limits on the metadata for its read and write operations at 500 and 100 units, respectively. Large numbers of objects or buckets may cause operations to exceed this capacity, at which point DynamoDB will throttle operations. For example, an application may cause EMRFS to throw a `ProvisionedThroughputExceededException` if you perform an operation that exceeds these capacity limits. Upon throttling, the EMRFS CLI tool attempts to retry writing to the DynamoDB table using [exponential backoff](https://docs.aws.amazon.com/general/latest/gr/api-retries.html) until the operation finishes or when it reaches the maximum retry value for writing objects from Amazon EMR to Amazon S3. 

You can configure your own throughput capacity limits. However, DynamoDB has strict partition limits of 3000 read capacity units (RCUs) and 1000 write capacity units (WCUs) per second for read and write operations. To avoid `sync` failures caused by throttling, we recommend you limit throughput for read operations to fewer than 3000 RCUs and write operations to fewer than 1000 WCUs. For instructions on setting custom throughput capacity limits, see [Configure consistent view](emrfs-configure-consistent-view.md).

You can also view Amazon CloudWatch metrics for your EMRFS metadata in the DynamoDB console where you can see the number of throttled read and write requests. If you do have a non-zero value for throttled requests, your application may potentially benefit from increasing allocated throughput capacity for read or write operations. You may also realize a performance benefit if you see that your operations are approaching the maximum allocated throughput capacity in reads or writes for an extended period of time.

**Throughput characteristics for notable EMRFS operations**  
The default for read and write operations is 400 and 100 throughput capacity units, respectively. The following performance characteristics give you an idea of what throughput is required for certain operations. These tests were performed using a single-node `m3.large` cluster. All operations were single threaded. Performance differs greatly based on particular application characteristics and it may take experimentation to optimize file system operations.


| Operation  | Average read-per-second  | Average write-per-second  | 
| --- | --- | --- | 
| create (object) | 26.79 |  6.70 | 
| delete (object) | 10.79 |  10.79 | 
| delete (directory containing 1000 objects) | 21.79 | 338.40  | 
|  getFileStatus (object) | 34.70 | 0  | 
| getFileStatus (directory) | 19.96 | 0 | 
| listStatus (directory containing 1 object) | 43.31 | 0 | 
| listStatus (directory containing 10 objects) | 44.34 | 0 | 
| listStatus (directory containing 100 objects) | 84.44 | 0 | 
| listStatus (directory containing 1,000 objects) | 308.81 | 0 | 
| listStatus (directory containing 10,000 objects) | 416.05 | 0 | 
| listStatus (directory containing 100,000 objects) | 823.56 | 0 | 
| listStatus (directory containing 1M objects) | 882.36 | 0 | 
| mkdir (continuous for 120 seconds)  | 24.18 | 4.03 | 
| mkdir | 12.59 | 0 | 
| rename (object) | 19.53 | 4.88 | 
| rename (directory containing 1000 objects) | 23.22 | 339.34 | 

**To submit a step that purges old data from your metadata store**  
Users may wish to remove particular entries in the DynamoDB-based metadata. This can help reduce storage costs associated with the table. Users have the ability to manually or programmatically purge particular entries by using the EMRFS CLI `delete` subcommand. However, if you delete entries from the metadata, EMRFS no longer makes any checks for consistency.

Programmatically purging after the completion of a job can be done by submitting a final step to your cluster, which executes a command on the EMRFS CLI. For instance, type the following command to submit a step to your cluster to delete all entries older than two days.

```
aws emr add-steps --cluster-id j-2AL4XXXXXX5T9 --steps Name="emrfsCLI",Jar="command-runner.jar",Args=["emrfs","delete","--time","2","--time-unit","days"]
{
    "StepIds": [
        "s-B12345678902"
    ]
}
```

Use the StepId value returned to check the logs for the result of the operation.

# Configure consistency notifications for CloudWatch and Amazon SQS
<a name="emrfs-configure-sqs-cw"></a>

You can enable CloudWatch metrics and Amazon SQS messages in EMRFS for Amazon S3 eventual consistency issues. 

**CloudWatch**  
When CloudWatch metrics are enabled, a metric named **Inconsistency** is pushed each time a `FileSystem` API call fails due to Amazon S3 eventual consistency. 

**To view CloudWatch metrics for Amazon S3 eventual consistency issues**

To view the **Inconsistency** metric in the CloudWatch console, select the EMRFS metrics and then select a **JobFlowId**/**Metric Name** pair. For example: `j-162XXXXXXM2CU ListStatus`, `j-162XXXXXXM2CU GetFileStatus`, and so on.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the **Dashboard**, in the **Metrics** section, choose **EMRFS**. 

1. In the **Job Flow Metrics** pane, select one or more **JobFlowId**/**Metric Name** pairs. A graphical representation of the metrics appears in the window below.

**Amazon SQS**  
When Amazon SQS notifications are enabled, an Amazon SQS queue with the name `EMRFS-Inconsistency-<jobFlowId>` is created when EMRFS is initialized. Amazon SQS messages are pushed into the queue when a `FileSystem` API call fails due to Amazon S3 eventual consistency. The message contains information such as JobFlowId, API, a list of inconsistent paths, a stack trace, and so on. Messages can be read using the Amazon SQS console or using the EMRFS `read-sqs` command.

**To manage Amazon SQS messages for Amazon S3 eventual consistency issues**

Amazon SQS messages for Amazon S3 eventual consistency issues can be read using the EMRFS CLI. To read messages from an EMRFS Amazon SQS queue, type the `read-sqs` command and specify an output location on the master node's local file system for the resulting output file. 

You can also delete an EMRFS Amazon SQS queue using the `delete-sqs` command.

1. To read messages from an Amazon SQS queue, type the following command. Replace *queuename* with the name of the Amazon SQS queue that you configured and replace */path/filename* with the path to the output file:

   ```
   emrfs read-sqs --queue-name queuename --output-file /path/filename
   ```

   For example, to read and output Amazon SQS messages from the default queue, type:

   ```
   emrfs read-sqs --queue-name EMRFS-Inconsistency-j-162XXXXXXM2CU --output-file /path/filename
   ```
**Note**  
You can also use the `-q` and `-o` shortcuts instead of `--queue-name` and `--output-file` respectively.

1. To delete an Amazon SQS queue, type the following command:

   ```
   emrfs delete-sqs --queue-name queuename
   ```

   For example, to delete the default queue, type:

   ```
   emrfs delete-sqs --queue-name EMRFS-Inconsistency-j-162XXXXXXM2CU
   ```
**Note**  
You can also use the `-q` shortcut instead of `--queue-name`.

# Configure consistent view
<a name="emrfs-configure-consistent-view"></a>

You can configure additional settings for consistent view by providing them using configuration properties for `emrfs-site` properties. For example, you can choose a different default DynamoDB throughput by supplying the following arguments to the CLI `--emrfs` option, using the emrfs-site configuration classification (Amazon EMR release version 4.x and later only), or a bootstrap action to configure the emrfs-site.xml file on the master node:

**Example Changing default metadata read and write values at cluster launch**  

```
aws emr create-cluster --release-label emr-7.12.0 --instance-type m5.xlarge \
--emrfs Consistent=true,Args=[fs.s3.consistent.metadata.read.capacity=600,\
fs.s3.consistent.metadata.write.capacity=300] --ec2-attributes KeyName=myKey
```

Alternatively, use the following configuration file and save it locally or in Amazon S3:

```
[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.consistent.metadata.read.capacity": "600",
        "fs.s3.consistent.metadata.write.capacity": "300"
      }
    }
 ]
```

Use the configuration you created with the following syntax:

```
aws emr create-cluster --release-label emr-7.12.0 --applications Name=Hive \
--instance-type m5.xlarge --instance-count 2 --configurations file://./myConfig.json
```

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

The following options can be set using configurations or AWS CLI `--emrfs` arguments. For information about those arguments, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/).


**`emrfs-site.xml` Properties for consistent view**  

| Property  | Default value | Description  | 
| --- | --- | --- | 
| fs.s3.consistent | false |  When set to **true**, this property configures EMRFS to use DynamoDB to provide consistency.  | 
| fs.s3.consistent.retryPolicyType | exponential | This property identifies the policy to use when retrying for consistency issues. Options include: exponential, fixed, or none. | 
| fs.s3.consistent.retryPeriodSeconds | 1 | This property sets the length of time to wait between consistency retry attempts. | 
| fs.s3.consistent.retryCount | 10 | This property sets the maximum number of retries when inconsistency is detected. | 
| fs.s3.consistent.throwExceptionOnInconsistency | true | This property determines whether to throw or log a consistency exception. When set to true, a ConsistencyException is thrown. | 
| fs.s3.consistent.metadata.autoCreate | true | When set to true, this property enables automatic creation of metadata tables. | 
| fs.s3.consistent.metadata.etag.verification.enabled | true | With Amazon EMR 5.29.0, this property is enabled by default. When enabled, EMRFS uses S3 ETags to verify that objects being read are the latest available version. This feature is helpful for read-after-update use cases in which files on S3 are being overwritten while retaining the same name. This ETag verification capability currently does not work with S3 Select. | 
| fs.s3.consistent.metadata.tableName | EmrFSMetadata | This property specifies the name of the metadata table in DynamoDB. | 
| fs.s3.consistent.metadata.read.capacity | 500 | This property specifies the DynamoDB read capacity to provision when the metadata table is created. | 
| fs.s3.consistent.metadata.write.capacity | 100 | This property specifies the DynamoDB write capacity to provision when the metadata table is created. | 
| fs.s3.consistent.fastList | true | When set to true, this property uses multiple threads to list a directory (when necessary). Consistency must be enabled in order to use this property. | 
| fs.s3.consistent.fastList.prefetchMetadata | false | When set to true, this property enables metadata prefetching for directories containing more than 20,000 items. | 
| fs.s3.consistent.notification.CloudWatch | false | When set to true, CloudWatch metrics are enabled for FileSystem API calls that fail due to Amazon S3 eventual consistency issues. | 
| fs.s3.consistent.notification.SQS | false | When set to true, eventual consistency notifications are pushed to an Amazon SQS queue. | 
| fs.s3.consistent.notification.SQS.queueName | EMRFS-Inconsistency-<jobFlowId> | Changing this property allows you to specify your own SQS queue name for messages regarding Amazon S3 eventual consistency issues. | 
| fs.s3.consistent.notification.SQS.customMsg | none | This property allows you to specify custom information included in SQS messages regarding Amazon S3 eventual consistency issues. If a value is not specified for this property, the corresponding field in the message is empty.  | 
| fs.s3.consistent.dynamodb.endpoint | none | This property allows you to specify a custom DynamoDB endpoint for your consistent view metadata. | 
| fs.s3.useRequesterPaysHeader | false | When set to true, this property allows Amazon S3 requests to buckets with the request payer option enabled.  | 

# EMRFS CLI Command Reference
<a name="emrfs-cli-reference"></a>

The EMRFS CLI is installed by default on all cluster master nodes created using Amazon EMR release version 3.2.1 or later. You can use the EMRFS CLI to manage the metadata for consistent view. 

**Note**  
The **emrfs** command is only supported with VT100 terminal emulation. However, it may work with other terminal emulator modes.

## emrfs top-level command
<a name="emrfs-top-level"></a>

The **emrfs** top-level command supports the following structure.

```
emrfs [describe-metadata | set-metadata-capacity | delete-metadata | create-metadata | \
list-metadata-stores | diff | delete | sync | import ] [options] [arguments]
```

Specify [options], with or without [arguments] as described in the following table. For [options] specific to sub-commands (`describe-metadata`, `set-metadata-capacity`, etc.), see each sub-command below.


**[Options] for emrfs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-a AWS_ACCESS_KEY_ID \| --access-key AWS_ACCESS_KEY_ID`  |  The AWS access key you use to write objects to Amazon S3 and to create or access a metadata store in DynamoDB. By default, *AWS\$1ACCESS\$1KEY\$1ID* is set to the access key used to create the cluster.  |  No  | 
|  `-s AWS_SECRET_ACCESS_KEY \| --secret-key AWS_SECRET_ACCESS_KEY`  |  The AWS secret key associated with the access key you use to write objects to Amazon S3 and to create or access a metadata store in DynamoDB. By default, *AWS\$1SECRET\$1ACCESS\$1KEY* is set to the secret key associated with the access key used to create the cluster.  |  No  | 
|  `-v \| --verbose`  |  Makes output verbose.  |  No  | 
|  `-h \| --help`  |  Displays the help message for the `emrfs` command with a usage statement.  |  No  | 

## emrfs describe-metadata sub-command
<a name="emrfs-describe-metadata"></a>


**[Options] for emrfs describe-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 

**Example emrfs describe-metadata example**  <a name="emrfs-describe-metadata"></a>
The following example describes the default metadata table.  

```
$ emrfs describe-metadata
EmrFSMetadata
  read-capacity: 400
  write-capacity: 100
  status: ACTIVE
  approximate-item-count (6 hour delay): 12
```

## emrfs set-metadata-capacity sub-command
<a name="emrfs-set-metadata-capacity"></a>


**[Options] for emrfs set-metadata-capacity**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  `-r READ_CAPACITY \| --read-capacity READ_CAPACITY`  |  The requested read throughput capacity for the metadata table. If the *READ\$1CAPACITY* argument is not supplied, the default value is `400`.  |  No  | 
|  `-w WRITE_CAPACITY \| --write-capacity WRITE_CAPACITY`  |  The requested write throughput capacity for the metadata table. If the *WRITE\$1CAPACITY* argument is not supplied, the default value is `100`.  |  No  | 

**Example emrfs set-metadata-capacity example**  
The following example sets the read throughput capacity to `600` and the write capacity to `150` for a metadata table named `EmrMetadataAlt`.  

```
$ emrfs set-metadata-capacity --metadata-name EmrMetadataAlt  --read-capacity 600 --write-capacity 150
  read-capacity: 400
  write-capacity: 100
  status: UPDATING
  approximate-item-count (6 hour delay): 0
```

## emrfs delete-metadata sub-command
<a name="emrfs-delete-metadata"></a>


**[Options] for emrfs delete-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 

**Example emrfs delete-metadata example**  
The following example deletes the default metadata table.  

```
$ emrfs delete-metadata
```

## emrfs create-metadata sub-command
<a name="emrfs-create-metadata"></a>


**[Options] for emrfs create-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  `-r READ_CAPACITY \| --read-capacity READ_CAPACITY`  |  The requested read throughput capacity for the metadata table. If the *READ\$1CAPACITY* argument is not supplied, the default value is `400`.  |  No  | 
|  `-w WRITE_CAPACITY \| --write-capacity WRITE_CAPACITY`  |  The requested write throughput capacity for the metadata table. If the *WRITE\$1CAPACITY* argument is not supplied, the default value is `100`.  |  No  | 

**Example emrfs create-metadata example**  
The following example creates a metadata table named `EmrFSMetadataAlt`.  

```
$ emrfs create-metadata -m EmrFSMetadataAlt
Creating metadata: EmrFSMetadataAlt
EmrFSMetadataAlt
  read-capacity: 400
  write-capacity: 100
  status: ACTIVE
  approximate-item-count (6 hour delay): 0
```

## emrfs list-metadata-stores sub-command
<a name="emrfs-list-metadata-stores"></a>

The **emrfs list-metadata-stores** sub-command has no [options]. 

**Example List-metadata-stores example**  
The following example lists your metadata tables.  

```
$ emrfs list-metadata-stores
  EmrFSMetadata
```

## emrfs diff sub-command
<a name="emrfs-diff"></a>


**[Options] for emrfs diff**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket to compare with the metadata table. Buckets sync recursively.  |  Yes  | 

**Example emrfs diff example**  
The following example compares the default metadata table to an Amazon S3 bucket.  

```
$ emrfs diff s3://elasticmapreduce/samples/cloudfront
BOTH | MANIFEST ONLY | S3 ONLY
DIR elasticmapreduce/samples/cloudfront
DIR elasticmapreduce/samples/cloudfront/code/
DIR elasticmapreduce/samples/cloudfront/input/
DIR elasticmapreduce/samples/cloudfront/logprocessor.jar
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-14.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-15.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-16.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-17.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-18.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-19.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-20.WxYz1234
DIR elasticmapreduce/samples/cloudfront/code/cloudfront-loganalyzer.tgz
```

## emrfs delete sub-command
<a name="emrfs-delete"></a>


**[Options] for emrfs delete**  

|  Option  |  Description  |  Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
| -t TIME \$1 --time TIME |  The expiration time (interpreted using the time unit argument). All metadata entries older than the *TIME* argument are deleted for the specified bucket.  |  | 
|  `-u UNIT \| --time-unit UNIT`  |  The measure used to interpret the time argument (nanoseconds, microseconds, milliseconds, seconds, minutes, hours, or days). If no argument is specified, the default value is `days`.  |  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No   | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs delete example**  
The following example removes all objects in an Amazon S3 bucket from the tracking metadata for consistent view.  

```
$ emrfs delete s3://elasticmapreduce/samples/cloudfront
entries deleted: 11
```

## emrfs import sub-command
<a name="emrfs-import"></a>


**[Options] for emrfs import**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No  | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs import example**  
The following example imports all objects in an Amazon S3 bucket with the tracking metadata for consistent view. All unknown keys are ignored.  

```
$ emrfs import s3://elasticmapreduce/samples/cloudfront
```

## emrfs sync sub-command
<a name="emrfs-sync"></a>


**[Options] for emrfs sync**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No  | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs sync command example**  
The following example imports all objects in an Amazon S3 bucket with the tracking metadata for consistent view. All unknown keys are deleted.   

```
$ emrfs sync s3://elasticmapreduce/samples/cloudfront
Synching samples/cloudfront                                       0 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/code/                                 1 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/                                      2 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/input/                                9 added | 0 updated | 0 removed | 0 unchanged
Done synching s3://elasticmapreduce/samples/cloudfront            9 added | 0 updated | 1 removed | 0 unchanged
creating 3 folder key(s)
folders written: 3
```

## emrfs read-sqs sub-command
<a name="emrfs-read-sqs"></a>


**[Options] for emrfs read-sqs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-q QUEUE_NAME \| --queue-name QUEUE_NAME`  |  *QUEUE\$1NAME* is the name of the Amazon SQS queue configured in `emrfs-site.xml`. The default value is **EMRFS-Inconsistency-<jobFlowId>**.  |  Yes  | 
|  `-o OUTPUT_FILE \| --output-file OUTPUT_FILE`  |  *OUTPUT\$1FILE* is the path to the output file on the master node's local file system. Messages read from the queue are written to this file.   |  Yes  | 

## emrfs delete-sqs sub-command
<a name="emrfs-delete-sqs"></a>


**[Options] for emrfs delete-sqs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-q QUEUE_NAME \| --queue-name QUEUE_NAME`  |  *QUEUE\$1NAME* is the name of the Amazon SQS queue configured in `emrfs-site.xml`. The default value is **EMRFS-Inconsistency-<jobFlowId>**.  |  Yes  | 

## Submitting EMRFS CLI commands as steps
<a name="emrfs-submit-steps-as-cli"></a>

The following example shows how to use the `emrfs` utility on the master node by leveraging the AWS CLI or API and the `command-runner.jar` to run the `emrfs` command as a step. The example uses the AWS SDK for Python (Boto3) to add a step to a cluster which adds objects in an Amazon S3 bucket to the default EMRFS metadata table.

```
import boto3
from botocore.exceptions import ClientError


def add_emrfs_step(command, bucket_url, cluster_id, emr_client):
    """
    Add an EMRFS command as a job flow step to an existing cluster.

    :param command: The EMRFS command to run.
    :param bucket_url: The URL of a bucket that contains tracking metadata.
    :param cluster_id: The ID of the cluster to update.
    :param emr_client: The Boto3 Amazon EMR client object.
    :return: The ID of the added job flow step. Status can be tracked by calling
             the emr_client.describe_step() function.
    """
    job_flow_step = {
        "Name": "Example EMRFS Command Step",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": ["/usr/bin/emrfs", command, bucket_url],
        },
    }

    try:
        response = emr_client.add_job_flow_steps(
            JobFlowId=cluster_id, Steps=[job_flow_step]
        )
        step_id = response["StepIds"][0]
        print(f"Added step {step_id} to cluster {cluster_id}.")
    except ClientError:
        print(f"Couldn't add a step to cluster {cluster_id}.")
        raise
    else:
        return step_id


def usage_demo():
    emr_client = boto3.client("emr")
    # Assumes the first waiting cluster has EMRFS enabled and has created metadata
    # with the default name of 'EmrFSMetadata'.
    cluster = emr_client.list_clusters(ClusterStates=["WAITING"])["Clusters"][0]
    add_emrfs_step(
        "sync", "s3://elasticmapreduce/samples/cloudfront", cluster["Id"], emr_client
    )


if __name__ == "__main__":
    usage_demo()
```

You can use the `step_id` value returned to check the logs for the result of the operation.

# Authorizing access to EMRFS data in Amazon S3
<a name="emr-plan-credentialsprovider"></a>

By default, the EMR role for EC2 determines the permissions for accessing EMRFS data in Amazon S3. The IAM policies that are attached to this role apply regardless of the user or group making the request through EMRFS. The default is `EMR_EC2_DefaultRole`. For more information, see [Service role for cluster EC2 instances (EC2 instance profile)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html).

Beginning with Amazon EMR release version 5.10.0, you can use a security configuration to specify IAM roles for EMRFS. This allows you to customize permissions for EMRFS requests to Amazon S3 for clusters that have multiple users. You can specify different IAM roles for different users and groups, and for different Amazon S3 bucket locations based on the prefix in Amazon S3. When EMRFS makes a request to Amazon S3 that matches users, groups, or the locations that you specify, the cluster uses the corresponding role that you specify instead of the EMR role for EC2. For more information, see [Configure IAM roles for EMRFS requests to Amazon S3](https://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-emrfs-iam-roles).

Alternatively, if your Amazon EMR solution has demands beyond what IAM roles for EMRFS provides, you can define a custom credentials provider class, which allows you to customize access to EMRFS data in Amazon S3.

## Creating a custom credentials provider for EMRFS data in Amazon S3
<a name="emr-create-credentialsprovider"></a>

To create a custom credentials provider, you implement the [AWSCredentialsProvider](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/AWSCredentialsProvider.html) and the Hadoop [Configurable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configurable.html) classes.

For a detailed explanation of this approach, see [Securely analyze data from another AWS account with EMRFS](https://aws.amazon.com/blogs/big-data/securely-analyze-data-from-another-aws-account-with-emrfs) in the AWS Big Data blog. The blog post includes a tutorial that walks you through the process end-to-end, from creating IAM roles to launching the cluster. It also provides a Java code example that implements the custom credential provider class.

The basic steps are as follows:

**To specify a custom credentials provider**

1. Create a custom credentials provider class compiled as a JAR file.

1. Run a script as a bootstrap action to copy the custom credentials provider JAR file to the `/usr/share/aws/emr/emrfs/auxlib` location on the cluster's master node. For more information about bootstrap actions, see [(Optional) Create bootstrap actions to install additional software](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html).

1. Customize the `emrfs-site` classification to specify the class that you implement in the JAR file. For more information about specifying configuration objects to customize applications, see [Configuring applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html) in the *Amazon EMR Release Guide*.

   The following example demonstrates a `create-cluster` command that launches a Hive cluster with common configuration parameters, and also includes:
   + A bootstrap action that runs the script, `copy_jar_file.sh`, which is saved to `amzn-s3-demo-bucket` in Amazon S3.
   + An `emrfs-site` classification that specifies a custom credentials provider defined in the JAR file as `MyCustomCredentialsProvider`
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   aws emr create-cluster --applications Name=Hive \
   --bootstrap-actions '[{"Path":"s3://amzn-s3-demo-bucket/copy_jar_file.sh","Name":"Custom action"}]' \
   --ec2-attributes '{"KeyName":"MyKeyPair","InstanceProfile":"EMR_EC2_DefaultRole",\
   "SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx",\
   "EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' \
   --service-role EMR_DefaultRole_V2 --enable-debugging --release-label emr-7.12.0 \
   --log-uri 's3n://amzn-s3-demo-bucket/' --name 'test-awscredentialsprovider-emrfs' \
   --instance-type=m5.xlarge --instance-count 3  \
   --configurations '[{"Classification":"emrfs-site",\
   "Properties":{"fs.s3.customAWSCredentialsProvider":"MyAWSCredentialsProviderWithUri"},\
   "Configurations":[]}]'
   ```

# Managing the default AWS Security Token Service endpoint
<a name="emr-emrfs-sts-endpoint"></a>

EMRFS uses the AWS Security Token Service (STS) to retrieve temporary security credentials in order to access your AWS resources. Earlier Amazon EMR release versions send all AWS STS requests to a single global endpoint at `https://sts.amazonaws.com`. Amazon EMR release versions 5.31.0 and 6.1.0 and later make requests to Regional AWS STS endpoints instead. This reduces latency and improves session token validity. For more information about AWS STS endpoints, see [Managing AWS STS in an AWS Region](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html) in the *AWS Identity and Access Management User Guide*.

When you use Amazon EMR release versions 5.31.0 and 6.1.0 and later, you can override the default AWS STS endpoint. To do so, you must change the `fs.s3.sts.endpoint` property in your `emrfs-site` configuration.

The following AWS CLI example sets the default AWS STS endpoint used by EMRFS to the global endpoint.

```
aws emr create-cluster --release-label <emr-5.33.0> --instance-type m5.xlarge \
--emrfs Args=[fs.s3.sts.endpoint=https://sts.amazonaws.com]
```

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

Alternatively, you can create a JSON configuration file using the following example, and specify it using the `--configurations` argument of `emr create-cluster`. For more information about using `--configurations,` see the [*AWS CLI Command Reference*.](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/emr/create-cluster.html)

```
[
  {
    "classification": "emrfs-site",
    "properties": {
      "fs.s3.sts.endpoint": "https://sts.amazonaws.com"
    }
  }
]
```

# Specifying Amazon S3 encryption using EMRFS properties
<a name="emr-emrfs-encryption"></a>

**Important**  
Beginning with Amazon EMR release version 4.8.0, you can use security configurations to apply encryption settings more easily and with more options. We recommend using security configurations. For information, see [Configure data encryption](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html#emr-security-configuration-encryption). The console instructions described in this section are available for release versions earlier than 4.8.0. If you use the AWS CLI to configure Amazon S3 encryption both in the cluster configuration and in a security configuration in subsequent versions, the security configuration overrides the cluster configuration.

When you create a cluster, you can specify server-side encryption (SSE) or client-side encryption (CSE) for EMRFS data in Amazon S3 using the console or using `emrfs-site` classification properties through the AWS CLI or EMR SDK. Amazon S3 SSE and CSE are mutually exclusive; you can choose either but not both.

For AWS CLI instructions, see the appropriate section for your encryption type below.

**To specify EMRFS encryption options using the AWS Management Console**

1. Navigate to the new Amazon EMR console and select **Switch to the old console** from the side navigation. For more information on what to expect when you switch to the old console, see [Using the old console](https://docs.aws.amazon.com/emr/latest/ManagementGuide/whats-new-in-console.html#console-opt-in).

1. Choose **Create cluster**, **Go to advanced options**.

1. Choose a **Release** of 4.7.2 or earlier.

1. Choose other options for **Software and Steps** as appropriate for your application, and then choose **Next**.

1. Choose settings in the **Hardware** and **General Cluster Settings** panes as appropriate for your application.

1. On the **Security** pane, under **Authentication and encryption**, select the **S3 Encryption (with EMRFS)** option to use.
**Note**  
**S3 server-side encryption with KMS Key Management** (SSE-KMS) is not available when using Amazon EMR release version 4.4 or earlier.
   + If you choose an option that uses **AWS Key Management**, choose an **AWS KMS Key ID**. For more information, see [Using AWS KMS keys for EMRFS encryption](#emr-emrfs-awskms).
   + If you choose **S3 client-side encryption with custom materials provider**, provide the **Class name** and the **JAR location**. For more information, see [Amazon S3 client-side encryption](emr-emrfs-encryption-cse.md).

1. Choose other options as appropriate for your application and then choose **Create Cluster**.

## Using AWS KMS keys for EMRFS encryption
<a name="emr-emrfs-awskms"></a>

The AWS KMS encryption key must be created in the same Region as your Amazon EMR cluster instance and the Amazon S3 buckets used with EMRFS. If the key that you specify is in a different account from the one that you use to configure a cluster, you must specify the key using its ARN.

The role for the Amazon EC2 instance profile must have permissions to use the KMS key you specify. The default role for the instance profile in Amazon EMR is `EMR_EC2_DefaultRole`. If you use a different role for the instance profile, or you use IAM roles for EMRFS requests to Amazon S3, make sure that each role is added as a key user as appropriate. This gives the role permissions to use the KMS key. For more information, see [Using Key Policies](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users) in the *AWS Key Management Service Developer Guide* and [Configure IAM roles for EMRFS requests to Amazon S3](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-iam-roles.html).

You can use the AWS Management Console to add your instance profile or EC2 instance profile to the list of key users for the specified KMS key, or you can use the AWS CLI or an AWS SDK to attach an appropriate key policy.

Note that Amazon EMR supports only [symmetric KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#symmetric-cmks). You cannot use an [asymmetric KMS key](https://docs.aws.amazon.com/kms/latest/developerguide/symmetric-asymmetric.html#asymmetric-cmks) to encrypt data at rest in an Amazon EMR cluster. For help determining whether a KMS key is symmetric or asymmetric, see [ Identifying symmetric and asymmetric KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/find-symm-asymm.html).

The procedure below describes how to add the default Amazon EMR instance profile, `EMR_EC2_DefaultRole` as a *key user* using the AWS Management Console. It assumes that you have already created a KMS key. To create a new KMS key, see [Creating Keys](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) in the *AWS Key Management Service Developer Guide*.

**To add the EC2 instance profile for Amazon EMR to the list of encryption key users**

1. Sign in to the AWS Management Console and open the AWS Key Management Service (AWS KMS) console at [https://console.aws.amazon.com/kms](https://console.aws.amazon.com/kms).

1. To change the AWS Region, use the Region selector in the upper-right corner of the page.

1. Select the alias of the KMS key to modify.

1. On the key details page under **Key Users**, choose **Add**.

1. In the **Add key users** dialog box, select the appropriate role. The name of the default role is `EMR_EC2_DefaultRole`.

1. Choose **Add**.

## Amazon S3 server-side encryption
<a name="emr-emrfs-encryption-sse"></a>

All Amazon S3 buckets have encryption configured by default, and all new objects that are uploaded to an S3 bucket are automatically encrypted at rest, Amazon S3 encrypts data at the object level as it writes the data to disk and decrypts the data when it is accessed. For more information about SSE, see [Protecting data using server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html) in the *Amazon Simple Storage Service User Guide*.

You can choose between two different key management systems when you specify SSE in Amazon EMR: 
+ **SSE-S3** – Amazon S3 manages keys for you.
+ **SSE-KMS** – You use an AWS KMS key to set up with policies suitable for Amazon EMR. For more information about key requirements for Amazon EMR, see [Using AWS KMS keys for encryption](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-awskms-keys).

SSE with customer-provided keys (SSE-C) is not available for use with Amazon EMR.

**To create a cluster with SSE-S3 enabled using the AWS CLI**
+ Type the following command:

  ```
  aws emr create-cluster --release-label emr-4.7.2 or earlier \
  --instance-count 3 --instance-type m5.xlarge --emrfs Encryption=ServerSide
  ```

You can also enable SSE-S3 by setting the fs.s3.enableServerSideEncryption property to true in `emrfs-site` properties. See the example for SSE-KMS below and omit the property for Key ID.

**To create a cluster with SSE-KMS enabled using the AWS CLI**
**Note**  
SSE-KMS is available only in Amazon EMR release version 4.5.0 and later.
+ Type the following AWS CLI command to create a cluster with SSE-KMS, where *keyID* is an AWS KMS key, for example, *a4567b8-9900-12ab-1234-123a45678901*:

  ```
  aws emr create-cluster --release-label emr-4.7.2 or earlier --instance-count 3 \
  --instance-type m5.xlarge --use-default-roles \
  --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryption.kms.keyId=keyId]
  ```

  **--OR--**

  Type the following AWS CLI command using the `emrfs-site` classification and provide a configuration JSON file with contents as shown similar to `myConfig.json` in the example below:

  ```
  aws emr create-cluster --release-label emr-4.7.2 or earlier --instance-count 3 --instance-type m5.xlarge --applications Name=Hadoop --configurations file://myConfig.json --use-default-roles
  ```

  Example contents of **myConfig.json**:

  ```
  [
    {
      "Classification":"emrfs-site",
      "Properties": {
         "fs.s3.enableServerSideEncryption": "true",
         "fs.s3.serverSideEncryption.kms.keyId":"a4567b8-9900-12ab-1234-123a45678901"
      }
    }
  ]
  ```

### Configuration properties for SSE-S3 and SSE-KMS
<a name="emr-emrfs-encryption-site-sse-properties"></a>

These properties can be configured using the `emrfs-site` configuration classification. SSE-KMS is available only in Amazon EMR release version 4.5.0 and later.


| Property  | Default value | Description  | 
| --- | --- | --- | 
| fs.s3.enableServerSideEncryption | false |  When set to **true**, objects stored in Amazon S3 are encrypted using server-side encryption. If no key is specified, SSE-S3 is used.  | 
| fs.s3.serverSideEncryption.kms.keyId | n/a |  Specifies an AWS KMS key ID or ARN. If a key is specified, SSE-KMS is used.  | 

# Amazon S3 client-side encryption
<a name="emr-emrfs-encryption-cse"></a>

With Amazon S3 client-side encryption, the Amazon S3 encryption and decryption takes place in the EMRFS client on your cluster. Objects are encrypted before being uploaded to Amazon S3 and decrypted after they are downloaded. The provider you specify supplies the encryption key that the client uses. The client can use keys provided by AWS KMS (CSE-KMS) or a custom Java class that provides the client-side root key (CSE-C). The encryption specifics are slightly different between CSE-KMS and CSE-C, depending on the specified provider and the metadata of the object being decrypted or encrypted. For more information about these differences, see [Protecting data using client-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html) in the *Amazon Simple Storage Service User Guide*.

**Note**  
Amazon S3 CSE only ensures that EMRFS data exchanged with Amazon S3 is encrypted; not all data on cluster instance volumes is encrypted. Furthermore, because Hue does not use EMRFS, objects that the Hue S3 File Browser writes to Amazon S3 are not encrypted.

**To specify CSE-KMS for EMRFS data in Amazon S3 using the AWS CLI**
+ Type the following command and replace *MyKMSKeyID* with the Key ID or ARN of the KMS key to use:

  ```
  aws emr create-cluster --release-label emr-4.7.2 or earlier
  --emrfs Encryption=ClientSide,ProviderType=KMS,KMSKeyId=MyKMSKeyId
  ```

## Creating a custom key provider
<a name="emr-emrfs-create-cse-key"></a>

Depending on the type of encryption you use when creating a custom key provider, the application must also implement different EncryptionMaterialsProvider interfaces. Both interfaces are available in the AWS SDK for Java version 1.11.0 and later.
+ To implement Amazon S3 encryption, use the [ com.amazonaws.services.s3.model.EncryptionMaterialsProvider interface](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/EncryptionMaterialsProvider.html).
+ To implement local disk encryption, use the [ com.amazonaws.services.elasticmapreduce.spi.security.EncryptionMaterialsProvider interface](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/spi/security/EncryptionMaterialsProvider.html).

You can use any strategy to provide encryption materials for the implementation. For example, you might choose to provide static encryption materials or integrate with a more complex key management system.

If you’re using Amazon S3 encryption, you must use the encryption algorithms **AES/GCM/NoPadding** for custom encryption materials.

If you’re using local disk encryption, the encryption algorithm to use for custom encryption materials varies by EMR release. For Amazon EMR 7.0.0 and lower, you must use **AES/GCM/NoPadding**. For Amazon EMR 7.1.0 and higher, you must use **AES**.

The EncryptionMaterialsProvider class gets encryption materials by encryption context. Amazon EMR populates encryption context information at runtime to help the caller determine the correct encryption materials to return.

**Example: Using a custom key provider for Amazon S3 encryption with EMRFS**  
When Amazon EMR fetches the encryption materials from the EncryptionMaterialsProvider class to perform encryption, EMRFS optionally populates the materialsDescription argument with two fields: the Amazon S3 URI for the object and the JobFlowId of the cluster, which can be used by the EncryptionMaterialsProvider class to return encryption materials selectively.  
For example, the provider may return different keys for different Amazon S3 URI prefixes. It is the description of the returned encryption materials that is eventually stored with the Amazon S3 object rather than the materialsDescription value that is generated by EMRFS and passed to the provider. While decrypting an Amazon S3 object, the encryption materials description is passed to the EncryptionMaterialsProvider class, so that it can, again, selectively return the matching key to decrypt the object.  
An EncryptionMaterialsProvider reference implementation is provided below. Another custom provider, [EMRFSRSAEncryptionMaterialsProvider](https://github.com/awslabs/emr-sample-apps/tree/master/emrfs-plugins/EMRFSRSAEncryptionMaterialsProvider), is available from GitHub.   

```
import com.amazonaws.services.s3.model.EncryptionMaterials;
import com.amazonaws.services.s3.model.EncryptionMaterialsProvider;
import com.amazonaws.services.s3.model.KMSEncryptionMaterials;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;

import java.util.Map;

/**
 * Provides KMSEncryptionMaterials according to Configuration
 */
public class MyEncryptionMaterialsProviders implements EncryptionMaterialsProvider, Configurable{
  private Configuration conf;
  private String kmsKeyId;
  private EncryptionMaterials encryptionMaterials;

  private void init() {
    this.kmsKeyId = conf.get("my.kms.key.id");
    this.encryptionMaterials = new KMSEncryptionMaterials(kmsKeyId);
  }

  @Override
  public void setConf(Configuration conf) {
    this.conf = conf;
    init();
  }

  @Override
  public Configuration getConf() {
    return this.conf;
  }

  @Override
  public void refresh() {

  }

  @Override
  public EncryptionMaterials getEncryptionMaterials(Map<String, String> materialsDescription) {
    return this.encryptionMaterials;
  }

  @Override
  public EncryptionMaterials getEncryptionMaterials() {
    return this.encryptionMaterials;
  }
}
```

## Specifying a custom materials provider using the AWS CLI
<a name="emr-emrfs-encryption-cse-custom-cli"></a>

To use the AWS CLI, pass the `Encryption`, `ProviderType`, `CustomProviderClass`, and `CustomProviderLocation` arguments to the `emrfs` option.

```
aws emr create-cluster --instance-type m5.xlarge --release-label emr-4.7.2 or earlier --emrfs Encryption=ClientSide,ProviderType=Custom,CustomProviderLocation=s3://amzn-s3-demo-bucket/myfolder/provider.jar,CustomProviderClass=classname
```

Setting `Encryption` to `ClientSide` enables client-side encryption, `CustomProviderClass` is the name of your `EncryptionMaterialsProvider` object, and `CustomProviderLocation` is the local or Amazon S3 location from which Amazon EMR copies `CustomProviderClass` to each node in the cluster and places it in the classpath.

## Specifying a custom materials provider using an SDK
<a name="emr-emrfs-encryption-cse-custom-sdk"></a>

To use an SDK, you can set the property `fs.s3.cse.encryptionMaterialsProvider.uri` to download the custom `EncryptionMaterialsProvider` class that you store in Amazon S3 to each node in your cluster. You configure this in `emrfs-site.xml` file along with CSE enabled and the proper location of the custom provider.

For example, in the AWS SDK for Java using RunJobFlowRequest, your code might look like the following:

```
<snip>
		Map<String,String> emrfsProperties = new HashMap<String,String>();
	    	emrfsProperties.put("fs.s3.cse.encryptionMaterialsProvider.uri","s3://amzn-s3-demo-bucket/MyCustomEncryptionMaterialsProvider.jar");
	    	emrfsProperties.put("fs.s3.cse.enabled","true");
	    	emrfsProperties.put("fs.s3.consistent","true");
		    emrfsProperties.put("fs.s3.cse.encryptionMaterialsProvider","full.class.name.of.EncryptionMaterialsProvider");

		Configuration myEmrfsConfig = new Configuration()
	    	.withClassification("emrfs-site")
	    	.withProperties(emrfsProperties);

		RunJobFlowRequest request = new RunJobFlowRequest()
			.withName("Custom EncryptionMaterialsProvider")
			.withReleaseLabel("emr-7.12.0")
			.withApplications(myApp)
			.withConfigurations(myEmrfsConfig)
			.withServiceRole("EMR_DefaultRole_V2")
			.withJobFlowRole("EMR_EC2_DefaultRole")
			.withLogUri("s3://myLogUri/")
			.withInstances(new JobFlowInstancesConfig()
				.withEc2KeyName("myEc2Key")
				.withInstanceCount(2)
				.withKeepJobFlowAliveWhenNoSteps(true)
				.withMasterInstanceType("m5.xlarge")
				.withSlaveInstanceType("m5.xlarge")
			);						
					
		RunJobFlowResult result = emr.runJobFlow(request);
</snip>
```

## Custom EncryptionMaterialsProvider with arguments
<a name="emr-emrfs-encryption-custommaterials"></a>

You may need to pass arguments directly to the provider. To do this, you can use the `emrfs-site` configuration classification with custom arguments defined as properties. An example configuration is shown below, which is saved as a file, `myConfig.json`:

```
[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "myProvider.arg1":"value1",
	    "myProvider.arg2":"value2"
      }
    }
 ]
```

Using the `create-cluster` command from the AWS CLI, you can use the `--configurations` option to specify the file as shown below:

```
aws emr create-cluster --release-label emr-7.12.0 --instance-type m5.xlarge --instance-count 2 --configurations file://myConfig.json --emrfs Encryption=ClientSide,CustomProviderLocation=s3://amzn-s3-demo-bucket/myfolder/myprovider.jar,CustomProviderClass=classname
```

## Configuring EMRFS S3EC V2 support
<a name="emr-emrfs-encryption-cse-s3v2"></a>

S3 Java SDK releases (1.11.837 and later) support encryption client Version 2 (S3EC V2) with various security enhancements. For more information, see the S3 blog post [Updates to the Amazon S3 encryption client](https://aws.amazon.com/blogs/developer/updates-to-the-amazon-s3-encryption-client/). Also, refer to [Amazon S3 encryption client migration](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/s3-encryption-migration.html) in the AWS SDK for Java Developer Guide. 

Encryption client V1 is still available in the SDK for backward compatibility. By default EMRFS will use S3EC V1 to encrypt and decrypt S3 objects if CSE is enabled.

S3 objects encrypted with S3EC V2 cannot be decrypted by EMRFS on an EMR cluster whose release version is earlier than emr-5.31.0 (emr-5.30.1 and earlier, emr-6.1.0 and earlier).

**Example Configure EMRFS to use S3EC V2**  
To configure EMRFS to use S3EC V2, add the following configuration:  

```
{
  "Classification": "emrfs-site",
  "Properties": {
    "fs.s3.cse.encryptionV2.enabled": "true"
  }
}
```

## `emrfs-site.xml` Properties for Amazon S3 client-side encryption
<a name="emr-emrfs-cse-config"></a>


| Property  | Default value | Description  | 
| --- | --- | --- | 
| fs.s3.cse.enabled | false |  When set to **true**, EMRFS objects stored in Amazon S3 are encrypted using client-side encryption.  | 
| fs.s3.cse.encryptionV2.enabled | false |  When set to `true`, EMRFS uses S3 encryption client Version 2 to encrypt and decrypt objects on S3. Available for EMR version 5.31.0 and later.  | 
| fs.s3.cse.encryptionMaterialsProvider.uri | N/A | Applies when using custom encryption materials. The Amazon S3 URI where the JAR with the EncryptionMaterialsProvider is located. When you provide this URI, Amazon EMR automatically downloads the JAR to all nodes in the cluster. | 
| fs.s3.cse.encryptionMaterialsProvider | N/A |  The `EncryptionMaterialsProvider` class path used with client-side encryption. When using CSE-KMS, specify `com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider`.  | 
| fs.s3.cse.materialsDescription.enabled | false |  When set to `true`, populates the materialsDescription of encrypted objects with the Amazon S3 URI for the object and the JobFlowId. Set to `true` when using custom encryption materials.  | 
| fs.s3.cse.kms.keyId | N/A |  Applies when using CSE-KMS. The value of the KeyId, ARN, or alias of the KMS key used for encryption.  | 
| fs.s3.cse.cryptoStorageMode | ObjectMetadata  |  The Amazon S3 storage mode. By default, the description of the encryption information is stored in the object metadata. You can also store the description in an instruction file. Valid values are ObjectMetadata and InstructionFile. For more information, see [Client-side data encryption with the AWS SDK for Java and Amazon S3](https://aws.amazon.com/articles/client-side-data-encryption-with-the-aws-sdk-for-java-and-amazon-s3/).  | 