# Consistent view


**Warning**  
On June 1, 2023, EMRFS consistent view will reach end of standard support for future Amazon EMR releases. EMRFS consistent view will continue to work for existing releases.

With the release of Amazon S3 strong read-after-write consistency on December 1, 2020, you no longer need to use EMRFS consistent view (EMRFS CV) with your Amazon EMR clusters. EMRFS CV is an optional feature that allows Amazon EMR clusters to check for list and read-after-write consistency for Amazon S3 objects. When you create a cluster and EMRFS CV is turned on, Amazon EMR creates an Amazon DynamoDB database to store object metadata that it uses to track list and read-after-write consistency for S3 objects. You can now turn off EMRFS CV and delete the DynamoDB database that it uses so that you don't accrue additional costs. The following procedures explain how to check for the CV feature, turn it off, and delete the DynamoDB database that the feature uses.<a name="enable-emr-fs-console"></a>

**To check if you're using the EMRFS CV feature**

1. Navigate to the **Configuration** tab. If your cluster has the following configuration, it uses EMRFS CV.

   ```
   Classification=emrfs-site,Property=fs.s3.consistent,Value=true
   ```

1. Alternatively, use the AWS CLI to describe your cluster with the [`describe-cluster` API](https://docs.aws.amazon.com/cli/latest/reference/emr/describe-cluster.html). If the output contains `fs.s3.consistent: true`, your cluster uses EMRFS CV.

**To turn off EMRFS CV on your Amazon EMR clusters**

To turn off the EMRFS CV feature, use one of the following three options. You should test these options in your testing environment before applying them to your production environments.

1. 

**To stop your existing cluster and start a new cluster without EMRFS CV options.**

   1. Before you stop your cluster, ensure that you back up your data and notify your users.

   1. To stop your cluster, follow the instructions in [Terminate a cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html).

   1. If you use the Amazon EMR console to create new cluster, navigate to **Advanced Options**. In the **Edit software settings** section, deselect the option to turn on EMRFS CV. If the check box for **EMRFS consistent view** is available, keep it unchecked.

   1. If you use AWS CLI to create a new cluster with the [`create-cluster` API](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html), don't use the `--emrfs` option, which turns on EMRFS CV.

   1. If you use an SDK or CloudFormation to create a new cluster, don't use any of the configurations listed in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html).

1. 

**To clone a cluster and remove EMRFS CV**

   1. In the Amazon EMR console, choose the cluster that uses EMRFS CV.

   1. At the top of the **Cluster Details** page, choose **Clone**.

   1. Choose **Previous** and navigate to **Step 1: Software and Steps**.

   1. In **Edit software settings**, remove EMRFS CV. In **Edit configuration**, delete the following configurations in the `emrfs-site` classification. If you're loading JSON from a S3 bucket, you must modify your S3 object.

      ```
      [
      	{"classification":
      		"emrfs-site",
      		"properties": {
      			"fs.s3.consistent.retryPeriodSeconds":"10",
      			"fs.s3.consistent":"true",
      			"fs.s3.consistent.retryCount":"5",
      			"fs.s3.consistent.metadata.tableName":"EmrFSMetadata"
      		}
      	}
      ]
      ```

1. 

**To remove EMRFS CV from a cluster that uses instance groups**

   1. Use the following command to check if a single EMR cluster uses the DynamoDB table that is associated with EMRFS CV, or if multiple clusters share the table. The table name is specified in `fs.s3.consistent.metadata.tableName`, as described in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html). The default table name used by EMRFS CV is `EmrFSMetadata`.

      ```
      aws emr describe-cluster --cluster-id j-XXXXX | grep fs.s3.consistent.metadata.tableName
      ```

   1. If your cluster doesn't share your DynamoDB database with another cluster, use the following command to reconfigure the cluster and deactivate EMRFS CV. For more information, see [Reconfigure an instance group in a running cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html).

      ```
      aws emr modify-instance-groups --cli-input-json file://disable-emrfs-1.json
      ```

      This command opens the file you want to modify. Modify the file with the following configurations.

      ```
      {
      	"ClusterId": "j-xxxx",
      	"InstanceGroups": [
      		{
      			"InstanceGroupId": "ig-xxxx",
      			"Configurations": [
      				{
      					"Classification": "emrfs-site",
      					"Properties": {
      						"fs.s3.consistent": "false"
      					},
      					"Configurations": []
      				}
      			]
      		}
      	]
      }
      ```

   1. If your cluster shares the DynamoDB table with another cluster, turn off EMRFS CV on all clusters at a time when no clusters modify any objects in the shared S3 location.

**To delete Amazon DynamoDB resources associated with EMRFS CV**

After you remove EMRFS CV from your Amazon EMR clusters, delete the DynamoDB resources associated with EMRFS CV. Until you do so, you continue to incur DynamoDB charges associated with EMRFS CV.

1. Check the CloudWatch metrics for your DynamoDB table and confirm that the table isn't used by any clusters.

1. Delete the DynamoDB table.

   ```
   aws dynamodb delete-table --table-name <your-table-name>
   ```

**To delete Amazon SQS resources associated with EMRFS CV**

1. If you configured your cluster to push inconsistency notifications to Amazon SQS, you can delete all SQS queues.

1. Find the Amazon SQS queue name specified in `fs.s3.consistent.notification.SQS.queueName`, as described in [Configure consistent view](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html). The default queue name format is `EMRFS-Inconsistency-<j-cluster ID>`.

   ```
   aws sqs list-queues | grep ‘EMRFS-Inconsistency’
   aws sqs delete-queue –queue-url <your-queue-url>
   ```

**To stop using the EMRFS CLI**
+ The [EMRFS CLI](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emrfs-cli-reference.html) manages the metadata that EMRFS CV generates. As standard support for EMRFS CV reaches its end in future releases of Amazon EMR, support for the EMRFS CLI will also reach its end. 

**Topics**
+ [

# Enable consistent view
](enable-consistent-view.md)
+ [

# Understanding how EMRFS consistent view tracks objects in Amazon S3
](emrfs-files-tracked.md)
+ [

# Retry logic
](emrfs-retry-logic.md)
+ [

# EMRFS consistent view metadata
](emrfs-metadata.md)
+ [

# Configure consistency notifications for CloudWatch and Amazon SQS
](emrfs-configure-sqs-cw.md)
+ [

# Configure consistent view
](emrfs-configure-consistent-view.md)
+ [

# EMRFS CLI Command Reference
](emrfs-cli-reference.md)

# Enable consistent view


You can enable Amazon S3 server-side encryption or consistent view for EMRFS using the AWS Management Console, AWS CLI, or the `emrfs-site` configuration classification.<a name="enable-emr-fs-console"></a>

**To configure consistent view using the console**

1. Navigate to the new Amazon EMR console and select **Switch to the old console** from the side navigation. For more information on what to expect when you switch to the old console, see [Using the old console](https://docs.aws.amazon.com/emr/latest/ManagementGuide/whats-new-in-console.html#console-opt-in).

1. Choose **Create cluster**, **Go to advanced options**.

1. Choose settings for **Step 1: Software and Steps** and **Step 2: Hardware**. 

1. For **Step 3: General Cluster Settings**, under **Additional Options**, choose **EMRFS consistent view**.

1. For **EMRFS Metadata store**, type the name of your metadata store. The default value is **EmrFSMetadata**. If the EmrFSMetadata table does not exist, it is created for you in DynamoDB.
**Note**  
Amazon EMR does not automatically remove the EMRFS metadata from DynamoDB when the cluster is terminated.

1. For **Number of retries**, type an integer value. If an inconsistency is detected, EMRFS tries to call Amazon S3 this number of times. The default value is **5**. 

1. For **Retry period (in seconds)**, type an integer value. This is the amount of time that EMRFS waits between retry attempts. The default value is **10**.
**Note**  
Subsequent retries use an exponential backoff. 

**To launch a cluster with consistent view enabled using the AWS CLI**

We recommend that you install the current version of AWS CLI. To download the latest release, see [https://aws.amazon.com/cli/](https://aws.amazon.com/cli/).
+ 
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  ```
  1. aws emr create-cluster --instance-type m5.xlarge --instance-count 3 --emrfs Consistent=true \
  2. --release-label emr-7.12.0 --ec2-attributes KeyName=myKey
  ```

**To check if consistent view is enabled using the AWS Management Console**
+ To check whether consistent view is enabled in the console, navigate to the **Cluster List** and select your cluster name to view **Cluster Details**. The "EMRFS consistent view" field has a value of `Enabled` or `Disabled`.

**To check if consistent view is enabled by examining the `emrfs-site.xml` file**
+ You can check if consistency is enabled by inspecting the `emrfs-site.xml` configuration file on the master node of the cluster. If the Boolean value for `fs.s3.consistent` is set to `true` then consistent view is enabled for file system operations involving Amazon S3.

# Understanding how EMRFS consistent view tracks objects in Amazon S3


EMRFS creates a consistent view of objects in Amazon S3 by adding information about those objects to the EMRFS metadata. EMRFS adds these listings to its metadata when:
+  An object written by EMRFS during the course of an Amazon EMR job.
+  An object is synced with or imported to EMRFS metadata by using the EMRFS CLI.

Objects read by EMRFS are not automatically added to the metadata. When EMRFS deletes an object, a listing still remains in the metadata with a deleted state until that listing is purged using the EMRFS CLI. To learn more about the CLI, see [EMRFS CLI Command Reference](emrfs-cli-reference.md). For more information about purging listings in the EMRFS metadata, see [EMRFS consistent view metadata](emrfs-metadata.md).

For every Amazon S3 operation, EMRFS checks the metadata for information about the set of objects in consistent view. If EMRFS finds that Amazon S3 is inconsistent during one of these operations, it retries the operation according to parameters defined in `emrfs-site` configuration properties. After EMRFS exhausts the retries, it either throws a `ConsistencyException` or logs the exception and continue the workflow. For more information about retry logic, see [Retry logic](emrfs-retry-logic.md). You can find `ConsistencyExceptions` in your logs, for example:
+  listStatus: No Amazon S3 object for metadata item `/S3_bucket/dir/object`
+  getFileStatus: Key `dir/file` is present in metadata but not Amazon S3

If you delete an object directly from Amazon S3 that EMRFS consistent view tracks, EMRFS treats that object as inconsistent because it is still listed in the metadata as present in Amazon S3. If your metadata becomes out of sync with the objects EMRFS tracks in Amazon S3, you can use the **sync** sub-command of the EMRFS CLI to reset metadata so that it reflects Amazon S3. To discover discrepancies between metadata and Amazon S3, use the **diff**. Finally, EMRFS only has a consistent view of the objects referenced in the metadata; there can be other objects in the same Amazon S3 path that are not being tracked. When EMRFS lists the objects in an Amazon S3 path, it returns the superset of the objects being tracked in the metadata and those in that Amazon S3 path.

# Retry logic


EMRFS tries to verify list consistency for objects tracked in its metadata for a specific number of retries. The default is 5. In the case where the number of retries is exceeded the originating job returns a failure unless `fs.s3.consistent.throwExceptionOnInconsistency` is set to `false`, where it will only log the objects tracked as inconsistent. EMRFS uses an exponential backoff retry policy by default but you can also set it to a fixed policy. Users may also want to retry for a certain period of time before proceeding with the rest of their job without throwing an exception. They can achieve this by setting `fs.s3.consistent.throwExceptionOnInconsistency` to `false`, `fs.s3.consistent.retryPolicyType` to `fixed`, and `fs.s3.consistent.retryPeriodSeconds` for the desired value. The following example creates a cluster with consistency enabled, which logs inconsistencies and sets a fixed retry interval of 10 seconds:

**Example Setting retry period to a fixed amount**  

```
aws emr create-cluster --release-label emr-7.12.0 \
--instance-type m5.xlarge --instance-count 1 \
--emrfs Consistent=true,Args=[fs.s3.consistent.throwExceptionOnInconsistency=false, fs.s3.consistent.retryPolicyType=fixed,fs.s3.consistent.retryPeriodSeconds=10] --ec2-attributes KeyName=myKey
```

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

For more information, see [Consistent view](emr-plan-consistent-view.md).

## EMRFS configurations for IMDS get region calls


EMRFS relies on the IMDS (instance metadata service) to get instance region and Amazon S3, DynamoDB, or AWS KMS endpoints. However, IMDS has a limit on how many requests it can handle, and requests that exceed that limit will fail. This IMDS limit can cause EMRFS failures to initialize and cause the query or command to fail. You can use the following randomized exponential backoff retry mechanism and a fallback region configuration properties in emrfs-site.xml to address the scenario where all retries fail.

```
<property>
    <name>fs.s3.region.retryCount</name>
    <value>3</value>
    <description>
    Maximum retries that would be attempted to get AWS region.
    </description>
</property>
<property>
    <name>fs.s3.region.retryPeriodSeconds</name>
    <value>3</value>
    <description>
    Base sleep time in second for each get-region retry.
    </description>
</property>
<property>
    <name>fs.s3.region.fallback</name>
    <value>us-east-1</value>
    <description>
    Fallback to this region after maximum retries for getting AWS region have been reached.
    </description>
</property>
```

# EMRFS consistent view metadata


EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRFS. The metadata is used to track all operations (read, write, update, and copy), and no actual content is stored in it. This metadata is used to validate whether the objects or metadata received from Amazon S3 matches what is expected. This confirmation gives EMRFS the ability to check list consistency and read-after-write consistency for new objects EMRFS writes to Amazon S3 or objects synced with EMRFS. Multiple clusters can share the same metadata.

**How to add entries to metadata**  
You can use the `sync` or `import` subcommands to add entries to metadata. `sync` reflects the state of the Amazon S3 objects in a path, while `import` is used strictly to add new entries to the metadata. For more information, see [EMRFS CLI Command Reference](emrfs-cli-reference.md).

**How to check differences between metadata and objects in Amazon S3**  
To check for differences between the metadata and Amazon S3, use the `diff` subcommand of the EMRFS CLI. For more information, see [EMRFS CLI Command Reference](emrfs-cli-reference.md).

**How to know if metadata operations are being throttled**  
EMRFS sets default throughput capacity limits on the metadata for its read and write operations at 500 and 100 units, respectively. Large numbers of objects or buckets may cause operations to exceed this capacity, at which point DynamoDB will throttle operations. For example, an application may cause EMRFS to throw a `ProvisionedThroughputExceededException` if you perform an operation that exceeds these capacity limits. Upon throttling, the EMRFS CLI tool attempts to retry writing to the DynamoDB table using [exponential backoff](https://docs.aws.amazon.com/general/latest/gr/api-retries.html) until the operation finishes or when it reaches the maximum retry value for writing objects from Amazon EMR to Amazon S3. 

You can configure your own throughput capacity limits. However, DynamoDB has strict partition limits of 3000 read capacity units (RCUs) and 1000 write capacity units (WCUs) per second for read and write operations. To avoid `sync` failures caused by throttling, we recommend you limit throughput for read operations to fewer than 3000 RCUs and write operations to fewer than 1000 WCUs. For instructions on setting custom throughput capacity limits, see [Configure consistent view](emrfs-configure-consistent-view.md).

You can also view Amazon CloudWatch metrics for your EMRFS metadata in the DynamoDB console where you can see the number of throttled read and write requests. If you do have a non-zero value for throttled requests, your application may potentially benefit from increasing allocated throughput capacity for read or write operations. You may also realize a performance benefit if you see that your operations are approaching the maximum allocated throughput capacity in reads or writes for an extended period of time.

**Throughput characteristics for notable EMRFS operations**  
The default for read and write operations is 400 and 100 throughput capacity units, respectively. The following performance characteristics give you an idea of what throughput is required for certain operations. These tests were performed using a single-node `m3.large` cluster. All operations were single threaded. Performance differs greatly based on particular application characteristics and it may take experimentation to optimize file system operations.


| Operation  | Average read-per-second  | Average write-per-second  | 
| --- | --- | --- | 
| create (object) | 26.79 |  6.70 | 
| delete (object) | 10.79 |  10.79 | 
| delete (directory containing 1000 objects) | 21.79 | 338.40  | 
|  getFileStatus (object) | 34.70 | 0  | 
| getFileStatus (directory) | 19.96 | 0 | 
| listStatus (directory containing 1 object) | 43.31 | 0 | 
| listStatus (directory containing 10 objects) | 44.34 | 0 | 
| listStatus (directory containing 100 objects) | 84.44 | 0 | 
| listStatus (directory containing 1,000 objects) | 308.81 | 0 | 
| listStatus (directory containing 10,000 objects) | 416.05 | 0 | 
| listStatus (directory containing 100,000 objects) | 823.56 | 0 | 
| listStatus (directory containing 1M objects) | 882.36 | 0 | 
| mkdir (continuous for 120 seconds)  | 24.18 | 4.03 | 
| mkdir | 12.59 | 0 | 
| rename (object) | 19.53 | 4.88 | 
| rename (directory containing 1000 objects) | 23.22 | 339.34 | 

**To submit a step that purges old data from your metadata store**  
Users may wish to remove particular entries in the DynamoDB-based metadata. This can help reduce storage costs associated with the table. Users have the ability to manually or programmatically purge particular entries by using the EMRFS CLI `delete` subcommand. However, if you delete entries from the metadata, EMRFS no longer makes any checks for consistency.

Programmatically purging after the completion of a job can be done by submitting a final step to your cluster, which executes a command on the EMRFS CLI. For instance, type the following command to submit a step to your cluster to delete all entries older than two days.

```
aws emr add-steps --cluster-id j-2AL4XXXXXX5T9 --steps Name="emrfsCLI",Jar="command-runner.jar",Args=["emrfs","delete","--time","2","--time-unit","days"]
{
    "StepIds": [
        "s-B12345678902"
    ]
}
```

Use the StepId value returned to check the logs for the result of the operation.

# Configure consistency notifications for CloudWatch and Amazon SQS


You can enable CloudWatch metrics and Amazon SQS messages in EMRFS for Amazon S3 eventual consistency issues. 

**CloudWatch**  
When CloudWatch metrics are enabled, a metric named **Inconsistency** is pushed each time a `FileSystem` API call fails due to Amazon S3 eventual consistency. 

**To view CloudWatch metrics for Amazon S3 eventual consistency issues**

To view the **Inconsistency** metric in the CloudWatch console, select the EMRFS metrics and then select a **JobFlowId**/**Metric Name** pair. For example: `j-162XXXXXXM2CU ListStatus`, `j-162XXXXXXM2CU GetFileStatus`, and so on.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the **Dashboard**, in the **Metrics** section, choose **EMRFS**. 

1. In the **Job Flow Metrics** pane, select one or more **JobFlowId**/**Metric Name** pairs. A graphical representation of the metrics appears in the window below.

**Amazon SQS**  
When Amazon SQS notifications are enabled, an Amazon SQS queue with the name `EMRFS-Inconsistency-<jobFlowId>` is created when EMRFS is initialized. Amazon SQS messages are pushed into the queue when a `FileSystem` API call fails due to Amazon S3 eventual consistency. The message contains information such as JobFlowId, API, a list of inconsistent paths, a stack trace, and so on. Messages can be read using the Amazon SQS console or using the EMRFS `read-sqs` command.

**To manage Amazon SQS messages for Amazon S3 eventual consistency issues**

Amazon SQS messages for Amazon S3 eventual consistency issues can be read using the EMRFS CLI. To read messages from an EMRFS Amazon SQS queue, type the `read-sqs` command and specify an output location on the master node's local file system for the resulting output file. 

You can also delete an EMRFS Amazon SQS queue using the `delete-sqs` command.

1. To read messages from an Amazon SQS queue, type the following command. Replace *queuename* with the name of the Amazon SQS queue that you configured and replace */path/filename* with the path to the output file:

   ```
   emrfs read-sqs --queue-name queuename --output-file /path/filename
   ```

   For example, to read and output Amazon SQS messages from the default queue, type:

   ```
   emrfs read-sqs --queue-name EMRFS-Inconsistency-j-162XXXXXXM2CU --output-file /path/filename
   ```
**Note**  
You can also use the `-q` and `-o` shortcuts instead of `--queue-name` and `--output-file` respectively.

1. To delete an Amazon SQS queue, type the following command:

   ```
   emrfs delete-sqs --queue-name queuename
   ```

   For example, to delete the default queue, type:

   ```
   emrfs delete-sqs --queue-name EMRFS-Inconsistency-j-162XXXXXXM2CU
   ```
**Note**  
You can also use the `-q` shortcut instead of `--queue-name`.

# Configure consistent view


You can configure additional settings for consistent view by providing them using configuration properties for `emrfs-site` properties. For example, you can choose a different default DynamoDB throughput by supplying the following arguments to the CLI `--emrfs` option, using the emrfs-site configuration classification (Amazon EMR release version 4.x and later only), or a bootstrap action to configure the emrfs-site.xml file on the master node:

**Example Changing default metadata read and write values at cluster launch**  

```
aws emr create-cluster --release-label emr-7.12.0 --instance-type m5.xlarge \
--emrfs Consistent=true,Args=[fs.s3.consistent.metadata.read.capacity=600,\
fs.s3.consistent.metadata.write.capacity=300] --ec2-attributes KeyName=myKey
```

Alternatively, use the following configuration file and save it locally or in Amazon S3:

```
[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.consistent.metadata.read.capacity": "600",
        "fs.s3.consistent.metadata.write.capacity": "300"
      }
    }
 ]
```

Use the configuration you created with the following syntax:

```
aws emr create-cluster --release-label emr-7.12.0 --applications Name=Hive \
--instance-type m5.xlarge --instance-count 2 --configurations file://./myConfig.json
```

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

The following options can be set using configurations or AWS CLI `--emrfs` arguments. For information about those arguments, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/).


**`emrfs-site.xml` Properties for consistent view**  

| Property  | Default value | Description  | 
| --- | --- | --- | 
| fs.s3.consistent | false |  When set to **true**, this property configures EMRFS to use DynamoDB to provide consistency.  | 
| fs.s3.consistent.retryPolicyType | exponential | This property identifies the policy to use when retrying for consistency issues. Options include: exponential, fixed, or none. | 
| fs.s3.consistent.retryPeriodSeconds | 1 | This property sets the length of time to wait between consistency retry attempts. | 
| fs.s3.consistent.retryCount | 10 | This property sets the maximum number of retries when inconsistency is detected. | 
| fs.s3.consistent.throwExceptionOnInconsistency | true | This property determines whether to throw or log a consistency exception. When set to true, a ConsistencyException is thrown. | 
| fs.s3.consistent.metadata.autoCreate | true | When set to true, this property enables automatic creation of metadata tables. | 
| fs.s3.consistent.metadata.etag.verification.enabled | true | With Amazon EMR 5.29.0, this property is enabled by default. When enabled, EMRFS uses S3 ETags to verify that objects being read are the latest available version. This feature is helpful for read-after-update use cases in which files on S3 are being overwritten while retaining the same name. This ETag verification capability currently does not work with S3 Select. | 
| fs.s3.consistent.metadata.tableName | EmrFSMetadata | This property specifies the name of the metadata table in DynamoDB. | 
| fs.s3.consistent.metadata.read.capacity | 500 | This property specifies the DynamoDB read capacity to provision when the metadata table is created. | 
| fs.s3.consistent.metadata.write.capacity | 100 | This property specifies the DynamoDB write capacity to provision when the metadata table is created. | 
| fs.s3.consistent.fastList | true | When set to true, this property uses multiple threads to list a directory (when necessary). Consistency must be enabled in order to use this property. | 
| fs.s3.consistent.fastList.prefetchMetadata | false | When set to true, this property enables metadata prefetching for directories containing more than 20,000 items. | 
| fs.s3.consistent.notification.CloudWatch | false | When set to true, CloudWatch metrics are enabled for FileSystem API calls that fail due to Amazon S3 eventual consistency issues. | 
| fs.s3.consistent.notification.SQS | false | When set to true, eventual consistency notifications are pushed to an Amazon SQS queue. | 
| fs.s3.consistent.notification.SQS.queueName | EMRFS-Inconsistency-<jobFlowId> | Changing this property allows you to specify your own SQS queue name for messages regarding Amazon S3 eventual consistency issues. | 
| fs.s3.consistent.notification.SQS.customMsg | none | This property allows you to specify custom information included in SQS messages regarding Amazon S3 eventual consistency issues. If a value is not specified for this property, the corresponding field in the message is empty.  | 
| fs.s3.consistent.dynamodb.endpoint | none | This property allows you to specify a custom DynamoDB endpoint for your consistent view metadata. | 
| fs.s3.useRequesterPaysHeader | false | When set to true, this property allows Amazon S3 requests to buckets with the request payer option enabled.  | 

# EMRFS CLI Command Reference


The EMRFS CLI is installed by default on all cluster master nodes created using Amazon EMR release version 3.2.1 or later. You can use the EMRFS CLI to manage the metadata for consistent view. 

**Note**  
The **emrfs** command is only supported with VT100 terminal emulation. However, it may work with other terminal emulator modes.

## emrfs top-level command


The **emrfs** top-level command supports the following structure.

```
emrfs [describe-metadata | set-metadata-capacity | delete-metadata | create-metadata | \
list-metadata-stores | diff | delete | sync | import ] [options] [arguments]
```

Specify [options], with or without [arguments] as described in the following table. For [options] specific to sub-commands (`describe-metadata`, `set-metadata-capacity`, etc.), see each sub-command below.


**[Options] for emrfs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-a AWS_ACCESS_KEY_ID \| --access-key AWS_ACCESS_KEY_ID`  |  The AWS access key you use to write objects to Amazon S3 and to create or access a metadata store in DynamoDB. By default, *AWS\$1ACCESS\$1KEY\$1ID* is set to the access key used to create the cluster.  |  No  | 
|  `-s AWS_SECRET_ACCESS_KEY \| --secret-key AWS_SECRET_ACCESS_KEY`  |  The AWS secret key associated with the access key you use to write objects to Amazon S3 and to create or access a metadata store in DynamoDB. By default, *AWS\$1SECRET\$1ACCESS\$1KEY* is set to the secret key associated with the access key used to create the cluster.  |  No  | 
|  `-v \| --verbose`  |  Makes output verbose.  |  No  | 
|  `-h \| --help`  |  Displays the help message for the `emrfs` command with a usage statement.  |  No  | 

## emrfs describe-metadata sub-command


**[Options] for emrfs describe-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 

**Example emrfs describe-metadata example**  <a name="emrfs-describe-metadata"></a>
The following example describes the default metadata table.  

```
$ emrfs describe-metadata
EmrFSMetadata
  read-capacity: 400
  write-capacity: 100
  status: ACTIVE
  approximate-item-count (6 hour delay): 12
```

## emrfs set-metadata-capacity sub-command


**[Options] for emrfs set-metadata-capacity**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  `-r READ_CAPACITY \| --read-capacity READ_CAPACITY`  |  The requested read throughput capacity for the metadata table. If the *READ\$1CAPACITY* argument is not supplied, the default value is `400`.  |  No  | 
|  `-w WRITE_CAPACITY \| --write-capacity WRITE_CAPACITY`  |  The requested write throughput capacity for the metadata table. If the *WRITE\$1CAPACITY* argument is not supplied, the default value is `100`.  |  No  | 

**Example emrfs set-metadata-capacity example**  
The following example sets the read throughput capacity to `600` and the write capacity to `150` for a metadata table named `EmrMetadataAlt`.  

```
$ emrfs set-metadata-capacity --metadata-name EmrMetadataAlt  --read-capacity 600 --write-capacity 150
  read-capacity: 400
  write-capacity: 100
  status: UPDATING
  approximate-item-count (6 hour delay): 0
```

## emrfs delete-metadata sub-command


**[Options] for emrfs delete-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 

**Example emrfs delete-metadata example**  
The following example deletes the default metadata table.  

```
$ emrfs delete-metadata
```

## emrfs create-metadata sub-command


**[Options] for emrfs create-metadata**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  `-r READ_CAPACITY \| --read-capacity READ_CAPACITY`  |  The requested read throughput capacity for the metadata table. If the *READ\$1CAPACITY* argument is not supplied, the default value is `400`.  |  No  | 
|  `-w WRITE_CAPACITY \| --write-capacity WRITE_CAPACITY`  |  The requested write throughput capacity for the metadata table. If the *WRITE\$1CAPACITY* argument is not supplied, the default value is `100`.  |  No  | 

**Example emrfs create-metadata example**  
The following example creates a metadata table named `EmrFSMetadataAlt`.  

```
$ emrfs create-metadata -m EmrFSMetadataAlt
Creating metadata: EmrFSMetadataAlt
EmrFSMetadataAlt
  read-capacity: 400
  write-capacity: 100
  status: ACTIVE
  approximate-item-count (6 hour delay): 0
```

## emrfs list-metadata-stores sub-command


The **emrfs list-metadata-stores** sub-command has no [options]. 

**Example List-metadata-stores example**  
The following example lists your metadata tables.  

```
$ emrfs list-metadata-stores
  EmrFSMetadata
```

## emrfs diff sub-command


**[Options] for emrfs diff**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket to compare with the metadata table. Buckets sync recursively.  |  Yes  | 

**Example emrfs diff example**  
The following example compares the default metadata table to an Amazon S3 bucket.  

```
$ emrfs diff s3://elasticmapreduce/samples/cloudfront
BOTH | MANIFEST ONLY | S3 ONLY
DIR elasticmapreduce/samples/cloudfront
DIR elasticmapreduce/samples/cloudfront/code/
DIR elasticmapreduce/samples/cloudfront/input/
DIR elasticmapreduce/samples/cloudfront/logprocessor.jar
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-14.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-15.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-16.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-17.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-18.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-19.WxYz1234
DIR elasticmapreduce/samples/cloudfront/input/XABCD12345678.2009-05-05-20.WxYz1234
DIR elasticmapreduce/samples/cloudfront/code/cloudfront-loganalyzer.tgz
```

## emrfs delete sub-command


**[Options] for emrfs delete**  

|  Option  |  Description  |  Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
| -t TIME \$1 --time TIME |  The expiration time (interpreted using the time unit argument). All metadata entries older than the *TIME* argument are deleted for the specified bucket.  |  | 
|  `-u UNIT \| --time-unit UNIT`  |  The measure used to interpret the time argument (nanoseconds, microseconds, milliseconds, seconds, minutes, hours, or days). If no argument is specified, the default value is `days`.  |  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No   | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs delete example**  
The following example removes all objects in an Amazon S3 bucket from the tracking metadata for consistent view.  

```
$ emrfs delete s3://elasticmapreduce/samples/cloudfront
entries deleted: 11
```

## emrfs import sub-command


**[Options] for emrfs import**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No  | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs import example**  
The following example imports all objects in an Amazon S3 bucket with the tracking metadata for consistent view. All unknown keys are ignored.  

```
$ emrfs import s3://elasticmapreduce/samples/cloudfront
```

## emrfs sync sub-command


**[Options] for emrfs sync**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-m METADATA_NAME \| --metadata-name METADATA_NAME`  |  *METADATA\$1NAME* is the name of the DynamoDB metadata table. If the *METADATA\$1NAME* argument is not supplied, the default value is `EmrFSMetadata`.  |  No  | 
|  *s3://s3Path*  |  The path to the Amazon S3 bucket you are tracking for consistent view. Buckets sync recursively.  |  Yes  | 
|  `--read-consumption READ_CONSUMPTION`  |  The requested amount of available read throughput used for the **delete** operation. If the *READ\$1CONSUMPTION* argument is not specified, the default value is `400`.  |  No  | 
|  `--write-consumption WRITE_CONSUMPTION`  |  The requested amount of available write throughput used for the **delete** operation. If the *WRITE\$1CONSUMPTION* argument is not specified, the default value is `100`.  |  No  | 

**Example emrfs sync command example**  
The following example imports all objects in an Amazon S3 bucket with the tracking metadata for consistent view. All unknown keys are deleted.   

```
$ emrfs sync s3://elasticmapreduce/samples/cloudfront
Synching samples/cloudfront                                       0 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/code/                                 1 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/                                      2 added | 0 updated | 0 removed | 0 unchanged
Synching samples/cloudfront/input/                                9 added | 0 updated | 0 removed | 0 unchanged
Done synching s3://elasticmapreduce/samples/cloudfront            9 added | 0 updated | 1 removed | 0 unchanged
creating 3 folder key(s)
folders written: 3
```

## emrfs read-sqs sub-command


**[Options] for emrfs read-sqs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-q QUEUE_NAME \| --queue-name QUEUE_NAME`  |  *QUEUE\$1NAME* is the name of the Amazon SQS queue configured in `emrfs-site.xml`. The default value is **EMRFS-Inconsistency-<jobFlowId>**.  |  Yes  | 
|  `-o OUTPUT_FILE \| --output-file OUTPUT_FILE`  |  *OUTPUT\$1FILE* is the path to the output file on the master node's local file system. Messages read from the queue are written to this file.   |  Yes  | 

## emrfs delete-sqs sub-command


**[Options] for emrfs delete-sqs**  

| Option  | Description  | Required  | 
| --- | --- | --- | 
|  `-q QUEUE_NAME \| --queue-name QUEUE_NAME`  |  *QUEUE\$1NAME* is the name of the Amazon SQS queue configured in `emrfs-site.xml`. The default value is **EMRFS-Inconsistency-<jobFlowId>**.  |  Yes  | 

## Submitting EMRFS CLI commands as steps


The following example shows how to use the `emrfs` utility on the master node by leveraging the AWS CLI or API and the `command-runner.jar` to run the `emrfs` command as a step. The example uses the AWS SDK for Python (Boto3) to add a step to a cluster which adds objects in an Amazon S3 bucket to the default EMRFS metadata table.

```
import boto3
from botocore.exceptions import ClientError


def add_emrfs_step(command, bucket_url, cluster_id, emr_client):
    """
    Add an EMRFS command as a job flow step to an existing cluster.

    :param command: The EMRFS command to run.
    :param bucket_url: The URL of a bucket that contains tracking metadata.
    :param cluster_id: The ID of the cluster to update.
    :param emr_client: The Boto3 Amazon EMR client object.
    :return: The ID of the added job flow step. Status can be tracked by calling
             the emr_client.describe_step() function.
    """
    job_flow_step = {
        "Name": "Example EMRFS Command Step",
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": ["/usr/bin/emrfs", command, bucket_url],
        },
    }

    try:
        response = emr_client.add_job_flow_steps(
            JobFlowId=cluster_id, Steps=[job_flow_step]
        )
        step_id = response["StepIds"][0]
        print(f"Added step {step_id} to cluster {cluster_id}.")
    except ClientError:
        print(f"Couldn't add a step to cluster {cluster_id}.")
        raise
    else:
        return step_id


def usage_demo():
    emr_client = boto3.client("emr")
    # Assumes the first waiting cluster has EMRFS enabled and has created metadata
    # with the default name of 'EmrFSMetadata'.
    cluster = emr_client.list_clusters(ClusterStates=["WAITING"])["Clusters"][0]
    add_emrfs_step(
        "sync", "s3://elasticmapreduce/samples/cloudfront", cluster["Id"], emr_client
    )


if __name__ == "__main__":
    usage_demo()
```

You can use the `step_id` value returned to check the logs for the result of the operation.