# Configure Amazon EMR cluster location and data storage
<a name="emr-cluster-location-data-storage"></a>

This section describes how to configure the region for a cluster, the different file systems available when you use Amazon EMR and how to use them. It also covers how to prepare or upload data to Amazon EMR if necessary, as well as how to prepare an output location for log files and any output data files you configure.

**Topics**
+ [Choose an AWS Region for your Amazon EMR cluster](emr-plan-region.md)
+ [Working with storage and file systems with Amazon EMR](emr-plan-file-systems.md)
+ [Prepare input data for processing with Amazon EMR](emr-plan-input.md)
+ [Configure a location for Amazon EMR cluster output](emr-plan-output.md)

# Choose an AWS Region for your Amazon EMR cluster
<a name="emr-plan-region"></a>

Amazon Web Services run on servers in data centers around the world. Data centers are organized by geographical Region. When you launch an Amazon EMR cluster, you must specify a Region. You might choose a Region to reduce latency, minimize costs, or address regulatory requirements. For the list of Regions and endpoints supported by Amazon EMR, see [Regions and endpoints](https://docs.aws.amazon.com/general/latest/gr/#emr_region) in the *Amazon Web Services General Reference*. 

For best performance, you should launch the cluster in the same Region as your data. For example, if the Amazon S3 bucket storing your input data is in the US West (Oregon) Region, you should launch your cluster in the US West (Oregon) Region to avoid cross-Region data transfer fees. If you use an Amazon S3 bucket to receive the output of the cluster, you would also want to create it in the US West (Oregon) Region. 

If you plan to associate an Amazon EC2 key pair with the cluster (required for using SSH to log on to the master node), the key pair must be created in the same Region as the cluster. Similarly, the security groups that Amazon EMR creates to manage the cluster are created in the same Region as the cluster. 

If you signed up for an AWS account on or after May 17, 2017, the default Region when you access a resource from the AWS Management Console is US East (Ohio) (us-east-2); for older accounts, the default Region is either US West (Oregon) (us-west-2) or US East (N. Virginia) (us-east-1). For more information, see [Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html). 

Some AWS features are available only in limited Regions. For example, Cluster Compute instances are available only in the US East (N. Virginia) Region, and the Asia Pacific (Sydney) Region supports only Hadoop 1.0.3 and later. When choosing a Region, check that it supports the features you want to use.

For best performance, use the same Region for all of your AWS resources that will be used with the cluster. The following table maps the Region names between services. For a list of Amazon EMR Regions, see [AWS Regions and endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region) in the *Amazon Web Services General Reference*.

## Choose a Region with the console
<a name="emr-dev-specify-region-console"></a>

Your default Region is displayed to the left of your account information on the navigation bar. To switch Regions in both the new and old consoles, choose the Region dropdown menu and select a new option.

## Specify a Region with the AWS CLI
<a name="emr-dev-specify-region-cli"></a>

Specify a default Region in the AWS CLI using either the **aws configure** command or the `AWS_DEFAULT_REGION` environment variable. For more information, see [Configuring the AWS Region](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-installing-specifying-region) in the *AWS Command Line Interface User Guide*.

## Choose a Region with an SDK or the API
<a name="emr-dev-specify-region-api"></a>

To choose a Region using an SDK, configure your application to use that Region's endpoint. If you are creating a client application using an AWS SDK, you can change the client endpoint by calling `setEndpoint`, as shown in the following example:

```
1. client.setEndpoint("elasticmapreduce.us-west-2.amazonaws.com");
```

After your application has specified a Region by setting the endpoint, you can set the Availability Zone for your cluster's EC2 instances. Availability Zones are distinct geographical locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. A Region contains one or more Availability Zones. To optimize performance and reduce latency, all resources should be located in the same Availability Zone as the cluster that uses them. 

# Working with storage and file systems with Amazon EMR
<a name="emr-plan-file-systems"></a>

Amazon EMR and Hadoop provide a variety of file systems that you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example, `s3://amzn-s3-demo-bucket1/path` references an Amazon S3 bucket using S3A (since EMR-7.10.0 release). The following table lists the available file systems, with recommendations about when it's best to use each one.

Amazon EMR and Hadoop typically use two or more of the following file systems when processing a cluster. HDFS and S3A are the two main file systems used with Amazon EMR.

**Important**  
Beginning with Amazon EMR release 5.22.0, Amazon EMR uses AWS Signature Version 4 exclusively to authenticate requests to Amazon S3. Earlier Amazon EMR releases use AWS Signature Version 2 in some cases, unless the release notes indicate that Signature Version 4 is used exclusively. For more information, see [Authenticating Requests (AWS Signature Version 4)](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html) and [Authenticating Requests (AWS Signature Version 2)](https://docs.aws.amazon.com/AmazonS3/latest/API/auth-request-sig-v2.html) in the *Amazon Simple Storage Service Developer Guide*.


| File system | Prefix | Description | 
| --- | --- | --- | 
| HDFS | hdfs:// (or no prefix) |  HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information, see [Hadoop documentation](http://hadoop.apache.org/docs/stable).  HDFS is used by the master and core nodes. One advantage is that it's fast; a disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps.   | 
| S3A | s3://, s3a://, s3n:// |  The Hadoop S3A filesystem is a open source S3 connector that enables Apache Hadoop and its ecosystem to interact directly with Amazon S3 storage. It allows users to read and write data to S3 buckets using Hadoop-compatible file operations, providing a seamless integration between Hadoop applications and cloud storage. Prior to EMR-7.10.0, Amazon EMR used EMRFS for *s3://* and *s3n://* scheme.   | 
| local file system |  |  The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an EC2 instance that comes with a preconfigured block of preattached disk storage called an *instance store*. Data on instance store volumes persists only during the life of its EC2 instance. Instance store volumes are ideal for storing temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information, see [Amazon EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html). The local file system is used by HDFS, but Python also runs from the local file system and you can choose to store additional application files on instance store volumes.  | 
| (Legacy) Amazon S3 block file system | s3bfs:// |  The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.  We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.   | 

## Access file systems
<a name="emr-dev-access-file-systems"></a>

You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems. 

**To access a local HDFS**
+ Specify the `hdfs:///` prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS. 

  ```
  1. hdfs:///path-to-data
  2. 							
  3. /path-to-data
  ```

**To access a remote HDFS**
+ Include the IP address of the master node in the URI, as shown in the following examples. 

  ```
  1. hdfs://master-ip-address/path-to-data
  2. 						
  3. master-ip-address/path-to-data
  ```

**To access Amazon S3**
+ Use the `s3://` prefix.

  ```
  1. 						
  2. s3://bucket-name/path-to-file-in-bucket
  ```

**To access the Amazon S3 block file system**
+ Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the `s3bfs://` prefix in the URI. 

  The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GB. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload large files to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated. For more information about multipart upload for EMR, see [Configure multipart upload for Amazon S3](emr-plan-upload-s3.html#Config_Multipart). For more information about S3 object-size and part-size limits, see [Amazon S3 multipart upload limits](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html) in the **Amazon Simple Storage Service** *User Guide*.
**Warning**  
Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use EMRFS instead. 

  ```
  1. s3bfs://bucket-name/path-to-file-in-bucket
  ```

# Prepare input data for processing with Amazon EMR
<a name="emr-plan-input"></a>

Most clusters load input data and then process that data. In order to load data, it needs to be in a location that the cluster can access and in a format the cluster can process. The most common scenario is to upload input data into Amazon S3. Amazon EMR provides tools for your cluster to import or read data from Amazon S3.

The default input format in Hadoop is text files, though you can customize Hadoop and use tools to import data stored in other formats. 

**Topics**
+ [Types of input Amazon EMR can accept](emr-plan-input-accept.md)
+ [Different ways to get data into Amazon EMR](emr-plan-get-data-in.md)

# Types of input Amazon EMR can accept
<a name="emr-plan-input-accept"></a>

The default input format for a cluster is text files with each line separated by a newline (\$1n) character, which is the input format most commonly used. 

If your input data is in a format other than the default text files, you can use the Hadoop interface `InputFormat` to specify other input types. You can even create a subclass of the `FileInputFormat` class to handle custom data types. For more information, see [http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputFormat.html). 

If you are using Hive, you can use a serializer/deserializer (SerDe) to read data in from a given format into HDFS. For more information, see [https://cwiki.apache.org/confluence/display/Hive/SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe). 

# Different ways to get data into Amazon EMR
<a name="emr-plan-get-data-in"></a>

Amazon EMR provides several ways to get data onto a cluster. The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can also use the DistributedCache feature of Hadoop to transfer files from a distributed file system to the local file system. The implementation of Hive provided by Amazon EMR (Hive version 0.7.1.1 and later) includes functionality that you can use to import and export data between DynamoDB and an Amazon EMR cluster. If you have large amounts of on-premises data to process, you may find the Direct Connect service useful. 

**Topics**
+ [Upload data to Amazon S3](emr-plan-upload-s3.md)
+ [Upload data with AWS DataSync](emr-plan-upload-datasync.md)
+ [Import files with distributed cache with Amazon EMR](emr-plan-input-distributed-cache.md)
+ [Detecting and processing compressed files with Amazon EMR](HowtoProcessGzippedFiles.md)
+ [Import DynamoDB data into Hive with Amazon EMR](emr-plan-input-dynamodb.md)
+ [Connect to data with AWS Direct Connect from Amazon EMR](emr-plan-input-directconnect.md)
+ [Upload large amounts of data for Amazon EMR with AWS Snowball Edge](emr-plan-input-snowball.md)

# Upload data to Amazon S3
<a name="emr-plan-upload-s3"></a>

For information on how to upload objects to Amazon S3, see [ Add an object to your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/PuttingAnObjectInABucket.html) in the *Amazon Simple Storage Service User Guide*. For more information about using Amazon S3 with Hadoop, see [http://wiki.apache.org/hadoop/AmazonS3](http://wiki.apache.org/hadoop2/AmazonS3). 

**Topics**
+ [Create and configure an Amazon S3 bucket](#create-s3-bucket-input)
+ [Configure multipart upload for Amazon S3](#Config_Multipart)
+ [Best practices](#emr-bucket-bestpractices)
+ [Upload data to Amazon S3 Express One Zone](emr-express-one-zone.md)

## Create and configure an Amazon S3 bucket
<a name="create-s3-bucket-input"></a>

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, see [Bucket restrictions and limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service User Guide*.

This section shows you how to use the Amazon S3 AWS Management Console to create and then set permissions for an Amazon S3 bucket. You can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or AWS CLI. You can also use curl along with a modification to pass the appropriate authentication parameters for Amazon S3.

See the following resources:
+ To create a bucket using the console, see [Create a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket.html) in the *Amazon S3 User Guide*.
+ To create and work with buckets using the AWS CLI, see [Using high-level S3 commands with the AWS Command Line Interface](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-s3-commands.html) in the *Amazon S3 User Guide*.
+ To create a bucket using an SDK, see [Examples of creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-get-location-example.html) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using curl, see [Amazon S3 authentication tool for curl](https://aws.amazon.com/code/amazon-s3-authentication-tool-for-curl/).
+ For more information on specifying Region-specific buckets, see [Accessing a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html#access-bucket-intro) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using Amazon S3 Access Points, see [Using a bucket-style alias for your access point](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html) in the *Amazon S3 User Guide*. You can easily use Amazon S3 Access Points with the Amazon S3 Access Point Alias instead of the Amazon S3 bucket name. You can use the Amazon S3 Access Point Alias for both existing and new applications, including Spark, Hive, Presto and others.

**Note**  
If you enable logging for a bucket, it enables only bucket access logs, not Amazon EMR cluster logs. 

During bucket creation or after, you can set the appropriate permissions to access the bucket depending on your application. Typically, you give yourself (the owner) read and write access and give authenticated users read access.

Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. 

## Configure multipart upload for Amazon S3
<a name="Config_Multipart"></a>

Amazon EMR supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.

For more information, see [Multipart upload overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) in the *Amazon Simple Storage Service User Guide*.

In addition, Amazon EMR offers properties that allow you to more precisely control the clean-up of failed multipart upload parts.

The following table describes the Amazon EMR configuration properties for multipart upload. You can configure these using the `core-site` configuration classification. For more information, see [Configure applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configure-apps.html) in the *Amazon EMR Release Guide*.


| Configuration parameter name | Default value | Description | 
| --- | --- | --- | 
| fs.s3n.multipart.uploads.enabled | true | A Boolean type that indicates whether to enable multipart uploads. When EMRFS consistent view is enabled, multipart uploads are enabled by default and setting this value to false is ignored. | 
| fs.s3n.multipart.uploads.split.size | 134217728 | Specifies the maximum size of a part, in bytes, before EMRFS starts a new part upload when multipart uploads is enabled. The minimum value is `5242880` (5 MB). If a lesser value is specified, `5242880` is used. The maximum is `5368709120` (5 GB). If a greater value is specified, `5368709120` is used. If EMRFS client-side encryption is disabled and the Amazon S3 Optimized Committer is also disabled, this value also controls the maximum size that a data file can grow until EMRFS uses multipart uploads rather than a `PutObject` request to upload the file. For more information, see  | 
| fs.s3n.ssl.enabled | true | A Boolean type that indicates whether to use http or https.  | 
| fs.s3.buckets.create.enabled | false | A Boolean type that indicates whether a bucket should be created if it does not exist. Setting to false causes an exception on CreateBucket operations. | 
| fs.s3.multipart.clean.enabled | false | A Boolean type that indicates whether to enable background periodic clean-up of incomplete multipart uploads. | 
| fs.s3.multipart.clean.age.threshold | 604800 | A long type that specifies the minimum age of a multipart upload, in seconds, before it is considered for cleanup. The default is one week. | 
| fs.s3.multipart.clean.jitter.max | 10000 | An integer type that specifies the maximum amount of random jitter delay in seconds added to the 15-minute fixed delay before scheduling next round of clean-up. | 

### Disable multipart uploads
<a name="emr-dev-multipart-upload"></a>

------
#### [ Console ]

**To disable multipart uploads with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Software settings**, enter the following configuration: `classification=core-site,properties=[fs.s3n.multipart.uploads.enabled=false]`.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To disable multipart upload using the AWS CLI**

This procedure explains how to disable multipart upload using the AWS CLI. To disable multipart upload, type the `create-cluster` command with the `--bootstrap-actions` parameter. 

1. Create a file, `myConfig.json`, with the following contents and save it in the same directory where you run the command:

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3n.multipart.uploads.enabled": "false"
       }
     }
   ]
   ```

1. Type the following command and replace *myKey* with the name of your EC2 key pair.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   1. aws emr create-cluster --name "Test cluster" \
   2. --release-label emr-7.12.0 --applications Name=Hive Name=Pig \
   3. --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
   4. --instance-count 3 --configurations file://myConfig.json
   ```

------
#### [ API ]

**To disable multipart upload using the API**
+ For information on using Amazon S3 multipart uploads programmatically, see [Using the AWS SDK for Java for multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMPDotJavaAPI.html) in the *Amazon Simple Storage Service User Guide*.

  For more information about the AWS SDK for Java, see [AWS SDK for Java](https://aws.amazon.com/sdkforjava/).

------

## Best practices
<a name="emr-bucket-bestpractices"></a>

The following are recommendations for using Amazon S3 buckets with EMR clusters.

### Enable versioning
<a name="emr-enable-versioning"></a>

Versioning is a recommended configuration for your Amazon S3 bucket. By enabling versioning, you ensure that even if data is unintentionally deleted or overwritten it can be recovered. For more information, see [Using versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) in the Amazon Simple Storage Service User Guide.

### Clean up failed multipart uploads
<a name="emr-multipart-cleanup"></a>

EMR cluster components use multipart uploads via the AWS SDK for Java with Amazon S3 APIs to write log files and output data to Amazon S3 by default. For information about changing properties related to this configuration using Amazon EMR, see [Configure multipart upload for Amazon S3](#Config_Multipart). Sometimes the upload of a large file can result in an incomplete Amazon S3 multipart upload. When a multipart upload is unable to complete successfully, the in-progress multipart upload continues to occupy your bucket and incurs storage charges. We recommend the following options to avoid excessive file storage:
+ For buckets that you use with Amazon EMR, use a lifecycle configuration rule in Amazon S3 to remove incomplete multipart uploads three days after the upload initiation date. Lifecycle configuration rules allow you to control the storage class and lifetime of objects. For more information, see [Object lifecycle management](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html), and [Aborting incomplete multipart uploads using a bucket lifecycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config).
+ Enable Amazon EMR's multipart cleanup feature by setting `fs.s3.multipart.clean.enabled` to `true` and tuning other cleanup parameters. This feature is useful at high volume, large scale, and with clusters that have limited uptime. In this case, the `DaysAfterIntitiation` parameter of a lifecycle configuration rule may be too long, even if set to its minimum, causing spikes in Amazon S3 storage. Amazon EMR's multipart cleanup allows more precise control. For more information, see [Configure multipart upload for Amazon S3](#Config_Multipart). 

### Manage version markers
<a name="w2aac28c11c17c11b7c11b9"></a>

We recommend that you enable a lifecycle configuration rule in Amazon S3 to remove expired object delete markers for versioned buckets that you use with Amazon EMR. When deleting an object in a versioned bucket, a delete marker is created. If all previous versions of the object subsequently expire, an expired object delete marker is left in the bucket. While you are not charged for delete markers, removing expired markers can improve the performance of LIST requests. For more information, see [Lifecycle configuration for a bucket with versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-bucket-with-versioning.html) in the Amazon Simple Storage Service User Guide.

### Performance best practices
<a name="w2aac28c11c17c11b7c11c11"></a>

Depending on your workloads, specific types of usage of EMR clusters and applications on those clusters can result in a high number of requests against a bucket. For more information, see [Request rate and performance considerations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/request-rate-perf-considerations.html) in the *Amazon Simple Storage Service User Guide*. 

# Upload data to Amazon S3 Express One Zone
<a name="emr-express-one-zone"></a>

## Overview
<a name="emr-express-one-zone-overview"></a>

With Amazon EMR 6.15.0 and higher, you can use Amazon EMR with Apache Spark in conjunction with the [Amazon S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html) storage class for improved performance on your Spark jobs. Amazon EMR releases 7.2.0 and higher also support HBase, Flink, and Hive, so you can also benefit from S3 Express One Zone if you use these applications. *S3 Express One Zone* is an S3 storage class for applications that frequently access data with hundreds of thousands of requests per second. At the time of its release, S3 Express One Zone delivers the lowest latency and highest performance cloud object storage in Amazon S3. 

## Prerequisites
<a name="emr-express-one-zone-prereqs"></a>
+ **S3 Express One Zone permissions** – When S3 Express One Zone initially performs an action like `GET`, `LIST`, or `PUT` on an S3 object, the storage class calls `CreateSession` on your behalf. Your IAM policy must allow the `s3express:CreateSession` permission so that the S3A connector can invoke the `CreateSession` API. For an example policy with this permission, see [Getting started with Amazon S3 Express One Zone](#emr-express-one-zone-start).
+ **S3A connector** – To configure your Spark cluster to access data from an Amazon S3 bucket that uses the S3 Express One Zone storage class, you must use the Apache Hadoop connector S3A. To use the connector, ensure all S3 URIs use the `s3a` scheme. If they don’t, you can change the filesystem implementation that you use for `s3` and `s3n` schemes.

To change the `s3` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

To change the `s3n` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3n.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3n.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

## Getting started with Amazon S3 Express One Zone
<a name="emr-express-one-zone-start"></a>

**Topics**
+ [Create a permission policy](#emr-express-one-zone-permissions)
+ [Create and configure your cluster](#emr-express-one-zone-create)
+ [Configurations overview](#emr-express-one-zone-configs)

### Create a permission policy
<a name="emr-express-one-zone-permissions"></a>

Before you can create a cluster that uses Amazon S3 Express One Zone, you must create an IAM policy to attach to the Amazon EC2 instance profile for the cluster. The policy must have permissions to access the S3 Express One Zone storage class. The following example policy shows how to grant the required permission. After you create the policy, attach the policy to the instance profile role that you use to create your EMR cluster, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3express:*:123456789012:bucket/example-s3-bucket"
      ],
      "Action": [
        "s3express:CreateSession"
      ],
      "Sid": "AllowS3EXPRESSCreatesession"
    }
  ]
}
```

------

### Create and configure your cluster
<a name="emr-express-one-zone-create"></a>

Next, create a cluster that runs Spark, HBase, Flink, or Hive with S3 Express One Zone. The following steps describe a high-level overview to create a cluster in the AWS Management Console:

1. Navigate to the Amazon EMR console and select **Clusters** from the sidebar. Then choose **Create cluster**.

1. If you use Spark, select Amazon EMR release `emr-6.15.0` or higher. If you use HBase, Flink, or Hive, select `emr-7.2.0` or higher.

1. Select the applications that you want to include on your cluster, such as Spark, HBase, or Flink.

1. To enable Amazon S3 Express One Zone, enter a configuration similar to the following example in the **Software settings** section. The configurations and recommended values are described in the [Configurations overview](#emr-express-one-zone-configs) section that follows this procedure.

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider",
         "fs.s3a.change.detection.mode": "none",
         "fs.s3a.endpoint.region": "aa-example-1",
         "fs.s3a.select.enabled": "false"
       }
     },
     {
       "Classification": "spark-defaults",
       "Properties": {
         "spark.sql.sources.fastS3PartitionDiscovery.enabled": "false"
       }
     }
   ]
   ```

1. In the **EC2 instance profile for Amazon EMR** section, choose to use an existing role, and use a role with the policy attached that you created in the [Create a permission policy](#emr-express-one-zone-permissions) section above.

1. Configure the rest of your cluster settings as appropriate for your application, and then select **Create cluster**.

### Configurations overview
<a name="emr-express-one-zone-configs"></a>

The following tables describe the configurations and suggested values that you should specify when you set up a cluster that uses S3 Express One Zone with Amazon EMR, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

**S3A configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `fs.s3a.aws.credentials.provider`  |  If not specified, uses `AWSCredentialProviderList` in the following order: `TemporaryAWSCredentialsProvider`, `SimpleAWSCredentialsProvider`, `EnvironmentVariableCredentialsProvider`, `IAMInstanceCredentialsProvider`.  |  <pre>software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider</pre>  |  The Amazon EMR instance profile role should have the policy that allows the S3A filesystem to call `s3express:CreateSession`. Other crendential providers also work if they have the S3 Express One Zone permissions.  | 
|  `fs.s3a.endpoint.region`  |  null  |  The AWS Region where you created the bucket.  |  Region resolution logic doesn't work with S3 Express One Zone storage class.  | 
|  `fs.s3a.select.enabled`  |  `true`  |  `false`  |  Amazon S3 `select` is not supported with S3 Express One Zone storage class.  | 
|  `fs.s3a.change.detection.mode`  |  `server`  |  none  |  Change detection by S3A works by checking MD5-based `etags`. S3 Express One Zone storage class doesn't support MD5 `checksums`.  | 

**Spark configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `spark.sql.sources.fastS3PartitionDiscovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

**Hive configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `hive.exec.fast.s3.partition.discovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

## Considerations
<a name="emr-express-one-zone-considerations"></a>

Consider the following when you integrate Apache Spark on Amazon EMR with the S3 Express One Zone storage class:
+ The S3A connector is required to use S3 Express One Zone with Amazon EMR. Only S3A has the features and storage classes that are required to interact with S3 Express One Zone. For steps to set up the connector, see [Prerequisites](#emr-express-one-zone-prereqs).
+ The Amazon S3 Express One Zone storage class supports SSE-S3 and SSE-KMS encryption. For more information, see [Server-side encryption with Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-data-protection.html#s3-express-ecnryption).
+ The Amazon S3 Express One Zone storage class does not support writes with the S3A `FileOutputCommitter`. Writes with the S3A `FileOutputCommitter` on S3 Express One Zone buckets result in an error: *InvalidStorageClass: The storage class you specified is not valid*.
+ Amazon S3 Express One Zone is supported with Amazon EMR releases 6.15.0 and higher on EMR on EC2. Additionally, it's supported on Amazon EMR releases 7.2.0 and higher on Amazon EMR on EKS and on Amazon EMR Serverless.

# Upload data with AWS DataSync
<a name="emr-plan-upload-datasync"></a>

AWS DataSync is an online data transfer service that simplifies, automates, and accelerates the process of moving data between your on-premises storage and AWS storage services or between AWS storage services. DataSync supports a variety of on-premises storage systems such as Hadoop Distributed File System (HDFS), NAS file servers, and self-managed object storage.

The most common way to get data onto a cluster is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster.

DataSync can help you accomplish the following tasks:
+ Replicate HDFS on your Hadoop cluster to Amazon S3 for business continuity
+ Copy HDFS to Amazon S3 to populate your data lakes
+ Transfer data between your Hadoop cluster's HDFS and Amazon S3 for analysis and processing

To upload data to your S3 bucket, you first deploy one or more DataSync agents in the same network as your on-premises storage. An *agent* is a virtual machine (VM) that is used to read data from or write data to a self-managed location. You then activate your agents in the AWS account and AWS Region where your S3 bucket is located.

After your agent is activated, you create a source location for your on-premises storage, a destination location for your S3 bucket, and a task. A *task* is a set of two locations (source and destination) and a set of default options that you use to control the behavior of the task.

Finally, you run your DataSync task to transfer data from the source to the destination. 

For more information, see [Getting started with AWS DataSync](https://docs.aws.amazon.com/datasync/latest/userguide/getting-started.html).

# Import files with distributed cache with Amazon EMR
<a name="emr-plan-input-distributed-cache"></a>

DistributedCache is a Hadoop feature that can boost efficiency when a map or a reduce task needs access to common data. If your cluster depends on existing applications or binaries that are not installed when the cluster is created, you can use DistributedCache to import these files. This feature lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes. 

For more information, go to [http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html).

You invoke DistributedCache when you create the cluster. The files are cached just before starting the Hadoop job and the files remain cached for the duration of the job. You can cache files stored on any Hadoop-compatible file system, for example HDFS or Amazon S3. The default size of the file cache is 10GB. To change the size of the cache, reconfigure the Hadoop parameter, `local.cache.size` using the bootstrap action. For more information, see [Create bootstrap actions to install additional software with an Amazon EMR cluster](emr-plan-bootstrap.md).

**Topics**
+ [Supported file types](#emr-dev-supported-file-types)
+ [Location of cached files](#locationofcache)
+ [Access cached files from streaming applications](#cachemapper)
+ [Access cached files from streaming applications](#cacheinconsole)

## Supported file types
<a name="emr-dev-supported-file-types"></a>

DistributedCache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.

Archives are one or more files packaged using a utility, such as `gzip`. DistributedCache passes the compressed files to each core node and decompresses the archive as part of caching. DistributedCache supports the following compression formats:
+ zip
+ tgz
+ tar.gz
+ tar
+ jar

## Location of cached files
<a name="locationofcache"></a>

DistributedCache copies files to core nodes only. If there are no core nodes in the cluster, DistributedCache copies the files to the primary node.

DistributedCache associates the cache files to the current working directory of the mapper and reducer using symlinks. A symlink is an alias to a file location, not the actual file location. The value of the parameter, `yarn.nodemanager.local-dirs` in `yarn-site.xml`, specifies the location of temporary files. Amazon EMR sets this parameter to `/mnt/mapred`, or some variation based on instance type and EMR version. For example, a setting may have `/mnt/mapred` and `/mnt1/mapred` because the instance type has two ephemeral volumes. Cache files are located in a subdirectory of the temporary file location at `/mnt/mapred/taskTracker/archive`. 

If you cache a single file, DistributedCache puts the file in the `archive` directory. If you cache an archive, DistributedCache decompresses the file, creates a subdirectory in `/archive` with the same name as the archive file name. The individual files are located in the new subdirectory.

You can use DistributedCache only when using Streaming.

## Access cached files from streaming applications
<a name="cachemapper"></a>

To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.

## Access cached files from streaming applications
<a name="cacheinconsole"></a>

You can use the AWS Management Console and the AWS CLI to create clusters that use Distributed Cache. 

------
#### [ Console ]

**To specify distributed cache files with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Steps**, choose **Add step**. This opens the **Add step** dialog. In the **Arguments** field, include the files and archives to save to the cache. The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

   If you want to add an individual file to the distributed cache, specify `-cacheFile`, followed by the name and location of the file, the pound (\$1) sign, and the name you want to give the file when it's placed in the local cache. The following example demonstrates how to add an individual file to the distributed cache.

   ```
   -cacheFile \
   s3://amzn-s3-demo-bucket/file-name#cache-file-name
   ```

   If you want to add an archive file to the distributed cache, enter `-cacheArchive` followed by the location of the files in Amazon S3, the pound (\$1) sign, and then the name you want to give the collection of files in the local cache. The following example demonstrates how to add an archive file to the distributed cache.

   ```
   -cacheArchive \
   s3://amzn-s3-demo-bucket/archive-name#cache-archive-name
   ```

   Enter appropriate values in the other dialog fields. Options differ depending on the step type. To add your step and exit the dialog, choose **Add step**.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To specify distributed cache files with the AWS CLI**
+ To submit a Streaming step when a cluster is created, type the `create-cluster` command with the `--steps` parameter. To specify distributed cache files using the AWS CLI, specify the appropriate arguments when submitting a Streaming step. 

  If you want to add an individual file to the distributed cache, specify `-cacheFile`, followed by the name and location of the file, the pound (\$1) sign, and the name you want to give the file when it's placed in the local cache. 

  If you want to add an archive file to the distributed cache, enter `-cacheArchive` followed by the location of the files in Amazon S3, the pound (\$1) sign, and then the name you want to give the collection of files in the local cache. The following example demonstrates how to add an archive file to the distributed cache.

  For more information on using Amazon EMR commands in the AWS CLI, see [https://docs.aws.amazon.com/cli/latest/reference/emr](https://docs.aws.amazon.com/cli/latest/reference/emr).

**Example 1**  
Type the following command to launch a cluster and submit a Streaming step that uses `-cacheFile` to add one file, `sample_dataset_cached.dat`, to the cache.   

```
aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]
```
When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.  
If you have not previously created the default EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

**Example 2**  
The following command shows the creation of a streaming cluster and uses `-cacheArchive` to add an archive of files to the cache.   

```
aws emr create-cluster --name "Test cluster" --release-label emr-4.0.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]
```
When you specify the instance count without using the `--instance-groups` parameter, a single primary node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.  
If you have not previously created the default EMR service role and EC2 instance profile, type `aws emr create-default-roles` to create them before typing the `create-cluster` subcommand.

------

# Detecting and processing compressed files with Amazon EMR
<a name="HowtoProcessGzippedFiles"></a>

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.

To index LZO files, you can use the hadoop-lzo library which can be downloaded from [https://github.com/kevinweil/hadoop-lzo](https://github.com/kevinweil/hadoop-lzo). Note that because this is a third-party library, Amazon EMR does not offer developer support on how to use this tool. For usage information, see [the hadoop-lzo readme file.](https://github.com/kevinweil/hadoop-lzo/blob/master/README.md) 

# Import DynamoDB data into Hive with Amazon EMR
<a name="emr-plan-input-dynamodb"></a>

The implementation of Hive provided by Amazon EMR includes functionality that you can use to import and export data between DynamoDB and an Amazon EMR cluster. This is useful if your input data is stored in DynamoDB. For more information, see [Export, import, query, and join tables in DynamoDB using Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMRforDynamoDB.html). 

# Connect to data with AWS Direct Connect from Amazon EMR
<a name="emr-plan-input-directconnect"></a>

Direct Connect is a service you can use to establish a private dedicated network connection to Amazon Web Services from your data center, office, or colocation environment. If you have large amounts of input data, using Direct Connect may reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. For more information see the [Direct Connect User Guide](https://docs.aws.amazon.com/directconnect/latest/UserGuide/). 

# Upload large amounts of data for Amazon EMR with AWS Snowball Edge
<a name="emr-plan-input-snowball"></a>

AWS Snowball Edge is a service you can use to transfer large amounts of data between Amazon Simple Storage Service (Amazon S3) and your onsite data storage location at faster-than-internet speeds. Snowball Edge supports two job types: import jobs and export jobs. Import jobs involve a data transfer from an on-premises source to an Amazon S3 bucket. Export jobs involve a data transfer from an Amazon S3 bucket to an on-premises source. For both job types, Snowball Edge devices secure and protect your data while regional shipping carriers transport them between Amazon S3 and your onsite data storage location. Snowball Edge devices are physically rugged and protected by the AWS Key Management Service (AWS KMS). For more information, see the [AWS Snowball Edge Edge Developer Guide](https://docs.aws.amazon.com/snowball/latest/developer-guide/).

# Configure a location for Amazon EMR cluster output
<a name="emr-plan-output"></a>

 The most common output format of an Amazon EMR cluster is as text files, either compressed or uncompressed. Typically, these are written to an Amazon S3 bucket. This bucket must be created before you launch the cluster. You specify the S3 bucket as the output location when you launch the cluster. 

For more information, see the following topics:

**Topics**
+ [Create and configure an Amazon S3 bucket](#create-s3-bucket-output)
+ [What formats can Amazon EMR return?](emr-plan-output-formats.md)
+ [How to write data to an Amazon S3 bucket you don't own with Amazon EMR](emr-s3-acls.md)
+ [Ways to compress the output of your Amazon EMR cluster](emr-plan-output-compression.md)

## Create and configure an Amazon S3 bucket
<a name="create-s3-bucket-output"></a>

Amazon EMR (Amazon EMR) uses Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, go to [Bucket Restrictions and Limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service Developers Guide*.

To create a an Amazon S3 bucket, follow the instructions on the [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) page in the *Amazon Simple Storage Service Developers Guide*.

**Note**  
 If you enable logging in the **Create a Bucket** wizard, it enables only bucket access logs, not cluster logs. 

**Note**  
For more information on specifying Region-specific buckets, refer to [Buckets and Regions](https://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html) in the *Amazon Simple Storage Service Developer Guide* and [ Available Region Endpoints for the AWS SDKs ](https://aws.amazon.com/articles/available-region-endpoints-for-the-aws-sdks/).

 After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access. We strongly recommend that you follow [Security Best Practices for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html) when configuring your bucket. 

 Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. The following table describes example data, scripts, and log file locations. 


| Information | Example Location on Amazon S3 | 
| --- | --- | 
| script or program |  s3://amzn-s3-demo-bucket1/script/MapperScript.py  | 
| log files |  s3://amzn-s3-demo-bucket1/logs  | 
| input data |  s3://amzn-s3-demo-bucket1/input  | 
| output data |  s3://amzn-s3-demo-bucket1/output  | 

# What formats can Amazon EMR return?
<a name="emr-plan-output-formats"></a>

 The default output format for a cluster is text with key, value pairs written to individual lines of the text files. This is the output format most commonly used. 

 If your output data needs to be written in a format other than the default text files, you can use the Hadoop interface `OutputFormat` to specify other output types. You can even create a subclass of the `FileOutputFormat` class to handle custom data types. For more information, see [http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html). 

 If you are launching a Hive cluster, you can use a serializer/deserializer (SerDe) to output data from HDFS to a given format. For more information, see [https://cwiki.apache.org/confluence/display/Hive/SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe). 

# How to write data to an Amazon S3 bucket you don't own with Amazon EMR
<a name="emr-s3-acls"></a>

 When you write a file to an Amazon Simple Storage Service (Amazon S3) bucket, by default, you are the only one able to read that file. The assumption is that you will write files to your own buckets, and this default setting protects the privacy of your files. 

 However, if you are running a cluster, and you want the output to write to the Amazon S3 bucket of another AWS user, and you want that other AWS user to be able to read that output, you must do two things: 
+  Have the other AWS user grant you write permissions for their Amazon S3 bucket. The cluster you launch runs under your AWS credentials, so any clusters you launch will also be able to write to that other AWS user's bucket. 
+  Set read permissions for the other AWS user on the files that you or the cluster write to the Amazon S3 bucket. The easiest way to set these read permissions is to use canned access control lists (ACLs), a set of pre-defined access policies defined by Amazon S3. 

 For information about how the other AWS user can grant you permissions to write files to the other user's Amazon S3 bucket, see [Editing bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EditingBucketPermissions.html) in the *Amazon Simple Storage Service User Guide*. 

 For your cluster to use canned ACLs when it writes files to Amazon S3, set the `fs.s3.canned.acl` cluster configuration option to the canned ACL to use. The following table lists the currently defined canned ACLs. 


| Canned ACL | Description | 
| --- | --- | 
| AuthenticatedRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AuthenticatedUsers group grantee is granted Permission.Read access. | 
| BucketOwnerFullControl | Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object. | 
| BucketOwnerRead | Specifies that the owner of the bucket is granted Permission.Read. The owner of the bucket is not necessarily the same as the owner of the object. | 
| LogDeliveryWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.LogDelivery group grantee is granted Permission.Write access, so that access logs can be delivered. | 
| Private | Specifies that the owner is granted Permission.FullControl. | 
| PublicRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read access. | 
| PublicReadWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read and Permission.Write access. | 

 There are many ways to set the cluster configuration options, depending on the type of cluster you are running. The following procedures show how to set the option for common cases. 

**To write files using canned ACLs in Hive**
+  From the Hive command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Hive command prompt connect to the master node using SSH, and type Hive at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to `BucketOwnerFullControl`, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command is case sensitive and contains no quotation marks or spaces. 

  ```
  hive> set fs.s3.canned.acl=BucketOwnerFullControl;   
  create table acl (n int) location 's3://amzn-s3-demo-bucket/acl/'; 
  insert overwrite table acl select count(*) from acl;
  ```

   The last two lines of the example create a table that is stored in Amazon S3 and write data to the table. 

**To write files using canned ACLs in Pig**
+  From the Pig command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Pig command prompt connect to the master node using SSH, and type Pig at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to BucketOwnerFullControl, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command includes one space before the canned ACL name and contains no quotation marks. 

  ```
  pig> set fs.s3.canned.acl BucketOwnerFullControl; 
  store some data into 's3://amzn-s3-demo-bucket/pig/acl';
  ```

**To write files using canned ACLs in a custom JAR**
+  Set the `fs.s3.canned.acl` configuration option using Hadoop with the -D flag. This is shown in the example below. 

  ```
  hadoop jar hadoop-examples.jar wordcount 
  -Dfs.s3.canned.acl=BucketOwnerFullControl s3://amzn-s3-demo-bucket/input s3://amzn-s3-demo-bucket/output
  ```

# Ways to compress the output of your Amazon EMR cluster
<a name="emr-plan-output-compression"></a>

There are different ways to compress output that results from data processing. The compression tools you use depend on properties of your data. Compression can improve performance when you transfer large amounts of data.

## Output data compression
<a name="HadoopOutputDataCompression"></a>

 This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true. 

 If you are running a streaming job you can enable this by passing the streaming job these arguments. 

```
1. -jobconf mapred.output.compress=true
```

 You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client. 

```
1.    
2. --bootstrap-actions s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
3. --args "-s,mapred.output.compress=true"
```

 Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job. 

```
1. FileOutputFormat.setCompressOutput(conf, true);
```

## Intermediate data compression
<a name="HadoopIntermediateDataCompression"></a>

 If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compress the map output and decompress it when it arrives on the core node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression. 

 When writing a Custom Jar, use the following command: 

```
1. conf.setCompressMapOutput(true);
```

## Using the Snappy library with Amazon EMR
<a name="emr-using-snappy"></a>

Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to [http://code.google.com/p/snappy/](http://code.google.com/p/snappy/).