

# Upload data to Amazon S3
<a name="emr-plan-upload-s3"></a>

For information on how to upload objects to Amazon S3, see [ Add an object to your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/PuttingAnObjectInABucket.html) in the *Amazon Simple Storage Service User Guide*. For more information about using Amazon S3 with Hadoop, see [http://wiki.apache.org/hadoop/AmazonS3](http://wiki.apache.org/hadoop2/AmazonS3). 

**Topics**
+ [Create and configure an Amazon S3 bucket](#create-s3-bucket-input)
+ [Configure multipart upload for Amazon S3](#Config_Multipart)
+ [Best practices](#emr-bucket-bestpractices)
+ [Upload data to Amazon S3 Express One Zone](emr-express-one-zone.md)

## Create and configure an Amazon S3 bucket
<a name="create-s3-bucket-input"></a>

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, see [Bucket restrictions and limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service User Guide*.

This section shows you how to use the Amazon S3 AWS Management Console to create and then set permissions for an Amazon S3 bucket. You can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or AWS CLI. You can also use curl along with a modification to pass the appropriate authentication parameters for Amazon S3.

See the following resources:
+ To create a bucket using the console, see [Create a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket.html) in the *Amazon S3 User Guide*.
+ To create and work with buckets using the AWS CLI, see [Using high-level S3 commands with the AWS Command Line Interface](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-s3-commands.html) in the *Amazon S3 User Guide*.
+ To create a bucket using an SDK, see [Examples of creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-get-location-example.html) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using curl, see [Amazon S3 authentication tool for curl](https://aws.amazon.com/code/amazon-s3-authentication-tool-for-curl/).
+ For more information on specifying Region-specific buckets, see [Accessing a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html#access-bucket-intro) in the *Amazon Simple Storage Service User Guide*.
+ To work with buckets using Amazon S3 Access Points, see [Using a bucket-style alias for your access point](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html) in the *Amazon S3 User Guide*. You can easily use Amazon S3 Access Points with the Amazon S3 Access Point Alias instead of the Amazon S3 bucket name. You can use the Amazon S3 Access Point Alias for both existing and new applications, including Spark, Hive, Presto and others.

**Note**  
If you enable logging for a bucket, it enables only bucket access logs, not Amazon EMR cluster logs. 

During bucket creation or after, you can set the appropriate permissions to access the bucket depending on your application. Typically, you give yourself (the owner) read and write access and give authenticated users read access.

Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. 

## Configure multipart upload for Amazon S3
<a name="Config_Multipart"></a>

Amazon EMR supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.

For more information, see [Multipart upload overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) in the *Amazon Simple Storage Service User Guide*.

In addition, Amazon EMR offers properties that allow you to more precisely control the clean-up of failed multipart upload parts.

The following table describes the Amazon EMR configuration properties for multipart upload. You can configure these using the `core-site` configuration classification. For more information, see [Configure applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configure-apps.html) in the *Amazon EMR Release Guide*.


| Configuration parameter name | Default value | Description | 
| --- | --- | --- | 
| fs.s3n.multipart.uploads.enabled | true | A Boolean type that indicates whether to enable multipart uploads. When EMRFS consistent view is enabled, multipart uploads are enabled by default and setting this value to false is ignored. | 
| fs.s3n.multipart.uploads.split.size | 134217728 | Specifies the maximum size of a part, in bytes, before EMRFS starts a new part upload when multipart uploads is enabled. The minimum value is `5242880` (5 MB). If a lesser value is specified, `5242880` is used. The maximum is `5368709120` (5 GB). If a greater value is specified, `5368709120` is used. If EMRFS client-side encryption is disabled and the Amazon S3 Optimized Committer is also disabled, this value also controls the maximum size that a data file can grow until EMRFS uses multipart uploads rather than a `PutObject` request to upload the file. For more information, see  | 
| fs.s3n.ssl.enabled | true | A Boolean type that indicates whether to use http or https.  | 
| fs.s3.buckets.create.enabled | false | A Boolean type that indicates whether a bucket should be created if it does not exist. Setting to false causes an exception on CreateBucket operations. | 
| fs.s3.multipart.clean.enabled | false | A Boolean type that indicates whether to enable background periodic clean-up of incomplete multipart uploads. | 
| fs.s3.multipart.clean.age.threshold | 604800 | A long type that specifies the minimum age of a multipart upload, in seconds, before it is considered for cleanup. The default is one week. | 
| fs.s3.multipart.clean.jitter.max | 10000 | An integer type that specifies the maximum amount of random jitter delay in seconds added to the 15-minute fixed delay before scheduling next round of clean-up. | 

### Disable multipart uploads
<a name="emr-dev-multipart-upload"></a>

------
#### [ Console ]

**To disable multipart uploads with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Software settings**, enter the following configuration: `classification=core-site,properties=[fs.s3n.multipart.uploads.enabled=false]`.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To disable multipart upload using the AWS CLI**

This procedure explains how to disable multipart upload using the AWS CLI. To disable multipart upload, type the `create-cluster` command with the `--bootstrap-actions` parameter. 

1. Create a file, `myConfig.json`, with the following contents and save it in the same directory where you run the command:

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3n.multipart.uploads.enabled": "false"
       }
     }
   ]
   ```

1. Type the following command and replace *myKey* with the name of your EC2 key pair.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

   ```
   1. aws emr create-cluster --name "Test cluster" \
   2. --release-label emr-7.12.0 --applications Name=Hive Name=Pig \
   3. --use-default-roles --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
   4. --instance-count 3 --configurations file://myConfig.json
   ```

------
#### [ API ]

**To disable multipart upload using the API**
+ For information on using Amazon S3 multipart uploads programmatically, see [Using the AWS SDK for Java for multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMPDotJavaAPI.html) in the *Amazon Simple Storage Service User Guide*.

  For more information about the AWS SDK for Java, see [AWS SDK for Java](https://aws.amazon.com/sdkforjava/).

------

## Best practices
<a name="emr-bucket-bestpractices"></a>

The following are recommendations for using Amazon S3 buckets with EMR clusters.

### Enable versioning
<a name="emr-enable-versioning"></a>

Versioning is a recommended configuration for your Amazon S3 bucket. By enabling versioning, you ensure that even if data is unintentionally deleted or overwritten it can be recovered. For more information, see [Using versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) in the Amazon Simple Storage Service User Guide.

### Clean up failed multipart uploads
<a name="emr-multipart-cleanup"></a>

EMR cluster components use multipart uploads via the AWS SDK for Java with Amazon S3 APIs to write log files and output data to Amazon S3 by default. For information about changing properties related to this configuration using Amazon EMR, see [Configure multipart upload for Amazon S3](#Config_Multipart). Sometimes the upload of a large file can result in an incomplete Amazon S3 multipart upload. When a multipart upload is unable to complete successfully, the in-progress multipart upload continues to occupy your bucket and incurs storage charges. We recommend the following options to avoid excessive file storage:
+ For buckets that you use with Amazon EMR, use a lifecycle configuration rule in Amazon S3 to remove incomplete multipart uploads three days after the upload initiation date. Lifecycle configuration rules allow you to control the storage class and lifetime of objects. For more information, see [Object lifecycle management](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html), and [Aborting incomplete multipart uploads using a bucket lifecycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config).
+ Enable Amazon EMR's multipart cleanup feature by setting `fs.s3.multipart.clean.enabled` to `true` and tuning other cleanup parameters. This feature is useful at high volume, large scale, and with clusters that have limited uptime. In this case, the `DaysAfterIntitiation` parameter of a lifecycle configuration rule may be too long, even if set to its minimum, causing spikes in Amazon S3 storage. Amazon EMR's multipart cleanup allows more precise control. For more information, see [Configure multipart upload for Amazon S3](#Config_Multipart). 

### Manage version markers
<a name="w2aac28c11c17c11b7c11b9"></a>

We recommend that you enable a lifecycle configuration rule in Amazon S3 to remove expired object delete markers for versioned buckets that you use with Amazon EMR. When deleting an object in a versioned bucket, a delete marker is created. If all previous versions of the object subsequently expire, an expired object delete marker is left in the bucket. While you are not charged for delete markers, removing expired markers can improve the performance of LIST requests. For more information, see [Lifecycle configuration for a bucket with versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-bucket-with-versioning.html) in the Amazon Simple Storage Service User Guide.

### Performance best practices
<a name="w2aac28c11c17c11b7c11c11"></a>

Depending on your workloads, specific types of usage of EMR clusters and applications on those clusters can result in a high number of requests against a bucket. For more information, see [Request rate and performance considerations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/request-rate-perf-considerations.html) in the *Amazon Simple Storage Service User Guide*. 

# Upload data to Amazon S3 Express One Zone
<a name="emr-express-one-zone"></a>

## Overview
<a name="emr-express-one-zone-overview"></a>

With Amazon EMR 6.15.0 and higher, you can use Amazon EMR with Apache Spark in conjunction with the [Amazon S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html) storage class for improved performance on your Spark jobs. Amazon EMR releases 7.2.0 and higher also support HBase, Flink, and Hive, so you can also benefit from S3 Express One Zone if you use these applications. *S3 Express One Zone* is an S3 storage class for applications that frequently access data with hundreds of thousands of requests per second. At the time of its release, S3 Express One Zone delivers the lowest latency and highest performance cloud object storage in Amazon S3. 

## Prerequisites
<a name="emr-express-one-zone-prereqs"></a>
+ **S3 Express One Zone permissions** – When S3 Express One Zone initially performs an action like `GET`, `LIST`, or `PUT` on an S3 object, the storage class calls `CreateSession` on your behalf. Your IAM policy must allow the `s3express:CreateSession` permission so that the S3A connector can invoke the `CreateSession` API. For an example policy with this permission, see [Getting started with Amazon S3 Express One Zone](#emr-express-one-zone-start).
+ **S3A connector** – To configure your Spark cluster to access data from an Amazon S3 bucket that uses the S3 Express One Zone storage class, you must use the Apache Hadoop connector S3A. To use the connector, ensure all S3 URIs use the `s3a` scheme. If they don’t, you can change the filesystem implementation that you use for `s3` and `s3n` schemes.

To change the `s3` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

To change the `s3n` scheme, specify the following cluster configurations: 

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3n.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3n.impl": "org.apache.hadoop.fs.s3a.S3A"
    }
  }
]
```

## Getting started with Amazon S3 Express One Zone
<a name="emr-express-one-zone-start"></a>

**Topics**
+ [Create a permission policy](#emr-express-one-zone-permissions)
+ [Create and configure your cluster](#emr-express-one-zone-create)
+ [Configurations overview](#emr-express-one-zone-configs)

### Create a permission policy
<a name="emr-express-one-zone-permissions"></a>

Before you can create a cluster that uses Amazon S3 Express One Zone, you must create an IAM policy to attach to the Amazon EC2 instance profile for the cluster. The policy must have permissions to access the S3 Express One Zone storage class. The following example policy shows how to grant the required permission. After you create the policy, attach the policy to the instance profile role that you use to create your EMR cluster, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3express:*:123456789012:bucket/example-s3-bucket"
      ],
      "Action": [
        "s3express:CreateSession"
      ],
      "Sid": "AllowS3EXPRESSCreatesession"
    }
  ]
}
```

------

### Create and configure your cluster
<a name="emr-express-one-zone-create"></a>

Next, create a cluster that runs Spark, HBase, Flink, or Hive with S3 Express One Zone. The following steps describe a high-level overview to create a cluster in the AWS Management Console:

1. Navigate to the Amazon EMR console and select **Clusters** from the sidebar. Then choose **Create cluster**.

1. If you use Spark, select Amazon EMR release `emr-6.15.0` or higher. If you use HBase, Flink, or Hive, select `emr-7.2.0` or higher.

1. Select the applications that you want to include on your cluster, such as Spark, HBase, or Flink.

1. To enable Amazon S3 Express One Zone, enter a configuration similar to the following example in the **Software settings** section. The configurations and recommended values are described in the [Configurations overview](#emr-express-one-zone-configs) section that follows this procedure.

   ```
   [
     {
       "Classification": "core-site",
       "Properties": {
         "fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider",
         "fs.s3a.change.detection.mode": "none",
         "fs.s3a.endpoint.region": "aa-example-1",
         "fs.s3a.select.enabled": "false"
       }
     },
     {
       "Classification": "spark-defaults",
       "Properties": {
         "spark.sql.sources.fastS3PartitionDiscovery.enabled": "false"
       }
     }
   ]
   ```

1. In the **EC2 instance profile for Amazon EMR** section, choose to use an existing role, and use a role with the policy attached that you created in the [Create a permission policy](#emr-express-one-zone-permissions) section above.

1. Configure the rest of your cluster settings as appropriate for your application, and then select **Create cluster**.

### Configurations overview
<a name="emr-express-one-zone-configs"></a>

The following tables describe the configurations and suggested values that you should specify when you set up a cluster that uses S3 Express One Zone with Amazon EMR, as described in the [Create and configure your cluster](#emr-express-one-zone-create) section.

**S3A configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `fs.s3a.aws.credentials.provider`  |  If not specified, uses `AWSCredentialProviderList` in the following order: `TemporaryAWSCredentialsProvider`, `SimpleAWSCredentialsProvider`, `EnvironmentVariableCredentialsProvider`, `IAMInstanceCredentialsProvider`.  |  <pre>software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider</pre>  |  The Amazon EMR instance profile role should have the policy that allows the S3A filesystem to call `s3express:CreateSession`. Other crendential providers also work if they have the S3 Express One Zone permissions.  | 
|  `fs.s3a.endpoint.region`  |  null  |  The AWS Region where you created the bucket.  |  Region resolution logic doesn't work with S3 Express One Zone storage class.  | 
|  `fs.s3a.select.enabled`  |  `true`  |  `false`  |  Amazon S3 `select` is not supported with S3 Express One Zone storage class.  | 
|  `fs.s3a.change.detection.mode`  |  `server`  |  none  |  Change detection by S3A works by checking MD5-based `etags`. S3 Express One Zone storage class doesn't support MD5 `checksums`.  | 

**Spark configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `spark.sql.sources.fastS3PartitionDiscovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

**Hive configurations**


| Parameter | Default value | Suggested value | Explanation | 
| --- | --- | --- | --- | 
|  `hive.exec.fast.s3.partition.discovery.enabled`  |  `true`  |  false  |  The internal optimization uses an S3 API parameter that the S3 Express One Zone storage class doesn't support.  | 

## Considerations
<a name="emr-express-one-zone-considerations"></a>

Consider the following when you integrate Apache Spark on Amazon EMR with the S3 Express One Zone storage class:
+ The S3A connector is required to use S3 Express One Zone with Amazon EMR. Only S3A has the features and storage classes that are required to interact with S3 Express One Zone. For steps to set up the connector, see [Prerequisites](#emr-express-one-zone-prereqs).
+ The Amazon S3 Express One Zone storage class supports SSE-S3 and SSE-KMS encryption. For more information, see [Server-side encryption with Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-data-protection.html#s3-express-ecnryption).
+ The Amazon S3 Express One Zone storage class does not support writes with the S3A `FileOutputCommitter`. Writes with the S3A `FileOutputCommitter` on S3 Express One Zone buckets result in an error: *InvalidStorageClass: The storage class you specified is not valid*.
+ Amazon S3 Express One Zone is supported with Amazon EMR releases 6.15.0 and higher on EMR on EC2. Additionally, it's supported on Amazon EMR releases 7.2.0 and higher on Amazon EMR on EKS and on Amazon EMR Serverless.