

# S3A file system
<a name="emr-s3a-file"></a>

This section covers protocols for Spark running on Amazon Elastic Map Reduce (EMR) when using the S3A filesystem.

# S3A MagicV2 Committer
<a name="s3a-magicv2-committer"></a>

With the EMR-6.15.0 release, Amazon EMR introduces a new S3A committer type known as the MagicV2 committer. For comprehensive information about this feature, please consult the relevant documentation sections.

The MagicV2 Committer represents an enhanced implementation of the open-source [MagicCommitter](https://javadoc.io/static/org.apache.hadoop/hadoop-aws/3.4.0/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.html), specifically designed to optimize file writing to Amazon S3 through the S3A filesystem. Like its predecessor, it leverages Amazon S3's multipart upload capabilities to eliminate the traditional list and rename operations typically associated with job and task commit phases.

Compared to the original MagicCommitter, the MagicV2 committer demonstrates superior performance by writing files to the job's output location during the task commit phase, rather than the job commit phase. This approach enables distributed file writing and eliminates the need for temporary commit metadata storage on Amazon S3, resulting in improved cost-effectiveness. Furthermore, the MagicV2 committer provides enhanced flexibility by allowing file path overwrites across multiple threads during the commit process.

## Enable the MagicV2 Committer
<a name="s3a-magicv2-committer-enable"></a>

To enable MagicV2 committer, pass the following configuration in your job configuration or use the core-site configuration to set the property. For more information, see [Configure applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html).

```
mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
fs.s3a.committer.magic.enabled=true
fs.s3a.committer.name=magicv2
fs.s3a.committer.magic.track.commits.in.memory.enabled=true
```

For workloads that require overwriting the existing directory before committing or writing the new files, the following additional configuration is needed, along with the previously mentioned configuration.

```
fs.s3a.committer.magic.overwrite.and.commit=true
fs.s3a.committer.magic.delete.directory.threads=thread size
```

The default value for the `threads` configuration is `20`. However, this parameter should be tuned when there are a large number of directories to be overwritten for better performance. This is available only in EMR-7.2.0 and above.

## Considerations
<a name="considerations"></a>
+ If the Java Virtual Machine (JVM) crashes or is killed while tasks are running and writing data to Amazon S3, incomplete multipart uploads are more likely to be left behind. For this reason, when you use the MagicV2 committer, be sure to follow the best practices for managing failed multipart uploads. For more information, see the [Best practices for working with Amazon S3 buckets](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-upload-s3.html#emr-bucket-bestpractices) section in the Amazon EMR Management Guide.
+ If a job fails, any files committed by the successful tasks will still be visible in the destination path. In such cases, the user will need to manually clean up the committed files before re-running the job on the same destination path.
+ The MagicV2 committer consumes a small amount of memory for each file written by a task attempt until the task gets committed or aborted. In most jobs, the amount of memory ry consumed is negligible. However, in some cases where a single executor process handles a large number of tasks concurrently, it can put a lot of memory pressure, and the container or executor might run out of memory (OOM). Increasing the container or executor memory should solve this issue.

# Migration Guide: EMRFS to S3A Filesystem
<a name="emr-s3a-migrate"></a>

Starting with the EMR-7.10.0 release, S3A Filesystem is the default filesystem/s3 connector for EMR clusters for all S3 file schemes, including the following:
+ **s3://**
+ **s3n://**
+ **s3a://**

This change applies across all EMR deployments, including EC2, EKS, and EMR Serverless.

If you want to continue using EMRFS, you can configure this by adding the following property to the `core-site.xml` configuration file:

```
<property>
  <name>fs.s3.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
  <name>fs.s3n.impl</name>
  <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
```

## Migration of Existing EMRFS Configurations to S3A Configurations
<a name="emr-s3a-migration-of-existing-emrfs-configurations"></a>

**Note**  
Amazon EMR implements automatic configuration mapping between EMRFS and S3A when specific conditions are met. The mapping process automatically occurs when S3A configurations are undefined while corresponding EMRFS configurations are present. This automatic mapping functionality extends to bucket-level configurations, enabling seamless integration between EMRFS and S3A settings. As an illustration, when you configure a bucket-specific encryption setting in EMRFS using 'fs.s3.bucket.amzn-s3-demo-bucket1.serverSideEncryption.kms.keyId' with a value of "XYZ", the system automatically maps this to the equivalent S3A configuration by setting 'fs.s3a.encryption.key' to "XYZ" for the specified bucket amzn-s3-demo-bucket1.

The following predefined set of EMRFS configurations will be automatically translated to their corresponding S3A configuration equivalents. Any configurations currently implemented through cluster or job overrides will seamlessly transition to the S3A filesystem without requiring additional manual configuration or modifications.

By default, this configuration mapping feature is automatically activated. Users who wish to disable this automatic translation can do so by adding the following property to the core-site.xml configuration file.

```
<property>
  <name>fs.s3a.emrfs.compatibility.enable</name>
  <value>false</value>
</property>
```

**Note**  
The encryption key mapping from EMRFS (fs.s3.serverSideEncryption.kms.keyId or fs.s3.cse.kms.keyId) to S3A (fs.s3a.encryption.key) occurs only when either SSE-KMS or CSE-KMS encryption is enabled on either file system.


**EMRFS to S3A Configuration Mapping**  

| EMRFS Configuration Name | S3A Configuration Name | 
| --- | --- | 
| fs.s3.aimd.adjustWindow | fs.s3a.aimd.adjustWindow | 
| fs.s3.aimd.enabled | fs.s3a.aimd.enabled | 
| fs.s3.aimd.increaseIncrement | fs.s3a.aimd.increaseIncrement | 
| fs.s3.aimd.initialRate | fs.s3a.aimd.initialRate | 
| fs.s3.aimd.maxAttempts | fs.s3a.aimd.maxAttempts | 
| fs.s3.aimd.minRate | fs.s3a.aimd.minRate | 
| fs.s3.aimd.reductionFactor | fs.s3a.aimd.reductionFactor | 
| fs.s3.sts.endpoint | fs.s3a.assumed.role.sts.endpoint | 
| fs.s3.sts.sessionDurationSeconds | fs.s3a.assumed.role.session.duration | 
| fs.s3.authorization.roleMapping | fs.s3a.authorization.roleMapping | 
| fs.s3.authorization.ugi.groupName.enabled | fs.s3a.authorization.ugi.groupName.enabled | 
| fs.s3.credentialsResolverClass | fs.s3a.credentials.resolver | 
| fs.s3n.multipart.uploads.enabled | fs.s3a.multipart.uploads.enabled | 
| fs.s3n.multipart.uploads.split.size | fs.s3a.multipart.size | 
| fs.s3.serverSideEncryption.kms.customEncryptionContext | fs.s3a.encryption.context | 
| fs.s3.enableServerSideEncryption | fs.s3a.encryption.algorithm | 
| fs.s3.serverSideEncryption.kms.keyId / fs.s3.cse.kms.keyId | fs.s3a.encryption.key | 
| fs.s3.cse.kms.region | fs.s3a.encryption.cse.kms.region | 
| fs.s3.authorization.audit.enabled | fs.s3a.authorization.audit.enabled | 
| fs.s3.buckets.create.enabled | fs.s3a.bucket.probe | 
| fs.s3.delete.maxBatchSize | fs.s3a.bulk.delete.page.size | 
| fs.s3.filestatus.metadata.enabled | fs.s3a.metadata.cache.enabled | 
| fs.s3.maxConnections | fs.s3a.connection.maximum | 
| fs.s3.maxRetries | fs.s3a.retry.limit | 
| fs.s3.metadata.cache.expiration.seconds | fs.s3a.metadata.cache.expiration.seconds | 
| fs.s3.buffer.dir | fs.s3a.buffer.dir | 
| fs.s3.canned.acl | fs.s3a.acl.default | 
| fs.s3.positionedRead.optimization.enabled | fs.s3a.positionedRead.optimization.enabled | 
| fs.s3.readFullyIntoBuffers.optimization.enabled | fs.s3a.readFullyIntoBuffers.optimization.enabled | 
| fs.s3.signerType | fs.s3a.signing-algorithm | 
| fs.s3.storageClass | fs.s3a.create.storage.class | 
| fs.s3.threadpool.maxSize | fs.s3a.threads.max | 
| fs.s3.useRequesterPaysHeader | fs.s3a.requester.pays.enabled | 
| fs.s3n.block.size | fs.s3a.block.size | 
| fs.s3n.endpoint | fs.s3a.endpoint | 
| fs.s3n.ssl.enabled | fs.s3a.connection.ssl.enabled | 
| fs.s3.open.acceptsFileStatus | fs.s3a.open.acceptsFileStatus | 
| fs.s3.connection.maxIdleMilliSeconds | fs.s3a.connection.idle.time | 
| fs.s3.s3AccessGrants.enabled | fs.s3a.access.grants.enabled | 
| fs.s3.s3AccessGrants.fallbackToIAM | fs.s3a.access.grants.fallback.to.iam | 

### Considerations and Limitations
<a name="emr-s3a-migration-considerations-and-limitations"></a>
+ All the EMR engines – Spark, MapReduce, Flink, Tez, Hive etc will use S3A as the default S3 connector except for Trino and Presto engine.
+ EMR S3A does not support integration with EMR Ranger. Consider migrating to AWS Lake Formation.
+ AWS Lake Formation Support With RecordServer For EMR Spark with S3A is not supported - Consider using Spark Native FGAC.
+ AWS S3 Select is not supported.
+ Option to Periodically Clean Up Of Incomplete Multi Part Upload (MPU) is not available with S3A - Consider configuring [S3 bucket life cycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) to clean up dangling MPUs.
+ Inorder to migrate from EMRFS to S3A while using S3 CSE-CUSTOM encryption, The custom key provider needs to be rewritten from [EMRFSRSAEncryptionMaterialsProvider](https://github.com/awslabs/emr-sample-apps/tree/master/emrfs-plugins/EMRFSRSAEncryptionMaterialsProvider) interface to [Keyring interface](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/choose-keyring.html). Refer to setting up S3A [CSE-CUSTOM](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-s3a-cse-custom.html) for more information.
+ Amazon S3 directories created using EMRFS are marked with a '\$1\$1folder\$1' suffix, while directories created using S3A file system end with a '/' suffix, which is consistent with directories created through the AWS S3 console.
+ To use a custom S3 credential provider, set the S3A configuration property `fs.s3a.aws.credentials.provider` with the same credential provider class that was previously used in the EMRFS configuration `fs.s3.customAWSCredentialsProvider`.

### Unsupported EMRFS Configurations
<a name="emr-s3a-migration-unsupported"></a>

The following EMRFS configurations have been identified as unsupported or obsolete, and consequently, no direct mapping will be provided to their S3A configuration counterparts. These specific configurations will not be automatically translated or carried over during the migration to the S3A filesystem.


**Unsupported EMRFS Configurations and Reasons**  
<a name="unsupported-emrfs-configs"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-s3a-migrate.html)