

# Using Amazon EMR on EKS with AWS Lake Formation for fine-grained access control
<a name="security_iam_fgac-lf"></a>

With Amazon EMR release 7.7 and higher, you can leverage AWS Lake Formation to apply fine-grained access controls on AWS Glue Data Catalog tables that are backed by Amazon S3 buckets. This capability lets you configure table, row, column, and cell-level access controls for read queries within your Amazon EMR on EKS Spark Jobs.

**Topics**
+ [

# How Amazon EMR on EKS works with AWS Lake Formation
](security_iam_fgac-lf-works.md)
+ [

# Enable Lake Formation with Amazon EMR on EKS
](security_iam_fgac-lf-enable.md)
+ [

# Considerations and limitations
](security_iam_fgac-considerations.md)
+ [

# Troubleshooting
](security_iam_fgac-troubleshooting.md)

# How Amazon EMR on EKS works with AWS Lake Formation
<a name="security_iam_fgac-lf-works"></a>

Using Amazon EMR on EKS with Lake Formation lets you enforce a layer of permissions on each Spark Job to apply Lake Formation permission control when Amazon EMR on EKS executes jobs. Amazon EMR on EKS uses [Spark resource profiles](https://spark.apache.org/docs/latest/api/java/org/apache/spark/resource/ResourceProfile.html) to create two profiles to effectively execute jobs. The User Profile executes user-supplied code, while the system profile enforces Lake Formation policies. Each Lake Formation enabled Job utilizes two Spark drivers, one for the User profile, and another for the System profile. For more information, see What is [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html).

The following is a high-level overview of how Amazon EMR on EKS gets access to data protected by Lake Formation security policies.

![\[Job security by means of Lake Formation\]](http://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/images/fgac_diagram_eks_spark.png)


The following steps describe this process:

1. A user submits a Spark Job to an AWS Lake Formation-enabled Amazon EMR on EKS virtual cluster.

1. The Amazon EMR on EKS service sets up the User Driver and runs the job in the User Profile. The User Driver runs a lean version of Spark that has no ability to launch tasks, requests executors, access Amazon S3 or the Glue Data Catalog. It only builds a Job plan.

1. The Amazon EMR on EKS service sets up a second driver called a System Driver and runs it in the System Profile (with a privileged identity). Amazon EKS sets up an encrypted TLS channel between the two drivers for communication. The User Driver uses the channel to send the job plans to the System Driver. The System Driver does not run user-submitted code. It runs full Spark and communicates with Amazon S3 and the Data Catalog for data access. It requests executors and compiles the Job Plan into a sequence of execution stages.

1. Amazon EMR on EKS service then runs the stages on executors. User Code in any stage is run exclusively on User profile executors.

1. Stages that read data from Data Catalog tables protected by Lake Formation or those that apply security filters are delegated to System executors.

# Enable Lake Formation with Amazon EMR on EKS
<a name="security_iam_fgac-lf-enable"></a>

With Amazon EMR release 7.7 and higher, you can leverage AWS Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by Amazon S3. This capability lets you configure table, row, column, and cell level access controls for read queries within your Amazon EMR on EKS Spark Jobs.

This section covers how to create a security configuration and set up Lake Formation to work with Amazon EMR. It also describes how to create a virtual cluster with the Security Configuration that you created for Lake Formation. These sections are meant to be completed in sequence.

## Step 1: Set up Lake Formation-based column, row, or cell-level permissions
<a name="security_iam_fgac-lf-enable-permissions"></a>

First, to apply row and column level permissions with Lake Formation, the data lake administrator for Lake Formation must set the **LakeFormationAuthorizedCaller** Session Tag. Lake Formation uses this session tag to authorize callers and provide access to the data lake.

Navigate to the AWS Lake Formation console and select the **Application integration settings** option from the **Administration** section in the sidebar. Then, check the box **Allow external engines to filter data in Amazon S3 locations registered with Lake Formation**. Add the **AWS Account IDs ** where the Spark Jobs would be running, and the **Session tag Values**.

![\[Application integration settings\]](http://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/images/application_integration_settings_fgac.png)


Note that the **LakeFormationAuthorizedCaller** Session Tag passed here is passed in the **SecurityConfiguration** later when you set up IAM roles, in section 3.

## Step 2: Setup EKS RBAC permissions
<a name="security_iam_fgac-lf-enable-rbac"></a>

Second, you set up permissions for role-based access control.

### Provide EKS Cluster Permissions to the Amazon EMR on EKS service
<a name="security_iam_fgac-lf-enable-rbac-cluster"></a>

The Amazon EMR on EKS Service must have EKS Cluster Role permissions so that it can create cross namespace permissions for the System Driver to spin off User executors in the User namespace.

**Create Cluster Role**

This sample defines permissions for a collection of resources.

```
vim emr-containers-cluster-role.yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: emr-containers
rules:
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["serviceaccounts", "services", "configmaps", "events", "pods", "pods/log"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "deletecollection", "annotate", "patch", "label"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "patch", "delete", "watch"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "deployments"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "annotate", "patch", "label"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "annotate", "patch", "label"]
  - apiGroups: ["extensions", "networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "annotate", "patch", "label"]
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources: ["clusterroles","clusterrolebindings","roles", "rolebindings"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete", "deletecollection", "annotate", "patch", "label"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "describe", "create", "edit", "delete",  "deletecollection", "annotate", "patch", "label"]
  - apiGroups: ["kyverno.io"]
    resources: ["clusterpolicies"]
    verbs: ["create", "delete"]
---
```

```
kubectl apply -f emr-containers-cluster-role.yaml
```

**Create Cluster Role Bindings**

```
vim emr-containers-cluster-role-binding.yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: emr-containers
subjects:
- kind: User
  name: emr-containers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: emr-containers
  apiGroup: rbac.authorization.k8s.io
---
```

```
kubectl apply -f emr-containers-cluster-role-binding.yaml
```

### Provide Namespace access to the Amazon EMR on EKS service
<a name="security_iam_fgac-lf-enable-rbac-cluster"></a>

Create two Kubernetes namespaces, one for User driver and executors, and another for System driver & executors, and enable Amazon EMR on EKS service access to submit Jobs in both User and System Namespaces. Follow the existing guide to provide access for each namespace, which is available at [Enable cluster access using `aws-auth`](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-cluster-access.html#setting-up-cluster-access-aws-auth). 

## Step 3: Setup IAM Roles for user and system profile components
<a name="security_iam_fgac-lf-system-profile-configure"></a>

Third, you set up roles for specific components. A Lake Formation-enabled Spark Job has two components, User and System. The User driver and executors run in User namespace, and are tied to the JobExecutionRole that is passed in the StartJobRun API. The System driver and executors run in the System namespace, and are tied to the **QueryEngine** role.

### Configure Query Engine role
<a name="security_iam_fgac-lf-system-profile-configure-query"></a>

The QueryEngine role is tied to the System Space Components, and would have permissions to assume the **JobExecutionRole** with **LakeFormationAuthorizedCaller** Session tag. The IAM Permissions Policy of Query Engine role is the following:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AssumeJobRoleWithSessionTagAccessForSystemDriver",
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession"
      ],
      "Resource": [
        "arn:aws:iam::*:role/JobExecutionRole"
      ],
      "Condition": {
        "StringLike": {
          "aws:RequestTag/LakeFormationAuthorizedCaller": "EMR on EKS Engine"
        }
      }
    },
    {
      "Sid": "AssumeJobRoleWithSessionTagAccessForSystemExecutor",
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole"
      ],
      "Resource": [
        "arn:aws:iam::*:role/JobExecutionRole"
      ]
    },
    {
      "Sid": "CreateCertificateAccessForTLS",
      "Effect": "Allow",
      "Action": [
        "emr-containers:CreateCertificate"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

Configure the Trust policy of Query Engine role to trust the Kubernetes System namespace.

```
aws emr-containers update-role-trust-policy \ 
    --cluster-name eks cluster \ 
    --namespace eks system namespace \ 
    --role-name query_engine_iam_role_name
```

For more information, see [Updating the role trust policy](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-trust-policy.html).

### Configure the Job Execution Role
<a name="security_iam_fgac-lf-system-profile-job"></a>

Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API operations.

IAM Permissions Policy of **JobExecutionRole**: The **JobExecution** Role should have the Policy Statements in its Permissions Policy.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "CreateCertificateAccessForTLS",
      "Effect": "Allow",
      "Action": [
        "emr-containers:CreateCertificate"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

IAM Trust Policy for **JobExecutionRole**:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "TrustQueryEngineRoleForSystemDriver",
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession"
      ],
      "Resource": [
        "arn:aws:iam::*:role/QueryExecutionRole"
      ],
      "Condition": {
        "StringLike": {
          "aws:RequestTag/LakeFormationAuthorizedCaller": "EMR on EKS Engine"
        }
      }
    },
    {
      "Sid": "TrustQueryEngineRoleForSystemExecutor",
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole"
      ],
      "Resource": [
        "arn:aws:iam::*:role/QueryEngineRole"
      ]
    }
  ]
}
```

------

Configure the Trust Policy of Job execution Role to trust the Kubernetes user namespace:

```
aws emr-containers update-role-trust-policy \ 
    --cluster-name eks cluster \ 
    --namespace eks User namespace \ 
    --role-name job_execution_role_name
```

For more information, see [Update the trust policy of the job execution role](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-trust-policy.html).

## Step 4: Setup security configuration
<a name="security_iam_fgac-lf-security-config"></a>

To run a Lake Formation-enabled job, you must create a security configuration.

```
aws emr-containers create-security-configuration \
    --name 'security-configuration-name' \
    --security-configuration '{
        "authorizationConfiguration": {
            "lakeFormationConfiguration": {
                "authorizedSessionTagValue": "SessionTag configured in LakeFormation",
                "secureNamespaceInfo": {
                    "clusterId": "eks-cluster-name",
                    "namespace": "system-namespace-name"
                },
                "queryEngineRoleArn": "query-engine-IAM-role-ARN"
            }
        }
    }'
```

Ensure that the Session Tag passed in the field **authorizedSessionTagValue** can authorize Lake Formation. Set the value to the one configured in Lake Formation, in [Step 1: Set up Lake Formation-based column, row, or cell-level permissions](#security_iam_fgac-lf-enable-permissions).

## Step 5: Create a virtual cluster
<a name="security_iam_fgac-lf-virtual-cluster"></a>

Create a Amazon EMR on EKS virtual cluster with a security configuration.

```
aws emr-containers create-virtual-cluster \
--name my-lf-enabled-vc \
--container-provider '{
    "id": "eks-cluster",
    "type": "EKS",
    "info": {
        "eksInfo": {
            "namespace": "user-namespace"
        }
    }
}' \
--security-configuration-id SecurityConfiguraionId
```

Ensure the **SecurityConfiguration** Id from the previous step is passed, so that the Lake Formation authorization configuration is applied to all Jobs running on the virtual cluster. For more information, see [Register the Amazon EKS cluster with Amazon EMR]().

## Step 6: Submit a Job in the FGAC Enabled VirtualCluster
<a name="security_iam_fgac-enabled-cluster"></a>

The Process for Job Submission is same for both non Lake Formation and Lake Formation jobs. For more information, see [Submit a job run with `StartJobRun`](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-submit.html).

The Spark Driver, Executor and Event Logs of the System Driver are stored in AWS Service Account’s S3 Bucket for debugging. We recommend configuring a customer-managed KMS Key in the Job Run to encrypt all logs stored in the AWS service bucket. For more information about enabling log encryption, see [Encrypting Amazon EMR on EKS logs](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/security_iam_fgac-logging-kms.html).

# Considerations and limitations
<a name="security_iam_fgac-considerations"></a>

Note the following considerations and limitations when you use Lake Formation with Amazon EMR on EKS:
+ Amazon EMR on EKS supports fine-grained access control via Lake Formation only for Apache Hive, Apache Iceberg, Apache Hudi, and Delta table Formats. Apache Hive formats include Parquet, ORC, and xSV.
+ `DynamicResourceAllocation` is enabled by default, and you can't turn off `DynamicResourceAllocation` for Lake Formation jobs. As DRA `spark.dynamicAllocation.maxExecutors` configuration's default value is infinity, please configure an appropriate value based on your workload.
+ By default, `spark.dynamicAllocation.preallocateExecutors` is enabled in Amazon EMR Spark, which can cause excessive container churn when `spark.dynamicAllocation.initialExecutors` and `spark.dynamicAllocation.minExecutors` are not set. For recommended configurations to manage executor preallocation, see the [Performance](best-practices.md#performance) section in [Links to Amazon EMR on EKS best practices guides on GitHubRunning interactive workloads on Amazon EMR on EKS](best-practices.md).
+ Lake Formation-enabled jobs don’t support usage of customized EMR on EKS Images in System Driver and System Executors.
+ You can only use Lake Formation with Spark jobs.
+ EMR on EKS with Lake Formation only supports a single Spark session throughout a job.
+ EMR on EKS with Lake Formation only supports cross-account table queries shared through resource links.
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Write with Lake Formation granted permissions
  + Access control for nested columns
+ EMR on EKS blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore `ANALYZE TABLE` command
+ To enforce access controls, `EXPLAIN PLAN` and DDL operations such as `DESCRIBE TABLE` don't expose restricted information.
+ Amazon EMR on EKS restricts access to system driver Spark logs on Lake Formation-enabled jobs. Since the system driver runs with more access, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, EMR on EKS disabled access to system driver logs. For troubleshooting, contact AWS support.
+ If you registered a table location with Lake Formation, the data access path goes through the Lake Formation stored credentials, regardless of the IAM permission for the EMR on EKS job execution role. If you misconfigure the role registered with the table location, jobs submitted that use the role with S3 IAM permission to the table location will fail.
+ Writing to a Lake Formation table uses IAM permission rather than Lake Formation granted permissions. If your job execution role has the necessary S3 permissions, you can use it to run write operations.

The following are considerations and limitations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. Amazon EMR hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that you don't register in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We recommend that you use Iceberg DataFrameWriterV2 instead of V1.

For more information, see [Understanding Amazon EMR on EKS concepts and terminology](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-concepts.html) and [Enable cluster access for Amazon EMR on EKS](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-cluster-access.html).

## Disclaimer for data administrators
<a name="security_iam_fgac-considerations-data-admin"></a>

**Note**  
When you grant access to Lake Formation resources to an IAM role for EMR on EKS, you must ensure the EMR cluster administrator or operator is a trusted administrator. This is particularly relevant for Lake Formation resources that are shared across multiple organizations and AWS accounts.

## Responsibilities for EKS administrators
<a name="security_iam_fgac-considerations-responsibilities"></a>
+ The `System` namespace should be protected. No user or resource or entity or tooling would be allowed to have any Kubernetes RBAC permissions on the Kubernetes resources in the `System` namespace.
+ No user or resource or entity except the EMR on EKS service should have access to `CREATE` access to POD, CONFIG\$1MAP and SECRET in the `User` namespace.
+ `System` drivers and `System` executors contain sensitive data. So, Spark events, Spark driver logs, and Spark executor logs in the `System` namespace should not be forwarded to external log storage systems.

# Troubleshooting
<a name="security_iam_fgac-troubleshooting"></a>

## Logging
<a name="security_iam_fgac-troubleshooting-logging"></a>

EMR on EKS uses Spark resources profiles to split job execution. Amazon EMR on EKS uses the user profile to run the code you supplied, while the system profile enforces Lake Formation policies. You can access the logs for the containers ran as the user profile by configuring the StartJobRun request with [MonitoringConfiguration](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-s3.html).

## Spark History Server
<a name="security_iam_fgac-troubleshooting-spark-history"></a>

The Spark History Server have all Spark events generated from the user profile and redacted events generated from the system driver. You can see all of the containers from both the user and system drivers in the **Executors** tab. However, log links are available only for the user profile.

## Job failed with insufficient Lake Formation permissions
<a name="security_iam_fgac-troubleshooting-job-failed"></a>

Make sure that your job execution role has the permissions to run `SELECT` and `DESCRIBE` on the table that you are accessing.

## Job with RDD execution failed
<a name="security_iam_fgac-troubleshooting-RDD"></a>

EMR on EKS currently doesn't support resilient distributed dataset (RDD) operations on Lake Formation-enabled jobs.

## Unable to access data files in Amazon S3
<a name="security_iam_fgac-troubleshooting-unable-access"></a>

Make sure you have registered the location of the data lake in Lake Formation.

## Security validation exception
<a name="security_iam_fgac-troubleshooting-validation"></a>

EMR on EKS detected a security validation error. Contact AWS support for assistance.

## Sharing AWS Glue Data Catalog and tables across accounts
<a name="security_iam_fgac-troubleshooting-across"></a>

You can share databases and tables across accounts and still use Lake Formation. For more information, see [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html) and [How do I share AWS Glue Data Catalog and tables cross-account using AWS Lake Formation?](https://repost.aws/knowledge-center/glue-lake-formation-cross-account).

## Iceberg Job throwing initialization error not setting the AWS region
<a name="security_iam_fgac-troubleshooting-init-error"></a>

Message is the following:

```
25/02/25 13:33:19 ERROR SparkFGACExceptionSanitizer: Client received error with id = b921f9e6-f655-491f-b8bd-b2842cdc20c7, 
reason = IllegalArgumentException, message = Cannot initialize 
LakeFormationAwsClientFactory, please set client.region to a valid aws region
```

Make sure the Spark configuration `spark.sql.catalog.catalog_name.client.region` is set to a valid region.

## Iceberg Job throwing SparkUnsupportedOperationException
<a name="security_iam_fgac-troubleshooting-unsupported-error"></a>

Message is the following:

```
25/02/25 13:53:15 ERROR SparkFGACExceptionSanitizer: Client received error with id = 921fef42-0800-448b-bef5-d283d1278ce0, 
reason = SparkUnsupportedOperationException, message = Either glue.id or glue.account-id is set with non-default account. 
Cross account access with fine-grained access control is only supported with AWS Resource Access Manager.
```

Make sure the Spark Configuration `spark.sql.catalog.catalog_name.glue.account-id` is set to a valid account id.

## Iceberg Job fail with "403 Access Denied" during MERGE operation
<a name="security_iam_fgac-troubleshooting-merge-s3fileio-error"></a>

Message is the following:

```
software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, 
...
	at software.amazon.awssdk.services.s3.DefaultS3Client.deleteObject(DefaultS3Client.java:3365)
	at org.apache.iceberg.aws.s3.S3FileIO.deleteFile(S3FileIO.java:162)
	at org.apache.iceberg.io.FileIO.deleteFile(FileIO.java:86)
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:129)
```

Disable S3 Delete operations in Spark by adding the following property. `--conf spark.sql.catalog.s3-table-name.s3.delete-enabled=false`.