

# Using AWS Glue with AWS Lake Formation for fine-grained access control
<a name="security-lf-enable"></a>

## Overview
<a name="security-lf-enable-overview"></a>

With AWS Glue version 5.0 and higher, you can leverage AWS Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by S3. This capability lets you configure table, row, column, and cell level access controls for read queries within your AWS Glue for Apache Spark jobs. See the following sections to learn more about Lake Formation and how to use it with AWS Glue.

`GlueContext`-based table-level access control with AWS Lake Formation permissions supported in Glue 4.0 or before is not supported in Glue 5.0. Use the new Spark native fine-grained access control (FGAC) in Glue 5.0. Note the following details:
+ If you need fine grained access control (FGAC) for row/column/cell access control, you will need to migrate from `GlueContext`/Glue DynamicFrame in Glue 4.0 and prior to Spark dataframe in Glue 5.0. For examples, see [Migrating from GlueContext/Glue DynamicFrame to Spark DataFrame](security-lf-migration-spark-dataframes.md)
+  If you need Full Table Access control (FTA), you can leverage FTA with DynamicFrames in AWS Glue 5.0. You can also migrate to native Spark approach for additional capabilities such as Resilient Distributed Datasets (RDDs), custom libraries, and User Defined Functions (UDFs) with AWS Lake Formation tables. For examples, see [ Migrating from AWS Glue 4.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html). 
+ If you don't need FGAC, then no migration to Spark dataframe is necessary and `GlueContext` features like job bookmarks, push down predicates will continue to work.
+ Jobs with FGAC require a minimum of 4 workers: one user driver, one system driver, one system executor, and one standby user executor.

Using AWS Glue with AWS Lake Formation incurs additional charges.

## How AWS Glue works with AWS Lake Formation
<a name="security-lf-enable-how-it-works"></a>

Using AWS Glue with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when AWS Glue executes jobs. AWS Glue uses [ Spark resource profiles](https://spark.apache.org/docs/latest/api/java/org/apache/spark/resource/ResourceProfile.html) to create two profiles to effectively execute jobs. The user profile executes user-supplied code, while the system profile enforces Lake Formation policies. For more information, see [What is AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) and [Considerations and limitations](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable-considerations.html).

The following is a high-level overview of how AWS Glue gets access to data protected by Lake Formation security policies.

![\[The diagram shows how fine-grained access control works with the AWS Glue StartJobRun API.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-50-fgac-start-job-run-api-diagram.png)


1. A user calls the `StartJobRun` API on an AWS Lake Formation-enabled AWS Glue job.

1. AWS Glue sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, access S3 or the Glue Catalog. It builds a job plan.

1. AWS Glue sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). AWS Glue sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver does not run user-submitted code. It runs full Spark and communicates with S3, and the Data Catalog for data access. It request executors and compiles the Job Plan into a sequence of execution stages. 

1. AWS Glue then runs the stages on executors with the user driver or system driver. User code in any stage is run exclusively on user profile executors.

1. Stages that read data from Data Catalog tables protected by AWS Lake Formation or those that apply security filters are delegated to system executors.

## Minimum worker requirement
<a name="security-lf-enable-permissions"></a>

A Lake Formation-enabled job in AWS Glue requires a minimum of 4 workers: one user driver, one system driver, one system executor, and one standby User Executor. This is up from the minimum of 2 workers required for standard AWS Glue jobs.

A Lake Formation-enabled job in AWS Glue utilizes two Spark drivers—one for the system profile and another for the user profile. Similarly, the executors are also divided into two profiles:
+ System executors: handle tasks where Lake Formation data filters are applied.
+ User executors: are requested by the system driver as needed.

As Spark jobs are lazy in nature, AWS Glue reserves 10% of the total workers (minimum of 1), after deducting the two drivers, for user executors.

All Lake Formation-enabled jobs have auto-scaling enabled, meaning the user executors will only start when needed.

For an example configuration, see [Considerations and limitations](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable-considerations.html).

## Job runtime role IAM permissions
<a name="security-lf-enable-permissions"></a>

Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API operation. 

The following is an example policy of how to provide IAM permissions to access a script in S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "ScriptAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::*.amzn-s3-demo-bucket/scripts",
        "arn:aws:s3:::*.amzn-s3-demo-bucket/*"
      ]
    },
    {
      "Sid": "LoggingAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/logs/*"
      ]
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Setting up Lake Formation permissions for job runtime role
<a name="security-lf-enable-set-up-grants-for-role"></a>

First, register the location of your Hive table with Lake Formation. Then create permissions for your job runtime role on your desired table. For more details about Lake Formation, see [ What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) in the *AWS Lake Formation Developer Guide*.

After you set up the Lake Formation permissions, you can submit Spark jobs on AWS Glue.

## Submitting a job run
<a name="security-lf-enable-submit-job"></a>

After you finish setting up the Lake Formation grants, you can submit Spark jobs on AWS Glue. To run Iceberg jobs, you must provide the following Spark configurations. To configure through Glue job parameters, put the following parameter:
+ Key:

  ```
  --conf
  ```
+ Value:

  ```
  spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
  					  --conf spark.sql.catalog.spark_catalog.warehouse=<S3_DATA_LOCATION> 
  					  --conf spark.sql.catalog.spark_catalog.glue.account-id=<ACCOUNT_ID> 
  					  --conf spark.sql.catalog.spark_catalog.client.region=<REGION> 
  					  --conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<REGION>.amazonaws.com
  ```

## Using an Interactive Session
<a name="security-lf-using-interactive-session"></a>

 After you finish setting up the AWS Lake Formation grants, you can use Interactive Sessions on AWS Glue. You must provide the following Spark configurations via the `%%configure` magic prior to executing code. 

```
%%configure
{
    "--enable-lakeformation-fine-grained-access": "true",
    "--conf": "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.warehouse=<S3_DATA_LOCATION> --conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog.client.region=<REGION> --conf spark.sql.catalog.spark_catalog.glue.account-id=<ACCOUNT_ID> --conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<REGION>.amazonaws.com"
}
```

## FGAC for AWS Glue 5.0 Notebook or interactive sessions
<a name="security-lf-fgac"></a>

To enable Fine-Grained Access Control (FGAC) in AWS Glue you must specify the Spark confs required for Lake Formation as part of the %%configure magic before you create first cell.

Specifying it later using the calls `SparkSession.builder().conf("").get()` or `SparkSession.builder().conf("").create()` will not be enough. This is a change from the AWS Glue 4.0 behavior.

## Open-table format support
<a name="security-lf-enable-open-table-format-support"></a>

AWS Glue version 5.0 or later includes support for fine-grained access control based on Lake Formation. AWS Glue supports Hive and Iceberg table types. The following table describes all of the supported operations.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/security-lf-enable.html)

# Migrating from GlueContext/Glue DynamicFrame to Spark DataFrame
<a name="security-lf-migration-spark-dataframes"></a>

The following are Python and Scala examples of migrating `GlueContext`/Glue `DynamicFrame` in Glue 4.0 to Spark `DataFrame` in Glue 5.0.

**Python**  
Before:

```
escaped_table_name= '`<dbname>`.`<table_name>`'

additional_options = {
  "query": f'select * from {escaped_table_name} WHERE column1 = 1 AND column7 = 7'
}

# DynamicFrame example
dataset = glueContext.create_data_frame_from_catalog(
    database="<dbname>",
    table_name=escaped_table_name, 
    additional_options=additional_options)
```

After:

```
table_identifier= '`<catalogname>`.`<dbname>`.`<table_name>`"' #catalogname is optional

# DataFrame example
dataset = spark.sql(f'select * from {table_identifier} WHERE column1 = 1 AND column7 = 7')
```

**Scala**  
Before:

```
val escapedTableName = "`<dbname>`.`<table_name>`"

val additionalOptions = JsonOptions(Map(
    "query" -> s"select * from $escapedTableName WHERE column1 = 1 AND column7 = 7"
    )
)

# DynamicFrame example
val datasource0 = glueContext.getCatalogSource(
    database="<dbname>", 
    tableName=escapedTableName, 
    additionalOptions=additionalOptions).getDataFrame()
```

After:

```
val tableIdentifier = "`<catalogname>`.`<dbname>`.`<table_name>`" //catalogname is optional

# DataFrame example
val datasource0 = spark.sql(s"select * from $tableIdentifier WHERE column1 = 1 AND column7 = 7")
```

# Considerations and limitations
<a name="security-lf-enable-considerations"></a>

Consider the following considerations and limitations when you use Lake Formation with AWS Glue. 

AWS Glue with Lake Formation is available in all supported Regions except AWS GovCloud (US-East) and AWS GovCloud (US-West).
+ AWS Glue supports fine-grained access control via Lake Formation only for Apache Hive and Apache Iceberg tables. Apache Hive formats include Parquet, ORC, and CSV. 
+ You can only use Lake Formation with Spark jobs.
+ AWS Glue with Lake Formation only supports a single Spark session throughout a job.
+ When Lake Formation is enabled,AWS Glue requires a greater number of workers because it requires one system driver, system executors, one user driver, and optionally user executors (required when your job has UDFs or `spark.createDataFrame`).
+ AWS Glue with Lake Formation only supports cross-account table queries shared through resource links. The resource-link needs to be named identically to the source account's resource.
+ To enable fine-grained access control for AWS Glue jobs, pass the `--enable-lakeformation-fine-grained-access` job parameter.
+ You can configure your AWS Glue jobs to work with the AWS Glue multi-catalog hierarchy. For information on the configuration parameters to use with the AWS Glue `StartJobRun` API, see [Working with AWS Glue multi-catalog hierarchy on EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/external-metastore-glue-multi.html).
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Write with Lake Formation granted permissions
  + Access control for nested columns
+ AWS Glue blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore
  + `ANALYZE TABLE` command
+ To enforce access controls, `EXPLAIN PLAN` and DDL operations such as `DESCRIBE TABLE` don't expose restricted information.
+ AWS Glue restricts access to system driver Spark logs on Lake Formation-enabled applications. Since the system driver runs with more access, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, AWS Glue disabled access to system driver logs. For troubleshooting, contact AWS support.
+ If you registered a table location with Lake Formation, the data access path goes through the Lake Formation stored credentials regardless of the IAM permission for the AWS Glue job runtime role. If you misconfigure the role registered with table location, jobs submitted that use the role with S3 IAM permission to the table location will fail.
+ Writing to a Lake Formation table uses IAM permission rather than Lake Formation granted permissions. If your job runtime role has the necessary S3 permissions, you can use it to run write operations.

The following are considerations and limitations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. AWS Glue hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that you don't register in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We recommend that you use Iceberg DataFrameWriterV2 instead of V1.

## Example worker allocation
<a name="security-lf-considerations-worker-allocation"></a>

For a job configured with the following parameters:

```
--enable-lakeformation-fine-grained-access=true  
--number-of-workers=20
```

The worker allocation would be:
+ One worker for the user driver.
+ One worker for the system driver.
+ 10% of the remaining 18 workers (that is, 2 workers) reserved for the user executors.
+ Up to 16 workers allocated for system executors.

With auto-scaling enabled, the user executors can utilize any of the unallocated capacity from the system executors if needed.

## Controlling user executor allocation
<a name="security-lf-considerations-user-exec-allocation"></a>

You can adjust the reservation percentage for user executors using the following configuration:

```
--conf spark.dynamicAllocation.maxExecutorsRatio=<value between 0 and 1>
```

This configuration allows fine-tuned control over how many user executors are reserved relative to the total available capacity.

# Troubleshooting
<a name="security-lf-troubleshooting"></a>

See the following sections for troubleshooting solutions.

## Logging
<a name="security-lf-troubleshooting-logging"></a>

AWS Glue uses Spark resources profiles to split job execution. AWS Glue uses the user profile to run the code you supplied, while the system profile enforces Lake Formation policies. You can access the logs for the tasks ran as the user profile.

## Live UI and Spark History Server
<a name="security-lf-troubleshooting-live-ui"></a>

The Live UI and the Spark History Server have all Spark events generated from the user profile and redacted events generated from the system driver.

You can see all of the tasks from both the user and system drivers in the **Executors** tab. However, log links are available only for the user profile. Also, some information is redacted from Live UI, such as the number of output records.

## Job failed with insufficient Lake Formation permissions
<a name="security-lf-troubleshooting-insufficient-lf-permissions"></a>

Make sure that your job runtime role has the permissions to run SELECT and DESCRIBE on the table that you are accessing.

## Job with RDD execution failed
<a name="security-lf-troubleshooting-rdd-execution"></a>

AWS Glue currently doesn't support resilient distributed dataset (RDD) operations on Lake Formation-enabled jobs.

## Unable to access data files in Amazon S3
<a name="security-lf-troubleshooting-s3-access-failure"></a>

Make sure you have registered the location of the data lake in Lake Formation.

## Security validation exception
<a name="security-lf-troubleshooting-security-validation"></a>

AWS Glue detected a security validation error. Contact AWS support for assistance.

## Sharing AWS Glue Data Catalog and tables across accounts
<a name="security-lf-troubleshooting-cross-account"></a>

You can share databases and tables across accounts and still use Lake Formation. For more information, see [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html) and [How do I share AWS Glue Data Catalog and tables cross-account using ?](https://repost.aws/knowledge-center/glue-lake-formation-cross-account).