# 在 AWS Glue ETL 任务中使用数据湖框架
<a name="aws-glue-programming-etl-datalake-native-frameworks"></a>

开源数据湖框架简化了对存储在 Amazon S3 上的数据湖中的文件的增量数据处理。AWS Glue 3.0 及更高版本支持以下开源数据湖框架：
+ Apache Hudi
+ Linux Foundation Delta Lake
+ Apache Iceberg

我们为这些框架提供原生支持，以便您可以以交易一致的方式读取和写入存储在 Amazon S3 中的数据。无需安装单独的连接器或完成额外的配置步骤即可在 AWS Glue ETL 任务中使用这些框架。

通过 AWS Glue Data Catalog 管理数据集时，您可以使用 AWS Glue 方法读取和写入 Spark DataFrames 数据湖表。也可以使用 Spark DataFrame API 读取和写入 Amazon S3 数据。

在这段视频中，您可以了解 Apache Hudi、Apache Iceberg 和 Delta Lake 工作原理的基础知识。您将看到如何在数据湖中插入、更新和删除数据，以及每个框架的工作原理。

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/fryfx0Zg7KA/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/fryfx0Zg7KA)


**Topics**
+ [限制](aws-glue-programming-etl-datalake-native-frameworks-limitations.md)
+ [在 AWS Glue 中使用 Hudi 框架](aws-glue-programming-etl-format-hudi.md)
+ [在 AWS Glue 中使用 Delta Lake 框架](aws-glue-programming-etl-format-delta-lake.md)
+ [在 AWS Glue 中使用 Iceberg 框架](aws-glue-programming-etl-format-iceberg.md)

# 限制
<a name="aws-glue-programming-etl-datalake-native-frameworks-limitations"></a>

在数据湖框架与 AWS Glue 配合使用之前，请考虑以下限制。
+ 以下 AWS Glue `GlueContext` DynamicFrame 方法不支持读取和写入数据湖框架表。请改用 `GlueContext` DataFrame 方法或 Spark DataFrame API。
  + `create_dynamic_frame.from_catalog`
  + `write_dynamic_frame.from_catalog`
  + `getDynamicFrame`
  + `writeDynamicFrame`
+ 以下 `GlueContext` DataFrame 方法支持 Lake Formation 权限控制：
  + `create_data_frame.from_catalog`
  + `write_data_frame.from_catalog`
  + `getDataFrame`
  + `writeDataFrame`
+ 不支持[对小文件进行分组](grouping-input-files.md)。
+ 不支持[作业书签](monitor-continuations.md)。
+ Apache Hudi 0.10.1 for AWS Glue 3.0 不支持 Read (MoR) 表上的 Hudi Merge。
+ `ALTER TABLE … RENAME TO` 不适用于 Apache Iceberg 0.13.1 for AWS Glue 3.0。

## 有关由 Lake Formation 权限管理的数据湖格式表的限制
<a name="w2aac67c11c24c11c31c17b7"></a>

数据湖格式通过 Lake Formation 权限与 AWS Glue ETL 集成。不支持使用 `create_dynamic_frame` 创建 DynamicFrame。有关更多信息，请参阅以下示例：
+ [示例：读取和写入具有 Lake Formation 权限控制的 Iceberg 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html#aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables)
+ [示例：读取和写入具有 Lake Formation 权限控制的 Hudi 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables)
+ [示例：读取和写入具有 Lake Formation 权限控制的 Delta Lake 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html#aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables)

**注意**  
仅在 AWS Glue 版本 4.0 中支持通过适用于 Apache Hudi、Apache Iceberg 和 Delta Lake 的 Lake Formation 权限与 AWS Glue ETL 集成。

Apache Iceberg 通过 Lake Formation 权限与 AWS Glue ETL 集成的效果最好。它支持几乎所有操作，包括支持 SQL。

Hudi 支持除管理操作之外的大多数基本操作。这是因为这些选项通常通过写入 DataFrame 来完成，并通过 `additional_options` 指定。由于不支持 SparkSQL，因此需要使用 AWS Glue API 来为您的操作创建 DataFrame。

Delta Lake 仅支持读取、附加和覆盖表数据。Delta Lake 需要使用自己的库才能执行更新等各种任务。

由 Lake Formation 权限管理的 Iceberg 表不支持以下功能。
+ 使用 ETL AWS Glue 进行压缩
+ 通过 AWS Glue ETL 支持 Spark SQL

由 Lake Formation 权限管理的 Hudi 表存在以下限制：
+ 移除孤立文件

由 Lake Formation 权限管理的 Delta Lake 表存在以下限制：
+ 除在 Delta Lake 表中插入和读取数据的所有其他功能。

# 在 AWS Glue 中使用 Hudi 框架
<a name="aws-glue-programming-etl-format-hudi"></a>

AWS Glue 3.0 及更高版本支持数据湖的 Apache Hudi 框架。Hudi 是一个开源数据湖存储框架，简化增量数据处理和数据管道开发。本主题涵盖了在 Hudi 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Hudi 的更多信息，请参阅 [Apache Hudi 官方文档](https://hudi.apache.org/docs/overview/)。

您可以使用 AWS Glue 对 Amazon S3 中的 Hudi 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Hudi 表。还支持其他操作，包括插入、更新和所有 [Apache Spark 操作](https://hudi.apache.org/docs/quick-start-guide/)。

**注意**  
AWS Glue 5.0 中的 Apache Hudi 0.15.0 实现会在内部回退 [HUDI-7001](https://github.com/apache/hudi/pull/9936) 变更。如果记录密钥由单个字段组成，则不会出现与复杂键生成相关的回归问题。不过，这种行为与 OSS Apache Hudi 0.15.0 不同。  
Apache Hudi 0.10.1 for AWS Glue 3.0 不支持 Read（MoR）表上的 Hudi Merge。

下表列出了 AWS Glue 每个版本中包含的 Hudi 版本。


****  

| AWS Glue 版本 | 支持的 Hudi 版本 | 
| --- | --- | 
| 5.1 | 1.0.2 | 
| 5.0 | 0.15.0 | 
| 4.0 | 0.12.1 | 
| 3.0 | 0.10.1 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 启用 Hudi
<a name="aws-glue-programming-etl-format-hudi-enable"></a>

要为 AWS Glue 启用 Hudi，请完成以下任务：
+ 指定 `hudi` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Hudi 表。

  ```
  spark.serializer=org.apache.spark.serializer.KryoSerializer
  ```
+ AWS Glue 4.0 默认为 Hudi 表启用了 Lake Formation 权限支持。无需额外配置即可读取/写入注册到 Lake Formation 的 Hudi 表。AWS Glue 作业 IAM 角色必须具有 SELECT 权限才能读取已注册的 Hudi 表。AWS Glue 作业 IAM 角色必须具有 SUPER 权限才能写入已注册的 Hudi 表。要了解有关管理 Lake Formation 权限的更多信息，请参阅 [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html)。

**使用不同的 Hudi 版本**

要使用 AWS Glue 不支持的 Hudi 版本，请使用 `--extra-jars` 作业参数指定您自己的 Hudi JAR 文件。请勿使用 `hudi` 作为 `--datalake-formats` 作业参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。

## 示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中
<a name="aws-glue-programming-etl-format-hudi-write"></a>

以下示例脚本脚本演示了如何将 Hudi 表写入 Amazon S3，并将该表注册到 AWS Glue 数据目录。该示例使用 Hudi [Hive 同步工具](https://hudi.apache.org/docs/syncing_metastore/)来注册该表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 作业参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

```
# Example: Create a Hudi table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options={
    "hoodie.table.name": "<your_table_name>",
    "hoodie.database.name": "<your_database_name>",
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
    "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
    "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.database": "<your_database_name>",
    "hoodie.datasource.hive_sync.table": "<your_table_name>",
    "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasource.hive_sync.use_jdbc": "false",
    "hoodie.datasource.hive_sync.mode": "hms",
    "path": "s3://<s3Path/>"
}

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save()
```

------
#### [ Scala ]

```
// Example: Example: Create a Hudi table from a DataFrame
// and register the table to Glue Data Catalog

val additionalOptions = Map(
  "hoodie.table.name" -> "<your_table_name>",
  "hoodie.database.name" -> "<your_database_name>",
  "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
  "hoodie.datasource.write.operation" -> "upsert",
  "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
  "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
  "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
  "hoodie.datasource.write.hive_style_partitioning" -> "true",
  "hoodie.datasource.hive_sync.enable" -> "true",
  "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
  "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
  "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
  "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
  "hoodie.datasource.hive_sync.use_jdbc" -> "false",
  "hoodie.datasource.hive_sync.mode" -> "hms",
  "path" -> "s3://<s3Path/>")

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("append")
  .save()
```

------

## 示例：使用 AWS Glue Data Catalog 从 Amazon S3 读取 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read"></a>

此示例从 Amazon S3 读取您在 [示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中](#aws-glue-programming-etl-format-hudi-write) 中创建的 Hudi 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.create\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Read a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

dataFrame = glueContext.create_data_frame.from_catalog(
    database = "<your_database_name>",
    table_name = "<your_table_name>"
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dataFrame = glueContext.getCatalogSource(
      database = "<your_database_name>",
      tableName = "<your_table_name>"
    ).getDataFrame()
  }
}
```

------

## 示例：更新 `DataFrame` 并将其插入到 Amazon S3 的 Hudi 表中
<a name="aws-glue-programming-etl-format-hudi-update-insert"></a>

此示例使用 AWS Glue Data Catalog 将 DataFrame 插入到您在 [示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中](#aws-glue-programming-etl-format-hudi-write) 中创建的 Hudi 表中。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.write\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Upsert a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame = dataFrame,
    database = "<your_database_name>",
    table_name = "<your_table_name>",
    additional_options={
        "hoodie.table.name": "<your_table_name>",
        "hoodie.database.name": "<your_database_name>",
        "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
        "hoodie.datasource.write.operation": "upsert",
        "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning": "true",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.hive_sync.database": "<your_database_name>",
        "hoodie.datasource.hive_sync.table": "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc": "false",
        "hoodie.datasource.hive_sync.mode": "hms"
    }
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Upsert a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = JsonOptions(Map(
        "hoodie.table.name" -> "<your_table_name>",
        "hoodie.database.name" -> "<your_database_name>",
        "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
        "hoodie.datasource.write.operation" -> "upsert",
        "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning" -> "true",
        "hoodie.datasource.hive_sync.enable" -> "true",
        "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
        "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc" -> "false",
        "hoodie.datasource.hive_sync.mode" -> "hms"
      )))
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark 从 Amazon S3 读取 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read-spark"></a>

此示例使用 Spark DataFrame API 从 Amazon S3 读取 Hudi 表。

------
#### [ Python ]

```
# Example: Read a Hudi table from S3 using a Spark DataFrame

dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Hudi table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------

## 示例：使用 Spark 向 Amazon S3 写入 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-write-spark"></a>

示例：使用 Spark 向 Amazon S3 写入 Hudi 表

------
#### [ Python ]

```
# Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save("s3://<s3Path/>)
```

------
#### [ Scala ]

```
// Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("overwrite")
  .save("s3://<s3path/>")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Hudi 表。

1. 创建一个 Hudi 表并将其注册到 Lake Formation。

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark dataframe API 注册的 Amazon S3 路径的 Hudi 表：

      ```
      hudi_options = {
          'hoodie.table.name': table_name,
          'hoodie.database.name': database_name,
          'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
          'hoodie.datasource.write.recordkey.field': 'product_id',
          'hoodie.datasource.write.table.name': table_name,
          'hoodie.datasource.write.operation': 'upsert',
          'hoodie.datasource.write.precombine.field': 'updated_at',
          'hoodie.datasource.write.hive_style_partitioning': 'true',
          'hoodie.upsert.shuffle.parallelism': 2,
          'hoodie.insert.shuffle.parallelism': 2,
          'path': <S3_TABLE_LOCATION>,
          'hoodie.datasource.hive_sync.enable': 'true',
          'hoodie.datasource.hive_sync.database': database_name,
          'hoodie.datasource.hive_sync.table': table_name,
          'hoodie.datasource.hive_sync.use_jdbc': 'false',
          'hoodie.datasource.hive_sync.mode': 'hms'
      }
      
      df_products.write.format("hudi")  \
          .options(**hudi_options)  \
          .mode("overwrite")  \
          .save()
      ```

1. 向 AWS Glue 作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)。

1.  读取注册到 Lake Formation 的 Hudi 表。代码与读取未注册的 Hudi 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
    val dataFrame = glueContext.getCatalogSource(
         database = "<your_database_name>",
         tableName = "<your_table_name>"
       ).getDataFrame()
   ```

1. 写入注册到 Lake Formation 的 Hudi 表。代码与写入未注册的 Hudi 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   ```
   glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
         additionalOptions = JsonOptions(Map(
           "hoodie.table.name" -> "<your_table_name>",
           "hoodie.database.name" -> "<your_database_name>",
           "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
           "hoodie.datasource.write.operation" -> "<write_operation>",
           "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
           "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
           "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
           "hoodie.datasource.write.hive_style_partitioning" -> "true",
           "hoodie.datasource.hive_sync.enable" -> "true",
           "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
           "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
           "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
           "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.hive_sync.use_jdbc" -> "false",
           "hoodie.datasource.hive_sync.mode" -> "hms"
         )))
         .writeDataFrame(dataFrame, glueContext)
   ```

# 在 AWS Glue 中使用 Delta Lake 框架
<a name="aws-glue-programming-etl-format-delta-lake"></a>

AWS Glue 3.0 及更高版本支持 Linux Foundation Delta Lake 框架。Delta Lake 是一个开源数据湖存储框架，可帮助您执行 ACID 交易、扩展元数据处理以及统一流式和批处理数据处理。本主题涵盖了在 Delta Lake 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Delta Lake 的更多信息，请参阅 [Delta Lake 官方文档](https://docs.delta.io/latest/delta-intro.html)。

您可以使用 AWS Glue 对 Amazon S3 中的 Delta Lake 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Delta Lake 表。还支持插入、更新和[表批量读取和写入](https://docs.delta.io/0.7.0/api/python/index.html)等其他操作。使用 Delta Lake 表时，也可以选择使用 Delta Lake Python 库中的方法，例如 `DeltaTable.forPath`。有关 Delta Lake Python 库的更多信息，请参阅 Delta Lake 的 Python 文档页面。

下表列出了 AWS Glue 每个版本中包含的 Delta Lake 版本。


****  

| AWS Glue 版本 | 支持的 Delta Lake 版本 | 
| --- | --- | 
| 5.1 | 3.3.2 | 
| 5.0 | 3.3.0 | 
| 4.0 | 2.1.0 | 
| 3.0 | 1.0.0 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 为 AWS Glue 启用 Delta Lake
<a name="aws-glue-programming-etl-format-delta-lake-enable"></a>

要为 AWS Glue 启用 Delta Lake，请完成以下任务：
+ 指定 `delta` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Delta Lake 表。

  ```
  spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
  ```
+ AWS Glue 4.0 默认为 Delta 表启用了 Lake Formation 权限支持。无需额外配置即可读取/写入注册到 Lake Formation 的 Delta 表。AWS Glue 作业 IAM 角色必须具有 SELECT 权限才能读取已注册的 Delta 表。AWS Glue 作业 IAM 角色必须具有 SUPER 权限才能写入已注册的 Delta 表。要了解有关管理 Lake Formation 权限的更多信息，请参阅 [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html)。

**使用不同的 Delta Lake 版本**

要使用 AWS Glue 不支持的 Delta Lake 版本，请使用 `--extra-jars` 作业参数指定您自己的 Delta Lake JAR 文件。请勿包含 `delta` 作为 `--datalake-formats` 作业参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。要在这种情况下使用 Delta Lake Python 库，必须使用 `--extra-py-files` 作业参数指定库 JAR 文件。Python 库打包在 Delta Lake JAR 文件中。

## 示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录
<a name="aws-glue-programming-etl-format-delta-lake-write"></a>

以下 AWS Glue ETL 脚本演示了如何将 Delta Lake 表写入 Amazon S3，并将该表注册到 AWS Glue 数据目录。

------
#### [ Python ]

```
# Example: Create a Delta Lake table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options = {
    "path": "s3://<s3Path>"
}
dataFrame.write \
    .format("delta") \
    .options(**additional_options) \
    .mode("append") \
    .partitionBy("<your_partitionkey_field>") \
    .saveAsTable("<your_database_name>.<your_table_name>")
```

------
#### [ Scala ]

```
// Example: Example: Create a Delta Lake table from a DataFrame
// and register the table to Glue Data Catalog

val additional_options = Map(
  "path" -> "s3://<s3Path>"
)
dataFrame.write.format("delta")
  .options(additional_options)
  .mode("append")
  .partitionBy("<your_partitionkey_field>")
  .saveAsTable("<your_database_name>.<your_table_name>")
```

------

## 示例：使用 AWS Glue 数据目录从 Amazon S3 读取 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-read"></a>

以下 AWS Glue ETL 脚本读取您在 [示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-delta-lake-write) 中创建的 Delta Lake 表。

------
#### [ Python ]

在本示例中，使用 [create\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog) 方法。

```
# Example: Read a Delta Lake table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read a Delta Lake table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## 示例：使用 AWS Glue 数据目录在 Amazon S3 中将 `DataFrame` 插入 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-insert"></a>

此示例将数据插入您在 [示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-delta-lake-write) 中创建的 Delta Lake 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 [write\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog) 方法。

```
# Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark API 从 Amazon S3 读取 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta_lake-read-spark"></a>

此示例使用 Spark API 从 Amazon S3 读取 Delta Lake 表。

------
#### [ Python ]

```
# Example: Read a Delta Lake table from S3 using a Spark DataFrame

dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Delta Lake table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------

## 示例：使用 Spark 向 Amazon S3 写入 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta_lake-write-spark"></a>

此示例使用 Spark 向 Amazon S3 写入 Delta Lake 表。

------
#### [ Python ]

```
# Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta") \
    .options(**additional_options) \
    .mode("overwrite") \
    .partitionBy("<your_partitionkey_field>")
    .save("s3://<s3Path>")
```

------
#### [ Scala ]

```
// Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta")
  .options(additionalOptions)
  .mode("overwrite")
  .partitionBy("<your_partitionkey_field>")
  .save("s3://<s3path/>")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Delta Lake 表。

1. 创建一个 Delta 表并将其注册到 Lake Formation

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark 注册的 Amazon S3 路径的 Delta 表：
**注意**  
以下示例属于 Python 示例。

      ```
      dataFrame.write \
      	.format("delta") \
      	.mode("overwrite") \
      	.partitionBy("<your_partitionkey_field>") \
      	.save("s3://<the_s3_path>")
      ```

      将数据写入 Amazon S3 后，使用 AWS Glue 爬网程序创建新的 Delta 目录表。有关更多信息，请参阅 [Introducing native Delta Lake table support with AWS Glue crawlers](https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/)。

      您也可以通过 AWS Glue `CreateTable` API 手动创建表。

1. 向 AWS Glue 作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)。

1.  读取注册到 Lake Formation 的 Delta 表。代码与读取未注册的 Delta 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
   # Example: Read a Delta Lake table from Glue Data Catalog
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. 写入注册到 Lake Formation 的 Delta 表。代码与写入未注册的 Delta 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   默认情况下，AWS Glue 会将 `Append` 作为 saveMode 使用。您可以通过设置 `additional_options` 中的 saveMode 选项来对其进行更改。要了解 Delta 表中对 saveMode 的支持，请参阅 [Write to a table](https://docs.delta.io/latest/delta-batch.html#write-to-a-table)。

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

# 在 AWS Glue 中使用 Iceberg 框架
<a name="aws-glue-programming-etl-format-iceberg"></a>

AWS Glue 3.0 及更高版本支持数据湖的 Apache Iceberg 框架。Iceberg 提供了一种高性能的表格式，其工作原理与 SQL 表类似。本主题涵盖了在 Iceberg 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Iceberg 的更多信息，请参阅 [Apache Iceberg 官方文档](https://iceberg.apache.org/docs/latest/)。

您可以使用 AWS Glue 对 Amazon S3 中的 Iceberg 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Iceberg 表。还支持其他操作，包括插入和所有 [Spark 查询](https://iceberg.apache.org/docs/latest/spark-queries/) [Spark 写入](https://iceberg.apache.org/docs/latest/spark-writes/)。Iceberg 表不支持更新。

**注意**  
`ALTER TABLE … RENAME TO` 不适用于 Apache Iceberg 0.13.1 for AWS Glue 3.0。

下表列出了 AWS Glue 每个版本中包含的 Iceberg 版本。


****  

| AWS Glue 版本 | 支持 Iceberg 版本 | 
| --- | --- | 
| 5.1 | 1.10.0 | 
| 5.0 | 1.7.1 | 
| 4.0 | 1.0.0 | 
| 3.0 | 0.13.1 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 启用 Iceberg 框架
<a name="aws-glue-programming-etl-format-iceberg-enable"></a>

要启用 Iceberg for AWS Glue，请完成以下任务：
+ 指定 `iceberg` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Iceberg 表。

  ```
  spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
  --conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ 
  --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
  --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  ```

  如果正在读取或写入注册到 Lake Formation 的 Iceberg 表，请按照 AWS Glue 5.0 及更高版本中 [将 AWS Glue 与 AWS Lake Formation 结合使用以进行精细访问控制](security-lf-enable.md) 中的指南进行操作。在 AWS Glue 4.0 中，添加以下配置来启用 Lake Formation 支持。

  ```
  --conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true
  --conf spark.sql.catalog.glue_catalog.glue.id=<table-catalog-id>
  ```

  如果您将 AWS Glue 3.0 与 Iceberg 0.13.1 一起使用，则必须设置以下附加配置才能使用 Amazon DynamoDB 锁定管理器来确保原子交易。AWSGlue 4.0 或更高版本默认使用乐观锁。有关更多信息，请参阅 Apache Iceberg 官方文档中的 [Iceberg AWS 集成](https://iceberg.apache.org/docs/latest/aws/#dynamodb-lock-manager)。

  ```
  --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager 
  --conf spark.sql.catalog.glue_catalog.lock.table=<your-dynamodb-table-name>
  ```

**使用不同的 Iceberg 版本**

要使用 AWS Glue 不支持的 Iceberg 版本，请使用 `--extra-jars` 作业参数指定您自己的 Iceberg JAR 文件。请勿包含 `iceberg` 作为 `--datalake-formats` 参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。

**为 Iceberg 表启用加密**

**注意**  
Iceberg 表有自己的用于启用服务器端加密的机制。除了 AWS Glue 的安全配置外，您还应该启用此配置。

要在 Iceberg 表上启用服务器端加密，请查看 [Iceberg 文档](https://iceberg.apache.org/docs/latest/aws/#s3-server-side-encryption)中的指南。

**为 Iceberg 跨区域表访问添加 Spark 配置**

要通过 AWS Glue Data Catalog 和 AWS Lake Formation 为 Iceberg 跨区域表访问添加额外的 Spark 配置，请按照以下步骤操作：

1. 创建[多区域接入点](https://docs.aws.amazon.com/AmazonS3/latest/userguide/multi-region-access-point-create-examples.html)。

1. 设置以下 Spark 属性：

   ```
   -----
       --conf spark.sql.catalog.my_catalog.s3.use-arn-region-enabled=true \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket1", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket2", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap
   -----
   ```

## 示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录
<a name="aws-glue-programming-etl-format-iceberg-write"></a>

此示例脚本演示了如何将 Iceberg 表写入 Amazon S3。该示例使用 [IcebergAWS 集成](https://iceberg.apache.org/docs/latest/aws/)将表注册到 AWS Glue 数据目录。

------
#### [ Python ]

```
# Example: Create an Iceberg table from a DataFrame 
# and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

query = f"""
CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------
#### [ Scala ]

```
// Example: Example: Create an Iceberg table from a DataFrame
// and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

val query = """CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------

或者，您可以使用 Spark 方法将 Iceberg 表写入 Amazon S3 和 Data Catalog。

先决条件：您需要预置目录以供 Iceberg 库使用。使用 AWS Glue Data Catalog 时，AWS Glue 让这一切变得简单明了。AWS Glue Data Catalog 已预先配置为供 Spark 库作为 `glue_catalog` 使用。Data Catalog 表由 *databaseName* 和 *tableName* 标识。有关 AWS Glue Data Catalog 的更多信息，请参阅 [AWS Glue 中的数据发现和编目](catalog-and-crawler.md)。

如果您不使用 AWS Glue Data Catalog ，则需要通过 Spark API 配置目录。有关更多信息，请参阅 Iceberg 文档中的 [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)。

此示例使用 Spark 从将 Iceberg 表写入 Amazon S3 和 Data Catalog 中。

------
#### [ Python ]

```
# Example: Write an Iceberg table to S3 on the Glue Data Catalog

# Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .create()

# Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .append()
```

------
#### [ Scala ]

```
// Example: Write an Iceberg table to S3 on the Glue Data Catalog

// Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .create()

// Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .append()
```

------

## 示例：使用 AWS Glue 数据目录从 Amazon S3 读取 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read"></a>

此示例读取您在 [示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-iceberg-write) 中创建的 Iceberg 表。

------
#### [ Python ]

在本示例中，使用 `GlueContext.create\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Read an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## 示例：使用 AWS Glue 数据目录在 Amazon S3 将 `DataFrame` 插入 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-insert"></a>

此示例将数据插入您在 [示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-iceberg-write) 中创建的 Iceberg 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.write\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Insert into an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Insert into an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark 从 Amazon S3 读取 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read-spark"></a>

先决条件：您需要预置目录以供 Iceberg 库使用。使用 AWS Glue Data Catalog 时，AWS Glue 让这一切变得简单明了。AWS Glue Data Catalog 已预先配置为供 Spark 库作为 `glue_catalog` 使用。Data Catalog 表由 *databaseName* 和 *tableName* 标识。有关 AWS Glue Data Catalog 的更多信息，请参阅 [AWS Glue 中的数据发现和编目](catalog-and-crawler.md)。

如果您不使用 AWS Glue Data Catalog ，则需要通过 Spark API 配置目录。有关更多信息，请参阅 Iceberg 文档中的 [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)。

此示例使用 Spark 从 Data Catalog 读取 Amazon S3 中的 Iceberg 表。

------
#### [ Python ]

```
# Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------
#### [ Scala ]

```
// Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

val dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Iceberg 表。

**注意**  
此示例仅适用于 AWS Glue 4.0。在 AWS Glue 5.0 及更高版本中，请按照 [将 AWS Glue 与 AWS Lake Formation 结合使用以进行精细访问控制](security-lf-enable.md) 中的指南进行操作。

1. 创建一个 Iceberg 表并将其注册到 Lake Formation：

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark SQL 注册的路径的 Iceberg 表：
**注意**  
以下示例属于 Python 示例。

      ```
      dataFrame.createOrReplaceTempView("tmp_<your_table_name>")
      
      query = f"""
      CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
      USING iceberg
      AS SELECT * FROM tmp_<your_table_name>
      """
      spark.sql(query)
      ```

      您也可以通过 AWS Glue `CreateTable` API 手动创建表。有关更多信息，请参阅 [Creating Apache Iceberg tables](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-iceberg-tables.html)。
**注意**  
该 `UpdateTable` API 目前不支持 Iceberg 表格式作为操作的输入。

1. 向作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html

1. 读取注册到 Lake Formation 的 Iceberg 表。代码与读取未注册的 Iceberg 表相同。请注意，您的 AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
   # Example: Read an Iceberg table from the AWS Glue Data Catalog
   from awsglue.context import GlueContextfrom pyspark.context import SparkContext
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. 写入注册到 Lake Formation 的 Iceberg 表。代码与写入未注册的 Iceberg 表相同。请注意，您的 AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```