# AWS Glue for Spark 中的输入和输出的数据格式选项
<a name="aws-glue-programming-etl-format"></a>

这些页面提供有关 AWS Glue for Spark 支持的数据格式的功能支持和配置参数的信息。有关这些信息用法和适用性的说明，请参阅以下内容。

## AWS Glue 中跨数据格式的支持功能
<a name="aws-glue-programming-etl-format-features"></a>

 每种数据格式可能支持不同的 AWS Glue 功能。您的数据格式是否以下常用功能应视其类型而定。请参阅数据格式的相关文档，了解如何利用我们的功能满足您的需求。


|  |  | 
| --- |--- |
| 读取 | AWS Glue 无需额外资源（例如连接器）即可识别和解释此数据格式。 | 
| 写入 | AWS Glue 可以在没有额外资源的情况下以此格式写入数据。您可以在任务中加入第三方库并使用标准 Apache Spark 函数来写入数据，就像在其他 Spark 环境中一样。有关库的更多信息，请参阅 [将 Python 库与 AWS Glue 结合使用](aws-glue-programming-python-libraries.md)。 | 
| 流式处理读取 | AWS Glue 可以从 Apache Kafka、Amazon Managed Streaming for Apache Kafka 或 Amazon Kinesis 消息流中识别和解释此数据格式。我们期望流以一致的格式呈现数据，因此数据将读入为 DataFrames。 | 
| 对小文件进行分组 | AWS Glue 在执行 AWS Glue 转换时可以将文件合并到发送至每个节点的批处理工作中。这可以显著提高涉及大量小文件的工作负载性能。有关更多信息，请参阅 [以较大的组读取输入文件](grouping-input-files.md)。 | 
| 作业书签 | AWS Glue 可以跟踪转换的进度，在任务运行期间使用任务书签对相同数据集执行相同的工作。这可以提高涉及多个数据集且其中只需要处理上次任务运行之后产生的新数据的工作负载性能。有关更多信息，请参阅 [使用作业书签跟踪已处理的数据](monitor-continuations.md)。 | 

## AWS Glue 中用于与数据格式交互的参数
<a name="aws-glue-programming-etl-format-parameters"></a>

某些 AWS Glue 连接类型支持多种 `format` 类型，这需要您在使用类似 `GlueContext.write_dynamic_frame.from_options` 的方法时，使用 `format_options` 对象指定关于数据格式的信息。
+ `s3` – 有关更多信息，请参阅 AWS Glue 中的 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。您还可以查看关于支持此连接类型的方法的文档：Python 中的 [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 和 [write\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 以及相应的 Scala 方法 [def getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 和 [def getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat)。

  
+ `kinesis` – 有关更多信息，请参阅 AWS Glue 中的 ETL 的连接类型和选项：[Kinesis 连接参数](aws-glue-programming-etl-connect-kinesis-home.md#aws-glue-programming-etl-connect-kinesis)。您还可以查看关于支持此连接类型的方法的文档：[create\$1data\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) 以及相应的 Scala 方法 [def createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions)。
+ `kafka` – 有关更多信息，请参阅 AWS Glue 中的 ETL 的连接类型和选项：[Kafka 连接参数](aws-glue-programming-etl-connect-kafka-home.md#aws-glue-programming-etl-connect-kafka)。您还可以查看关于支持此连接类型的方法的文档：[create\$1data\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) 以及相应的 Scala 方法 [def createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions)。

有些连接类型不需要 `format_options`。例如，在正常使用过程中，连接至关系数据库的 JDBC 连接以一致的表格数据格式检索数据。因此，从 JDBC 连接中进行读取不需要 `format_options`。

某些在 Glue 中读取和写入数据的方法不需要 `format_options`。例如，通过 AWS Glue 爬网程序使用 `GlueContext.create_dynamic_frame.from_catalog`。爬网程序决定数据的形状。使用爬网程序时，AWS Glue 分类器将检查数据，以便就如何表示数据格式做出明智决策。然后，它会将数据表示形式存储在 AWS Glue 数据目录中，此数据目录可以在 AWS Glue ETL 脚本中使用，以通过 `GlueContext.create_dynamic_frame.from_catalog` 方法检索数据。爬网程序无需您手动指定有关数据格式的信息。

对于访问 AWS Lake Formation 受管表的任务，AWS Glue 支持读取和写入 Lake Formation 受管表支持的所有格式。有关当前 AWS Lake Formation 受管表支持的格式列表，请参阅 *AWS Lake Formation 开发人员指南*中的[受管表的注释和限制](https://docs.aws.amazon.com/lake-formation/latest/dg/governed-table-restrictions.html)。

**注意**  
对于写入 Apache Parquet，AWS Glue ETL 仅支持为针对动态帧进行优化的自定义 Parquet 编写器类型指定选项来写入受管表。使用 `parquet` 格式写入受管表时，应在表参数中添加值为 `true` 的键 `useGlueParquetWriter`。

**Topics**
+ [AWS Glue 中跨数据格式的支持功能](#aws-glue-programming-etl-format-features)
+ [AWS Glue 中用于与数据格式交互的参数](#aws-glue-programming-etl-format-parameters)
+ [在 AWS Glue 中使用 CSV 格式](aws-glue-programming-etl-format-csv-home.md)
+ [在 AWS Glue 中使用 Parquet 格式](aws-glue-programming-etl-format-parquet-home.md)
+ [在 AWS Glue 中使用 XML 格式](aws-glue-programming-etl-format-xml-home.md)
+ [在 AWS Glue 中使用 Avro 格式](aws-glue-programming-etl-format-avro-home.md)
+ [在 AWS Glue 中使用 grokLog 格式](aws-glue-programming-etl-format-grokLog-home.md)
+ [在 AWS Glue 中使用 Ion 格式](aws-glue-programming-etl-format-ion-home.md)
+ [在 AWS Glue 中使用 JSON 格式](aws-glue-programming-etl-format-json-home.md)
+ [在 AWS Glue 中使用 ORC 格式](aws-glue-programming-etl-format-orc-home.md)
+ [在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)
+ [共享配置参考](#aws-glue-programming-etl-format-shared-reference)

# 在 AWS Glue 中使用 CSV 格式
<a name="aws-glue-programming-etl-format-csv-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 CSV 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

 AWS Glue 支持逗号分隔值（CSV）格式。此格式是一种基于行的最小数据格式。CSV 通常不严格遵守某一标准，但您可以参考 [RFC 4180](https://tools.ietf.org/html/rfc4180) 和 [RFC 7111](https://tools.ietf.org/html/rfc7111) 了解更多信息。

您可以使用 AWS Glue 从 Amazon S3 和流式传输源读取 CSV，以及将 CSV 写入 Amazon S3。您可以读取并写入包含 S3 中的 CSV 文件的 `bzip` 和 `gzip` 存档。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表显示了哪些常用 AWS Glue 功能支持 CSV 格式选项。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 支持 | 支持 | 支持 | 支持 | 

## 示例：从 S3 读取 CSV 文件或文件夹
<a name="aws-glue-programming-etl-format-csv-read"></a>

 **先决条件：**您将需要至您想要读取的 CSV 文件或文件夹的 S3 路径（`s3path`）。

 **配置：**在函数选项中，请指定 `format="csv"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。您可以配置读取器如何解释 `format_options` 中的 CSV 文件。有关详细信息，请参阅 [CSV 配置参考](#aws-glue-programming-etl-format-csv-reference)。

以下 AWS Glue ETL 脚本显示了从 S3 读取 CSV 文件或文件夹的过程。

 我们提供自定义的 CSV 读取器，其中包含通过 `optimizePerformance` 配置键针对常见工作流进行的性能优化。要确定此读取器是否适合您的工作负载，请参阅 [使用向量化 SIMD CSV 读取器优化读取性能](#aws-glue-programming-etl-format-simd-csv-reader)。

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
# Example: Read CSV from S3
# For show, we handle a CSV with a header row.  Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="csv",
    format_options={
        "withHeader": True,
        # "optimizePerformance": True,
    },
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame = spark.read\
    .format("csv")\
    .option("header", "true")\
    .load("s3://s3path")
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
// Example: Read CSV from S3
// For show, we handle a CSV with a header row.  Set the withHeader option.
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"withHeader": true}"""),
      connectionType="s3",
      format="csv",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

您还可以使用脚本（`org.apache.spark.sql.DataFrame`）中的 DataFrames。

```
val dataFrame = spark.read
  .option("header","true")
  .format("csv")
  .load("s3://s3path“)
```

------

## 示例：将 CSV 文件和文件夹写入 S3
<a name="aws-glue-programming-etl-format-csv-write"></a>

 **先决条件：**您将需要一个初始化的 DataFrame（`dataFrame`）或 DynamicFrame（`dynamicFrame`）。您还需要预期 S3 输出路径 `s3path`。

 **配置：**在函数选项中，请指定 `format="csv"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中配置编写器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。您可以配置自己的操作在 `format_options` 中写入文件的内容的方式。有关详细信息，请参阅 [CSV 配置参考](#aws-glue-programming-etl-format-csv-reference)。以下 AWS Glue ETL 脚本显示了将 CSV 文件和文件夹写入 S3 的过程。

------
#### [ Python ]

在本示例中，使用 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 方法。

```
# Example: Write CSV to S3
# For show, customize how we write string type values.  Set quoteChar to -1 so our values are not quoted.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="csv",
    format_options={
        "quoteChar": -1,
    },
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame.write\
    .format("csv")\
    .option("quote", None)\
    .mode("append")\
    .save("s3://s3path")
```

------
#### [ Scala ]

在本示例中，请使用 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 方法。

```
// Example: Write CSV to S3
// For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="csv"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

您还可以使用脚本（`org.apache.spark.sql.DataFrame`）中的 DataFrames。

```
dataFrame.write
    .format("csv")
    .option("quote", null)
    .mode("Append")
    .save("s3://s3path")
```

------

## CSV 配置参考
<a name="aws-glue-programming-etl-format-csv-reference"></a>

您可以在 AWS Glue 库指定 `format="csv"` 的任何位置使用以下 `format_options`：
+ `separator` – 指定分隔符。默认值为逗号，但也可以指定任何其他字符。
  + **类型：**文本，**默认值：**`","`
+ `escaper` – 指定要用于转义的字符。此选项仅在读取 CSV 文件而非写入时使用。如果启用，则按原样使用紧跟其后的字符，一小组已知的转义符（`\n`、`\r`、`\t` 和 `\0`）除外。
  + **类型：**文本，**默认值：**无
+ `quoteChar` – 指定要用于引用的字符。默认值为双引号。将这设置为 `-1` 可完全关闭引用。
  + **类型：**文本，**默认值：**`'"'`
+ `multiLine` – 指定单个记录能否跨越多行。当字段包含带引号的换行符时，会出现此选项。如果有记录跨越多个行，您必须将此选项设置为 `True`。启用 `multiLine` 可能会降低性能，因为它在解析时需要更加谨慎的文件拆分。
  + **类型：**布尔值，**默认值：**`false`
+ `withHeader` – 指定是否将第一行视为标头。可以在 `DynamicFrameReader` 类中使用此选项。
  + **类型：**布尔值，**默认值：**`false`
+ `writeHeader` – 指定是否将标头写入输出。可以在 `DynamicFrameWriter` 类中使用此选项。
  + **类型：**布尔值，**默认值：**`true`
+ `skipFirst` – 指定是否跳过第一个数据行。
  + **类型：**布尔值，**默认值：**`false`
+ `optimizePerformance` – 指定是否使用高级 SIMD CSV 读取器以及基于 Apache Arrow 的列式内存格式。仅适用于 AWS Glue 3.0\$1。
  + **类型：**布尔值，**默认值：**`false`
+ `strictCheckForQuoting` - 在编写 CSV 时，Glue 可能会在其解释为字符串的值中添加引号。这样做是为了防止写出的内容出现模棱两可之处。为了节省决定写入什么的时间，Glue 可能会在某些不需要引号的情况下进行引用。启用严格检查将执行更密集的计算，并且只有在绝对必要时才会引用。仅适用于 AWS Glue 3.0\$1。
  + **类型：**布尔值，**默认值：**`false`

## 使用向量化 SIMD CSV 读取器优化读取性能
<a name="aws-glue-programming-etl-format-simd-csv-reader"></a>

AWS Glue 3.0 版添加了经过优化的 CSV 读取器，与基于行的 CSV 读取器相比，它可以显著提高整体任务性能。

 优化的读取器：
+ 使用 CPU SIMD 指令从磁盘读取
+ 立即以列格式（Apache Arrow）将记录写入内存 
+ 将记录分成几批

这样可以节省日后对记录进行批处理或转换为列格式时的处理时间。例如，更改架构或按列检索数据时。

要使用优化的读取器，请将在 `format_options` 或表属性中将 `"optimizePerformance"` 设置为 `true`。

```
glueContext.create_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "csv", 
    format_options={
        "optimizePerformance": True, 
        "separator": ","
        }, 
    transformation_ctx = "datasink2")
```

**矢量化 CSV 读取器的限制**  
请注意向量化 CSV 读取器的以下限制：
+ 它不支持 `multiLine` 和 `escaper` 格式选项。它使用默认双引号字符 `'"'` 的 `escaper`。设置这些选项后，AWS Glue 会自动回退使用基于行的 CSV 读取器。
+ 它不支持创建具有 [ChoiceType](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype) 的 DynamicFrame。
+ 它不支持创建具有[错误记录](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame)的 DynamicFrame。
+ 它不支持读取带多字节字符（如日语或中文字符）的 CSV 文件。

# 在 AWS Glue 中使用 Parquet 格式
<a name="aws-glue-programming-etl-format-parquet-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 Parquet 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

AWS Glue 支持使用 Parquet 格式。此格式是一种以性能为导向、基于列的数据格式。有关标准颁发机构对此格式的简介，请参阅 [Apache Parquet Documentation Overview](https://parquet.apache.org/docs/overview/)（Apache Parquet 文档概述）。

您可以使用 AWS Glue 从 Amazon S3 和流式处理媒体源读取 Parquet 文件，以及将 Parquet 文件写入 Amazon S3。您可以读取并写入包含 S3 中的 Parquet 文件的 `bzip` 和 `gzip` 存档。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表显示了哪些常用 AWS Glue 功能支持 Parquet 格式选项。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 支持 | 支持 | 不支持 | 支持\$1 | 

\$1 在 AWS Glue 版本 1.0\$1 中受支持

## 示例：从 S3 读取 Parquet 文件或文件夹
<a name="aws-glue-programming-etl-format-parquet-read"></a>

**先决条件：**您将需要至您想要读取的 Parquet 文件或文件夹的 S3 路径（`s3path`）。

 **配置：**在函数选项中，请指定 `format="parquet"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。

您可以在 `connection_options` 中配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。

 您可以配置读取器如何解释 `format_options` 中的 Parquet 文件。有关详细信息，请参阅 [Parquet 配置参考](#aws-glue-programming-etl-format-parquet-reference)。

以下 AWS Glue ETL 脚本显示了从 S3 读取 Parquet 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
# Example: Read Parquet from S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path/"]}, 
    format = "parquet"
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame = spark.read.parquet("s3://s3path/")
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 方法。

```
// Example: Read Parquet from S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="parquet",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
}
```

您还可以使用脚本（`org.apache.spark.sql.DataFrame`）中的 DataFrames。

```
spark.read.parquet("s3://s3path/")
```

------

## 示例：将 Parquet 文件和文件夹写入 S3
<a name="aws-glue-programming-etl-format-parquet-write"></a>

**先决条件：**您将需要一个初始化的 DataFrame（`dataFrame`）或 DynamicFrame（`dynamicFrame`）。您还需要预期 S3 输出路径 `s3path`。

 **配置：**在函数选项中，请指定 `format="parquet"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。

您可以在 `connection_options` 中进一步修改编写器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。您可以配置自己的操作在 `format_options` 中写入文件的内容的方式。有关详细信息，请参阅 [Parquet 配置参考](#aws-glue-programming-etl-format-parquet-reference)。

以下 AWS Glue ETL 脚本显示了将 Parquet 文件和文件夹写入 S3 的过程。

我们通过 `useGlueParquetWriter` 配置键为自定义 Parquet 编写器提供 DynamicFrames 的性能优化。要确定此编写器是否适合您的工作负载，请参阅 [Glue Parquet 编写器](#aws-glue-programming-etl-format-glue-parquet-writer)。

------
#### [ Python ]

在本示例中，使用 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 方法。

```
# Example: Write Parquet to S3
# Consider whether useGlueParquetWriter is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://s3path",
    },
    format_options={
        # "useGlueParquetWriter": True,
    },
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
df.write.parquet("s3://s3path/")
```

------
#### [ Scala ]

在本示例中，请使用 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 方法。

```
// Example: Write Parquet to S3
// Consider whether useGlueParquetWriter is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="parquet"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

您还可以使用脚本（`org.apache.spark.sql.DataFrame`）中的 DataFrames。

```
df.write.parquet("s3://s3path/")
```

------

## Parquet 配置参考
<a name="aws-glue-programming-etl-format-parquet-reference"></a>

您可以在 AWS Glue 库指定 `format="parquet"` 的任何位置使用以下 `format_options`：
+ `useGlueParquetWriter` – 指定使用具有 DynamicFrame 工作流性能优化的自定义 Parquet 编写器。有关使用情况的详细信息，请参阅 [Glue Parquet 编写器](#aws-glue-programming-etl-format-glue-parquet-writer)。
  + **类型：**布尔值，**默认值：**`false`
+ `compression` – 指定使用的压缩编解码器。值与 `org.apache.parquet.hadoop.metadata.CompressionCodecName` 完全兼容。
  + **类型：**枚举文本，**默认值：**`"snappy"`
  + 值：`"uncompressed"`、`"snappy"`、`"gzip"` 和 `"lzo"`
+ `blockSize` – 指定内存中缓冲的行组的字节大小。您可以用它来调整性能。大小应精确地划分为若干兆字节。
  + **类型：**数值，**默认值：**`134217728`
  + 默认值等于 128MB。
+ `pageSize` – 指定页面的大小（以字节为单位）。您可以用它来调整性能。页面是必须完全读取以访问单个记录的最小单位。
  + **类型：**数值，**默认值：**`1048576`
  + 默认值等于 1MB。

**注意**  
此外，基础 SparkSQL 代码所接受的任何选项均可通过 `connection_options` 映射参数传递给此格式。例如，您可以为 AWS Glue Spark 读取器设置 Spark 配置（如 [mergeSchema](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging)），以合并所有文件的架构。

## 使用 AWS Glue Parquet 编写器优化写入性能
<a name="aws-glue-programming-etl-format-glue-parquet-writer"></a>

**注意**  
 AWS Glue Parquet 编写器以前一直通过 `glueparquet` 格式类型访问。这种访问模式已不再提倡。请改用启用了 `useGlueParquetWriter` 的 `parquet` 类型。

AWS Glue Parquet 编写器具有允许更快地写入 Parquet 文件的性能增强功能。传统编写器在写入之前计算架构。Parquet 格式不会以可快速检索的方式存储架构，因此可能需要一些时间。使用 AWS Glue Parquet 编写器时，不需要预计算的架构。在数据传入时，编写器会动态计算和修改架构。

指定 `useGlueParquetWriter` 时请注意以下限制：
+ 编写器仅支持架构发展（例如添加或删除列）但不支持更改列类型，例如使用 `ResolveChoice`。
+ 写入器不支持写入空 DataFrame，例如，写入纯架构文件。通过设置 `enableUpdateCatalog=True` 实现与 AWS Glue Data Catalog 的集成时，尝试写入空 DataFrame 不会更新数据目录。这将导致在数据目录中创建一个没有架构的表。

如果您的转换不需要这些限制，则开启 AWS Glue Parquet 编写器应该能提高性能。

# 在 AWS Glue 中使用 XML 格式
<a name="aws-glue-programming-etl-format-xml-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 XML 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

AWS Glue 支持使用 XML 格式。此格式表示高度可配置、严格定义的数据结构，这些数据结构不是基于行或列的。XML 是高度标准化格式。有关标准颁发机构对该格式的简介，请参阅 [XML Essentials](https://www.w3.org/standards/xml/core)（XML 基础知识）。

您可以使用 AWS Glue 从 Amazon S3 以及含有 XML 文件的 `bzip` 和 `gzip` 存档中读取 XML 文件。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表显示了哪些常用 AWS Glue 功能支持 XML 格式选项。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 不支持 | 不支持 | 支持 | 支持 | 

## 示例：从 S3 读取 XML
<a name="aws-glue-programming-etl-format-xml-read"></a>

 XML 读取器采用 XML 标签名称。它检查输入中带有该标签的元素以推断架构，并使用相应的值填充 DynamicFrame。AWS Glue XML 功能的行为类似于 [XML Data Source for Apache Spark](https://github.com/databricks/spark-xml)（Apache Spark 的 XML 数据来源）。通过将此阅读器与该项目的文档进行比较，您也许可以深入了解基本行为。

**先决条件：**您将需要至您想要读取的 XML 文件或文件夹以及有关您的 XML 文件的一些信息的 S3 路径（`s3path`）。您还需要您想要读取的 XML 元素的标签 `xmlTag`。

 **配置：**在函数选项中，请指定 `format="xml"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中进一步配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。在您的 `format_options` 中，请使用 `rowTag` 键指定 `xmlTag`。您可以进一步配置读取器如何解释 `format_options` 中的 XML 文件。有关详细信息，请参阅 [XML 配置参考](#aws-glue-programming-etl-format-xml-reference)。

以下 AWS Glue ETL 脚本显示了从 S3 读取 XML 文件或文件夹的过程。

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
# Example: Read XML from S3
# Set the rowTag option to configure the reader.

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="xml",
    format_options={"rowTag": "xmlTag"},
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame = spark.read\
    .format("xml")\
    .option("rowTag", "xmlTag")\
    .load("s3://s3path")
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
// Example: Read XML from S3
// Set the rowTag option to configure the reader.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkSession

val glueContext = new GlueContext(SparkContext.getOrCreate())
val sparkSession: SparkSession = glueContext.getSparkSession

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"rowTag": "xmlTag"}"""), 
      connectionType="s3", 
      format="xml", 
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
}
```

您还可以使用脚本（`org.apache.spark.sql.DataFrame`）中的 DataFrames。

```
val dataFrame = spark.read
  .option("rowTag", "xmlTag")
  .format("xml")
  .load("s3://s3path“)
```

------

## XML 配置参考
<a name="aws-glue-programming-etl-format-xml-reference"></a>

您可以在 AWS Glue 库指定 `format="xml"` 的任何位置使用以下 `format_options`：
+ `rowTag` – 指定文件中要视为行的 XML 标签。行标签不能自结束。
  + **类型：**文本，**必填项**
+ `encoding` – 指定字符编码。它可以是由我们的运行时环境支持的[字符集](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html)的名称或别名。我们没有对编码支持做出具体的保证，但主编码应该起作用。
  + **类型：**文本，**默认值：**`"UTF-8"`
+ `excludeAttribute` – 指定是否要排除元素中的属性。
  + **类型：**布尔值，**默认值：**`false`
+ `treatEmptyValuesAsNulls` – 指定是否将空格视为空值。
  + **类型：**布尔值，**默认值：**`false`
+ `attributePrefix` – 用于将属性与子元素文本区分开来的属性的前缀。此前缀用于字段名称。
  + **类型：**文本，**默认值：**`"_"`
+ `valueTag` – 在元素中具有没有子项的属性时用于值的标签。
  + **类型：**文本，**默认值：**`"_VALUE"`
+ `ignoreSurroundingSpaces` – 指定是否应忽略值周围的空格。
  + **类型：**布尔值，**默认值：**`false`
+ `withSchema` – 在您想要覆盖推断的架构的情况下，包含预期的架构。如果您不使用此选项，AWS Glue 会推断 XML 数据中的架构。
  + **类型：**文本，**默认值：**不适用
  + 该值应该是代表 `StructType` 的一个 JSON 对象。

## 手动指定 XML 架构
<a name="aws-glue-programming-etl-format-xml-withschema"></a>

**手动 XML 架构示例**

此示例使用 `withSchema` 格式选项来指定 XML 数据的架构。

```
from awsglue.gluetypes import *

schema = StructType([ 
  Field("id", IntegerType()),
  Field("name", StringType()),
  Field("nested", StructType([
    Field("x", IntegerType()),
    Field("y", StringType()),
    Field("z", ChoiceType([IntegerType(), StringType()]))
  ]))
])

datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={"withSchema": json.dumps(schema.jsonValue())},
    transformation_ctx = ""
)
```

# 在 AWS Glue 中使用 Avro 格式
<a name="aws-glue-programming-etl-format-avro-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 Avro 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

AWS Glue 支持使用 Avro 格式。此格式是一种以性能为导向、基于行的数据格式。有关标准颁发机构对此格式的简介，请参阅 [Apache Avro 1.8.2 Documentation](https://avro.apache.org/docs/1.8.2/)（Apache Avro 1.8.2 文档）。

您可以使用 AWS Glue 从 Amazon S3 和流式传输源读取 Avro 文件，以及将 Avro 文件写入 Amazon S3。您可以读取并写入包含 S3 中的 Avro 文件的 `bzip2` 和 `gzip` 存档。此外，您还可以编写包含 Avro 文件的 `deflate`、`snappy` 和 `xz` 存档。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表显示了支持 Avro 格式选项的常用 AWS Glue 功能。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 支持 | 支持\$1 | 不支持 | 支持 | 

\$1受支持，但有限制。有关更多信息，请参阅 [Avro 串流源的注释和限制](add-job-streaming.md#streaming-avro-notes)。

## 示例：从 S3 读取 Avro 文件或文件夹
<a name="aws-glue-programming-etl-format-avro-read"></a>

**先决条件：**需要待读取的 Avro 文件或文件夹的 S3 路径 (`s3path`)。

**配置：**在函数选项中，请指定 `format="avro"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 中的“Data format options for ETL inputs and outputs”（ETL 输入和输出的数据格式选项）。您可以配置读取器解释 `format_options` 中的 Avro 文件的方式。有关详细信息，请参阅 [Avro Configuration Reference](#aws-glue-programming-etl-format-avro-reference)（Avro 配置参考）。

以下 AWS Glue ETL 脚本显示了从 S3 读取 Avro 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="avro"
)
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="avro",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
```

------

## 示例：将 Avro 文件和文件夹写入 Amazon S3
<a name="aws-glue-programming-etl-format-avro-write"></a>

**先决条件：**您将需要一个初始化的 DataFrame（`dataFrame`）或 DynamicFrame（`dynamicFrame`）。您还需要预期 S3 输出路径 `s3path`。

**配置：**在函数选项中，请指定 `format="avro"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中进一步修改编写器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 中的“Data format options for ETL inputs and outputs”（ETL 输入和输出的数据格式选项）。您可以改变写入器在 `format_options` 中解释 Avro 文件的方式。有关详细信息，请参阅 [Avro Configuration Reference](#aws-glue-programming-etl-format-avro-reference)（Avro 配置参考）。

以下 AWS Glue ETL 脚本显示了将 Avro 文件或文件夹写入 Amazon S3 的过程。

------
#### [ Python ]

在本示例中，使用 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 方法。

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="avro",
    connection_options={
        "path": "s3://s3path"
    }
)
```

------
#### [ Scala ]

在本示例中，请使用 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 方法。

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
      connectionType="s3",
      options=JsonOptions("""{"path": "s3://s3path"}"""),
      format="avro"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

------

## Avro 配置参考
<a name="aws-glue-programming-etl-format-avro-reference"></a>

您可以在 AWS Glue 库指定 `format="avro"` 的任何位置使用以下 `format_options` 值：
+ `version` – 指定要支持的 Apache Avro 读取器/写入器格式的版本。默认值为“1.7”。您可以指定 `format_options={"version": “1.8”}` 以启用 Avro 逻辑类型读取和写入。有关更多信息，请参阅 [Apache Avro 1.7.7 规范](https://avro.apache.org/docs/1.7.7/spec.html)和 [Apache Avro 1.8.2 规范](https://avro.apache.org/docs/1.8.2/spec.html)。

  Apache Avro 1.8 连接器支持以下逻辑类型转换：

对于读取器：此表显示 Avro 数据类型（逻辑类型和 Avro 基元类型）与 Avro 阅读器 1.7 和 1.8 的 AWS Glue `DynamicFrame` 数据类型之间的转换。


| Avro 数据类型：逻辑类型 | Avro 数据类型：Avro 基元类型 | GlueDynamicFrame 数据类型：Avro 读取器 1.7 | GlueDynamicFrame 数据类型：Avro 读取器 1.8 | 
| --- | --- | --- | --- | 
| 十进制 | bytes | BINARY | 十进制 | 
| 十进制 | 固定 | BINARY | 十进制 | 
| 日期 | 整数 | INT | 日期 | 
| 时间（毫秒） | 整数 | INT | INT | 
| 时间（微秒） | 长整数 | LONG | LONG | 
| 时间戳（毫秒） | 长整数 | LONG | Timestamp | 
| 时间戳（微秒） | 长整数 | LONG | LONG | 
| 持续时间（不是逻辑类型） | 固定为 12 | BINARY | BINARY | 

对于写入器：此表显示 Avro 写入器 1.7 和 1.8 在 AWS Glue `DynamicFrame` 数据类型与 Avro 数据类型之间的转换。


| AWS Glue `DynamicFrame` 数据类型 | Avro 数据类型：Avro 写入器 1.7 | Avro 数据类型：Avro 写入器 1.8 | 
| --- | --- | --- | 
| 十进制 | 字符串 | decimal | 
| 日期 | 字符串 | date | 
| Timestamp | 字符串 | timestamp-micros | 

## Avro Spark DataFrame 支持
<a name="aws-glueprogramming-etl-format-avro-dataframe-support"></a>

要使用 Spark DataFrame API 中的 Avro，您需要为相应的 Spark 版本安装 Spark Avro 插件。任务中可用的 Spark 版本取决于您的 AWS Glue 版本。有关 Spark 版本的更多信息，请参阅 [AWS Glue 版本](release-notes.md)。该插件由 Apache 维护，我们不提供具体的支持保证。

在 AWS Glue 2.0 中 – 使用 2.4.3 版本的 Spark Avro 插件。您可以在 Maven Central 上找到该 JAR，请参阅 [org.apache.spark:spark-avro\$12.12:2.4.3](https://search.maven.org/artifact/org.apache.spark/spark-avro_2.12/3.1.1/jar)。

在 AWS Glue 3.0 中 – 使用3.1.1 版本的 Spark Avro 插件。您可以在 Maven Central 上找到该 JAR，请参阅 [org.apache.spark:spark-avro\$12.12:3.1.1](https://search.maven.org/artifact/org.apache.spark/spark-avro_2.12/3.1.1/jar)。

要在 AWS Glue ETL 任务中加入额外的 JAR，请使用 `--extra-jars` 任务参数。有关任务参数的更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。您也可以在 AWS 管理控制台 中配置此参数。

# 在 AWS Glue 中使用 grokLog 格式
<a name="aws-glue-programming-etl-format-grokLog-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以结构松散的纯文本格式存储或传输，本文档将向您介绍供通过 Grok 模式使用 AWS Glue 中的数据时的可用功能。

AWS Glue 支持使用 Grok 模式。Grok 模式类似于正则表达式捕获组。这些组能识别纯文本文件中的字符序列模式，并为其指定类型和用途。在 AWS Glue 中，其主要用途是读取日志。有关作者对 Grok 的说明，请参阅 [Logstash Reference: Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html)（Logstash 参考：Grok 筛选器插件）。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 不适用 | 支持 | 支持 | 不支持 | 

## grokLog 配置参考
<a name="aws-glue-programming-etl-format-groklog-reference"></a>

您可以将以下 `format_options` 值与 `format="grokLog"` 结合使用：
+ `logFormat` – 指定与日志的格式匹配的 Grok 模式。
+ `customPatterns` – 指定在此处使用的其他 Grok 模式。
+ `MISSING` – 指定用于标识缺失值的信号。默认值为 `'-'`。
+ `LineCount` – 指定每个日志记录中的行数。默认值为 `'1'`，并且目前仅支持单行记录。
+ `StrictMode` – 指定是否启用严格模式的布尔值。在严格模式下，读取器不会执行自动类型转换或恢复。默认值为 `"false"`。

# 在 AWS Glue 中使用 Ion 格式
<a name="aws-glue-programming-etl-format-ion-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 Ion 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

AWS Glue 支持使用 Ion 格式。此格式以可互换的二进制和纯文本表示形式表示（并非基于行或列的）数据结构。有关作者对此格式的说明，请参阅 [Amazon Ion](https://amzn.github.io/ion-docs/)。(有关更多信息，请参阅 [Amazon Ion 规范](https://amzn.github.io/ion-docs/spec.html)。)

您可以使用 AWS Glue 从 Amazon S3 中读取 Ion 文件。您可以从 S3 中读取包含 Ion 文件的 `bzip` 和 `gzip` 存档。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表显示了支持 Ion 格式选项的常见 AWS Glue 功能。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 不支持 | 不支持 | 支持 | 不支持 | 

## 示例：从 S3 读取 Ion 文件或文件夹
<a name="aws-glue-programming-etl-format-ion-read"></a>

**先决条件：**需要待读取的 Ion 文件或文件夹的 S3 路径 (`s3path`)。

**配置：**在函数选项中，请指定 `format="json"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。

以下 AWS Glue ETL 脚本显示了从 S3 读取 Ion 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
# Example: Read ION from S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="ion"
)
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
// Example: Read ION from S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="ion",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

------

## Ion 配置参考
<a name="aws-glue-programming-etl-format-ion-reference"></a>

没有适用于 `format="ion"` 的 `format_options` 值。

# 在 AWS Glue 中使用 JSON 格式
<a name="aws-glue-programming-etl-format-json-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果您的数据以 JSON 数据格式存储或传输，本文档将向您介绍供您使用 AWS Glue 中的数据的可用功能。

AWS Glue 支持使用 JSON 格式。此格式表示形状一致但内容灵活且并非基于行或列的数据结构。JSON 由多个权威机构发布的平行标准定义，其中一项标准便是 ECMA-404。有关常引用的源对该格式的说明，请参阅 [Introducing JSON](https://www.json.org/)（JSON 简介）。

您可以使用 AWS Glue 从 Amazon S3 读取 JSON 文件、`bzip` 和 `gzip` 压缩 JSON 文件。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 支持 | 支持 | 支持 | 支持 | 

## 示例：从 S3 读取 JSON 文件或文件夹
<a name="aws-glue-programming-etl-format-json-read"></a>

**先决条件：**需要待读取的 JSON 文件或文件夹的 S3 路径 (`s3path`)。

**配置：**在函数选项中，请指定 `format="json"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在连接选项中进一步更改读取操作遍历 S3 的方式，请参阅 [Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)，了解详细信息。您可以配置读取器解释 `format_options` 中 JSON 文件的方式。有关详细信息，请参阅 [JSON Configuration Reference](#aws-glue-programming-etl-format-json-reference)（JSON 配置参考）。

 以下 AWS Glue ETL 脚本显示了从 S3 读取 JSON 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
# Example: Read JSON from S3
# For show, we handle a nested JSON file that we can limit with the JsonPath parameter
# For show, we also handle a JSON where a single entry spans multiple lines
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="json",
    format_options={
        "jsonPath": "$.id",
        "multiline": True,
        # "optimizePerformance": True, -> not compatible with jsonPath, multiline
    }
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame = spark.read\
    .option("multiline", "true")\
    .json("s3://s3path")
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
// Example: Read JSON from S3
// For show, we handle a nested JSON file that we can limit with the JsonPath parameter
// For show, we also handle a JSON where a single entry spans multiple lines
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"jsonPath": "$.id", "multiline": true, "optimizePerformance":false}"""),
      connectionType="s3",
      format="json",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
val dataFrame = spark.read
    .option("multiline", "true")
    .json("s3://s3path")
```

------

## 示例：将 JSON 文件和文件夹写入 Amazon S3
<a name="aws-glue-programming-etl-format-json-write"></a>

**先决条件：**需要初始化的 DataFrame (`dataFrame`) 或 DynamicFrame (`dynamicFrame`)。您还需要预期 S3 输出路径 `s3path`。

**配置：**在函数选项中，请指定 `format="json"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中进一步修改编写器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中的 ETL 输入和输出的数据格式选项：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。您可以配置读取器解释 `format_options` 中的 JSON 文件的方式。有关详细信息，请参阅 [JSON Configuration Reference](#aws-glue-programming-etl-format-json-reference)（JSON 配置参考）。

以下 AWS Glue ETL 脚本显示了从 S3 写入 JSON 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 方法。

```
# Example: Write JSON to S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="json"
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
df.write.json("s3://s3path/")
```

------
#### [ Scala ]

在本示例中，请使用 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 方法。

```
// Example: Write JSON to S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="json"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
df.write.json("s3://s3path")
```

------

## JSON 配置参考
<a name="aws-glue-programming-etl-format-json-reference"></a>

您可以将以下 `format_options` 值与 `format="json"` 结合使用：
+ `jsonPath` – [JsonPath](https://github.com/json-path/JsonPath) 表达式，标识要读取到记录中的对象。当文件包含嵌套在外部数组内的记录时，此表达式尤其有用。例如，以下 JsonPath 表达式面向 JSON 对象的 `id` 字段。

  ```
  format="json", format_options={"jsonPath": "$.id"}
  ```
+ `multiline` – 指定单个记录能否跨越多行的布尔值。当字段包含带引号的换行符时，会出现此选项。如果有记录跨越多个行，您必须将此选项设置为 `"true"`。默认值为 `"false"`，它允许在分析过程中更积极地拆分文件。
+ `optimizePerformance` – 一个布尔值，用于指定是否将高级 SIMD JSON 读取器与基于 Apache Arrow 的列式内存格式结合使用。仅适用于 AWS Glue 3.0。不兼容 `multiline` 或 `jsonPath`。提供这两个选项中的任何一个都将指示 AWS Glue 回滚到标准读取器。
+ `withSchema` – 一个字符串值，以 [手动指定 XML 架构](aws-glue-programming-etl-format-xml-home.md#aws-glue-programming-etl-format-xml-withschema) 中描述的格式指定表 Schema。仅从非目录连接读取时与 `optimizePerformance` 结合使用。

## 将矢量化 SIMD JSON 读取器与 Apache Arrow 列式格式结合使用
<a name="aws-glue-programming-etl-format-simd-json-reader"></a>

AWS Glue 版本 3.0 增加了适用于 JSON 数据的矢量化读取器。与标准读取器相比，它在某些条件下的执行速度可提高 2 倍。此读取器存在一些需要用户在使用前注意的限制，详见本节的说明。

要使用优化的读取器，请将在 `format_options` 或表属性中将 `"optimizePerformance"` 设置为 True。除非从目录中读取，否则您还需要提供 `withSchema`。`withSchema` 需要有一个 [手动指定 XML 架构](aws-glue-programming-etl-format-xml-home.md#aws-glue-programming-etl-format-xml-withschema) 中描述的输入

```
// Read from S3 data source        
glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "json", 
    format_options={
        "optimizePerformance": True,
        "withSchema": SchemaString
        })    
 
// Read from catalog table
glueContext.create_dynamic_frame.from_catalog(
    database = database, 
    table_name = table, 
    additional_options = {
    // The vectorized reader for JSON can read your schema from a catalog table property.
        "optimizePerformance": True,
        })
```

有关在 AWS Glue 库中构建 *SchemaString* 的更多信息，请参阅 [PySpark 扩展类型](aws-glue-api-crawler-pyspark-extensions-types.md)。

**矢量化 CSV 读取器的限制**  
请注意以下限制：
+ 不支持具有嵌套对象或数组值的 JSON 元素。AWS Glue 将回滚到标准读取器（如有提供）。
+ 必须从目录或使用 `withSchema` 参数提供一个 Schema。
+ 不兼容 `multiline` 或 `jsonPath`。提供这两个选项中的任何一个都将指示 AWS Glue 回滚到标准读取器。
+ 如果提供的输入记录与输入 Schema 不一致，将会导致读取器失败。
+ 将不会创建[错误记录](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame)。
+ 不支持具有多字节字符（如日语或中文字符）的 JSON 文件。

# 在 AWS Glue 中使用 ORC 格式
<a name="aws-glue-programming-etl-format-orc-home"></a>

AWS Glue 从源中检索数据，并将数据写入以各种数据格式存储和传输的目标。如果数据以 ORC 数据格式存储或传输，本文档将向您介绍在 AWS Glue 中使用数据时可用的功能。

AWS Glue 支持使用 ORC 格式。此格式是一种以性能为导向、基于列的数据格式。有关标准颁发机构对该格式的简介，请参阅 [Apache Orc](https://orc.apache.org/docs/)。

您可以使用 AWS Glue 从 Amazon S3 和流式传输源读取 ORC 文件，以及将 ORC 文件写入 Amazon S3。您可以读取并写入包含 S3 中的 ORC 文件的 `bzip` 和 `gzip` 存档。请在 [S3 连接参数](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 上而非本页中讨论的配置中配置压缩行为。

下表展示了支持 ORC 格式选项的常见 AWS Glue 操作。


| 读取 | 写入 | 流式处理读取 | 对小文件进行分组 | 作业书签 | 
| --- | --- | --- | --- | --- | 
| 支持 | 支持 | 支持 | 不支持 | 支持\$1 | 

\$1 在 AWS Glue 1.0 以上版本中受支持

## 示例：从 S3 读取 ORC 文件或文件夹
<a name="aws-glue-programming-etl-format-orc-read"></a>

**先决条件：**需要待读取的 ORC 文件或文件夹的 S3 路径 (`s3path`)。

**配置：**在函数选项中，请指定 `format="orc"`。在您的 `connection_options` 中，请使用 `paths` 键指定 `s3path`。您可以在 `connection_options` 中配置读取器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue 中 ETL 的连接类型和选项：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)。

 以下 AWS Glue ETL 脚本展示了从 S3 读取 ORC 文件或文件夹的过程：

------
#### [ Python ]

在本示例中，使用 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 方法。

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="orc"
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
dataFrame = spark.read\
    .orc("s3://s3path")
```

------
#### [ Scala ]

在本示例中，使用 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 操作。

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="orc",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
}
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
val dataFrame = spark.read
    .orc("s3://s3path")
```

------

## 示例：将 ORC 文件和文件夹写入 S3
<a name="aws-glue-programming-etl-format-orc-write"></a>

**先决条件：**您将需要一个初始化的 DataFrame（`dataFrame`）或 DynamicFrame（`dynamicFrame`）。您还需要预期 S3 输出路径 `s3path`。

**配置：**在函数选项中，请指定 `format="orc"`。在连接选项中，使用 `paths` 密钥指定 `s3path`。您可以在 `connection_options` 中进一步修改编写器与 S3 的交互方式。有关详细信息，请参阅 AWS Glue：[Amazon S3 连接选项参考](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 中的“Data format options for ETL inputs and outputs”（ETL 输入和输出的数据格式选项）。以下代码示例展示了这个过程：

------
#### [ Python ]

在本示例中，使用 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 方法。

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="orc",
    connection_options={
        "path": "s3://s3path"
    }
)
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
df.write.orc("s3://s3path/")
```

------
#### [ Scala ]

在本示例中，请使用 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 方法。

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
      connectionType="s3",
      options=JsonOptions("""{"path": "s3://s3path"}"""),
      format="orc"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

您还可以使用脚本（`pyspark.sql.DataFrame`）中的 DataFrames。

```
df.write.orc("s3://s3path/")
```

------

## ORC 配置参考
<a name="aws-glue-programming-etl-format-orc-reference"></a>

没有适用于 `format="orc"` 的 `format_options` 值。不过，基础 SparkSQL 代码所接受的任何选项均可通过 `connection_options` 映射参数传递给它。

# 在 AWS Glue ETL 任务中使用数据湖框架
<a name="aws-glue-programming-etl-datalake-native-frameworks"></a>

开源数据湖框架简化了对存储在 Amazon S3 上的数据湖中的文件的增量数据处理。AWS Glue 3.0 及更高版本支持以下开源数据湖框架：
+ Apache Hudi
+ Linux Foundation Delta Lake
+ Apache Iceberg

我们为这些框架提供原生支持，以便您可以以交易一致的方式读取和写入存储在 Amazon S3 中的数据。无需安装单独的连接器或完成额外的配置步骤即可在 AWS Glue ETL 任务中使用这些框架。

通过 AWS Glue Data Catalog 管理数据集时，您可以使用 AWS Glue 方法读取和写入 Spark DataFrames 数据湖表。也可以使用 Spark DataFrame API 读取和写入 Amazon S3 数据。

在这段视频中，您可以了解 Apache Hudi、Apache Iceberg 和 Delta Lake 工作原理的基础知识。您将看到如何在数据湖中插入、更新和删除数据，以及每个框架的工作原理。

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/fryfx0Zg7KA/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/fryfx0Zg7KA)


**Topics**
+ [限制](aws-glue-programming-etl-datalake-native-frameworks-limitations.md)
+ [在 AWS Glue 中使用 Hudi 框架](aws-glue-programming-etl-format-hudi.md)
+ [在 AWS Glue 中使用 Delta Lake 框架](aws-glue-programming-etl-format-delta-lake.md)
+ [在 AWS Glue 中使用 Iceberg 框架](aws-glue-programming-etl-format-iceberg.md)

# 限制
<a name="aws-glue-programming-etl-datalake-native-frameworks-limitations"></a>

在数据湖框架与 AWS Glue 配合使用之前，请考虑以下限制。
+ 以下 AWS Glue `GlueContext` DynamicFrame 方法不支持读取和写入数据湖框架表。请改用 `GlueContext` DataFrame 方法或 Spark DataFrame API。
  + `create_dynamic_frame.from_catalog`
  + `write_dynamic_frame.from_catalog`
  + `getDynamicFrame`
  + `writeDynamicFrame`
+ 以下 `GlueContext` DataFrame 方法支持 Lake Formation 权限控制：
  + `create_data_frame.from_catalog`
  + `write_data_frame.from_catalog`
  + `getDataFrame`
  + `writeDataFrame`
+ 不支持[对小文件进行分组](grouping-input-files.md)。
+ 不支持[作业书签](monitor-continuations.md)。
+ Apache Hudi 0.10.1 for AWS Glue 3.0 不支持 Read (MoR) 表上的 Hudi Merge。
+ `ALTER TABLE … RENAME TO` 不适用于 Apache Iceberg 0.13.1 for AWS Glue 3.0。

## 有关由 Lake Formation 权限管理的数据湖格式表的限制
<a name="w2aac67c11c24c11c31c17b7"></a>

数据湖格式通过 Lake Formation 权限与 AWS Glue ETL 集成。不支持使用 `create_dynamic_frame` 创建 DynamicFrame。有关更多信息，请参阅以下示例：
+ [示例：读取和写入具有 Lake Formation 权限控制的 Iceberg 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html#aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables)
+ [示例：读取和写入具有 Lake Formation 权限控制的 Hudi 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables)
+ [示例：读取和写入具有 Lake Formation 权限控制的 Delta Lake 表](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html#aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables)

**注意**  
仅在 AWS Glue 版本 4.0 中支持通过适用于 Apache Hudi、Apache Iceberg 和 Delta Lake 的 Lake Formation 权限与 AWS Glue ETL 集成。

Apache Iceberg 通过 Lake Formation 权限与 AWS Glue ETL 集成的效果最好。它支持几乎所有操作，包括支持 SQL。

Hudi 支持除管理操作之外的大多数基本操作。这是因为这些选项通常通过写入 DataFrame 来完成，并通过 `additional_options` 指定。由于不支持 SparkSQL，因此需要使用 AWS Glue API 来为您的操作创建 DataFrame。

Delta Lake 仅支持读取、附加和覆盖表数据。Delta Lake 需要使用自己的库才能执行更新等各种任务。

由 Lake Formation 权限管理的 Iceberg 表不支持以下功能。
+ 使用 ETL AWS Glue 进行压缩
+ 通过 AWS Glue ETL 支持 Spark SQL

由 Lake Formation 权限管理的 Hudi 表存在以下限制：
+ 移除孤立文件

由 Lake Formation 权限管理的 Delta Lake 表存在以下限制：
+ 除在 Delta Lake 表中插入和读取数据的所有其他功能。

# 在 AWS Glue 中使用 Hudi 框架
<a name="aws-glue-programming-etl-format-hudi"></a>

AWS Glue 3.0 及更高版本支持数据湖的 Apache Hudi 框架。Hudi 是一个开源数据湖存储框架，简化增量数据处理和数据管道开发。本主题涵盖了在 Hudi 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Hudi 的更多信息，请参阅 [Apache Hudi 官方文档](https://hudi.apache.org/docs/overview/)。

您可以使用 AWS Glue 对 Amazon S3 中的 Hudi 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Hudi 表。还支持其他操作，包括插入、更新和所有 [Apache Spark 操作](https://hudi.apache.org/docs/quick-start-guide/)。

**注意**  
AWS Glue 5.0 中的 Apache Hudi 0.15.0 实现会在内部回退 [HUDI-7001](https://github.com/apache/hudi/pull/9936) 变更。如果记录密钥由单个字段组成，则不会出现与复杂键生成相关的回归问题。不过，这种行为与 OSS Apache Hudi 0.15.0 不同。  
Apache Hudi 0.10.1 for AWS Glue 3.0 不支持 Read（MoR）表上的 Hudi Merge。

下表列出了 AWS Glue 每个版本中包含的 Hudi 版本。


****  

| AWS Glue 版本 | 支持的 Hudi 版本 | 
| --- | --- | 
| 5.1 | 1.0.2 | 
| 5.0 | 0.15.0 | 
| 4.0 | 0.12.1 | 
| 3.0 | 0.10.1 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 启用 Hudi
<a name="aws-glue-programming-etl-format-hudi-enable"></a>

要为 AWS Glue 启用 Hudi，请完成以下任务：
+ 指定 `hudi` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Hudi 表。

  ```
  spark.serializer=org.apache.spark.serializer.KryoSerializer
  ```
+ AWS Glue 4.0 默认为 Hudi 表启用了 Lake Formation 权限支持。无需额外配置即可读取/写入注册到 Lake Formation 的 Hudi 表。AWS Glue 作业 IAM 角色必须具有 SELECT 权限才能读取已注册的 Hudi 表。AWS Glue 作业 IAM 角色必须具有 SUPER 权限才能写入已注册的 Hudi 表。要了解有关管理 Lake Formation 权限的更多信息，请参阅 [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html)。

**使用不同的 Hudi 版本**

要使用 AWS Glue 不支持的 Hudi 版本，请使用 `--extra-jars` 作业参数指定您自己的 Hudi JAR 文件。请勿使用 `hudi` 作为 `--datalake-formats` 作业参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。

## 示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中
<a name="aws-glue-programming-etl-format-hudi-write"></a>

以下示例脚本脚本演示了如何将 Hudi 表写入 Amazon S3，并将该表注册到 AWS Glue 数据目录。该示例使用 Hudi [Hive 同步工具](https://hudi.apache.org/docs/syncing_metastore/)来注册该表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 作业参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

```
# Example: Create a Hudi table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options={
    "hoodie.table.name": "<your_table_name>",
    "hoodie.database.name": "<your_database_name>",
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
    "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
    "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.database": "<your_database_name>",
    "hoodie.datasource.hive_sync.table": "<your_table_name>",
    "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasource.hive_sync.use_jdbc": "false",
    "hoodie.datasource.hive_sync.mode": "hms",
    "path": "s3://<s3Path/>"
}

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save()
```

------
#### [ Scala ]

```
// Example: Example: Create a Hudi table from a DataFrame
// and register the table to Glue Data Catalog

val additionalOptions = Map(
  "hoodie.table.name" -> "<your_table_name>",
  "hoodie.database.name" -> "<your_database_name>",
  "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
  "hoodie.datasource.write.operation" -> "upsert",
  "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
  "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
  "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
  "hoodie.datasource.write.hive_style_partitioning" -> "true",
  "hoodie.datasource.hive_sync.enable" -> "true",
  "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
  "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
  "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
  "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
  "hoodie.datasource.hive_sync.use_jdbc" -> "false",
  "hoodie.datasource.hive_sync.mode" -> "hms",
  "path" -> "s3://<s3Path/>")

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("append")
  .save()
```

------

## 示例：使用 AWS Glue Data Catalog 从 Amazon S3 读取 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read"></a>

此示例从 Amazon S3 读取您在 [示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中](#aws-glue-programming-etl-format-hudi-write) 中创建的 Hudi 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.create\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Read a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

dataFrame = glueContext.create_data_frame.from_catalog(
    database = "<your_database_name>",
    table_name = "<your_table_name>"
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dataFrame = glueContext.getCatalogSource(
      database = "<your_database_name>",
      tableName = "<your_table_name>"
    ).getDataFrame()
  }
}
```

------

## 示例：更新 `DataFrame` 并将其插入到 Amazon S3 的 Hudi 表中
<a name="aws-glue-programming-etl-format-hudi-update-insert"></a>

此示例使用 AWS Glue Data Catalog 将 DataFrame 插入到您在 [示例：将 Hudi 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录中](#aws-glue-programming-etl-format-hudi-write) 中创建的 Hudi 表中。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.write\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Upsert a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame = dataFrame,
    database = "<your_database_name>",
    table_name = "<your_table_name>",
    additional_options={
        "hoodie.table.name": "<your_table_name>",
        "hoodie.database.name": "<your_database_name>",
        "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
        "hoodie.datasource.write.operation": "upsert",
        "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning": "true",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.hive_sync.database": "<your_database_name>",
        "hoodie.datasource.hive_sync.table": "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc": "false",
        "hoodie.datasource.hive_sync.mode": "hms"
    }
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Upsert a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = JsonOptions(Map(
        "hoodie.table.name" -> "<your_table_name>",
        "hoodie.database.name" -> "<your_database_name>",
        "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
        "hoodie.datasource.write.operation" -> "upsert",
        "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning" -> "true",
        "hoodie.datasource.hive_sync.enable" -> "true",
        "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
        "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc" -> "false",
        "hoodie.datasource.hive_sync.mode" -> "hms"
      )))
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark 从 Amazon S3 读取 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read-spark"></a>

此示例使用 Spark DataFrame API 从 Amazon S3 读取 Hudi 表。

------
#### [ Python ]

```
# Example: Read a Hudi table from S3 using a Spark DataFrame

dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Hudi table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------

## 示例：使用 Spark 向 Amazon S3 写入 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-write-spark"></a>

示例：使用 Spark 向 Amazon S3 写入 Hudi 表

------
#### [ Python ]

```
# Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save("s3://<s3Path/>)
```

------
#### [ Scala ]

```
// Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("overwrite")
  .save("s3://<s3path/>")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Hudi 表
<a name="aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Hudi 表。

1. 创建一个 Hudi 表并将其注册到 Lake Formation。

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark dataframe API 注册的 Amazon S3 路径的 Hudi 表：

      ```
      hudi_options = {
          'hoodie.table.name': table_name,
          'hoodie.database.name': database_name,
          'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
          'hoodie.datasource.write.recordkey.field': 'product_id',
          'hoodie.datasource.write.table.name': table_name,
          'hoodie.datasource.write.operation': 'upsert',
          'hoodie.datasource.write.precombine.field': 'updated_at',
          'hoodie.datasource.write.hive_style_partitioning': 'true',
          'hoodie.upsert.shuffle.parallelism': 2,
          'hoodie.insert.shuffle.parallelism': 2,
          'path': <S3_TABLE_LOCATION>,
          'hoodie.datasource.hive_sync.enable': 'true',
          'hoodie.datasource.hive_sync.database': database_name,
          'hoodie.datasource.hive_sync.table': table_name,
          'hoodie.datasource.hive_sync.use_jdbc': 'false',
          'hoodie.datasource.hive_sync.mode': 'hms'
      }
      
      df_products.write.format("hudi")  \
          .options(**hudi_options)  \
          .mode("overwrite")  \
          .save()
      ```

1. 向 AWS Glue 作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)。

1.  读取注册到 Lake Formation 的 Hudi 表。代码与读取未注册的 Hudi 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
    val dataFrame = glueContext.getCatalogSource(
         database = "<your_database_name>",
         tableName = "<your_table_name>"
       ).getDataFrame()
   ```

1. 写入注册到 Lake Formation 的 Hudi 表。代码与写入未注册的 Hudi 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   ```
   glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
         additionalOptions = JsonOptions(Map(
           "hoodie.table.name" -> "<your_table_name>",
           "hoodie.database.name" -> "<your_database_name>",
           "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
           "hoodie.datasource.write.operation" -> "<write_operation>",
           "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
           "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
           "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
           "hoodie.datasource.write.hive_style_partitioning" -> "true",
           "hoodie.datasource.hive_sync.enable" -> "true",
           "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
           "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
           "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
           "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.hive_sync.use_jdbc" -> "false",
           "hoodie.datasource.hive_sync.mode" -> "hms"
         )))
         .writeDataFrame(dataFrame, glueContext)
   ```

# 在 AWS Glue 中使用 Delta Lake 框架
<a name="aws-glue-programming-etl-format-delta-lake"></a>

AWS Glue 3.0 及更高版本支持 Linux Foundation Delta Lake 框架。Delta Lake 是一个开源数据湖存储框架，可帮助您执行 ACID 交易、扩展元数据处理以及统一流式和批处理数据处理。本主题涵盖了在 Delta Lake 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Delta Lake 的更多信息，请参阅 [Delta Lake 官方文档](https://docs.delta.io/latest/delta-intro.html)。

您可以使用 AWS Glue 对 Amazon S3 中的 Delta Lake 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Delta Lake 表。还支持插入、更新和[表批量读取和写入](https://docs.delta.io/0.7.0/api/python/index.html)等其他操作。使用 Delta Lake 表时，也可以选择使用 Delta Lake Python 库中的方法，例如 `DeltaTable.forPath`。有关 Delta Lake Python 库的更多信息，请参阅 Delta Lake 的 Python 文档页面。

下表列出了 AWS Glue 每个版本中包含的 Delta Lake 版本。


****  

| AWS Glue 版本 | 支持的 Delta Lake 版本 | 
| --- | --- | 
| 5.1 | 3.3.2 | 
| 5.0 | 3.3.0 | 
| 4.0 | 2.1.0 | 
| 3.0 | 1.0.0 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 为 AWS Glue 启用 Delta Lake
<a name="aws-glue-programming-etl-format-delta-lake-enable"></a>

要为 AWS Glue 启用 Delta Lake，请完成以下任务：
+ 指定 `delta` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Delta Lake 表。

  ```
  spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
  ```
+ AWS Glue 4.0 默认为 Delta 表启用了 Lake Formation 权限支持。无需额外配置即可读取/写入注册到 Lake Formation 的 Delta 表。AWS Glue 作业 IAM 角色必须具有 SELECT 权限才能读取已注册的 Delta 表。AWS Glue 作业 IAM 角色必须具有 SUPER 权限才能写入已注册的 Delta 表。要了解有关管理 Lake Formation 权限的更多信息，请参阅 [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html)。

**使用不同的 Delta Lake 版本**

要使用 AWS Glue 不支持的 Delta Lake 版本，请使用 `--extra-jars` 作业参数指定您自己的 Delta Lake JAR 文件。请勿包含 `delta` 作为 `--datalake-formats` 作业参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。要在这种情况下使用 Delta Lake Python 库，必须使用 `--extra-py-files` 作业参数指定库 JAR 文件。Python 库打包在 Delta Lake JAR 文件中。

## 示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录
<a name="aws-glue-programming-etl-format-delta-lake-write"></a>

以下 AWS Glue ETL 脚本演示了如何将 Delta Lake 表写入 Amazon S3，并将该表注册到 AWS Glue 数据目录。

------
#### [ Python ]

```
# Example: Create a Delta Lake table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options = {
    "path": "s3://<s3Path>"
}
dataFrame.write \
    .format("delta") \
    .options(**additional_options) \
    .mode("append") \
    .partitionBy("<your_partitionkey_field>") \
    .saveAsTable("<your_database_name>.<your_table_name>")
```

------
#### [ Scala ]

```
// Example: Example: Create a Delta Lake table from a DataFrame
// and register the table to Glue Data Catalog

val additional_options = Map(
  "path" -> "s3://<s3Path>"
)
dataFrame.write.format("delta")
  .options(additional_options)
  .mode("append")
  .partitionBy("<your_partitionkey_field>")
  .saveAsTable("<your_database_name>.<your_table_name>")
```

------

## 示例：使用 AWS Glue 数据目录从 Amazon S3 读取 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-read"></a>

以下 AWS Glue ETL 脚本读取您在 [示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-delta-lake-write) 中创建的 Delta Lake 表。

------
#### [ Python ]

在本示例中，使用 [create\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog) 方法。

```
# Example: Read a Delta Lake table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read a Delta Lake table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## 示例：使用 AWS Glue 数据目录在 Amazon S3 中将 `DataFrame` 插入 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-insert"></a>

此示例将数据插入您在 [示例：将 Delta Lake 表写入 Amazon S3，并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-delta-lake-write) 中创建的 Delta Lake 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 [write\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog) 方法。

```
# Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark API 从 Amazon S3 读取 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta_lake-read-spark"></a>

此示例使用 Spark API 从 Amazon S3 读取 Delta Lake 表。

------
#### [ Python ]

```
# Example: Read a Delta Lake table from S3 using a Spark DataFrame

dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Delta Lake table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------

## 示例：使用 Spark 向 Amazon S3 写入 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta_lake-write-spark"></a>

此示例使用 Spark 向 Amazon S3 写入 Delta Lake 表。

------
#### [ Python ]

```
# Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta") \
    .options(**additional_options) \
    .mode("overwrite") \
    .partitionBy("<your_partitionkey_field>")
    .save("s3://<s3Path>")
```

------
#### [ Scala ]

```
// Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta")
  .options(additionalOptions)
  .mode("overwrite")
  .partitionBy("<your_partitionkey_field>")
  .save("s3://<s3path/>")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Delta Lake 表
<a name="aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Delta Lake 表。

1. 创建一个 Delta 表并将其注册到 Lake Formation

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark 注册的 Amazon S3 路径的 Delta 表：
**注意**  
以下示例属于 Python 示例。

      ```
      dataFrame.write \
      	.format("delta") \
      	.mode("overwrite") \
      	.partitionBy("<your_partitionkey_field>") \
      	.save("s3://<the_s3_path>")
      ```

      将数据写入 Amazon S3 后，使用 AWS Glue 爬网程序创建新的 Delta 目录表。有关更多信息，请参阅 [Introducing native Delta Lake table support with AWS Glue crawlers](https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/)。

      您也可以通过 AWS Glue `CreateTable` API 手动创建表。

1. 向 AWS Glue 作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)。

1.  读取注册到 Lake Formation 的 Delta 表。代码与读取未注册的 Delta 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
   # Example: Read a Delta Lake table from Glue Data Catalog
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. 写入注册到 Lake Formation 的 Delta 表。代码与写入未注册的 Delta 表相同。请注意，AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   默认情况下，AWS Glue 会将 `Append` 作为 saveMode 使用。您可以通过设置 `additional_options` 中的 saveMode 选项来对其进行更改。要了解 Delta 表中对 saveMode 的支持，请参阅 [Write to a table](https://docs.delta.io/latest/delta-batch.html#write-to-a-table)。

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

# 在 AWS Glue 中使用 Iceberg 框架
<a name="aws-glue-programming-etl-format-iceberg"></a>

AWS Glue 3.0 及更高版本支持数据湖的 Apache Iceberg 框架。Iceberg 提供了一种高性能的表格式，其工作原理与 SQL 表类似。本主题涵盖了在 Iceberg 表中传输或存储数据时，在 AWS Glue 中使用数据的可用功能。要了解有关 Iceberg 的更多信息，请参阅 [Apache Iceberg 官方文档](https://iceberg.apache.org/docs/latest/)。

您可以使用 AWS Glue 对 Amazon S3 中的 Iceberg 表执行读写操作，也可以使用 AWS Glue 数据目录处理 Iceberg 表。还支持其他操作，包括插入和所有 [Spark 查询](https://iceberg.apache.org/docs/latest/spark-queries/) [Spark 写入](https://iceberg.apache.org/docs/latest/spark-writes/)。Iceberg 表不支持更新。

**注意**  
`ALTER TABLE … RENAME TO` 不适用于 Apache Iceberg 0.13.1 for AWS Glue 3.0。

下表列出了 AWS Glue 每个版本中包含的 Iceberg 版本。


****  

| AWS Glue 版本 | 支持 Iceberg 版本 | 
| --- | --- | 
| 5.1 | 1.10.0 | 
| 5.0 | 1.7.1 | 
| 4.0 | 1.0.0 | 
| 3.0 | 0.13.1 | 

要了解有关 AWS Glue 支持的数据湖框架的更多信息，请参阅[在 AWS Glue ETL 任务中使用数据湖框架](aws-glue-programming-etl-datalake-native-frameworks.md)。

## 启用 Iceberg 框架
<a name="aws-glue-programming-etl-format-iceberg-enable"></a>

要启用 Iceberg for AWS Glue，请完成以下任务：
+ 指定 `iceberg` 作为 `--datalake-formats` 作业参数的值。有关更多信息，请参阅 [在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。
+ `--conf` 为 Glue 作业创建一个名为 AWS 的密钥，并将其设置为以下值。或者，您可以在脚本中使用 `SparkConf` 设置以下配置。这些设置有助于 Apache Spark 正确处理 Iceberg 表。

  ```
  spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
  --conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ 
  --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
  --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  ```

  如果正在读取或写入注册到 Lake Formation 的 Iceberg 表，请按照 AWS Glue 5.0 及更高版本中 [将 AWS Glue 与 AWS Lake Formation 结合使用以进行精细访问控制](security-lf-enable.md) 中的指南进行操作。在 AWS Glue 4.0 中，添加以下配置来启用 Lake Formation 支持。

  ```
  --conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true
  --conf spark.sql.catalog.glue_catalog.glue.id=<table-catalog-id>
  ```

  如果您将 AWS Glue 3.0 与 Iceberg 0.13.1 一起使用，则必须设置以下附加配置才能使用 Amazon DynamoDB 锁定管理器来确保原子交易。AWSGlue 4.0 或更高版本默认使用乐观锁。有关更多信息，请参阅 Apache Iceberg 官方文档中的 [Iceberg AWS 集成](https://iceberg.apache.org/docs/latest/aws/#dynamodb-lock-manager)。

  ```
  --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager 
  --conf spark.sql.catalog.glue_catalog.lock.table=<your-dynamodb-table-name>
  ```

**使用不同的 Iceberg 版本**

要使用 AWS Glue 不支持的 Iceberg 版本，请使用 `--extra-jars` 作业参数指定您自己的 Iceberg JAR 文件。请勿包含 `iceberg` 作为 `--datalake-formats` 参数的值。如果使用 AWS Glue 5.0 或更高版本，则必须设置 `--user-jars-first true` 作业参数。

**为 Iceberg 表启用加密**

**注意**  
Iceberg 表有自己的用于启用服务器端加密的机制。除了 AWS Glue 的安全配置外，您还应该启用此配置。

要在 Iceberg 表上启用服务器端加密，请查看 [Iceberg 文档](https://iceberg.apache.org/docs/latest/aws/#s3-server-side-encryption)中的指南。

**为 Iceberg 跨区域表访问添加 Spark 配置**

要通过 AWS Glue Data Catalog 和 AWS Lake Formation 为 Iceberg 跨区域表访问添加额外的 Spark 配置，请按照以下步骤操作：

1. 创建[多区域接入点](https://docs.aws.amazon.com/AmazonS3/latest/userguide/multi-region-access-point-create-examples.html)。

1. 设置以下 Spark 属性：

   ```
   -----
       --conf spark.sql.catalog.my_catalog.s3.use-arn-region-enabled=true \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket1", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket2", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap
   -----
   ```

## 示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录
<a name="aws-glue-programming-etl-format-iceberg-write"></a>

此示例脚本演示了如何将 Iceberg 表写入 Amazon S3。该示例使用 [IcebergAWS 集成](https://iceberg.apache.org/docs/latest/aws/)将表注册到 AWS Glue 数据目录。

------
#### [ Python ]

```
# Example: Create an Iceberg table from a DataFrame 
# and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

query = f"""
CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------
#### [ Scala ]

```
// Example: Example: Create an Iceberg table from a DataFrame
// and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

val query = """CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------

或者，您可以使用 Spark 方法将 Iceberg 表写入 Amazon S3 和 Data Catalog。

先决条件：您需要预置目录以供 Iceberg 库使用。使用 AWS Glue Data Catalog 时，AWS Glue 让这一切变得简单明了。AWS Glue Data Catalog 已预先配置为供 Spark 库作为 `glue_catalog` 使用。Data Catalog 表由 *databaseName* 和 *tableName* 标识。有关 AWS Glue Data Catalog 的更多信息，请参阅 [AWS Glue 中的数据发现和编目](catalog-and-crawler.md)。

如果您不使用 AWS Glue Data Catalog ，则需要通过 Spark API 配置目录。有关更多信息，请参阅 Iceberg 文档中的 [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)。

此示例使用 Spark 从将 Iceberg 表写入 Amazon S3 和 Data Catalog 中。

------
#### [ Python ]

```
# Example: Write an Iceberg table to S3 on the Glue Data Catalog

# Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .create()

# Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .append()
```

------
#### [ Scala ]

```
// Example: Write an Iceberg table to S3 on the Glue Data Catalog

// Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .create()

// Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .append()
```

------

## 示例：使用 AWS Glue 数据目录从 Amazon S3 读取 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read"></a>

此示例读取您在 [示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-iceberg-write) 中创建的 Iceberg 表。

------
#### [ Python ]

在本示例中，使用 `GlueContext.create\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Read an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) 方法。

```
// Example: Read an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## 示例：使用 AWS Glue 数据目录在 Amazon S3 将 `DataFrame` 插入 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-insert"></a>

此示例将数据插入您在 [示例：将 Iceberg 表写入 Amazon S3 并将其注册到 AWS Glue 数据目录](#aws-glue-programming-etl-format-iceberg-write) 中创建的 Iceberg 表。

**注意**  
此示例要求您设置 `--enable-glue-datacatalog` 任务参数，才能将 AWS Glue Data Catalog 用作 Apache Spark Hive 元存储。要了解更多信息，请参阅[在 AWS Glue 作业中使用作业参数](aws-glue-programming-etl-glue-arguments.md)。

------
#### [ Python ]

在本示例中，使用 `GlueContext.write\$1data\$1frame.from\$1catalog()` 方法。

```
# Example: Insert into an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

在本示例中，使用 [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) 方法。

```
// Example: Insert into an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## 示例：使用 Spark 从 Amazon S3 读取 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read-spark"></a>

先决条件：您需要预置目录以供 Iceberg 库使用。使用 AWS Glue Data Catalog 时，AWS Glue 让这一切变得简单明了。AWS Glue Data Catalog 已预先配置为供 Spark 库作为 `glue_catalog` 使用。Data Catalog 表由 *databaseName* 和 *tableName* 标识。有关 AWS Glue Data Catalog 的更多信息，请参阅 [AWS Glue 中的数据发现和编目](catalog-and-crawler.md)。

如果您不使用 AWS Glue Data Catalog ，则需要通过 Spark API 配置目录。有关更多信息，请参阅 Iceberg 文档中的 [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)。

此示例使用 Spark 从 Data Catalog 读取 Amazon S3 中的 Iceberg 表。

------
#### [ Python ]

```
# Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------
#### [ Scala ]

```
// Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

val dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------

## 示例：读取和写入具有 Lake Formation 权限控制的 Iceberg 表
<a name="aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables"></a>

此示例将读取和写入一个具有 Lake Formation 权限控制的 Iceberg 表。

**注意**  
此示例仅适用于 AWS Glue 4.0。在 AWS Glue 5.0 及更高版本中，请按照 [将 AWS Glue 与 AWS Lake Formation 结合使用以进行精细访问控制](security-lf-enable.md) 中的指南进行操作。

1. 创建一个 Iceberg 表并将其注册到 Lake Formation：

   1. 要启用 Lake Formation 权限控制，您首先需要将表的 Amazon S3 路径注册到 Lake Formation。有关更多信息，请参阅 [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html)（注册 Amazon S3 位置）。您可以通过 Lake Formation 控制台或使用 AWS CLI 进行注册：

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      注册了 Amazon S3 位置后，对于任何指向该位置（或其任何子位置）的 AWS Glue 表，`GetTable` 调用中的 `IsRegisteredWithLakeFormation` 参数都将返回值 true。

   1. 创建一个指向通过 Spark SQL 注册的路径的 Iceberg 表：
**注意**  
以下示例属于 Python 示例。

      ```
      dataFrame.createOrReplaceTempView("tmp_<your_table_name>")
      
      query = f"""
      CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
      USING iceberg
      AS SELECT * FROM tmp_<your_table_name>
      """
      spark.sql(query)
      ```

      您也可以通过 AWS Glue `CreateTable` API 手动创建表。有关更多信息，请参阅 [Creating Apache Iceberg tables](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-iceberg-tables.html)。
**注意**  
该 `UpdateTable` API 目前不支持 Iceberg 表格式作为操作的输入。

1. 向作业 IAM 角色授予 Lake Formation 权限。您可以通过 Lake Formation 控制台授予权限，也可以使用 AWS CLI 授予权限。有关更多信息，请参阅 https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html

1. 读取注册到 Lake Formation 的 Iceberg 表。代码与读取未注册的 Iceberg 表相同。请注意，您的 AWS Glue 作业 IAM 角色需要具有 SELECT 权限才能成功读取。

   ```
   # Example: Read an Iceberg table from the AWS Glue Data Catalog
   from awsglue.context import GlueContextfrom pyspark.context import SparkContext
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. 写入注册到 Lake Formation 的 Iceberg 表。代码与写入未注册的 Iceberg 表相同。请注意，您的 AWS Glue 作业 IAM 角色需要具有 SUPER 权限才能成功写入。

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

## 共享配置参考
<a name="aws-glue-programming-etl-format-shared-reference"></a>

 您可以对任何格式类型使用以下 `format_options` 值：
+ `attachFilename` - 适当格式的字符串，用作列名。如果您提供此选项，则记录的源文件名将附加到记录中。参数值将用作列名。
+ `attachTimestamp` - 适当格式的字符串，用作列名。如果您提供此选项，则记录的源文件的修改时间将附加到记录中。参数值将用作列名。