

# Use non-Hive table formats in Athena for Spark
<a name="notebooks-spark-table-formats"></a>

**Note**  
This page refers to using Python libraries in the release version Pyspark engine version 3. Refer to [Amazon EMR 7.12](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7120-release.html) for supported open table format versions.

When you work with sessions and notebooks in Athena for Spark, you can use Linux Foundation Delta Lake, Apache Hudi, and Apache Iceberg tables, in addition to Apache Hive tables.

## Considerations and limitations
<a name="notebooks-spark-table-formats-considerations-and-limitations"></a>

When you use table formats other than Apache Hive with Athena for Spark, consider the following points:
+ In addition to Apache Hive, only one table format is supported per notebook. To use multiple table formats in Athena for Spark, create a separate notebook for each table format. For information about creating notebooks in Athena for Spark, see [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).
+ The Delta Lake, Hudi, and Iceberg table formats have been tested on Athena for Spark by using AWS Glue as the metastore. You might be able to use other metastores, but such usage is not currently supported.
+ To use the additional table formats, override the default `spark_catalog` property, as indicated in the Athena console and in this documentation. These non-Hive catalogs can read Hive tables, in addition to their own table formats.

## Table versions
<a name="notebooks-spark-table-formats-versions"></a>

The following table shows supported non-Hive table versions in Amazon Athena for Apache Spark.


****  

| Table format | Supported version | 
| --- | --- | 
| Apache Iceberg | 1.2.1 | 
| Apache Hudi | 0.13 | 
| Linux Foundation Delta Lake | 2.0.2 | 

In Athena for Spark, these table format `.jar` files and their dependencies are loaded onto the classpath for Spark drivers and executors.

For an *AWS Big Data Blog* post that shows how to work with Iceberg, Hudi, and Delta Lake table formats using Spark SQL in Amazon Athena notebooks, see [Use Amazon Athena with Spark SQL for your open-source transactional table formats](https://aws.amazon.com/blogs/big-data/use-amazon-athena-with-spark-sql-for-your-open-source-transactional-table-formats/).

**Topics**
+ [Considerations and limitations](#notebooks-spark-table-formats-considerations-and-limitations)
+ [Table versions](#notebooks-spark-table-formats-versions)
+ [Iceberg](notebooks-spark-table-formats-apache-iceberg.md)
+ [Hudi](notebooks-spark-table-formats-apache-hudi.md)
+ [Delta Lake](notebooks-spark-table-formats-linux-foundation-delta-lake.md)

# Use Apache Iceberg tables in Athena for Spark
<a name="notebooks-spark-table-formats-apache-iceberg"></a>

[Apache Iceberg](https://iceberg.apache.org/) is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3). It provides you with fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution.

To use Apache Iceberg tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Apache Iceberg as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
"spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.spark_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
```

The following procedure shows you how to use an Apache Iceberg table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use an Apache Iceberg table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME"
   TABLE_NAME = "NEW_TABLE_NAME"
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"]
   data = [("Golang", 3000)]
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Apache Iceberg table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} (
   language string,
   users_count int
   ) USING ICEBERG
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME, TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```

For more information and examples on working with Spark DataFrames and Iceberg tables, see [Spark Queries](https://iceberg.apache.org/docs/latest/spark-queries/) in the Apache Iceberg documentation.

# Use Apache Hudi tables in Athena for Spark
<a name="notebooks-spark-table-formats-apache-hudi"></a>

[https://hudi.apache.org/](https://hudi.apache.org/) is an open-source data management framework that simplifies incremental data processing. Record-level insert, update, upsert, and delete actions are processed with greater precision, which reduces overhead.

To use Apache Hudi tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Apache Hudi as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
```

The following procedure shows you how to use an Apache Hudi table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use an Apache Hudi table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME"
   TABLE_NAME = "NEW_TABLE_NAME"
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"]
   data = [("Golang", 3000)]
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Apache Hudi table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} (
   language string,
   users_count int
   ) USING HUDI
   TBLPROPERTIES (
   primaryKey = 'language',
   type = 'mor'
   );
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME,TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```

# Use Linux Foundation Delta Lake tables in Athena for Spark
<a name="notebooks-spark-table-formats-linux-foundation-delta-lake"></a>

[Linux Foundation Delta Lake](https://delta.io/) is a table format that you can use for big data analytics. You can use Athena for Spark to read Delta Lake tables stored in Amazon S3 directly.

To use Delta Lake tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Delta Lake as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog" : "org.apache.spark.sql.delta.catalog.DeltaCatalog", 
"spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension"
```

The following procedure shows you how to use a Delta Lake table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use a Delta Lake table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME" 
   TABLE_NAME = "NEW_TABLE_NAME" 
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"] 
   data = [("Golang", 3000)] 
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Delta Lake table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} ( 
     language string, 
     users_count int 
   ) USING DELTA 
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME, TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```