

# Populating and managing transactional tables
<a name="populate-otf"></a>

[Apache Iceberg](https://iceberg.apache.org/), [Apache Hudi](https://hudi.incubator.apache.org/), and Linux Foundation [Delta Lake](https://delta.io/) are open-source table formats designed for handling large-scale data analytics and data lake workloads in Apache Spark. 

You can populate Iceberg, Hudi, and Delta Lake tables in the AWS Glue Data Catalog using the following methods: 
+ AWS Glue crawler; – AWS Glue crawlers can automatically discover and populate Iceberg, Hudi and Delta Lake table metadata in the Data Catalog. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).
+ AWS Glue ETL Jobs – You can create ETL jobs to write data to Iceberg, Hudi, and Delta Lake tables and populate their metadata in the Data Catalog. For more information, see [Using data lake frameworks with AWS Glue ETL jobs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-datalake-native-frameworks.html).
+ AWS Glue console, AWS Lake Formation console, AWS CLI or API – You can use the AWS Glue console, Lake Formation console, or API to create and manage Iceberg table definitions in the Data Catalog.

**Topics**
+ [Creating Apache Iceberg tables](#creating-iceberg-tables)
+ [Optimizing Iceberg tables](table-optimizers.md)
+ [Optimizing query performance for Iceberg tables](iceberg-column-statistics.md)

## Creating Apache Iceberg tables
<a name="creating-iceberg-tables"></a>

You can create Apache Iceberg tables that use the Apache Parquet data format in the AWS Glue Data Catalog with data residing in Amazon S3. A table in the Data Catalog is the metadata definition that represents the data in a data store. By default, AWS Glue creates Iceberg v2 tables. For the difference between v1 and v2 tables, see [Format version changes](https://iceberg.apache.org/spec/#appendix-e-format-version-changes) in the Apache Iceberg documentation.

 [Apache Iceberg](https://iceberg.apache.org/) is an open table format for very large analytic datasets. Iceberg allows for easy changes to your schema, also known as schema evolution, meaning that users can add, rename, or remove columns from a data table without disrupting the underlying data. Iceberg also provides support for data versioning, which allows users to track changes to data overtime. This enables the time travel feature, which allows users to access and query historical versions of data and analyze changes to the data between updates and deletes.

You can use AWS Glue or Lake Formation console or the `CreateTable` operation in the AWS Glue API to create an Iceberg table in the Data Catalog. For more information, see [CreateTable action (Python: create\$1table)](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-CreateTable).

When you create an Iceberg table in the Data Catalog, you must specify the table format and metadata file path in Amazon S3 to be able to perform reads and writes.

 You can use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with AWS Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by IAM permissions policies for Amazon S3 and AWS Glue actions. For more information, see [Managing permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/managing-permissions.html). 

**Note**  
Data Catalog doesn’t support creating partitions and adding Iceberg table properties.

### Prerequisites
<a name="iceberg-prerequisites"></a>

 To create Iceberg tables in the Data Catalog, and set up Lake Formation data access permissions, you need to complete the following requirements: 

1. 

**Permissions required to create Iceberg tables without the data registered with Lake Formation.**

   In addition to the permissions required to create a table in the Data Catalog, the table creator requires the following permissions:
   + `s3:PutObject` on resource arn:aws:s3:::\$1bucketName\$1
   + `s3:GetObject` on resource arn:aws:s3:::\$1bucketName\$1
   + `s3:DeleteObject`on resource arn:aws:s3:::\$1bucketName\$1

1. 

**Permissions required to create Iceberg tables with data registered with Lake Formation:**

   To use Lake Formation to manage and secure the data in your data lake, register your Amazon S3 location that has the data for tables with Lake Formation. This is so that Lake Formation can vend credentials to AWS analytical services such as Athena, Redshift Spectrum, and Amazon EMR to access data. For more information on registering an Amazon S3 location, see [Adding an Amazon S3 location to your data lake](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html). 

   A principal who reads and writes the underlying data that is registered with Lake Formation requires the following permissions:
   + `lakeformation:GetDataAccess`
   + `DATA_LOCATION_ACCESS`

     A principal who has data location permissions on a location also has location permissions on all child locations.

     For more information on data location permissions, see [Underlying data access control](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-underlying-data.html#data-location-permissions)ulink.

 To enable compaction, the service needs to assume an IAM role that has permissions to update tables in the Data Catalog. For details, see [Table optimization prerequisites](optimization-prerequisites.md) 

### Creating an Iceberg table
<a name="create-iceberg-table"></a>

You can create Iceberg v1 and v2 tables using AWS Glue or Lake Formation console or AWS Command Line Interface as documented on this page. You can also create Iceberg tables using the AWS Glue crawler. For more information, see [Data Catalog and Crawlers](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html) in the AWS Glue Developer Guide.

**To create an Iceberg table**

------
#### [ Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Under Data Catalog, choose **Tables**, and use the **Create table** button to specify the following attributes:
   + **Table name** – Enter a name for the table. If you’re using Athena to access tables, use these [naming tips](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html) in the Amazon Athena User Guide.
   + **Database** – Choose an existing database or create a new one.
   + **Description** – The description of the table. You can write a description to help you understand the contents of the table.
   + **Table format** – For **Table format**, choose Apache Iceberg.
   + **Enable compaction** – Choose **Enable compaction** to compact small Amazon S3 objects in the table into larger objects.
   + **IAM role** – To run compaction, the service assumes an IAM role on your behalf. You can choose an IAM role using the drop-down. Ensure that the role has the permissions required to enable compaction.

     To learn more about the required permissions, see [Table optimization prerequisites](optimization-prerequisites.md).
   + **Location** – Specify the path to the folder in Amazon S3 that stores the metadata table. Iceberg needs a metadata file and location in the Data Catalog to be able to perform reads and writes.
   + **Schema** – Choose **Add columns** to add columns and data types of the columns. You have the option to create an empty table and update the schema later. Data Catalog supports Hive data types. For more information, see [Hive data types](https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=27838462#content/view/27838462). 

      Iceberg allows you to evolve schema and partition after you create the table. You can use [Athena queries](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-evolving-table-schema.html) to update the table schema and [Spark queries](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table-sql-extensions) for updating partitions. 

------
#### [ AWS CLI ]

```
aws glue create-table \
    --database-name iceberg-db \
    --region us-west-2 \
    --open-table-format-input '{
      "IcebergInput": { 
           "MetadataOperation": "CREATE",
           "Version": "2"
         }
      }' \
    --table-input '{"Name":"test-iceberg-input-demo",
            "TableType": "EXTERNAL_TABLE",
            "StorageDescriptor":{ 
               "Columns":[ 
                   {"Name":"col1", "Type":"int"}, 
                   {"Name":"col2", "Type":"int"}, 
                   {"Name":"col3", "Type":"string"}
                ], 
               "Location":"s3://DOC_EXAMPLE_BUCKET_ICEBERG/"
            }
        }'
```

------

**Topics**
+ [Prerequisites](#iceberg-prerequisites)
+ [Creating an Iceberg table](#create-iceberg-table)

# Optimizing Iceberg tables
<a name="table-optimizers"></a>

AWS Glue supports mutiple table optimization options to enhance the management and performance of Apache Iceberg tables used by the AWS analytical engines and ETL jobs. These optimizers provide efficient storage utilization, improved query performance, and effective data management. There are three types of table optimizers available in AWS Glue: 
+ **Compaction **– Data compaction compacts small data files to reduce storage usage and improve read performance. Data files are merged and rewritten to remove obsolete data and consolidate fragmented data into larger, more efficient files. You can configure compaction to run automatically. 

  Binpack is the default compaction strategy in Apache Iceberg. It combines smaller data files into larger ones for optimal performance. Compaction also supports sort and Z-order strategies that cluster similar data together. Sort organizes data based on specified columns, improving query performance for filtered operations. Z-order creates sorted datasets that enhance query performance when multiple columns are queried simultaneously. All three compaction strategies - bincpak, sort, and Z-order - reduce the amount of data scanned by query engines, thereby lowering query processing costs.
+ **Snapshot retention **– Snapshots are timestamped versions of an Iceberg table. Snapshot retention configurations allow customers to enforce how long to retain snapshots and how many snapshots to retain. Configuring a snapshot retention optimizer can help manage storage overhead by removing older, unnecessary snapshots and their associated underlying files.
+ **Orphan file deletion** – Orphan files are files that are no longer referenced by the Iceberg table metadata. These files can accumulate over time, especially after operations like table deletions or failed ETL jobs. Enabling orphan file deletion allows AWS Glue to periodically identify and remove these unnecessary files, freeing up storage.

Catalog-level optimization configuration is available through the Lake Formation console and using the AWS Glue `UpdateCatalog` API operation. You can enable or disable compaction, snapshot retention, and orphan file deletion optimizers for individual Iceberg tables in the Data Catalog using the AWS Glue console, AWS CLI, or AWS Glue API operations. 

 The following video demonstrates how to configure optimizers for Iceberg tables in the Data Catalog. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/xOXE7AS-pNA?si=lKvt_TSlPkoc6OXn/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/xOXE7AS-pNA?si=lKvt_TSlPkoc6OXn)


**Topics**
+ [Table optimization prerequisites](optimization-prerequisites.md)
+ [Catalog-level table optimizers](catalog-level-optimizers.md)
+ [Compaction optimization](compaction-management.md)
+ [Snapshot retention optimization](snapshot-retention-management.md)
+ [Deleting orphan files](orphan-file-deletion.md)
+ [Viewing optimization details](view-optimization-status.md)
+ [Viewing Amazon CloudWatch metrics](view-optimization-metrics.md)
+ [Deleting an optimizer](delete-optimizer.md)
+ [Considerations and limitations](optimizer-notes.md)
+ [Supported Regions for table optimizers](regions-optimizers.md)

# Table optimization prerequisites
<a name="optimization-prerequisites"></a>

The table optimizer assumes the permissions of the AWS Identity and Access Management (IAM) role that you specify when you enable optimization options (compaction, snapshot retention, and orphan file delettion) for a table. You can either create s single role for all optimizers or create separate roles for each optimizer.

**Note**  
The orphan file deletion optimizer doesn't require the `glue:updateTable` or `s3:putObject` permissions. The snapshot expiration and compaction optimizers require the same set of permissions.

The IAM role must have the permissions to read data and update metadata in the Data Catalog. You can create an IAM role and attach the following inline policies:
+ Add the following inline policy that grants Amazon S3 read/write permissions on the location for data that is not registered with AWS Lake Formation. This policy also includes permissions to update the table in the Data Catalog, and to permit AWS Glue to add logs in Amazon CloudWatch logs and publish metrics. For source data in Amazon S3 that isn't registered with Lake Formation, access is determined by IAM permissions policies for Amazon S3 and AWS Glue actions. 

  In the following inline policies, replace `bucket-name` with your Amazon S3 bucket name, `aws-account-id` and `region` with a valid AWS account number and Region of the Data Catalog, `database_name` with the name of your database, and `table_name` with the name of the table.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:PutObject",
                  "s3:GetObject",
                  "s3:DeleteObject"
              ],
              "Resource": [
                  "arn:aws:s3:::amzn-s3-demo-bucket/*"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket"
              ],
              "Resource": [
                  "arn:aws:s3:::amzn-s3-demo-bucket"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "glue:UpdateTable",
                  "glue:GetTable"
              ],
              "Resource": [
                  "arn:aws:glue:us-east-1:111122223333:table/<database-name>/<table-name>",
                  "arn:aws:glue:us-east-1:111122223333:database/<database-name>",
                  "arn:aws:glue:us-east-1:111122223333:catalog"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "logs:CreateLogGroup",
                  "logs:CreateLogStream",
                  "logs:PutLogEvents"
              ],
              "Resource": [
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-compaction/logs:*",
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-retention/logs:*",
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-orphan-file-deletion/logs:*"
              ]
          }
      ]
  }
  ```

------
+ Use the following policy to enable compaction for data registered with Lake Formation. 

  If the optimization role doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT` and `DELETE` permissions on the table. 

  For more information on registering an Amazon S3 bucket with Lake Formation, see [Adding an Amazon S3 location to your data lake](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html).

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "lakeformation:GetDataAccess"
        ],
        "Resource": "*"
      },
      {
        "Effect": "Allow",
        "Action": [
          "glue:UpdateTable",
          "glue:GetTable"
        ],
        "Resource": [
          "arn:aws:glue:us-east-1:111122223333:table/databaseName/tableName",
          "arn:aws:glue:us-east-1:111122223333:database/databaseName",
          "arn:aws:glue:us-east-1:111122223333:catalog"
        ]
      },
      {
        "Effect": "Allow",
        "Action": [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ],
        "Resource": [
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-compaction/logs:*",
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-retention/logs:*",
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-orphan-file-deletion/logs:*"
        ]
      }
    ]
  }
  ```

------
+ (Optional) To optimize Iceberg tables with data in Amazon S3 buckets encrypted using [Server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html), the compaction role requires permissions to decrypt Amazon S3 objects and generate a new data key to write objects to the encrypted buckets. Add the following policy to the desired AWS KMS key. We support only bucket-level encryption.

  ```
  {
      "Effect": "Allow",
      "Principal": {
          "AWS": "arn:aws:iam::<aws-account-id>:role/<optimizer-role-name>"
      },
      "Action": [
          "kms:Decrypt",
          "kms:GenerateDataKey"
      ],
      "Resource": "*"
  }
  ```
+  (Optional) For data location registered with Lake Formation, the role used to register the location requires permissions to decrypt Amazon S3 objects and generate a new data key to write objects to the encrypted buckets. For more information, see [Registering an encrypted Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-encrypted.html). 
+ (Optional) If the AWS KMS key is stored in a different AWS account, you need to include the following permissions to the compaction role.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "kms:Decrypt",
          "kms:GenerateDataKey"
        ],
        "Resource": [
          "arn:aws:kms:us-east-1:111122223333:key/key-id"
        ]
      }
    ]
  }
  ```

------
+  The role you use to run compaction must have the `iam:PassRole` permission on the role. 

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "iam:PassRole"
        ],
        "Resource": [
          "arn:aws:iam::111122223333:role/<optimizer-role-name>"
        ]
      }
    ]
  }
  ```

------
+ Add the following trust policy to the role for AWS Glue service to assume the IAM role to run the compaction process.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
          "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }
  ```

------
+ <a name="catalog-optimizer-requirement"></a> (Optional) To update the Data Catalog settings to enable catalog-level table optimizations, the IAM role used must have the `glue:UpdateCatalog` permission or AWS Lake Formation `ALTER CATALOG` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

# Catalog-level table optimizers
<a name="catalog-level-optimizers"></a>

With a one-time catalog configuration, you can set up automatic optimizers such as compaction, snapshot retention, and orphan file deletion for all new and updated Apache Iceberg tables in the AWS Glue Data Catalog. Catalog-level optimizer configurations allow you to apply consistent optimizer settings across all tables within a catalog, eliminating the need to configure optimizers individually for each table.

Data lake administrators can configure the table optimizers by selecting the default catalog in the Lake Formation console and enabling optimizers using the `Table optimization` option. When you create new tables or update existing tables in the Data Catalog, the Data Catalog automatically runs the table optimizations to reduce operational burden.

If you have configured optimization at the table level or if you have previously deleted the table optimization settings for a table, those table-specific settings take precedence over the default catalog settings for table optimization. If a configuration parameter is not defined at either the table or catalog level, the Iceberg table property value will be applied. This setting is applicable to snapshot retention and orphan file deletion optimizer.

When enabling catalog-level optimizers, consider the following:
+ When you configure optimization settings at the time of catalog creation and subsequently disable the optimizations through an Update Catalog request, the operation will cascade through all the tables within the catalog.
+ If you have already configured optimizers for a given table, then the disable operation at the catalog level will not impact this table.
+ When you disable optimizers at the catalog level, tables with existing optimizer configurations will maintain their specific settings and remain unaffected by the catalog-level change. However, tables without their own optimizer configurations will inherit the disabled state from the catalog level.
+ Since snapshot retention and orphan file deletion optimizers can be schedule-based, updates will introduce a random delay to the start of their schedule. This will cause each optimizer to start at slightly different times, spreading out the load and reducing the likelihood of exceeding service limits.
+ Catalog-level optimizer settings are not automatically inherited by tables when AWS Glue Data Catalog encryption is enabled. If your catalog has metadata encryption enabled, you must configure table optimizers individually for each table. To use catalog-level optimizer inheritance, metadata encryption must be disabled on the catalog.

**Topics**
+ [Enabling catalog-level automatic table optimization](enable-auto-table-optimizers.md)
+ [Viewing catalog-level optimizations](view-catalog-optimizations.md)
+ [Disabling catalog-level table optimization](disable-auto-table-optimizers.md)

# Enabling catalog-level automatic table optimization
<a name="enable-auto-table-optimizers"></a>

 You can enable the automatic table optimization for all new Apache Iceberg tables in the Data Catalog. After creating the table, you can also explicitly update the table optimization settings manually. 

 To update the Data Catalog settings to enable catalog-level table optimizations, the IAM role used must have the `glue:UpdateCatalog` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

 For the Lake Formation managed tables, the IAM role selected during the catalog optimization configuration requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT`, and `DELETE` permissions for any new tables or updated tables. 

## To enable catalog-level optimizers (console)
<a name="enable-catalog-optimizers-console"></a>

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. In the navigation pane, choose **Data Catalog**.

1. Select the **Catalogs** tab.

1. Choose the account-level catalog.

1. Choose **Table optimizations**, **Edit** under **Table optimizations** tab. You can also choose **Edit optimizations** from **Actions**.  
![\[The screenshot shows the edit option to enable optimizations at the catalog-level.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-edit-optimizations.png)

1. On the **Table optimization** page, configure the following options:  
![\[The screenshot shows the optimization options at the catalog-level.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-optimization-options.png)

   1. Configure **Compaction** settings:
      + Enable/disable compaction.
      + Choose the IAM role that has the necessary permissions to run the optimizers.

        For more information on the permission requirements for the IAM role, see [Table optimization prerequisites](optimization-prerequisites.md).

   1. Configure **Snapshot retention** settings:
      + Enable/disable retention.
      + Set snapshot retention period in days - default is 5 days.
      + Set number of snapshots to retain - default is 1 snapshot.
      + Enable/disable cleaning of expired files.

   1. Configure **Orphan file deletion** settings:
      + Enable/disable orphan file deletion.
      + Set orphan file retention period in days - default is 3 days.

1. Choose **Save**.

## Enabling Catalog-Level Optimizers via AWS CLI
<a name="catalog-auto-optimizers-cli"></a>

Use the following CLI command to update an existing catalog with optimizer settings:

**Example Update catalog with optimizer settings**  

```
aws glue update-catalog \
   --name catalog-id \
  --catalog-input \
  '{
    "CatalogId": "111122223333",
    "CatalogInput": {
        "CatalogProperties": {
            "CustomProperties": {
                "ColumnStatistics.Enabled": "false",
                "ColumnStatistics.RoleArn": "arn:aws:iam::111122223333:role/service-role/stats-role-name"
            },
            "IcebergOptimizationProperties": {
                "RoleArn": "arn:aws:iam::111122223333:role/optimizer-role-name",
                "Compaction": {
                    "enabled": "true"
                },
                "Retention": {
                    "enabled": "true",
                    "snapshotRetentionPeriodInDays": "10",
                    "numberOfSnapshotsToRetain": "5",
                    "cleanExpiredFiles": "true"
                },
                "OrphanFileDeletion": {
                    "enabled": "true",
                    "orphanFileRetentionPeriodInDays": "3"
                }
            }
        }
    }
}'
```

If you encounter issues with catalog-level optimizers, check the following:
+ Ensure the IAM role has the correct permissions as outlined in the Prerequisites section.
+ Check CloudWatch logs for any error messages related to optimizer operations.

   For more information, see [View available metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html) in the *Amazon CloudWatch User Guide*. 
+ Verify that the catalog settings were successfully applied by checking the catalog configuration.
+ For table access failures, check the CloudWatch logs and EventBridge notifications for detailed error information.

# Viewing catalog-level optimizations
<a name="view-catalog-optimizations"></a>

 When catalog-level table optimization is enabled, anytime an Apache Iceberg table is created or updated via the `CreateTable` or `UpdateTable` APIs through AWS Management Console, SDK, or AWS Glue crawler, an equivalent table level setting is created for that table. 

 After you create or update a table, you can verify the table details to confirm the table optimization. The `Table optimization` shows the `Configuration source` property set as `Catalog`. 

![\[An image of an Apache Iceberg table with catalog-level optimization configuration has  been applied.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-optimization-enabled.png)


# Disabling catalog-level table optimization
<a name="disable-auto-table-optimizers"></a>

 You can disable table optimization for new tables using the AWS Lake Formation console, the `glue:UpdateCatalog` API. 

**To disable the table optimizations at the catalog level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Table optimizations**.

1. On the **Edit optimization** page, unselect the **Optimization options**.

1. Choose **Save**.

# Compaction optimization
<a name="compaction-management"></a>

 The Amazon S3 data lakes using open table formats like Apache Iceberg store data as S3 objects. Having thousands of small Amazon S3 objects in a data lake table increases metadata overhead and affects read performance. AWS Glue Data Catalog provides managed compaction for Iceberg tables, compacting small objects into larger ones for better read performance by AWS analytics services like Amazon Athena and Amazon EMR, and AWS Glue ETL jobs. Data Catalog performs compaction without interfering with concurrent queries and supports compaction only for Parquet format tables. 

The table optimizer continuously monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes.

In the Data Catalog, the compaction process starts when a table or any of its partitions have more than 100 files. Each file must be smaller than 75% of the target file size. The target file size is defined by the `write.target-file-size-bytes` table property, which defaults to 512 MB if not explicitly set.

 For limitations, see [Supported formats and limitations for managed data compaction](optimizer-notes.md#compaction-notes). 

**Topics**
+ [Enabling compaction optimizer](enable-compaction.md)
+ [Disabling compaction optimizer](disable-compaction.md)

# Enabling compaction optimizer
<a name="enable-compaction"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable compaction for your Apache Iceberg tables in the AWS Glue Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable compaction when you create the table. Compaction is disabled by default for new tables.

------
#### [ Console ]

**To enable compaction**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose a table in open table format that you want to enable compaction for, then under **Actions** menu, choose **Optimization**, and then choose **Enable**.

   You can also enable compaction by selecting the **Table optimization** tab on the **Table details** page. Choose the **Table optimization** tab on the lower section of the page, and choose **Enable compaction**. 

   The **Enable optimization** option is also available when you create a new Iceberg table in the Data Catalog.

1. On the **Enable optimization** page, choose **Compaction** under **Optimization options**.  
![\[Apache Iceberg table details page with Enable compaction option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-enable-compaction.png)

1. Next, select an IAM role from the drop down with the permissions shown in the [Table optimization prerequisites](optimization-prerequisites.md) section. 

   You can also choose **Create a new IAM role** option to create a custom role with the required permissions to run compaction.

    Follow the steps below to update an existing IAM role: 

   1.  To update the permissions policy for the IAM role, in the IAM console, go to the IAM role that is being used for running compaction. 

   1.  In the **Add permissions** section, choose Create policy. In the newly opened browser window, create a new policy to use with your role. 

   1. On the Create policy page, choose the `JSON` tab. Copy the JSON code shown in the Prerequisites into the policy editor field.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. Choose a compaction strategy. The available options are:
   + **Binpack** – Binpack is the default compaction strategy in Apache Iceberg. It combines smaller data files into larger ones for optimal performance.
   + **Sort** – Sorting in Apache Iceberg is a data organization technique that clusters information within files based on specified columns, significantly improving query performance by reducing the number of files that need to be processed. You define the sort order in Iceberg's metadata using the sort-order field, and when multiple columns are specified, data is sorted in the sequence the columns appear in the sort order, ensuring records with similar values are stored together within files. The sorting compaction strategy takes the optimization further by sorting data across all files within a partition. 
   + **Z-order** – Z-ordering is a way to organize data when you need to sort by multiple columns with equal importance. Unlike traditional sorting that prioritizes one column over others, Z-ordering gives balanced weight to each column, helping your query engine read fewer files when searching for data.

     The technique works by weaving together the binary digits of values from different columns. For example, if you have the numbers 3 and 4 from two columns, Z-ordering first converts them to binary (3 becomes 011 and 4 becomes 100), then interleaves these digits to create a new value: 011010. This interleaving creates a pattern that keeps related data physically close together.

     Z-ordering is particularly effective for multi-dimensional queries. For example, a customer table Z-ordered by income, state, and zip code can deliver superior performance compared to hierarchical sorting when querying across multiple dimensions. This organization allows queries targeting specific combinations of income and geographic location to quickly locate relevant data while minimizing unnecessary file scans.

1. **Minimum input files **– The number of data files required in a partition before compaction is triggered.

1. **Delete files threshold** – Minimum delete operations required in a data file before it becomes eligible for compaction.

1. Choose **Enable optimization**.

------
#### [ AWS CLI ]

 The following example shows how to enable compaction. Replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and name of the IAM role that has the required permissions to run compaction. You can replace compaction strategy `sort` with other supported strategies like `z-order` or `binpack`.

order" depending on your requirements.

```
aws glue create-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{
    "roleArn": "arn:aws:iam::123456789012:role/optimizer_role",
    "enabled": true,
    "vpcConfiguration": {"glueConnectionName": "glue_connection_name"},
    "compactionConfiguration": {
      "icebergConfiguration": {"strategy": "sort"}
    }
  }'\
--type compaction
```

------
#### [ AWS API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to enable compaction for a table.

------

After you enable compaction, **Table optimization** tab shows the following compaction details once the compaction run is complete:

Start time  
The time at which the compaction process started within Data Catalog. The value is a timestamp in UTC time. 

End time  
The time at which the compaction process ended in Data Catalog. The value is a timestamp in UTC time. 

Status  
The status of the compaction run. Values are success or fail.

Files compacted  
Total number of files compacted.

Bytes compacted  
Total number of bytes compacted.

# Disabling compaction optimizer
<a name="disable-compaction"></a>

 You can disable automatic compaction for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. On the left navigation, under **Data Catalog**, choose **Tables**. 

1. From the tables list, choose the Iceberg table that you want to disable compaction.

1. Choose the **Table optimization** tab on the lower section of the **Tables details** page.

1. From **Actions**, choose **Disable**, and then choose **Compaction**.

1.  Choose **Disable compaction** on the confirmation message. You can re-enable compaction at a later time. 

    After the you confirm, compaction is disabled and the compaction status for the table turns back to `Disabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to run compaction.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "enabled":'false', "vpcConfiguration":{"glueConnectionName":"glue_connection_name"}}'\ 
  --type compaction
```

------
#### [ AWS API ]

Call [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable compaction for a specific table.

------

# Snapshot retention optimization
<a name="snapshot-retention-management"></a>

Apache Iceberg snapshot retention feature allows users to query historical data at specific points in time and revert unwanted modifications to their tables. In the AWS Glue Data Catalog, snapshot retention configuration controls how long these snapshots (versions of the table data) are kept before being expired and removed. This helps manage storage costs and metadata overhead by automatically removing older snapshots based on a configured retention period or maximum number of snapshots to keep. 

You can configure the retention period in days and the maximum number of snapshots to retain for a table. AWS Glue removes snapshots that are older than the specified retention period from the table metadata, while keeping the most recent snapshots up to the configured limit. After removing old snapshots from the metadata, AWS Glue deletes the corresponding data and metadata files that are no longer referenced and unique to the expired snapshots. This allows time travel queries only up to the remaining retained snapshots, while reclaiming storage space used by expired snapshot data.

**Topics**
+ [Enabling snapshot retention optimizer](enable-snapshot-retention.md)
+ [Updating snapshot retention optimizer](update-snapshot-retention.md)
+ [Disabling snapshot retention optimizer](disable-snapshot-retention.md)

# Enabling snapshot retention optimizer
<a name="enable-snapshot-retention"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable snapshot retention optimizers for your Apache Iceberg tables in the Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable snapshot retention optimizer when you create the table. Snapshot retention is disabled by default for new tables.

------
#### [ Console ]

**To enable snapshot retention optimizer**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose an Iceberg table that you want to enable snapshot retention optimizer for, then under **Actions** menu, choose **Enable** under **Optimization**.

   You can also enable optimization by selecting the table and opening the **Table details** page. Choose the **Table optimization** tab on the lower section of the page, and choose **Enable snapshot retention**. 

1. On the **Enable optimization ** page, under **Optimization configuration**, you have two options: **Use default setting** or **Customize settings**. If you choose to use the default settings, AWS Glue utilizes the properties defined in the Iceberg table configuration to determine the snapshot retention period and the number of snapshots to be retained. In the absence of this configuration, AWS Glue retains one snapshot for five days, and deletes files associated with the expired snapshots.

1.  Next, choose an IAM role that AWS Glue can assume on your behalf to run the optimizer. For details about the permissions required for the IAM role, see the [Table optimization prerequisites](optimization-prerequisites.md) section.

   Follow the steps below to update an existing IAM role: 

   1.  To update the permissions policy for the IAM role, in the IAM console, go to the IAM role that is being used for running compaction. 

   1.  In the Add permissions section, choose Create policy. In the newly opened browser window, create a new policy to use with your role. 

   1. On the Create policy page, choose the JSON tab. Copy the JSON code shown in the Prerequisites into the policy editor field.

1. If you prefer to set the values for the **Snapshot retention configuration** manually, choose **Customize settings**.   
![\[Apache Iceberg table details page with Enable retention>Customize settings option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-enable-retention.png)

1. Choose the box **Apply the selected IAM role to the selected optimizers** option to use a single IAM role for all enabling all optimizers.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC Connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. Next, under **Snapshot retention configuration**, either choose to use the values specified in the [Iceberg table configuration](https://iceberg.apache.org/docs/1.5.2/configuration/#table-behavior-properties), or specify custom values for snapshot retention period (history.expire.max-snapshot-age-ms), minimum number of snapshots (history.expire.min-snapshots-to-keep) to retain, and the time in hours between consecutive snapshot deletion job runs.

1.  Choose **Delete associated files** to delete underlying files when the table optimizer deletes old snapshots from the table metadata.

    If you don't choose this option, when older snapshots are removed from the table metadata, their associated files will remain in the storage as orphaned files. 

1. Next, read the caution statement, and choose **I acknowledge** to proceed.
**Note**  
 In the Data Catalog, the snapshot retention optimizer honors the lifecycle that is controlled by branch and tag level retention policies. For more information, see [Branching and tagging](https://iceberg.apache.org/docs/latest/branching/#overview) section in the Iceberg documentation.

1. Review the configuration and choose **Enable optimization**.

   Wait a few minutes for the retention optimizer to run and expire old snapshots based on the configuration.

------
#### [ AWS CLI ]

 To enable snapshot retention for new Iceberg tables in AWS Glue, you need to create a table optimizer of type `retention` and set the `enabled` field to `true` in the `table-optimizer-configuration`. You can do this using the AWS CLI command `create-table-optimizer` or `update-table-optimizer`. Additionally, you need to specify the retention configuration fields like `snapshotRetentionPeriodInDays` and `numberOfSnapshotsToRetain` based on your requirements.

The following example shows how to enable the snapshot retention optimizer. Replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and name of the IAM role that has the required permissions to run the snapshot retention optimizer. 

```
aws glue create-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":'true', "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"}, "retentionConfiguration":{"icebergConfiguration":{"snapshotRetentionPeriodInDays":7,"numberOfSnapshotsToRetain":3,"cleanExpiredFiles":'true'}}}'\
  --type retention
```

 This command creates a retention optimizer for the specified Iceberg table in the given catalog, database, and Region. The table-optimizer-configuration specifies the IAM role ARN to use, enables the optimizer, and sets the retention configuration. In this example, it retains snapshots for 7 days, keeps a minimum of 3 snapshots, and cleans expired files. 
+  snapshotRetentionPeriodInDays –The number of days to retain snapshots before expiring them. The default value is `5`. 
+ numberOfSnapshotsToRetain – The minimum number of snapshots to keep, even if they are older than the retention period. The default value is `1`. 
+ cleanExpiredFiles – A boolean indicating whether to delete expired data files after expiring snapshots. The default value is `true`.

   When set to true, older snapshots are removed from table metadata, and their underlying files are deleted. If this parameter is set to false, older snapshots are removed from table metadata but their underlying files remain in the storage as orphan files. 

------
#### [ AWS API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to enable snapshot retention optimizer for a table.

------

After you enable compaction, **Table optimization** tab shows the following compaction details (after approximately 15-20 minutes):

Start time  
The time at which the snapshot retention optimizer started. The value is a timestamp in UTC time. 

Run time  
The time shows how long the optimizer takes to complete the task. The value is a timestamp in UTC time. 

Status  
The status of the optimizer run. Values are success or fail.

Data files deleted  
Total number of files deleted.

Manifest files deleted  
Total number of manifest files deleted.

Manifest lists deleted  
Total number of manifest lists deleted.

# Updating snapshot retention optimizer
<a name="update-snapshot-retention"></a>

 You can update the existing configuration of an snapshot retention optimizer for a particular Apache Iceberg table using the AWS Glue console, AWS CLI, or the UpdateTableOptimizer API. 

------
#### [ Console ]

**To update snapshot retention configuration**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table you want to update the snapshot retention optimizer configuration.

1. On the lower section of the **Tables details** page, select the **Table optimization ** tab, and then choose **Edit**. You can also choose **Edit** under **Optimization ** from the **Actions **menu located on the top right corner of the page.

1.  On the **Edit optimization** page, make the desired changes. 

1.  Choose **Save**. 

------
#### [ AWS CLI ]

 To update a snapshot retention optimizer using the AWS CLI, you can use the following command: 

```
aws glue update-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role"","enabled":'true', "vpcConfiguration":{"glueConnectionName":"glue_connection_name"},"retentionConfiguration":{"icebergConfiguration":{"snapshotRetentionPeriodInDays":7,"numberOfSnapshotsToRetain":3,"cleanExpiredFiles":'true'}}}' \
 --type retention
```

 This command updates the retention configuration for the specified table in the given catalog, database, and Region. The key parameters are: 
+  snapshotRetentionPeriodInDays –The number of days to retain snapshots before expiring them. The default value is `1`. 
+ numberOfSnapshotsToRetain – The minimum number of snapshots to keep, even if they are older than the retention period. The default value is `5`. 
+ cleanExpiredFiles – A boolean indicating whether to delete expired data files after expiring snapshots. The default value is `true`. 

   When set to true, older snapshots are removed from table metadata, and their underlying files are deleted." If this parameter is set to false, older snapshots are removed from table metadata but their underlying files remain in the storage as orphan files. 

------
#### [ API ]

To update a table optimizer, you can use the `UpdateTableOptimizer` API. This API allows you to update the configuration of an existing table optimizer for compaction, retention, or orphan file removal. The request parameters include:
+ catalogId (required): The ID of the catalog containing the table 
+  databaseName (optional): The name of the database containing the table 
+  tableName (optional): The name of the table 
+  type (required): The type of table optimizer (compaction, retention, or orphan\$1file\$1deletion) 
+  retentionConfiguration (required): The updated configuration for the table optimizer, including role ARN, enabled status, retention configuration, and orphan file removal configuration. 

------

# Disabling snapshot retention optimizer
<a name="disable-snapshot-retention"></a>

 You can disable the snapshot retention optimizer for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

**To disable snapshot retention**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table that you want to disable the optimizer for snapshot retention.

1. On lower section of the **Table details** page, choose **Table optimization** and **Disable**, **Snapshot retention** under **Actions**.

   You can also choose **Disable** under ** Optimization** from the **Actions** menu located on top right corner of the page.

1.  Choose **Disable ** on the confirmation message. You can re-enable the snapshot retention optimizer at a later time. 

    After the you confirm, snapshot retention optimizer is disabled and the status for snapshot retention turns back to `Not enabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to run the retention optimizer.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "vpcConfiguration":{"glueConnectionName":"glue_connection_name"}, "enabled":'false'}'\ 
  --type retention
```

------
#### [ AWS API ]

Call [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable the snapshot retention optimizer for a specific table.

------

# Deleting orphan files
<a name="orphan-file-deletion"></a>

 AWS Glue Data Catalog allows you to remove orphan files from your Iceberg tables. Orphan files are unreferenced files that exist in your Amazon S3 data source under the specified table location, are not tracked by the Iceberg table metadata, and are older than your configured age limit. These orphan files can accumulate over time due to failure in operations like compaction, partition drops, or table rewrites, and take up unnecessary storage space.

The orphan file deletion optimizer in AWS Glue scans the table metadata and the actual data files, identifies the orphan files, and deletes them to reclaim storage space. The optimizer only removes files created after the optimizer's creation date that also meet the configured deletion criteria. Files created before or on the optimizer creation date are never deleted.

**Orphan file deletion logic**

1. Date check – Compares file creation date with optimizer creation date. If file is older than or equal to optimizer creation date, the file is skipped.

1. Optimizer configuration check – If file is newer than optimizer creation date, evaluates the file against the configured age limit. The optimizer deletes the file if it matches the deletion critera. Skips the file, if it doesn't match the criteria.

 You can initiate the orphan file deletion by creating an orphan file deletion table optimizer in the Data Catalog.

**Important**  
 By default, orphan file deletion evaluates files across your AWS Glue table location. While you can configure a sub-prefix to limit the scope of evaluation by using API parameter, you must ensure your table location doesn't contain files from other data sources or tables. If your table location overlaps with other data sources, the service might identify and delete unrelated files as orphans. 

**Topics**
+ [Enabling orphan file deletion](enable-orphan-file-deletion.md)
+ [Updating orphan file deletion optimizer](update-orphan-file-deletion.md)
+ [Disabling orphan file deletion](disable-orphan-file-deletion.md)

# Enabling orphan file deletion
<a name="enable-orphan-file-deletion"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable orphan file deletion for your Apache Iceberg tables in the Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable orphan file deletion optimizer when you create the table. Snapshot retention is disabled by default for new tables.

------
#### [ Console ]

**To enable orphan file deletion**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose an Iceberg table in that you want to enable orphan file deletion.

   Choose the **Table optimization** tab on the lower section of the page, and choose **Enable**, **Orphan file deletion** from **Actions**. 

   You can also choose **Enable** under **Optimization** from the **Actions** menu located on the top right corner of the page..

1. On the **Enable optimization** page, choose **Orphan file deletion** under **Optimization options**.

1. If you choose to use **Default settings**, all orphan files will be deleted after 3 days. If you want to keep the orphan files for a specific number of days, choose **Customize settings**.

1. Next, choose an IAM role with the required permissions to delete orphan files.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC Connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. If you choose **Customize settings**, enter the number of days to retain the files before deletion under **Orphan file deletion configuration**. You can also specify the interval between two consecutive optimizer runs. The default value is 24 hours.

1. Choose **Enable optimization**.

------
#### [ AWS CLI ]

 To enable orphan file deletion for an Iceberg table in AWS Glue, you need to create a table optimizer of type `orphan_file_deletion` and set the `enabled` field to true. To create an orphan file deletion optimizer for an Iceberg table using the AWS CLI, you can use the following command:

```
aws glue create-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"}, "orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":3, "location":'S3 location'}}}'\
 --type orphan_file_deletion
```

 This command creates an orphan file deletion optimizer for the specified Iceberg table. The key parameters are:
+ roleArn – the ARN of the IAM role with permissions to access the S3 bucket and Glue resources.
+ enabled – Set to true to enable the optimizer.
+ orphanFileRetentionPeriodInDays – The number of days to retain orphan files before deleting them (minimum 1 day).
+ type – Set to orphan\$1file\$1deletion to create an orphan file deletion optimizer.

 After creating the table optimizer, it will run orphan file deletion periodically (once per day if left enabled). You can check the runs using the `list-table-optimizer-runs` API. The orphan file deletion job will identify and delete files that are not tracked in the Iceberg metadata for the table.

------
#### [ API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to create the orphan file deletion optimizer for a specific table.

------

# Updating orphan file deletion optimizer
<a name="update-orphan-file-deletion"></a>

 You can modify the configuration for the orphan file deletion optimizer, such as changing the retention period for orphan files or the IAM role used by the optimizer using AWS Glue console, AWS CLI, or the `UpdateTableOptimizer` operation. 

------
#### [ AWS Management Console ]

**To update the orphan file deletion optimizer**

1.  Choose **Data Catalog** and choose **Tables**. From the tables list, choose the table you want to update the orphan file deletion optimizer configuration.

1. On the lower section of the **Tables details** page, choose **Table optimization **, and then choose **Edit**. 

1.  On the **Edit optimization** page, make the desired changes. 

1.  Choose **Save**. 

------
#### [ AWS CLI ]

 You can use the `update-table-optimizer` call to update the orphan file deletion optimizer in AWS Glue, you can use call. This allows you to modify the `OrphanFileDeletionConfiguration` in the `icebergConfiguration` field where you can specify the updated `OrphanFileRetentionPeriodInDays` to set the number of days to retain orphan files, to specify the Iceberg table location to delete orphan files from. 

```
aws glue update-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name Iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{"glueConnectionName":"glue_connection_name"},"orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":5}}}' \
 --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to update the orphan file deletion optimizer for a table.

------

 

# Disabling orphan file deletion
<a name="disable-orphan-file-deletion"></a>

 You can disable orphan file deletion optimizer for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

**To disable orphan file deletion**

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table that you want to disable the optimizer for orphan file deletion.

1. On lower section of the **Table details** page, choose **Table optimization** tab.

1. Choose **Actions**, and then choose **Disable **, **Orphan file deletion**.

   You can also choose **Disable** under **Optimization** from the **Actions** menu.

1.  Choose **Disable ** on the confirmation message. You can re-enable the orphan file deletion optimizer at a later time. 

    After the you confirm, orphan file deletion optimizer is disabled and the status for orphan file deletion turns back to `Not enabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to disable the optimizer.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "enabled":'false'}'\ 
  --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable the snapshot retention optimizer for a specific table.

------

# Viewing optimization details
<a name="view-optimization-status"></a>

You can view the optimization status for Apache Iceberg tables in the AWS Glue console, AWS CLI, or using AWS API operations. 

------
#### [ Console ]

**To view the optimization status for Iceberg tables (console)**
+ You can view optimization status for Iceberg tables on the AWS Glue console by choosing an Iceberg table from the **Tables** list under **Data Catalog**. Under **Table optimization**. Choose the **View all**  
![\[Apache Iceberg table details page with Enable compaction option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-list-compaction-status.png)

------
#### [  AWS CLI  ]

You can view the optimization details using AWS CLI.

In the following examples, replace the account ID with a valid AWS account ID, the database name, and table name with actual Iceberg table name. For `type`, provide and optimization type. Acceptable values are `compaction`, `retention`, and `orphan-file-deletion`.
+ **To get the last compaction run details for a table**

  ```
  aws get-table-optimizer \
    --catalog-id 123456789012 \
    --database-name iceberg_db \
    --table-name iceberg_table \
    --type compaction
  ```
+ Use the following example to retrieve the history of an optimizer for a specific table.

  ```
  aws list-table-optimizer-runs \
    --catalog-id 123456789012 \
    --database-name iceberg_db \
    --table-name iceberg_table \
    --type compaction
  ```
+ The following example shows how to retrieve the optimization run and configuration details for multiple optimizers. You can specify a maximum of 20 optimizers.

  ```
  aws glue batch-get-table-optimizer \
  --entries '[{"catalogId":"123456789012", "databaseName":"iceberg_db", "tableName":"iceberg_table", "type":"compaction"}]'
  ```

------
#### [ API ]
+ Use `GetTableOptimizer` operation to retrieve the last run details of an optimizer. 
+  Use `ListTableOptimizerRuns` operation to retrieve history of a given optimizer on a specific table. You can specify 20 optimizers in a single API call. 
+ Use the [BatchGetTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-BatchGetTableOptimizer) operation to retrieve configuration details for multiple optimizers in your account. 

------

# Viewing Amazon CloudWatch metrics
<a name="view-optimization-metrics"></a>

 After running the table optimizers successfully, the service creates Amazon CloudWatch metrics on the optimization job performance. You can go to the **CloudWatch Metrics** and choose **Metrics**, **All metrics**. You can to filter metrics by the specific namespace (for example AWS Glue), table name, or database name.

 For more information, see [View available metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html) in the *Amazon CloudWatch User Guide*. 

****Compaction****
+ Number of bytes compacted 
+ Number of files compacted
+ Number of DPU allocated to job 
+ Duration of job (Hours) 

****Snapshot retention****
+ Number of data files deleted 
+ Number of manifest files deleted
+ Number of Manifest lists deleted 
+ Duration of job (Hours)

****Orphan file deletion****
+ Number of orphan files deleted 
+ Duration of job (Hours) 

# Deleting an optimizer
<a name="delete-optimizer"></a>

You can delete an optimizer and associated metadata for the table using AWS CLI or AWS API operation.

Run the following AWS CLI command to delete optimization history for a table. You need to specify the optimizer `type` along with the catalog ID, database name and table name. The acceptable values are: `compaction`, `retention`, and `orphan_file_deletion`.

```
aws glue delete-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --type compaction
```

 Use `DeleteTableOptimizer` operation to delete an optimizer for a table.

# Considerations and limitations
<a name="optimizer-notes"></a>

 This section includes things to consider when using table optimizers within the AWS Glue Data Catalog. 

## Durability and correctness
<a name="durability-correctness"></a>

**S3 Table Locations:**

When multiple AWS Glue Data Catalog tables share the same Amazon S3 location and have optimizers enabled, the snapshot retention or orphan file deletion optimizer for one table may delete files that are still referenced by the other table. Ensure that each table with optimizers enabled has a unique Amazon S3 location that is not shared with any other table, including tables in different databases.

**S3 Lifecycle Expiry:**

Amazon S3 lifecycle expiration rules that apply to Iceberg table storage locations can delete manifest and data files that are still referenced by active snapshots. If your bucket has lifecycle expiration rules, ensure they exclude the Iceberg table storage path.

## Known issues
<a name="known-issues"></a>

The [Catalog-level table optimizers](https://docs.aws.amazon.com/glue/latest/dg/catalog-level-optimizers.html) documentation states that "tables without their own optimizer configurations will inherit the disabled state from the catalog level." There is a known issue where some tables without their own optimizer configuration may not correctly inherit the disabled state from the catalog-level configuration. Use the AWS Glue console and optimizer execution logs to verify which optimizers are currently enabled and running in your account, and disable any that you do not require.

## Supported formats and limitations for managed data compaction
<a name="compaction-notes"></a>

Data compaction supports a variety of data types and compression formats for reading and writing data, including reading data from encrypted tables.

**Concurrency Control:**

 Apache Iceberg supports optimistic concurrency control, allowing multiple writers to perform operations simultaneously. Conflicts are detected and resolved at commit time. When working with streaming pipelines, configure appropriate retry settings through table properties and compaction settings to handle concurrent writes effectively. For detailed guidance, refer to the AWS Big Data Blog on [managing concurrent writes in Iceberg tables](https://aws.amazon.com/blogs/big-data/manage-concurrent-write-conflicts-in-apache-iceberg-on-the-aws-glue-data-catalog/). 

**Compaction Retries:**

 When compaction operations fail four consecutive times, AWS Glue catalog table optimization automatically suspends the optimizer to prevent unnecessary compute resource consumption. First investigate the logs and try to understand why compaction is repeatedly failing. To resume compaction optimization, you can re-enable the optimizer through the AWS Glue console or API. 

 **Data compaction supports:**
+ **Encryption** – Data compaction only supports default Amazon S3 encryption (SSE-S3) and server-side KMS encryption (SSE-KMS).
+ **Compaction strategies** – Binpack, sort, and Z-order sorting
+ You can run compaction from the account where Data Catalog resides when the Amazon S3 bucket that stores the underlying data is in another account. To do this, the compaction role requires access to the Amazon S3 bucket.

 **Data compaction currently doesn’t support:** 
+ **Compaction on cross-account tables** – You can't run compaction on cross-account tables.
+ **Compaction on cross-Region tables** – You can't run compaction on cross-Region tables.
+ **Enabling compaction on resource links**
+ **Tables in Amazon S3 Express One Zone storage class ** – You can't run compaction on Amazon S3 Express One Zone Iceberg Tables. 
+ **Z-order compaction strategy doesn't support the following data types :**
  + Decimal
  + TimestampWithoutZone

## Considerations for snapshot retention and orphan file deletion optimizers
<a name="retention-notes"></a>

The following considerations apply to the snapshot retention and the orphan file deletion optimizers. 
+ The snapshot retention and orphan file deletion processes have a maximum limit of deleting 1,000,000 files per run. When deleting expired snapshots, if the number of eligible files for deletion surpasses 1,000,000, any remaining files beyond that threshold will continue to exist in the table storage as orphan files. 
+ Snapshots will be preserved by the snapshot retention optimizer only when both criteria are satisfied: the minimum number of snapshots to keep and the specified retention period.
+ The snapshot retention optimizer deletes expired snapshot metadata from Apache Iceberg, preventing time travel queries for expired snapshots and optionally deleting associated data files.
+  Orphan file deletion optimizer deletes orphaned data and metadata files that are no longer referenced by Iceberg metadata if their creation time is before the orphan file deletion retention period from the time of optimizer run.
+ Apache Iceberg facilitates version control through branches and tags, which are named pointers to specific snapshot states. Each branch and tag follows its own independent life-cycle, governed by retention policies defined at their respective levels. The AWS Glue Data Catalog optimizers take these life cycle policies into account, ensuring adherence to the specified retention rules. Branch and tag-level retention policies take precedence over the optimizer configurations. 

   For more information, see [Branching and Tagging](https://iceberg.apache.org/docs/nightly/branching/) in Apache Iceberg documentation. 
+ Snapshot retention and orphan file deletion optimizers will delete files eligible for clean-up as per configured parameters. Enhance your control over file deletion by implementing S3 versioning and life-cycle policies on the appropriate buckets.

   For detailed instructions on setting up versioning and creating life cycle rules, see [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html). 
+  For proper orphan file determination, ensure that the provided table location and any sub-paths don't overlap with or contain data from any other tables or data sources. If paths overlap, you risk unrecoverable data loss from unintended deletion of files. 

## Debugging OversizedAllocationException exception
<a name="debug-exception"></a>

To resolve an `OversizedAllocationException` exception:
+ Reduce the batch size of the vectorized reader and check. The default batch size is 5000. This is controlled in the `read.parquet.vectorization.batch-size`.
  + If this doesn’t work even after multiple variations, turn off vectorization. This is controlled in the `read.parquet.vectorization.enabled`.

# Supported Regions for table optimizers
<a name="regions-optimizers"></a>

The table optimization features (compaction, snapshot retention, and orphan file deletion) for AWS Glue Data Catalog are available in the following AWS Regions:
+ Asia Pacific (Tokyo)
+ Asia Pacific (Seoul)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Jakarta)
+ Canada (Central)
+ Europe (Ireland)
+ Europe (London)
+ Europe (Frankfurt)
+ Europe (Stockholm)
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)
+ South America (São Paulo)

# Optimizing query performance for Iceberg tables
<a name="iceberg-column-statistics"></a>

Apache Iceberg is a high-performance open table format for huge analytic datasets. AWS Glue supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables. These statistics can facilitate better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.

 AWS Glue estimates the number of distinct values in each column of the Iceberg table and and store them in [Puffin ](https://iceberg.apache.org/puffin-spec/)files on Amazon S3 associated with Iceberg table snapshots. Puffin is an Iceberg file format designed to store metadata like indexes, statistics, and sketches. Storing sketches in Puffin files tied to snapshots ensures transactional consistency and freshness of the NDV statistics.

You can configure to run column statistics generation task using AWS Glue console or AWS CLI. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using AWS Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

**Topics**
+ [Prerequisites for generating column statistics](iceberg-column-stats-prereqs.md)
+ [Generating column statistics for Iceberg tables](iceberg-generate-column-stats.md)
+ [See also](#see-also-iceberg-stats)

# Prerequisites for generating column statistics
<a name="iceberg-column-stats-prereqs"></a>

To generate or update column statistics for Iceberg tables, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), and the required inline policies.

For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md). 

# Generating column statistics for Iceberg tables
<a name="iceberg-generate-column-stats"></a>

Follow these steps to configure a schedule for generating statistics in the Data Catalog using AWS Glue console or AWS CLI or the or run the **StartColumnStatisticsTaskRun** operation.

**To generate column statistics**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose **Tables** under Data Catalog .

1. Choose an Iceberg table from the list. 

1. Choose **Column statistics**, **Generate on demand**,under **Actions** menu.

   You can also choose **Generate statistics** button under **Column statistics** tab in the lower section of the **Tables** page.

1. On the **Generate statistics** page, provide the statistics generation details. Follow steps 6-11 in the [Generating column statistics on a schedule](generate-column-stats.md) section to configure a schedule for statistics generation for Iceberg tables. 

   You can also choose to generate column statistics on-demand by followin the instructions in the [Generating column statistics on demand](column-stats-on-demand.md)
**Note**  
Sampling option is not available for Iceberg tables.

   AWS Glue calculates the number of distinct values for each column of the Iceberg table to a new Puffin file committed to the specified snapshot ID in your Amazon S3 location.

## See also
<a name="see-also-iceberg-stats"></a>
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)