

# Deleting orphan files
<a name="orphan-file-deletion"></a>

 AWS Glue Data Catalog allows you to remove orphan files from your Iceberg tables. Orphan files are unreferenced files that exist in your Amazon S3 data source under the specified table location, are not tracked by the Iceberg table metadata, and are older than your configured age limit. These orphan files can accumulate over time due to failure in operations like compaction, partition drops, or table rewrites, and take up unnecessary storage space.

The orphan file deletion optimizer in AWS Glue scans the table metadata and the actual data files, identifies the orphan files, and deletes them to reclaim storage space. The optimizer only removes files created after the optimizer's creation date that also meet the configured deletion criteria. Files created before or on the optimizer creation date are never deleted.

**Orphan file deletion logic**

1. Date check – Compares file creation date with optimizer creation date. If file is older than or equal to optimizer creation date, the file is skipped.

1. Optimizer configuration check – If file is newer than optimizer creation date, evaluates the file against the configured age limit. The optimizer deletes the file if it matches the deletion critera. Skips the file, if it doesn't match the criteria.

 You can initiate the orphan file deletion by creating an orphan file deletion table optimizer in the Data Catalog.

**Important**  
 By default, orphan file deletion evaluates files across your AWS Glue table location. While you can configure a sub-prefix to limit the scope of evaluation by using API parameter, you must ensure your table location doesn't contain files from other data sources or tables. If your table location overlaps with other data sources, the service might identify and delete unrelated files as orphans. 

**Topics**
+ [Enabling orphan file deletion](enable-orphan-file-deletion.md)
+ [Updating orphan file deletion optimizer](update-orphan-file-deletion.md)
+ [Disabling orphan file deletion](disable-orphan-file-deletion.md)

# Enabling orphan file deletion
<a name="enable-orphan-file-deletion"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable orphan file deletion for your Apache Iceberg tables in the Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable orphan file deletion optimizer when you create the table. Snapshot retention is disabled by default for new tables.

------
#### [ Console ]

**To enable orphan file deletion**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose an Iceberg table in that you want to enable orphan file deletion.

   Choose the **Table optimization** tab on the lower section of the page, and choose **Enable**, **Orphan file deletion** from **Actions**. 

   You can also choose **Enable** under **Optimization** from the **Actions** menu located on the top right corner of the page..

1. On the **Enable optimization** page, choose **Orphan file deletion** under **Optimization options**.

1. If you choose to use **Default settings**, all orphan files will be deleted after 3 days. If you want to keep the orphan files for a specific number of days, choose **Customize settings**.

1. Next, choose an IAM role with the required permissions to delete orphan files.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC Connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. If you choose **Customize settings**, enter the number of days to retain the files before deletion under **Orphan file deletion configuration**. You can also specify the interval between two consecutive optimizer runs. The default value is 24 hours.

1. Choose **Enable optimization**.

------
#### [ AWS CLI ]

 To enable orphan file deletion for an Iceberg table in AWS Glue, you need to create a table optimizer of type `orphan_file_deletion` and set the `enabled` field to true. To create an orphan file deletion optimizer for an Iceberg table using the AWS CLI, you can use the following command:

```
aws glue create-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"}, "orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":3, "location":'S3 location'}}}'\
 --type orphan_file_deletion
```

 This command creates an orphan file deletion optimizer for the specified Iceberg table. The key parameters are:
+ roleArn – the ARN of the IAM role with permissions to access the S3 bucket and Glue resources.
+ enabled – Set to true to enable the optimizer.
+ orphanFileRetentionPeriodInDays – The number of days to retain orphan files before deleting them (minimum 1 day).
+ type – Set to orphan\$1file\$1deletion to create an orphan file deletion optimizer.

 After creating the table optimizer, it will run orphan file deletion periodically (once per day if left enabled). You can check the runs using the `list-table-optimizer-runs` API. The orphan file deletion job will identify and delete files that are not tracked in the Iceberg metadata for the table.

------
#### [ API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to create the orphan file deletion optimizer for a specific table.

------

# Updating orphan file deletion optimizer
<a name="update-orphan-file-deletion"></a>

 You can modify the configuration for the orphan file deletion optimizer, such as changing the retention period for orphan files or the IAM role used by the optimizer using AWS Glue console, AWS CLI, or the `UpdateTableOptimizer` operation. 

------
#### [ AWS Management Console ]

**To update the orphan file deletion optimizer**

1.  Choose **Data Catalog** and choose **Tables**. From the tables list, choose the table you want to update the orphan file deletion optimizer configuration.

1. On the lower section of the **Tables details** page, choose **Table optimization **, and then choose **Edit**. 

1.  On the **Edit optimization** page, make the desired changes. 

1.  Choose **Save**. 

------
#### [ AWS CLI ]

 You can use the `update-table-optimizer` call to update the orphan file deletion optimizer in AWS Glue, you can use call. This allows you to modify the `OrphanFileDeletionConfiguration` in the `icebergConfiguration` field where you can specify the updated `OrphanFileRetentionPeriodInDays` to set the number of days to retain orphan files, to specify the Iceberg table location to delete orphan files from. 

```
aws glue update-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name Iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{"glueConnectionName":"glue_connection_name"},"orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":5}}}' \
 --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to update the orphan file deletion optimizer for a table.

------

 

# Disabling orphan file deletion
<a name="disable-orphan-file-deletion"></a>

 You can disable orphan file deletion optimizer for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

**To disable orphan file deletion**

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table that you want to disable the optimizer for orphan file deletion.

1. On lower section of the **Table details** page, choose **Table optimization** tab.

1. Choose **Actions**, and then choose **Disable **, **Orphan file deletion**.

   You can also choose **Disable** under **Optimization** from the **Actions** menu.

1.  Choose **Disable ** on the confirmation message. You can re-enable the orphan file deletion optimizer at a later time. 

    After the you confirm, orphan file deletion optimizer is disabled and the status for orphan file deletion turns back to `Not enabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to disable the optimizer.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "enabled":'false'}'\ 
  --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable the snapshot retention optimizer for a specific table.

------