

# Optimizing query performance for Iceberg tables
<a name="iceberg-column-statistics"></a>

Apache Iceberg is a high-performance open table format for huge analytic datasets. AWS Glue supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables. These statistics can facilitate better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.

 AWS Glue estimates the number of distinct values in each column of the Iceberg table and and store them in [Puffin ](https://iceberg.apache.org/puffin-spec/)files on Amazon S3 associated with Iceberg table snapshots. Puffin is an Iceberg file format designed to store metadata like indexes, statistics, and sketches. Storing sketches in Puffin files tied to snapshots ensures transactional consistency and freshness of the NDV statistics.

You can configure to run column statistics generation task using AWS Glue console or AWS CLI. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using AWS Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

**Topics**
+ [

# Prerequisites for generating column statistics
](iceberg-column-stats-prereqs.md)
+ [

# Generating column statistics for Iceberg tables
](iceberg-generate-column-stats.md)
+ [

## See also
](#see-also-iceberg-stats)

# Prerequisites for generating column statistics
<a name="iceberg-column-stats-prereqs"></a>

To generate or update column statistics for Iceberg tables, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), and the required inline policies.

For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md). 

# Generating column statistics for Iceberg tables
<a name="iceberg-generate-column-stats"></a>

Follow these steps to configure a schedule for generating statistics in the Data Catalog using AWS Glue console or AWS CLI or the or run the **StartColumnStatisticsTaskRun** operation.

**To generate column statistics**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose **Tables** under Data Catalog .

1. Choose an Iceberg table from the list. 

1. Choose **Column statistics**, **Generate on demand**,under **Actions** menu.

   You can also choose **Generate statistics** button under **Column statistics** tab in the lower section of the **Tables** page.

1. On the **Generate statistics** page, provide the statistics generation details. Follow steps 6-11 in the [Generating column statistics on a schedule](generate-column-stats.md) section to configure a schedule for statistics generation for Iceberg tables. 

   You can also choose to generate column statistics on-demand by followin the instructions in the [Generating column statistics on demand](column-stats-on-demand.md)
**Note**  
Sampling option is not available for Iceberg tables.

   AWS Glue calculates the number of distinct values for each column of the Iceberg table to a new Puffin file committed to the specified snapshot ID in your Amazon S3 location.

## See also
<a name="see-also-iceberg-stats"></a>
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)