# Managing the Data Catalog
<a name="manage-catalog"></a>

 The AWS Glue Data Catalog is a central metadata repository that stores structural and operational metadata for your Amazon S3 data sets. Managing the Data Catalog effectively is crucial for maintaining data quality, performance, security, and governance.

 By understanding and applying these Data Catalog management practices, you can ensure your metadata remains accurate, performant, secure, and well-governed as your data landscape evolves. 

This section covers the following aspects of Data Catalog management:
+ *Updating table schema and partitions*   As your data evolves, you may need to update the table schema or partition structure defined in the Data Catalog. For more information on how to make these updates programmatically using the AWS Glue ETL, see [Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs](update-from-job.md).
+ *Managing column statistics*: Accurate column statistics help optimize query plans and improve performance. For more information on how to generate, update, and manage column statistics, see [Optimizing query performance using column statistics](column-statistics.md). 
+  *Encrypting the Data Catalog*   To protect sensitive metadata, you can encrypt your Data Catalog using AWS Key Management Service (AWS KMS). This section explains how to enable and manage encryption for your Data Catalog. 
+ *Securing the Data Catalog with AWS Lake Formation*   Lake Formation provides a comprehensive approach to data lake security and access control. You can use Lake Formation to secure and govern access to your Data Catalog and underlying data. 

**Topics**
+ [Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs](update-from-job.md)
+ [Optimizing query performance using column statistics](column-statistics.md)
+ [Encrypting your Data Catalog](catalog-encryption.md)
+ [Securing your Data Catalog using Lake Formation](secure-catalog.md)
+ [Working with AWS Glue Data Catalog views in AWS Glue](catalog-views.md)

# Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs
<a name="update-from-job"></a>

Your extract, transform, and load (ETL) job might create new table partitions in the target data store. Your dataset schema can evolve and diverge from the AWS Glue Data Catalog schema over time. AWS Glue ETL jobs now provide several features that you can use within your ETL script to update your schema and partitions in the Data Catalog. These features allow you to see the results of your ETL work in the Data Catalog, without having to rerun the crawler.

## New partitions
<a name="update-from-job-partitions"></a>

If you want to view the new partitions in the AWS Glue Data Catalog, you can do one of the following:
+ When the job finishes, rerun the crawler, and view the new partitions on the console when the crawler finishes.
+ When the job finishes, view the new partitions on the console right away, without having to rerun the crawler. You can enable this feature by adding a few lines of code to your ETL script, as shown in the following examples. The code uses the `enableUpdateCatalog` argument to indicate that the Data Catalog is to be updated during the job run as the new partitions are created.

**Method 1**  
Pass `enableUpdateCatalog` and `partitionKeys` in an options argument.  

```
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["region", "year", "month", "day"]


sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<target_db_name>,
                                                    table_name=<target_table_name>, transformation_ctx="write_sink",
                                                    additional_options=additionalOptions)
```

```
val options = JsonOptions(Map(
    "path" -> <S3_output_path>, 
    "partitionKeys" -> Seq("region", "year", "month", "day"), 
    "enableUpdateCatalog" -> true))
val sink = glueContext.getCatalogSink(
    database = <target_db_name>, 
    tableName = <target_table_name>, 
    additionalOptions = options)sink.writeDynamicFrame(df)
```

**Method 2**  
Pass `enableUpdateCatalog` and `partitionKeys` in `getSink()`, and call `setCatalogInfo()` on the `DataSink` object.  

```
sink = glueContext.getSink(
    connection_type="s3", 
    path="<S3_output_path>",
    enableUpdateCatalog=True,
    partitionKeys=["region", "year", "month", "day"])
sink.setFormat("json")
sink.setCatalogInfo(catalogDatabase=<target_db_name>, catalogTableName=<target_table_name>)
sink.writeFrame(last_transform)
```

```
val options = JsonOptions(
   Map("path" -> <S3_output_path>, 
       "partitionKeys" -> Seq("region", "year", "month", "day"), 
       "enableUpdateCatalog" -> true))
val sink = glueContext.getSink("s3", options).withFormat("json")
sink.setCatalogInfo(<target_db_name>, <target_table_name>)
sink.writeDynamicFrame(df)
```

Now, you can create new catalog tables, update existing tables with modified schema, and add new table partitions in the Data Catalog using an AWS Glue ETL job itself, without the need to re-run crawlers.

## Updating table schema
<a name="update-from-job-updating-table-schema"></a>

If you want to overwrite the Data Catalog table’s schema you can do one of the following:
+ When the job finishes, rerun the crawler and make sure your crawler is configured to update the table definition as well. View the new partitions on the console along with any schema updates, when the crawler finishes. For more information, see [Configuring a Crawler Using the API](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-api).
+ When the job finishes, view the modified schema on the console right away, without having to rerun the crawler. You can enable this feature by adding a few lines of code to your ETL script, as shown in the following examples. The code uses `enableUpdateCatalog` set to true, and also `updateBehavior` set to `UPDATE_IN_DATABASE`, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run.

------
#### [ Python ]

```
additionalOptions = {
    "enableUpdateCatalog": True, 
    "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]

sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
    table_name=<dst_tbl_name>, transformation_ctx="write_sink",
    additional_options=additionalOptions)
job.commit()
```

------
#### [ Scala ]

```
val options = JsonOptions(Map(
    "path" -> outputPath, 
    "partitionKeys" -> Seq("partition_0", "partition_1"), 
    "enableUpdateCatalog" -> true))
val sink = glueContext.getCatalogSink(database = nameSpace, tableName = tableName, additionalOptions = options)
sink.writeDynamicFrame(df)
```

------

You can also set the `updateBehavior` value to `LOG` if you want to prevent your table schema from being overwritten, but still want to add the new partitions. The default value of `updateBehavior` is `UPDATE_IN_DATABASE`, so if you don’t explicitly define it, then the table schema will be overwritten.

If `enableUpdateCatalog` is not set to true, regardless of whichever option selected for `updateBehavior`, the ETL job will not update the table in the Data Catalog. 

## Creating new tables
<a name="update-from-job-creating-new-tables"></a>

You can also use the same options to create a new table in the Data Catalog. You can specify the database and new table name using `setCatalogInfo`.

------
#### [ Python ]

```
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
    enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
```

------
#### [ Scala ]

```
val options = JsonOptions(Map(
    "path" -> outputPath, 
    "partitionKeys" -> Seq("<partition_1>", "<partition_2>"), 
    "enableUpdateCatalog" -> true, 
    "updateBehavior" -> "UPDATE_IN_DATABASE"))
val sink = glueContext.getSink(connectionType = "s3", connectionOptions = options).withFormat("<format>")
sink.setCatalogInfo(catalogDatabase = “<dst_db_name>”, catalogTableName = “<dst_tbl_name>”)
sink.writeDynamicFrame(df)
```

------

## Restrictions
<a name="update-from-job-restrictions"></a>

Take note of the following restrictions:
+ Only Amazon Simple Storage Service (Amazon S3) targets are supported.
+ The `enableUpdateCatalog` feature is not supported for governed tables.
+ Only the following formats are supported: `json`, `csv`, `avro`, and `parquet`.
+ To create or update tables with the `parquet` classification, you must utilize the AWS Glue optimized parquet writer for DynamicFrames. This can be achieved with one of the following:
  + If you're updating an existing table in the catalog with `parquet` classification, the table must have the `"useGlueParquetWriter"` table property set to `true` before you update it. You can set this property via the AWS Glue APIs/SDK, via the console or via an Athena DDL statement.   
![\[Catalog table property edit field in AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-table-property.png)

    Once the catalog table property is set, you can use the following snippet of code to update the catalog table with the new data:

    ```
    glueContext.write_dynamic_frame.from_catalog(
        frame=frameToWrite,
        database="dbName",
        table_name="tableName",
        additional_options={
            "enableUpdateCatalog": True,
            "updateBehavior": "UPDATE_IN_DATABASE"
        }
    )
    ```
  + If the table doesn't already exist within catalog, you can utilize the `getSink()` method in your script with `connection_type="s3"` to add the table and its partitions to the catalog, along with writing the data to Amazon S3. Provide the appropriate `partitionKeys` and `compression` for your workflow.

    ```
    s3sink = glueContext.getSink(
        path="s3://bucket/folder/",
        connection_type="s3",
        updateBehavior="UPDATE_IN_DATABASE",
        partitionKeys=[],
        compression="snappy",
        enableUpdateCatalog=True
    )
        
    s3sink.setCatalogInfo(
        catalogDatabase="dbName", catalogTableName="tableName"
    )
        
    s3sink.setFormat("parquet", useGlueParquetWriter=True)
    s3sink.writeFrame(frameToWrite)
    ```
  + The `glueparquet` format value is a legacy method of enabling the AWS Glue parquet writer.
+ When the `updateBehavior` is set to `LOG`, new partitions will be added only if the `DynamicFrame` schema is equivalent to or contains a subset of the columns defined in the Data Catalog table's schema.
+ Schema updates are not supported for non-partitioned tables (not using the "partitionKeys" option).
+ Your partitionKeys must be equivalent, and in the same order, between your parameter passed in your ETL script and the partitionKeys in your Data Catalog table schema.
+ This feature currently does not yet support updating/creating tables in which the updating schemas are nested (for example, arrays inside of structs).

For more information, see [Programming Spark scripts](aws-glue-programming.md).

# Working with MongoDB connections in ETL jobs
<a name="integrate-with-mongo-db"></a>

You can create a connection for MongoDB and then use that connection in your AWS Glue job. For more information, see [MongoDB connections](aws-glue-programming-etl-connect-mongodb-home.md) in the AWS Glue programming guide. The connection `url`, `username` and `password` are stored in the MongoDB connection. Other options can be specified in your ETL job script using the `additionalOptions` parameter of `glueContext.getCatalogSource`. The other options can include:
+ `database`: (Required) The MongoDB database to read from.
+ `collection`: (Required) The MongoDB collection to read from.

By placing the `database` and `collection` information inside the ETL job script, you can use the same connection for in multiple jobs.

1. Create an AWS Glue Data Catalog connection for the MongoDB data source. See ["connectionType": "mongodb"](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-mongodb) for a description of the connection parameters. You can create the connection using the console, APIs or CLI.

1. Create a database in the AWS Glue Data Catalog to store the table definitions for your MongoDB data. See [Creating databases](define-database.md) for more information.

1. Create a crawler that crawls the data in the MongoDB using the information in the connection to connect to the MongoDB. The crawler creates the tables in the AWS Glue Data Catalog that describe the tables in the MongoDB database that you use in your job. See [Using crawlers to populate the Data Catalog](add-crawler.md) for more information.

1. Create a job with a custom script. You can create the job using the console, APIs or CLI. For more information, see [Adding Jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job.html).

1. Choose the data targets for your job. The tables that represent the data target can be defined in your Data Catalog, or your job can create the target tables when it runs. You choose a target location when you author the job. If the target requires a connection, the connection is also referenced in your job. If your job requires multiple data targets, you can add them later by editing the script.

1. Customize the job-processing environment by providing arguments for your job and generated script. 

   Here is an example of creating a `DynamicFrame` from the MongoDB database based on the table structure defined in the Data Catalog. The code uses `additionalOptions` to provide the additional data source information:

------
#### [  Scala  ]

   ```
   val resultFrame: DynamicFrame = glueContext.getCatalogSource(
           database = catalogDB, 
           tableName = catalogTable, 
           additionalOptions = JsonOptions(Map("database" -> DATABASE_NAME, 
                   "collection" -> COLLECTION_NAME))
         ).getDynamicFrame()
   ```

------
#### [  Python  ]

   ```
   glue_context.create_dynamic_frame_from_catalog(
           database = catalogDB,
           table_name = catalogTable,
           additional_options = {"database":"database_name", 
               "collection":"collection_name"})
   ```

------

1. Run the job, either on-demand or through a trigger.

# Optimizing query performance using column statistics
<a name="column-statistics"></a>

You can compute column-level statistics for AWS Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column. 

Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. AWS analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

There are three scenarios for generating column statistics: 

 **Auto**   
AWS Glue supports automatic column statistics generation at the catalog-level so that it can automatically generate statistics for new tables in the AWS Glue Data Catalog. 

**Scheduled**  
AWS Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.   
With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution. 

**On-demand**  
Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately. 

You can configure to run column statistics generation task using AWS Glue console, AWS CLI, and AWS Glue API operations. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

 The following video demonstrates how to enhance query performance using column statistics. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i)


**Topics**
+ [Prerequisites for generating column statistics](column-stats-prereqs.md)
+ [Automatic column statistics generation](auto-column-stats-generation.md)
+ [Generating column statistics on a schedule](generate-column-stats.md)
+ [Generating column statistics on demand](column-stats-on-demand.md)
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)
+ [Considerations and limitations](column-stats-notes.md)

# Prerequisites for generating column statistics
<a name="column-stats-prereqs"></a>

To generate or update column statistics, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. Follow these steps to create a new IAM role:

**Note**  
 To generate statistics for tables managed by Lake Formation, the IAM role used to generate statistics requires full table access. 

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. You can also create a role and attach the the permissions listed in the policy below, and add that role to the column statistics generation task.

**To create an IAM role for generating column statistics**

1. To create an IAM role, see [Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

1. To update an existing role, in the IAM console, go to the IAM role that is being used by the generate column statistics process.

1. In the **Add permissions** section, choose **Attach policies**. In the newly opened browser window, choose `AWSGlueServiceRole` AWS managed policy.

1. You also need to include permissions to read data from the Amazon S3 data location.

   In the **Add permissions** section, choose **Create policy**. In the newly opened browser window, create a new policy to use with your role.

1. In the **Create policy** page, choose the **JSON** tab. Copy the following `JSON` code into the policy editor field.
**Note**  
In the following policies, replace account ID with a valid AWS account, and replace `region` with the Region of the table, and `bucket-name` with the Amazon S3 bucket name.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "S3BucketAccess",
               "Effect": "Allow",
               "Action": [
                   "s3:ListBucket",
                   "s3:GetObject"
               ],
               "Resource": [
               	"arn:aws:s3:::amzn-s3-demo-bucket/*",
   							"arn:aws:s3:::amzn-s3-demo-bucket"
               ]
           }
        ]
   }
   ```

------

1. (Optional) If you're using Lake Formation permissions to provide access to your data, the IAM role requires `lakeformation:GetDataAccess` permissions.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "LakeFormationDataAccess",
         "Effect": "Allow",
         "Action": "lakeformation:GetDataAccess",
         "Resource": [
           "*"
         ]
       }
     ]
   }
   ```

------

    If the Amazon S3 data location is registered with Lake Formation, and the IAM role assumed by the column statistics generation task doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER` and `DESCRIBE` permissions on the table. The role used for registering the Amazon S3 bucket requires Lake Formation `INSERT` and `DELETE` permissions on the table. 

   If the Amazon S3 data location is not registered with Lake Formation, and the IAM role doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT` and `DELETE` permissions on the table. 

1. If you've enabled the catalog-level `Automatic statistics generation` option, the IAM role must have the `glue:UpdateCatalog` permission or the Lake Formation `ALTER CATALOG` permission on the default Data Catalog. You can use the `GetCatalog` operation to verify the catalog properties. 

1. (Optional) The column statistics generation task that writes encrypted Amazon CloudWatch Logs requires the following permissions in the key policy.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "CWLogsKmsPermissions",
         "Effect": "Allow",
         "Action": [
           "logs:CreateLogGroup",
           "logs:CreateLogStream",
           "logs:PutLogEvents",
           "logs:AssociateKmsKey"
         ],
         "Resource": [
           "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue:*"
         ]
       },
       {
         "Sid": "KmsPermissions",
         "Effect": "Allow",
         "Action": [
           "kms:GenerateDataKey",
           "kms:Decrypt",
           "kms:Encrypt"
         ],
         "Resource": [
           "arn:aws:kms:us-east-1:111122223333:key/arn of key used for ETL cloudwatch encryption"
         ],
         "Condition": {
           "StringEquals": {
             "kms:ViaService": [
               "glue.us-east-1.amazonaws.com"
             ]
           }
         }
       }
     ]
   }
   ```

------

1. The role you use to run column statistics must have the `iam:PassRole` permission on the role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "iam:PassRole"
         ],
         "Resource": [
           "arn:aws:iam::111122223333:role/columnstats-role-name"
         ]
       }
     ]
   }
   ```

------

1. When you create an IAM role for generating column statistics, that role must also have the following trust policy that enables the service to assume the role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "TrustPolicy",
         "Effect": "Allow",
         "Principal": {
           "Service": "glue.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

# Automatic column statistics generation
<a name="auto-column-stats-generation"></a>

Automatic generation of column statistics allows you to schedule and automatically compute statistics on new tables in the AWS Glue Data Catalog. When you enable automatic statistics generation, the Data Catalog discovers new tables with specific data formats such as Parquet, JSON, CSV, XML, ORC, ION, and Apache Iceberg, along with their individual bucket paths. With a one-time catalog configuration, the Data Catalog generates statistics for these tables.

 Data lake administrators can configure the statistics generation by selecting the default catalog in the Lake Formation console, and enabling table statistics using the `Optimization configuration` option. When you create new tables or update existing tables in the Data Catalog, the Data Catalog collects the number of distinct values (NDVs) for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length for other supported file formats on a weekly basis. 

If you have configured statistics generation at the table-level or if you have previously deleted the statistics generation settings for a table, those table-specific settings take precedence over the default catalog settings for automatic column statistics generation.

 Automatic statistics generation task analyzes 50% of records in the tables to calculate statistics. Automatic column statistics generation ensures that the Data Catalog maintains weekly metrics that can be used by query engines like Amazon Athena and Amazon Redshift Spectrum for improved query performance and potential cost savings. It allows scheduling statistics generation using AWS Glue APIs or the console, providing an automated process without manual intervention. 

**Topics**
+ [Enabling catalog-level automatic statistics generation](enable-auto-column-stats-generation.md)
+ [Viewing automated table-level settings](view-auto-column-stats-settings.md)
+ [Disabling catalog-level column statistics generation](disable-auto-column-stats-generation.md)

# Enabling catalog-level automatic statistics generation
<a name="enable-auto-column-stats-generation"></a>

You can enable the automatic column statistics generation for all new Apache Iceberg tables and tables in non-OTF table (Parquet, JSON, CSV, XML, ORC, ION) formats in the Data Catalog. After creating the table, you can also explicitly update the column statistics settings manually.

 To update the Data Catalog settings to enable catalog-level, the IAM role used must have the `glue:UpdateCatalog` permission or AWS Lake Formation `ALTER CATALOG` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

------
#### [ AWS Management Console ]

**To enable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**.   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-column-stats-auto.png)

1. On the **Table optimization configuration** page, choose the **Enable automatic statistics generation for the tables of the catalog** option.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-optimization-option.jpg)

1. Choose an existing IAM role or create a new one that has the necessary permissions to run the column statistics task.

1. Choose **Submit**.

------
#### [ AWS CLI ]

You can also enable catalog-level statistics collection through the AWS CLI. To configure table-level statistics collection using AWS CLI, run the following command:

```
aws glue update-catalog --cli-input-json '{
    "name": "123456789012",
    "catalogInput": {
        "description": "Updating root catalog with role arn",
        "catalogProperties": {
            "customProperties": {
                "ColumnStatistics.RoleArn": "arn:aws:iam::"123456789012":role/service-role/AWSGlueServiceRole",
                "ColumnStatistics.Enabled": "true"
            }
        }
    }
}'
```

 The above command calls AWS Glue's `UpdateCatalog` operation, which takes in a `CatalogProperties` structure with the following key-value pairs for catalog-level statistics generation: 
+ ColumnStatistics.RoleArn – IAM role ARN to be used for all tasks triggered for Catalog-level statistics generation
+ ColumnStatistics.Enabled – Boolean indicating whether the catalog-level settings is enabled or disabled

------

# Viewing automated table-level settings
<a name="view-auto-column-stats-settings"></a>

 When catalog-level statistics collection is enabled, anytime an Apache Hive table or Apache Iceberg table is created or updated via the `CreateTable` or `UpdateTable` APIs through AWS Management Console, SDK, or AWS Glue crawler, an equivalent table level setting is created for that table. 

 Tables with automatic statistics generation enabled must follow one of following properties:
+ Use an `InputSerdeLibrary` that begins with org.apache.hadoop and `TableType` equals `EXTERNAL_TABLE`
+ Use an `InputSerdeLibrary` that begins with `com.amazon.ion` and `TableType` equals `EXTERNAL_TABLE`
+ Contain table\$1type: "ICEBERG" in it’s parameters structure. 

 After you create or update a table, you can verify the table details to confirm the statistics generation. The `Statistics generation summary` shows the `Schedule` property set as `AUTO` and `Statistics configuration` value is `Inherited from catalog`. Any table setting with the following setting would be automatically triggered by Glue internally. 

![\[An image of a Hive table with catalog-level statistics collection has been applied and statistics have been collected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/auto-stats-summary.png)


# Disabling catalog-level column statistics generation
<a name="disable-auto-column-stats-generation"></a>

 You can disable automatic column statistics generation for new tables using the AWS Lake Formation console, the `glue:UpdateCatalogSettings` API, or the `glue:DeleteColumnStatisticsTaskSettings` API. 

**To disable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**. 

1. On the **Table optimization configuration** page, unselect the **Enable automatic statistics generation for the tables of the catalog** option.

1. Choose **Submit**.

# Generating column statistics on a schedule
<a name="generate-column-stats"></a>

Follow these steps to configure a schedule for generating column statistics in the AWS Glue Data Catalog using the AWS Glue console, the AWS CLI, or the [CreateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-CreateColumnStatisticsTaskSettings) operation.

------
#### [ Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1. Choose a table from the list. 

1. Choose **Column statistics** tab in the lower section of the **Tables** page.

1. You can also choose **Generate on schedule** under **Column statistics** from **Actions**.

1. On the **Generate statistics on schedule** page, configure a recurring schedule for running the column statistics task by choosing the frequency and start time. You can choose the frequency to be hourly, daily, weekly, or define a cron expression to specify the schedule.

   A cron expression is a string representing a schedule pattern, consisting of 6 fields separated by spaces: \$1 \$1 \$1 \$1 \$1 <minute> <hour> <day of month> <month> <day of week> <year> For example, to run a task every day at midnight, the cron expression would be: 0 0 \$1 \$1 ? \$1

   For more information, see [Cron expressions](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions).  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats-schedule.png)

1. Next, choose the column option to generate statistics.
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.

1. Choose an IAM role or create an existing role that has permissions to generate statistics. AWS Glue assumes this role to generate column statistics.

   A quicker approach is to let the AWS Glue console to create a role for you. The role that it creates is specifically for generating column statistics, and includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

   If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. 

1. (Optional) Next, choose a security configuration to enable at-rest encryption for logs.

1. (Optional) You can choose a sample size by indicating only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value. 

   We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

1. Choose **Generate statistics** to run the column statistics generation task.

------
#### [ AWS CLI ]

You can use the following AWS CLI example to create a column statistics generation schedule. The database-name, table-name, and role are required parameters, and optional parameters are schedule, column-name-list, catalog-id, sample-size, and security-configuration.

```
aws glue create-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name table_name \ 
 --role 'arn:aws:iam::123456789012:role/stats-role' \ 
 --schedule 'cron(0 0-5 14 * * ?)' \ 
 --column-name-list 'col-1' \  
 --catalog-id '123456789012' \ 
 --sample-size '10.0 ' \
 --security-configuration 'test-security'
```

You can generate column statistics also by calling the [StartColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRun) operation.

------

# Managing the schedule for column statistics generation
<a name="manage-column-stats-schedule"></a>

You can manage the scheduling operations such as updating, starting, stopping, and deleting schedules for the column statistics generation in AWS Glue. You can use AWS Glue console, AWS CLI, or [AWS Glue column statistics API operations](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html) to perform these tasks.

**Topics**
+ [Updating the column statistics generation schedule](#update-column-stats-shedule)
+ [Stopping the schedule for column statistics generation](#stop-column-stats-schedule)
+ [Resuming the schedule for column statistics generation](#resume-column-stats-schedule)
+ [Deleting column statistics generation schedule](#delete-column-stats-schedule)

## Updating the column statistics generation schedule
<a name="update-column-stats-shedule"></a>

You can update the schedule to trigger the column statistics generation task after it has been created. You can use the AWS Glue console, AWS CLI, or run the [UpdateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-UpdateColumnStatisticsTaskSettings) operation to update the schedule for a table. You can modify the parameters of an existing schedule, such as the schedule type (on-demand, or scheduled) and other optional parameters. 

------
#### [ AWS Management Console ]

**To update the settings for a column statistics generation task**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose the table that you want to update from the tables list.

1. In the lower section of the table details page, choose **Column statistics**. 

1. Under **Actions**, choose **Edit** to update the schedule.

1. Make the desired changes to the schedule, and choose **Save**.

------
#### [ AWS CLI ]

 If you are not using AWS Glue's statistics generation feature in the console, you can manually update the schedule using the `update-column-statistics-task-settings` command. The following example shows how to update column statistics using AWS CLI. 

```
aws glue update-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name 'table_name' \ 
 --role arn:aws:iam::123456789012:role/stats_role \ 
 --schedule 'cron(0 0-5 16 * * ?)' \ 
 --column-name-list 'col-1' \
 --sample-size '20.0' \  
 --catalog-id '123456789012'\
 --security-configuration 'test-security'
```

------

## Stopping the schedule for column statistics generation
<a name="stop-column-stats-schedule"></a>

 If you no longer need the incremental statistics, you can stop the scheduled generation to save resources and costs. Pausing the schedule doesn't impact the previously generated statistics. You can resume the schedule at your convenience. 

------
#### [ AWS Management Console ]

**To stop the schedule for a column statistics generation task**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Pause**.

1. Choose **Pause** to confirm.

------
#### [ AWS CLI ]

To stop a column statistics task run schedule using the AWS CLI, you can use the following command: 

```
aws glue stop-column-statistics-task-run-schedule \
 --database-name ''database_name' \
 --table-name 'table_name'
```

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

------

## Resuming the schedule for column statistics generation
<a name="resume-column-stats-schedule"></a>

 If you've paused the statistics generation schedule, AWS Glue allows you to resume the schedule at your convenience. You can resume the schedule using the AWS Glue console, AWS CLI, or the [StartColumnStatisticsTaskRunSchedule](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRunSchedule) operation. 

------
#### [ AWS Management Console ]

**To resume the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, and choose **Resume**.

1. Choose **Resume**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

```
aws glue start-column-statistics-task-run-schedule \
 --database-name 'database_name' \
 --table-name 'table_name'
```

------

## Deleting column statistics generation schedule
<a name="delete-column-stats-schedule"></a>

 While maintaining up-to-date statistics is generally recommended for optimal query performance, there are specific use cases where removing the automatic generation schedule might be beneficial.
+ If the data remains relatively static, the existing column statistics may remain accurate for an extended period, reducing the need for frequent updates. Deleting the schedule can prevent unnecessary resource consumption and overhead associated with regenerating statistics on unchanging data.
+ When manual control over statistics generation is preferred. By deleting the automatic schedule, administrators can selectively update column statistics at specific intervals or after significant data changes, aligning the process with their maintenance strategies and resource allocation needs. 

------
#### [ AWS Management Console ]

**To delete the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Delete**.

1. Choose **Delete**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

You can delete column statistics schedule using the [DeleteColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-DeleteColumnStatisticsTaskSettings) API operation or AWS CLI. The following example shows how to delete the schedule for generating column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-task-settings \
    --database-name 'database_name' \
    --table-name 'table_name'
```

------

# Generating column statistics on demand
<a name="column-stats-on-demand"></a>

You can run the column statistics task for the AWS Glue Data Catalog tables task on-demand without a set schedule. This option is useful for ad-hoc analysis or when statistics need to be computed immediately.

Follow these steps to generate column statistics on demand for the Data Catalog tables using AWS Glue console or AWS CLI.

------
#### [ AWS Management Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1.  Choose a table from the list. 

1. Choose **Generate statistics** under **Actions** menu.

   You can also choose **Generate**, **Generate on demand** option under **Column statistics** tab in the lower section of the **Table** page.

1. Follow steps 7 - 11 in the [Generating column statistics on a schedule](generate-column-stats.md) to generate column statistics for the table.

1. On the **Generate statistics** page, specify the following options:   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats.png)
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.
   + **IAM role** –Choose **Create a new IAM role** that has the required permission policies to run the column statistics generation task. Choose View permission details to review the policy statement. You can also select an IAM role from the list. For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md).

     AWS Glue assumes the permissions of the role that you specify to generate statistics. 

     For more information about providing roles for AWS Glue, see [Identity-based policies for AWS Glue.](https://docs.aws.amazon.com/glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-id-based-policies).
   + (Optional) Next, choose a security configuration to enable at-rest encryption for logs.
   + **Sample rows** – Choose only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value.
**Note**  
We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

   Choose **Generate statistics** to run the task.

------
#### [ AWS CLI ]

This command will trigger an column statistics task run for the specified table. You need to provide the database name, table name, an IAM role with permissions to generate statistics, and optionally provide column names and a sample size percentage for the statistics computation.

```
aws glue start-column-statistics-task-run \ 
    --database-name 'database_name \ 
    --table-name 'table_name' \ 
    --role 'arn:aws:iam::123456789012:role/stats-role' \
    --column-name 'col1','col2'  \
    --sample-size 10.0
```

This command will start a task to generate column statistics for the specified table. 

------

## Updating column statistics on demand
<a name="update-column-stats-on-demand"></a>

 Maintaining up-to-date column statistics is crucial for the query optimizer to generate efficient execution plans, ensuring improved query performance, reduced resource consumption, and better overall system performance. This process is particularly important after significant data changes, such as bulk loads or extensive modifications, which can render existing statistics obsolete. 

You need to explicitly run the **Generate statistics** task from the AWS Glue console to refresh the column statistics. Data Catalog doesn't automatically refresh the statistics.

If you are not using AWS Glue's statistics generation feature in the console, you can manually update column statistics using the [UpdateColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to update column statistics using AWS CLI.

```
aws glue update-column-statistics-for-table --cli-input-json:

{
    "CatalogId": "111122223333",
    "DatabaseName": "database_name",
    "TableName": "table_name",
    "ColumnStatisticsList": [
        {
            "ColumnName": "col1",
            "ColumnType": "Boolean",
            "AnalyzedTime": "1970-01-01T00:00:00",
            "StatisticsData": {
                "Type": "BOOLEAN",
                "BooleanColumnStatisticsData": {
                    "NumberOfTrues": 5,
                    "NumberOfFalses": 5,
                    "NumberOfNulls": 0
                }
            }
        }
    ]
}
```

# Viewing column statistics
<a name="view-column-stats"></a>

After generating the statistics successfully, Data Catalog stores this information for the cost-based optimizers in Amazon Athena and Amazon Redshift to make optimal choices when running queries. The statistics varies based on the type of the column.

------
#### [ AWS Management Console ]

**To view column statistics for a table**
+ After running column statistics task, the **Column statistics** tab on the **Table details** page shows the statistics for the table.   
![\[The screenshot shows columns generated from the most recent run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats.png)

  The following statistics are available:
  + Column name: Column name used to generate statistics
  + Last updated: Data and time when the statistics were generated
  + Average length: Average length of values in the column
  + Distinct values: Total number of distinct values in the column. We estimate the number of distinct values in a column with 5% relative error.
  + Max value: The largest value in the column.
  + Min value: The smallest value in the column. 
  + Max length: The length of the highest value in the column.
  + Null values: The total number of null values in the column.
  + True values: The total number of true values in the column.
  + False values: The total number of false values in the column.
  + numFiles: The total number of files in the table. This value is available under the **Advanced properties** tab.

------
#### [ AWS CLI ]

The following example shows how to retrieve column statistics using AWS CLI.

```
aws glue get-column-statistics-for-table \
    --database-name database_name \
    --table-name table_name \
    --column-names <column_name>
```

 You can also view the column statistics using the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation. 

------

# Viewing column statistics task runs
<a name="view-stats-run"></a>

After you run a column statistics task, you can explore the task run details for a table using AWS Glue console, AWS CLI or using [GetColumnStatisticsTaskRuns](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-GetColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To view column statistics task run details**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **View runs**.

   You can see information about all runs associated with the specified table.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats-task-runs.png)

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue get-column-statistics-task-runs --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Stopping column statistics task run
<a name="stop-stats-run"></a>

You can stop a column statistics task run for a table using AWS Glue console, AWS CLI or using [StopColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StopColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To stop a column statistics task run**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select the table with the column statistics task run is in progress.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **Stop**.

   If you stop the task before the run is complete, column statistics won't be generated for the table.

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue stop-column-statistics-task-run --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Deleting column statistics
<a name="delete-column-stats"></a>

You can delete column statistics using the [DeleteColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to delete column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-for-table \
    --database-name 'database_name' \
    --table-name 'table_name' \
    --column-name 'column_name'
```

# Considerations and limitations
<a name="column-stats-notes"></a>

The following considerations and limitations apply to generating column statistics.

**Considerations**
+ Using sampling to generate statistics reduces run time, but can generate inaccurate statistics.
+ Data Catalog doesn't store different versions of the statistics.
+ You can only run one statistics generation task at a time per table.
+ If a table is encrypted using customer AWS KMS key registered with Data Catalog, AWS Glue uses the same key to encrypt statistics.

**Column statistics task supports generating statistics:**
+ When the IAM role has full table permissions (IAM or Lake Formation).
+ When the IAM role has permissions on the table using Lake Formation hybrid access mode.

**Column statistics task doesn’t support generating statistics for:**
+ Tables with Lake Formation cell-based access control
+ Transactional data lakes - Linux foundation Delta Lake, Apache Hudi
+ Tables in federated databases - Hive metastore, Amazon Redshift datashares
+ Nested columns, arrays, and struct data types.
+ Table that is shared with you from another account

# Encrypting your Data Catalog
<a name="catalog-encryption"></a>

 You can protect your metadata stored in the AWS Glue Data Catalog at rest using encryption keys managed by AWS Key Management Service (AWS KMS). You can enable Data Catalog encryption for new Data Catalog, by using the **Data Catalog settings**. You can enable or disable encryption for existing Data Catalog as needed. When enabled, AWS Glue encrypts all new metadata written to the catalog, while existing metadata remains unencrypted. 

For detailed information about encrypting your Data Catalog, see [Encrypting your Data Catalog](encrypt-glue-data-catalog.md).

# Securing your Data Catalog using Lake Formation
<a name="secure-catalog"></a>

 AWS Lake Formation is a service that makes it easier to set up a secure data lake in AWS. It provides a central place to create and securely manage your data lakes by defining fine-granied access control permissions. Lake Formation uses the Data Catalog to store and retrieve metadata about your data lake, such as table definitions, schema information, and data access control settings.

You can register your Amazon S3 data location of the metadata table or database with Lake Formation and use it to define metadata-level permissions on the Data Catalog resources. You can also use Lake Formation to manage storage access permissions on the underlying data stored in Amazon S3 on behalf of integrated analytical engines.

For more information see [What is AWS Lake Formation?](lake-formation/latest/dg/what-is-lake-formation.html).

# Working with AWS Glue Data Catalog views in AWS Glue
<a name="catalog-views"></a>

 You can create and manage views in the AWS Glue Data Catalog, commonly known as AWS Glue Data Catalog views. These views are useful because they support multiple SQL query engines, allowing you to access the same view across different AWS services, such as Amazon Athena, Amazon Redshift, and AWS Glue. You can use views based on Apache Iceberg, Apache Hudi, and Delta Lake. 

 By creating a view in the Data Catalog, you can use resource grants and tag-based access controls in AWS Lake Formation to grant access to it. Using this method of access control, you don't have to configure additional access to the tables referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in AWS Lake Formation, see [ Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the AWS Lake Formation Developer Guide. 

 Data Catalog views are useful for the following use cases: 
+  **Granular access control** – You can create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don't work in the HR department from seeing personally identifiable information (PII). 
+  **Complete view definition** – By applying filters on your view in the Data Catalog, you ensure that data records available in the view are always complete. 
+  **Enhanced security** – The query definition used to create the view must be complete, making Data Catalog views less susceptible to SQL commands from malicious actors. 
+  **Simple data sharing** – Share data with other AWS accounts without moving data, using cross-account data sharing in AWS Lake Formation. 

## Creating a Data Catalog view
<a name="catalog-creating-view"></a>

 You can create Data Catalog views using the AWS CLI and AWS Glue ETL scripts using Spark SQL. The syntax for creating a Data Catalog view includes specifying the view type as `MULTI DIALECT` and the `SECURITY` predicate as `DEFINER`, indicating a definer view. 

 Example SQL statement to create a Data Catalog view: 

```
CREATE PROTECTED MULTI DIALECT VIEW database_name.catalog_view SECURITY DEFINER
AS SELECT order_date, sum(totalprice) AS price
FROM source_table
GROUP BY order_date;
```

 After creating a Data Catalog view, you can use an IAM role with the AWS Lake Formation `SELECT` permission on the view to query it from services like Amazon Athena, Amazon Redshift, or AWS Glue ETL jobs. You don't need to grant access to the underlying tables referenced in the view. 

 For more information on creating and configuring Data Catalog views, see [Building AWS Glue Data Catalog views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide. 

## Supported view operations
<a name="catalog-supported-view-operations"></a>

 The following command fragments show you various ways to work with Data Catalog views: 

 **CREATE VIEW** 

 Creates a data-catalog view. The following is a sample that shows creating a view from an existing table: 

```
CREATE PROTECTED MULTI DIALECT VIEW catalog_view 
SECURITY DEFINER AS SELECT * FROM my_catalog.my_database.source_table
```

 **ALTER VIEW** 

 Available syntax: 

```
ALTER VIEW view_name [FORCE] ADD DIALECT AS query
ALTER VIEW view_name [FORCE] UPDATE DIALECT AS query
ALTER VIEW view_name DROP DIALECT
```

 You can use the `FORCE ADD DIALECT` option to force update the schema and sub objects as per the new engine dialect. Note that doing this can result in query errors if you don't also use `FORCE` to update other engine dialects. The following shows a sample: 

```
ALTER VIEW catalog_view FORCE ADD DIALECTAS
SELECT order_date, sum(totalprice) AS priceFROM source_tableGROUP BY orderdate;
```

 The following shows how to alter a view in order to update the dialect: 

```
ALTER VIEW catalog_view UPDATE DIALECT AS
SELECT count(*) FROM my_catalog.my_database.source_table;
```

 **DESCRIBE VIEW** 

 Available syntax for describing a view: 

 `SHOW COLUMNS {FROM|IN} view_name [{FROM|IN} database_name]` – If the user has the required AWS Glue and AWS Lake Formation permissions to describe the view, they can list the columns. The following shows a couple sample commands for showing columns: 

```
SHOW COLUMNS FROM my_database.source_table;    
SHOW COLUMNS IN my_database.source_table;
```

 `DESCRIBE view_name` – If the user has the required AWS Glue and AWS Lake Formation permissions to describe the view, they can list the columns in the view along with its metadata. 

 **DROP VIEW** 

 Available syntax: 

```
DROP VIEW [ IF EXISTS ] view_name
```

 The following sample shows a `DROP` statement that tests if a view exists prior to dropping it: 

```
DROP VIEW IF EXISTS catalog_view;
```

 `SHOW CREATE VIEW view_name` – Shows the SQL statement that creates the specified view. The following is a sample that shows creating a data-catalog view: 

```
SHOW CREATE TABLE my_database.catalog_view;CREATE PROTECTED MULTI DIALECT VIEW my_catalog.my_database.catalog_view (
  net_profit,
  customer_id,
  item_id,
  sold_date)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1736267222')
SECURITY DEFINER AS SELECT * FROM
my_database.store_sales_partitioned_lf WHERE customer_id IN (SELECT customer_id from source_table limit 10)
```

 **SHOW VIEWS** 

 List all views in the catalog, such as regular views, multi-dialect views (MDV), and MDV without Spark dialect. Available syntax is the following: 

```
SHOW VIEWS [{ FROM | IN } database_name] [LIKE regex_pattern]:
```

 The following shows a sample command to show views: 

```
SHOW VIEWS IN marketing_analytics LIKE 'catalog_view*';
```

 For more information about creating and configuring data-catalog views, see [ Building AWS Glue Data Catalog views ](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide. 

## Querying a Data Catalog view
<a name="catalog-view-query"></a>

 After creating a Data Catalog view, you can query the view. The IAM role configured in your AWS Glue jobs must have the Lake Formation **SELECT** permission on the Data Catalog view. You don't need to grant access to the underlying tables referenced in the view. 

 Once you have everything set up, you can query your view. For example, you can run the following query to access a view. 

```
SELECT * from my_database.catalog_view LIMIT 10;
```

## Limitations
<a name="catalog-view-limitations"></a>

 Consider the following limitations when you use Data Catalog views. 
+  You can only create Data Catalog views with AWS Glue 5.0 and above. 
+  The Data Catalog view definer must have `SELECT` access to the underlying base tables accessed by the view. Creating the Data Catalog view fails if a specific base table has any Lake Formation filters imposed on the definer role. 
+  Base tables must not have the `IAMAllowedPrincipals` data lake permission in AWS Lake Formation. If present, the error **Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions occurs**. 
+  The table's Amazon S3 location must be registered as a AWS Lake Formation data lake location. If the table isn't registered, the error `Multi Dialect views may only reference AWS Lake Formation managed tables` occurs. For information about how to register Amazon Amazon S3 locations in AWS Lake Formation, see [ Registering an Amazon S3 location ](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html) in the AWS Lake Formation Developer Guide. 
+  You can only create `PROTECTED` Data Catalog views. `UNPROTECTED` views aren't supported. 
+  You can't reference tables in another AWS account in a Data Catalog view definition. You also can't reference a table in the same account that's in a separate region. 
+  To share data across an account or region, the entire view must be shared cross account and cross region, using AWS Lake Formation resource links. 
+  User-defined functions (UDFs) aren't supported. 
+  You can't reference other views in Data Catalog views.