AnyCompany Developer Guide



# Data discovery and cataloging in AWS Glue
<a name="catalog-and-crawler"></a>

The AWS Glue Data Catalog is a centralized repository that stores metadata about your organization's data sets. It acts as an index to the location, schema, and runtime metrics of your data sources. The metadata is stored in metadata tables, where each table represents a single data store. 

You can populate the Data Catalog using a crawler, which automatically scans your data sources and extracts metadata. A crawler can connect to data sources that are internal (AWS-based) and external to AWS. 

For more information about the supported data sources, see [Supported data sources for crawling](crawler-data-stores.md)

You can also create tables in the Data Catalog manually by defining the table structure, schema, and partitioning structure according to your specific requirements.

For more information about creating metadata tables manually, see [Defining metadata manually](populate-dg-manual.md).

You can use the information in the Data Catalog to create and monitor your ETL jobs. The Data Catalog integrates with other AWS analytics services, providing a unified view of data sources making it easier to manage and analyze data.
+ Amazon Athena – Store and query table metadata in the Data Catalog for the Amazon S3 data using SQL.
+ AWS Lake Formation – Centrally define and manage fine-grained data access policies and audit data access.
+ Amazon EMR – Access data sources defined in the Data Catalog for big data processing.
+ Amazon SageMaker AI – Quickly and confidently build, train, and deploy machine learning models.Key features of the Data Catalog

The following are the key aspects of the Data Catalog. 

Metadata repository  
 The Data Catalog acts as a central metadata repository, storing information about the location, schema, and properties of your data sources. This metadata is organized into databases and tables, similar to a traditional relational database catalog. 

Automatic data discoverability  
 AWS Glue crawlers can automatically discover and catalog new or updated data sources, reducing the overhead of manual metadata management and ensuring that your Data Catalog remains up-to-date. By cataloging your data sources, the Data Catalog makes it easier for users and applications to discover and understand the available data assets within your organization, promoting data reuse and collaboration.  
The Data Catalog supports a wide range of data sources, including Amazon S3, Amazon RDS, Amazon Redshift, Apache Hive, and more. It can automatically infer and store metadata from these sources using AWS Glue crawlers.   
For more information see, [Using crawlers to populate the Data Catalog](add-crawler.md).

Schema management  
The Data Catalog automatically captures and manages the schema of your data sources, including schema inference, evolution, and versioning. You can update your schema and partitions in the Data Catalog using AWS Glue ETL jobs. 

Table optimization  
For better read performance by AWS analytics services such as Amazon Athena and Amazon EMR, and AWS Glue ETL jobs, the Data Catalog provides managed compaction (a process that compacts small Amazon S3 objects into larger objects) for Iceberg tables in the Data Catalog. You can use AWS Glue console, AWS Lake Formation console, AWS CLI, or AWS API to enable or disable compaction for individual Iceberg tables that are in the Data Catalog.  
For more information, see [Optimizing Iceberg tables](table-optimizers.md).

Column statistics  
 You can compute column-level statistics for Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column. The Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values.   
For more information, see [Optimizing query performance using column statistics](column-statistics.md).

Data lineage  
The Data Catalog maintains a record of the transformations and operations performed on your data, providing data lineage information. This lineage information is valuable for auditing, compliance, and understanding the data's provenance.

Integration with other AWS services  
The Data Catalog seamlessly integrates with other AWS services, such as AWS Lake Formation, Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR. This integration allows you to query and analyze data across various data stores using a single, consistent metadata layer.

Security and access control  
AWS Glue integrates with AWS Lake Formation to support fine-grained access control for Data Catalog resources, allowing you to manage permissions and secure access to your data assets based on your organization's policies and requirements. AWS Glue integrates with AWS Key Management Service (AWS KMS) to encrypt metadata that's stored in the Data Catalog. 

Materialized views   
The Data Catalog supports Apache Iceberg materialized views, which are managed tables that store precomputed results of SQL queries and automatically refresh as underlying source data changes. Materialized views simplify data transformation pipelines and accelerate query performance by eliminating redundant computation.  
You can create materialized views using Apache Spark SQL in AWS Glue version 5.1 and later, Amazon EMR release 7.12.0 and later, and Amazon Athena. The Data Catalog automatically monitors source Apache Iceberg tables and refreshes materialized views using managed compute infrastructure. Spark engines across AWS Glue, Amazon EMR, and Amazon Athena can automatically rewrite queries to use materialized views when they provide better performance.  
Materialized views are stored as Apache Iceberg tables in Amazon S3 Tables buckets or Amazon S3 general purpose buckets within your account, making them accessible from multiple query engines. The Data Catalog manages all aspects of materialized view lifecycle, including automatic refresh scheduling, incremental updates, and metadata management.  
For more information, see Using materialized views with AWS Glue and Using materialized views with Amazon EMR.

**Topics**
+ [Populating the AWS Glue Data Catalog](populate-catalog-methods.md)
+ [Populating and managing transactional tables](populate-otf.md)
+ [Managing the Data Catalog](manage-catalog.md)
+ [Accessing the Data Catalog](access_catalog.md)
+ [AWS Glue Data Catalog best practices](best-practice-catalog.md)
+ [Monitoring Data Catalog usage metrics in Amazon CloudWatch](data-catalog-cloudwatch-metrics.md)
+ [AWS Glue Schema registry](schema-registry.md)

# Populating the AWS Glue Data Catalog
<a name="populate-catalog-methods"></a>

You can populate the AWS Glue Data Catalog using the following methods:
+ AWS Glue crawler – An AWS Glue crawler can automatically discover and catalog data sources like databases, data lakes, and streaming data. The crawlers are the most common and recommended method to populate the Data Catalog as they can automatically discover and infer metadata for a wide variety of data sources.
+  Manually adding metadata – You can manually define databases, tables, and connection details and add them to the Data Catalog using the AWS Glue console, Lake Formation console, AWS CLI, or AWS Glue APIs. Manual entry is useful when you want to catalog data sources that cannot be crawled. 
+ Integrating with other AWS services – You can populate the Data Catalog with metadata from services like AWS Lake Formation and Amazon Athena. These services can discover and register data sources in the Data Catalog. 
+  Populating from an existing metadata repository – If you have an existing metadata store like Apache Hive Metastore, you can use AWS Glue to import that metadata into the Data Catalog. For more information, see [Migration between the Hive Metastore and the AWS Glue Data Catalog](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Hive_metastore_migration) on GitHub.

**Topics**
+ [Using crawlers to populate the Data Catalog](add-crawler.md)
+ [Defining metadata manually](populate-dg-manual.md)
+ [Integrating with Amazon S3 Tables](glue-federation-s3tables.md)
+ [Integrating with other AWS services](populate-dc-other-services.md)
+ [Data Catalog settings](console-data-catalog-settings.md)

# Using crawlers to populate the Data Catalog
<a name="add-crawler"></a>

You can use an AWS Glue crawler to populate the AWS Glue Data Catalog with databases and tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.

## Workflow
<a name="crawler-workflow"></a>

The following workflow diagram shows how AWS Glue crawlers interact with data stores and other elements to populate the Data Catalog.

![\[Workflow showing how AWS Glue crawler populates the Data Catalog in 5 basic steps.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-overview.png)


The following is the general workflow for how a crawler populates the AWS Glue Data Catalog:

1. A crawler runs any custom *classifiers* that you choose to infer the format and schema of your data. You provide the code for custom classifiers, and they run in the order that you specify.

   The first custom classifier to successfully recognize the structure of your data is used to create a schema. Custom classifiers lower in the list are skipped.

1. If no custom classifier matches your data's schema, built-in classifiers try to recognize your data's schema. An example of a built-in classifier is one that recognizes JSON.

1. The crawler connects to the data store. Some data stores require connection properties for crawler access.

1. The inferred schema is created for your data.

1. The crawler writes metadata to the Data Catalog. A table definition contains metadata about the data in your data store. The table is written to a database, which is a container of tables in the Data Catalog. Attributes of a table include classification, which is a label created by the classifier that inferred the table schema.

**Topics**
+ [Workflow](#crawler-workflow)
+ [How crawlers work](#crawler-running)
+ [How does a crawler determine when to create partitions?](#crawler-s3-folder-table-partition)
+ [Supported data sources for crawling](crawler-data-stores.md)
+ [Crawler prerequisites](crawler-prereqs.md)
+ [Defining and managing classifiers](add-classifier.md)
+ [Configuring a crawler](define-crawler.md)
+ [Scheduling a crawler](schedule-crawler.md)
+ [Viewing crawler results and details](console-crawlers-details.md)
+ [Customizing crawler behavior](crawler-configuration.md)
+ [Tutorial: Adding an AWS Glue crawler](tutorial-add-crawler.md)

## How crawlers work
<a name="crawler-running"></a>

When a crawler runs, it takes the following actions to interrogate a data store:
+ **Classifies data to determine the format, schema, and associated properties of the raw data** – You can configure the results of classification by creating a custom classifier.
+ **Groups data into tables or partitions ** – Data is grouped based on crawler heuristics.
+ **Writes metadata to the Data Catalog ** – You can configure how the crawler adds, updates, and deletes tables and partitions.

When you define a crawler, you choose one or more classifiers that evaluate the format of your data to infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. You can use built-in classifiers or define your own. You define your custom classifiers in a separate operation, before you define the crawlers. AWS Glue provides built-in classifiers to infer schemas from common files with formats that include JSON, CSV, and Apache Avro. For the current list of built-in classifiers in AWS Glue, see [Built-in classifiers](add-classifier.md#classifier-built-in). 

The metadata tables that a crawler creates are contained in a database when you define a crawler. If your crawler does not specify a database, your tables are placed in the default database. In addition, each table has a classification column that is filled in by the classifier that first successfully recognized the data store.

If the file that is crawled is compressed, the crawler must download it to process it. When a crawler runs, it interrogates files to determine their format and compression type and writes these properties into the Data Catalog. Some file formats (for example, Apache Parquet) enable you to compress parts of the file as it is written. For these files, the compressed data is an internal component of the file, and AWS Glue does not populate the `compressionType` property when it writes tables into the Data Catalog. In contrast, if an *entire file* is compressed by a compression algorithm (for example, gzip), then the `compressionType` property is populated when tables are written into the Data Catalog. 

The crawler generates the names for the tables that it creates. The names of the tables that are stored in the AWS Glue Data Catalog follow these rules:
+ Only alphanumeric characters and underscore (`_`) are allowed.
+ Any custom prefix cannot be longer than 64 characters.
+ The maximum length of the name cannot be longer than 128 characters. The crawler truncates generated names to fit within the limit.
+ If duplicate table names are encountered, the crawler adds a hash string suffix to the name.

If your crawler runs more than once, perhaps on a schedule, it looks for new or changed files or tables in your data store. The output of the crawler includes new tables and partitions found since a previous run.

## How does a crawler determine when to create partitions?
<a name="crawler-s3-folder-table-partition"></a>

When an AWS Glue crawler scans Amazon S3 data store and detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name. You provide an **Include path** that points to the folder level to crawl. When the majority of schemas at a folder level are similar, the crawler creates partitions of a table instead of separate tables. To influence the crawler to create separate tables, add each table's root folder as a separate data store when you define the crawler.

For example, consider the following Amazon S3 folder structure.

![\[Rectangles at multiple levels represent a folder hierarchy in Amazon S3. The top rectangle is labeled Sales. Rectangle below that is labeled year=2019. Two rectangles below that are labeled month=Jan and month=Feb. Each of those rectangles has two rectangles below them, labeled day=1 and day=2. All four "day" (bottom) rectangles have either two or four files under them. All rectangles and files are connected with lines.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-s3-folders.png)


The paths to the four lowest level folders are the following:

```
S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
S3://sales/year=2019/month=Feb/day=1
S3://sales/year=2019/month=Feb/day=2
```

Assume that the crawler target is set at `Sales`, and that all files in the `day=n` folders have the same format (for example, JSON, not encrypted), and have the same or very similar schemas. The crawler will create a single table with four partitions, with partition keys `year`, `month`, and `day`.

In the next example, consider the following Amazon S3 structure:

```
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
```

If the schemas for files under `table1` and `table2` are similar, and a single data store is defined in the crawler with **Include path** `s3://bucket01/folder1/`, the crawler creates a single table with two partition key columns. The first partition key column contains `table1` and `table2`, and the second partition key column contains `partition1` through `partition3` for the `table1` partition and `partition4` and `partition5` for the `table2` partition. To create two separate tables, define the crawler with two data stores. In this example, define the first **Include path** as `s3://bucket01/folder1/table1/` and the second as `s3://bucket01/folder1/table2`.

**Note**  
In Amazon Athena, each table corresponds to an Amazon S3 prefix with all the objects in it. If objects have different schemas, Athena does not recognize different objects within the same prefix as separate tables. This can happen if a crawler creates multiple tables from the same Amazon S3 prefix. This might lead to queries in Athena that return zero results. For Athena to properly recognize and query tables, create the crawler with a separate **Include path** for each different table schema in the Amazon S3 folder structure. For more information, see [Best Practices When Using Athena with AWS Glue](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html) and this [AWS Knowledge Center article](https://aws.amazon.com/premiumsupport/knowledge-center/athena-empty-results/).

# Supported data sources for crawling
<a name="crawler-data-stores"></a>

Crawlers can crawl the following file-based and table-based data stores.


| Access type that crawler uses | Data stores | 
| --- | --- | 
| Native client |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 
| JDBC |  Amazon Redshift Snowflake Within Amazon Relational Database Service (Amazon RDS) or external to Amazon RDS: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 
| MongoDB client |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 

**Note**  
Currently AWS Glue does not support crawlers for data streams.

For JDBC, MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility) data stores, you must specify an AWS Glue *connection* that the crawler can use to connect to the data store. For Amazon S3, you can optionally specify a connection of type Network. A connection is a Data Catalog object that stores connection information, such as credentials, URL, Amazon Virtual Private Cloud information, and more. For more information, see [Connecting to data](glue-connections.md).

The following are the versions of drivers supported by the crawler:


| Product | Crawler supported driver | 
| --- | --- | 
| PostgreSQL | 42.2.1 | 
| Amazon Aurora | Same as native crawler drivers | 
| MariaDB | 8.0.13 | 
| Microsoft SQL Server | 6.1.0 | 
| MySQL | 8.0.13 | 
| Oracle | 11.2.2 | 
| Amazon Redshift | 4.1 | 
| Snowflake | 3.13.20 | 
| MongoDB | 4.7.2 | 
| MongoDB Atlas | 4.7.2 | 

The following are notes about the various data stores.

**Amazon S3**  
You can choose to crawl a path in your account or in another account. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created and partition information is added to the Data Catalog for that table.

**Amazon S3 and Amazon DynamoDB**  
Crawlers use an AWS Identity and Access Management (IAM) role for permission to access your data stores. *The role you pass to the crawler must have permission to access Amazon S3 paths and Amazon DynamoDB tables that are crawled*.

**Amazon DynamoDB**  
When defining a crawler using the AWS Glue console, you specify one DynamoDB table. If you're using the AWS Glue API, you can specify a list of tables. You can choose to crawl only a small sample of the data to reduce crawler run times.

**Delta Lake**  
For each Delta Lake data store, you specify how to create the Delta tables:  
+ **Create Native tables**: Allow integration with query engines that support querying of the Delta transaction log directly. For more information, see [Querying Delta Lake tables](https://docs.aws.amazon.com/athena/latest/ug/delta-lake-tables.html).
+ **Create Symlink tables**: Create a `_symlink_manifest` folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.

**Iceberg**  
For each Iceberg data store, you specify an Amazon S3 path that contains the metadata for your Iceberg tables. If crawler discovers Iceberg table metadata, it registers it in the Data Catalog. You can set a schedule for the crawler to keep the tables updated.  
You can define these parameters for the data store:  
+ **Exclusions**: Allows you to skip certain folders.
+ **Maximum Traversal Depth**: Sets the depth limit the crawler can crawl in your Amazon S3 bucket. The default maximum traversal depth is 10 and the maximum depth you can set is 20.

**Hudi**  
For each Hudi data store, you specify an Amazon S3 path that contains the metadata for your Hudi tables. If crawler discovers Hudi table metadata, it registers it in the Data Catalog. You can set a schedule for the crawler to keep the tables updated.  
You can define these parameters for the data store:  
+ **Exclusions**: Allows you to skip certain folders.
+ **Maximum Traversal Depth**: Sets the depth limit the crawler can crawl in your Amazon S3 bucket. The default maximum traversal depth is 10 and the maximum depth you can set is 20.
Timestamp columns with `millis` as logical types will be interpreted as `bigint`, due to an incompatibility with Hudi 0.13.1 and timestamp types. A resolution may be provided in the upcoming Hudi release.
Hudi tables are categorized as follows, with specific implications for each:  
+ Copy on Write (CoW): Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write.
+ Merge on Read (MoR): Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.
With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that change less frequently.  
Hudi provides three query types for accessing the data:  
+ Snapshot queries: Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query.
+ Incremental queries: Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines.
+ Read optimized queries: For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.
For Copy-On-Write tables, the crawlers creates a single table in the Data Catalog with the ReadOptimized serde `org.apache.hudi.hadoop.HoodieParquetInputFormat`.  
For Merge-On-Read tables, the crawler creates two tables in the Data Catalog for the same table location:  
+ A table with suffix `_ro` which uses the ReadOptimized serde `org.apache.hudi.hadoop.HoodieParquetInputFormat`.
+ A table with suffix `_rt` which uses the RealTime Serde allowing for Snapshot queries: `org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat`.

**MongoDB and Amazon DocumentDB (with MongoDB compatibility)**  
MongoDB versions 3.2 and later are supported. You can choose to crawl only a small sample of the data to reduce crawler run times.

**Relational database**  
Authentication is with a database user name and password. Depending on the type of database engine, you can choose which objects are crawled, such as databases, schemas, and tables.

**Snowflake**  
The Snowflake JDBC crawler supports crawling the Table, External Table, View, and Materialized View. The Materialized View Definition will not be populated.  
For Snowflake external tables, the crawler only will crawl if it points to an Amazon S3 location. In addition to the the table schema, the crawler will also crawl the Amazon S3 location, file format and output as table parameters in the Data Catalog table. Note that the partition information of the partitioned external table is not populated.  
ETL is currently not supported for Data Catalog tables created using the Snowflake crawler.

# Crawler prerequisites
<a name="crawler-prereqs"></a>

The crawler assumes the permissions of the AWS Identity and Access Management (IAM) role that you specify when you define it. This IAM role must have permissions to extract data from your data store and write to the Data Catalog. The AWS Glue console lists only IAM roles that have attached a trust policy for the AWS Glue principal service. From the console, you can also create an IAM role with an IAM policy to access Amazon S3 data stores accessed by the crawler. For more information about providing roles for AWS Glue, see [Identity-based policies for AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-id-based-policies).

**Note**  
When crawling a Delta Lake data store, you must have Read/Write permissions to the Amazon S3 location.

For your crawler, you can create a role and attach the following policies:
+ The `AWSGlueServiceRole` AWS managed policy, which grants the required permissions on the Data Catalog
+ An inline policy that grants permissions on the data source.
+ An inline policy that grants `iam:PassRole` permission on the role.

A quicker approach is to let the AWS Glue console crawler wizard create a role for you. The role that it creates is specifically for the crawler, and includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source.

If you specify an existing role for a crawler, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. For example, for an Amazon S3 data store, the inline policy would at a minimum be the following: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::bucket/object*"
      ]
    }
  ]
}
```

------

For an Amazon DynamoDB data store, the policy would at a minimum be the following: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:Scan"
      ],
      "Resource": [
        "arn:aws:dynamodb:us-east-1:111122223333:table/table-name*"
      ]
    }
  ]
}
```

------

In addition, if the crawler reads AWS Key Management Service (AWS KMS) encrypted Amazon S3 data, then the IAM role must have decrypt permission on the AWS KMS key. For more information, see [Step 2: Create an IAM role for AWS Glue](create-an-iam-role.md).

# Defining and managing classifiers
<a name="add-classifier"></a>

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was. 

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns `certainty=1.0` during processing, it indicates that it's 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier. 

If no classifier returns `certainty=1.0`, AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than `0.0`, AWS Glue returns the default classification string of `UNKNOWN`.

## When do I use a classifier?
<a name="classifier-when-used"></a>

You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can't recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data. 

 For more information about creating a classifier using the AWS Glue console, see [Creating classifiers using the AWS Glue console](console-classifiers.md). 

## Custom classifiers
<a name="classifier-defining"></a>

The output of a classifier includes a string that indicates the file's classification or format (for example, `json`) and the schema of the file. For custom classifiers, you define the logic for creating the schema based on the type of classifier. Classifier types include defining schemas based on grok patterns, XML tags, and JSON paths.

If you change a classifier definition, any data that was previously crawled using the classifier is not reclassified. A crawler keeps track of previously crawled data. New data is classified with the updated classifier, which might result in an updated schema. If the schema of your data has evolved, update the classifier to account for any schema changes when your crawler runs. To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier. 

For more information about creating custom classifiers in AWS Glue, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

**Note**  
If your data format is recognized by one of the built-in classifiers, you don't need to create a custom classifier.

## Built-in classifiers
<a name="classifier-built-in"></a>

 AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.

If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent certainty, it invokes the built-in classifiers in the order shown in the following table. The built-in classifiers return a result to indicate whether the format matches (`certainty=1.0`) or does not match (`certainty=0.0`). The first classifier that has `certainty=1.0` provides the classification string and schema for a metadata table in your Data Catalog.


| Classifier type | Classification string | Notes | 
| --- | --- | --- | 
| Apache Avro | avro | Reads the schema at the beginning of the file to determine format. | 
| Apache ORC | orc | Reads the file metadata to determine format. | 
| Apache Parquet | parquet | Reads the schema at the end of the file to determine format. | 
| JSON | json | Reads the beginning of the file to determine format. | 
| Binary JSON | bson | Reads the beginning of the file to determine format. | 
| XML | xml | Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document.  For information about creating a custom XML classifier to specify rows in the document, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml).  | 
| Amazon Ion | ion | Reads the beginning of the file to determine format. | 
| Combined Apache log | combined\$1apache | Determines log formats through a grok pattern. | 
| Apache log | apache | Determines log formats through a grok pattern. | 
| Linux kernel log | linux\$1kernel | Determines log formats through a grok pattern. | 
| Microsoft log | microsoft\$1log | Determines log formats through a grok pattern. | 
| Ruby log | ruby\$1logger | Reads the beginning of the file to determine format. | 
| Squid 3.x log | squid | Reads the beginning of the file to determine format. | 
| Redis monitor log | redismonlog | Reads the beginning of the file to determine format. | 
| Redis log | redislog | Reads the beginning of the file to determine format. | 
| CSV | csv | Checks for the following delimiters: comma (,), pipe (\$1), tab (\$1t), semicolon (;), and Ctrl-A (\$1u0001). Ctrl-A is the Unicode control character for Start Of Heading. | 
| Amazon Redshift | redshift | Uses JDBC connection to import metadata. | 
| MySQL | mysql | Uses JDBC connection to import metadata. | 
| PostgreSQL | postgresql | Uses JDBC connection to import metadata. | 
| Oracle database | oracle | Uses JDBC connection to import metadata. | 
| Microsoft SQL Server | sqlserver | Uses JDBC connection to import metadata. | 
| Amazon DynamoDB | dynamodb | Reads data from the DynamoDB table. | 

Files in the following compressed formats can be classified:
+ ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
+ BZIP
+ GZIP
+ LZ4
+ Snappy (supported for both standard and Hadoop native Snappy formats)

### Built-in CSV classifier
<a name="classifier-builtin-rules"></a>

The built-in CSV classifier parses CSV file contents to determine the schema for an AWS Glue table. This classifier checks for the following delimiters:
+ Comma (,)
+ Pipe (\$1)
+ Tab (\$1t)
+ Semicolon (;)
+ Ctrl-A (\$1u0001)

  Ctrl-A is the Unicode control character for `Start Of Heading`.

To be classified as CSV, the table schema must have at least two columns and two rows of data. The CSV classifier uses a number of heuristics to determine whether a header is present in a given file. If the classifier can't determine a header from the first row of data, column headers are displayed as `col1`, `col2`, `col3`, and so on. The built-in CSV classifier determines whether to infer a header by evaluating the following characteristics of the file:
+ Every column in a potential header parses as a STRING data type.
+ Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
+ Every column in a potential header must meet the AWS Glue `regex` requirements for a column name.
+ The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

**Note**  
If the built-in CSV classifier does not create your AWS Glue table as you want, you might be able to use one of the following alternatives:  
Change the column names in the Data Catalog, set the `SchemaChangePolicy` to LOG, and set the partition output configuration to `InheritFromTable` for future crawler runs.
Create a custom grok classifier to parse the data and assign the columns that you want.
The built-in CSV classifier creates tables referencing the `LazySimpleSerDe` as the serialization library, which is a good choice for type inference. However, if the CSV data contains quoted strings, edit the table definition and change the SerDe library to `OpenCSVSerDe`. Adjust any inferred types to STRING, set the `SchemaChangePolicy` to LOG, and set the partitions output configuration to `InheritFromTable` for future crawler runs. For more information about SerDe libraries, see [SerDe Reference](https://docs.aws.amazon.com/athena/latest/ug/serde-reference.html) in the Amazon Athena User Guide.

# Writing custom classifiers for diverse data formats
<a name="custom-classifier"></a>

You can provide a custom classifier to classify your data in AWS Glue. You can create a custom classifier using a grok pattern, an XML tag, JavaScript Object Notation (JSON), or comma-separated values (CSV). An AWS Glue crawler calls a custom classifier. If the classifier recognizes the data, it returns the classification and schema of the data to the crawler. You might need to define a custom classifier if your data doesn't match any built-in classifiers, or if you want to customize the tables that are created by the crawler.

 For more information about creating a classifier using the AWS Glue console, see [Creating classifiers using the AWS Glue console](console-classifiers.md). 

AWS Glue runs custom classifiers before built-in classifiers, in the order you specify. When a crawler finds a classifier that matches the data, the classification string and schema are used in the definition of tables that are written to your AWS Glue Data Catalog.

**Topics**
+ [Writing grok custom classifiers](#custom-classifier-grok)
+ [Writing XML custom classifiers](#custom-classifier-xml)
+ [Writing JSON custom classifiers](#custom-classifier-json)
+ [Writing CSV custom classifiers](#custom-classifier-csv)

## Writing grok custom classifiers
<a name="custom-classifier-grok"></a>

Grok is a tool that is used to parse textual data given a matching pattern. A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. AWS Glue uses grok patterns to infer the schema of your data. When a grok pattern matches your data, AWS Glue uses the pattern to determine the structure of your data and map it into fields.

AWS Glue provides many built-in patterns, or you can define your own. You can create a grok pattern using built-in patterns and custom patterns in your custom classifier definition. You can tailor a grok pattern to classify custom text file formats.

**Note**  
AWS Glue grok custom classifiers use the `GrokSerDe` serialization library for tables created in the AWS Glue Data Catalog. If you are using the AWS Glue Data Catalog with Amazon Athena, Amazon EMR, or Redshift Spectrum, check the documentation about those services for information about support of the `GrokSerDe`. Currently, you might encounter problems querying tables created with the `GrokSerDe` from Amazon EMR and Redshift Spectrum.

The following is the basic syntax for the components of a grok pattern:

```
%{PATTERN:field-name}
```

Data that matches the named `PATTERN` is mapped to the `field-name` column in the schema, with a default data type of `string`. Optionally, the data type for the field can be cast to `byte`, `boolean`, `double`, `short`, `int`, `long`, or `float` in the resulting schema.

```
%{PATTERN:field-name:data-type}
```

For example, to cast a `num` field to an `int` data type, you can use this pattern: 

```
%{NUMBER:num:int}
```

Patterns can be composed of other patterns. For example, you can have a pattern for a `SYSLOG` timestamp that is defined by patterns for month, day of the month, and time (for example, `Feb 1 06:25:43`). For this data, you might define the following pattern:

```
SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
```

**Note**  
Grok patterns can process only one line at a time. Multiple-line patterns are not supported. Also, line breaks within a pattern are not supported.

### Custom values for grok classifier
<a name="classifier-values"></a>

When you define a grok classifier, you supply the following values to create the custom classifier.

**Name**  
Name of the classifier.

**Classification**  
The text string that is written to describe the format of the data that is classified; for example, `special-logs`.

**Grok pattern**  
The set of patterns that are applied to the data store to determine whether there is a match. These patterns are from AWS Glue [built-in patterns](#classifier-builtin-patterns) and any custom patterns that you define.  
The following is an example of a grok pattern:  

```
%{TIMESTAMP_ISO8601:timestamp} \[%{MESSAGEPREFIX:message_prefix}\] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
```
When the data matches `TIMESTAMP_ISO8601`, a schema column `timestamp` is created. The behavior is similar for the other named patterns in the example.

**Custom patterns**  
Optional custom patterns that you define. These patterns are referenced by the grok pattern that classifies your data. You can reference these custom patterns in the grok pattern that is applied to your data. Each custom component pattern must be on a separate line. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used to define the pattern.   
The following is an example of using custom patterns:  

```
CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*
```
The first custom named pattern, `CRAWLERLOGLEVEL`, is a match when the data matches one of the enumerated strings. The second custom pattern, `MESSAGEPREFIX`, tries to match a message prefix string.

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

### Built-in patterns
<a name="classifier-builtin-patterns"></a>

AWS Glue provides many common patterns that you can use to build a custom classifier. You add a named pattern to the `grok pattern` in a classifier definition.

The following list consists of a line for each pattern. In each line, the pattern name is followed its definition. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used in defining the pattern.

```
#<noloc>&GLU;</noloc> Built-in patterns
 USERNAME [a-zA-Z0-9._-]+
 USER %{USERNAME:UNWANTED}
 INT (?:[+-]?(?:[0-9]+))
 BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
 NUMBER (?:%{BASE10NUM:UNWANTED})
 BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
 BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
 BOOLEAN (?i)(true|false)
 
 POSINT \b(?:[1-9][0-9]*)\b
 NONNEGINT \b(?:[0-9]+)\b
 WORD \b\w+\b
 NOTSPACE \S+
 SPACE \s*
 DATA .*?
 GREEDYDATA .*
 #QUOTEDSTRING (?:(?<!\\)(?:"(?:\\.|[^\\"])*"|(?:'(?:\\.|[^\\'])*')|(?:`(?:\\.|[^\\`])*`)))
 QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))
 UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
 
 # Networking
 MAC (?:%{CISCOMAC:UNWANTED}|%{WINDOWSMAC:UNWANTED}|%{COMMONMAC:UNWANTED})
 CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
 WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
 COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
 IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
 IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
 IP (?:%{IPV6:UNWANTED}|%{IPV4:UNWANTED})
 HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62}))*(\.?|\b)
 HOST %{HOSTNAME:UNWANTED}
 IPORHOST (?:%{HOSTNAME:UNWANTED}|%{IP:UNWANTED})
 HOSTPORT (?:%{IPORHOST}:%{POSINT:PORT})
 
 # paths
 PATH (?:%{UNIXPATH}|%{WINPATH})
 UNIXPATH (?>/(?>[\w_%!$@:.,~-]+|\\.)*)+
 #UNIXPATH (?<![\w\/])(?:/[^\/\s?*]*)+
 TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
 WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
 URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
 URIHOST %{IPORHOST}(?::%{POSINT:port})?
 # uripath comes loosely from RFC1738, but mostly from what Firefox
 # doesn't turn into %XX
 URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
 #URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
 URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]]*
 URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
 URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?
 
 # Months: January, Feb, 3, 03, 12, December
 MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
 MONTHNUM (?:0?[1-9]|1[0-2])
 MONTHNUM2 (?:0[1-9]|1[0-2])
 MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
 
 # Days: Monday, Tue, Thu, etc...
 DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
 
 # Years?
 YEAR (?>\d\d){1,2}
 # Time: HH:MM:SS
 #TIME \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?
 # TIME %{POSINT<24}:%{POSINT<60}(?::%{POSINT<60}(?:\.%{POSINT})?)?
 HOUR (?:2[0123]|[01]?[0-9])
 MINUTE (?:[0-5][0-9])
 # '60' is a leap second in most time standards and thus is valid.
 SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
 TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
 # datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
 DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
 DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
 DATESTAMP_US %{DATE_US}[- ]%{TIME}
 DATESTAMP_EU %{DATE_EU}[- ]%{TIME}
 ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
 ISO8601_SECOND (?:%{SECOND}|60)
 TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
 TZ (?:[PMCE][SD]T|UTC)
 DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
 DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}
 DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
 DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}
 CISCOTIMESTAMP %{MONTH} %{MONTHDAY} %{TIME}
 
 # Syslog Dates: Month Day HH:MM:SS
 SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
 PROG (?:[\w._/%-]+)
 SYSLOGPROG %{PROG:program}(?:\[%{POSINT:pid}\])?
 SYSLOGHOST %{IPORHOST}
 SYSLOGFACILITY <%{NONNEGINT:facility}.%{NONNEGINT:priority}>
 HTTPDATE %{MONTHDAY}/%{MONTH}/%{YEAR}:%{TIME} %{INT}
 
 # Shortcuts
 QS %{QUOTEDSTRING:UNWANTED}
 
 # Log formats
 SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
 
 MESSAGESLOG %{SYSLOGBASE} %{DATA}
 
 COMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-})
 COMBINEDAPACHELOG %{COMMONAPACHELOG} %{QS:referrer} %{QS:agent}
 COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth} \[%{HTTPDATE:timestamp;date;dd/MMM/yyyy:HH:mm:ss Z}\] "(?:%{WORD:verb;string} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion;float})?|%{DATA:rawrequest})" %{NUMBER:response;int} (?:%{NUMBER:bytes;long}|-)
 
 
 # Log Levels
 LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
```

## Writing XML custom classifiers
<a name="custom-classifier-xml"></a>

XML defines the structure of a document with the use of tags in the file. With an XML custom classifier, you can specify the tag name used to define a row.

### Custom classifier values for an XML classifier
<a name="classifier-values-xml"></a>

When you define an XML classifier, you supply the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `xml`.

**Name**  
Name of the classifier.

**Row tag**  
The XML tag name that defines a table row in the XML document, without angle brackets `< >`. The name must comply with XML rules for a tag.  
The element containing the row data **cannot** be a self-closing empty element. For example, this empty element is **not** parsed by AWS Glue:  

```
            <row att1=”xx” att2=”yy” />  
```
 Empty elements can be written as follows:  

```
            <row att1=”xx” att2=”yy”> </row> 
```

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

For example, suppose that you have the following XML file. To create an AWS Glue table that only contains columns for author and title, create a classifier in the AWS Glue console with **Row tag** as `AnyCompany`. Then add and run a crawler that uses this custom classifier.

```
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
     <AnyCompany>
       <author>Rivera, Martha</author>
       <title>AnyCompany Developer Guide</title>
     </AnyCompany>
   </book>
   <book id="bk102">
     <AnyCompany>   
       <author>Stiles, John</author>
       <title>Style Guide for AnyCompany</title>
     </AnyCompany>
   </book>
</catalog>
```

## Writing JSON custom classifiers
<a name="custom-classifier-json"></a>

JSON is a data-interchange format. It defines data structures with name-value pairs or an ordered list of values. With a JSON custom classifier, you can specify the JSON path to a data structure that is used to define the schema for your table.

### Custom classifier values in AWS Glue
<a name="classifier-values-json"></a>

When you define a JSON classifier, you supply the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `json`.

**Name**  
Name of the classifier.

**JSON path**  
A JSON path that points to an object that is used to define a table schema. The JSON path can be written in dot notation or bracket notation. The following operators are supported:      
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html)

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

**Example Using a JSON classifier to pull records from an array**  
Suppose that your JSON data is an array of records. For example, the first few lines of your file might look like the following:  

```
[
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ak",
    "name": "Alaska"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:1",
    "name": "Alabama's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:2",
    "name": "Alabama's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:3",
    "name": "Alabama's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:4",
    "name": "Alabama's 4th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:5",
    "name": "Alabama's 5th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:6",
    "name": "Alabama's 6th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:7",
    "name": "Alabama's 7th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:1",
    "name": "Arkansas's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:2",
    "name": "Arkansas's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:3",
    "name": "Arkansas's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:4",
    "name": "Arkansas's 4th congressional district"
  }
]
```
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array. For example, the schema might look like the following:  

```
root
|-- record: array
```
However, to create a schema that is based on each record in the JSON array, create a custom JSON classifier and specify the JSON path as `$[*]`. When you specify this JSON path, the classifier interrogates all 12 records in the array to determine the schema. The resulting schema contains separate fields for each object, similar to the following example:  

```
root
|-- type: string
|-- id: string
|-- name: string
```

**Example Using a JSON classifier to examine only parts of a file**  
Suppose that your JSON data follows the pattern of the example JSON file `s3://awsglue-datasets/examples/us-legislators/all/areas.json` drawn from [http://everypolitician.org/](http://everypolitician.org/). Example objects in the JSON file look like the following:  

```
{
  "type": "constituency",
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
{
  "type": "constituency",
  "identifiers": [
    {
      "scheme": "dmoz",
      "identifier": "Regional\/North_America\/United_States\/Alaska\/"
    },
    {
      "scheme": "freebase",
      "identifier": "\/m\/0hjy"
    },
    {
      "scheme": "fips",
      "identifier": "US02"
    },
    {
      "scheme": "quora",
      "identifier": "Alaska-state"
    },
    {
      "scheme": "britannica",
      "identifier": "place\/Alaska"
    },
    {
      "scheme": "wikidata",
      "identifier": "Q797"
    }
  ],
  "other_names": [
    {
      "lang": "en",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "fr",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "nov",
      "note": "multilingual",
      "name": "Alaska"
    }
  ],
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
```
When you run a crawler using the built-in JSON classifier, the entire file is used to create the schema. You might end up with a schema like this:  

```
root
|-- type: string
|-- id: string
|-- name: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
```
However, to create a schema using just the "`id`" object, create a custom JSON classifier and specify the JSON path as `$.id`. Then the schema is based on only the "`id`" field:  

```
root
|-- record: string
```
The first few lines of data extracted with this schema look like this:  

```
{"record": "ocd-division/country:us/state:ak"}
{"record": "ocd-division/country:us/state:al/cd:1"}
{"record": "ocd-division/country:us/state:al/cd:2"}
{"record": "ocd-division/country:us/state:al/cd:3"}
{"record": "ocd-division/country:us/state:al/cd:4"}
{"record": "ocd-division/country:us/state:al/cd:5"}
{"record": "ocd-division/country:us/state:al/cd:6"}
{"record": "ocd-division/country:us/state:al/cd:7"}
{"record": "ocd-division/country:us/state:ar/cd:1"}
{"record": "ocd-division/country:us/state:ar/cd:2"}
{"record": "ocd-division/country:us/state:ar/cd:3"}
{"record": "ocd-division/country:us/state:ar/cd:4"}
{"record": "ocd-division/country:us/state:as"}
{"record": "ocd-division/country:us/state:az/cd:1"}
{"record": "ocd-division/country:us/state:az/cd:2"}
{"record": "ocd-division/country:us/state:az/cd:3"}
{"record": "ocd-division/country:us/state:az/cd:4"}
{"record": "ocd-division/country:us/state:az/cd:5"}
{"record": "ocd-division/country:us/state:az/cd:6"}
{"record": "ocd-division/country:us/state:az/cd:7"}
```
To create a schema based on a deeply nested object, such as "`identifier`," in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.identifiers[*].identifier`. Although the schema is similar to the previous example, it is based on a different object in the JSON file.   
The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data from the table shows that the schema is based on the data in the "`identifier`" object:  

```
{"record": "Regional/North_America/United_States/Alaska/"}
{"record": "/m/0hjy"}
{"record": "US02"}
{"record": "5879092"}
{"record": "4001016-8"}
{"record": "destination/alaska"}
{"record": "1116270"}
{"record": "139487266"}
{"record": "n79018447"}
{"record": "01490999-8dec-4129-8254-eef6e80fadc3"}
{"record": "Alaska-state"}
{"record": "place/Alaska"}
{"record": "Q797"}
{"record": "Regional/North_America/United_States/Alabama/"}
{"record": "/m/0gyh"}
{"record": "US01"}
{"record": "4829764"}
{"record": "4084839-5"}
{"record": "161950"}
{"record": "131885589"}
```
To create a table based on another deeply nested object, such as the "`name`" field in the "`other_names`" array in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.other_names[*].name`. Although the schema is similar to the previous example, it is based on a different object in the JSON file. The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data in the table shows that it is based on the data in the "`name`" object in the "`other_names`" array:  

```
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "ألاسكا"}
{"record": "ܐܠܐܣܟܐ"}
{"record": "الاسكا"}
{"record": "Alaska"}
{"record": "Alyaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Штат Аляска"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "আলাস্কা"}
```

## Writing CSV custom classifiers
<a name="custom-classifier-csv"></a>

 Custom CSV classifiers allows you to specify datatypes for each column in the custom csv classifier field. You can specify each column’s datatype separated by a comma. By specifying datatypes, you can override the crawlers inferred datatypes and ensure data will be classified appropriately.

You can set the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog.

When you create a custom classifier, you can also re-use the classifer for different crawlers.
+  For csv files with only headers (no data), these files will be classified as UNKNOWN since not enough information is provided. If you specify that the CSV 'Has headings' in the *Column headings* option, and provide the datatypes, we can classify these files correctly. 

You can use a custom CSV classifier to infer the schema of various types of CSV data. The custom attributes that you can provide for your classifier include delimiters, a CSV SerDe option, options about the header, and whether to perform certain validations on the data.

### Custom classifier values in AWS Glue
<a name="classifier-values-csv"></a>

When you define a CSV classifier, you provide the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `csv`.

**Classifier name**  
Name of the classifier.

**CSV Serde**  
Sets the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Options are Open CSV SerDe, Lazy Simple SerDe, and None. You can specify the None value when you want the crawler to do the detection.

**Column delimiter**  
A custom symbol to denote what separates each column entry in the row. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Quote symbol**  
A custom symbol to denote what combines content into a single column value. Must be different from the column delimiter. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Column headings**  
Indicates the behavior for how column headings should be detected in the CSV file. If your custom CSV file has column headings, enter a comma-delimited list of the column headings.

**Processing options: Allow files with single column**  
Enables the processing of files that contain only one column.

**Processing options: Trim white space before identifying column values**  
Specifies whether to trim values before identifying the type of column values.

**Custom datatypes - *optional***  
 Enter the custom datatype separated by a comma. Specifies the custom datatypes in the CSV file. The custom datatype must be a supported datatype. Supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. Unsupported datatypes will display an error. 

# Creating classifiers using the AWS Glue console
<a name="console-classifiers"></a>

A classifier determines the schema of your data. You can write a custom classifier and point to it from AWS Glue. 

## Creating classifiers
<a name="add-classifier-console"></a>

To add a classifier in the AWS Glue console, choose **Add classifier**. When you define a classifier, you supply values for the following:
+ **Classifier name** – Provide a unique name for your classifier.
+ **Classifier type** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

**Classifier name**  
Provide a unique name for your classifier.

**Classifier type**  
Choose the type of classifier to create.

Depending on the type of classifier you choose, configure the following properties for your classifier:

------
#### [ Grok ]
+ **Classification** 

  Describe the format or type of data that is classified or provide a custom label. 
+ **Grok pattern** 

  This is used to parse your data into a structured schema. The grok pattern is composed of named patterns that describe the format of your data store. You write this grok pattern using the named built-in patterns provided by AWS Glue and custom patterns you write and include in the **Custom patterns** field. Although grok debugger results might not match the results from AWS Glue exactly, we suggest that you try your pattern using some sample data with a grok debugger. You can find grok debuggers on the web. The named built-in patterns provided by AWS Glue are generally compatible with grok patterns that are available on the web. 

  Build your grok pattern by iteratively adding named patterns and check your results in a debugger. This activity gives you confidence that when the AWS Glue crawler runs your grok pattern, your data can be parsed.
+ **Custom patterns** 

  For grok classifiers, these are optional building blocks for the **Grok pattern** that you write. When built-in patterns cannot parse your data, you might need to write a custom pattern. These custom patterns are defined in this field and referenced in the **Grok pattern** field. Each custom pattern is defined on a separate line. Just like the built-in patterns, it consists of a named pattern definition that uses [regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax. 

  For example, the following has the name `MESSAGEPREFIX` followed by a regular expression definition to apply to your data to determine whether it follows the pattern. 

  ```
  MESSAGEPREFIX .*-.*-.*-.*-.*
  ```

------
#### [ XML ]
+ **Row tag** 

  For XML classifiers, this is the name of the XML tag that defines a table row in the XML document. Type the name without angle brackets `< >`. The name must comply with XML rules for a tag.

  For more information, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml). 

------
#### [ JSON ]
+ **JSON path** 

  For JSON classifiers, this is the JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators. 

  For more information, see the list of operators in [Writing JSON custom classifiers](custom-classifier.md#custom-classifier-json). 

------
#### [ CSV ]
+ **Column delimiter** 

  A single character or symbol to denote what separates each column entry in the row. Choose the delimiter from the list, or choose `Other` to enter a custom delimiter.
+ **Quote symbol** 

  A single character or symbol to denote what combines content into a single column value. Must be different from the column delimiter. Choose the quote symbol from the list, or choose `Other` to enter a custom quote character.
+ **Column headings** 

  Indicates the behavior for how column headings should be detected in the CSV file. You can choose `Has headings`, `No headings`, or `Detect headings`. If your custom CSV file has column headings, enter a comma-delimited list of the column headings. 
+ **Allow files with single column** 

  To be classified as CSV, the data must have at least two columns and two rows of data. Use this option to allow the processing of files that contain only one column.
+ **Trim whitespace before identifying column values** 

  This option specifies whether to trim values before identifying the type of column values.
+  **Custom datatype** 

   (Optional) - Enter custom datatypes in a comma-delimited list. The supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. 
+  **CSV Serde** 

   (Optional) - A SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Choose from `Open CSV SerDe`, `Lazy Simple SerDe`, or `None`. You can specify the `None` value when you want the crawler to do the detection. 

------

For more information, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

## Viewing classifiers
<a name="view-classifiers-console"></a>

To see a list of all the classifiers that you have created, open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/), and choose the **Classifiers** tab.

The list displays the following properties about each classifier:
+ **Classifier** – The classifier name. When you create a classifier, you must provide a name for it.
+ **Classification** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

## Managing classifiers
<a name="manage-classifiers-console"></a>

From the **Classifiers** list in the AWS Glue console, you can add, edit, and delete classifiers. To see more details for a classifier, choose the classifier name in the list. Details include the information you defined when you created the classifier. 

# Configuring a crawler
<a name="define-crawler"></a>

A crawler accesses your data store, identifies metadata, and creates table definitions in the AWS Glue Data Catalog. The **Crawlers** pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

 This topic contains the step-by-step process of configuring a crawler, covering essential aspects such as setting up the crawler's parameters, defining the data sources to crawl, setting up security, and managing the crawled data. 

**Topics**
+ [Step 1: Set crawler properties](define-crawler-set-crawler-properties.md)
+ [Step 2: Choose data sources and classifiers](define-crawler-choose-data-sources.md)
+ [Step 3: Configure security settings](define-crawler-configure-security-settings.md)
+ [Step 4: Set output and scheduling](define-crawler-set-output-and-scheduling.md)
+ [Step 5: Review and create](define-crawler-review.md)

# Step 1: Set crawler properties
<a name="define-crawler-set-crawler-properties"></a>

**To configure a crawler**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). Choose **Crawlers** in the navigation pane.

1.  Choose **Create crawler**, and follow the instructions in the **Add crawler** wizard. The wizard will guide you the steps required to create a crawler. If you want to add custom calssifiers to define the schema, see [Defining and managing classifiers](add-classifier.md). 

1.  Enter a name for your crawler and description (optional). Optionally, you can tag your crawler with a **Tag key** and optional **Tag value**. Once created, tag keys are read-only. Use tags on some resources to help you organize and identify them. For more information, see AWS tags in AWS Glue.   
**Name**  
Name may contain letters (A-Z), numbers (0-9), hyphens (-), or underscores (\$1), and can be up to 255 characters long.  
**Description**  
Descriptions can be up to 2048 characters long.  
**Tags**  
Use tags to organize and identify your resources. For more information, see the following:   
   + [AWS tags in AWS Glue](monitor-tags.md)

# Step 2: Choose data sources and classifiers
<a name="define-crawler-choose-data-sources"></a>

Next, configure the data sources and classifiers for the crawler.

For more information about supported data sources, see [Supported data sources for crawling](crawler-data-stores.md).

**Data source configuration**  
Select the appropriate option for **Is your data already mapped to AWS Glue tables?** choose 'Not yet' or 'Yes'. By default, 'Not yet' is selected.   
The crawler can access data stores directly as the source of the crawl, or it can use existing tables in the Data Catalog as the source. If the crawler uses existing catalog tables, it crawls the data stores that are specified by those catalog tables.   
+ Not yet: Select one or more data sources to be crawled. A crawler can crawl multiple data stores of different types (Amazon S3, JDBC, and so on).

  You can configure only one data store at a time. After you have provided the connection information and include paths and exclude patterns, you then have the option of adding another data store.
+ Yes: Select existing tables from your AWS Glue Data Catalog. The catalog tables specify the data stores to crawl. The crawler can crawl only catalog tables in a single run; it can't mix in other source types.

  A common reason to specify a catalog table as the source is when you create the table manually (because you already know the structure of the data store) and you want a crawler to keep the table updated, including adding new partitions. For a discussion of other reasons, see [Updating manually created Data Catalog tables using crawlers](tables-described.md#update-manual-tables).

  When you specify existing tables as the crawler source type, the following conditions apply:
  + Database name is optional.
  + Only catalog tables that specify Amazon S3, Amazon DynamoDB, or Delta Lake data stores are permitted.
  + No new catalog tables are created when the crawler runs. Existing tables are updated as needed, including adding new partitions.
  + Deleted objects found in the data stores are ignored; no catalog tables are deleted. Instead, the crawler writes a log message. (`SchemaChangePolicy.DeleteBehavior=LOG`)
  + The crawler configuration option to create a single schema for each Amazon S3 path is enabled by default and cannot be disabled. (`TableGroupingPolicy`=`CombineCompatibleSchemas`) For more information, see [Creating a single schema for each Amazon S3 include path](crawler-grouping-policy.md).
  + You can't mix catalog tables as a source with any other source types (for example Amazon S3 or Amazon DynamoDB).
  
 To use Delta tables, first create a Delta table using Athena DDL or the AWS Glue API.   
 Using Athena, set the location to your Amazon S3 folder and the table type to 'DELTA'.   

```
CREATE EXTERNAL TABLE database_name.table_name
LOCATION 's3://bucket/folder/'
TBLPROPERTIES ('table_type' = 'DELTA')
```
 Using the AWS Glue API, specify the table type within the table parameters map. The table parameters need to include the following key/value pair. For more information on how to create a table, see [ Boto3 documentation for create\$1table ](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/create_table.html).   

```
{
    "table_type":"delta"
}
```

**Data sources**  
Select or add the list of data sources to be scanned by the crawler.  
 (Optional) If you choose JDBC as the data source, you can use your own JDBC drivers when specifying the Connection access where the driver info is stored. 

**Include path**  
 When evaluating what to include or exclude in a crawl, a crawler starts by evaluating the required include path. For Amazon S3, MongoDB, MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility), and relational data stores, you must specify an include path.     
For an Amazon S3 data store  
Choose whether to specify a path in this account or in a different account, and then browse to choose an Amazon S3 path.  
For Amazon S3 data stores, include path syntax is `bucket-name/folder-name/file-name.ext`. To crawl all objects in a bucket, you specify just the bucket name in the include path. The exclude pattern is relative to the include path  
For a Delta Lake data store  
Specify one or more Amazon S3 paths to Delta tables as s3://*bucket*/*prefix*/*object*.  
For an Iceberg or Hudi data store  
Specify one or more Amazon S3 paths that contain folders with Iceberg or Hudi table metadata as s3://*bucket*/*prefix*.  
For Iceberg and Hudi data stores, the Iceberg/Hudi folder may be located in a child folder of the root folder. The crawler will scan all folders underneath a path for a Hudi folder.  
For a JDBC data store  
Enter *<database>*/*<schema>*/*<table>* or *<database>*/*<table>*, depending on the database product. Oracle Database and MySQL don’t support schema in the path. You can substitute the percent (%) character for *<schema>* or *<table>*. For example, for an Oracle database with a system identifier (SID) of `orcl`, enter `orcl/%` to import all tables to which the user named in the connection has access.  
This field is case-sensitive.
 If you choose to bring in your own JDBC driver versions, AWS Glue crawlers will consume resources in AWS Glue jobs and Amazon S3 buckets to ensure your provided driver are run in your environment. The additional usage of resources will be reflected in your account. Drivers are limited to the properties described in [Adding an AWS Glue connection](https://docs.aws.amazon.com/glue/latest/dg/console-connections.html).   
For a MongoDB, MongoDB Atlas, or Amazon DocumentDB data store  
For MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility), the syntax is `database/collection`.
For JDBC data stores, the syntax is either `database-name/schema-name/table-name` or `database-name/table-name`. The syntax depends on whether the database engine supports schemas within a database. For example, for database engines such as MySQL or Oracle, don't specify a `schema-name` in your include path. You can substitute the percent sign (`%`) for a schema or table in the include path to represent all schemas or all tables in a database. You cannot substitute the percent sign (`%`) for database in the include path. 

**Maximum transversal depth (for Iceberg or Hudi data stores only)**  
Defines the maximum depth of the Amazon S3 path that the crawler can traverse to discover the Iceberg or Hudi metadata folder in your Amazon S3 path. The purpose of this parameter is to limit the crawler run time. The default value is 10 and the maximum is 20.

**Exclude patterns**  
These enable you to exclude certain files or tables from the crawl. The exclude path is relative to the include path. For example, to exclude a table in your JDBC data store, type the table name in the exclude path.   
A crawler connects to a JDBC data store using an AWS Glue connection that contains a JDBC URI connection string. The crawler only has access to objects in the database engine using the JDBC user name and password in the AWS Glue connection. *The crawler can only create tables that it can access through the JDBC connection.* After the crawler accesses the database engine with the JDBC URI, the include path is used to determine which tables in the database engine are created in the Data Catalog. For example, with MySQL, if you specify an include path of `MyDatabase/%`, then all tables within `MyDatabase` are created in the Data Catalog. When accessing Amazon Redshift, if you specify an include path of `MyDatabase/%`, then all tables within all schemas for database `MyDatabase` are created in the Data Catalog. If you specify an include path of `MyDatabase/MySchema/%`, then all tables in database `MyDatabase` and schema `MySchema` are created.   
After you specify an include path, you can then exclude objects from the crawl that your include path would otherwise include by specifying one or more Unix-style `glob` exclude patterns. These patterns are applied to your include path to determine which objects are excluded. These patterns are also stored as a property of tables created by the crawler. AWS Glue PySpark extensions, such as `create_dynamic_frame.from_catalog`, read the table properties and exclude objects defined by the exclude pattern.   
AWS Glue supports the following `glob` patterns in the exclude pattern.       
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)
AWS Glue interprets `glob` exclude patterns as follows:  
+ The slash (`/`) character is the delimiter to separate Amazon S3 keys into a folder hierarchy.
+ The asterisk (`*`) character matches zero or more characters of a name component without crossing folder boundaries.
+ A double asterisk (`**`) matches zero or more characters crossing folder or schema boundaries.
+ The question mark (`?`) character matches exactly one character of a name component.
+ The backslash (`\`) character is used to escape characters that otherwise can be interpreted as special characters. The expression `\\` matches a single backslash, and `\{` matches a left brace.
+ Brackets `[ ]` create a bracket expression that matches a single character of a name component out of a set of characters. For example, `[abc]` matches `a`, `b`, or `c`. The hyphen (`-`) can be used to specify a range, so `[a-z]` specifies a range that matches from `a` through `z` (inclusive). These forms can be mixed, so [`abce-g`] matches `a`, `b`, `c`, `e`, `f`, or `g`. If the character after the bracket (`[`) is an exclamation point (`!`), the bracket expression is negated. For example, `[!a-c]` matches any character except `a`, `b`, or `c`.

  Within a bracket expression, the `*`, `?`, and `\` characters match themselves. The hyphen (`-`) character matches itself if it is the first character within the brackets, or if it's the first character after the `!` when you are negating.
+ Braces (`{ }`) enclose a group of subpatterns, where the group matches if any subpattern in the group matches. A comma (`,`) character is used to separate the subpatterns. Groups cannot be nested.
+ Leading period or dot characters in file names are treated as normal characters in match operations. For example, the `*` exclude pattern matches the file name `.hidden`.

**Example Amazon S3 exclude patterns**  
Each exclude pattern is evaluated against the include path. For example, suppose that you have the following Amazon S3 directory structure:  

```
/mybucket/myfolder/
   departments/
      finance.json
      market-us.json
      market-emea.json
      market-ap.json
   employees/
      hr.json
      john.csv
      jane.csv
      juan.txt
```
Given the include path `s3://mybucket/myfolder/`, the following are some sample results for exclude patterns:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)

**Example Excluding a subset of Amazon S3 partitions**  
Suppose that your data is partitioned by day, so that each day in a year is in a separate Amazon S3 partition. For January 2015, there are 31 partitions. Now, to crawl data for only the first week of January, you must exclude all partitions except days 1 through 7:  

```
 2015/01/{[!0],0[8-9]}**, 2015/0[2-9]/**, 2015/1[0-2]/**    
```
Take a look at the parts of this glob pattern. The first part, ` 2015/01/{[!0],0[8-9]}**`, excludes all days that don't begin with a "0" in addition to day 08 and day 09 from month 01 in year 2015. Notice that "\$1\$1" is used as the suffix to the day number pattern and crosses folder boundaries to lower-level folders. If "\$1" is used, lower folder levels are not excluded.  
The second part, ` 2015/0[2-9]/**`, excludes days in months 02 to 09, in year 2015.  
The third part, `2015/1[0-2]/**`, excludes days in months 10, 11, and 12, in year 2015.

**Example JDBC exclude patterns**  
Suppose that you are crawling a JDBC database with the following schema structure:  

```
MyDatabase/MySchema/
   HR_us
   HR_fr
   Employees_Table
   Finance
   Market_US_Table
   Market_EMEA_Table
   Market_AP_Table
```
Given the include path `MyDatabase/MySchema/%`, the following are some sample results for exclude patterns:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)

**Additional crawler source parameters**  
Each source type requires a different set of additional parameters.

**Connection**  
Select or add an AWS Glue connection. For information about connections, see [Connecting to data](glue-connections.md).

**Additional metadata - optional (for JDBC data stores)**  
Select additional metadata properties for the crawler to crawl.  
+ Comments: Crawl associated table level and column level comments.
+ Raw types: Persist the raw datatypes of the table columns in additional metadata. As a default behavior, the crawler translates the raw datatypes to Hive-compatible types.

**JDBC Driver Class name - optional (for JDBC data stores)**  
 Type a custom JDBC driver class name for the crawler to connect to the data source:   
+ Postgres: org.postgresql.Driver
+ MySQL: com.mysql.jdbc.Driver, com.mysql.cj.jdbc.Driver
+ Redshift: com.amazon.redshift.jdbc.Driver, com.amazon.redshift.jdbc42.Driver
+ Oracle: oracle.jdbc.driver.OracleDriver
+ SQL Server: com.microsoft.sqlserver.jdbc.SQLServerDriver

**JDBC Driver S3 Path - optional (for JDBC data stores)**  
Choose an existing Amazon S3 path to a `.jar` file. This is where the `.jar` file will be stored when using a custom JDBC driver for the crawler to connect to the data source.

**Enable data sampling (for Amazon DynamoDB, MongoDB, MongoDB Atlas, and Amazon DocumentDB data stores only)**  
Select whether to crawl a data sample only. If not selected the entire table is crawled. Scanning all the records can take a long time when the table is not a high throughput table.

**Create tables for querying (for Delta Lake data stores only)**  
Select how you want to create the Delta Lake tables:  
+ Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly.
+ Create Symlink tables: Create a symlink manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.

**Scanning rate - optional (for DynamoDB data stores only)**  
Specify the percentage of the DynamoDB table Read Capacity Units to use by the crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second. Enter a value between 0.1 and 1.5. If not specified, defaults to 0.5 for provisioned tables and 1/4 of maximum configured capacity for on-demand tables. Note that only provisioned capacity mode should be used with AWS Glue crawlers.  
For DynamoDB data stores, set the provisioned capacity mode for processing reads and writes on your tables. The AWS Glue crawler should not be used with the on-demand capacity mode.

**Network connection - optional (for Amazon S3, Delta, Iceberg, Hudi and Catalog target data stores)**  
Optionally include a Network connection to use with this Amazon S3 target. Note that each crawler is limited to one Network connection so any other Amazon S3 targets will also use the same connection (or none, if left blank).  
For information about connections, see [Connecting to data](glue-connections.md).

**Sample only a subset of files and Sample size (for Amazon S3 data stores only)**  
Specify the number of files in each leaf folder to be crawled when crawling sample files in a dataset. When this feature is turned on, instead of crawling all the files in this dataset, the crawler randomly selects some files in each leaf folder to crawl.   
The sampling crawler is best suited for customers who have previous knowledge about their data formats and know that schemas in their folders do not change. Turning on this feature will significantly reduce crawler runtime.  
A valid value is an integer between 1 and 249. If not specified, all the files are crawled.

**Subsequent crawler runs**  
This field is a global field that affects all Amazon S3 data sources.  
+ Crawl all sub-folders: Crawl all folders again with every subsequent crawl.
+ Crawl new sub-folders only: Only Amazon S3 folders that were added since the last crawl will be crawled. If the schemas are compatible, new partitions will be added to existing tables. For more information, see [Scheduling incremental crawls for adding new partitions](incremental-crawls.md).
+ Crawl based on events: Rely on Amazon S3 events to control what folders to crawl. For more information, see [Accelerating crawls using Amazon S3 event notifications](crawler-s3-event-notifications.md).

**Custom classifiers - optional**  
Define custom classifiers before defining crawlers. A classifier checks whether a given file is in a format the crawler can handle. If it is, the classifier creates a schema in the form of a `StructType` object that matches that data format.  
For more information, see [Defining and managing classifiers](add-classifier.md).

# Step 3: Configure security settings
<a name="define-crawler-configure-security-settings"></a>

**IAM role**  
The crawler assumes this role. It must have permissions similar to the AWS managed policy `AWSGlueServiceRole`. For Amazon S3 and DynamoDB sources, it must also have permissions to access the data store. If the crawler reads Amazon S3 data encrypted with AWS Key Management Service (AWS KMS), then the role must have decrypt permissions on the AWS KMS key.   
For an Amazon S3 data store, additional permissions attached to the role would be similar to the following:     
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::bucket/object*"
      ]
    }
  ]
}
```
For an Amazon DynamoDB data store, additional permissions attached to the role would be similar to the following:     
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:Scan"
      ],
      "Resource": [
        "arn:aws:dynamodb:*:111122223333:table/table-name*"
      ]
    }
  ]
}
```
 In order to add your own JDBC driver, additional permissions need to be added.   
+  Grant permissions for the following job actions: `CreateJob`, `DeleteJob`, `GetJob`, `GetJobRun`, `StartJobRun`. 
+  Grant permissions for Amazon S3 actions: `s3:DeleteObjects`, `s3:GetObject`, `s3:ListBucket`, `s3:PutObject`. 
**Note**  
The `s3:ListBucket` is not needed if the Amazon S3 bucket policy is disabled.
+  Grant service principal access to bucket/folder in the Amazon S3 policy. 
 Example Amazon S3 policy:     
****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/driver-parent-folder/driver.jar",
                "arn:aws:s3:::amzn-s3-demo-bucket"
            ]
        }
    ]
}
```
 AWS Glue creates the following folders (`_crawler` and `_glue_job_crawler` at the same level as the JDBC driver in your Amazon S3 bucket. For example, if the driver path is `<s3-path/driver_folder/driver.jar>`, then the following folders will be created if they do not already exist:   
+  <s3-path/driver\$1folder/\$1crawler> 
+  <s3-path/driver\$1folder/\$1glue\$1job\$1crawler> 
 Optionally, you can add a security configuration to a crawler to specify at-rest encryption options.  
For more information, see [Step 2: Create an IAM role for AWS Glue](create-an-iam-role.md) and [Identity and access management for AWS Glue](security-iam.md).

**Lake Formation configuration - optional**  
Allow the crawler to use Lake Formation credentials for crawling the data source.  
Checking **Use Lake Formation credentials for crawling S3 data source** will allow the crawler to use Lake Formation credentials for crawling the data source. If the data source belongs to another account, you must provide the registered account ID. Otherwise, the crawler will crawl only those data sources associated to the account. Only applicable to Amazon S3 and Data Catalog data sources.

**Security configuration - optional**  
Settings include security configurations. For more information, see the following:   
+ [Encrypting data written by AWS Glue](encryption-security-configuration.md)
Once a security configuration has been set on a crawler, you can change, but you cannot remove it. To lower the level of security on a crawler, explicitly set the security feature to `DISABLED` within your configuration, or create a new crawler.

# Step 4: Set output and scheduling
<a name="define-crawler-set-output-and-scheduling"></a>

**Output configuration**  
Options include how the crawler should handle detected schema changes, deleted objects in the data store, and more. For more information, see [Customizing crawler behavior](crawler-configuration.md)

**Crawler schedule**  
You can run a crawler on demand or define a time-based schedule for your crawlers and jobs in AWS Glue. The definition of these schedules uses the Unix-like cron syntax. For more information, see [Scheduling a crawler](schedule-crawler.md).

# Step 5: Review and create
<a name="define-crawler-review"></a>

Review the crawler settings you configured, and create the crawler.

# Scheduling a crawler
<a name="schedule-crawler"></a>

You can run an AWS Glue crawler on demand or on a regular schedule. When you set up a crawler based on a schedule, you can specify certain constraints, such as the frequency of the crawler runs, which days of the week it runs, and at what time. You can create these custom schedules in *cron* format. For more information, see [cron](http://en.wikipedia.org/wiki/Cron) in Wikipedia.

When setting up a crawler schedule, you should consider the features and limitations of cron. For example, if you choose to run your crawler on day 31 each month, keep in mind that some months don't have 31 days.

**Topics**
+ [Create a crawler schedule](create-crawler-schedule.md)
+ [Create a schedule for an existing crawler](Update-crawler-schedule.md)

# Create a crawler schedule
<a name="create-crawler-schedule"></a>

You can create a schedule for the crawler using the AWS Glue console or AWS CLI.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). 

1. Choose **Crawlers** in the navigation pane.

1. Follow steps 1-3 in the [Configuring a crawler](define-crawler.md) section.

1. In [Step 4: Set output and scheduling](define-crawler-set-output-and-scheduling.md), choose a **Crawler schedule** to set the frequency of the run. You can choose the crawler to run hourly, daily, weekly, monthly or define custom schedule using cron expressions.

   A cron expression is a string representing a schedule pattern, consisting of 6 fields separated by spaces: \$1 \$1 \$1 \$1 \$1 <minute> <hour> <day of month> <month> <day of week> <year> 

   For example, to run a task every day at midnight, the cron expression is: 0 0 \$1 \$1 ? \$1

   For more information, see [Cron expressions](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions).

1. Review the crawler settings you configured, and create the crawler to run on a schedule.

------
#### [ AWS CLI ]

```
aws glue create-crawler 
 --name myCrawler \
 --role AWSGlueServiceRole-myCrawler  \
 --targets '{"S3Targets":[{Path="s3://amzn-s3-demo-bucket/"}]}' \
 --schedule cron(15 12 * * ? *)
```

------

For more information about using cron to schedule jobs and crawlers, see [Time-based schedules for jobs and crawlers](monitor-data-warehouse-schedule.md). 

# Create a schedule for an existing crawler
<a name="Update-crawler-schedule"></a>

Follow these steps to set up a recurring schedule for an existing crawler.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). 

1. Choose **Crawlers** in the navigation pane.

1. Choose a crawler that you want to schedule from the available list.

1. Choose **Edit** from the **Actions menu.**

1. Scroll down to **Step 4: Set output and scheduling**, and choose **Edit**. 

1.  Update your crawler schedule under **Crawler schedule**. 

1. Choose **Update**.

------
#### [ AWS CLI ]

Use the following CLI command to update an existing crawler configuration:

```
aws glue update-crawler-schedule 
   --crawler-name myCrawler
   --schedule cron(15 12 * * ? *)
```

------

# Viewing crawler results and details
<a name="console-crawlers-details"></a>

 After the crawler runs successfully, it creates table definitions in the Data Catalog. Choose **Tables** in the navigation pane to see the tables that were created by your crawler in the database that you specified. 

 You can view information related to the crawler itself as follows:
+ The **Crawlers** page on the AWS Glue console displays the following properties for a crawler:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/console-crawlers-details.html)
+  To view the history of a crawler, choose **Crawlers** in the navigation pane to see the crawlers you created. Choose a crawler from the list of available crawlers. You can view the crawler properties and view the crawler history in the **Crawler runs** tab. 

   The Crawler runs tab displays information about each time the crawler ran, including ** Start time (UTC),** **End time (UTC)**, **Duration**, **Status**, **DPU hours**, and **Table changes**. 

  The Crawler runs tab displays only the crawls that have occurred since the launch date of the crawler history feature, and only retains up to 12 months of crawls. Older crawls will not be returned.
+ To see additional information, choose a tab in the crawler details page. Each tab will display information related to the crawler. 
  +  **Schedule**: Any schedules created for the crawler will be visible here. 
  +  **Data sources**: All data sources scanned by the crawler will be visible here. 
  +  **Classifiers**: All classifiers assigned to the crawler will be visible here. 
  +  **Tags**: Any tags created and assigned to an AWS resource will be visible here. 

# Parameters set on Data Catalog tables by crawler
<a name="table-properties-crawler"></a>

 These table properties are set by AWS Glue crawlers. We expect users to consume the `classification` and `compressionType` properties. Other properties, including table size estimates, are used for internal calculations, and we do not guarantee their accuracy or applicability to customer use cases. Changing these parameters may alter the behavior of the crawler, we do not support this workflow. 


| Property key | Property value | 
| --- | --- | 
| UPDATED\$1BY\$1CRAWLER | Name of crawler performing update. | 
| connectionName | The name of the connection in the Data Catalog for the crawler used to connect the to the data store. | 
| recordCount | Estimate count of records in table, based on file sizes and headers. | 
| skip.header.line.count | Rows skipped to skip header. Set on tables classified as CSV. | 
| CrawlerSchemaSerializerVersion | For internal use | 
| classification | Format of data, inferred by crawler. For more information about data formats supported by AWS Glue crawlers see [Built-in classifiers](add-classifier.md#classifier-built-in). | 
| CrawlerSchemaDeserializerVersion | For internal use | 
| sizeKey | Combined size of files in table crawled. | 
| averageRecordSize | Average size of row in table, in bytes. | 
| compressionType | Type of compression used on data in the table. For more information about compression types supported by AWS Glue crawlers see [Built-in classifiers](add-classifier.md#classifier-built-in). | 
| typeOfData | `file`, `table` or `view`. | 
| objectCount | Number of objects under Amazon S3 path for table. | 

 These additional table properties are set by AWS Glue crawlers for Snowflake data stores. 


| Property key | Property value | 
| --- | --- | 
| aws:RawTableLastAltered | Records the last altered timestamp of the Snowflake table. | 
| ViewOriginalText | View SQL statement. | 
| ViewExpandedText | View SQL statement encoded in Base64 format. | 
| ExternalTable:S3Location | Amazon S3 location of the Snowflake external table. | 
| ExternalTable:FileFormat | Amazon S3 file format of the Snowflake external table. | 

 These additional table properties are set by AWS Glue crawlers for JDBC-type data stores such as Amazon Redshift, Microsoft SQL Server, MySQL, PostgreSQL, and Oracle. 


| Property key | Property value | 
| --- | --- | 
| aws:RawType | When a crawler store the data in the Data Catalog it translates the datatypes to Hive-compatible types, which many times causes the information on the native datatype to be lost. The crawler outputs the `aws:RawType` parameter to provide the native-level datatype. | 
| aws:RawColumnComment | If a comment is associated with a column in the database, the crawler outputs the corresponding comment in the catalog table. The comment string is truncated to 255 bytes. Comments are not supported for Microsoft SQL Server.  | 
| aws:RawTableComment | If a comment is associated with a table in the database, the crawler outputs corresponding comment in the catalog table. The comment string is truncated to 255 bytes. Comments are not supported for Microsoft SQL Server. | 

# Customizing crawler behavior
<a name="crawler-configuration"></a>

When you configure an AWS Glue crawler, you have several options for defining the behavior of your crawler.
+ **Incremental crawls** – You can configure a crawler to run incremental crawls to add only new partitions to the table schema. 
+ **Partition indexes** – A crawler creates partition indexes for Amazon S3 and Delta Lake targets by default to provide efficient lookup for specific partitions.
+ **Accelerate crawl time by using Amazon S3 events **– You can configure a crawler to use Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.
+ **Handling schema changes** – You can prevent a crawlers from making any schema changes to the existing schema. You can use the AWS Management Console or the AWS Glue API to configure how your crawler processes certain types of changes. 
+ **A single schema for multiple Amazon S3 paths** – You can configure a crawler to create a single schema for each S3 path if the data is compatible.
+ **Table location and partitioning levels** – The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. 
+ **Table threshold** – You can specify the maximum number of tables the crawler is allowed to create by specifying a table threshold.
+ **AWS Lake Formation credentials** – You can configure a crawler to use Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. 

 For more information about using the AWS Glue console to add a crawler, see [Configuring a crawler](define-crawler.md). 

**Topics**
+ [Scheduling incremental crawls for adding new partitions](incremental-crawls.md)
+ [Generating partition indexes](crawler-configure-partition-indexes.md)
+ [Preventing a crawler from changing an existing schema](crawler-schema-changes-prevent.md)
+ [Creating a single schema for each Amazon S3 include path](crawler-grouping-policy.md)
+ [Specifying the table location and partitioning level](crawler-table-level.md)
+ [Specifying the maximum number of tables the crawler is allowed to create](crawler-maximum-number-of-tables.md)
+ [Configuring a crawler to use Lake Formation credentials](crawler-lf-integ.md)
+ [Accelerating crawls using Amazon S3 event notifications](crawler-s3-event-notifications.md)

# Scheduling incremental crawls for adding new partitions
<a name="incremental-crawls"></a>

You can configure an AWS Glue crawler run incremental crawls to add only new partitions to the table schema. When the crawler runs for the first time, it performs a full crawl to processes the entire data source to record the complete schema and all existing partitions in the AWS Glue Data Catalog.

Subsequent crawls after the initial full crawl will be incremental, where the crawler identifies and adds only the new partitions that have been introduced since the previous crawl. This approach results in faster crawl times, as the crawler no longer needs to process the entire data source for each run, but instead focuses only on the new partitions. 

**Note**  
Incremental crawls don't detect modifications or deletions of existing partitions. This configuration is best suited for data sources with a stable schema. If a one-time major schema change occurs, it is advisable to temporarily set the crawler to perform a full crawl to capture the new schema accurately, and then switch back to incremental crawling mode. 

The following diagram shows that with the incremental crawl setting enabled, the crawler will only detect and add the newly added folder, month=March, to the catalog.

![\[The following diagram shows that files for the month of March have been added.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-s3-folders-new.png)


Follow these steps to update your crawler to perform incremental crawls:

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. Choose a crawler that you want to set up to crawl incrementally.

1. Choose **Edit**.

1. Choose **Step 2. Choose data sources and classifiers**.

1. Choose the data source that you want to incrementally crawl. 

1. Choose **Edit**.

1. Choose **Crawl new sub-folders only** under **Subsequent crawler runs**.

1. Choose **Update**.

To create a schedule for a crawler, see [Scheduling a crawler](schedule-crawler.md).

------
#### [ AWS CLI ]

```
aws glue update-crawler \
 --name myCrawler \
 --recrawl-policy RecrawlBehavior=CRAWL_NEW_FOLDERS_ONLY \
 --schema-change-policy UpdateBehavior=LOG,DeleteBehavior=LOG
```

------

**Notes and restrictions**  
When this option is turned on, you can't change the Amazon S3 target data stores when editing the crawler. This option affects certain crawler configuration settings. When turned on, it forces the update behavior and delete behavior of the crawler to `LOG`. This means that:
+ If it discovers objects where schemas are not compatible, the crawler will not add the objects in the Data Catalog, and adds this detail as a log in CloudWatch Logs.
+ It will not update deleted objects in the Data Catalog.

# Generating partition indexes
<a name="crawler-configure-partition-indexes"></a>

The Data Catalog supports creating partition indexes to provide efficient lookup for specific partitions. For more information, see [Creating partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). The AWS Glue crawler creates partition indexes for Amazon S3 and Delta Lake targets by default.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you define a crawler, the option to **Create partition indexes automatically ** is enabled by default under **Advanced options** on the **Set output and scheduling** page.

   To disable this option, you can unselect the checkbox **Create partition indexes automatically ** in the console. 

1. Complete the crawler configuration and choose **Create crawler**.

------
#### [ AWS CLI ]

 You can also disable this option by using the AWS CLI, set the `CreatePartitionIndex ` in the `configuration` parameter. The default value is true.

```
aws glue update-crawler \
    --name myCrawler \
    --configuration '{"Version": 1.0, "CreatePartitionIndex": false }'
```

------

## Usage notes for partition indexes
<a name="crawler-configure-partition-indexes-usage-notes"></a>
+ Tables created by the crawler do not have the variable `partition_filtering.enabled` by default. For more information, see [AWS Glue partition indexing and filtering](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index).
+ Creating partition indexes for encrypted partitions is not supported.

# Preventing a crawler from changing an existing schema
<a name="crawler-schema-changes-prevent"></a>

 You can prevent AWS Glue crawlers from making any schema changes to the Data Catalog when they run. By default, crawlers updates the schema in the Data Catalog to match the data source being crawled. However, in some cases, you may want to prevent the Crawler from modifying the existing schema, especially if you have transformed or cleaned the data and don't want the original schema to overwrite the changes.

 Follow these steps to configure your crawler not to overwrite the existing schema in a table definition. 

------
#### [  AWS Management Console  ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. Choose a crawler from the list, and choose **Edit**.

1. Choose **step 4, Set output and scheduling**.

1. Under **Advance options**, choose **Add new columns only** or **Ignore the change and don't update the table in the Data Catalog**. 

1.  You can also set a configuration option to **Update all new and existing partitions with metadata from the table**. This sets partition schemas to inherit from the table. 

1. Choose **Update**.

------
#### [ AWS CLI ]

The following example shows how to configure a crawler to not change existing schema, only add new columns:

```
aws glue update-crawler \
  --name myCrawler \
  --configuration '{"Version": 1.0, "CrawlerOutput": {"Tables": {"AddOrUpdateBehavior": "MergeNewColumns"}}}'
```

The following example shows how to configure a crawler to not change the existing schema, and not add new columns:

```
aws glue update-crawler \
  --name myCrawler \
  --schema-change-policy UpdateBehavior=LOG \
  --configuration '{"Version": 1.0, "CrawlerOutput": {"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }}}'
```

------
#### [ API ]

If you don't want a table schema to change at all when a crawler runs, set the schema change policy to `LOG`. 

When you configure the crawler using the API, set the following parameters:
+ Set the `UpdateBehavior` field in `SchemaChangePolicy` structure to `LOG`.
+  Set the `Configuration` field with a string representation of the following JSON object in the crawler API; for example: 

  ```
  {
     "Version": 1.0,
     "CrawlerOutput": {
        "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
     }
  }
  ```

------

# Creating a single schema for each Amazon S3 include path
<a name="crawler-grouping-policy"></a>

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

To help illustrate this option, suppose that you define a crawler with an include path `s3://amzn-s3-demo-bucket/table1/`. When the crawler runs, it finds two JSON files with the following characteristics:
+ **File 1** – `S3://amzn-s3-demo-bucket/table1/year=2017/data1.json`
+ *File content* – `{“A”: 1, “B”: 2}`
+ *Schema* – `A:int, B:int`
+ **File 2** – `S3://amzn-s3-demo-bucket/table1/year=2018/data2.json`
+ *File content* – `{“C”: 3, “D”: 4}`
+ *Schema* – `C: int, D: int`

By default, the crawler creates two tables, named `year_2017` and `year_2018` because the schemas are not sufficiently similar. However, if the option **Create a single schema for each S3 path** is selected, and if the data is compatible, the crawler creates one table. The table has the schema `A:int,B:int,C:int,D:int` and `partitionKey` `year:string`.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you configure a new crawler, under **Output and scheduling **, select the option **Create a single schema for each S3 path** under Advance options. 

------
#### [ AWS CLI ]

You can configure a crawler to `CombineCompatibleSchemas` into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

When you configure the crawler using the AWS CLI, set the following configuration option:

```
aws glue update-crawler \
   --name myCrawler \
   --configuration '{"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas" }}'
```

------
#### [ API ]

When you configure the crawler using the API, set the following configuration option:

 Set the `Configuration` field with a string representation of the following JSON object in the crawler API; for example: 

```
{
   "Version": 1.0,
   "Grouping": {
      "TableGroupingPolicy": "CombineCompatibleSchemas" }
}
```

------

# Specifying the table location and partitioning level
<a name="crawler-table-level"></a>

By default, when a crawler defines tables for data stored in Amazon S3 the crawler attempts to merge schemas together, and create top-level tables (`year=2019`). In some cases, you may expect the crawler to create a table for the folder `month=Jan` but instead the crawler creates a partition since a sibling folder (`month=Mar`) was merged into the same table.

The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. When you specify a **Table level**, the table is created at that absolute level from the Amazon S3 bucket.

![\[Crawler grouping with table level specified as level 2.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-table-level1.jpg)


 When configuring the crawler on the console, you can specify a value for the **Table level** crawler option. The value must be a positive integer that indicates the table location (the absolute level in the dataset). The level for the top level folder is 1. For example, for the path `mydataset/year/month/day/hour`, if the level is set to 3, the table is created at location `mydataset/year/month`. 

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you configure a crawler, under **Output and scheduling**, choose **Table level** under **Advance options**.

![\[Specifying a table level in the crawler configuration.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-configuration-console.png)


------
#### [ AWS CLI ]

When you configure the crawler using the AWS CLI, set the `configuration` parameter as shown in the example code: 

```
aws glue update-crawler \
  --name myCrawler \
  --configuration '{"Version": 1.0, "Grouping": { "TableLevelConfiguration": 2 }}'
```

------
#### [ API ]

When you configure the crawler using the API, set the `Configuration` field with a string representation of the following JSON object; for example: 

```
configuration = jsonencode(
{
   "Version": 1.0,
   "Grouping": {
            TableLevelConfiguration = 2  
        }
})
```

------
#### [ CloudFormation ]

In this example, you set the **Table level** option available in the console within your CloudFormation template:

```
"Configuration": "{
    \"Version\":1.0,
    \"Grouping\":{\"TableLevelConfiguration\":2}
}"
```

------

# Specifying the maximum number of tables the crawler is allowed to create
<a name="crawler-maximum-number-of-tables"></a>

You can optionally specify the maximum number of tables the crawler is allowed to create by specifying a `TableThreshold` via the AWS Glue console or AWS CLI. If the tables detected by the crawler during its crawl is greater that this input value, the crawl fails and no data is written to the Data Catalog.

This parameter is useful when the tables that would be detected and created by the crawler are much greater more than what you expect. There can be multiple reasons for this, such as:
+ When using an AWS Glue job to populate your Amazon S3 locations you can end up with empty files at the same level as a folder. In such cases when you run a crawler on this Amazon S3 location, the crawler creates multiple tables due to files and folders present at the same level.
+ If you do not configure `"TableGroupingPolicy": "CombineCompatibleSchemas"` you may end up with more tables than expected. 

You specify the `TableThreshold` as an integer value greater than 0. This value is configured on a per crawler basis. That is, for every crawl this value is considered. For example: a crawler has the `TableThreshold` value set as 5. In each crawl AWS Glue compares the number of tables detected with this table threshold value (5) and if the number of tables detected is less than 5, AWS Glue writes the tables to the Data Catalog and if not, the crawl fails without writing to the Data Catalog.

------
#### [ AWS Management Console ]

**To set `TableThreshold` using the AWS Management Console:**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. When configuring a crawler, in **Output and scheduling**, set the **Maximum table threshold** to the number of tables the crawler is allowed generate.  
![\[The Output and scheduling section of the AWS console showing the Maximum table threshold parameter.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-max-tables.png)

------
#### [ AWS CLI ]

To set `TableThreshold` using the AWS CLI:

```
aws glue update-crawler \
    --name myCrawler \
    --configuration '{"Version": 1.0, "CrawlerOutput": {"Tables": { "TableThreshold": 5 }}}'
```

------
#### [ API ]

To set `TableThreshold` using the API:

```
"{"Version":1.0,
"CrawlerOutput":
{"Tables":{"AddOrUpdateBehavior":"MergeNewColumns",
"TableThreshold":5}}}";
```

------

Error messages are logged to help you identify table paths and clean-up your data. Example log in your account if the crawler fails because the table count was greater than table threshold value provided:

```
Table Threshold value = 28, Tables detected - 29
```

In CloudWatch, we log all table locations detected as an INFO message. An error is logged as the reason for the failure.

```
ERROR com.amazonaws.services.glue.customerLogs.CustomerLogService - CustomerLogService received CustomerFacingException with message 
The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog.
com.amazonaws.services.glue.exceptions.CustomerFacingInternalException: The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. 
Failing crawler without writing to Data Catalog.
```

# Configuring a crawler to use Lake Formation credentials
<a name="crawler-lf-integ"></a>

You can configure a crawler to use AWS Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a crawler’s target.

**Note**  
When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.

## Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)
<a name="in-account-crawling"></a>

To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.

You can complete the following configuration steps using the AWS Management Console or AWS Command Line Interface (AWS CLI).

------
#### [ AWS Management Console ]

1. Before configuring a crawler to access the crawler source, register the data location of the data store or the Data Catalog with Lake Formation. In the Lake Formation console ([https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/)), register an Amazon S3 location as the root location of your data lake in the AWS account where the crawler is defined. For more information, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html).

1. Grant **Data location** permissions to the IAM role that's used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see [Granting data location permissions (same account)](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html).

1. Grant the crawler role access permissions (`Create`) to the database, which is specified as the output database. For more information, see [Granting database permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html).

1. In the IAM console ([https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)), create an IAM role for the crawler. Add the `lakeformation:GetDataAccess` policy to the role.

1. In the AWS Glue console ([https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/)), while configuring the crawler, select the option **Use Lake Formation credentials for crawling Amazon S3 data source**.
**Note**  
The accountId field is optional for in-account crawling.

------
#### [ AWS CLI ]

```
aws glue --profile demo create-crawler --debug --cli-input-json '{
    "Name": "prod-test-crawler",
    "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role",
    "DatabaseName": "prod-run-db",
    "Description": "",
    "Targets": {
    "S3Targets":[
                {
                 "Path": "s3://amzn-s3-demo-bucket"
                }
                ]
                },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG"
  },
  "RecrawlPolicy": {
    "RecrawlBehavior": "CRAWL_EVERYTHING"
  },
  "LineageConfiguration": {
    "CrawlerLineageSettings": "DISABLE"
  },
  "LakeFormationConfiguration": {
    "UseLakeFormationCredentials": true,
    "AccountId": "111122223333"
  },
  "Configuration": {
           "Version": 1.0,
           "CrawlerOutput": {
             "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
             "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
           },
           "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" }
         },
  "CrawlerSecurityConfiguration": "",
  "Tags": {
    "KeyName": ""
  }
}'
```

------

# Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)
<a name="cross-account-crawling"></a>

To allow the crawler to access a data store in a different account using Lake Formation credentials, you must first register the Amazon S3 data location with Lake Formation. Then, you grant data location permissions to the crawler's account by taking the following steps.

You can complete the following steps using the AWS Management Console or AWS CLI.

------
#### [ AWS Management Console ]

1. In the account where the Amazon S3 location is registered (account B):

   1. Register an Amazon S3 path with Lake Formation. For more information, see [Registering Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html).

   1.  Grant **Data location** permissions to the account (account A) where the crawler will be run. For more information, see [Grant data location permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html). 

   1. Create an empty database in Lake Formation with the underlying location as the target Amazon S3 location. For more information, see [Creating a database](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-database.html).

   1. Grant account A (the account where the crawler will be run) access to the database that you created in the previous step. For more information, see [Granting database permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html). 

1. In the account where the crawler is created and will be run (account A):

   1.  Using the AWS RAM console, accept the database that was shared from the external account (account B). For more information, see [Accepting a resource share invitation from AWS Resource Access Manager](https://docs.aws.amazon.com/lake-formation/latest/dg/accepting-ram-invite.html). 

   1.  Create an IAM role for the crawler. Add `lakeformation:GetDataAccess` policy to the role.

   1.  In the Lake Formation console ([https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/)), grant **Data location** permissions on the target Amazon S3 location to the IAM role used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see [Granting data location permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html). 

   1.  Create a resource link on the shared database. For more information, see [Create a resource link](https://docs.aws.amazon.com/lake-formation/latest/dg/create-resource-link-database.html). 

   1.  Grant the crawler role access permissions (`Create`) on the shared database and (`Describe`) the resource link. The resource link is specified in the output for the crawler. 

   1.  In the AWS Glue console ([https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/)), while configuring the crawler, select the option **Use Lake Formation credentials for crawling Amazon S3 data source**.

      For cross-account crawling, specify the AWS account ID where the target Amazon S3 location is registered with Lake Formation. For in-account crawling, the accountId field is optional.   
![\[IAM role selection and Lake Formation configuration options for AWS Glue crawler security settings.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cross-account-crawler.png)

------
#### [ AWS CLI ]

```
aws glue --profile demo create-crawler --debug --cli-input-json '{
    "Name": "prod-test-crawler",
    "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role",
    "DatabaseName": "prod-run-db",
    "Description": "",
    "Targets": {
    "S3Targets":[
                {
                 "Path": "s3://amzn-s3-demo-bucket"
                }
                ]
                },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG"
  },
  "RecrawlPolicy": {
    "RecrawlBehavior": "CRAWL_EVERYTHING"
  },
  "LineageConfiguration": {
    "CrawlerLineageSettings": "DISABLE"
  },
  "LakeFormationConfiguration": {
    "UseLakeFormationCredentials": true,
    "AccountId": "111111111111"
  },
  "Configuration": {
           "Version": 1.0,
           "CrawlerOutput": {
             "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
             "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
           },
           "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" }
         },
  "CrawlerSecurityConfiguration": "",
  "Tags": {
    "KeyName": ""
  }
}'
```

------

**Note**  
A crawler using Lake Formation credentials is only supported for Amazon S3 and Data Catalog targets.
For targets using Lake Formation credential vending, the underlying Amazon S3 locations must belong to the same bucket. For example, customers can use multiple targets (s3://amzn-s3-demo-bucket1/folder1, s3://amzn-s3-demo-bucket1/folder2) as long as all target locations are under the same bucket (amzn-s3-demo-bucket1). Specifying different buckets (s3://amzn-s3-demo-bucket1/folder1, s3://amzn-s3-demo-bucket2/folder2) is not allowed.
Currently for Data Catalog target crawlers, only a single catalog target with a single catalog table is allowed.

# Accelerating crawls using Amazon S3 event notifications
<a name="crawler-s3-event-notifications"></a>

Instead of listing the objects from an Amazon S3 or Data Catalog target, you can configure the crawler to use Amazon S3 events to find any changes. This feature improves the recrawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.

The first crawl lists all Amazon S3 objects from the target. After the first successful crawl, you can choose to recrawl manually or on a set schedule. The crawler will list only the objects from those events instead of listing all objects.

When the target is a Data Catalog table, the crawler updates the existing tables in the Data Catalog with changes (for example, extra partitions in a table).

The advantages of moving to an Amazon S3 event based crawler are:
+ A faster recrawl as the listing of all the objects from the target is not required, instead the listing of specific folders is done where objects are added or deleted.
+ A reduction in the overall crawl cost as the listing of specific folders is done where objects are added or deleted.

The Amazon S3 event crawl runs by consuming Amazon S3 events from the SQS queue based on the crawler schedule. There will be no cost if there are no events in the queue. Amazon S3 events can be configured to go directly to the SQS queue or in cases where multiple consumers need the same event, a combination of SNS and SQS. For more information, see [Setting up your account for Amazon S3 event notifications](#crawler-s3-event-notifications-setup).

After creating and configuring the crawler in event mode, the first crawl runs in listing mode by performing full a listing of the Amazon S3 or Data Catalog target. The following log confirms the operation of the crawl by consuming Amazon S3 events after the first successful crawl: "The crawl is running by consuming Amazon S3 events."

After creating the Amazon S3 event crawl and updating the crawler properties which may impact the crawl, the crawl operates in list mode and the following log is added: "Crawl is not running in S3 event mode".

**Note**  
The maximum number of messages to consume is 100,000 messages per crawl.

## Considerations and limitations
<a name="s3event-crawler-limitations"></a>

The following considerations and limitations apply when you configure a crawler to use Amazon S3 event notifications to find any changes. 
+  **Important behavior with deleted partitions** 

  When using Amazon S3 event crawlers with Data Catalog tables:
  +  If you delete a partition using the `DeletePartition` API call, you must also delete all S3 objects under that partition, and select **All object removal events** when you configure your S3 event notifications. If deletion events are not configured, the crawler recreates the deleted partition during its next run. 
+ Only a single target is supported by the crawler, whether for Amazon S3 or Data Catalog targets.
+ SQS on private VPC is not supported.
+ Amazon S3 sampling is not supported.
+ The crawler target should be a folder for an Amazon S3 target, or one or more AWS Glue Data Catalog tables for a Data Catalog target.
+ The 'everything' path wildcard is not supported: s3://%
+ For a Data Catalog target, all catalog tables should point to same Amazon S3 bucket for Amazon S3 event mode.
+ For a Data Catalog target, a catalog table should not point to an Amazon S3 location in the Delta Lake format (containing \$1symlink folders, or checking the catalog table's `InputFormat`).

**Topics**
+ [Considerations and limitations](#s3event-crawler-limitations)
+ [Setting up your account for Amazon S3 event notifications](#crawler-s3-event-notifications-setup)
+ [Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target](crawler-s3-event-notifications-setup-console-s3-target.md)
+ [Setting up a crawler for Amazon S3 event notifications for a Data Catalog table](crawler-s3-event-notifications-setup-console-catalog-target.md)

## Setting up your account for Amazon S3 event notifications
<a name="crawler-s3-event-notifications-setup"></a>

Complete the following setup tasks. Note the values in parenthesis reference the configurable settings from the script.

1. You need to set up event notifications for your Amazon S3 bucket.

   For more information, see [Amazon S3 event notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html).

1. To use the Amazon S3 event based crawler, you should enable event notification on the Amazon S3 bucket with events filtered from the prefix which is the same as the S3 target and store in SQS. You can set up SQS and event notification through the console by following the steps in [Walkthrough: Configuring a bucket for notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html).

1. Add the following SQS policy to the role used by the crawler. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "VisualEditor0",
         "Effect": "Allow",
         "Action": [
           "sqs:DeleteMessage",
           "sqs:GetQueueUrl",
           "sqs:ListDeadLetterSourceQueues",
           "sqs:ReceiveMessage",
           "sqs:GetQueueAttributes",
           "sqs:ListQueueTags",
           "sqs:SetQueueAttributes",
           "sqs:PurgeQueue"
         ],
         "Resource": "arn:aws:sqs:us-east-1:111122223333:cfn-sqs-queue"
       }
     ]
   }
   ```

------

# Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target
<a name="crawler-s3-event-notifications-setup-console-s3-target"></a>

Follow these steps to set up a crawler for Amazon S3 event notifications for an Amazon S3 target using the AWS Management Console or AWS CLI.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the GuardDuty console at [https://console.aws.amazon.com/guardduty/](https://console.aws.amazon.com/guardduty/).

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the AWS Glue console ](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to AWS Glue tables? * 

    By default **Not yet** is already selected. Leave this as the default as you are using an Amazon S3 data source and the data is not already mapped to AWS Glue tables. 

1.  In the section **Data sources**, choose **Add a data source**.   
![\[Data source configuration interface with options to select or add data sources for crawling.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console1.png)

1.  In the **Add data source** modal, configure the Amazon S3 data source: 
   +  **Data source**: By default, Amazon S3 is selected. 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Location of Amazon S3 data**: By default, **In this account** is selected. 
   +  **Amazon S3 path**: Specify the Amazon S3 path where folders and files are crawled. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Add an Amazon S3 data source**.   
![\[Add data source dialog for S3, showing options for network connection and crawl settings.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console2.png)

------
#### [ AWS CLI ]

 The following is an example Amazon S3 AWS CLI call to configure a crawler to use event notifications to crawl an Amazon S3 target bucket. 

```
Create Crawler:
aws glue update-crawler \
    --name myCrawler \
    --recrawl-policy RecrawlBehavior=CRAWL_EVENT_MODE \
    --schema-change-policy UpdateBehavior=UPDATE_IN_DATABASE,DeleteBehavior=LOG
    --targets '{"S3Targets":[{"Path":"s3://amzn-s3-demo-bucket/", "EventQueueArn": "arn:aws:sqs:us-east-1:012345678910:MyQueue"}]}'
```

------

# Setting up a crawler for Amazon S3 event notifications for a Data Catalog table
<a name="crawler-s3-event-notifications-setup-console-catalog-target"></a>

When you have a Data Catalog table, set up a crawler for Amazon S3 event notifications using the AWS Glue console:

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the AWS Glue console ](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to AWS Glue tables? * 

    Select **Yes** to select existing tables from your Data Catalog as your data source. 

1.  In the section **Glue tables**, choose **Add tables**.   
![\[Data source configuration interface with options to select existing Glue tables or add new ones.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console1-cat.png)

1.  In the **Add table** modal, configure the database and tables: 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Database**: Select a database in the Data Catalog. 
   +  **Tables**: Select one or more tables from that database in the Data Catalog. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Confirm**.   
![\[Add Glue tables dialog with network, database, tables, and crawler options.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console2-cat.png)

# Tutorial: Adding an AWS Glue crawler
<a name="tutorial-add-crawler"></a>

For this AWS Glue scenario, you're asked to analyze arrival data for major air carriers to calculate the popularity of departure airports month over month. You have flights data for the year 2016 in CSV format stored in Amazon S3. Before you transform and analyze your data, you catalog its metadata in the AWS Glue Data Catalog.

In this tutorial, let’s add a crawler that infers metadata from these flight logs in Amazon S3 and creates a table in your Data Catalog.

**Topics**
+ [Prerequisites](#tutorial-add-crawler-prerequisites)
+ [Step 1: Add a crawler](#tutorial-add-crawler-step1)
+ [Step 2: Run the crawler](#tutorial-add-crawler-step2)
+ [Step 3: View AWS Glue Data Catalog objects](#tutorial-add-crawler-step3)

## Prerequisites
<a name="tutorial-add-crawler-prerequisites"></a>

This tutorial assumes that you have an AWS account and access to AWS Glue.

## Step 1: Add a crawler
<a name="tutorial-add-crawler-step1"></a>

Use these steps to configure and run a crawler that extracts the metadata from a CSV file stored in Amazon S3.

**To create a crawler that reads files stored on Amazon S3**

1. On the AWS Glue service console, on the left-side menu, choose **Crawlers**.

1. On the Crawlers page, choose **Create crawler**. This starts a series of pages that prompt you for the crawler details.  
![\[The screenshot shows the crawler page. From here you can create a crawler or edit, duplicate, delete, view an existing crawler.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-create_crawler.png)

1. In the Crawler name field, enter **Flights Data Crawler**, and choose **Next**.

   Crawlers invoke classifiers to infer the schema of your data. This tutorial uses the built-in classifier for CSV by default. 

1. For the crawler source type, choose **Data stores** and choose **Next**.

1. Now let's point the crawler to your data. On the **Add a data store** page, choose the Amazon S3 data store. This tutorial doesn't use a connection, so leave the **Connection** field blank if it's visible. 

   For the option **Crawl data in**, choose **Specified path in another account**. Then, for the **Include path**, enter the path where the crawler can find the flights data, which is **s3://crawler-public-us-east-1/flight/2016/csv**. After you enter the path, the title of this field changes to **Include path.** Choose **Next**.

1. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a single data store, so choose **No**, and then choose **Next**.

1. The crawler needs permissions to access the data store and create objects in the AWS Glue Data Catalog. To configure these permissions, choose **Create an IAM role**. The IAM role name starts with `AWSGlueServiceRole-`, and in the field, you enter the last part of the role name. Enter **CrawlerTutorial**, and then choose **Next**. 
**Note**  
To create an IAM role, your AWS user must have `CreateRole`, `CreatePolicy`, and `AttachRolePolicy` permissions.

   The wizard creates an IAM role named `AWSGlueServiceRole-CrawlerTutorial`, attaches the AWS managed policy `AWSGlueServiceRole` to this role, and adds an inline policy that allows read access to the Amazon S3 location `s3://crawler-public-us-east-1/flight/2016/csv`.

1. Create a schedule for the crawler. For **Frequency**, choose **Run on demand**, and then choose **Next**. 

1. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. First, choose **Add database** to create a database. In the pop-up window, enter **test-flights-db** for the database name, and then choose **Create**.

   Next, enter **flights** for **Prefix added to tables**. Use the default values for the rest of the options, and choose **Next**.

1. Verify the choices you made in the **Add crawler** wizard. If you see any mistakes, you can choose **Back** to return to previous pages and make changes.

   After you have reviewed the information, choose **Finish** to create the crawler.

## Step 2: Run the crawler
<a name="tutorial-add-crawler-step2"></a>

After creating a crawler, the wizard sends you to the Crawlers view page. Because you create the crawler with an on-demand schedule, you're given the option to run the crawler.

**To run the crawler**

1. The banner near the top of this page lets you know that the crawler was created, and asks if you want to run it now. Choose **Run it now?** to run the crawler.

   The banner changes to show "Attempting to run" and Running" messages for your crawler. After the crawler starts running, the banner disappears, and the crawler display is updated to show a status of Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the crawler that is displayed in the table.

1. When the crawler completes, a new banner appears that describes the changes made by the crawler. You can choose the **test-flights-db** link to view the Data Catalog objects.

## Step 3: View AWS Glue Data Catalog objects
<a name="tutorial-add-crawler-step3"></a>

The crawler reads data at the source location and creates tables in the Data Catalog. A table is the metadata definition that represents your data, including its schema. The tables in the Data Catalog do not contain data. Instead, you use these tables as a source or target in a job definition.

**To view the Data Catalog objects created by the crawler**

1. In the left-side navigation, under **Data catalog**, choose **Databases**. Here you can view the `flights-db` database that is created by the crawler.

1. In the left-side navigation, under **Data catalog** and below **Databases**, choose **Tables**. Here you can view the `flightscsv` table created by the crawler. If you choose the table name, then you can view the table settings, parameters, and properties. Scrolling down in this view, you can view the schema, which is information about the columns and data types of the table.

1. If you choose **View partitions** on the table view page, you can see the partitions created for the data. The first column is the partition key.

# Defining metadata manually
<a name="populate-dg-manual"></a>

 The AWS Glue Data Catalog is a central repository that stores metadata about your data sources and data sets. While a crawler can automatically crawl and populate metadata for supported data sources, there are certain scenarios where you may need to define metadata manually in the Data Catalog: 
+ Unsupported data formats – If you have data sources that are not supported by the crawler, you need to manually define the metadata for those data sources in the Data Catalog.
+ Custom metadata requirements – The AWS Glue crawler infers metadata based on predefined rules and conventions. If you have specific metadata requirements that are not covered by the AWS Glue crawler inferred metadata, you can manually define the metadata to meet your needs 
+ Data governance and standardization – In some cases, you may want to have more control over the metadata definitions for data governance, compliance, or security reasons. Manually defining metadata allows you to ensure that the metadata adheres to your organization's standards and policies. 
+ Placeholder for future data ingestion – If you have data sources that are not immediately available or accessible, you can create empty schema tables as placeholders. Once the data sources become available, you can populate the tables with the actual data, while maintaining the predefined structure. 

 To define metadata manually, you can use the AWS Glue console, Lake Formation console, AWS Glue API, or the AWS Command Line Interface (AWS CLI). You can create databases, tables, and partitions, and specify metadata properties such as column names, data types, descriptions, and other attributes. 

# Creating databases
<a name="define-database"></a>

Databases are used to organize metadata tables in the AWS Glue. When you define a table in the AWS Glue Data Catalog, you add it to a database. A table can be in only one database.

Your database can contain tables that define data from many different data stores. This data can include objects in Amazon Simple Storage Service (Amazon S3) and relational tables in Amazon Relational Database Service.

**Note**  
When you delete a database from the AWS Glue Data Catalog, all the tables in the database are also deleted.

 To view the list of databases, sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Choose **Databases**, and then choose a database name in the list to view the details. 

 From the **Databases** tab in the AWS Glue console, you can add, edit, and delete databases:
+ To create a new database, choose **Add database** and provide a name and description. For compatibility with other metadata stores, such as Apache Hive, the name is folded to lowercase characters. 
**Note**  
If you plan to access the database from Amazon Athena, then provide a name with only alphanumeric and underscore characters. For more information, see [ Athena names](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#ate-table-database-and-column-names-allow-only-underscore-special-characters). 
+  To edit the description for a database, select the check box next to the database name and choose **Edit**. 
+  To delete a database, select the check box next to the database name and choose **Remove**. 
+  To display the list of tables contained in the database, choose the database name and the database properties will display all tables in the database. 

To change the database that a crawler writes to, you must change the crawler definition. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).

## Database resource links
<a name="databases-resource-links"></a>


|  | 
| --- |
| The AWS Glue console was recently updated. The current version of the console does not support Database Resource Links. | 

The Data Catalog can also contain *resource links* to databases. A database resource link is a link to a local or shared database. Currently, you can create resource links only in AWS Lake Formation. After you create a resource link to a database, you can use the resource link name wherever you would use the database name. Along with databases that you own or that are shared with you, database resource links are returned by `glue:GetDatabases()` and appear as entries on the **Databases** page of the AWS Glue console.

The Data Catalog can also contain table resource links.

For more information about resource links, see [Creating Resource Links](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html) in the *AWS Lake Formation Developer Guide*.

# Creating tables
<a name="tables-described"></a>

Even though running a crawler is the recommended method to take inventory of the data in your data stores, you can add metadata tables to the AWS Glue Data Catalog manually. This approach allows you to have more control over the metadata definitions and customize them according them to your specific requirements.

You can also add tables to the Data Catalog manually in the following ways:
+ Use the AWS Glue console to manually create a table in the AWS Glue Data Catalog. For more information, see [Creating tables using the console](#console-tables).
+ Use the `CreateTable` operation in the [AWS Glue API](aws-glue-api.md) to create a table in the AWS Glue Data Catalog. For more information, see [CreateTable action (Python: create\$1table)](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-CreateTable).
+ Use CloudFormation templates. For more information, see [AWS CloudFormation for AWS Glue](populate-with-cloudformation-templates.md).

When you define a table manually using the console or an API, you specify the table schema and the value of a classification field that indicates the type and format of the data in the data source. If a crawler creates the table, the data format and schema are determined by either a built-in classifier or a custom classifier. For more information about creating a table using the AWS Glue console, see [Creating tables using the console](#console-tables).

**Topics**
+ [Table partitions](#tables-partition)
+ [Table resource links](#tables-resource-links)
+ [Creating tables using the console](#console-tables)
+ [Creating partition indexes](partition-indexes.md)
+ [Updating manually created Data Catalog tables using crawlers](#update-manual-tables)
+ [Data Catalog table properties](#table-properties)

## Table partitions
<a name="tables-partition"></a>

An AWS Glue table definition of an Amazon Simple Storage Service (Amazon S3) folder can describe a partitioned table. For example, to improve query performance, a partitioned table might separate monthly data into different files using the name of the month as a key. In AWS Glue, table definitions include the partitioning key of a table. When AWS Glue evaluates the data in Amazon S3 folders to catalog a table, it determines whether an individual table or a partitioned table is added. 

You can create partition indexes on a table to fetch a subset of the partitions instead of loading all the partitions in the table. For information about working with partition indexes, see [Creating partition indexes](partition-indexes.md).

All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3 folder:
+ The schemas of the files are similar, as determined by AWS Glue.
+ The data format of the files is the same.
+ The compression format of the files is the same.

For example, you might own an Amazon S3 bucket named `my-app-bucket`, where you store both iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS and Android sales have the same schema, data format, and compression format. In the AWS Glue Data Catalog, the AWS Glue crawler creates one table definition with partitioning keys for year, month, and day. 

The following Amazon S3 listing of `my-app-bucket` shows some of the partitions. The `=` symbol is used to assign partition key values. 

```
   my-app-bucket/Sales/year=2010/month=feb/day=1/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=1/Android.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/Android.csv
   ...
   my-app-bucket/Sales/year=2017/month=feb/day=4/iOS.csv
   my-app-bucket/Sales/year=2017/month=feb/day=4/Android.csv
```

## Table resource links
<a name="tables-resource-links"></a>


|  | 
| --- |
| The AWS Glue console was recently updated. The current version of the console does not support Table Resource Links. | 

The Data Catalog can also contain *resource links* to tables. A table resource link is a link to a local or shared table. Currently, you can create resource links only in AWS Lake Formation. After you create a resource link to a table, you can use the resource link name wherever you would use the table name. Along with tables that you own or that are shared with you, table resource links are returned by `glue:GetTables()` and appear as entries on the **Tables** page of the AWS Glue console.

The Data Catalog can also contain database resource links.

For more information about resource links, see [Creating Resource Links](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html) in the *AWS Lake Formation Developer Guide*.

## Creating tables using the console
<a name="console-tables"></a>

A table in the AWS Glue Data Catalog is the metadata definition that represents the data in a data store. You create tables when you run a crawler, or you can create a table manually in the AWS Glue console. The **Tables** list in the AWS Glue console displays values of your table's metadata. You use table definitions to specify sources and targets when you create ETL (extract, transform, and load) jobs. 

**Note**  
With recent changes to the AWS management console, you may need to modify your existing IAM roles to have the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-SearchTables](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-SearchTables) permission. For new role creation, the `SearchTables` API permission has already been added as default.

To get started, sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Choose the **Tables** tab, and use the **Add tables** button to create tables either with a crawler or by manually typing attributes. 

### Adding tables on the console
<a name="console-tables-add"></a>

To use a crawler to add tables, choose **Add tables**, **Add tables using a crawler**. Then follow the instructions in the **Add crawler** wizard. When the crawler runs, tables are added to the AWS Glue Data Catalog. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).

If you know the attributes that are required to create an Amazon Simple Storage Service (Amazon S3) table definition in your Data Catalog, you can create it with the table wizard. Choose **Add tables**, **Add table manually**, and follow the instructions in the **Add table** wizard.

When adding a table manually through the console, consider the following:
+ If you plan to access the table from Amazon Athena, then provide a name with only alphanumeric and underscore characters. For more information, see [Athena names](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#ate-table-database-and-column-names-allow-only-underscore-special-characters).
+ The location of your source data must be an Amazon S3 path.
+ The data format of the data must match one of the listed formats in the wizard. The corresponding classification, SerDe, and other table properties are automatically populated based on the format chosen. You can define tables with the following formats:   
**Avro**  
Apache Avro JSON binary format.  
**CSV**  
Character separated values. You also specify the delimiter of either comma, pipe, semicolon, tab, or Ctrl-A.  
**JSON**  
JavaScript Object Notation.  
**XML**  
Extensible Markup Language format. Specify the XML tag that defines a row in the data. Columns are defined within row tags.  
**Parquet**  
Apache Parquet columnar storage.  
**ORC**  
Optimized Row Columnar (ORC) file format. A format designed to efficiently store Hive data.
+ You can define a partition key for the table.
+ Currently, partitioned tables that you create with the console cannot be used in ETL jobs.

### Table attributes
<a name="console-tables-attributes"></a>

The following are some important attributes of your table:

**Name**  
The name is determined when the table is created, and you can't change it. You refer to a table name in many AWS Glue operations.

**Database**  
The container object where your table resides. This object contains an organization of your tables that exists within the AWS Glue Data Catalog and might differ from an organization in your data store. When you delete a database, all tables contained in the database are also deleted from the Data Catalog. 

**Description**  
The description of the table. You can write a description to help you understand the contents of the table.

**Table format**  
Specify creating a standard AWS Glue table, or a table in Apache Iceberg format.  
The Data Catalog provides following table optimization options to manage table storage and improve query performance for Iceberg tables.  
+ **Compaction** – Data files are merged and rewritten remove obsolete data and consolidate fragmented data into larger, more efficient files.
+ **Snapshot retention **– Snapshots are timestamped versions of an Iceberg table. Snapshot retention configurations allow customers to enforce how long to retain snapshots and how many snapshots to retain. Configuring a snapshot retention optimizer can help manage storage overhead by removing older, unnecessary snapshots and their associated underlying files.
+ **Orphan file deletion** – Orphan files are files that are no longer referenced by the Iceberg table metadata. These files can accumulate over time, especially after operations like table deletions or failed ETL jobs. Enabling orphan file deletion allows AWS Glue to periodically identify and remove these unnecessary files, freeing up storage.
For more information, see [Optimizing Iceberg tables](table-optimizers.md).

**Optimization configuration**  
You can either use the default settings or customize the settings for enabling the table optimizers.

**IAM role**  
 To run the table optimizers, the service assumes an IAM role on your behalf. You can choose an IAM role using the drop-down. Ensure that the role has the permissions required to enable compaction.  
To learn more about the required permissions for the IAM role, see [Table optimization prerequisites](optimization-prerequisites.md).

**Location**  
The pointer to the location of the data in a data store that this table definition represents.

**Classification**  
A categorization value provided when the table was created. Typically, this is written when a crawler runs and specifies the format of the source data.

**Last updated**  
The time and date (UTC) that this table was updated in the Data Catalog.

**Date added**  
The time and date (UTC) that this table was added to the Data Catalog.

**Deprecated**  
If AWS Glue discovers that a table in the Data Catalog no longer exists in its original data store, it marks the table as deprecated in the data catalog. If you run a job that references a deprecated table, the job might fail. Edit jobs that reference deprecated tables to remove them as sources and targets. We recommend that you delete deprecated tables when they are no longer needed. 

**Connection**  
If AWS Glue requires a connection to your data store, the name of the connection is associated with the table.

### Viewing and managing table details
<a name="console-tables-details"></a>

To see the details of an existing table, choose the table name in the list, and then choose **Action, View details**.

The table details include properties of your table and its schema. This view displays the schema of the table, including column names in the order defined for the table, data types, and key columns for partitions. If a column is a complex type, you can choose **View properties** to display details of the structure of that field, as shown in the following example:

```
{
"StorageDescriptor": 
    {
      "cols": {
         "FieldSchema": [
           {
             "name": "primary-1",
             "type": "CHAR",
             "comment": ""
           },
           {
             "name": "second ",
             "type": "STRING",
             "comment": ""
           }
         ]
      },
      "location": "s3://aws-logs-111122223333-us-east-1",
      "inputFormat": "",
      "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "compressed": "false", 
      "numBuckets": "0",
      "SerDeInfo": {
           "name": "",
           "serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
           "parameters": {
               "separatorChar": "|"
            }
      },
      "bucketCols": [],
      "sortCols": [],
      "parameters": {},
      "SkewedInfo": {},
      "storedAsSubDirectories": "false"
    },
    "parameters": {
       "classification": "csv"
    }
}
```

For more information about the properties of a table, such as `StorageDescriptor`, see [StorageDescriptor structure](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-StorageDescriptor).

To change the schema of a table, choose **Edit schema** to add and remove columns, change column names, and change data types.

 To compare different versions of a table, including its schema, choose **Compare versions** to see a side-by-side comparison of two versions of the schema for a table. For more information, see [Comparing table schema versions](#console-tables-schema-comparison). 

To display the files that make up an Amazon S3 partition, choose **View partition**. For Amazon S3 tables, the **Key** column displays the partition keys that are used to partition the table in the source data store. Partitioning is a way to divide a table into related parts based on the values of a key column, such as date, location, or department. For more information about partitions, search the internet for information about "hive partitioning."

**Note**  
To get step-by-step guidance for viewing the details of a table, see the **Explore table** tutorial in the console.

### Comparing table schema versions
<a name="console-tables-schema-comparison"></a>

 When you compare two versions of table schemas, you can compare nested row changes by expanding and collapsing nested rows, compare schemas of two versions side-by-side, and view table properties side-by-side. 

 To compare versions 

1.  From the AWS Glue console, choose **Tables**, then **Actions** and choose **Compare versions**.   
![\[The screenshot shows the Actions button when selected. The drop-down menu displays the Compare versions option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-table-compare-versions.png)

1.  Choose a version to compare by choosing the version drop-down menu. When comparing schemas, the Schema tab is highlighted in orange. 

1.  When you compare tables between two versions, the table schemas are presented to you on the left and right side of the screen. This enables you to determine changes visually by comparing the Column name, data type, key, and comment fields side-by-side. When there is a change, a colored icon displays the type of change that was made. 
   +  Deleted – displayed by a red icon indicates where the column was removed from a previous version of the table schema. 
   +  Edited or Moved – displayed by a blue icon indicates where the column was modified or moved in a newer version of the table schema. 
   +  Added – displayed by a green icon indicates where the column was added to a newer version of the table schema. 
   +  Nested changes – displayed by a yellow icon indicates where the nested column contains changes. Choose the column to expand and view the columns that have either been deleted, edited, moved, or added.   
![\[The screenshot shows the table schema comparison between two versions. On the left side is the older version. On the right side is the newer version. The delete icon is next to a column that was removed from the older version and is no longer in the newer version.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-table-version-comparison.png)

1.  Use the filter fields search bar to display fields based on the characters you enter here. If you enter a column name in either table version, the filtered fields are displayed in both table versions to show you where the changes have occurred. 

1.  To compare properties, choose the **Properties tab**. 

1.  To stop comparing versions, choose **Stop comparing** to return to the list of tables. 

# Creating partition indexes
<a name="partition-indexes"></a>

Over time, hundreds of thousands of partitions get added to a table. The [GetPartitions API](https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html) is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request.

Lets take a *sales\$1data* table as an example which is partitioned by the keys *Country*, *Category*, *Year*, *Month*, and *creationDate*. If you want to obtain sales data for all the items sold for the *Books* category in the year 2020 after *2020-08-15*, you have to make a `GetPartitions` request with the expression "Category = 'Books' and creationDate > '2020-08-15'" to the Data Catalog.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions using the query expression provided by the user in the `GetPartitions` request. The query takes more time to run as the number of partitions increase on a table with no indexes. With an index, the `GetPartitions` query will try to fetch a subset of the partitions instead of loading all the partitions in the table.

**Topics**
+ [About partition indexes](#partition-index-1)
+ [Creating a table with partition indexes](#partition-index-creating-table)
+ [Adding a partition index to an existing table](#partition-index-existing-table)
+ [Describing partition indexes on a table](#partition-index-describing)
+ [Limitations on using partition indexes](#partition-index-limitations)
+ [Using indexes for an optimized GetPartitions call](#partition-index-getpartitions)
+ [Integration with engines](#partition-index-integration-engines)

## About partition indexes
<a name="partition-index-1"></a>

When you create a partition index, you specify a list of partition keys that already exist on a given table. Partition index is sub list of partition keys defined in the table. A partition index can be created on any permutation of partition keys defined on the table. For the above *sales\$1data* table, the possible indexes are (country, category, creationDate), (country, category, year), (country, category), (country), (category, country, year, month), and so on.

The Data Catalog will concatenate the partition values in the order provided at the time of index creation. The index is built consistently as partitions are added to the table. Indexes can be created for String (string, char, and varchar), Numeric (int, bigint, long, tinyint, and smallint), and Date (yyyy-MM-dd) column types. 

**Supported data types**
+ Date – A date in ISO format, such as `YYYY-MM-DD`. For example, date `2020-08-15`. The format uses hyphens (‐) to separate the year, month, and day. The permissible range for dates for indexing spans from `0000-01-01` to `9999-12-31`.
+ String – A string literal enclosed in single or double quotes. 
+ Char – Fixed length character data, with a specified length between 1 and 255, such as char(10).
+ Varchar – Variable length character data, with a specified length between 1 and 65535, such as varchar(10).
+ Numeric – int, bigint, long, tinyint, and smallint

Indexes on Numeric, String, and Date data types support =, >, >=, <, <= and between operators. The indexing solution currently only supports the `AND` logical operator. Sub-expressions with the operators "LIKE", "IN", "OR", and "NOT" are ignored in the expression for filtering using an index. Filtering for the ignored sub-expression is done on the partitions fetched after applying index filtering.

For each partition added to a table, there is a corresponding index item created. For a table with ‘n’ partitions, 1 partition index will result in 'n' partition index items. 'm' partition index on same table will result into 'm\$1n' partition index items. Each partition index item will be charged according to the current AWS Glue pricing policy for data catalog storage. For details on storage object pricing, see [AWS Glue pricing](https://aws.amazon.com/glue/pricing/).

## Creating a table with partition indexes
<a name="partition-index-creating-table"></a>

You can create a partition index during table creation. The `CreateTable` request takes a list of [`PartitionIndex` objects](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-PartitionIndex) as an input. A maximum of 3 partition indexes can be created on a given table. Each partition index requires a name and a list of `partitionKeys` defined for the table. Created indexes on a table can be fetched using the [`GetPartitionIndexes` API](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-GetPartitionIndexes)

## Adding a partition index to an existing table
<a name="partition-index-existing-table"></a>

To add a partition index to an existing table, use the `CreatePartitionIndex` operation. You can create one `PartitionIndex` per `CreatePartitionIndex` operation. Adding an index does not affect the availability of a table, as the table continues to be available while indexes are being created.

The index status for an added partition is set to CREATING and the creation of the index data is started. If the process for creating the indexes is successful, the indexStatus is updated to ACTIVE and for an unsuccessful process, the index status is updated to FAILED. Index creation can fail for multiple reasons, and you can use the `GetPartitionIndexes` operation to retrieve the failure details. The possible failures are:
+ ENCRYPTED\$1PARTITION\$1ERROR — Index creation on a table with encrypted partitions is not supported.
+ INVALID\$1PARTITION\$1TYPE\$1DATA\$1ERROR — Observed when the `partitionKey` value is not a valid value for the corresponding `partitionKey` data type. For example: a `partitionKey` with the 'int' datatype has a value 'foo'.
+ MISSING\$1PARTITION\$1VALUE\$1ERROR — Observed when the `partitionValue` for an `indexedKey` is not present. This can happen when a table is not partitioned consistently.
+ UNSUPPORTED\$1PARTITION\$1CHARACTER\$1ERROR — Observed when the value for an indexed partition key contains the characters \$1u0000, \$1u0001 or \$1u0002
+ INTERNAL\$1ERROR — An internal error occurred while indexes were being created. 

## Describing partition indexes on a table
<a name="partition-index-describing"></a>

To fetch the partition indexes created on a table, use the `GetPartitionIndexes` operation. The response returns all the indexes on the table, along with the current status of each index (the `IndexStatus`).

The `IndexStatus` for a partition index will be one of the following:
+ `CREATING` — The index is currently being created, and is not yet available for use.
+ `ACTIVE` — The index is ready for use. Requests can use the index to perform an optimized query.
+ `DELETING` — The index is currently being deleted, and can no longer be used. An index in the active state can be deleted using the `DeletePartitionIndex` request, which moves the status from ACTIVE to DELETING.
+ `FAILED` — The index creation on an existing table failed. Each table stores the last 10 failed indexes.

The possible state transitions for indexes created on an existing table are:
+ CREATING → ACTIVE → DELETING
+ CREATING → FAILED

## Limitations on using partition indexes
<a name="partition-index-limitations"></a>

Once you have created a partition index, note these changes to table and partition functionality:

**New partition creation (after Index Addition)**  
After a partition index is created on a table, all new partitions added to the table will be validated for the data type checks for indexed keys. The partition value of the indexed keys will be validated for data type format. If the data type check fails, the create partition operation will fail. For the *sales\$1data* table, if an index is created for keys (category, year) where the category is of type `string` and year of type `int`, the creation of the new partition with a value of YEAR as "foo" will fail.

After indexes are enabled, the addition of partitions with indexed key values having the characters U\$10000, U\$100001, and U\$10002 will start to fail.

**Table updates**  
Once a partition index is created on a table, you cannot modify the partition key names for existing partition keys, and you cannot change the type, or order, of keys which are registered with the index.

## Using indexes for an optimized GetPartitions call
<a name="partition-index-getpartitions"></a>

When you call `GetPartitions` on a table with an index, you can include an expression, and if applicable the Data Catalog will use an index if possible. The first key of the index should be passed in the expression for the indexes to be used in filtering. Index optimization in filtering is applied as a best effort. The Data Catalog tries to use index optimization as much as possible, but in case of a missing index, or unsupported operator, it falls back to the existing implementation of loading all partitions. 

For the *sales\$1data* table above, lets add the index [Country, Category, Year]. If "Country" is not passed in the expression, the registered index will not be able to filter partitions using indexes. You can add up to 3 indexes to support various query patterns.

Lets take some example expressions and see how indexes work on them:


| Expressions | How index will be used | 
| --- | --- | 
|  Country = 'US'  |  Index will be used to filter partitions.  | 
|  Country = 'US' and Category = 'Shoes'  |  Index will be used to filter partitions.  | 
|  Category = 'Shoes'  |  Indexes will not be used as "country" is not provided in the expression. All partitions will be loaded to return a response.  | 
|  Country = 'US' and Category = 'Shoes' and Year > '2018'  |  Index will be used to filter partitions.  | 
|  Country = 'US' and Category = 'Shoes' and Year > '2018' and month = 2  |  Index will be used to fetch all partitions with country = "US" and category = "shoes" and year > 2018. Then, filtering on the month expression will be performed.  | 
|  Country = 'US' AND Category = 'Shoes' OR Year > '2018'  |  Indexes will not be used as an `OR` operator is present in the expression.  | 
|  Country = 'US' AND Category = 'Shoes' AND (Year = 2017 OR Year = '2018')  |  Index will be used to fetch all partitions with country = "US" and category = "shoes", and then filtering on the year expression will be performed.  | 
|  Country in ('US', 'UK') AND Category = 'Shoes'  |  Indexes will not be used for filtering as the `IN` operator is not supported currently.  | 
|  Country = 'US' AND Category in ('Shoes', 'Books')  |  Index will be used to fetch all partitions with country = "US", and then filtering on the Category expression will be performed.  | 
|  Country = 'US' AND Category in ('Shoes', 'Books') AND (creationDate > '2023-9-01'  |  Index will be used to fetch all partitions with country = "US", with creationDate > '2023-9-01', and then filtering on the Category expression will be performed.  | 

## Integration with engines
<a name="partition-index-integration-engines"></a>

Redshift Spectrum, Amazon EMR and AWS Glue ETL Spark DataFrames are able to utilize indexes for fetching partitions after indexes are in an ACTIVE state in AWS Glue. [Athena](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index) and [AWS Glue ETL Dynamic frames](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-cat-predicates) require you to follow extra steps to utilize indexes for query improvement.

### Enable partition filtering
<a name="enable-partition-filtering-athena"></a>

To enable partition filtering in Athena, you need to update the table properties as follows:

1. In the AWS Glue console, under **Data Catalog**, choose **Tables**.

1. Choose a table.

1. Under **Actions**, choose **Edit table**.

1. Under **Table properties**, add the following:
   + Key –`partition_filtering.enabled`
   + Value – `true`

1. Choose **Apply**.

Alternatively, you can set this parameter by running an [ALTER TABLE SET PROPERTIES](https://docs.aws.amazon.com/athena/latest/ug/alter-table-set-tblproperties.html) query in Athena.

```
ALTER TABLE partition_index.table_with_index
SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')
```

## Updating manually created Data Catalog tables using crawlers
<a name="update-manual-tables"></a>

You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.

To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.

The following are other reasons why you might want to manually create catalog tables and specify catalog tables as the crawler source:
+ You want to choose the catalog table name and not rely on the catalog table naming algorithm.
+ You want to prevent new tables from being created in the case where files with a format that could disrupt partition detection are mistakenly saved in the data source path.

For more information, see [Step 2: Choose data sources and classifiers](define-crawler-choose-data-sources.md).

## Data Catalog table properties
<a name="table-properties"></a>

 Table properties, or parameters, as they are known in the AWS CLI, are unvalidated key and value strings. You can set your own properties on the table to support uses of the Data Catalog outside of AWS Glue. Other services using the Data Catalog may do so as well. AWS Glue sets some table properties when running jobs or crawlers. Unless otherwise described, these properties are for internal use, we do not support that they will continue to exist in their current form, or support product behavior if these properties are manually changed. 

 For more information about table properties set by AWS Glue crawlers, see [Parameters set on Data Catalog tables by crawler](table-properties-crawler.md). 

# Integrating with Amazon S3 Tables
<a name="glue-federation-s3tables"></a>

AWS Glue Data Catalog integration with Amazon S3 Tables allows you to discover, query, and join S3 Tables with data in Amazon S3 data lakes using a single catalog. When you integrate S3 Tables with the Data Catalog, the service creates a federated catalog structure that maps S3 Tables resources to AWS Glue catalog objects:
+ An S3 table bucket becomes a catalog in the Data Catalog
+ An S3 namespace becomes a AWS Glue database
+ An S3 table becomes a AWS Glue table

## Access controls
<a name="s3-tables-access-controls"></a>

The Data Catalog supports two access control modes for S3 Tables integration:
+ **IAM access control** – Uses IAM policies to control access to S3 Tables and the Data Catalog. In this approach, you need IAM permissions on both S3 Tables resources and Data Catalog objects to access resources.
+ **AWS Lake Formation access control** – Uses AWS Lake Formation grants in addition to AWS Glue IAM permissions to control access to S3 Tables through the Data Catalog. In this mode, principals require IAM permissions to interact with the Data Catalog, and AWS Lake Formation grants determine which catalog resources (databases, tables, columns, rows) the principal can access. This mode supports both coarse-grained access control (database-level and table-level grants) and fine-grained access control (column-level and row-level security). When a registered role is configured and credential vending is enabled, S3 Tables IAM permissions are not required for the principal, as AWS Lake Formation vends credentials on behalf of the principal using the registered role. AWS Lake Formation access control also supports credential vending for third-party analytics engines. For more information, see [Creating an S3 Tables catalog](https://docs.aws.amazon.com/lake-formation/latest/dg/create-s3-tables-catalog.html) in the *AWS Lake Formation Developer Guide*.

You can migrate between access control modes as your requirements evolve.

## Catalog hierarchy for auto-mounting
<a name="s3-tables-catalog-hierarchy"></a>

When you integrate S3 Tables with the Data Catalog using the Amazon S3 management console, the console creates a federated catalog called `s3tablescatalog` in the Data Catalog in your account in that AWS Region. This federated catalog serves as the parent catalog for all existing and future S3 table buckets in that account and Region. The integration maps Amazon S3 table bucket resources in the following hierarchy:
+ **Federated catalog** – `s3tablescatalog` (automatically created)
+ **Child catalogs** – Each S3 table bucket becomes a child catalog under `s3tablescatalog`
+ **Databases** – Each S3 namespace within a table bucket becomes a database
+ **Tables** – Each S3 table within a namespace becomes a table

For example, if you have an S3 table bucket named "analytics-bucket" with a namespace "sales" containing a table "transactions", the full path in the Data Catalog would be: `s3tablescatalog/analytics-bucket/sales/transactions`

This four-part hierarchy applies to same-account scenarios where S3 Tables and the Data Catalog are in the same AWS account. For cross-account scenarios, you manually mount individual S3 table buckets in the Data Catalog, which creates a three-part hierarchy.

## Supported Regions
<a name="s3-tables-supported-regions"></a>

S3 Tables integration with the Data Catalog is available in the following AWS Regions:


| Region code | Region name | 
| --- | --- | 
| us-east-1 | US East (N. Virginia) | 
| us-east-2 | US East (Ohio) | 
| us-west-1 | US West (N. California) | 
| us-west-2 | US West (Oregon) | 
| af-south-1 | Africa (Cape Town) | 
| ap-east-1 | Asia Pacific (Hong Kong) | 
| ap-east-2 | Asia Pacific (Taipei) | 
| ap-northeast-1 | Asia Pacific (Tokyo) | 
| ap-northeast-2 | Asia Pacific (Seoul) | 
| ap-northeast-3 | Asia Pacific (Osaka) | 
| ap-south-1 | Asia Pacific (Mumbai) | 
| ap-south-2 | Asia Pacific (Hyderabad) | 
| ap-southeast-1 | Asia Pacific (Singapore) | 
| ap-southeast-2 | Asia Pacific (Sydney) | 
| ap-southeast-3 | Asia Pacific (Jakarta) | 
| ap-southeast-4 | Asia Pacific (Melbourne) | 
| ap-southeast-5 | Asia Pacific (Malaysia) | 
| ap-southeast-6 | Asia Pacific (New Zealand) | 
| ap-southeast-7 | Asia Pacific (Thailand) | 
| ca-central-1 | Canada (Central) | 
| ca-west-1 | Canada West (Calgary) | 
| eu-central-1 | Europe (Frankfurt) | 
| eu-central-2 | Europe (Zurich) | 
| eu-north-1 | Europe (Stockholm) | 
| eu-south-1 | Europe (Milan) | 
| eu-south-2 | Europe (Spain) | 
| eu-west-1 | Europe (Ireland) | 
| eu-west-2 | Europe (London) | 
| eu-west-3 | Europe (Paris) | 
| il-central-1 | Israel (Tel Aviv) | 
| mx-central-1 | Mexico (Central) | 
| sa-east-1 | South America (Sao Paulo) | 

**Topics**
+ [Access controls](#s3-tables-access-controls)
+ [Catalog hierarchy for auto-mounting](#s3-tables-catalog-hierarchy)
+ [Supported Regions](#s3-tables-supported-regions)
+ [Prerequisites](s3tables-catalog-prerequisites.md)
+ [Enabling S3 Tables integration with the Data Catalog](enable-s3-tables-catalog-integration.md)
+ [Adding databases and tables to the S3 Tables catalog](create-databases-tables-s3-catalog.md)
+ [Sharing S3 Tables catalog objects](share-s3-tables-catalog.md)
+ [Managing S3 Tables integration](manage-s3-tables-catalog-integration.md)

# Prerequisites
<a name="s3tables-catalog-prerequisites"></a>

Before you create a federated catalog for S3 Tables in the AWS Glue Data Catalog, ensure your IAM principal (user or role) has the required permissions.

## Required IAM permissions
<a name="s3tables-required-iam-permissions"></a>

Your IAM principal needs the following permissions to enable S3 Tables integration:

**AWS Glue permissions**:
+ `glue:CreateCatalog` – Required to create the `s3tablescatalog` federated catalog
+ `glue:GetCatalog` – Required to view catalog details
+ `glue:GetDatabase` – Required to view S3 namespaces as databases
+ `glue:GetTable` – Required to view S3 tables
+ `glue:passConnection` – Grants the calling principal the right to delegate the `aws:s3tables` connection to the AWS Glue service

**S3 Tables permissions** (for IAM access control):
+ `s3tables:CreateTableBucket`
+ `s3tables:GetTableBucket`
+ `s3tables:CreateNamespace`
+ `s3tables:GetNamespace`
+ `s3tables:ListNamespaces`
+ `s3tables:CreateTable`
+ `s3tables:GetTable`
+ `s3tables:ListTables`
+ `s3tables:UpdateTableMetadataLocation`
+ `s3tables:GetTableMetadataLocation`
+ `s3tables:GetTableData`
+ `s3tables:PutTableData`

## IAM policy example
<a name="s3tables-iam-policy-example"></a>

The following IAM policy provides the minimum permissions required to enable S3 Tables integration with the Data Catalog in IAM mode:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "GlueDataCatalogPermissions",
      "Effect": "Allow",
      "Action": [
        "glue:CreateCatalog",
        "glue:GetCatalog",
        "glue:GetDatabase",
        "glue:GetTable"
      ],
      "Resource": [
        "arn:aws:glue:region:account-id:catalog/s3tablescatalog",
        "arn:aws:glue:region:account-id:database/s3tablescatalog/*/*",
        "arn:aws:glue:region:account-id:table/s3tablescatalog/*/*/*"
      ]
    },
    {
      "Sid": "S3TablesDataAccessPermissions",
      "Effect": "Allow",
      "Action": [
        "s3tables:GetTableBucket",
        "s3tables:GetNamespace",
        "s3tables:GetTable",
        "s3tables:GetTableMetadataLocation",
        "s3tables:GetTableData"
      ],
      "Resource": [
        "arn:aws:s3tables:region:account-id:bucket/*",
        "arn:aws:s3tables:region:account-id:bucket/*/table/*"
      ]
    }
  ]
}
```

# Enabling S3 Tables integration with the Data Catalog
<a name="enable-s3-tables-catalog-integration"></a>

You can enable S3 Tables integration with the AWS Glue Data Catalog using the Amazon S3 management console or AWS CLI. When you enable the integration using the console, AWS creates a federated catalog named `s3tablescatalog` that automatically discovers and mounts all S3 table buckets in your AWS account and Region.

## Enable S3 Tables integration using the Amazon S3 management console
<a name="enable-s3-tables-console"></a>

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. In the left navigation pane, choose **Table buckets**.

1. Choose **Create table bucket**.

1. Enter a **Table bucket name** and make sure that the **Enable integration** checkbox is selected.

1. Choose **Create table bucket**.

Amazon S3 automatically integrates your table buckets in that Region. The first time that you integrate table buckets in any Region, Amazon S3 creates `s3tablescatalog` in the Data Catalog in that Region.

After the catalog is created, all S3 table buckets in your account and Region are automatically mounted as child catalogs. You can view the databases (namespaces) and tables by navigating to the catalog in the Data Catalog.

## Enable S3 Tables integration using AWS CLI
<a name="enable-s3-tables-cli"></a>

Use the `glue create-catalog` command to create the `s3tablescatalog` catalog.

```
aws glue create-catalog \
  --name "s3tablescatalog" \
  --catalog-input '{
    "Description": "Federated catalog for S3 Tables",
    "FederatedCatalog": {
      "Identifier": "arn:aws:s3tables:region:account-id:bucket/*",
      "ConnectionName": "aws:s3tables"
    },
    "CreateDatabaseDefaultPermissions": [{
      "Principal": {
        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
      },
      "Permissions": ["ALL"]
    }],
    "CreateTableDefaultPermissions": [{
      "Principal": {
        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
      },
      "Permissions": ["ALL"]
    }]
  }'
```

Replace *region* with your AWS Region and *account-id* with your AWS account ID.

## Verifying the integration
<a name="verify-s3-tables-integration"></a>

After creating the catalog, you can verify that S3 table buckets are mounted by listing the child catalogs:

```
aws glue get-catalogs \
  --parent-catalog-id s3tablescatalog
```

# Adding databases and tables to the S3 Tables catalog
<a name="create-databases-tables-s3-catalog"></a>

Ensure that you have the necessary permissions to list and create catalogs, databases, and tables in the Data Catalog in your Region. Ensure that S3 Tables integration is enabled in your AWS account and Region.

## Adding a database to the S3 Tables catalog
<a name="add-database-s3-tables-catalog"></a>

### Adding a database (console)
<a name="add-database-s3-tables-console"></a>

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/home](https://console.aws.amazon.com/glue/home).

1. In the left navigation pane, choose **Databases**.

1. Choose **Add Database**.

1. Choose **Glue Database in S3 Tables Federated Catalog**.

1. Enter a unique name for the database.

1. Select the target catalog which maps to a table bucket in S3 Tables.

1. Choose **Create Database**.

### Adding a database (AWS CLI)
<a name="add-database-s3-tables-cli"></a>

```
aws glue create-database \
  --region region \
  --catalog-id "account-id:s3tablescatalog/my-catalog" \
  --database-input '{"Name": "my-database"}'
```

## Adding a table to the S3 Tables catalog
<a name="add-table-s3-tables-catalog"></a>

### Adding a table (console)
<a name="add-table-s3-tables-console"></a>

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/home](https://console.aws.amazon.com/glue/home).

1. In the left navigation pane, choose **Tables**.

1. Select the appropriate S3 Tables catalog in the catalog dropdown.

1. Choose **Add Table**.

1. Enter a unique name for your table.

1. Confirm the correct S3 Tables catalog is selected in the catalog dropdown.

1. Select the database in the database dropdown.

1. Enter the table schema by either inputting a JSON or adding each column individually.

1. Choose **Create table**.

### Adding a table (AWS CLI)
<a name="add-table-s3-tables-cli"></a>

```
aws glue create-table \
  --region region \
  --catalog-id "account-id:s3tablescatalog/my-catalog" \
  --database-name "my-database" \
  --table-input '{
    "Name": "my-table",
    "Parameters": {
      "classification": "",
      "format": "ICEBERG"
    },
    "StorageDescriptor": {
      "Columns": [
        {"Name": "id", "Type": "int", "Parameters": {}},
        {"Name": "val", "Type": "string", "Parameters": {}}
      ]
    }
  }'
```

# Sharing S3 Tables catalog objects
<a name="share-s3-tables-catalog"></a>

When using IAM access control, you can share S3 Tables catalog objects with other users using AWS Glue resource links for same-account sharing. For cross-account sharing, you can share S3 table buckets with another AWS account and the IAM role or user in the recipient account can create a AWS Glue catalog object using the shared table bucket.

## Sharing within the same account using resource links
<a name="share-s3-tables-resource-links"></a>

Resource links allow you to create references to AWS Glue databases and tables in the `s3tablescatalog` that appear in your AWS Glue default catalog. This is useful for organizing data access or creating logical groupings of tables.

### Create a resource link (console)
<a name="share-s3-tables-resource-link-console"></a>

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Catalogs**.

1. In the **Catalog** list, select **s3tablescatalog**.

1. Select the table you want to share from the `s3tablescatalog`.

1. Choose **Actions**, then choose **Create resource link**.

1. For **Resource link name**, enter a name for the resource link.

1. For **Target database**, select the database where you want to create the resource link.

1. (Optional) For **Description**, enter a description.

1. Choose **Create**.

The resource link appears in the target database and points to the original table in `s3tablescatalog`.

### Create resource links (AWS CLI)
<a name="share-s3-tables-resource-link-cli"></a>

Create a database resource link:

```
aws glue create-database \
  --database-name "my-database-resource-link" \
  --database-input '{
    "Name": "sales_data_link",
    "TargetDatabase": {
      "CatalogId": "account-id:s3tablescatalog/analytics-bucket",
      "DatabaseName": "sales"
    }
  }'
```

Create a table resource link:

```
aws glue create-table \
  --table-name "my-table-resource-link" \
  --table-input '{
    "Name": "sales_data_link",
    "TargetTable": {
      "CatalogId": "account-id:s3tablescatalog/analytics-bucket",
      "DatabaseName": "sales",
      "Name": "transactions"
    }
  }'
```

# Managing S3 Tables integration
<a name="manage-s3-tables-catalog-integration"></a>

## Enable AWS Lake Formation
<a name="manage-s3-tables-enable-lf"></a>

You can enable AWS Lake Formation for your S3 Tables catalog when you want to scale your data governance requirements. AWS Lake Formation provides database-style grants to manage fine-grained access, scale permissions using tag-based access, and grant permissions based on user attributes such as group associations to your tables in S3 Tables.

Go to the AWS Lake Formation management console to enable AWS Lake Formation for your S3 Tables catalog in AWS Glue. For more information, see [Creating an S3 Tables catalog](https://docs.aws.amazon.com/lake-formation/latest/dg/create-s3-tables-catalog.html) in the *AWS Lake Formation Developer Guide*.

## Delete S3 Tables integration
<a name="manage-s3-tables-delete-integration"></a>

You can delete S3 Tables integration by deleting the catalog integration in the Data Catalog. This operation only deletes the metadata in the Data Catalog and not the resources in S3 Tables.

Ensure that you have the necessary permissions to list, edit, and delete catalog objects in AWS Glue.

### Delete integration (console)
<a name="delete-s3-tables-console"></a>

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/home](https://console.aws.amazon.com/glue/home).

1. In the navigation pane, choose **Catalogs**.

1. In the **Catalog** list, select **s3tablescatalog**.

1. Choose **Delete**.

1. Confirm that deleting the catalog also deletes all associated catalog objects in the Data Catalog.

1. Choose **Delete**.

### Delete integration (AWS CLI)
<a name="delete-s3-tables-cli"></a>

```
aws glue delete-catalog \
  --region region \
  --catalog-id "s3tablescatalog"
```

# Integrating with other AWS services
<a name="populate-dc-other-services"></a>

 While you can use AWS Glue crawlers to populate the AWS Glue Data Catalog, there are several AWS services that can automatically integrate with and populate the catalog for you. The following sections provide more information about the specific use cases supported by AWS services that can populate the Data Catalog. 

**Topics**
+ [AWS Lake Formation](#lf-dc)
+ [Amazon Athena](#ate-dc)

## AWS Lake Formation
<a name="lf-dc"></a>

 AWS Lake Formation is a service that makes it easier to set up a secure data lake in AWS. Lake Formation is built on AWS Glue, and Lake Formation and AWS Glue share the same AWS Glue Data Catalog. You can register your Amazon S3 data location with Lake Formation, and use Lake Formation console to create databases and tables in the AWS Glue Data Catalog, define data access policies, and audit data access across your data lake from a central place. You can use the Lake Formation fine-grained access control to manage your existing Data Catalog resources and Amazon S3 data locations. 

With data registered with Lake Formation, you can securely share Data Catalog resources across IAM principals, AWS accounts, AWS organizations, and organizational units.

 For more information about creating Data Catalog resources using Lake Formation, see [Creating Data Catalog tables and databases](https://docs.aws.amazon.com/lake-formation/latest/dg/populating-catalog.html) in the AWS Lake Formation Developer Guide. 

## Amazon Athena
<a name="ate-dc"></a>

 Amazon Athena uses the Data Catalog to store and retrieve table metadata for the Amazon S3 data in your AWS account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.

 You can populate the AWS Glue Data Catalog by using Athena `CREATE TABLE` statements directly. You can manually define and populate the schema and partition metadata in the Data Catalog without needing to run a crawler. 

1. In the Athena console, create a database that will store the table metadata in the Data Catalog.

1. Use the `CREATE EXTERNAL TABLE` statement to define the schema of your data source.

1. Use the `PARTITIONED BY` clause to define any partition keys if your data is partitioned.

1. Use the `LOCATION` clause to specify the Amazon S3 path where your actual data files are stored. 

1. Run the `CREATE TABLE` statement.

    This query creates the table metadata in the Data Catalog based on your defined schema and partitions, without actually crawling the data. 

You can query the table in Athena, and it will use the metadata from the Data Catalog to access and query your data files in Amazon S3. 

 For more information, see [Creating databases and tables](https://docs.aws.amazon.com/athena/latest/ug/work-with-data.html) in the Amazon Athena User Guide. 

# Data Catalog settings
<a name="console-data-catalog-settings"></a>

 The Data Catalog settings contains options to set encryption and permissions options for the Data Catalog in your account. 

![\[The screenshot shows the Data Catalog settings modal.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data_catalog_settings.png)


**To change the fine-grained access control of the Data Catalog**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1.  Choose an encryption option. 
   +  **Metadata encryption** – Select this check box to encrypt the metadata in your Data Catalog. Metadata is encrypted at rest using the AWS Key Management Service (AWS KMS) key that you specify. For more information, see [Encrypting your Data Catalog](encrypt-glue-data-catalog.md). 
   +  **Encrypt connection passwords** – Select this check box to encrypt passwords in the AWS Glue connection object when the connection is created or updated. Passwords are encrypted using the AWS KMS key that you specify. When passwords are returned, they are encrypted. This option is a global setting for all AWS Glue connections in the Data Catalog. If you clear this check box, previously encrypted passwords remain encrypted using the key that was used when they were created or updated. For more information about AWS Glue connections, see [Connecting to data](glue-connections.md). 

     When you enable this option, choose an AWS KMS key, or choose **Enter a key ARN** and provide the Amazon Resource Name (ARN) for the key. Enter the ARN in the form ` arn:aws:kms:region:account-id:key/key-id `. You can also provide the ARN as a key alias, such as ` arn:aws:kms:region:account-id:alias/alias-name `. 
**Important**  
 If this option is selected, any user or role that creates or updates a connection must have `kms:Encrypt` permission on the specified KMS key. 

     For more information, see [Encrypting connection passwords](encrypt-connection-passwords.md).

1.  Choose **Settings**, and then in the **Permissions** editor, add the policy statement to change fine-grained access control of the Data Catalog for your account. Only one policy at a time can be attached to a Data Catalog. You can paste a JSON resource policy into this control. For more information, see [Resource-based policies within AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-resource-based-policies). 

1.  Choose **Save** to update your Data Catalog with any changes you made. 

 You can also use AWS Glue API operations to put, get, and delete resource policies. For more information, see [Security APIs in AWS Glue](aws-glue-api-jobs-security.md). 

# Populating and managing transactional tables
<a name="populate-otf"></a>

[Apache Iceberg](https://iceberg.apache.org/), [Apache Hudi](https://hudi.incubator.apache.org/), and Linux Foundation [Delta Lake](https://delta.io/) are open-source table formats designed for handling large-scale data analytics and data lake workloads in Apache Spark. 

You can populate Iceberg, Hudi, and Delta Lake tables in the AWS Glue Data Catalog using the following methods: 
+ AWS Glue crawler; – AWS Glue crawlers can automatically discover and populate Iceberg, Hudi and Delta Lake table metadata in the Data Catalog. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).
+ AWS Glue ETL Jobs – You can create ETL jobs to write data to Iceberg, Hudi, and Delta Lake tables and populate their metadata in the Data Catalog. For more information, see [Using data lake frameworks with AWS Glue ETL jobs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-datalake-native-frameworks.html).
+ AWS Glue console, AWS Lake Formation console, AWS CLI or API – You can use the AWS Glue console, Lake Formation console, or API to create and manage Iceberg table definitions in the Data Catalog.

**Topics**
+ [Creating Apache Iceberg tables](#creating-iceberg-tables)
+ [Optimizing Iceberg tables](table-optimizers.md)
+ [Optimizing query performance for Iceberg tables](iceberg-column-statistics.md)

## Creating Apache Iceberg tables
<a name="creating-iceberg-tables"></a>

You can create Apache Iceberg tables that use the Apache Parquet data format in the AWS Glue Data Catalog with data residing in Amazon S3. A table in the Data Catalog is the metadata definition that represents the data in a data store. By default, AWS Glue creates Iceberg v2 tables. For the difference between v1 and v2 tables, see [Format version changes](https://iceberg.apache.org/spec/#appendix-e-format-version-changes) in the Apache Iceberg documentation.

 [Apache Iceberg](https://iceberg.apache.org/) is an open table format for very large analytic datasets. Iceberg allows for easy changes to your schema, also known as schema evolution, meaning that users can add, rename, or remove columns from a data table without disrupting the underlying data. Iceberg also provides support for data versioning, which allows users to track changes to data overtime. This enables the time travel feature, which allows users to access and query historical versions of data and analyze changes to the data between updates and deletes.

You can use AWS Glue or Lake Formation console or the `CreateTable` operation in the AWS Glue API to create an Iceberg table in the Data Catalog. For more information, see [CreateTable action (Python: create\$1table)](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-CreateTable).

When you create an Iceberg table in the Data Catalog, you must specify the table format and metadata file path in Amazon S3 to be able to perform reads and writes.

 You can use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with AWS Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by IAM permissions policies for Amazon S3 and AWS Glue actions. For more information, see [Managing permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/managing-permissions.html). 

**Note**  
Data Catalog doesn’t support creating partitions and adding Iceberg table properties.

### Prerequisites
<a name="iceberg-prerequisites"></a>

 To create Iceberg tables in the Data Catalog, and set up Lake Formation data access permissions, you need to complete the following requirements: 

1. 

**Permissions required to create Iceberg tables without the data registered with Lake Formation.**

   In addition to the permissions required to create a table in the Data Catalog, the table creator requires the following permissions:
   + `s3:PutObject` on resource arn:aws:s3:::\$1bucketName\$1
   + `s3:GetObject` on resource arn:aws:s3:::\$1bucketName\$1
   + `s3:DeleteObject`on resource arn:aws:s3:::\$1bucketName\$1

1. 

**Permissions required to create Iceberg tables with data registered with Lake Formation:**

   To use Lake Formation to manage and secure the data in your data lake, register your Amazon S3 location that has the data for tables with Lake Formation. This is so that Lake Formation can vend credentials to AWS analytical services such as Athena, Redshift Spectrum, and Amazon EMR to access data. For more information on registering an Amazon S3 location, see [Adding an Amazon S3 location to your data lake](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html). 

   A principal who reads and writes the underlying data that is registered with Lake Formation requires the following permissions:
   + `lakeformation:GetDataAccess`
   + `DATA_LOCATION_ACCESS`

     A principal who has data location permissions on a location also has location permissions on all child locations.

     For more information on data location permissions, see [Underlying data access control](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-underlying-data.html#data-location-permissions)ulink.

 To enable compaction, the service needs to assume an IAM role that has permissions to update tables in the Data Catalog. For details, see [Table optimization prerequisites](optimization-prerequisites.md) 

### Creating an Iceberg table
<a name="create-iceberg-table"></a>

You can create Iceberg v1 and v2 tables using AWS Glue or Lake Formation console or AWS Command Line Interface as documented on this page. You can also create Iceberg tables using the AWS Glue crawler. For more information, see [Data Catalog and Crawlers](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html) in the AWS Glue Developer Guide.

**To create an Iceberg table**

------
#### [ Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Under Data Catalog, choose **Tables**, and use the **Create table** button to specify the following attributes:
   + **Table name** – Enter a name for the table. If you’re using Athena to access tables, use these [naming tips](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html) in the Amazon Athena User Guide.
   + **Database** – Choose an existing database or create a new one.
   + **Description** – The description of the table. You can write a description to help you understand the contents of the table.
   + **Table format** – For **Table format**, choose Apache Iceberg.
   + **Enable compaction** – Choose **Enable compaction** to compact small Amazon S3 objects in the table into larger objects.
   + **IAM role** – To run compaction, the service assumes an IAM role on your behalf. You can choose an IAM role using the drop-down. Ensure that the role has the permissions required to enable compaction.

     To learn more about the required permissions, see [Table optimization prerequisites](optimization-prerequisites.md).
   + **Location** – Specify the path to the folder in Amazon S3 that stores the metadata table. Iceberg needs a metadata file and location in the Data Catalog to be able to perform reads and writes.
   + **Schema** – Choose **Add columns** to add columns and data types of the columns. You have the option to create an empty table and update the schema later. Data Catalog supports Hive data types. For more information, see [Hive data types](https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=27838462#content/view/27838462). 

      Iceberg allows you to evolve schema and partition after you create the table. You can use [Athena queries](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-evolving-table-schema.html) to update the table schema and [Spark queries](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table-sql-extensions) for updating partitions. 

------
#### [ AWS CLI ]

```
aws glue create-table \
    --database-name iceberg-db \
    --region us-west-2 \
    --open-table-format-input '{
      "IcebergInput": { 
           "MetadataOperation": "CREATE",
           "Version": "2"
         }
      }' \
    --table-input '{"Name":"test-iceberg-input-demo",
            "TableType": "EXTERNAL_TABLE",
            "StorageDescriptor":{ 
               "Columns":[ 
                   {"Name":"col1", "Type":"int"}, 
                   {"Name":"col2", "Type":"int"}, 
                   {"Name":"col3", "Type":"string"}
                ], 
               "Location":"s3://DOC_EXAMPLE_BUCKET_ICEBERG/"
            }
        }'
```

------

**Topics**
+ [Prerequisites](#iceberg-prerequisites)
+ [Creating an Iceberg table](#create-iceberg-table)

# Optimizing Iceberg tables
<a name="table-optimizers"></a>

AWS Glue supports mutiple table optimization options to enhance the management and performance of Apache Iceberg tables used by the AWS analytical engines and ETL jobs. These optimizers provide efficient storage utilization, improved query performance, and effective data management. There are three types of table optimizers available in AWS Glue: 
+ **Compaction **– Data compaction compacts small data files to reduce storage usage and improve read performance. Data files are merged and rewritten to remove obsolete data and consolidate fragmented data into larger, more efficient files. You can configure compaction to run automatically. 

  Binpack is the default compaction strategy in Apache Iceberg. It combines smaller data files into larger ones for optimal performance. Compaction also supports sort and Z-order strategies that cluster similar data together. Sort organizes data based on specified columns, improving query performance for filtered operations. Z-order creates sorted datasets that enhance query performance when multiple columns are queried simultaneously. All three compaction strategies - bincpak, sort, and Z-order - reduce the amount of data scanned by query engines, thereby lowering query processing costs.
+ **Snapshot retention **– Snapshots are timestamped versions of an Iceberg table. Snapshot retention configurations allow customers to enforce how long to retain snapshots and how many snapshots to retain. Configuring a snapshot retention optimizer can help manage storage overhead by removing older, unnecessary snapshots and their associated underlying files.
+ **Orphan file deletion** – Orphan files are files that are no longer referenced by the Iceberg table metadata. These files can accumulate over time, especially after operations like table deletions or failed ETL jobs. Enabling orphan file deletion allows AWS Glue to periodically identify and remove these unnecessary files, freeing up storage.

Catalog-level optimization configuration is available through the Lake Formation console and using the AWS Glue `UpdateCatalog` API operation. You can enable or disable compaction, snapshot retention, and orphan file deletion optimizers for individual Iceberg tables in the Data Catalog using the AWS Glue console, AWS CLI, or AWS Glue API operations. 

 The following video demonstrates how to configure optimizers for Iceberg tables in the Data Catalog. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/xOXE7AS-pNA?si=lKvt_TSlPkoc6OXn/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/xOXE7AS-pNA?si=lKvt_TSlPkoc6OXn)


**Topics**
+ [Table optimization prerequisites](optimization-prerequisites.md)
+ [Catalog-level table optimizers](catalog-level-optimizers.md)
+ [Compaction optimization](compaction-management.md)
+ [Snapshot retention optimization](snapshot-retention-management.md)
+ [Deleting orphan files](orphan-file-deletion.md)
+ [Viewing optimization details](view-optimization-status.md)
+ [Viewing Amazon CloudWatch metrics](view-optimization-metrics.md)
+ [Deleting an optimizer](delete-optimizer.md)
+ [Considerations and limitations](optimizer-notes.md)
+ [Supported Regions for table optimizers](regions-optimizers.md)

# Table optimization prerequisites
<a name="optimization-prerequisites"></a>

The table optimizer assumes the permissions of the AWS Identity and Access Management (IAM) role that you specify when you enable optimization options (compaction, snapshot retention, and orphan file delettion) for a table. You can either create s single role for all optimizers or create separate roles for each optimizer.

**Note**  
The orphan file deletion optimizer doesn't require the `glue:updateTable` or `s3:putObject` permissions. The snapshot expiration and compaction optimizers require the same set of permissions.

The IAM role must have the permissions to read data and update metadata in the Data Catalog. You can create an IAM role and attach the following inline policies:
+ Add the following inline policy that grants Amazon S3 read/write permissions on the location for data that is not registered with AWS Lake Formation. This policy also includes permissions to update the table in the Data Catalog, and to permit AWS Glue to add logs in Amazon CloudWatch logs and publish metrics. For source data in Amazon S3 that isn't registered with Lake Formation, access is determined by IAM permissions policies for Amazon S3 and AWS Glue actions. 

  In the following inline policies, replace `bucket-name` with your Amazon S3 bucket name, `aws-account-id` and `region` with a valid AWS account number and Region of the Data Catalog, `database_name` with the name of your database, and `table_name` with the name of the table.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:PutObject",
                  "s3:GetObject",
                  "s3:DeleteObject"
              ],
              "Resource": [
                  "arn:aws:s3:::amzn-s3-demo-bucket/*"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket"
              ],
              "Resource": [
                  "arn:aws:s3:::amzn-s3-demo-bucket"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "glue:UpdateTable",
                  "glue:GetTable"
              ],
              "Resource": [
                  "arn:aws:glue:us-east-1:111122223333:table/<database-name>/<table-name>",
                  "arn:aws:glue:us-east-1:111122223333:database/<database-name>",
                  "arn:aws:glue:us-east-1:111122223333:catalog"
              ]
          },
          {
              "Effect": "Allow",
              "Action": [
                  "logs:CreateLogGroup",
                  "logs:CreateLogStream",
                  "logs:PutLogEvents"
              ],
              "Resource": [
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-compaction/logs:*",
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-retention/logs:*",
                  "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-orphan-file-deletion/logs:*"
              ]
          }
      ]
  }
  ```

------
+ Use the following policy to enable compaction for data registered with Lake Formation. 

  If the optimization role doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT` and `DELETE` permissions on the table. 

  For more information on registering an Amazon S3 bucket with Lake Formation, see [Adding an Amazon S3 location to your data lake](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html).

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "lakeformation:GetDataAccess"
        ],
        "Resource": "*"
      },
      {
        "Effect": "Allow",
        "Action": [
          "glue:UpdateTable",
          "glue:GetTable"
        ],
        "Resource": [
          "arn:aws:glue:us-east-1:111122223333:table/databaseName/tableName",
          "arn:aws:glue:us-east-1:111122223333:database/databaseName",
          "arn:aws:glue:us-east-1:111122223333:catalog"
        ]
      },
      {
        "Effect": "Allow",
        "Action": [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ],
        "Resource": [
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-compaction/logs:*",
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-retention/logs:*",
          "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue/iceberg-orphan-file-deletion/logs:*"
        ]
      }
    ]
  }
  ```

------
+ (Optional) To optimize Iceberg tables with data in Amazon S3 buckets encrypted using [Server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html), the compaction role requires permissions to decrypt Amazon S3 objects and generate a new data key to write objects to the encrypted buckets. Add the following policy to the desired AWS KMS key. We support only bucket-level encryption.

  ```
  {
      "Effect": "Allow",
      "Principal": {
          "AWS": "arn:aws:iam::<aws-account-id>:role/<optimizer-role-name>"
      },
      "Action": [
          "kms:Decrypt",
          "kms:GenerateDataKey"
      ],
      "Resource": "*"
  }
  ```
+  (Optional) For data location registered with Lake Formation, the role used to register the location requires permissions to decrypt Amazon S3 objects and generate a new data key to write objects to the encrypted buckets. For more information, see [Registering an encrypted Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-encrypted.html). 
+ (Optional) If the AWS KMS key is stored in a different AWS account, you need to include the following permissions to the compaction role.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "kms:Decrypt",
          "kms:GenerateDataKey"
        ],
        "Resource": [
          "arn:aws:kms:us-east-1:111122223333:key/key-id"
        ]
      }
    ]
  }
  ```

------
+  The role you use to run compaction must have the `iam:PassRole` permission on the role. 

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "iam:PassRole"
        ],
        "Resource": [
          "arn:aws:iam::111122223333:role/<optimizer-role-name>"
        ]
      }
    ]
  }
  ```

------
+ Add the following trust policy to the role for AWS Glue service to assume the IAM role to run the compaction process.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
          "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }
  ```

------
+ <a name="catalog-optimizer-requirement"></a> (Optional) To update the Data Catalog settings to enable catalog-level table optimizations, the IAM role used must have the `glue:UpdateCatalog` permission or AWS Lake Formation `ALTER CATALOG` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

# Catalog-level table optimizers
<a name="catalog-level-optimizers"></a>

With a one-time catalog configuration, you can set up automatic optimizers such as compaction, snapshot retention, and orphan file deletion for all new and updated Apache Iceberg tables in the AWS Glue Data Catalog. Catalog-level optimizer configurations allow you to apply consistent optimizer settings across all tables within a catalog, eliminating the need to configure optimizers individually for each table.

Data lake administrators can configure the table optimizers by selecting the default catalog in the Lake Formation console and enabling optimizers using the `Table optimization` option. When you create new tables or update existing tables in the Data Catalog, the Data Catalog automatically runs the table optimizations to reduce operational burden.

If you have configured optimization at the table level or if you have previously deleted the table optimization settings for a table, those table-specific settings take precedence over the default catalog settings for table optimization. If a configuration parameter is not defined at either the table or catalog level, the Iceberg table property value will be applied. This setting is applicable to snapshot retention and orphan file deletion optimizer.

When enabling catalog-level optimizers, consider the following:
+ When you configure optimization settings at the time of catalog creation and subsequently disable the optimizations through an Update Catalog request, the operation will cascade through all the tables within the catalog.
+ If you have already configured optimizers for a given table, then the disable operation at the catalog level will not impact this table.
+ When you disable optimizers at the catalog level, tables with existing optimizer configurations will maintain their specific settings and remain unaffected by the catalog-level change. However, tables without their own optimizer configurations will inherit the disabled state from the catalog level.
+ Since snapshot retention and orphan file deletion optimizers can be schedule-based, updates will introduce a random delay to the start of their schedule. This will cause each optimizer to start at slightly different times, spreading out the load and reducing the likelihood of exceeding service limits.
+ Catalog-level optimizer settings are not automatically inherited by tables when AWS Glue Data Catalog encryption is enabled. If your catalog has metadata encryption enabled, you must configure table optimizers individually for each table. To use catalog-level optimizer inheritance, metadata encryption must be disabled on the catalog.

**Topics**
+ [Enabling catalog-level automatic table optimization](enable-auto-table-optimizers.md)
+ [Viewing catalog-level optimizations](view-catalog-optimizations.md)
+ [Disabling catalog-level table optimization](disable-auto-table-optimizers.md)

# Enabling catalog-level automatic table optimization
<a name="enable-auto-table-optimizers"></a>

 You can enable the automatic table optimization for all new Apache Iceberg tables in the Data Catalog. After creating the table, you can also explicitly update the table optimization settings manually. 

 To update the Data Catalog settings to enable catalog-level table optimizations, the IAM role used must have the `glue:UpdateCatalog` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

 For the Lake Formation managed tables, the IAM role selected during the catalog optimization configuration requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT`, and `DELETE` permissions for any new tables or updated tables. 

## To enable catalog-level optimizers (console)
<a name="enable-catalog-optimizers-console"></a>

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. In the navigation pane, choose **Data Catalog**.

1. Select the **Catalogs** tab.

1. Choose the account-level catalog.

1. Choose **Table optimizations**, **Edit** under **Table optimizations** tab. You can also choose **Edit optimizations** from **Actions**.  
![\[The screenshot shows the edit option to enable optimizations at the catalog-level.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-edit-optimizations.png)

1. On the **Table optimization** page, configure the following options:  
![\[The screenshot shows the optimization options at the catalog-level.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-optimization-options.png)

   1. Configure **Compaction** settings:
      + Enable/disable compaction.
      + Choose the IAM role that has the necessary permissions to run the optimizers.

        For more information on the permission requirements for the IAM role, see [Table optimization prerequisites](optimization-prerequisites.md).

   1. Configure **Snapshot retention** settings:
      + Enable/disable retention.
      + Set snapshot retention period in days - default is 5 days.
      + Set number of snapshots to retain - default is 1 snapshot.
      + Enable/disable cleaning of expired files.

   1. Configure **Orphan file deletion** settings:
      + Enable/disable orphan file deletion.
      + Set orphan file retention period in days - default is 3 days.

1. Choose **Save**.

## Enabling Catalog-Level Optimizers via AWS CLI
<a name="catalog-auto-optimizers-cli"></a>

Use the following CLI command to update an existing catalog with optimizer settings:

**Example Update catalog with optimizer settings**  

```
aws glue update-catalog \
   --name catalog-id \
  --catalog-input \
  '{
    "CatalogId": "111122223333",
    "CatalogInput": {
        "CatalogProperties": {
            "CustomProperties": {
                "ColumnStatistics.Enabled": "false",
                "ColumnStatistics.RoleArn": "arn:aws:iam::111122223333:role/service-role/stats-role-name"
            },
            "IcebergOptimizationProperties": {
                "RoleArn": "arn:aws:iam::111122223333:role/optimizer-role-name",
                "Compaction": {
                    "enabled": "true"
                },
                "Retention": {
                    "enabled": "true",
                    "snapshotRetentionPeriodInDays": "10",
                    "numberOfSnapshotsToRetain": "5",
                    "cleanExpiredFiles": "true"
                },
                "OrphanFileDeletion": {
                    "enabled": "true",
                    "orphanFileRetentionPeriodInDays": "3"
                }
            }
        }
    }
}'
```

If you encounter issues with catalog-level optimizers, check the following:
+ Ensure the IAM role has the correct permissions as outlined in the Prerequisites section.
+ Check CloudWatch logs for any error messages related to optimizer operations.

   For more information, see [View available metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html) in the *Amazon CloudWatch User Guide*. 
+ Verify that the catalog settings were successfully applied by checking the catalog configuration.
+ For table access failures, check the CloudWatch logs and EventBridge notifications for detailed error information.

# Viewing catalog-level optimizations
<a name="view-catalog-optimizations"></a>

 When catalog-level table optimization is enabled, anytime an Apache Iceberg table is created or updated via the `CreateTable` or `UpdateTable` APIs through AWS Management Console, SDK, or AWS Glue crawler, an equivalent table level setting is created for that table. 

 After you create or update a table, you can verify the table details to confirm the table optimization. The `Table optimization` shows the `Configuration source` property set as `Catalog`. 

![\[An image of an Apache Iceberg table with catalog-level optimization configuration has  been applied.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-optimization-enabled.png)


# Disabling catalog-level table optimization
<a name="disable-auto-table-optimizers"></a>

 You can disable table optimization for new tables using the AWS Lake Formation console, the `glue:UpdateCatalog` API. 

**To disable the table optimizations at the catalog level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Table optimizations**.

1. On the **Edit optimization** page, unselect the **Optimization options**.

1. Choose **Save**.

# Compaction optimization
<a name="compaction-management"></a>

 The Amazon S3 data lakes using open table formats like Apache Iceberg store data as S3 objects. Having thousands of small Amazon S3 objects in a data lake table increases metadata overhead and affects read performance. AWS Glue Data Catalog provides managed compaction for Iceberg tables, compacting small objects into larger ones for better read performance by AWS analytics services like Amazon Athena and Amazon EMR, and AWS Glue ETL jobs. Data Catalog performs compaction without interfering with concurrent queries and supports compaction only for Parquet format tables. 

The table optimizer continuously monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes.

In the Data Catalog, the compaction process starts when a table or any of its partitions have more than 100 files. Each file must be smaller than 75% of the target file size. The target file size is defined by the `write.target-file-size-bytes` table property, which defaults to 512 MB if not explicitly set.

 For limitations, see [Supported formats and limitations for managed data compaction](optimizer-notes.md#compaction-notes). 

**Topics**
+ [Enabling compaction optimizer](enable-compaction.md)
+ [Disabling compaction optimizer](disable-compaction.md)

# Enabling compaction optimizer
<a name="enable-compaction"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable compaction for your Apache Iceberg tables in the AWS Glue Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable compaction when you create the table. Compaction is disabled by default for new tables.

------
#### [ Console ]

**To enable compaction**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose a table in open table format that you want to enable compaction for, then under **Actions** menu, choose **Optimization**, and then choose **Enable**.

   You can also enable compaction by selecting the **Table optimization** tab on the **Table details** page. Choose the **Table optimization** tab on the lower section of the page, and choose **Enable compaction**. 

   The **Enable optimization** option is also available when you create a new Iceberg table in the Data Catalog.

1. On the **Enable optimization** page, choose **Compaction** under **Optimization options**.  
![\[Apache Iceberg table details page with Enable compaction option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-enable-compaction.png)

1. Next, select an IAM role from the drop down with the permissions shown in the [Table optimization prerequisites](optimization-prerequisites.md) section. 

   You can also choose **Create a new IAM role** option to create a custom role with the required permissions to run compaction.

    Follow the steps below to update an existing IAM role: 

   1.  To update the permissions policy for the IAM role, in the IAM console, go to the IAM role that is being used for running compaction. 

   1.  In the **Add permissions** section, choose Create policy. In the newly opened browser window, create a new policy to use with your role. 

   1. On the Create policy page, choose the `JSON` tab. Copy the JSON code shown in the Prerequisites into the policy editor field.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. Choose a compaction strategy. The available options are:
   + **Binpack** – Binpack is the default compaction strategy in Apache Iceberg. It combines smaller data files into larger ones for optimal performance.
   + **Sort** – Sorting in Apache Iceberg is a data organization technique that clusters information within files based on specified columns, significantly improving query performance by reducing the number of files that need to be processed. You define the sort order in Iceberg's metadata using the sort-order field, and when multiple columns are specified, data is sorted in the sequence the columns appear in the sort order, ensuring records with similar values are stored together within files. The sorting compaction strategy takes the optimization further by sorting data across all files within a partition. 
   + **Z-order** – Z-ordering is a way to organize data when you need to sort by multiple columns with equal importance. Unlike traditional sorting that prioritizes one column over others, Z-ordering gives balanced weight to each column, helping your query engine read fewer files when searching for data.

     The technique works by weaving together the binary digits of values from different columns. For example, if you have the numbers 3 and 4 from two columns, Z-ordering first converts them to binary (3 becomes 011 and 4 becomes 100), then interleaves these digits to create a new value: 011010. This interleaving creates a pattern that keeps related data physically close together.

     Z-ordering is particularly effective for multi-dimensional queries. For example, a customer table Z-ordered by income, state, and zip code can deliver superior performance compared to hierarchical sorting when querying across multiple dimensions. This organization allows queries targeting specific combinations of income and geographic location to quickly locate relevant data while minimizing unnecessary file scans.

1. **Minimum input files **– The number of data files required in a partition before compaction is triggered.

1. **Delete files threshold** – Minimum delete operations required in a data file before it becomes eligible for compaction.

1. Choose **Enable optimization**.

------
#### [ AWS CLI ]

 The following example shows how to enable compaction. Replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and name of the IAM role that has the required permissions to run compaction. You can replace compaction strategy `sort` with other supported strategies like `z-order` or `binpack`.

order" depending on your requirements.

```
aws glue create-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{
    "roleArn": "arn:aws:iam::123456789012:role/optimizer_role",
    "enabled": true,
    "vpcConfiguration": {"glueConnectionName": "glue_connection_name"},
    "compactionConfiguration": {
      "icebergConfiguration": {"strategy": "sort"}
    }
  }'\
--type compaction
```

------
#### [ AWS API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to enable compaction for a table.

------

After you enable compaction, **Table optimization** tab shows the following compaction details once the compaction run is complete:

Start time  
The time at which the compaction process started within Data Catalog. The value is a timestamp in UTC time. 

End time  
The time at which the compaction process ended in Data Catalog. The value is a timestamp in UTC time. 

Status  
The status of the compaction run. Values are success or fail.

Files compacted  
Total number of files compacted.

Bytes compacted  
Total number of bytes compacted.

# Disabling compaction optimizer
<a name="disable-compaction"></a>

 You can disable automatic compaction for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. On the left navigation, under **Data Catalog**, choose **Tables**. 

1. From the tables list, choose the Iceberg table that you want to disable compaction.

1. Choose the **Table optimization** tab on the lower section of the **Tables details** page.

1. From **Actions**, choose **Disable**, and then choose **Compaction**.

1.  Choose **Disable compaction** on the confirmation message. You can re-enable compaction at a later time. 

    After the you confirm, compaction is disabled and the compaction status for the table turns back to `Disabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to run compaction.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "enabled":'false', "vpcConfiguration":{"glueConnectionName":"glue_connection_name"}}'\ 
  --type compaction
```

------
#### [ AWS API ]

Call [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable compaction for a specific table.

------

# Snapshot retention optimization
<a name="snapshot-retention-management"></a>

Apache Iceberg snapshot retention feature allows users to query historical data at specific points in time and revert unwanted modifications to their tables. In the AWS Glue Data Catalog, snapshot retention configuration controls how long these snapshots (versions of the table data) are kept before being expired and removed. This helps manage storage costs and metadata overhead by automatically removing older snapshots based on a configured retention period or maximum number of snapshots to keep. 

You can configure the retention period in days and the maximum number of snapshots to retain for a table. AWS Glue removes snapshots that are older than the specified retention period from the table metadata, while keeping the most recent snapshots up to the configured limit. After removing old snapshots from the metadata, AWS Glue deletes the corresponding data and metadata files that are no longer referenced and unique to the expired snapshots. This allows time travel queries only up to the remaining retained snapshots, while reclaiming storage space used by expired snapshot data.

**Topics**
+ [Enabling snapshot retention optimizer](enable-snapshot-retention.md)
+ [Updating snapshot retention optimizer](update-snapshot-retention.md)
+ [Disabling snapshot retention optimizer](disable-snapshot-retention.md)

# Enabling snapshot retention optimizer
<a name="enable-snapshot-retention"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable snapshot retention optimizers for your Apache Iceberg tables in the Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable snapshot retention optimizer when you create the table. Snapshot retention is disabled by default for new tables.

------
#### [ Console ]

**To enable snapshot retention optimizer**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose an Iceberg table that you want to enable snapshot retention optimizer for, then under **Actions** menu, choose **Enable** under **Optimization**.

   You can also enable optimization by selecting the table and opening the **Table details** page. Choose the **Table optimization** tab on the lower section of the page, and choose **Enable snapshot retention**. 

1. On the **Enable optimization ** page, under **Optimization configuration**, you have two options: **Use default setting** or **Customize settings**. If you choose to use the default settings, AWS Glue utilizes the properties defined in the Iceberg table configuration to determine the snapshot retention period and the number of snapshots to be retained. In the absence of this configuration, AWS Glue retains one snapshot for five days, and deletes files associated with the expired snapshots.

1.  Next, choose an IAM role that AWS Glue can assume on your behalf to run the optimizer. For details about the permissions required for the IAM role, see the [Table optimization prerequisites](optimization-prerequisites.md) section.

   Follow the steps below to update an existing IAM role: 

   1.  To update the permissions policy for the IAM role, in the IAM console, go to the IAM role that is being used for running compaction. 

   1.  In the Add permissions section, choose Create policy. In the newly opened browser window, create a new policy to use with your role. 

   1. On the Create policy page, choose the JSON tab. Copy the JSON code shown in the Prerequisites into the policy editor field.

1. If you prefer to set the values for the **Snapshot retention configuration** manually, choose **Customize settings**.   
![\[Apache Iceberg table details page with Enable retention>Customize settings option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-enable-retention.png)

1. Choose the box **Apply the selected IAM role to the selected optimizers** option to use a single IAM role for all enabling all optimizers.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC Connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. Next, under **Snapshot retention configuration**, either choose to use the values specified in the [Iceberg table configuration](https://iceberg.apache.org/docs/1.5.2/configuration/#table-behavior-properties), or specify custom values for snapshot retention period (history.expire.max-snapshot-age-ms), minimum number of snapshots (history.expire.min-snapshots-to-keep) to retain, and the time in hours between consecutive snapshot deletion job runs.

1.  Choose **Delete associated files** to delete underlying files when the table optimizer deletes old snapshots from the table metadata.

    If you don't choose this option, when older snapshots are removed from the table metadata, their associated files will remain in the storage as orphaned files. 

1. Next, read the caution statement, and choose **I acknowledge** to proceed.
**Note**  
 In the Data Catalog, the snapshot retention optimizer honors the lifecycle that is controlled by branch and tag level retention policies. For more information, see [Branching and tagging](https://iceberg.apache.org/docs/latest/branching/#overview) section in the Iceberg documentation.

1. Review the configuration and choose **Enable optimization**.

   Wait a few minutes for the retention optimizer to run and expire old snapshots based on the configuration.

------
#### [ AWS CLI ]

 To enable snapshot retention for new Iceberg tables in AWS Glue, you need to create a table optimizer of type `retention` and set the `enabled` field to `true` in the `table-optimizer-configuration`. You can do this using the AWS CLI command `create-table-optimizer` or `update-table-optimizer`. Additionally, you need to specify the retention configuration fields like `snapshotRetentionPeriodInDays` and `numberOfSnapshotsToRetain` based on your requirements.

The following example shows how to enable the snapshot retention optimizer. Replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and name of the IAM role that has the required permissions to run the snapshot retention optimizer. 

```
aws glue create-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":'true', "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"}, "retentionConfiguration":{"icebergConfiguration":{"snapshotRetentionPeriodInDays":7,"numberOfSnapshotsToRetain":3,"cleanExpiredFiles":'true'}}}'\
  --type retention
```

 This command creates a retention optimizer for the specified Iceberg table in the given catalog, database, and Region. The table-optimizer-configuration specifies the IAM role ARN to use, enables the optimizer, and sets the retention configuration. In this example, it retains snapshots for 7 days, keeps a minimum of 3 snapshots, and cleans expired files. 
+  snapshotRetentionPeriodInDays –The number of days to retain snapshots before expiring them. The default value is `5`. 
+ numberOfSnapshotsToRetain – The minimum number of snapshots to keep, even if they are older than the retention period. The default value is `1`. 
+ cleanExpiredFiles – A boolean indicating whether to delete expired data files after expiring snapshots. The default value is `true`.

   When set to true, older snapshots are removed from table metadata, and their underlying files are deleted. If this parameter is set to false, older snapshots are removed from table metadata but their underlying files remain in the storage as orphan files. 

------
#### [ AWS API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to enable snapshot retention optimizer for a table.

------

After you enable compaction, **Table optimization** tab shows the following compaction details (after approximately 15-20 minutes):

Start time  
The time at which the snapshot retention optimizer started. The value is a timestamp in UTC time. 

Run time  
The time shows how long the optimizer takes to complete the task. The value is a timestamp in UTC time. 

Status  
The status of the optimizer run. Values are success or fail.

Data files deleted  
Total number of files deleted.

Manifest files deleted  
Total number of manifest files deleted.

Manifest lists deleted  
Total number of manifest lists deleted.

# Updating snapshot retention optimizer
<a name="update-snapshot-retention"></a>

 You can update the existing configuration of an snapshot retention optimizer for a particular Apache Iceberg table using the AWS Glue console, AWS CLI, or the UpdateTableOptimizer API. 

------
#### [ Console ]

**To update snapshot retention configuration**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table you want to update the snapshot retention optimizer configuration.

1. On the lower section of the **Tables details** page, select the **Table optimization ** tab, and then choose **Edit**. You can also choose **Edit** under **Optimization ** from the **Actions **menu located on the top right corner of the page.

1.  On the **Edit optimization** page, make the desired changes. 

1.  Choose **Save**. 

------
#### [ AWS CLI ]

 To update a snapshot retention optimizer using the AWS CLI, you can use the following command: 

```
aws glue update-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role"","enabled":'true', "vpcConfiguration":{"glueConnectionName":"glue_connection_name"},"retentionConfiguration":{"icebergConfiguration":{"snapshotRetentionPeriodInDays":7,"numberOfSnapshotsToRetain":3,"cleanExpiredFiles":'true'}}}' \
 --type retention
```

 This command updates the retention configuration for the specified table in the given catalog, database, and Region. The key parameters are: 
+  snapshotRetentionPeriodInDays –The number of days to retain snapshots before expiring them. The default value is `1`. 
+ numberOfSnapshotsToRetain – The minimum number of snapshots to keep, even if they are older than the retention period. The default value is `5`. 
+ cleanExpiredFiles – A boolean indicating whether to delete expired data files after expiring snapshots. The default value is `true`. 

   When set to true, older snapshots are removed from table metadata, and their underlying files are deleted." If this parameter is set to false, older snapshots are removed from table metadata but their underlying files remain in the storage as orphan files. 

------
#### [ API ]

To update a table optimizer, you can use the `UpdateTableOptimizer` API. This API allows you to update the configuration of an existing table optimizer for compaction, retention, or orphan file removal. The request parameters include:
+ catalogId (required): The ID of the catalog containing the table 
+  databaseName (optional): The name of the database containing the table 
+  tableName (optional): The name of the table 
+  type (required): The type of table optimizer (compaction, retention, or orphan\$1file\$1deletion) 
+  retentionConfiguration (required): The updated configuration for the table optimizer, including role ARN, enabled status, retention configuration, and orphan file removal configuration. 

------

# Disabling snapshot retention optimizer
<a name="disable-snapshot-retention"></a>

 You can disable the snapshot retention optimizer for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

**To disable snapshot retention**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table that you want to disable the optimizer for snapshot retention.

1. On lower section of the **Table details** page, choose **Table optimization** and **Disable**, **Snapshot retention** under **Actions**.

   You can also choose **Disable** under ** Optimization** from the **Actions** menu located on top right corner of the page.

1.  Choose **Disable ** on the confirmation message. You can re-enable the snapshot retention optimizer at a later time. 

    After the you confirm, snapshot retention optimizer is disabled and the status for snapshot retention turns back to `Not enabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to run the retention optimizer.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "vpcConfiguration":{"glueConnectionName":"glue_connection_name"}, "enabled":'false'}'\ 
  --type retention
```

------
#### [ AWS API ]

Call [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable the snapshot retention optimizer for a specific table.

------

# Deleting orphan files
<a name="orphan-file-deletion"></a>

 AWS Glue Data Catalog allows you to remove orphan files from your Iceberg tables. Orphan files are unreferenced files that exist in your Amazon S3 data source under the specified table location, are not tracked by the Iceberg table metadata, and are older than your configured age limit. These orphan files can accumulate over time due to failure in operations like compaction, partition drops, or table rewrites, and take up unnecessary storage space.

The orphan file deletion optimizer in AWS Glue scans the table metadata and the actual data files, identifies the orphan files, and deletes them to reclaim storage space. The optimizer only removes files created after the optimizer's creation date that also meet the configured deletion criteria. Files created before or on the optimizer creation date are never deleted.

**Orphan file deletion logic**

1. Date check – Compares file creation date with optimizer creation date. If file is older than or equal to optimizer creation date, the file is skipped.

1. Optimizer configuration check – If file is newer than optimizer creation date, evaluates the file against the configured age limit. The optimizer deletes the file if it matches the deletion critera. Skips the file, if it doesn't match the criteria.

 You can initiate the orphan file deletion by creating an orphan file deletion table optimizer in the Data Catalog.

**Important**  
 By default, orphan file deletion evaluates files across your AWS Glue table location. While you can configure a sub-prefix to limit the scope of evaluation by using API parameter, you must ensure your table location doesn't contain files from other data sources or tables. If your table location overlaps with other data sources, the service might identify and delete unrelated files as orphans. 

**Topics**
+ [Enabling orphan file deletion](enable-orphan-file-deletion.md)
+ [Updating orphan file deletion optimizer](update-orphan-file-deletion.md)
+ [Disabling orphan file deletion](disable-orphan-file-deletion.md)

# Enabling orphan file deletion
<a name="enable-orphan-file-deletion"></a>

 You can use AWS Glue console, AWS CLI, or AWS API to enable orphan file deletion for your Apache Iceberg tables in the Data Catalog. For new tables, you can choose Apache Iceberg as table format and enable orphan file deletion optimizer when you create the table. Snapshot retention is disabled by default for new tables.

------
#### [ Console ]

**To enable orphan file deletion**

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) and sign in as a data lake administrator, the table creator, or a user who has been granted the `glue:UpdateTable` and `lakeformation:GetDataAccess` permissions on the table. 

1. In the navigation pane, under **Data Catalog**, choose **Tables**.

1. On the **Tables** page, choose an Iceberg table in that you want to enable orphan file deletion.

   Choose the **Table optimization** tab on the lower section of the page, and choose **Enable**, **Orphan file deletion** from **Actions**. 

   You can also choose **Enable** under **Optimization** from the **Actions** menu located on the top right corner of the page..

1. On the **Enable optimization** page, choose **Orphan file deletion** under **Optimization options**.

1. If you choose to use **Default settings**, all orphan files will be deleted after 3 days. If you want to keep the orphan files for a specific number of days, choose **Customize settings**.

1. Next, choose an IAM role with the required permissions to delete orphan files.

1. If you have security policy configurations where the Iceberg table optimizer needs to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC), create an AWS Glue network connection or use an existing one.

   If you don't have an AWS Glue VPC Connection set up already, create a new one by following the steps in the [Creating connections for connectors](https://docs.aws.amazon.com/glue/latest/dg/creating-connections.html) section using the AWS Glue console or the AWS CLI/SDK.

1. If you choose **Customize settings**, enter the number of days to retain the files before deletion under **Orphan file deletion configuration**. You can also specify the interval between two consecutive optimizer runs. The default value is 24 hours.

1. Choose **Enable optimization**.

------
#### [ AWS CLI ]

 To enable orphan file deletion for an Iceberg table in AWS Glue, you need to create a table optimizer of type `orphan_file_deletion` and set the `enabled` field to true. To create an orphan file deletion optimizer for an Iceberg table using the AWS CLI, you can use the following command:

```
aws glue create-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{
"glueConnectionName":"glue_connection_name"}, "orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":3, "location":'S3 location'}}}'\
 --type orphan_file_deletion
```

 This command creates an orphan file deletion optimizer for the specified Iceberg table. The key parameters are:
+ roleArn – the ARN of the IAM role with permissions to access the S3 bucket and Glue resources.
+ enabled – Set to true to enable the optimizer.
+ orphanFileRetentionPeriodInDays – The number of days to retain orphan files before deleting them (minimum 1 day).
+ type – Set to orphan\$1file\$1deletion to create an orphan file deletion optimizer.

 After creating the table optimizer, it will run orphan file deletion periodically (once per day if left enabled). You can check the runs using the `list-table-optimizer-runs` API. The orphan file deletion job will identify and delete files that are not tracked in the Iceberg metadata for the table.

------
#### [ API ]

Call [CreateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-CreateTableOptimizer) operation to create the orphan file deletion optimizer for a specific table.

------

# Updating orphan file deletion optimizer
<a name="update-orphan-file-deletion"></a>

 You can modify the configuration for the orphan file deletion optimizer, such as changing the retention period for orphan files or the IAM role used by the optimizer using AWS Glue console, AWS CLI, or the `UpdateTableOptimizer` operation. 

------
#### [ AWS Management Console ]

**To update the orphan file deletion optimizer**

1.  Choose **Data Catalog** and choose **Tables**. From the tables list, choose the table you want to update the orphan file deletion optimizer configuration.

1. On the lower section of the **Tables details** page, choose **Table optimization **, and then choose **Edit**. 

1.  On the **Edit optimization** page, make the desired changes. 

1.  Choose **Save**. 

------
#### [ AWS CLI ]

 You can use the `update-table-optimizer` call to update the orphan file deletion optimizer in AWS Glue, you can use call. This allows you to modify the `OrphanFileDeletionConfiguration` in the `icebergConfiguration` field where you can specify the updated `OrphanFileRetentionPeriodInDays` to set the number of days to retain orphan files, to specify the Iceberg table location to delete orphan files from. 

```
aws glue update-table-optimizer \
 --catalog-id 123456789012 \
 --database-name iceberg_db \
 --table-name Iceberg_table \
 --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role","enabled":true, "vpcConfiguration":{"glueConnectionName":"glue_connection_name"},"orphanFileDeletionConfiguration":{"icebergConfiguration":{"orphanFileRetentionPeriodInDays":5}}}' \
 --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to update the orphan file deletion optimizer for a table.

------

 
# Disabling orphan file deletion
<a name="disable-orphan-file-deletion"></a>

 You can disable orphan file deletion optimizer for a particular Apache Iceberg table using AWS Glue console or AWS CLI. 

------
#### [ Console ]

**To disable orphan file deletion**

1. Choose **Data Catalog** and choose **Tables**. From the tables list, choose the Iceberg table that you want to disable the optimizer for orphan file deletion.

1. On lower section of the **Table details** page, choose **Table optimization** tab.

1. Choose **Actions**, and then choose **Disable **, **Orphan file deletion**.

   You can also choose **Disable** under **Optimization** from the **Actions** menu.

1.  Choose **Disable ** on the confirmation message. You can re-enable the orphan file deletion optimizer at a later time. 

    After the you confirm, orphan file deletion optimizer is disabled and the status for orphan file deletion turns back to `Not enabled`.

------
#### [ AWS CLI ]

In the following example, replace the account ID with a valid AWS account ID. Replace the database name and table name with actual Iceberg table name and the database name. Replace the `roleArn` with the AWS Resource Name (ARN) of the IAM role and actual name of the IAM role that has the required permissions to disable the optimizer.

```
aws glue update-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --table-optimizer-configuration '{"roleArn":"arn:aws:iam::123456789012:role/optimizer_role", "enabled":'false'}'\ 
  --type orphan_file_deletion
```

------
#### [ API ]

Call the [UpdateTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-UpdateTableOptimizer) operation to disable the snapshot retention optimizer for a specific table.

------

# Viewing optimization details
<a name="view-optimization-status"></a>

You can view the optimization status for Apache Iceberg tables in the AWS Glue console, AWS CLI, or using AWS API operations. 

------
#### [ Console ]

**To view the optimization status for Iceberg tables (console)**
+ You can view optimization status for Iceberg tables on the AWS Glue console by choosing an Iceberg table from the **Tables** list under **Data Catalog**. Under **Table optimization**. Choose the **View all**  
![\[Apache Iceberg table details page with Enable compaction option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table-list-compaction-status.png)

------
#### [  AWS CLI  ]

You can view the optimization details using AWS CLI.

In the following examples, replace the account ID with a valid AWS account ID, the database name, and table name with actual Iceberg table name. For `type`, provide and optimization type. Acceptable values are `compaction`, `retention`, and `orphan-file-deletion`.
+ **To get the last compaction run details for a table**

  ```
  aws get-table-optimizer \
    --catalog-id 123456789012 \
    --database-name iceberg_db \
    --table-name iceberg_table \
    --type compaction
  ```
+ Use the following example to retrieve the history of an optimizer for a specific table.

  ```
  aws list-table-optimizer-runs \
    --catalog-id 123456789012 \
    --database-name iceberg_db \
    --table-name iceberg_table \
    --type compaction
  ```
+ The following example shows how to retrieve the optimization run and configuration details for multiple optimizers. You can specify a maximum of 20 optimizers.

  ```
  aws glue batch-get-table-optimizer \
  --entries '[{"catalogId":"123456789012", "databaseName":"iceberg_db", "tableName":"iceberg_table", "type":"compaction"}]'
  ```

------
#### [ API ]
+ Use `GetTableOptimizer` operation to retrieve the last run details of an optimizer. 
+  Use `ListTableOptimizerRuns` operation to retrieve history of a given optimizer on a specific table. You can specify 20 optimizers in a single API call. 
+ Use the [BatchGetTableOptimizer](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-table-optimizers.html#aws-glue-api-table-optimizers-BatchGetTableOptimizer) operation to retrieve configuration details for multiple optimizers in your account. 

------

# Viewing Amazon CloudWatch metrics
<a name="view-optimization-metrics"></a>

 After running the table optimizers successfully, the service creates Amazon CloudWatch metrics on the optimization job performance. You can go to the **CloudWatch Metrics** and choose **Metrics**, **All metrics**. You can to filter metrics by the specific namespace (for example AWS Glue), table name, or database name.

 For more information, see [View available metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html) in the *Amazon CloudWatch User Guide*. 

****Compaction****
+ Number of bytes compacted 
+ Number of files compacted
+ Number of DPU allocated to job 
+ Duration of job (Hours) 

****Snapshot retention****
+ Number of data files deleted 
+ Number of manifest files deleted
+ Number of Manifest lists deleted 
+ Duration of job (Hours)

****Orphan file deletion****
+ Number of orphan files deleted 
+ Duration of job (Hours) 

# Deleting an optimizer
<a name="delete-optimizer"></a>

You can delete an optimizer and associated metadata for the table using AWS CLI or AWS API operation.

Run the following AWS CLI command to delete optimization history for a table. You need to specify the optimizer `type` along with the catalog ID, database name and table name. The acceptable values are: `compaction`, `retention`, and `orphan_file_deletion`.

```
aws glue delete-table-optimizer \
  --catalog-id 123456789012 \
  --database-name iceberg_db \
  --table-name iceberg_table \
  --type compaction
```

 Use `DeleteTableOptimizer` operation to delete an optimizer for a table.

# Considerations and limitations
<a name="optimizer-notes"></a>

 This section includes things to consider when using table optimizers within the AWS Glue Data Catalog. 

## Durability and correctness
<a name="durability-correctness"></a>

**S3 Table Locations:**

When multiple AWS Glue Data Catalog tables share the same Amazon S3 location and have optimizers enabled, the snapshot retention or orphan file deletion optimizer for one table may delete files that are still referenced by the other table. Ensure that each table with optimizers enabled has a unique Amazon S3 location that is not shared with any other table, including tables in different databases.

**S3 Lifecycle Expiry:**

Amazon S3 lifecycle expiration rules that apply to Iceberg table storage locations can delete manifest and data files that are still referenced by active snapshots. If your bucket has lifecycle expiration rules, ensure they exclude the Iceberg table storage path.

## Known issues
<a name="known-issues"></a>

The [Catalog-level table optimizers](https://docs.aws.amazon.com/glue/latest/dg/catalog-level-optimizers.html) documentation states that "tables without their own optimizer configurations will inherit the disabled state from the catalog level." There is a known issue where some tables without their own optimizer configuration may not correctly inherit the disabled state from the catalog-level configuration. Use the AWS Glue console and optimizer execution logs to verify which optimizers are currently enabled and running in your account, and disable any that you do not require.

## Supported formats and limitations for managed data compaction
<a name="compaction-notes"></a>

Data compaction supports a variety of data types and compression formats for reading and writing data, including reading data from encrypted tables.

**Concurrency Control:**

 Apache Iceberg supports optimistic concurrency control, allowing multiple writers to perform operations simultaneously. Conflicts are detected and resolved at commit time. When working with streaming pipelines, configure appropriate retry settings through table properties and compaction settings to handle concurrent writes effectively. For detailed guidance, refer to the AWS Big Data Blog on [managing concurrent writes in Iceberg tables](https://aws.amazon.com/blogs/big-data/manage-concurrent-write-conflicts-in-apache-iceberg-on-the-aws-glue-data-catalog/). 

**Compaction Retries:**

 When compaction operations fail four consecutive times, AWS Glue catalog table optimization automatically suspends the optimizer to prevent unnecessary compute resource consumption. First investigate the logs and try to understand why compaction is repeatedly failing. To resume compaction optimization, you can re-enable the optimizer through the AWS Glue console or API. 

 **Data compaction supports:**
+ **Encryption** – Data compaction only supports default Amazon S3 encryption (SSE-S3) and server-side KMS encryption (SSE-KMS).
+ **Compaction strategies** – Binpack, sort, and Z-order sorting
+ You can run compaction from the account where Data Catalog resides when the Amazon S3 bucket that stores the underlying data is in another account. To do this, the compaction role requires access to the Amazon S3 bucket.

 **Data compaction currently doesn’t support:** 
+ **Compaction on cross-account tables** – You can't run compaction on cross-account tables.
+ **Compaction on cross-Region tables** – You can't run compaction on cross-Region tables.
+ **Enabling compaction on resource links**
+ **Tables in Amazon S3 Express One Zone storage class ** – You can't run compaction on Amazon S3 Express One Zone Iceberg Tables. 
+ **Z-order compaction strategy doesn't support the following data types :**
  + Decimal
  + TimestampWithoutZone

## Considerations for snapshot retention and orphan file deletion optimizers
<a name="retention-notes"></a>

The following considerations apply to the snapshot retention and the orphan file deletion optimizers. 
+ The snapshot retention and orphan file deletion processes have a maximum limit of deleting 1,000,000 files per run. When deleting expired snapshots, if the number of eligible files for deletion surpasses 1,000,000, any remaining files beyond that threshold will continue to exist in the table storage as orphan files. 
+ Snapshots will be preserved by the snapshot retention optimizer only when both criteria are satisfied: the minimum number of snapshots to keep and the specified retention period.
+ The snapshot retention optimizer deletes expired snapshot metadata from Apache Iceberg, preventing time travel queries for expired snapshots and optionally deleting associated data files.
+  Orphan file deletion optimizer deletes orphaned data and metadata files that are no longer referenced by Iceberg metadata if their creation time is before the orphan file deletion retention period from the time of optimizer run.
+ Apache Iceberg facilitates version control through branches and tags, which are named pointers to specific snapshot states. Each branch and tag follows its own independent life-cycle, governed by retention policies defined at their respective levels. The AWS Glue Data Catalog optimizers take these life cycle policies into account, ensuring adherence to the specified retention rules. Branch and tag-level retention policies take precedence over the optimizer configurations. 

   For more information, see [Branching and Tagging](https://iceberg.apache.org/docs/nightly/branching/) in Apache Iceberg documentation. 
+ Snapshot retention and orphan file deletion optimizers will delete files eligible for clean-up as per configured parameters. Enhance your control over file deletion by implementing S3 versioning and life-cycle policies on the appropriate buckets.

   For detailed instructions on setting up versioning and creating life cycle rules, see [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html). 
+  For proper orphan file determination, ensure that the provided table location and any sub-paths don't overlap with or contain data from any other tables or data sources. If paths overlap, you risk unrecoverable data loss from unintended deletion of files. 

## Debugging OversizedAllocationException exception
<a name="debug-exception"></a>

To resolve an `OversizedAllocationException` exception:
+ Reduce the batch size of the vectorized reader and check. The default batch size is 5000. This is controlled in the `read.parquet.vectorization.batch-size`.
  + If this doesn’t work even after multiple variations, turn off vectorization. This is controlled in the `read.parquet.vectorization.enabled`.

# Supported Regions for table optimizers
<a name="regions-optimizers"></a>

The table optimization features (compaction, snapshot retention, and orphan file deletion) for AWS Glue Data Catalog are available in the following AWS Regions:
+ Asia Pacific (Tokyo)
+ Asia Pacific (Seoul)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Jakarta)
+ Canada (Central)
+ Europe (Ireland)
+ Europe (London)
+ Europe (Frankfurt)
+ Europe (Stockholm)
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)
+ South America (São Paulo)

# Optimizing query performance for Iceberg tables
<a name="iceberg-column-statistics"></a>

Apache Iceberg is a high-performance open table format for huge analytic datasets. AWS Glue supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables. These statistics can facilitate better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.

 AWS Glue estimates the number of distinct values in each column of the Iceberg table and and store them in [Puffin ](https://iceberg.apache.org/puffin-spec/)files on Amazon S3 associated with Iceberg table snapshots. Puffin is an Iceberg file format designed to store metadata like indexes, statistics, and sketches. Storing sketches in Puffin files tied to snapshots ensures transactional consistency and freshness of the NDV statistics.

You can configure to run column statistics generation task using AWS Glue console or AWS CLI. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using AWS Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

**Topics**
+ [Prerequisites for generating column statistics](iceberg-column-stats-prereqs.md)
+ [Generating column statistics for Iceberg tables](iceberg-generate-column-stats.md)
+ [See also](#see-also-iceberg-stats)

# Prerequisites for generating column statistics
<a name="iceberg-column-stats-prereqs"></a>

To generate or update column statistics for Iceberg tables, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), and the required inline policies.

For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md). 

# Generating column statistics for Iceberg tables
<a name="iceberg-generate-column-stats"></a>

Follow these steps to configure a schedule for generating statistics in the Data Catalog using AWS Glue console or AWS CLI or the or run the **StartColumnStatisticsTaskRun** operation.

**To generate column statistics**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose **Tables** under Data Catalog .

1. Choose an Iceberg table from the list. 

1. Choose **Column statistics**, **Generate on demand**,under **Actions** menu.

   You can also choose **Generate statistics** button under **Column statistics** tab in the lower section of the **Tables** page.

1. On the **Generate statistics** page, provide the statistics generation details. Follow steps 6-11 in the [Generating column statistics on a schedule](generate-column-stats.md) section to configure a schedule for statistics generation for Iceberg tables. 

   You can also choose to generate column statistics on-demand by followin the instructions in the [Generating column statistics on demand](column-stats-on-demand.md)
**Note**  
Sampling option is not available for Iceberg tables.

   AWS Glue calculates the number of distinct values for each column of the Iceberg table to a new Puffin file committed to the specified snapshot ID in your Amazon S3 location.

## See also
<a name="see-also-iceberg-stats"></a>
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)

# Managing the Data Catalog
<a name="manage-catalog"></a>

 The AWS Glue Data Catalog is a central metadata repository that stores structural and operational metadata for your Amazon S3 data sets. Managing the Data Catalog effectively is crucial for maintaining data quality, performance, security, and governance.

 By understanding and applying these Data Catalog management practices, you can ensure your metadata remains accurate, performant, secure, and well-governed as your data landscape evolves. 

This section covers the following aspects of Data Catalog management:
+ *Updating table schema and partitions*   As your data evolves, you may need to update the table schema or partition structure defined in the Data Catalog. For more information on how to make these updates programmatically using the AWS Glue ETL, see [Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs](update-from-job.md).
+ *Managing column statistics*: Accurate column statistics help optimize query plans and improve performance. For more information on how to generate, update, and manage column statistics, see [Optimizing query performance using column statistics](column-statistics.md). 
+  *Encrypting the Data Catalog*   To protect sensitive metadata, you can encrypt your Data Catalog using AWS Key Management Service (AWS KMS). This section explains how to enable and manage encryption for your Data Catalog. 
+ *Securing the Data Catalog with AWS Lake Formation*   Lake Formation provides a comprehensive approach to data lake security and access control. You can use Lake Formation to secure and govern access to your Data Catalog and underlying data. 

**Topics**
+ [Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs](update-from-job.md)
+ [Optimizing query performance using column statistics](column-statistics.md)
+ [Encrypting your Data Catalog](catalog-encryption.md)
+ [Securing your Data Catalog using Lake Formation](secure-catalog.md)
+ [Working with AWS Glue Data Catalog views in AWS Glue](catalog-views.md)

# Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs
<a name="update-from-job"></a>

Your extract, transform, and load (ETL) job might create new table partitions in the target data store. Your dataset schema can evolve and diverge from the AWS Glue Data Catalog schema over time. AWS Glue ETL jobs now provide several features that you can use within your ETL script to update your schema and partitions in the Data Catalog. These features allow you to see the results of your ETL work in the Data Catalog, without having to rerun the crawler.

## New partitions
<a name="update-from-job-partitions"></a>

If you want to view the new partitions in the AWS Glue Data Catalog, you can do one of the following:
+ When the job finishes, rerun the crawler, and view the new partitions on the console when the crawler finishes.
+ When the job finishes, view the new partitions on the console right away, without having to rerun the crawler. You can enable this feature by adding a few lines of code to your ETL script, as shown in the following examples. The code uses the `enableUpdateCatalog` argument to indicate that the Data Catalog is to be updated during the job run as the new partitions are created.

**Method 1**  
Pass `enableUpdateCatalog` and `partitionKeys` in an options argument.  

```
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["region", "year", "month", "day"]


sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<target_db_name>,
                                                    table_name=<target_table_name>, transformation_ctx="write_sink",
                                                    additional_options=additionalOptions)
```

```
val options = JsonOptions(Map(
    "path" -> <S3_output_path>, 
    "partitionKeys" -> Seq("region", "year", "month", "day"), 
    "enableUpdateCatalog" -> true))
val sink = glueContext.getCatalogSink(
    database = <target_db_name>, 
    tableName = <target_table_name>, 
    additionalOptions = options)sink.writeDynamicFrame(df)
```

**Method 2**  
Pass `enableUpdateCatalog` and `partitionKeys` in `getSink()`, and call `setCatalogInfo()` on the `DataSink` object.  

```
sink = glueContext.getSink(
    connection_type="s3", 
    path="<S3_output_path>",
    enableUpdateCatalog=True,
    partitionKeys=["region", "year", "month", "day"])
sink.setFormat("json")
sink.setCatalogInfo(catalogDatabase=<target_db_name>, catalogTableName=<target_table_name>)
sink.writeFrame(last_transform)
```

```
val options = JsonOptions(
   Map("path" -> <S3_output_path>, 
       "partitionKeys" -> Seq("region", "year", "month", "day"), 
       "enableUpdateCatalog" -> true))
val sink = glueContext.getSink("s3", options).withFormat("json")
sink.setCatalogInfo(<target_db_name>, <target_table_name>)
sink.writeDynamicFrame(df)
```

Now, you can create new catalog tables, update existing tables with modified schema, and add new table partitions in the Data Catalog using an AWS Glue ETL job itself, without the need to re-run crawlers.

## Updating table schema
<a name="update-from-job-updating-table-schema"></a>

If you want to overwrite the Data Catalog table’s schema you can do one of the following:
+ When the job finishes, rerun the crawler and make sure your crawler is configured to update the table definition as well. View the new partitions on the console along with any schema updates, when the crawler finishes. For more information, see [Configuring a Crawler Using the API](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-api).
+ When the job finishes, view the modified schema on the console right away, without having to rerun the crawler. You can enable this feature by adding a few lines of code to your ETL script, as shown in the following examples. The code uses `enableUpdateCatalog` set to true, and also `updateBehavior` set to `UPDATE_IN_DATABASE`, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run.

------
#### [ Python ]

```
additionalOptions = {
    "enableUpdateCatalog": True, 
    "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]

sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
    table_name=<dst_tbl_name>, transformation_ctx="write_sink",
    additional_options=additionalOptions)
job.commit()
```

------
#### [ Scala ]

```
val options = JsonOptions(Map(
    "path" -> outputPath, 
    "partitionKeys" -> Seq("partition_0", "partition_1"), 
    "enableUpdateCatalog" -> true))
val sink = glueContext.getCatalogSink(database = nameSpace, tableName = tableName, additionalOptions = options)
sink.writeDynamicFrame(df)
```

------

You can also set the `updateBehavior` value to `LOG` if you want to prevent your table schema from being overwritten, but still want to add the new partitions. The default value of `updateBehavior` is `UPDATE_IN_DATABASE`, so if you don’t explicitly define it, then the table schema will be overwritten.

If `enableUpdateCatalog` is not set to true, regardless of whichever option selected for `updateBehavior`, the ETL job will not update the table in the Data Catalog. 

## Creating new tables
<a name="update-from-job-creating-new-tables"></a>

You can also use the same options to create a new table in the Data Catalog. You can specify the database and new table name using `setCatalogInfo`.

------
#### [ Python ]

```
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
    enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
```

------
#### [ Scala ]

```
val options = JsonOptions(Map(
    "path" -> outputPath, 
    "partitionKeys" -> Seq("<partition_1>", "<partition_2>"), 
    "enableUpdateCatalog" -> true, 
    "updateBehavior" -> "UPDATE_IN_DATABASE"))
val sink = glueContext.getSink(connectionType = "s3", connectionOptions = options).withFormat("<format>")
sink.setCatalogInfo(catalogDatabase = “<dst_db_name>”, catalogTableName = “<dst_tbl_name>”)
sink.writeDynamicFrame(df)
```

------

## Restrictions
<a name="update-from-job-restrictions"></a>

Take note of the following restrictions:
+ Only Amazon Simple Storage Service (Amazon S3) targets are supported.
+ The `enableUpdateCatalog` feature is not supported for governed tables.
+ Only the following formats are supported: `json`, `csv`, `avro`, and `parquet`.
+ To create or update tables with the `parquet` classification, you must utilize the AWS Glue optimized parquet writer for DynamicFrames. This can be achieved with one of the following:
  + If you're updating an existing table in the catalog with `parquet` classification, the table must have the `"useGlueParquetWriter"` table property set to `true` before you update it. You can set this property via the AWS Glue APIs/SDK, via the console or via an Athena DDL statement.   
![\[Catalog table property edit field in AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-table-property.png)

    Once the catalog table property is set, you can use the following snippet of code to update the catalog table with the new data:

    ```
    glueContext.write_dynamic_frame.from_catalog(
        frame=frameToWrite,
        database="dbName",
        table_name="tableName",
        additional_options={
            "enableUpdateCatalog": True,
            "updateBehavior": "UPDATE_IN_DATABASE"
        }
    )
    ```
  + If the table doesn't already exist within catalog, you can utilize the `getSink()` method in your script with `connection_type="s3"` to add the table and its partitions to the catalog, along with writing the data to Amazon S3. Provide the appropriate `partitionKeys` and `compression` for your workflow.

    ```
    s3sink = glueContext.getSink(
        path="s3://bucket/folder/",
        connection_type="s3",
        updateBehavior="UPDATE_IN_DATABASE",
        partitionKeys=[],
        compression="snappy",
        enableUpdateCatalog=True
    )
        
    s3sink.setCatalogInfo(
        catalogDatabase="dbName", catalogTableName="tableName"
    )
        
    s3sink.setFormat("parquet", useGlueParquetWriter=True)
    s3sink.writeFrame(frameToWrite)
    ```
  + The `glueparquet` format value is a legacy method of enabling the AWS Glue parquet writer.
+ When the `updateBehavior` is set to `LOG`, new partitions will be added only if the `DynamicFrame` schema is equivalent to or contains a subset of the columns defined in the Data Catalog table's schema.
+ Schema updates are not supported for non-partitioned tables (not using the "partitionKeys" option).
+ Your partitionKeys must be equivalent, and in the same order, between your parameter passed in your ETL script and the partitionKeys in your Data Catalog table schema.
+ This feature currently does not yet support updating/creating tables in which the updating schemas are nested (for example, arrays inside of structs).

For more information, see [Programming Spark scripts](aws-glue-programming.md).

# Working with MongoDB connections in ETL jobs
<a name="integrate-with-mongo-db"></a>

You can create a connection for MongoDB and then use that connection in your AWS Glue job. For more information, see [MongoDB connections](aws-glue-programming-etl-connect-mongodb-home.md) in the AWS Glue programming guide. The connection `url`, `username` and `password` are stored in the MongoDB connection. Other options can be specified in your ETL job script using the `additionalOptions` parameter of `glueContext.getCatalogSource`. The other options can include:
+ `database`: (Required) The MongoDB database to read from.
+ `collection`: (Required) The MongoDB collection to read from.

By placing the `database` and `collection` information inside the ETL job script, you can use the same connection for in multiple jobs.

1. Create an AWS Glue Data Catalog connection for the MongoDB data source. See ["connectionType": "mongodb"](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-mongodb) for a description of the connection parameters. You can create the connection using the console, APIs or CLI.

1. Create a database in the AWS Glue Data Catalog to store the table definitions for your MongoDB data. See [Creating databases](define-database.md) for more information.

1. Create a crawler that crawls the data in the MongoDB using the information in the connection to connect to the MongoDB. The crawler creates the tables in the AWS Glue Data Catalog that describe the tables in the MongoDB database that you use in your job. See [Using crawlers to populate the Data Catalog](add-crawler.md) for more information.

1. Create a job with a custom script. You can create the job using the console, APIs or CLI. For more information, see [Adding Jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job.html).

1. Choose the data targets for your job. The tables that represent the data target can be defined in your Data Catalog, or your job can create the target tables when it runs. You choose a target location when you author the job. If the target requires a connection, the connection is also referenced in your job. If your job requires multiple data targets, you can add them later by editing the script.

1. Customize the job-processing environment by providing arguments for your job and generated script. 

   Here is an example of creating a `DynamicFrame` from the MongoDB database based on the table structure defined in the Data Catalog. The code uses `additionalOptions` to provide the additional data source information:

------
#### [  Scala  ]

   ```
   val resultFrame: DynamicFrame = glueContext.getCatalogSource(
           database = catalogDB, 
           tableName = catalogTable, 
           additionalOptions = JsonOptions(Map("database" -> DATABASE_NAME, 
                   "collection" -> COLLECTION_NAME))
         ).getDynamicFrame()
   ```

------
#### [  Python  ]

   ```
   glue_context.create_dynamic_frame_from_catalog(
           database = catalogDB,
           table_name = catalogTable,
           additional_options = {"database":"database_name", 
               "collection":"collection_name"})
   ```

------

1. Run the job, either on-demand or through a trigger.

# Optimizing query performance using column statistics
<a name="column-statistics"></a>

You can compute column-level statistics for AWS Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column. 

Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. AWS analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

There are three scenarios for generating column statistics: 

 **Auto**   
AWS Glue supports automatic column statistics generation at the catalog-level so that it can automatically generate statistics for new tables in the AWS Glue Data Catalog. 

**Scheduled**  
AWS Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.   
With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution. 

**On-demand**  
Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately. 

You can configure to run column statistics generation task using AWS Glue console, AWS CLI, and AWS Glue API operations. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

 The following video demonstrates how to enhance query performance using column statistics. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i)


**Topics**
+ [Prerequisites for generating column statistics](column-stats-prereqs.md)
+ [Automatic column statistics generation](auto-column-stats-generation.md)
+ [Generating column statistics on a schedule](generate-column-stats.md)
+ [Generating column statistics on demand](column-stats-on-demand.md)
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)
+ [Considerations and limitations](column-stats-notes.md)

# Prerequisites for generating column statistics
<a name="column-stats-prereqs"></a>

To generate or update column statistics, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. Follow these steps to create a new IAM role:

**Note**  
 To generate statistics for tables managed by Lake Formation, the IAM role used to generate statistics requires full table access. 

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. You can also create a role and attach the the permissions listed in the policy below, and add that role to the column statistics generation task.

**To create an IAM role for generating column statistics**

1. To create an IAM role, see [Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

1. To update an existing role, in the IAM console, go to the IAM role that is being used by the generate column statistics process.

1. In the **Add permissions** section, choose **Attach policies**. In the newly opened browser window, choose `AWSGlueServiceRole` AWS managed policy.

1. You also need to include permissions to read data from the Amazon S3 data location.

   In the **Add permissions** section, choose **Create policy**. In the newly opened browser window, create a new policy to use with your role.

1. In the **Create policy** page, choose the **JSON** tab. Copy the following `JSON` code into the policy editor field.
**Note**  
In the following policies, replace account ID with a valid AWS account, and replace `region` with the Region of the table, and `bucket-name` with the Amazon S3 bucket name.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "S3BucketAccess",
               "Effect": "Allow",
               "Action": [
                   "s3:ListBucket",
                   "s3:GetObject"
               ],
               "Resource": [
               	"arn:aws:s3:::amzn-s3-demo-bucket/*",
   							"arn:aws:s3:::amzn-s3-demo-bucket"
               ]
           }
        ]
   }
   ```

------

1. (Optional) If you're using Lake Formation permissions to provide access to your data, the IAM role requires `lakeformation:GetDataAccess` permissions.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "LakeFormationDataAccess",
         "Effect": "Allow",
         "Action": "lakeformation:GetDataAccess",
         "Resource": [
           "*"
         ]
       }
     ]
   }
   ```

------

    If the Amazon S3 data location is registered with Lake Formation, and the IAM role assumed by the column statistics generation task doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER` and `DESCRIBE` permissions on the table. The role used for registering the Amazon S3 bucket requires Lake Formation `INSERT` and `DELETE` permissions on the table. 

   If the Amazon S3 data location is not registered with Lake Formation, and the IAM role doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT` and `DELETE` permissions on the table. 

1. If you've enabled the catalog-level `Automatic statistics generation` option, the IAM role must have the `glue:UpdateCatalog` permission or the Lake Formation `ALTER CATALOG` permission on the default Data Catalog. You can use the `GetCatalog` operation to verify the catalog properties. 

1. (Optional) The column statistics generation task that writes encrypted Amazon CloudWatch Logs requires the following permissions in the key policy.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "CWLogsKmsPermissions",
         "Effect": "Allow",
         "Action": [
           "logs:CreateLogGroup",
           "logs:CreateLogStream",
           "logs:PutLogEvents",
           "logs:AssociateKmsKey"
         ],
         "Resource": [
           "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue:*"
         ]
       },
       {
         "Sid": "KmsPermissions",
         "Effect": "Allow",
         "Action": [
           "kms:GenerateDataKey",
           "kms:Decrypt",
           "kms:Encrypt"
         ],
         "Resource": [
           "arn:aws:kms:us-east-1:111122223333:key/arn of key used for ETL cloudwatch encryption"
         ],
         "Condition": {
           "StringEquals": {
             "kms:ViaService": [
               "glue.us-east-1.amazonaws.com"
             ]
           }
         }
       }
     ]
   }
   ```

------

1. The role you use to run column statistics must have the `iam:PassRole` permission on the role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "iam:PassRole"
         ],
         "Resource": [
           "arn:aws:iam::111122223333:role/columnstats-role-name"
         ]
       }
     ]
   }
   ```

------

1. When you create an IAM role for generating column statistics, that role must also have the following trust policy that enables the service to assume the role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "TrustPolicy",
         "Effect": "Allow",
         "Principal": {
           "Service": "glue.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

# Automatic column statistics generation
<a name="auto-column-stats-generation"></a>

Automatic generation of column statistics allows you to schedule and automatically compute statistics on new tables in the AWS Glue Data Catalog. When you enable automatic statistics generation, the Data Catalog discovers new tables with specific data formats such as Parquet, JSON, CSV, XML, ORC, ION, and Apache Iceberg, along with their individual bucket paths. With a one-time catalog configuration, the Data Catalog generates statistics for these tables.

 Data lake administrators can configure the statistics generation by selecting the default catalog in the Lake Formation console, and enabling table statistics using the `Optimization configuration` option. When you create new tables or update existing tables in the Data Catalog, the Data Catalog collects the number of distinct values (NDVs) for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length for other supported file formats on a weekly basis. 

If you have configured statistics generation at the table-level or if you have previously deleted the statistics generation settings for a table, those table-specific settings take precedence over the default catalog settings for automatic column statistics generation.

 Automatic statistics generation task analyzes 50% of records in the tables to calculate statistics. Automatic column statistics generation ensures that the Data Catalog maintains weekly metrics that can be used by query engines like Amazon Athena and Amazon Redshift Spectrum for improved query performance and potential cost savings. It allows scheduling statistics generation using AWS Glue APIs or the console, providing an automated process without manual intervention. 

**Topics**
+ [Enabling catalog-level automatic statistics generation](enable-auto-column-stats-generation.md)
+ [Viewing automated table-level settings](view-auto-column-stats-settings.md)
+ [Disabling catalog-level column statistics generation](disable-auto-column-stats-generation.md)

# Enabling catalog-level automatic statistics generation
<a name="enable-auto-column-stats-generation"></a>

You can enable the automatic column statistics generation for all new Apache Iceberg tables and tables in non-OTF table (Parquet, JSON, CSV, XML, ORC, ION) formats in the Data Catalog. After creating the table, you can also explicitly update the column statistics settings manually.

 To update the Data Catalog settings to enable catalog-level, the IAM role used must have the `glue:UpdateCatalog` permission or AWS Lake Formation `ALTER CATALOG` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

------
#### [ AWS Management Console ]

**To enable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**.   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-column-stats-auto.png)

1. On the **Table optimization configuration** page, choose the **Enable automatic statistics generation for the tables of the catalog** option.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-optimization-option.jpg)

1. Choose an existing IAM role or create a new one that has the necessary permissions to run the column statistics task.

1. Choose **Submit**.

------
#### [ AWS CLI ]

You can also enable catalog-level statistics collection through the AWS CLI. To configure table-level statistics collection using AWS CLI, run the following command:

```
aws glue update-catalog --cli-input-json '{
    "name": "123456789012",
    "catalogInput": {
        "description": "Updating root catalog with role arn",
        "catalogProperties": {
            "customProperties": {
                "ColumnStatistics.RoleArn": "arn:aws:iam::"123456789012":role/service-role/AWSGlueServiceRole",
                "ColumnStatistics.Enabled": "true"
            }
        }
    }
}'
```

 The above command calls AWS Glue's `UpdateCatalog` operation, which takes in a `CatalogProperties` structure with the following key-value pairs for catalog-level statistics generation: 
+ ColumnStatistics.RoleArn – IAM role ARN to be used for all tasks triggered for Catalog-level statistics generation
+ ColumnStatistics.Enabled – Boolean indicating whether the catalog-level settings is enabled or disabled

------

# Viewing automated table-level settings
<a name="view-auto-column-stats-settings"></a>

 When catalog-level statistics collection is enabled, anytime an Apache Hive table or Apache Iceberg table is created or updated via the `CreateTable` or `UpdateTable` APIs through AWS Management Console, SDK, or AWS Glue crawler, an equivalent table level setting is created for that table. 

 Tables with automatic statistics generation enabled must follow one of following properties:
+ Use an `InputSerdeLibrary` that begins with org.apache.hadoop and `TableType` equals `EXTERNAL_TABLE`
+ Use an `InputSerdeLibrary` that begins with `com.amazon.ion` and `TableType` equals `EXTERNAL_TABLE`
+ Contain table\$1type: "ICEBERG" in it’s parameters structure. 

 After you create or update a table, you can verify the table details to confirm the statistics generation. The `Statistics generation summary` shows the `Schedule` property set as `AUTO` and `Statistics configuration` value is `Inherited from catalog`. Any table setting with the following setting would be automatically triggered by Glue internally. 

![\[An image of a Hive table with catalog-level statistics collection has been applied and statistics have been collected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/auto-stats-summary.png)


# Disabling catalog-level column statistics generation
<a name="disable-auto-column-stats-generation"></a>

 You can disable automatic column statistics generation for new tables using the AWS Lake Formation console, the `glue:UpdateCatalogSettings` API, or the `glue:DeleteColumnStatisticsTaskSettings` API. 

**To disable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**. 

1. On the **Table optimization configuration** page, unselect the **Enable automatic statistics generation for the tables of the catalog** option.

1. Choose **Submit**.

# Generating column statistics on a schedule
<a name="generate-column-stats"></a>

Follow these steps to configure a schedule for generating column statistics in the AWS Glue Data Catalog using the AWS Glue console, the AWS CLI, or the [CreateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-CreateColumnStatisticsTaskSettings) operation.

------
#### [ Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1. Choose a table from the list. 

1. Choose **Column statistics** tab in the lower section of the **Tables** page.

1. You can also choose **Generate on schedule** under **Column statistics** from **Actions**.

1. On the **Generate statistics on schedule** page, configure a recurring schedule for running the column statistics task by choosing the frequency and start time. You can choose the frequency to be hourly, daily, weekly, or define a cron expression to specify the schedule.

   A cron expression is a string representing a schedule pattern, consisting of 6 fields separated by spaces: \$1 \$1 \$1 \$1 \$1 <minute> <hour> <day of month> <month> <day of week> <year> For example, to run a task every day at midnight, the cron expression would be: 0 0 \$1 \$1 ? \$1

   For more information, see [Cron expressions](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions).  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats-schedule.png)

1. Next, choose the column option to generate statistics.
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.

1. Choose an IAM role or create an existing role that has permissions to generate statistics. AWS Glue assumes this role to generate column statistics.

   A quicker approach is to let the AWS Glue console to create a role for you. The role that it creates is specifically for generating column statistics, and includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

   If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. 

1. (Optional) Next, choose a security configuration to enable at-rest encryption for logs.

1. (Optional) You can choose a sample size by indicating only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value. 

   We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

1. Choose **Generate statistics** to run the column statistics generation task.

------
#### [ AWS CLI ]

You can use the following AWS CLI example to create a column statistics generation schedule. The database-name, table-name, and role are required parameters, and optional parameters are schedule, column-name-list, catalog-id, sample-size, and security-configuration.

```
aws glue create-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name table_name \ 
 --role 'arn:aws:iam::123456789012:role/stats-role' \ 
 --schedule 'cron(0 0-5 14 * * ?)' \ 
 --column-name-list 'col-1' \  
 --catalog-id '123456789012' \ 
 --sample-size '10.0 ' \
 --security-configuration 'test-security'
```

You can generate column statistics also by calling the [StartColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRun) operation.

------

# Managing the schedule for column statistics generation
<a name="manage-column-stats-schedule"></a>

You can manage the scheduling operations such as updating, starting, stopping, and deleting schedules for the column statistics generation in AWS Glue. You can use AWS Glue console, AWS CLI, or [AWS Glue column statistics API operations](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html) to perform these tasks.

**Topics**
+ [Updating the column statistics generation schedule](#update-column-stats-shedule)
+ [Stopping the schedule for column statistics generation](#stop-column-stats-schedule)
+ [Resuming the schedule for column statistics generation](#resume-column-stats-schedule)
+ [Deleting column statistics generation schedule](#delete-column-stats-schedule)

## Updating the column statistics generation schedule
<a name="update-column-stats-shedule"></a>

You can update the schedule to trigger the column statistics generation task after it has been created. You can use the AWS Glue console, AWS CLI, or run the [UpdateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-UpdateColumnStatisticsTaskSettings) operation to update the schedule for a table. You can modify the parameters of an existing schedule, such as the schedule type (on-demand, or scheduled) and other optional parameters. 

------
#### [ AWS Management Console ]

**To update the settings for a column statistics generation task**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose the table that you want to update from the tables list.

1. In the lower section of the table details page, choose **Column statistics**. 

1. Under **Actions**, choose **Edit** to update the schedule.

1. Make the desired changes to the schedule, and choose **Save**.

------
#### [ AWS CLI ]

 If you are not using AWS Glue's statistics generation feature in the console, you can manually update the schedule using the `update-column-statistics-task-settings` command. The following example shows how to update column statistics using AWS CLI. 

```
aws glue update-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name 'table_name' \ 
 --role arn:aws:iam::123456789012:role/stats_role \ 
 --schedule 'cron(0 0-5 16 * * ?)' \ 
 --column-name-list 'col-1' \
 --sample-size '20.0' \  
 --catalog-id '123456789012'\
 --security-configuration 'test-security'
```

------

## Stopping the schedule for column statistics generation
<a name="stop-column-stats-schedule"></a>

 If you no longer need the incremental statistics, you can stop the scheduled generation to save resources and costs. Pausing the schedule doesn't impact the previously generated statistics. You can resume the schedule at your convenience. 

------
#### [ AWS Management Console ]

**To stop the schedule for a column statistics generation task**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Pause**.

1. Choose **Pause** to confirm.

------
#### [ AWS CLI ]

To stop a column statistics task run schedule using the AWS CLI, you can use the following command: 

```
aws glue stop-column-statistics-task-run-schedule \
 --database-name ''database_name' \
 --table-name 'table_name'
```

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

------

## Resuming the schedule for column statistics generation
<a name="resume-column-stats-schedule"></a>

 If you've paused the statistics generation schedule, AWS Glue allows you to resume the schedule at your convenience. You can resume the schedule using the AWS Glue console, AWS CLI, or the [StartColumnStatisticsTaskRunSchedule](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRunSchedule) operation. 

------
#### [ AWS Management Console ]

**To resume the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, and choose **Resume**.

1. Choose **Resume**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

```
aws glue start-column-statistics-task-run-schedule \
 --database-name 'database_name' \
 --table-name 'table_name'
```

------

## Deleting column statistics generation schedule
<a name="delete-column-stats-schedule"></a>

 While maintaining up-to-date statistics is generally recommended for optimal query performance, there are specific use cases where removing the automatic generation schedule might be beneficial.
+ If the data remains relatively static, the existing column statistics may remain accurate for an extended period, reducing the need for frequent updates. Deleting the schedule can prevent unnecessary resource consumption and overhead associated with regenerating statistics on unchanging data.
+ When manual control over statistics generation is preferred. By deleting the automatic schedule, administrators can selectively update column statistics at specific intervals or after significant data changes, aligning the process with their maintenance strategies and resource allocation needs. 

------
#### [ AWS Management Console ]

**To delete the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Delete**.

1. Choose **Delete**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

You can delete column statistics schedule using the [DeleteColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-DeleteColumnStatisticsTaskSettings) API operation or AWS CLI. The following example shows how to delete the schedule for generating column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-task-settings \
    --database-name 'database_name' \
    --table-name 'table_name'
```

------

# Generating column statistics on demand
<a name="column-stats-on-demand"></a>

You can run the column statistics task for the AWS Glue Data Catalog tables task on-demand without a set schedule. This option is useful for ad-hoc analysis or when statistics need to be computed immediately.

Follow these steps to generate column statistics on demand for the Data Catalog tables using AWS Glue console or AWS CLI.

------
#### [ AWS Management Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1.  Choose a table from the list. 

1. Choose **Generate statistics** under **Actions** menu.

   You can also choose **Generate**, **Generate on demand** option under **Column statistics** tab in the lower section of the **Table** page.

1. Follow steps 7 - 11 in the [Generating column statistics on a schedule](generate-column-stats.md) to generate column statistics for the table.

1. On the **Generate statistics** page, specify the following options:   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats.png)
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.
   + **IAM role** –Choose **Create a new IAM role** that has the required permission policies to run the column statistics generation task. Choose View permission details to review the policy statement. You can also select an IAM role from the list. For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md).

     AWS Glue assumes the permissions of the role that you specify to generate statistics. 

     For more information about providing roles for AWS Glue, see [Identity-based policies for AWS Glue.](https://docs.aws.amazon.com/glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-id-based-policies).
   + (Optional) Next, choose a security configuration to enable at-rest encryption for logs.
   + **Sample rows** – Choose only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value.
**Note**  
We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

   Choose **Generate statistics** to run the task.

------
#### [ AWS CLI ]

This command will trigger an column statistics task run for the specified table. You need to provide the database name, table name, an IAM role with permissions to generate statistics, and optionally provide column names and a sample size percentage for the statistics computation.

```
aws glue start-column-statistics-task-run \ 
    --database-name 'database_name \ 
    --table-name 'table_name' \ 
    --role 'arn:aws:iam::123456789012:role/stats-role' \
    --column-name 'col1','col2'  \
    --sample-size 10.0
```

This command will start a task to generate column statistics for the specified table. 

------

## Updating column statistics on demand
<a name="update-column-stats-on-demand"></a>

 Maintaining up-to-date column statistics is crucial for the query optimizer to generate efficient execution plans, ensuring improved query performance, reduced resource consumption, and better overall system performance. This process is particularly important after significant data changes, such as bulk loads or extensive modifications, which can render existing statistics obsolete. 

You need to explicitly run the **Generate statistics** task from the AWS Glue console to refresh the column statistics. Data Catalog doesn't automatically refresh the statistics.

If you are not using AWS Glue's statistics generation feature in the console, you can manually update column statistics using the [UpdateColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to update column statistics using AWS CLI.

```
aws glue update-column-statistics-for-table --cli-input-json:

{
    "CatalogId": "111122223333",
    "DatabaseName": "database_name",
    "TableName": "table_name",
    "ColumnStatisticsList": [
        {
            "ColumnName": "col1",
            "ColumnType": "Boolean",
            "AnalyzedTime": "1970-01-01T00:00:00",
            "StatisticsData": {
                "Type": "BOOLEAN",
                "BooleanColumnStatisticsData": {
                    "NumberOfTrues": 5,
                    "NumberOfFalses": 5,
                    "NumberOfNulls": 0
                }
            }
        }
    ]
}
```

# Viewing column statistics
<a name="view-column-stats"></a>

After generating the statistics successfully, Data Catalog stores this information for the cost-based optimizers in Amazon Athena and Amazon Redshift to make optimal choices when running queries. The statistics varies based on the type of the column.

------
#### [ AWS Management Console ]

**To view column statistics for a table**
+ After running column statistics task, the **Column statistics** tab on the **Table details** page shows the statistics for the table.   
![\[The screenshot shows columns generated from the most recent run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats.png)

  The following statistics are available:
  + Column name: Column name used to generate statistics
  + Last updated: Data and time when the statistics were generated
  + Average length: Average length of values in the column
  + Distinct values: Total number of distinct values in the column. We estimate the number of distinct values in a column with 5% relative error.
  + Max value: The largest value in the column.
  + Min value: The smallest value in the column. 
  + Max length: The length of the highest value in the column.
  + Null values: The total number of null values in the column.
  + True values: The total number of true values in the column.
  + False values: The total number of false values in the column.
  + numFiles: The total number of files in the table. This value is available under the **Advanced properties** tab.

------
#### [ AWS CLI ]

The following example shows how to retrieve column statistics using AWS CLI.

```
aws glue get-column-statistics-for-table \
    --database-name database_name \
    --table-name table_name \
    --column-names <column_name>
```

 You can also view the column statistics using the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation. 

------

# Viewing column statistics task runs
<a name="view-stats-run"></a>

After you run a column statistics task, you can explore the task run details for a table using AWS Glue console, AWS CLI or using [GetColumnStatisticsTaskRuns](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-GetColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To view column statistics task run details**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **View runs**.

   You can see information about all runs associated with the specified table.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats-task-runs.png)

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue get-column-statistics-task-runs --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Stopping column statistics task run
<a name="stop-stats-run"></a>

You can stop a column statistics task run for a table using AWS Glue console, AWS CLI or using [StopColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StopColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To stop a column statistics task run**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select the table with the column statistics task run is in progress.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **Stop**.

   If you stop the task before the run is complete, column statistics won't be generated for the table.

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue stop-column-statistics-task-run --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Deleting column statistics
<a name="delete-column-stats"></a>

You can delete column statistics using the [DeleteColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to delete column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-for-table \
    --database-name 'database_name' \
    --table-name 'table_name' \
    --column-name 'column_name'
```

# Considerations and limitations
<a name="column-stats-notes"></a>

The following considerations and limitations apply to generating column statistics.

**Considerations**
+ Using sampling to generate statistics reduces run time, but can generate inaccurate statistics.
+ Data Catalog doesn't store different versions of the statistics.
+ You can only run one statistics generation task at a time per table.
+ If a table is encrypted using customer AWS KMS key registered with Data Catalog, AWS Glue uses the same key to encrypt statistics.

**Column statistics task supports generating statistics:**
+ When the IAM role has full table permissions (IAM or Lake Formation).
+ When the IAM role has permissions on the table using Lake Formation hybrid access mode.

**Column statistics task doesn’t support generating statistics for:**
+ Tables with Lake Formation cell-based access control
+ Transactional data lakes - Linux foundation Delta Lake, Apache Hudi
+ Tables in federated databases - Hive metastore, Amazon Redshift datashares
+ Nested columns, arrays, and struct data types.
+ Table that is shared with you from another account

# Encrypting your Data Catalog
<a name="catalog-encryption"></a>

 You can protect your metadata stored in the AWS Glue Data Catalog at rest using encryption keys managed by AWS Key Management Service (AWS KMS). You can enable Data Catalog encryption for new Data Catalog, by using the **Data Catalog settings**. You can enable or disable encryption for existing Data Catalog as needed. When enabled, AWS Glue encrypts all new metadata written to the catalog, while existing metadata remains unencrypted. 

For detailed information about encrypting your Data Catalog, see [Encrypting your Data Catalog](encrypt-glue-data-catalog.md).

# Securing your Data Catalog using Lake Formation
<a name="secure-catalog"></a>

 AWS Lake Formation is a service that makes it easier to set up a secure data lake in AWS. It provides a central place to create and securely manage your data lakes by defining fine-granied access control permissions. Lake Formation uses the Data Catalog to store and retrieve metadata about your data lake, such as table definitions, schema information, and data access control settings.

You can register your Amazon S3 data location of the metadata table or database with Lake Formation and use it to define metadata-level permissions on the Data Catalog resources. You can also use Lake Formation to manage storage access permissions on the underlying data stored in Amazon S3 on behalf of integrated analytical engines.

For more information see [What is AWS Lake Formation?](lake-formation/latest/dg/what-is-lake-formation.html).

# Working with AWS Glue Data Catalog views in AWS Glue
<a name="catalog-views"></a>

 You can create and manage views in the AWS Glue Data Catalog, commonly known as AWS Glue Data Catalog views. These views are useful because they support multiple SQL query engines, allowing you to access the same view across different AWS services, such as Amazon Athena, Amazon Redshift, and AWS Glue. You can use views based on Apache Iceberg, Apache Hudi, and Delta Lake. 

 By creating a view in the Data Catalog, you can use resource grants and tag-based access controls in AWS Lake Formation to grant access to it. Using this method of access control, you don't have to configure additional access to the tables referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in AWS Lake Formation, see [ Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the AWS Lake Formation Developer Guide. 

 Data Catalog views are useful for the following use cases: 
+  **Granular access control** – You can create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don't work in the HR department from seeing personally identifiable information (PII). 
+  **Complete view definition** – By applying filters on your view in the Data Catalog, you ensure that data records available in the view are always complete. 
+  **Enhanced security** – The query definition used to create the view must be complete, making Data Catalog views less susceptible to SQL commands from malicious actors. 
+  **Simple data sharing** – Share data with other AWS accounts without moving data, using cross-account data sharing in AWS Lake Formation. 

## Creating a Data Catalog view
<a name="catalog-creating-view"></a>

 You can create Data Catalog views using the AWS CLI and AWS Glue ETL scripts using Spark SQL. The syntax for creating a Data Catalog view includes specifying the view type as `MULTI DIALECT` and the `SECURITY` predicate as `DEFINER`, indicating a definer view. 

 Example SQL statement to create a Data Catalog view: 

```
CREATE PROTECTED MULTI DIALECT VIEW database_name.catalog_view SECURITY DEFINER
AS SELECT order_date, sum(totalprice) AS price
FROM source_table
GROUP BY order_date;
```

 After creating a Data Catalog view, you can use an IAM role with the AWS Lake Formation `SELECT` permission on the view to query it from services like Amazon Athena, Amazon Redshift, or AWS Glue ETL jobs. You don't need to grant access to the underlying tables referenced in the view. 

 For more information on creating and configuring Data Catalog views, see [Building AWS Glue Data Catalog views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide. 

## Supported view operations
<a name="catalog-supported-view-operations"></a>

 The following command fragments show you various ways to work with Data Catalog views: 

 **CREATE VIEW** 

 Creates a data-catalog view. The following is a sample that shows creating a view from an existing table: 

```
CREATE PROTECTED MULTI DIALECT VIEW catalog_view 
SECURITY DEFINER AS SELECT * FROM my_catalog.my_database.source_table
```

 **ALTER VIEW** 

 Available syntax: 

```
ALTER VIEW view_name [FORCE] ADD DIALECT AS query
ALTER VIEW view_name [FORCE] UPDATE DIALECT AS query
ALTER VIEW view_name DROP DIALECT
```

 You can use the `FORCE ADD DIALECT` option to force update the schema and sub objects as per the new engine dialect. Note that doing this can result in query errors if you don't also use `FORCE` to update other engine dialects. The following shows a sample: 

```
ALTER VIEW catalog_view FORCE ADD DIALECTAS
SELECT order_date, sum(totalprice) AS priceFROM source_tableGROUP BY orderdate;
```

 The following shows how to alter a view in order to update the dialect: 

```
ALTER VIEW catalog_view UPDATE DIALECT AS
SELECT count(*) FROM my_catalog.my_database.source_table;
```

 **DESCRIBE VIEW** 

 Available syntax for describing a view: 

 `SHOW COLUMNS {FROM|IN} view_name [{FROM|IN} database_name]` – If the user has the required AWS Glue and AWS Lake Formation permissions to describe the view, they can list the columns. The following shows a couple sample commands for showing columns: 

```
SHOW COLUMNS FROM my_database.source_table;    
SHOW COLUMNS IN my_database.source_table;
```

 `DESCRIBE view_name` – If the user has the required AWS Glue and AWS Lake Formation permissions to describe the view, they can list the columns in the view along with its metadata. 

 **DROP VIEW** 

 Available syntax: 

```
DROP VIEW [ IF EXISTS ] view_name
```

 The following sample shows a `DROP` statement that tests if a view exists prior to dropping it: 

```
DROP VIEW IF EXISTS catalog_view;
```

 `SHOW CREATE VIEW view_name` – Shows the SQL statement that creates the specified view. The following is a sample that shows creating a data-catalog view: 

```
SHOW CREATE TABLE my_database.catalog_view;CREATE PROTECTED MULTI DIALECT VIEW my_catalog.my_database.catalog_view (
  net_profit,
  customer_id,
  item_id,
  sold_date)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1736267222')
SECURITY DEFINER AS SELECT * FROM
my_database.store_sales_partitioned_lf WHERE customer_id IN (SELECT customer_id from source_table limit 10)
```

 **SHOW VIEWS** 

 List all views in the catalog, such as regular views, multi-dialect views (MDV), and MDV without Spark dialect. Available syntax is the following: 

```
SHOW VIEWS [{ FROM | IN } database_name] [LIKE regex_pattern]:
```

 The following shows a sample command to show views: 

```
SHOW VIEWS IN marketing_analytics LIKE 'catalog_view*';
```

 For more information about creating and configuring data-catalog views, see [ Building AWS Glue Data Catalog views ](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide. 

## Querying a Data Catalog view
<a name="catalog-view-query"></a>

 After creating a Data Catalog view, you can query the view. The IAM role configured in your AWS Glue jobs must have the Lake Formation **SELECT** permission on the Data Catalog view. You don't need to grant access to the underlying tables referenced in the view. 

 Once you have everything set up, you can query your view. For example, you can run the following query to access a view. 

```
SELECT * from my_database.catalog_view LIMIT 10;
```

## Limitations
<a name="catalog-view-limitations"></a>

 Consider the following limitations when you use Data Catalog views. 
+  You can only create Data Catalog views with AWS Glue 5.0 and above. 
+  The Data Catalog view definer must have `SELECT` access to the underlying base tables accessed by the view. Creating the Data Catalog view fails if a specific base table has any Lake Formation filters imposed on the definer role. 
+  Base tables must not have the `IAMAllowedPrincipals` data lake permission in AWS Lake Formation. If present, the error **Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions occurs**. 
+  The table's Amazon S3 location must be registered as a AWS Lake Formation data lake location. If the table isn't registered, the error `Multi Dialect views may only reference AWS Lake Formation managed tables` occurs. For information about how to register Amazon Amazon S3 locations in AWS Lake Formation, see [ Registering an Amazon S3 location ](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html) in the AWS Lake Formation Developer Guide. 
+  You can only create `PROTECTED` Data Catalog views. `UNPROTECTED` views aren't supported. 
+  You can't reference tables in another AWS account in a Data Catalog view definition. You also can't reference a table in the same account that's in a separate region. 
+  To share data across an account or region, the entire view must be shared cross account and cross region, using AWS Lake Formation resource links. 
+  User-defined functions (UDFs) aren't supported. 
+  You can't reference other views in Data Catalog views. 

# Accessing the Data Catalog
<a name="access_catalog"></a>

 You can use the AWS Glue Data Catalog (Data Catalog) to discover and understand your data. Data Catalog provides a consistent way to maintain schema definitions, data types, locations, and other metadata. You can access the Data Catalog using the following methods:
+ AWS Glue console – You can access and manage the Data Catalog through the AWS Glue console, a web-based user interface. The console allows you to browse and search for databases, tables, and their associated metadata, as well as create, update, and delete metadata definitions. 
+ AWS Glue crawler – Crawlers are programs that automatically scan your data sources and populate the Data Catalog with metadata. You can create and run crawlers to discover and catalog data from various sources like Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon CloudWatch, and JDBC-compliant relational databases such as MySQL, and PostgreSQL as well as several non-AWS sources such as Snowflake and Google BigQuery.
+ AWS Glue APIs – You can access the Data Catalog programmatically using the AWS Glue APIs. These APIs allow you to interact with the Data Catalog programmatically, enabling automation and integration with other applications and services. 
+ AWS Command Line Interface (AWS CLI) – You can use the AWS CLI to access and manage the Data Catalog from the command line. The CLI provides commands for creating, updating, and deleting metadata definitions, as well as querying and retrieving metadata information. 
+ Integration with other AWS services – The Data Catalog integrates with various other AWS services, allowing you to access and utilize the metadata stored in the catalog. For example, you can use Amazon Athena to query data sources using the metadata in the Data Catalog, and use AWS Lake Formation to manage data access and governance for the Data Catalog resources. 

**Topics**
+ [Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint](connect-glu-iceberg-rest.md)
+ [Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint](connect-glue-iceberg-rest-ext.md)
+ [AWS Glue REST APIs for Apache Iceberg specifications](iceberg-rest-apis.md)
+ [Connecting to Data Catalog from a standalone Spark application](connect-gludc-spark.md)
+ [Data mapping between Amazon Redshift and Apache Iceberg](data-mapping-rs-iceberg.md)
+ [Considerations and limitations when using AWS Glue Iceberg REST Catalog APIs](limitation-glue-iceberg-rest-api.md)

# Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint
<a name="connect-glu-iceberg-rest"></a>

 AWS Glue's Iceberg REST endpoint supports API operations specified in the Apache Iceberg REST specification. Using an Iceberg REST client, you can connect your application running on an analytics engine to the REST catalog hosted in the Data Catalog.

 The endpoint supports both Apache Iceberg table specifications - v1 and v2, defaulting to v2. When using the Iceberg table v1 specification, you must specify v1 in the API call. Using the API operation, you can access Iceberg tables stored in both Amazon S3 object storage and Amazon S3 Table storage. 

**Endpoint configuration**

You can access the AWS Glue Iceberg REST catalog using the service endpoint. Refer to the [AWS Glue service endpoints reference guide](https://docs.aws.amazon.com/general/latest/gr/glue.html#glue_region) for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows: 

```
Endpoint : https://glue.us-east-1.amazonaws.com/iceberg
```

** Additional configuration properties** – When using Iceberg client to connect an analytics engine like Spark to the service endpoint, you are required to specify the following application configuration properties:

```
catalog_name = "mydatacatalog"
aws_account_id = "123456789012"
aws_region = "us-east-1"
spark = SparkSession.builder \
    ... \
    .config("spark.sql.defaultCatalog", catalog_name) \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.type", "rest") \
    .config(f"spark.sql.catalog.{catalog_name}.uri", "https://glue.{aws_region}.amazonaws.com/iceberg") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", "{aws_account_id}") \
    .config(f"spark.sql.catalog.{catalog_name}.rest.sigv4-enabled", "true") \
    .config(f"spark.sql.catalog.{catalog_name}.rest.signing-name", "glue") \    
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()
```

AWS Glue Iceberg endpoint ` https://glue.us-east-1.amazonaws.com/iceberg` supports supports the following Iceberg REST APIs:
+ GetConfig
+ ListNamespaces
+ CreateNamespace
+ LoadNamespaceMetadata
+ UpdateNamespaceProperties
+ DeleteNamespace
+ ListTables
+ CreateTable
+ LoadTable
+ TableExists
+ UpdateTable
+ DeleteTable

## Prefix and catalog path parameters
<a name="prefix-catalog-path-parameters"></a>

Iceberg REST catalog APIs have a free-form prefix in their request URLs. For example, the `ListNamespaces` API call uses the `GET/v1/{prefix}/namespaces` URL format. AWS Glue prefix always follows the `/catalogs/{catalog}` structure to ensure that the REST path aligns the AWS Glue multi-catalog hierarchy. The `{catalog}` path parameter can be derived based on the following rules:


| **Access pattern** |  **Glue catalog ID Style**  |  **Prefix Style**  | **Example default catalog ID** |  **Example REST route**  | 
| --- | --- | --- | --- | --- | 
|  Access the default catalog in current account  | not required | : |  not applicable  |  GET /v1/catalogs/:/namespaces  | 
|  Access the default catalog in a specific account  | accountID | accountID | 111122223333 | GET /v1/catalogs/111122223333/namespaces | 
|  Access a nested catalog in current account  |  catalog1/catalog2  |  catalog1/catalog2  |  rmscatalog1:db1  |  GET /v1/catalogs/rmscatalog1:db1/namespaces  | 
|  Access a nested catalog in a specific account  |  accountId:catalog1/catalog2  |  accountId:catalog1/catalog2  |  123456789012/rmscatalog1:db1  |  GET /v1/catalogs/123456789012:rmscatalog1:db1/namespaces  | 

This catalog ID to prefix mapping is required only when you directly call the REST APIs. When you are working with the AWS Glue Iceberg REST catalog APIs through an engine, you need to specify the AWS Glue catalog ID in the `warehouse` parameter for your Iceberg REST catalog API setting or in the `glue.id` parameter for your AWS Glue extensions API setting. For example, see how you can use it with EMR Spark in [Use an Iceberg cluster with Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-spark-cluster.html).

## Namespace path parameter
<a name="ns-path-param"></a>

Namespaces in Iceberg REST catalog APIs path can have multiple levels. However, AWS Glue only supports single-level namespaces. To access a namespace in a multi-level catalog hierarchy, you can connect to a multi-level catalog above the namespace to reference the namespace. This allows any query engine that supports the 3-part notation of `catalog.namespace.table` to access objects in AWS Glue’s multi-level catalog hierarchy without compatibility issues compared to using the multi-level namespace.

# Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint
<a name="connect-glue-iceberg-rest-ext"></a>

 AWS Glue Iceberg REST extension endpoint provides additional APIs, which are not present in the Apache Iceberg REST specification, and provides server-side scan planning capabilities. These additional APIs are used when you access tables stored in Amazon Redshift managed storage. The endpoint is accessible from an application using Apache Iceberg AWS Glue Data Catalog extensions. 

**Endpoint configuration** – A catalog with tables in the Redshift managed storage is accessible using the service endpoint. Refer to the [AWS Glue service endpoints reference guide](https://docs.aws.amazon.com/general/latest/gr/glue.html#glue_region) for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows:

```
Endpoint : https://glue.us-east-1.amazonaws.com/extensions
```

```
catalog_name = "myredshiftcatalog"
aws_account_id = "123456789012"
aws_region = "us-east-1"
spark = SparkSession.builder \
    .config("spark.sql.defaultCatalog", catalog_name) \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.type", "glue") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.id", "{123456789012}:redshiftnamespacecatalog/redshiftdb") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()
```

# AWS Glue REST APIs for Apache Iceberg specifications
<a name="iceberg-rest-apis"></a>

This section contains specifications about the AWS Glue Iceberg REST catalog and AWS Glue extension APIs, and considerations when using these APIs. 

API requests to the AWS Glue Data Catalog endpoints are authenticated using AWS Signature Version 4 (SigV4). See [AWS Signature Version 4 for API requests](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_sigv.html) section to learn more about AWS SigV4.

When accessing the AWS Glue service endpoint, and AWS Glue metadata, the application assumes an IAM role which requires `glue:getCatalog` IAM action. 

Access to the Data Catalog, and its objects can be managed using IAM, Lake Formation, or Lake Formation hybrid mode permissions.

Federated catalogs in the Data Catalog have Lake Formation registered data locations. Lake Formation works with the Data Catalog to provide database-style permissions to manage user access to Data Catalog objects. 

You can use IAM, AWS Lake Formation, or Lake Formation hybrid mode permissions to manage access to the default Data Catalog and its objects. 

To create, insert, or delete data in Lake Formation managed objects, you must set up specific permissions for the IAM user or role. 
+ CREATE\$1CATALOG – Required to create catalogs 
+ CREATE\$1DATABASE – Required to create databases
+ CREATE\$1TABLE – Required to create tables
+ DELETE – Required to delete data from a table
+ DESCRIBE – Required to read metadata 
+ DROP – Required to drop/delete a table or database
+ INSERT – Needed when the principal needs to insert data into a table
+ SELECT – Needed when the principal needs to select data from a table

For more information, see [Lake Formation permissions reference](https://docs.aws.amazon.com/lake-formation/latest/dg/lf-permissions-reference.html) in the AWS Lake Formation Developer Guide.

# GetConfig
<a name="get-config"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | GetConfig | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET /iceberg/v1/config  | 
| IAM action |  glue:GetCatalog  | 
| Lake Formation permissions | Not applicable | 
| CloudTrail event |  glue:GetCatalog  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L67 | 

****Considerations and limitations****
+ The `warehouse` query parameter must be set to the AWS Glue catalog ID. If not set, the root catalog in the current account is used to return the response. For more information, see [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters).

# GetCatalog
<a name="get-catalog"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | GetCatalog | 
| Type |  AWS Glue extension API  | 
| REST path |  GET/extensions/v1/catalogs/\$1catalog\$1  | 
| IAM action |  glue:GetCatalog  | 
| Lake Formation permissions | DESCRIBE | 
| CloudTrail event |  glue:GetCatalog  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L40 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.

# ListNamespaces
<a name="list-ns"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | ListNamespaces | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET/iceberg/v1/catalogs/\$1catalog\$1/namespaces  | 
| IAM action |  glue:GetDatabase  | 
| Lake Formation permissions | ALL, DESCRIBE, SELECT | 
| CloudTrail event |  glue:GetDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L205 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ Only namespaces of the next level is displayed. To list namespaces in deeper levels, specify the nested catalog ID in the catalog path parameter.

# CreateNamespace
<a name="create-ns"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | CreateNamespace | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  POST/iceberg/v1/catalogs/\$1catalog\$1/namespaces  | 
| IAM action |  glue:CreateDatabase  | 
| Lake Formation permissions | ALL, DESCRIBE, SELECT | 
| CloudTrail event |  glue:CreateDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L256 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ Only single level namespace can be created. To create a multi-level namespace, you must iteratively create each level, and connect to the level using the catalog path parameter.

# StartCreateNamespaceTransaction
<a name="start-create-ns-transaction"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | StartCreateNamespaceTransaction | 
| Type |  AWS Glue extensions API  | 
| REST path |  POST/extensions/v1/catalogs/\$1catalog\$1/namespaces  | 
| IAM action |  glue:CreateDatabase  | 
| Lake Formation permissions | ALL, DESCRIBE, SELECT | 
| CloudTrail event |  glue:CreateDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L256 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can create only single-level namespace. To create a multi-level namespaces, you must iteratively create each level, and connect to the level using the catalog path parameter.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  You can call this API, only if the `GetCatalog` API call contains the parameter `use-extensions=true` in the response. 

## LoadNamespaceMetadata
<a name="load-ns-metadata"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | LoadNamespaceMetadata | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET/iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1  | 
| IAM action |  glue:GetDatabase  | 
| Lake Formation permissions | ALL, DESCRIBE, SELECT | 
| CloudTrail event |  glue:GetDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L302 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.

## UpdateNamespaceProperties
<a name="w2aac20c29c16c21c13"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | UpdateNamespaceProperties | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  POST /iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/properties  | 
| IAM action |  glue:UpdateDatabase  | 
| Lake Formation permissions | ALL, ALTER | 
| CloudTrail event |  glue:UpdateDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L400 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.

# DeleteNamespace
<a name="delete-ns"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | DeleteNamespace | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  DELETE/iceberg/v1/catalogs/\$1catalog\$1/namespces/\$1ns\$1  | 
| IAM action |  glue:DeleteDatabase  | 
| Lake Formation permissions | ALL, DROP | 
| CloudTrail event |  glue:DeleteDatabase  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L365 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ If there are objects in the database, the operation will fail.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  The API can only be used if the `GetCatalog` API call indicates `use-extensions=true` in response. 

# StartDeleteNamespaceTransaction
<a name="start-delete-ns-transaction"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | StartDeleteNamespaceTransaction | 
| Type |  AWS Glue extensions API  | 
| REST path |  DELETE /extensions/v1/catalogs/\$1catalog\$1/namespces/\$1ns\$1  | 
| IAM action |  glue:DeleteDatabase  | 
| Lake Formation permissions | ALL, DROP | 
| CloudTrail event |  glue:DeleteDatabase  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L85 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify a only single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ If there are objects in the database, the operation will fail.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  The API can only be used if the `GetCatalog` API call indicates `use-extensions=true` in response. 

# ListTables
<a name="list-tables"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | ListTables | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET /iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables  | 
| IAM action |  glue:GetTables  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTables  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L463 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ All tables including non-Iceberg tables will be listed. To determine if a table can be loaded as an Iceberg table or not, call `LoadTable` operation.

# CreateTable
<a name="create-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | CreateTable | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET /iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables  | 
| IAM action |  glue:CreateTable  | 
| Lake Formation permissions | ALL, CREATE\$1TABLE | 
| CloudTrail event |  glue:CreateTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L497 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ `CreateTable` with staging is not supported. If the `stageCreate` query parameter is specified, the operation will fail.This means operation like `CREATE TABLE AS SELECT` is not supported, and you can use a combination of `CREATE TABLE` and `INSERT INTO` as a workaround.
+ The `CreateTable` API operation doesn't support the option `state-create = TRUE`.

# StartCreateTableTransaction
<a name="start-create-table-transaction"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | CreateTable | 
| Type |  AWS Glue extensions API  | 
| REST path |  POST/extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables  | 
| IAM action |  glue:CreateTable  | 
| Lake Formation permissions | ALL, CREATE\$1TABLE | 
| CloudTrail event |  glue:CreateTable  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L107 | 

****Considerations and limitations****
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ `CreateTable` with staging is not supported. If the `stageCreate` query parameter is specified, the operation will fail.This means operation like `CREATE TABLE AS SELECT` is not supported, and user should use a combination of `CREATE TABLE` and `INSERT INTO` to workaround.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  The API can only be used if the `GetCatalog` API call indicates `use-extensions=true` in response. 

# LoadTable
<a name="load-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | LoadTable | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  GET /iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:GetTable  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L616 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ `CreateTable` with staging is not supported. If the `stageCreate` query parameter is specified, the operation will fail.This means operation like `CREATE TABLE AS SELECT` is not supported, and user should use a combination of `CREATE TABLE` and `INSERT INTO` to workaround.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  The API can only be used if the `GetCatalog` API call indicates `use-extensions=true` in response. 

# ExtendedLoadTable
<a name="extended-load-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | LoadTable | 
| Type |  AWS Glue extensions API  | 
| REST path |  GET /extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:GetTable  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTable  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L134 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ Only `all` mode is supported for snapshots query parameter.
+ Compared to `LoadTable` API, the `ExtendedLoadTable` API differs in the following ways:
  +  Doesn't strictly enforce that all the fields to be available.
  + provides the following additional parameters in the config field of the response:   
**Additional parameters**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/extended-load-table.html)

# PreplanTable
<a name="preplan-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | PreplanTable | 
| Type |  AWS Glue extensions API  | 
| REST path |  POST /extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1/preplan  | 
| IAM action |  glue:GetTable  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTable  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L211 | 

**Considerations**
+ The catalog path parameter should follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ Caller of this API should always determine if there are remaining results to fetch based on the page token. A response with empty page item but a pagination token is possible if the server side is still processing but is not able to produce any result in the given response time.
+  You can use this API only if the `ExtendedLoadTable` API response contains `aws.server-side-capabilities.scan-planning=true`. 

# PlanTable
<a name="plan-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | PlanTable | 
| Type |  AWS Glue extensions API  | 
| REST path |  POST /extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1/plan  | 
| IAM action |  glue:GetTable  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTable  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L243 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ Caller of this API should always determine if there are remaining results to fetch based on the page token. A response with empty page item but a pagination token is possible if the server side is still processing but is not able to produce any result in the given response time.
+  You can use this API only if the `ExtendedLoadTable` API response contains `aws.server-side-capabilities.scan-planning=true`. 

# TableExists
<a name="table-exists"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | TableExists | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  HEAD/iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:GetTable  | 
| Lake Formation permissions | ALL, SELECT, DESCRIBE | 
| CloudTrail event |  glue:GetTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L833 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.

# UpdateTable
<a name="update-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | UpdateTable | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  POST /iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:UpdateTable  | 
| Lake Formation permissions | ALL, ALTER | 
| CloudTrail event |  glue:UpdateTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L677 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.

# StartUpdateTableTransaction
<a name="start-update-table-transaction"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | StartUpdateTableTransaction | 
| Type | AWS Glue extension API | 
| REST path |  POST/extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:UpdateTable  | 
| Lake Formation permissions |  ALL, ALTER  | 
| CloudTrail event |  glue:UpdateTable  | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L154 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ The API is asynchronous, and returns a transaction ID that that you can use for tracking using the `CheckTransactionStatus` API call.
+  A `RenamTable` operation can also be performed through this API. When that happens, the caller must also ahve glue:CreateTable or LakeFormation CREATE\$1TABLE permission for the table to be renamed to. 
+  You can use this API only if the `ExtendedLoadTable` API response contains `aws.server-side-capabilities.scan-planning=true`. 

# DeleteTable
<a name="delete-table"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | DeleteTable | 
| Type |  Iceberg REST Catalog API  | 
| REST path |  DELETE/iceberg/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:DeleteTable  | 
| Lake Formation permissions | ALL, DROP | 
| CloudTrail event |  glue:DeleteTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L793 | 

**Considerations**
+ The catalog path parameter should follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ `DeleteTable` API operation supports a purge option. When purge is set to `true`, the table data is deleted, otherwise data is not deleted. For tables in Amazon S3, the operation does not delete table data. The operation fails when table is stored in Amazon S3, and `purge = TRUE,` . 

  For tables stored in Amazon Redshift managed storage, the operation will delete table data, similar to `DROP TABLE`behavior in Amazon Redshift. The operation fails when table is stored in Amazon Redshift and `purge = FALSE`.
+ `purgeRequest=true` is not supported. 

# StartDeleteTableTransaction
<a name="start-delete-table-transaction"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | StartDeleteTableTransaction | 
| Type |  AWS Glue extensions API  | 
| REST path |  DELETE /extensions/v1/catalogs/\$1catalog\$1/namespaces/\$1ns\$1/tables/\$1table\$1  | 
| IAM action |  glue:DeleteTable  | 
| Lake Formation permissions | ALL, DROP | 
| CloudTrail event |  glue:DeleteTable  | 
| Open API definition | https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/open-api/rest-catalog-open-api.yaml\$1L793 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.
+ You can specify only a single-level namespace in the REST Path parameter. For more in formation, see the [Namespace path parameter](connect-glu-iceberg-rest.md#ns-path-param) section.
+ `purgeRequest=false` is not supported. 
+  The API is asynchronous, and returns a transaction ID that can be tracked through `CheckTransactionStatus`. 

# CheckTransactionStatus
<a name="check-transaction-status"></a>


**General information**  

|  |  | 
| --- |--- |
| Operation name | CheckTransactionStatus | 
| Type |  AWS Glue extensions API  | 
| REST path |  POST/extensions/v1/transactions/status  | 
| IAM action |  The same permission as the action that initiates the transaction  | 
| Lake Formation permissions | The same permission as the action that initiates the transaction | 
| Open API definition | https://github.com/awslabs/glue-extensions-for-iceberg/blob/main/glue-extensions-api.yaml\$1L273 | 

**Considerations**
+ The catalog path parameter must follow the style described in the [Prefix and catalog path parameters](connect-glu-iceberg-rest.md#prefix-catalog-path-parameters) section.

# Connecting to Data Catalog from a standalone Spark application
<a name="connect-gludc-spark"></a>

You can connect to the Data Catalog from a stand application using an Apache Iceberg connector. 

1. Create an IAM role for Spark application.

1. Connect to AWS Glue Iceberg Rest endpoint using Iceberg connector.

   ```
   # configure your application. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html for best practices on configuring environment variables.
   export AWS_ACCESS_KEY_ID=$(aws configure get appUser.aws_access_key_id)
   export AWS_SECRET_ACCESS_KEY=$(aws configure get appUser.aws_secret_access_key)
   export AWS_SESSION_TOKEN=$(aws configure get appUser.aws_secret_token)
   
   export AWS_REGION=us-east-1
   export REGION=us-east-1
   export AWS_ACCOUNT_ID = {specify your aws account id here}
   
   ~/spark-3.5.3-bin-hadoop3/bin/spark-shell \
       --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \
       --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
       --conf "spark.sql.defaultCatalog=spark_catalog" \
       --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \
       --conf "spark.sql.catalog.spark_catalog.type=rest" \
       --conf "spark.sql.catalog.spark_catalog.uri=https://glue.us-east-1.amazonaws.com/iceberg" \
       --conf "spark.sql.catalog.spark_catalog.warehouse = {AWS_ACCOUNT_ID}" \
       --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \
       --conf "spark.sql.catalog.spark_catalog.rest.signing-name=glue" \
       --conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \
       --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
       --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider"
   ```

1. Query data in the Data Catalog.

   ```
   spark.sql("create database myicebergdb").show()
   spark.sql("""CREATE TABLE myicebergdb.mytbl (name string) USING iceberg location 's3://bucket_name/mytbl'""")
   spark.sql("insert into myicebergdb.mytbl values('demo') ").show()
   ```

# Data mapping between Amazon Redshift and Apache Iceberg
<a name="data-mapping-rs-iceberg"></a>

Redshift and Iceberg support various data types. The following compatibility matrix outlines the support and limitations when mapping data between these two data systems. Please refer to [Amazon Redshift Data Types](https://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html) and [Apache Iceberg Table Specifications](https://iceberg.apache.org/spec/#primitive-types) for more details on supported data types in respective data systems.


| Redshift data type | Aliases | Iceberg data type | 
| --- | --- | --- | 
| SMALLINT | INT2 | int | 
| INTEGER | INT, INT4 | int | 
| BIGINT | INT8 | long | 
| DECIMAL | NUMERIC | decimal | 
| REAL | FLOAT4 | float | 
| REAL | FLOAT4 | float | 
| DOUBLE PRECISION | FLOAT8, FLOAT | double | 
| CHAR | CHARACTER, NCHAR | string | 
| VARCHAR | CHARACTER VARYING, NVARCHAR | string | 
| BPCHAR |  | string | 
| TEXT |  | string | 
| DATE |  | date | 
| TIME | TIME WITHOUT TIMEZONE | time | 
| TIME | TIME WITH TIMEZONE | not supported | 
| TIMESTAMP | TIMESTAMP WITHOUT TIMEZONE | TIMESTAMP | 
| TIMESTAMPZ | TIMESTAMP WITH TIMEZONE | TIMESTAMPZ | 
| INTERVAL YEAR TO MONTH |  | Not supported | 
| INTERVAL DAY TO SECOND |  | Not supported | 
| BOOLEAN | BOOL | bool | 
| HLLSKETCH |  | Not supported | 
| SUPER |  | Not supported | 
| VARBYTE | VARBINARY, BINARY VARYING | binary | 
| GEOMETRY |  | Not supported | 
| GEOGRAPHY |  | Not supported | 

# Considerations and limitations when using AWS Glue Iceberg REST Catalog APIs
<a name="limitation-glue-iceberg-rest-api"></a>

Following are the considerations and limitations when using the Apache Iceberg REST Catalog Data Definition Language (DDL) operation behavior.

**Considerations**
+  **`RenameTable` API behavior** – The `RenameTable` operation is supported in tables in Amazon Redshift but not in Amazon S3. 
+  **DDL operations for namespaces and tables in Amazon Redshift** – Create, Update, Delete operations for namespaces and tables in Amazon Redshift are asynchronous operations because they are dependent on when Amazon Redshift managed workgroup is available and whether a conflicting DDL and DML transaction is in progress and operation has to wait for lock and then attempt to commit changes. 

**Limitations**
+  View APIs in the Apache Iceberg REST specification are not supported in AWS Glue Iceberg REST Catalog. 

# AWS Glue Data Catalog best practices
<a name="best-practice-catalog"></a>

 This section covers best practices for effectively managing and utilizing the AWS Glue Data Catalog. It emphasizes practices such as efficient crawler usage, metadata organization, security, performance optimization, automation, data governance, and integration with other AWS services. 
+ **Use crawlers effectively** – Run crawlers regularly to keep the Data Catalog up-to-date with changes in your data sources. Use incremental crawls for frequently changing data sources to improve performance. Configure crawlers to automatically add new partitions or update schemas when changes are detected. 
+ **Organize and name metadata tables** – Establish a consistent naming convention for databases and tables in the Data Catalog. Group related data sources into logical databases or folders for better organization. Use descriptive names that convey the purpose and content of each table. 
+ **Manage schemas effectively ** – Take advantage of the schema inference capabilities of AWS Glue crawlers. Review and update schema changes before applying them to avoid breaking downstream applications. Use schema evolution features to handle schema changes gracefully. 
+ **Secure the Data Catalog** – Enable data encryption at rest and in transit for the Data Catalog. Implement fine-grained access control policies to restrict access to sensitive data. Regularly audit and review Data Catalog permissions and activity logs. 
+ **Integrate with other AWS services** Data Catalog Use the Data Catalog as a centralized metadata layer for services like Amazon Athena, Redshift Spectrum, and AWS Lake Formation. Leverage AWS Glue ETL jobs to transform and load data into various data stores while maintaining metadata in the Data Catalog. 
+  **Monitor and optimize performance** Data Catalog Monitor the performance of crawlers and ETL jobs using Amazon CloudWatch metrics. Partition large datasets in the Data Catalog to improve query performance. Implement performance optimizations for frequently accessed metadata. 
+  **Stay updated with AWS Glue documentation and best practices** Data Catalog Regularly check the AWS Glue documentation and AWS Glue resources for the latest updates, best practices, and recommendations. Attend AWS Glue webinars, workshops, and other events to learn from experts and stay informed about new features and capabilities. 

# Monitoring Data Catalog usage metrics in Amazon CloudWatch
<a name="data-catalog-cloudwatch-metrics"></a>

AWS Glue Data Catalog usage metrics are now available with Amazon CloudWatch, simplifying the monitoring and understanding of resource utilization in your Data Catalog. You now have immediate visibility into your Glue Catalog API usage of catalogs, databases, tables, partitions, and connections, making it easier to maintain oversight of your Data Catalog.

## Overview of Data Catalog metrics
<a name="data-catalog-metrics-overview"></a>

AWS Glue Data Catalog automatically publishes usage metrics to Amazon CloudWatch. With CloudWatch metrics integration, you can track critical performance indicators every minute, including:
+ Table requests
+ Partition indexes created
+ Connections updated
+ Statistics updated

These metrics help you identify bottlenecks, detect anomalies, and make data-driven decisions to improve overall data catalog reliability. You can also set up CloudWatch alarms to receive notifications when metrics exceed specified thresholds, allowing for proactive management of your deployment. 

## Adding metrics to your CloudWatch dashboard
<a name="glue-data-catalog-metrics-dashboard"></a>

You can create custom dashboards to monitor your AWS Glue Data Catalog resources and set up alarms to be notified of any unusual activity.

You can add Data Catalog metrics to your CloudWatch dashboard by following these steps:

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose **All metrics**.

1. Choose **Usage>By AWS resource**.

1. Filter by **Glue** to see available metrics.

1. Select the metrics you want to add to your dashboard.

1. Add metrics for catalogs, databases, tables, partitions, and connections to your CloudWatch graph.  
![\[AWS Glue Data Catalog metrics in CloudWatch dashboard\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-cloudwatch-metrics.png)

You can configure custom alarms that trigger automatically when API usage exceeds your defined thresholds to identify abnormalities in your data catalog usage.

For detailed instructions on setting up alarms, see [Creating a Metrics Insights CloudWatch alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-alarm-create.html).

# AWS Glue Schema registry
<a name="schema-registry"></a>

**Note**  
AWS Glue Schema Registry is not supported in the following Regions in the AWS Glue console: Middle East (UAE).

The AWS Glue Schema registry allows you to centrally discover, control, and evolve data stream schemas. A *schema* defines the structure and format of a data record. With AWS Glue Schema registry, you can manage and enforce schemas on your data streaming applications using convenient integrations with Apache Kafka, [Amazon Managed Streaming for Apache Kafka](https://aws.amazon.com/msk/), [Amazon Kinesis Data Streams](https://aws.amazon.com/kinesis/data-streams/), [Amazon Managed Service for Apache Flink](https://aws.amazon.com/kinesis/data-analytics/), and [AWS Lambda](https://aws.amazon.com/lambda/).

The Schema registry supports AVRO (v1.11.4) data format, JSON Data format with [JSON Schema format](https://json-schema.org/) for the schema (specifications Draft-04, Draft-06, and Draft-07) with JSON schema validation using the [Everit library](https://github.com/everit-org/json-schema), Protocol Buffers (Protobuf) versions proto2 and proto3 without support for `extensions` or `groups`, and Java language support, with other data formats and languages to come. Supported features include compatibility, schema sourcing via metadata, auto-registration of schemas, IAM compatibility, and optional ZLIB compression to reduce storage and data transfer. The Schema registry is serverless and free to use.

Using a schema as a data format contract between producers and consumers leads to improved data governance, higher quality data, and enables data consumers to be resilient to compatible upstream changes.

The Schema registry allows disparate systems to share a schema for serialization and de-serialization. For example, assume you have a producer and consumer of data. The producer knows the schema when it publishes the data. The Schema Registry supplies a serializer and deserializer for certain systems such as Amazon MSK or Apache Kafka. 

 For more information, see [How the schema registry works](schema-registry-works.md).

**Topics**
+ [Schemas](#schema-registry-schemas)
+ [Registries](#schema-registry-registries)
+ [Schema versioning and compatibility](#schema-registry-compatibility)
+ [Open source Serde libraries](#schema-registry-serde-libraries)
+ [Quotas of the Schema Registry](#schema-registry-quotas)
+ [How the schema registry works](schema-registry-works.md)
+ [Getting started with schema registry](schema-registry-gs.md)

## Schemas
<a name="schema-registry-schemas"></a>

A *schema* defines the structure and format of a data record. A schema is a versioned specification for reliable data publication, consumption, or storage.

In this example schema for Avro, the format and structure are defined by the layout and field names, and the format of the field names is defined by the data types (e.g., `string`, `int`).

```
{
    "type": "record",
    "namespace": "ABC_Organization",
    "name": "Employee",
    "fields": [
        {
            "name": "Name",
            "type": "string"
        },
        {
            "name": "Age",
            "type": "int"
        },
        {
            "name": "address",
            "type": {
                "type": "record",
                "name": "addressRecord",
                "fields": [
                    {
                        "name": "street",
                        "type": "string"
                    },
                    {
                        "name": "zipcode",
                        "type": "int" 
                    }
                ]
            }
        }
    ]
}
```

In this example JSON schema draft-07 for JSON, the format is defined by the [JSON Schema organization](https://json-schema.org/).

```
{
	"$id": "https://example.com/person.schema.json",
	"$schema": "http://json-schema.org/draft-07/schema#",
	"title": "Person",
	"type": "object",
	"properties": {
		"firstName": {
			"type": "string",
			"description": "The person's first name."
		},
		"lastName": {
			"type": "string",
			"description": "The person's last name."
		},
		"age": {
			"description": "Age in years which must be equal to or greater than zero.",
			"type": "integer",
			"minimum": 0
		}
	}
}
```

In this example for Protobuf, the format is defined by the [version 2 of the Protocol Buffers language (proto2)](https://developers.google.com/protocol-buffers/docs/reference/proto2-spec).

```
syntax = "proto2";

package tutorial;

option java_multiple_files = true;
option java_package = "com.example.tutorial.protos";
option java_outer_classname = "AddressBookProtos";

message Person {
  optional string name = 1;
  optional int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    optional string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phones = 4;
}

message AddressBook {
  repeated Person people = 1;
}
```

## Registries
<a name="schema-registry-registries"></a>

A *registry* is a logical container of schemas. Registries allow you to organize your schemas, as well as manage access control for your applications. A registry has an Amazon Resource Name (ARN) to allow you to organize and set different access permissions to schema operations within the registry.

You may use the default registry or create as many new registries as necessary.


**AWS Glue Schema Registry Hierarchy**  

|  | 
| --- |
|  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/schema-registry.html)  | 
|  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/schema-registry.html)  | 
|  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/schema-registry.html)  | 
|  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/schema-registry.html)  | 

## Schema versioning and compatibility
<a name="schema-registry-compatibility"></a>

Each schema can have multiple versions. Versioning is governed by a compatibility rule that is applied on a schema. Requests to register new schema versions are checked against this rule by the Schema Registry before they can succeed. 

A schema version that is marked as a checkpoint is used to determine the compatibility of registering new versions of a schema. When a schema first gets created the default checkpoint will be the first version. As the schema evolves with more versions, you can use the CLI/SDK to change the checkpoint to a version of a schema using the `UpdateSchema` API that adheres to a set of constraints. In the console, editing the schema definition or compatibility mode will change the checkpoint to the latest version by default. 

Compatibility modes allow you to control how schemas can or cannot evolve over time. These modes form the contract between applications producing and consuming data. When a new version of a schema is submitted to the registry, the compatibility rule applied to the schema name is used to determine if the new version can be accepted. There are 8 compatibility modes: NONE, DISABLED, BACKWARD, BACKWARD\$1ALL, FORWARD, FORWARD\$1ALL, FULL, FULL\$1ALL.

In the Avro data format, fields may be optional or required. An optional field is one in which the `Type` includes null. Required fields do not have null as the `Type`.

In the Protobuf data format, fields can be optional (including repeated) or required in proto2 syntax, while all fields are optional (including repeated) in proto3 syntax. All compatibility rules are determined based on the understanding of the Protocol Buffers specifications as well as the guidance from the [Google Protocol Buffers documentation](https://developers.google.com/protocol-buffers/docs/overview#updating).
+ *NONE*: No compatibility mode applies. You can use this choice in development scenarios or if you do not know the compatibility modes that you want to apply to schemas. Any new version added will be accepted without undergoing a compatibility check.
+ *DISABLED*: This compatibility choice prevents versioning for a particular schema. No new versions can be added.
+ *BACKWARD*: This compatibility choice is recommended because it allows consumers to read both the current and the previous schema version. You can use this choice to check compatibility against the previous schema version when you delete fields or add optional fields. A typical use case for BACKWARD is when your application has been created for the most recent schema.

**AVRO**  
For example, assume you have a schema defined by first name (required), last name (required), email (required), and phone number (optional).

  If your next schema version removes the required email field, this would successfully register. BACKWARD compatibility requires consumers to be able to read the current and previous schema version. Your consumers will be able to read the new schema as the extra email field from old messages is ignored.

  If you have a proposed new schema version that adds a required field, for example, zip code, this would not successfully register with BACKWARD compatibility. Your consumers on the new version would not be able to read old messages before the schema change, as they are missing the required zip code field. However, if the zip code field was set as optional in the new schema, then the proposed version would successfully register as consumers can read the old schema without the optional zip code field.

**JSON**  
For example, assume you have a schema version defined by first name (optional), last name (optional), email (optional) and phone number (optional).

  If your next schema version adds the optional phone number property, this would successfully register as long as the original schema version does not allow any additional properties by setting the `additionalProperties` field to false. BACKWARD compatibility requires consumers to be able to read the current and previous schema version. Your consumers will be able to read data produced with the original schema where phone number property does not exist.

  If you have a proposed new schema version that adds the optional phone number property, this would not successfully register with BACKWARD compatibility when the original schema version sets the `additionalProperties` field to true, namely allowing any additional property. Your consumers on the new version would not be able to read old messages before the schema change, as they cannot read data with phone number property in a different type, for example string instead of number.

**PROTOBUF**  
For example, assume you have a schema version defined by a Message `Person` with `first name` (required), `last name` (required), `email` (required), and `phone number` (optional) fields under proto2 syntax.

  Similar to AVRO scenarios, if your next schema version removes the required `email` field, this would successfully register. BACKWARD compatibility requires consumers to be able to read the current and previous schema version. Your consumers will be able to read the new schema as the extra `email` field from old messages is ignored.

  If you have a proposed new schema version that adds a required field, for example, `zip code`, this would not successfully register with BACKWARD compatibility. Your consumers on the new version would not be able to read old messages before the schema change, as they are missing the required `zip code` field. However, if the `zip code` field was set as optional in the new schema, then the proposed version would successfully register as consumers can read the old schema without the optional `zip code` field.

  In case of a gRPC use case, adding new RPC service or RPC method is a backward compatible change. For example, assume you have a schema version defined by an RPC service `MyService` with two RPC methods `Foo` and `Bar`.

  If your next schema version adds a new RPC method called `Baz`, this would successfully register. Your consumers will be able to read data produced with the original schema according to BACKWARD compatibility since the newly added RPC method `Baz` is optional. 

  If you have a proposed new schema version that removes the existing RPC method `Foo`, this would not successfully register with BACKWARD compatibility. Your consumers on the new version would not be able to read old messages before the schema change, as they cannot understand and read data with the non-existent RPC method `Foo` in a gRPC application.
+ *BACKWARD\$1ALL*: This compatibility choice allows consumers to read both the current and all previous schema versions. You can use this choice to check compatibility against all previous schema versions when you delete fields or add optional fields.
+ *FORWARD*: This compatibility choice allows consumers to read both the current and the subsequent schema versions, but not necessarily later versions. You can use this choice to check compatibility against the last schema version when you add fields or delete optional fields. A typical use case for FORWARD is when your application has been created for a previous schema and should be able to process a more recent schema.

**AVRO**  
For example, assume you have a schema version defined by first name (required), last name (required), email (optional).

  If you have a new schema version that adds a required field, e.g. phone number, this would successfully register. FORWARD compatibility requires consumers to be able to read data produced with the new schema by using the previous version.

  If you have a proposed schema version that deletes the required first name field, this would not successfully register with FORWARD compatibility. Your consumers on the prior version would not be able to read the proposed schemas as they are missing the required first name field. However, if the first name field was originally optional, then the proposed new schema would successfully register as the consumers can read data based on the new schema that doesn’t have the optional first name field.

**JSON**  
For example, assume you have a schema version defined by first name (optional), last name (optional), email (optional) and phone number (optional).

  If you have a new schema version that removes the optional phone number property, this would successfully register as long as the new schema version does not allow any additional properties by setting the `additionalProperties` field to false. FORWARD compatibility requires consumers to be able to read data produced with the new schema by using the previous version.

  If you have a proposed schema version that deletes the optional phone number property, this would not successfully register with FORWARD compatibility when the new schema version sets the `additionalProperties` field to true, namely allowing any additional property. Your consumers on the prior version would not be able to read the proposed schemas as they could have phone number property in a different type, for example string instead of number.

**PROTOBUF**  
For example, assume you have a schema version defined by a Message `Person` with `first name` (required), `last name` (required), `email` (optional) fields under proto2 syntax.

  Similar to AVRO scenarios, if you have a new schema version that adds a required field, e.g. `phone number`, this would successfully register. FORWARD compatibility requires consumers to be able to read data produced with the new schema by using the previous version.

  If you have a proposed schema version that deletes the required `first name` field, this would not successfully register with FORWARD compatibility. Your consumers on the prior version would not be able to read the proposed schemas as they are missing the required `first name` field. However, if the `first name` field was originally optional, then the proposed new schema would successfully register as the consumers can read data based on the new schema that doesn’t have the optional `first name` field.

  In case of a gRPC use case, removing an RPC service or RPC method is a forward-compatible change. For example, assume you have a schema version defined by an RPC service `MyService` with two RPC methods `Foo` and `Bar`. 

  If your next schema version deletes the existing RPC method named `Foo`, this would successfully register according to FORWARD compatibility as the consumers can read data produced with the new schema by using the previous version. If you have a proposed new schema version that adds an RPC method `Baz`, this would not successfully register with FORWARD compatibility. Your consumers on the prior version would not be able to read the proposed schemas as they are missing the RPC method `Baz`.
+ *FORWARD\$1ALL*: This compatibility choice allows consumers to read data written by producers of any new registered schema. You can use this choice when you need to add fields or delete optional fields, and check compatibility against all previous schema versions.
+ *FULL*: This compatibility choice allows consumers to read data written by producers using the previous or next version of the schema, but not earlier or later versions. You can use this choice to check compatibility against the last schema version when you add or remove optional fields.
+ *FULL\$1ALL*: This compatibility choice allows consumers to read data written by producers using all previous schema versions. You can use this choice to check compatibility against all previous schema versions when you add or remove optional fields.

## Open source Serde libraries
<a name="schema-registry-serde-libraries"></a>

AWS provides open-source Serde libraries as a framework for serializing and deserializing data. The open source design of these libraries allows common open-source applications and frameworks to support these libraries in their projects.

For more details on how the Serde libraries work, see [How the schema registry works](schema-registry-works.md).

## Quotas of the Schema Registry
<a name="schema-registry-quotas"></a>

Quotas, also referred to as limits in AWS, are the maximum values for the resources, actions, and items in your AWS account. The following are soft limits for the Schema Registry in AWS Glue.

**Schema version metadata key-value pairs**  
You can have up to 10 key-value pairs per SchemaVersion per AWS Region.

You can view or set the key-value metadata pairs using the [QuerySchemaVersionMetadata action (Python: query\$1schema\$1version\$1metadata)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-QuerySchemaVersionMetadata) or [PutSchemaVersionMetadata action (Python: put\$1schema\$1version\$1metadata)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-PutSchemaVersionMetadata) APIs.

The following are hard limits for the Schema Registry in AWS Glue.

**Registries**  
You can have up to 100 registries per AWS Region for this account.

**SchemaVersion**  
You can have up to 10000 schema versions per AWS Region for this account.

Each new schema creates a new schema version, so you can theoretically have up to 10000 schemas per account per region, if each schema has only one version.

**Schema payloads**  
There is a size limit of 170KB for schema payloads.

# How the schema registry works
<a name="schema-registry-works"></a>

This section describes how the serialization and deserialization processes in schema registry work.

1. Register a schema: If the schema doesn’t already exist in the registry, the schema can be registered with a schema name equal to the name of the destination (e.g., test\$1topic, test\$1stream, prod\$1firehose) or the producer can provide a custom name for the schema. Producers can also add key-value pairs to the schema as metadata, such as source: msk\$1kafka\$1topic\$1A, or apply AWS tags to schemas on schema creation. Once a schema is registered the Schema Registry returns the schema version ID to the serializer. If the schema exists but the serializer is using a new version that doesn’t exist, the Schema Registry will check the schema reference a compatibility rule to ensure the new version is compatible before registering it as a new version.

   There are two methods of registering a schema: manual registration and auto-registration. You can register a schema manually via the AWS Glue console or CLI/SDK.

   When auto-registration is turned on in the serializer settings, automatic registration of the schema will be performed. If `REGISTRY_NAME` is not provided in the producer configurations, then auto-registration will register the new schema version under the default registry (default-registry). See [Installing SerDe Libraries](schema-registry-gs-serde.md) for information on specifying the auto-registration property.

1. Serializer validates data records against the schema: When the application producing data has registered its schema, the Schema Registry serializer validates the record being produced by the application is structured with the fields and data types matching a registered schema. If the schema of the record does not match a registered schema, the serializer will return an exception and the application will fail to deliver the record to the destination. 

   If no schema exists and if the schema name is not provided via the producer configurations, then the schema is created with the same name as the topic name (if Apache Kafka or Amazon MSK) or stream name (if Kinesis Data Streams).

   Every record has a schema definition and data. The schema definition is queried against the existing schemas and versions in the Schema Registry.

   By default, producers cache schema definitions and schema version IDs of registered schemas. If a record’s schema version definition does not match what’s available in cache, the producer will attempt to validate the schema with the Schema Registry. If the schema version is valid, then its version ID and definition will be cached locally on the producer.

   You can adjust the default cache period (24 hours) within the optional producer properties in step \$13 of [Installing SerDe Libraries](schema-registry-gs-serde.md).

1. Serialize and deliver records: If the record complies with the schema, the serializer decorates each record with the schema version ID, serializes the record based on the data format selected (AVRO, JSON, Protobuf, or other formats coming soon), compresses the record (optional producer configuration), and delivers it to the destination.

1. Consumers deserialize the data: Consumers reading this data use the Schema Registry deserializer library that parses the schema version ID from the record payload.

1. Deserializer may request the schema from the Schema Registry: If this is the first time the deserializer has seen records with a particular schema version ID, using the schema version ID the deserializer will request the schema from the Schema Registry and cache the schema locally on the consumer. If the Schema Registry cannot deserialize the record, the consumer can log the data from the record and move on, or halt the application.

1. The deserializer uses the schema to deserialize the record: When the deserializer retrieves the schema version ID from the Schema Registry, the deserializer decompresses the record (if record sent by producer is compressed) and uses the schema to deserialize the record. The application now processes the record.

**Note**  
Encryption: Your clients communicate with the Schema Registry via API calls which encrypt data in-transit using TLS encryption over HTTPS. Schemas stored in the Schema Registry are always encrypted at rest using a service-managed AWS Key Management Service (AWS KMS) key.

**Note**  
User Authorization: The Schema Registry supports identity-based IAM policies.

# Getting started with schema registry
<a name="schema-registry-gs"></a>

The following sections provide an overview and walk you through setting up and using Schema Registry. For information about schema registry concepts and components, see [AWS Glue Schema registry](schema-registry.md).

**Topics**
+ [Installing SerDe Libraries](schema-registry-gs-serde.md)
+ [Integrating with AWS Glue Schema Registry](schema-registry-integrations.md)
+ [Migration from a third-party schema registry to AWS Glue Schema Registry](schema-registry-integrations-migration.md)

# Installing SerDe Libraries
<a name="schema-registry-gs-serde"></a>

The SerDe libraries provide a framework for serializing and deserializing data. 

You will install the open source serializer for your applications producing data (collectively the "serializers"). The serializer handles serialization, compression, and the interaction with the Schema Registry. The serializer automatically extracts the schema from a record being written to a Schema Registry compatible destination, such as Amazon MSK. Likewise, you will install the open source deserializer on your applications consuming data.

# Java Implementation
<a name="schema-registry-gs-serde-java"></a>

**Note**  
Prerequisites: Before completing the following steps, you will need to have a Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Apache Kafka cluster running. Your producers and consumers need to be running on Java 8 or above.

To install the libraries on producers and consumers:

1. Inside both the producers’ and consumers’ pom.xml files, add this dependency via the code below:

   ```
   <dependency>
       <groupId>software.amazon.glue</groupId>
       <artifactId>schema-registry-serde</artifactId>
       <version>1.1.5</version>
   </dependency>
   ```

   Alternatively, you can clone the [AWS Glue Schema Registry Github repository](https://github.com/awslabs/aws-glue-schema-registry).

1. Setup your producers with these required properties:

   ```
   props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); // Can replace StringSerializer.class.getName()) with any other key serializer that you may use
   props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, GlueSchemaRegistryKafkaSerializer.class.getName());
   props.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
   properties.put(AWSSchemaRegistryConstants.DATA_FORMAT, "JSON"); // OR "AVRO"
   ```

   If there are no existing schemas, then auto-registration needs to be turned on (next step). If you do have a schema that you would like to apply, then replace "my-schema" with your schema name. Also the "registry-name" has to be provided if schema auto-registration is off. If the schema is created under the "default-registry" then registry name can be omitted.

1. (Optional) Set any of these optional producer properties. For detailed property descriptions, see [the ReadMe file](https://github.com/awslabs/aws-glue-schema-registry/blob/master/README.md).

   ```
   props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, "true"); // If not passed, uses "false"
   props.put(AWSSchemaRegistryConstants.SCHEMA_NAME, "my-schema"); // If not passed, uses transport name (topic name in case of Kafka, or stream name in case of Kinesis Data Streams)
   props.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "my-registry"); // If not passed, uses "default-registry"
   props.put(AWSSchemaRegistryConstants.CACHE_TIME_TO_LIVE_MILLIS, "86400000"); // If not passed, uses 86400000 (24 Hours)
   props.put(AWSSchemaRegistryConstants.CACHE_SIZE, "10"); // default value is 200
   props.put(AWSSchemaRegistryConstants.COMPATIBILITY_SETTING, Compatibility.FULL); // Pass a compatibility mode. If not passed, uses Compatibility.BACKWARD
   props.put(AWSSchemaRegistryConstants.DESCRIPTION, "This registry is used for several purposes."); // If not passed, constructs a description
   props.put(AWSSchemaRegistryConstants.COMPRESSION_TYPE, AWSSchemaRegistryConstants.COMPRESSION.ZLIB); // If not passed, records are sent uncompressed
   ```

   Auto-registration registers the schema version under the default registry ("default-registry"). If a `SCHEMA_NAME` is not specified in the previous step, then the topic name is inferred as `SCHEMA_NAME`. 

   See [Schema versioning and compatibility](schema-registry.md#schema-registry-compatibility) for more information on compatibility modes.

1. Setup your consumers with these required properties:

   ```
   props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
   props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, GlueSchemaRegistryKafkaDeserializer.class.getName());
   props.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2"); // Pass an AWS Region
   props.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName()); // Only required for AVRO data format
   ```

1. (Optional) Set these optional consumer properties. For detailed property descriptions, see [the ReadMe file](https://github.com/awslabs/aws-glue-schema-registry/blob/master/README.md).

   ```
   properties.put(AWSSchemaRegistryConstants.CACHE_TIME_TO_LIVE_MILLIS, "86400000"); // If not passed, uses 86400000
   props.put(AWSSchemaRegistryConstants.CACHE_SIZE, "10"); // default value is 200
   props.put(AWSSchemaRegistryConstants.SECONDARY_DESERIALIZER, "com.amazonaws.services.schemaregistry.deserializers.external.ThirdPartyDeserializer"); // For migration fall back scenario
   ```

# C\$1 Implementation
<a name="schema-registry-gs-serde-csharp"></a>

**Note**  
Prerequisites: Before completing the following steps, you will need to have a Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Apache Kafka cluster running. Your producers and consumers need to be running on .NET 8.0 or above.

## Installation
<a name="schema-registry-gs-serde-csharp-install"></a>

For C\$1 applications, install the AWS Glue Schema Registry SerDe NuGet package using one of the following methods:

**.NET CLI:**  
Use the following command to install the package:

```
dotnet add package Aws.Glue.SchemaRegistry --version 1.0.0-<rid>
```

where `<rid>` could be `1.0.0-linux-x64`, `1.0.0-linux-musl-x64` or `1.0.0-linux-arm64`

**PackageReference (in your .csproj file):**  
Add the following to your project file:

```
<PackageReference Include="Aws.Glue.SchemaRegistry" Version="1.0.0-<rid>" />
```

where `<rid>` could be `1.0.0-linux-x64`, `1.0.0-linux-musl-x64` or `1.0.0-linux-arm64`

## Configuration File Setup
<a name="schema-registry-gs-serde-csharp-config"></a>

Create a configuration properties file (e.g., `gsr-config.properties`) with the required settings:

**Minimal Configuration:**  
The following shows a minimal configuration example:

```
region=us-east-1
registry.name=default-registry
dataFormat=AVRO
schemaAutoRegistrationEnabled=true
```

## Using C\$1 Glue Schema client library for Kafka SerDes
<a name="schema-registry-gs-serde-csharp-kafka"></a>

**Sample serializer usage:**  
The following example shows how to use the serializer:

```
private static readonly string PROTOBUF_CONFIG_PATH = "<PATH_TO_CONFIG_FILE>";
var protobufSerializer = new GlueSchemaRegistryKafkaSerializer(PROTOBUF_CONFIG_PATH);
var serialized = protobufSerializer.Serialize(message, message.Descriptor.FullName);
// send serialized bytes to Kafka using producer.Produce(serialized)
```

**Sample deserializer usage:**  
The following example shows how to use the deserializer:

```
private static readonly string PROTOBUF_CONFIG_PATH = "<PATH_TO_CONFIG_FILE>";
var dataConfig = new GlueSchemaRegistryDataFormatConfiguration(
    new Dictionary<string, dynamic>
    {
        {
            GlueSchemaRegistryConstants.ProtobufMessageDescriptor, message.Descriptor
        }
    }
);
var protobufDeserializer = new GlueSchemaRegistryKafkaDeserializer(PROTOBUF_CONFIG_PATH, dataConfig);

// read message from Kafka using serialized = consumer.Consume()
var deserializedObject = protobufDeserializer.Deserialize(message.Descriptor.FullName, serialized);
```

## Using C\$1 Glue Schema client library with KafkaFlow for SerDes
<a name="schema-registry-gs-serde-csharp-kafkaflow"></a>

**Sample serializer usage:**  
The following example shows how to configure KafkaFlow with the serializer:

```
services.AddKafka(kafka => kafka
    .UseConsoleLog()
    .AddCluster(cluster => cluster
        .WithBrokers(new[] { "localhost:9092" })
        .AddProducer<CustomerProducer>(producer => producer
            .DefaultTopic("customer-events")
            .AddMiddlewares(m => m
                .AddSerializer<GlueSchemaRegistryKafkaFlowProtobufSerializer<Customer>>(
                    () => new GlueSchemaRegistryKafkaFlowProtobufSerializer<Customer>("config/gsr-config.properties")
                )
            )
        )
    )
);
```

**Sample deserializer usage:**  
The following example shows how to configure KafkaFlow with the deserializer:

```
.AddConsumer(consumer => consumer
    .Topic("customer-events")
    .WithGroupId("customer-group")
    .WithBufferSize(100)
    .WithWorkersCount(10)
    .AddMiddlewares(middlewares => middlewares
        .AddDeserializer<GlueSchemaRegistryKafkaFlowProtobufDeserializer<Customer>>(
            () => new GlueSchemaRegistryKafkaFlowProtobufDeserializer<Customer>("config/gsr-config.properties")
        )
        .AddTypedHandlers(h => h.AddHandler<CustomerHandler>())
    )
)
```

## Optional Producer Properties
<a name="schema-registry-gs-serde-csharp-optional"></a>

You can extend your configuration file with additional optional properties:

```
# Auto-registration (if not passed, uses "false")
schemaAutoRegistrationEnabled=true

# Schema name (if not passed, uses topic name)
schema.name=my-schema

# Registry name (if not passed, uses "default-registry")
registry.name=my-registry

# Cache settings
cacheTimeToLiveMillis=86400000
cacheSize=200

# Compatibility mode (if not passed, uses BACKWARD)
compatibility=FULL

# Registry description
description=This registry is used for several purposes.

# Compression (if not passed, records are sent uncompressed)
compressionType=ZLIB
```

## Supported Data Formats
<a name="schema-registry-gs-serde-supported-formats"></a>

Both Java and C\$1 implementations support the same data formats:
+ *AVRO*: Apache Avro binary format
+ *JSON*: JSON Schema format
+ *PROTOBUF*: Protocol Buffers format

## Notes
<a name="schema-registry-gs-serde-csharp-notes"></a>
+ To get started with the library, please visit [https://www.nuget.org/packages/AWS.Glue.SchemaRegistry](https://www.nuget.org/packages/AWS.Glue.SchemaRegistry)
+ Source code is available at: [https://github.com/awslabs/aws-glue-schema-registry](https://github.com/awslabs/aws-glue-schema-registry)

# Creating a registry
<a name="schema-registry-gs3"></a>

You may use the default registry or create as many new registries as necessary using the AWS Glue APIs or AWS Glue console.

**AWS Glue APIs**  
You can use these steps to perform this task using the AWS Glue APIs.

To use the AWS CLI for the AWS Glue Schema Registry APIs, make sure to update your AWS CLI to the latest version.

 To add a new registry, use the [CreateRegistry action (Python: create\$1registry)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-CreateRegistry) API. Specify `RegistryName` as the name of the registry to be created, with a max length of 255, containing only letters, numbers, hyphens, underscores, dollar signs, or hash marks. 

Specify a `Description` as a string not more than 2048 bytes long, matching the [ URI address multi-line string pattern](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-common.html#aws-glue-api-common-_string-patterns). 

Optionally, specify one or more `Tags` for your registry, as a map array of key-value pairs.

```
aws glue create-registry --registry-name registryName1 --description description
```

When your registry is created it is assigned an Amazon Resource Name (ARN), which you can view in the `RegistryArn` of the API response. Now that you've created a registry, create one or more schemas for that registry.

**AWS Glue console**  
To add a new registry in the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schema registries**.

1. Choose **Add registry**.

1. Enter a **Registry name** for the registry, consisting of letters, numbers, hyphens, or underscores. This name cannot be changed.

1. Enter a **Description** (optional) for the registry.

1. Optionally, apply one or more tags to your registry. Choose **Add new tag** and specify a **Tag key** and optionally a **Tag value**.

1. Choose **Add registry**.

![\[Example of a creating a registry.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_create_registry.png)


When your registry is created it is assigned an Amazon Resource Name (ARN), which you can view by choosing the registry from the list in **Schema registries**. Now that you've created a registry, create one or more schemas for that registry.

# Dealing with a specific record (JAVA POJO) for JSON
<a name="schema-registry-gs-json-java-pojo"></a>

You can use a plain old Java object (POJO) and pass the object as a record. This is similar to the notion of a specific record in AVRO. The [mbknor-jackson-jsonschema](https://github.com/mbknor/mbknor-jackson-jsonSchema) can generate a JSON schema for the POJO passed. This library can also inject additional information in the JSON schema.

The AWS Glue Schema Registry library uses the injected "className" field in schema to provide a fully classified class name. The "className" field is used by the deserializer to deserialize into an object of that class.

```
 Example class :

@JsonSchemaDescription("This is a car")
@JsonSchemaTitle("Simple Car Schema")
@Builder
@AllArgsConstructor
@EqualsAndHashCode
// Fully qualified class name to be added to an additionally injected property
// called className for deserializer to determine which class to deserialize
// the bytes into
@JsonSchemaInject(
        strings = {@JsonSchemaString(path = "className",
                value = "com.amazonaws.services.schemaregistry.integrationtests.generators.Car")}
)
// List of annotations to help infer JSON Schema are defined by https://github.com/mbknor/mbknor-jackson-jsonSchema
public class Car {
    @JsonProperty(required = true)
    private String make;

    @JsonProperty(required = true)
    private String model;

    @JsonSchemaDefault("true")
    @JsonProperty
    public boolean used;

    @JsonSchemaInject(ints = {@JsonSchemaInt(path = "multipleOf", value = 1000)})
    @Max(200000)
    @JsonProperty
    private int miles;

    @Min(2000)
    @JsonProperty
    private int year;

    @JsonProperty
    private Date purchaseDate;

    @JsonProperty
    @JsonFormat(shape = JsonFormat.Shape.NUMBER)
    private Date listedDate;

    @JsonProperty
    private String[] owners;

    @JsonProperty
    private Collection<Float> serviceChecks;

    // Empty constructor is required by Jackson to deserialize bytes
    // into an Object of this class
    public Car() {}
}
```

# Creating a schema
<a name="schema-registry-gs4"></a>

You can create a schema using the AWS Glue APIs or the AWS Glue console. 

**AWS Glue APIs**  
You can use these steps to perform this task using the AWS Glue APIs.

To add a new schema, use the [CreateSchema action (Python: create\$1schema)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-CreateSchema) API.

Specify a `RegistryId` structure to indicate a registry for the schema. Or, omit the `RegistryId` to use the default registry.

Specify a `SchemaName` consisting of letters, numbers, hyphens, or underscores, and `DataFormat` as **AVRO** or **JSON**. `DataFormat` once set on a schema is not changeable.

Specify a `Compatibility` mode:
+ *Backward (recommended)* — Consumer can read both current and previous version.
+ *Backward all* — Consumer can read current and all previous versions.
+ *Forward* — Consumer can read both current and subsequent version.
+ *Forward all* — Consumer can read both current and all subsequent versions.
+ *Full* — Combination of Backward and Forward.
+ *Full all* — Combination of Backward all and Forward all.
+ *None* — No compatibility checks are performed.
+ *Disabled* — Prevent any versioning for this schema.

Optionally, specify `Tags` for your schema. 

Specify a `SchemaDefinition` to define the schema in Avro, JSON, or Protobuf data format. See the examples.

For Avro data format:

```
aws glue create-schema --registry-id RegistryName="registryName1" --schema-name testschema --compatibility NONE --data-format AVRO --schema-definition "{\"type\": \"record\", \"name\": \"r1\", \"fields\": [ {\"name\": \"f1\", \"type\": \"int\"}, {\"name\": \"f2\", \"type\": \"string\"} ]}"
```

```
aws glue create-schema --registry-id RegistryArn="arn:aws:glue:us-east-2:901234567890:registry/registryName1" --schema-name testschema --compatibility NONE --data-format AVRO  --schema-definition "{\"type\": \"record\", \"name\": \"r1\", \"fields\": [ {\"name\": \"f1\", \"type\": \"int\"}, {\"name\": \"f2\", \"type\": \"string\"} ]}"
```

For JSON data format:

```
aws glue create-schema --registry-id RegistryName="registryName" --schema-name testSchemaJson --compatibility NONE --data-format JSON --schema-definition "{\"$schema\": \"http://json-schema.org/draft-07/schema#\",\"type\":\"object\",\"properties\":{\"f1\":{\"type\":\"string\"}}}"
```

```
aws glue create-schema --registry-id RegistryArn="arn:aws:glue:us-east-2:901234567890:registry/registryName" --schema-name testSchemaJson --compatibility NONE --data-format JSON --schema-definition "{\"$schema\": \"http://json-schema.org/draft-07/schema#\",\"type\":\"object\",\"properties\":{\"f1\":{\"type\":\"string\"}}}"
```

For Protobuf data format:

```
aws glue create-schema --registry-id RegistryName="registryName" --schema-name testSchemaProtobuf --compatibility NONE --data-format PROTOBUF --schema-definition "syntax = \"proto2\";package org.test;message Basic { optional int32 basic = 1;}"
```

```
aws glue create-schema --registry-id RegistryArn="arn:aws:glue:us-east-2:901234567890:registry/registryName" --schema-name testSchemaProtobuf --compatibility NONE --data-format PROTOBUF --schema-definition "syntax = \"proto2\";package org.test;message Basic { optional int32 basic = 1;}"
```

**AWS Glue console**  
To add a new schema using the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schemas**.

1. Choose **Add schema**.

1. Enter a **Schema name**, consisting of letters, numbers, hyphens, underscores, dollar signs, or hashmarks. This name cannot be changed.

1. Choose the **Registry** where the schema will be stored from the drop-down menu. The parent registry cannot be changed post-creation.

1. Leave the **Data format** as *Apache Avro* or *JSON*. This format applies to all versions of this schema.

1. Choose a **Compatibility mode**.
   + *Backward (recommended)* — receiver can read both current and previous versions.
   + *Backward All* — receiver can read current and all previous versions.
   + *Forward* — sender can write both current and previous versions.
   + *Forward All* — sender can write current and all previous versions.
   + *Full* — combination of Backward and Forward.
   + *Full All* — combination of Backward All and Forward All.
   + *None* — no compatibility checks performed.
   + *Disabled* — prevent any versioning for this schema.

1. Enter an optional **Description** for the registry of up to 250 characters.  
![\[Example of a creating a schema.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_create_schema.png)

1. Optionally, apply one or more tags to your schema. Choose **Add new tag** and specify a **Tag key** and optionally a **Tag value**.

1. In the **First schema version** box, enter or paste your initial schema. .

   For Avro format, see [Working with Avro data format](#schema-registry-avro)

   For JSON format, see [Working with JSON data format](#schema-registry-json)

1. Optionally, choose **Add metadata** to add version metadata to annotate or classify your schema version.

1. Choose **Create schema and version**.

![\[Example of a creating a schema.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_create_schema2.png)


The schema is created and appears in the list under **Schemas**.

## Working with Avro data format
<a name="schema-registry-avro"></a>

Avro provides data serialization and data exchange services. Avro stores the data definition in JSON format making it easy to read and interpret. The data itself is stored in binary format.

For information on defining an Apache Avro schema, see the [Apache Avro specification](http://avro.apache.org/docs/current/spec.html).

## Working with JSON data format
<a name="schema-registry-json"></a>

Data can be serialized with JSON format. [JSON Schema format](https://json-schema.org/) defines the standard for JSON Schema format.

# Updating a schema or registry
<a name="schema-registry-gs5"></a>

Once created you can edit your schemas, schema versions, or registry.

## Updating a registry
<a name="schema-registry-gs5a"></a>

You can update a registry using the AWS Glue APIs or the AWS Glue console. The name of an existing registry cannot be edited. You can edit the description for a registry.

**AWS Glue APIs**  
To update an existing registry, use the [UpdateRegistry action (Python: update\$1registry)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-UpdateRegistry) API.

Specify a `RegistryId` structure to indicate the registry that you want to update. Pass a `Description` to change the description for a registry.

```
aws glue update-registry --description updatedDescription --registry-id RegistryArn="arn:aws:glue:us-east-2:901234567890:registry/registryName1"
```

**AWS Glue console**  
To update a registry using the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schema registries**.

1. Choose a registry from the the list of registries, by checking its box.

1. In the **Action** menu, choose **Edit registry**.

# Updating a schema
<a name="schema-registry-gs5b"></a>

You can update the description or compatibility setting for a schema.

To update an existing schema, use the [UpdateSchema action (Python: update\$1schema)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-UpdateSchema) API.

Specify a `SchemaId` structure to indicate the schema that you want to update. One of `VersionNumber` or `Compatibility` has to be provided.

Code example 11:

```
aws glue update-schema --description testDescription --schema-id SchemaName="testSchema1",RegistryName="registryName1" --schema-version-number LatestVersion=true --compatibility NONE
```

```
aws glue update-schema --description testDescription --schema-id SchemaArn="arn:aws:glue:us-east-2:901234567890:schema/registryName1/testSchema1" --schema-version-number LatestVersion=true --compatibility NONE
```

# Adding a schema version
<a name="schema-registry-gs5c"></a>

When you add a schema version, you will need to compare the versions to make sure the new schema will be accepted.

To add a new version to an existing schema, use the [RegisterSchemaVersion action (Python: register\$1schema\$1version)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-RegisterSchemaVersion) API.

Specify a `SchemaId` structure to indicate the schema for which you want to add a version, and a `SchemaDefinition` to define the schema.

Code example 12:

```
aws glue register-schema-version --schema-definition "{\"type\": \"record\", \"name\": \"r1\", \"fields\": [ {\"name\": \"f1\", \"type\": \"int\"}, {\"name\": \"f2\", \"type\": \"string\"} ]}" --schema-id SchemaArn="arn:aws:glue:us-east-1:901234567890:schema/registryName/testschema"
```

```
aws glue register-schema-version --schema-definition "{\"type\": \"record\", \"name\": \"r1\", \"fields\": [ {\"name\": \"f1\", \"type\": \"int\"}, {\"name\": \"f2\", \"type\": \"string\"} ]}" --schema-id SchemaName="testschema",RegistryName="testregistry"
```

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schemas**.

1. Choose the schema from the the list of schemas, by checking its box.

1. Choose one or more schemas from the list, by checking the boxes.

1. In the **Action** menu, choose **Register new version**.

1. In the **New version** box, enter or paste your new schema.

1. Choose **Compare with previous version** to see differences with the previous schema version.

1. Optionally, choose **Add metadata** to add version metadata to annotate or classify your schema version. Enter **Key** and optional **Value**.

1. Choose **Register version**.

![\[Adding a schema version.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_add_schema_version.png)


The schema(s) version appears in the list of versions. If the version changed the compatibility mode, the version will be marked as a checkpoint.

## Example of a schema version comparison
<a name="schema-registry-gs5c1"></a>

When you choose to **Compare with previous version**, you will see the previous and new versions displayed together. Changed information will be highlighted as follows:
+ *Yellow*: indicates changed information.
+ *Green*: indicates content added in the latest version.
+ *Red*: indicates content removed in the latest version.

You can also compare against earlier versions.

![\[Example of a schema version comparison.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_version_comparison.png)


# Deleting a schema or registry
<a name="schema-registry-gs7"></a>

Deleting a schema, a schema version, or a registry are permanent actions that cannot be undone.

## Deleting a schema
<a name="schema-registry-gs7a"></a>

You may want to delete a schema when it will no longer be used within a registry, using the AWS Management Console, or the [DeleteSchema action (Python: delete\$1schema)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-DeleteSchema) API.

Deleting one or more schemas is a permanent action that cannot be undone. Make sure that the schema or schemas are no longer needed.

To delete a schema from the registry, call the [DeleteSchema action (Python: delete\$1schema)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-DeleteSchema) API, specifying the `SchemaId` structure to identify the schema.

For example:

```
aws glue delete-schema --schema-id SchemaArn="arn:aws:glue:us-east-2:901234567890:schema/registryName1/schemaname"
```

```
aws glue delete-schema --schema-id SchemaName="TestSchema6-deleteschemabyname",RegistryName="default-registry"
```

**AWS Glue console**  
To delete a schema from the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schema registries**.

1. Choose the registry that contains your schema from the the list of registries.

1. Choose one or more schemas from the list, by checking the boxes.

1. In the **Action** menu, choose **Delete schema**.

1. Enter the text **Delete** in the field to confirm deletion.

1. Choose **Delete**.

The schema(s) you specified are deleted from the registry.

## Deleting a schema version
<a name="schema-registry-gs7b"></a>

As schemas accumulate in the registry, you may want to delete unwanted schema versions using the AWS Management Console, or the [DeleteSchemaVersions action (Python: delete\$1schema\$1versions)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-DeleteSchemaVersions) API. Deleting one or more schema versions is a permanent action that cannot be undone. Make sure that the schema versions are no longer needed.

When deleting schema versions, take note of the following constraints:
+ You cannot delete a check-pointed version.
+ The range of contiguous versions cannot be more than 25.
+ The latest schema version must not be in a pending state.

Specify the `SchemaId` structure to identify the schema, and specify `Versions` as a range of versions to delete. For more information on specifying a version or range of versions, see [DeleteRegistry action (Python: delete\$1registry)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-DeleteRegistry). The schema versions you specified are deleted from the registry.

Calling the [ListSchemaVersions action (Python: list\$1schema\$1versions)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-ListSchemaVersions) API after this call will list the status of the deleted versions.

For example:

```
aws glue delete-schema-versions --schema-id SchemaName="TestSchema6",RegistryName="default-registry" --versions "1-1"
```

```
aws glue delete-schema-versions --schema-id SchemaArn="arn:aws:glue:us-east-2:901234567890:schema/default-registry/TestSchema6-NON-Existent" --versions "1-1"
```

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schema registries**.

1. Choose the registry that contains your schema from the the list of registries.

1. Choose one or more schemas from the list, by checking the boxes.

1. In the **Action** menu, choose **Delete schema**.

1. Enter the text **Delete** in the field to confirm deletion.

1. Choose **Delete**.

The schema versions you specified are deleted from the registry.

# Deleting a registry
<a name="schema-registry-gs7c"></a>

You may want to delete a registry when the schemas it contains should no longer be organized under that registry. You will need to reassign those schemas to another registry.

Deleting one or more registries is a permanent action that cannot be undone. Make sure that the registry or registries no longer needed.

The default registry can be deleted using the AWS CLI.

**AWS Glue API**  
To delete the entire registry including the schema and all of its versions, call the [DeleteRegistry action (Python: delete\$1registry)](aws-glue-api-schema-registry-api.md#aws-glue-api-schema-registry-api-DeleteRegistry) API. Specify a `RegistryId` structure to identify the registry.

For example:

```
aws glue delete-registry --registry-id RegistryArn="arn:aws:glue:us-east-2:901234567890:registry/registryName1"
```

```
aws glue delete-registry --registry-id RegistryName="TestRegistry-deletebyname"
```

To get the status of the delete operation, you can call the `GetRegistry` API after the asynchronous call.

**AWS Glue console**  
To delete a registry from the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Schema registries**.

1. Choose a registry from the list, by checking a box.

1. In the **Action** menu, choose **Delete registry**.

1. Enter the text **Delete** in the field to confirm deletion.

1. Choose **Delete**.

The registries you selected are deleted from AWS Glue.

## IAM examples for serializers
<a name="schema-registry-gs1"></a>

**Note**  
AWS managed policies grant necessary permissions for common use cases. For information on using managed policies to manage the schema registry, see [AWS managed (predefined) policies for AWS Glue](security-iam-awsmanpol.md#access-policy-examples-aws-managed). 

For serializers, you should create a minimal policy similar to that below to give you the ability to find the `schemaVersionId` for a given schema definition. Note, you should have read permissions on the registry in order to read the schemas in the registry. You can limit the registries that can be read by using the `Resource` clause.

Code example 13:

```
{
    "Sid" : "GetSchemaByDefinition",
    "Effect" : "Allow",
    "Action" :
	[
        "glue:GetSchemaByDefinition"
    ],
        "Resource" : ["arn:aws:glue:us-east-2:012345678:registry/registryname-1",
                      "arn:aws:glue:us-east-2:012345678:schema/registryname-1/schemaname-1",
                      "arn:aws:glue:us-east-2:012345678:schema/registryname-1/schemaname-2"
                     ]
}
```

Further, you can also allow producers to create new schemas and versions by including the following extra methods. Note, you should be able to inspect the registry in order to add/remove/evolve the schemas inside it. You can limit the registries that can be inspected by using the `Resource` clause.

Code example 14:

```
{
    "Sid" : "RegisterSchemaWithMetadata",
    "Effect" : "Allow",
    "Action" :
	[
        "glue:GetSchemaByDefinition",
        "glue:CreateSchema",
        "glue:RegisterSchemaVersion",
        "glue:PutSchemaVersionMetadata",
    ],
    "Resource" : ["arn:aws:glue:aws-region:123456789012:registry/registryname-1",
                  "arn:aws:glue:aws-region:123456789012:schema/registryname-1/schemaname-1",
                  "arn:aws:glue:aws-region:123456789012:schema/registryname-1/schemaname-2"
                 ]
}
```

## IAM examples for deserializers
<a name="schema-registry-gs1b"></a>

For deserializers (consumer side), you should create a policy similar to that below to allow the deserializer to fetch the schema from the Schema Registry for deserialization. Note, you should be able to inspect the registry in order to fetch the schemas inside it.

Code example 15:

```
{
    "Sid" : "GetSchemaVersion",
    "Effect" : "Allow",
    "Action" :
	[
        "glue:GetSchemaVersion"
    ],
    "Resource" : ["*"]
}
```

## Private connectivity using AWS PrivateLink
<a name="schema-registry-gs-private"></a>

You can use AWS PrivateLink to connect your data producer’s VPC to AWS Glue by defining an interface VPC endpoint for AWS Glue. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted entirely within the AWS network. For more information, see [Using AWS Glue with VPC Endpoints](https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoint.html).

# Accessing Amazon CloudWatch metrics
<a name="schema-registry-gs-monitoring"></a>

Amazon CloudWatch metrics are available as part of CloudWatch’s free tier. You can access these metrics in the CloudWatch console. API-Level metrics include CreateSchema (Success and Latency), GetSchemaByDefinition, (Success and Latency), GetSchemaVersion (Success and Latency), RegisterSchemaVersion (Success and Latency), PutSchemaVersionMetadata (Success and Latency). Resource-level metrics include Registry.ThrottledByLimit, SchemaVersion.ThrottledByLimit, SchemaVersion.Size.

# Sample CloudFormation template for schema registry
<a name="schema-registry-integrations-cfn"></a>

The following is a sample template for creating Schema Registry resources in CloudFormation. To create this stack in your account, copy the above template into a file `SampleTemplate.yaml`, and run the following command:

```
aws cloudformation create-stack --stack-name ABCSchemaRegistryStack --template-body "'cat SampleTemplate.yaml'"
```

This example uses `AWS::Glue::Registry` to create a registry, `AWS::Glue::Schema` to create a schema, `AWS::Glue::SchemaVersion` to create a schema version, and `AWS::Glue::SchemaVersionMetadata` to populate schema version metadata. 

```
Description: "A sample CloudFormation template for creating Schema Registry resources."
Resources:
  ABCRegistry:
    Type: "AWS::Glue::Registry"
    Properties:
      Name: "ABCSchemaRegistry"
      Description: "ABC Corp. Schema Registry"
      Tags:
        Project: "Foo"
  ABCSchema:
    Type: "AWS::Glue::Schema"
    Properties:
      Registry:
        Arn: !Ref ABCRegistry
      Name: "TestSchema"
      Compatibility: "NONE"
      DataFormat: "AVRO"
      SchemaDefinition: >
        {"namespace":"foo.avro","type":"record","name":"user","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":"int"}]}
      Tags:
        Project: "Foo"
  SecondSchemaVersion:
    Type: "AWS::Glue::SchemaVersion"
    Properties:
      Schema:
        SchemaArn: !Ref ABCSchema
      SchemaDefinition: >
        {"namespace":"foo.avro","type":"record","name":"user","fields":[{"name":"status","type":"string", "default":"ON"}, {"name":"name","type":"string"},{"name":"favorite_number","type":"int"}]}
  FirstSchemaVersionMetadata:
    Type: "AWS::Glue::SchemaVersionMetadata"
    Properties:
      SchemaVersionId: !GetAtt ABCSchema.InitialSchemaVersionId
      Key: "Application"
      Value: "Kinesis"
  SecondSchemaVersionMetadata:
    Type: "AWS::Glue::SchemaVersionMetadata"
    Properties:
      SchemaVersionId: !Ref SecondSchemaVersion
      Key: "Application"
      Value: "Kinesis"
```

# Integrating with AWS Glue Schema Registry
<a name="schema-registry-integrations"></a>

These sections describe integrations with AWS Glue schema registry. The examples in these section show a schema with AVRO data format. For more examples, including schemas with JSON data format, see the integration tests and ReadMe information in the [AWS Glue Schema Registry open source repository](https://github.com/awslabs/aws-glue-schema-registry).

**Topics**
+ [Use case: Connecting Schema Registry to Amazon MSK or Apache Kafka](#schema-registry-integrations-amazon-msk)
+ [Use case: Integrating Amazon Kinesis Data Streams with the AWS Glue Schema Registry](#schema-registry-integrations-kds)
+ [Use case: Amazon Managed Service for Apache Flink](#schema-registry-integrations-kinesis-data-analytics-apache-flink)
+ [Use Case: Integration with AWS Lambda](#schema-registry-integrations-aws-lambda)
+ [Use case: AWS Glue Data Catalog](#schema-registry-integrations-aws-glue-data-catalog)
+ [Use case: AWS Glue streaming](#schema-registry-integrations-aws-glue-streaming)
+ [Use case: Apache Kafka Streams](#schema-registry-integrations-apache-kafka-streams)

## Use case: Connecting Schema Registry to Amazon MSK or Apache Kafka
<a name="schema-registry-integrations-amazon-msk"></a>

Let's assume you are writing data to an Apache Kafka topic, and you can follow these steps to get started.

1. Create an Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Apache Kafka cluster with at least one topic. If creating an Amazon MSK cluster, you can use the AWS Management Console. Follow these instructions: [Getting Started Using Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/getting-started.html) in the *Amazon Managed Streaming for Apache Kafka Developer Guide*.

1. Follow the [Installing SerDe Libraries](schema-registry-gs-serde.md) step above.

1. To create schema registries, schemas, or schema versions, follow the instructions under the [Getting started with schema registry](schema-registry-gs.md) section of this document.

1. Start your producers and consumers to use the Schema Registry to write and read records to/from the Amazon MSK or Apache Kafka topic. Example producer and consumer code can be found in [the ReadMe file](https://github.com/awslabs/aws-glue-schema-registry/blob/master/README.md) from the Serde libraries. The Schema Registry library on the producer will automatically serialize the record and decorate the record with a schema version ID.

1. If the schema of this record has been inputted, or if auto-registration is turned on, then the schema will have been registered in the Schema Registry.

1. The consumer reading from the Amazon MSK or Apache Kafka topic, using the AWS Glue Schema Registry library, will automatically lookup the schema from the Schema Registry.

## Use case: Integrating Amazon Kinesis Data Streams with the AWS Glue Schema Registry
<a name="schema-registry-integrations-kds"></a>

This integration requires that you have an existing Amazon Kinesis data stream. For more information, see [Getting Started with Amazon Kinesis Data Streams](https://docs.aws.amazon.com/streams/latest/dev/getting-started.html) in the *Amazon Kinesis Data Streams Developer Guide*.

There are two ways that you can interact with data in a Kinesis data stream.
+ Through the Kinesis Producer Library (KPL) and Kinesis Client Library (KCL) libraries in Java. Multi-language support is not provided.
+ Through the `PutRecords`, `PutRecord`, and `GetRecords` Kinesis Data Streams APIs available in the AWS SDK for Java.

If you currently use the KPL/KCL libraries, we recommend continuing to use that method. There are updated KCL and KPL versions with Schema Registry integrated, as shown in the examples. Otherwise, you can use the sample code to leverage the AWS Glue Schema Registry if using the KDS APIs directly.

Schema Registry integration is only available with KPL v0.14.2 or later and with KCL v2.3 or later. Schema Registry integration with JSON data format is available with KPL v0.14.8 or later and with KCL v2.3.6 or later.

### Interacting with Data Using Kinesis SDK V2
<a name="schema-registry-integrations-kds-sdk-v2"></a>

This section describes interacting with Kinesis using Kinesis SDK V2

```
// Example JSON Record, you can construct a AVRO record also
private static final JsonDataWithSchema record = JsonDataWithSchema.builder(schemaString, payloadString);
private static final DataFormat dataFormat = DataFormat.JSON;

//Configurations for Schema Registry
GlueSchemaRegistryConfiguration gsrConfig = new GlueSchemaRegistryConfiguration("us-east-1");

GlueSchemaRegistrySerializer glueSchemaRegistrySerializer =
        new GlueSchemaRegistrySerializerImpl(awsCredentialsProvider, gsrConfig);
GlueSchemaRegistryDataFormatSerializer dataFormatSerializer =
        new GlueSchemaRegistrySerializerFactory().getInstance(dataFormat, gsrConfig);

Schema gsrSchema =
        new Schema(dataFormatSerializer.getSchemaDefinition(record), dataFormat.name(), "MySchema");

byte[] serializedBytes = dataFormatSerializer.serialize(record);

byte[] gsrEncodedBytes = glueSchemaRegistrySerializer.encode(streamName, gsrSchema, serializedBytes);

PutRecordRequest putRecordRequest = PutRecordRequest.builder()
        .streamName(streamName)
        .partitionKey("partitionKey")
        .data(SdkBytes.fromByteArray(gsrEncodedBytes))
        .build();
shardId = kinesisClient.putRecord(putRecordRequest)
        .get()
        .shardId();

GlueSchemaRegistryDeserializer glueSchemaRegistryDeserializer = new GlueSchemaRegistryDeserializerImpl(awsCredentialsProvider, gsrConfig);

GlueSchemaRegistryDataFormatDeserializer gsrDataFormatDeserializer =
        glueSchemaRegistryDeserializerFactory.getInstance(dataFormat, gsrConfig);

GetShardIteratorRequest getShardIteratorRequest = GetShardIteratorRequest.builder()
        .streamName(streamName)
        .shardId(shardId)
        .shardIteratorType(ShardIteratorType.TRIM_HORIZON)
        .build();

String shardIterator = kinesisClient.getShardIterator(getShardIteratorRequest)
        .get()
        .shardIterator();

GetRecordsRequest getRecordRequest = GetRecordsRequest.builder()
        .shardIterator(shardIterator)
        .build();
GetRecordsResponse recordsResponse = kinesisClient.getRecords(getRecordRequest)
        .get();

List<Object> consumerRecords = new ArrayList<>();
List<Record> recordsFromKinesis = recordsResponse.records();

for (int i = 0; i < recordsFromKinesis.size(); i++) {
    byte[] consumedBytes = recordsFromKinesis.get(i)
            .data()
            .asByteArray();

    Schema gsrSchema = glueSchemaRegistryDeserializer.getSchema(consumedBytes);
    Object decodedRecord = gsrDataFormatDeserializer.deserialize(ByteBuffer.wrap(consumedBytes),
                                                                    gsrSchema.getSchemaDefinition());
    consumerRecords.add(decodedRecord);
}
```

### Interacting with data using the KPL/KCL libraries
<a name="schema-registry-integrations-kds-libraries"></a>

This section describes integrating Kinesis Data Streams with Schema Registry using the KPL/KCL libraries. For more information on using KPL/KCL, see [Developing Producers Using the Amazon Kinesis Producer Library](https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html) in the *Amazon Kinesis Data Streams Developer Guide*.

#### Setting up the Schema Registry in KPL
<a name="schema-registry-integrations-kds-libraries-kpl"></a>

1. Define the schema definition for the data, data format and schema name authored in the AWS Glue Schema Registry.

1. Optionally configure the `GlueSchemaRegistryConfiguration` object.

1. Pass the schema object to the `addUserRecord API`.

   ```
   private static final String SCHEMA_DEFINITION = "{"namespace": "example.avro",\n"
   + " "type": "record",\n"
   + " "name": "User",\n"
   + " "fields": [\n"
   + " {"name": "name", "type": "string"},\n"
   + " {"name": "favorite_number", "type": ["int", "null"]},\n"
   + " {"name": "favorite_color", "type": ["string", "null"]}\n"
   + " ]\n"
   + "}";
   
   KinesisProducerConfiguration config = new KinesisProducerConfiguration();
   config.setRegion("us-west-1")
   
   //[Optional] configuration for Schema Registry.
   
   GlueSchemaRegistryConfiguration schemaRegistryConfig =
   new GlueSchemaRegistryConfiguration("us-west-1");
   
   schemaRegistryConfig.setCompression(true);
   
   config.setGlueSchemaRegistryConfiguration(schemaRegistryConfig);
   
   ///Optional configuration ends.
   
   final KinesisProducer producer =
         new KinesisProducer(config);
   
   final ByteBuffer data = getDataToSend();
   
   com.amazonaws.services.schemaregistry.common.Schema gsrSchema =
       new Schema(SCHEMA_DEFINITION, DataFormat.AVRO.toString(), "demoSchema");
   
   ListenableFuture<UserRecordResult> f = producer.addUserRecord(
   config.getStreamName(), TIMESTAMP, Utils.randomExplicitHashKey(), data, gsrSchema);
   
   private static ByteBuffer getDataToSend() {
         org.apache.avro.Schema avroSchema =
           new org.apache.avro.Schema.Parser().parse(SCHEMA_DEFINITION);
   
         GenericRecord user = new GenericData.Record(avroSchema);
         user.put("name", "Emily");
         user.put("favorite_number", 32);
         user.put("favorite_color", "green");
   
         ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
         Encoder encoder = EncoderFactory.get().directBinaryEncoder(outBytes, null);
         new GenericDatumWriter<>(avroSchema).write(user, encoder);
         encoder.flush();
         return ByteBuffer.wrap(outBytes.toByteArray());
    }
   ```

#### Setting up the Kinesis client library
<a name="schema-registry-integrations-kds-libraries-kcl"></a>

You will develop your Kinesis Client Library consumer in Java. For more information, see [Developing a Kinesis Client Library Consumer in Java](https://docs.aws.amazon.com/streams/latest/dev/kcl2-standard-consumer-java-example.html) in the *Amazon Kinesis Data Streams Developer Guide*.

1. Create an instance of `GlueSchemaRegistryDeserializer` by passing a `GlueSchemaRegistryConfiguration` object.

1. Pass the `GlueSchemaRegistryDeserializer` to `retrievalConfig.glueSchemaRegistryDeserializer`.

1. Access the schema of incoming messages by calling `kinesisClientRecord.getSchema()`.

   ```
   GlueSchemaRegistryConfiguration schemaRegistryConfig =
       new GlueSchemaRegistryConfiguration(this.region.toString());
   
    GlueSchemaRegistryDeserializer glueSchemaRegistryDeserializer =
       new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), schemaRegistryConfig);
   
    RetrievalConfig retrievalConfig = configsBuilder.retrievalConfig().retrievalSpecificConfig(new PollingConfig(streamName, kinesisClient));
    retrievalConfig.glueSchemaRegistryDeserializer(glueSchemaRegistryDeserializer);
   
     Scheduler scheduler = new Scheduler(
               configsBuilder.checkpointConfig(),
               configsBuilder.coordinatorConfig(),
               configsBuilder.leaseManagementConfig(),
               configsBuilder.lifecycleConfig(),
               configsBuilder.metricsConfig(),
               configsBuilder.processorConfig(),
               retrievalConfig
           );
   
    public void processRecords(ProcessRecordsInput processRecordsInput) {
               MDC.put(SHARD_ID_MDC_KEY, shardId);
               try {
                   log.info("Processing {} record(s)",
                   processRecordsInput.records().size());
                   processRecordsInput.records()
                   .forEach(
                       r ->
                           log.info("Processed record pk: {} -- Seq: {} : data {} with schema: {}",
                           r.partitionKey(), r.sequenceNumber(), recordToAvroObj(r).toString(), r.getSchema()));
               } catch (Throwable t) {
                   log.error("Caught throwable while processing records. Aborting.");
                   Runtime.getRuntime().halt(1);
               } finally {
                   MDC.remove(SHARD_ID_MDC_KEY);
               }
    }
   
    private GenericRecord recordToAvroObj(KinesisClientRecord r) {
       byte[] data = new byte[r.data().remaining()];
       r.data().get(data, 0, data.length);
       org.apache.avro.Schema schema = new org.apache.avro.Schema.Parser().parse(r.schema().getSchemaDefinition());
       DatumReader datumReader = new GenericDatumReader<>(schema);
   
       BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(data, 0, data.length, null);
       return (GenericRecord) datumReader.read(null, binaryDecoder);
    }
   ```

#### Interacting with data using the Kinesis Data Streams APIs
<a name="schema-registry-integrations-kds-apis"></a>

This section describes integrating Kinesis Data Streams with Schema Registry using the Kinesis Data Streams APIs.

1. Update these Maven dependencies:

   ```
   <dependencyManagement>
           <dependencies>
               <dependency>
                   <groupId>com.amazonaws</groupId>
                   <artifactId>aws-java-sdk-bom</artifactId>
                   <version>1.11.884</version>
                   <type>pom</type>
                   <scope>import</scope>
               </dependency>
           </dependencies>
       </dependencyManagement>
   
       <dependencies>
           <dependency>
               <groupId>com.amazonaws</groupId>
               <artifactId>aws-java-sdk-kinesis</artifactId>
           </dependency>
   
           <dependency>
               <groupId>software.amazon.glue</groupId>
               <artifactId>schema-registry-serde</artifactId>
               <version>1.1.5</version>
           </dependency>
   
           <dependency>
               <groupId>com.fasterxml.jackson.dataformat</groupId>
               <artifactId>jackson-dataformat-cbor</artifactId>
               <version>2.11.3</version>
           </dependency>
       </dependencies>
   ```

1. In the producer, add schema header information using the `PutRecords` or `PutRecord` API in Kinesis Data Streams.

   ```
   //The following lines add a Schema Header to the record
           com.amazonaws.services.schemaregistry.common.Schema awsSchema =
               new com.amazonaws.services.schemaregistry.common.Schema(schemaDefinition, DataFormat.AVRO.name(),
                   schemaName);
           GlueSchemaRegistrySerializerImpl glueSchemaRegistrySerializer =
               new GlueSchemaRegistrySerializerImpl(DefaultCredentialsProvider.builder().build(), new GlueSchemaRegistryConfiguration(getConfigs()));
           byte[] recordWithSchemaHeader =
               glueSchemaRegistrySerializer.encode(streamName, awsSchema, recordAsBytes);
   ```

1. In the producer, use the `PutRecords` or `PutRecord` API to put the record into the data stream.

1. In the consumer, remove the schema record from the header, and serialize an Avro schema record.

   ```
   //The following lines remove Schema Header from record
           GlueSchemaRegistryDeserializerImpl glueSchemaRegistryDeserializer =
               new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), getConfigs());
           byte[] recordWithSchemaHeaderBytes = new byte[recordWithSchemaHeader.remaining()];
           recordWithSchemaHeader.get(recordWithSchemaHeaderBytes, 0, recordWithSchemaHeaderBytes.length);
           com.amazonaws.services.schemaregistry.common.Schema awsSchema =
               glueSchemaRegistryDeserializer.getSchema(recordWithSchemaHeaderBytes);
           byte[] record = glueSchemaRegistryDeserializer.getData(recordWithSchemaHeaderBytes);
   
           //The following lines serialize an AVRO schema record
           if (DataFormat.AVRO.name().equals(awsSchema.getDataFormat())) {
               Schema avroSchema = new org.apache.avro.Schema.Parser().parse(awsSchema.getSchemaDefinition());
               Object genericRecord = convertBytesToRecord(avroSchema, record);
               System.out.println(genericRecord);
           }
   ```

#### Interacting with data using the Kinesis Data Streams APIs
<a name="schema-registry-integrations-kds-apis-reference"></a>

The following is example code for using the `PutRecords` and `GetRecords` APIs.

```
//Full sample code
import com.amazonaws.services.schemaregistry.deserializers.GlueSchemaRegistryDeserializerImpl;
import com.amazonaws.services.schemaregistry.serializers.GlueSchemaRegistrySerializerImpl;
import com.amazonaws.services.schemaregistry.utils.AVROUtils;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider;
import software.amazon.awssdk.services.glue.model.DataFormat;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;


public class PutAndGetExampleWithEncodedData {
    static final String regionName = "us-east-2";
    static final String streamName = "testStream1";
    static final String schemaName = "User-Topic";
    static final String AVRO_USER_SCHEMA_FILE = "src/main/resources/user.avsc";
    KinesisApi kinesisApi = new KinesisApi();

    void runSampleForPutRecord() throws IOException {
        Object testRecord = getTestRecord();
        byte[] recordAsBytes = convertRecordToBytes(testRecord);
        String schemaDefinition = AVROUtils.getInstance().getSchemaDefinition(testRecord);

        //The following lines add a Schema Header to a record
        com.amazonaws.services.schemaregistry.common.Schema awsSchema =
            new com.amazonaws.services.schemaregistry.common.Schema(schemaDefinition, DataFormat.AVRO.name(),
                schemaName);
        GlueSchemaRegistrySerializerImpl glueSchemaRegistrySerializer =
            new GlueSchemaRegistrySerializerImpl(DefaultCredentialsProvider.builder().build(), new GlueSchemaRegistryConfiguration(regionName));
        byte[] recordWithSchemaHeader =
            glueSchemaRegistrySerializer.encode(streamName, awsSchema, recordAsBytes);

        //Use PutRecords api to pass a list of records
        kinesisApi.putRecords(Collections.singletonList(recordWithSchemaHeader), streamName, regionName);

        //OR
        //Use PutRecord api to pass single record
        //kinesisApi.putRecord(recordWithSchemaHeader, streamName, regionName);
    }

    byte[] runSampleForGetRecord() throws IOException {
        ByteBuffer recordWithSchemaHeader = kinesisApi.getRecords(streamName, regionName);

        //The following lines remove the schema registry header
        GlueSchemaRegistryDeserializerImpl glueSchemaRegistryDeserializer =
            new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), new GlueSchemaRegistryConfiguration(regionName));
        byte[] recordWithSchemaHeaderBytes = new byte[recordWithSchemaHeader.remaining()];
        recordWithSchemaHeader.get(recordWithSchemaHeaderBytes, 0, recordWithSchemaHeaderBytes.length);

        com.amazonaws.services.schemaregistry.common.Schema awsSchema =
            glueSchemaRegistryDeserializer.getSchema(recordWithSchemaHeaderBytes);

        byte[] record = glueSchemaRegistryDeserializer.getData(recordWithSchemaHeaderBytes);

        //The following lines serialize an AVRO schema record
        if (DataFormat.AVRO.name().equals(awsSchema.getDataFormat())) {
            Schema avroSchema = new org.apache.avro.Schema.Parser().parse(awsSchema.getSchemaDefinition());
            Object genericRecord = convertBytesToRecord(avroSchema, record);
            System.out.println(genericRecord);
        }

        return record;
    }

    private byte[] convertRecordToBytes(final Object record) throws IOException {
        ByteArrayOutputStream recordAsBytes = new ByteArrayOutputStream();
        Encoder encoder = EncoderFactory.get().directBinaryEncoder(recordAsBytes, null);
        GenericDatumWriter datumWriter = new GenericDatumWriter<>(AVROUtils.getInstance().getSchema(record));
        datumWriter.write(record, encoder);
        encoder.flush();
        return recordAsBytes.toByteArray();
    }

    private GenericRecord convertBytesToRecord(Schema avroSchema, byte[] record) throws IOException {
        final GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<>(avroSchema);
        Decoder decoder = DecoderFactory.get().binaryDecoder(record, null);
        GenericRecord genericRecord = datumReader.read(null, decoder);
        return genericRecord;
    }

    private Map<String, String> getMetadata() {
        Map<String, String> metadata = new HashMap<>();
        metadata.put("event-source-1", "topic1");
        metadata.put("event-source-2", "topic2");
        metadata.put("event-source-3", "topic3");
        metadata.put("event-source-4", "topic4");
        metadata.put("event-source-5", "topic5");
        return metadata;
    }

    private GlueSchemaRegistryConfiguration getConfigs() {
        GlueSchemaRegistryConfiguration configs = new GlueSchemaRegistryConfiguration(regionName);
        configs.setSchemaName(schemaName);
        configs.setAutoRegistration(true);
        configs.setMetadata(getMetadata());
        return configs;
    }

    private Object getTestRecord() throws IOException {
        GenericRecord genericRecord;
        Schema.Parser parser = new Schema.Parser();
        Schema avroSchema = parser.parse(new File(AVRO_USER_SCHEMA_FILE));

        genericRecord = new GenericData.Record(avroSchema);
        genericRecord.put("name", "testName");
        genericRecord.put("favorite_number", 99);
        genericRecord.put("favorite_color", "red");

        return genericRecord;
    }
}
```

## Use case: Amazon Managed Service for Apache Flink
<a name="schema-registry-integrations-kinesis-data-analytics-apache-flink"></a>

Apache Flink is a popular open source framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Amazon Managed Service for Apache Flink is a fully managed AWS service that enables you to build and manage Apache Flink applications to process streaming data.

Open source Apache Flink provides a number of sources and sinks. For example, predefined data sources include reading from files, directories, and sockets, and ingesting data from collections and iterators. Apache Flink DataStream Connectors provide code for Apache Flink to interface with various third-party systems, such as Apache Kafka or Kinesis as sources and/or sinks.

For more information, see [Amazon Kinesis Data Analytics Developer Guide](https://docs.aws.amazon.com/kinesisanalytics/latest/java/what-is.html).

### Apache Flink Kafka connector
<a name="schema-registry-integrations-kafka-connector"></a>

Apache Flink provides an Apache Kafka data stream connector for reading data from and writing data to Kafka topics with exactly-once guarantees. Flink's Kafka consumer, `FlinkKafkaConsumer`, provides access to read from one or more Kafka topics. Apache Flink’s Kafka Producer, `FlinkKafkaProducer`, allows writing a stream of records to one or more Kafka topics. For more information, see [Apache Kafka Connector](https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html).

### Apache Flink Kinesis streams Connector
<a name="schema-registry-integrations-kinesis-connector"></a>

The Kinesis data stream connector provides access to Amazon Kinesis Data Streams. The `FlinkKinesisConsumer` is an exactly-once parallel streaming data source that subscribes to multiple Kinesis streams within the same AWS service region, and can transparently handle re-sharding of streams while the job is running. Each subtask of the consumer is responsible for fetching data records from multiple Kinesis shards. The number of shards fetched by each subtask will change as shards are closed and created by Kinesis. The `FlinkKinesisProducer` uses Kinesis Producer Library (KPL) to put data from an Apache Flink stream into a Kinesis stream. For more information, see [Amazon Kinesis Streams Connector](https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/kinesis.html).

For more information, see the [AWS Glue Schema Github repository](https://github.com/awslabs/aws-glue-schema-registry).

### Integrating with Apache Flink
<a name="schema-registry-integrations-apache-flink-integrate"></a>

The SerDes library provided with Schema Registry integrates with Apache Flink. To work with Apache Flink, you are required to implement [https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/util/serialization/SerializationSchema.java](https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/util/serialization/SerializationSchema.java) and [https://github.com/apache/flink/blob/8674b69964eae50cad024f2c5caf92a71bf21a09/flink-core/src/main/java/org/apache/flink/api/common/serialization/DeserializationSchema.java](https://github.com/apache/flink/blob/8674b69964eae50cad024f2c5caf92a71bf21a09/flink-core/src/main/java/org/apache/flink/api/common/serialization/DeserializationSchema.java) interfaces called `GlueSchemaRegistryAvroSerializationSchema` and `GlueSchemaRegistryAvroDeserializationSchema`, which you can plug into Apache Flink connectors.

### Adding an AWS Glue Schema Registry dependency into the Apache Flink application
<a name="schema-registry-integrations-kinesis-data-analytics-dependencies"></a>

To set up the integration dependencies to AWS Glue Schema Registry in the Apache Flink application:

1. Add the dependency to the `pom.xml` file.

   ```
   <dependency>
       <groupId>software.amazon.glue</groupId>
       <artifactId>schema-registry-flink-serde</artifactId>
       <version>1.0.0</version>
   </dependency>
   ```

#### Integrating Kafka or Amazon MSK with Apache Flink
<a name="schema-registry-integrations-kda-integrate-msk"></a>

You can use Managed Service for Apache Flink for Apache Flink, with Kafka as a source or Kafka as a sink.

**Kafka as a source**  
The following diagram shows integrating Kinesis Data Streams with Managed Service for Apache Flink for Apache Flink, with Kafka as a source.

![\[Kafka as a source.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gsr-kafka-source.png)


**Kafka as a sink**  
The following diagram shows integrating Kinesis Data Streams with Managed Service for Apache Flink for Apache Flink, with Kafka as a sink.

![\[Kafka as a sink.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gsr-kafka-sink.png)


To integrate Kafka (or Amazon MSK) with Managed Service for Apache Flink for Apache Flink, with Kafka as a source or Kafka as a sink, make the code changes below. Add the bolded code blocks to your respective code in the analogous sections.

If Kafka is the source, then use the deserializer code (block 2). If Kafka is the sink, use the serializer code (block 3).

```
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

String topic = "topic";
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");

// block 1
Map<String, Object> configs = new HashMap<>();
configs.put(AWSSchemaRegistryConstants.AWS_REGION, "aws-region");
configs.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
configs.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());

FlinkKafkaConsumer<GenericRecord> consumer = new FlinkKafkaConsumer<>(
    topic,
    // block 2
    GlueSchemaRegistryAvroDeserializationSchema.forGeneric(schema, configs),
    properties);

FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<>(
    topic,
    // block 3
    GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs),
    properties);

DataStream<GenericRecord> stream = env.addSource(consumer);
stream.addSink(producer);
env.execute();
```

#### Integrating Kinesis Data Streams with Apache Flink
<a name="schema-registry-integrations-integrate-kds"></a>

You can use Managed Service for Apache Flink for Apache Flink with Kinesis Data Streams as a source or a sink.

**Kinesis Data Streams as a source**  
The following diagram shows integrating Kinesis Data Streams with Managed Service for Apache Flink for Apache Flink, with Kinesis Data Streams as a source.

![\[Kinesis Data Streams as a source.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gsr-kinesis-source.png)


**Kinesis Data Streams as a sink**  
The following diagram shows integrating Kinesis Data Streams with Managed Service for Apache Flink for Apache Flink, with Kinesis Data Streams as a sink.

![\[Kinesis Data Streams as a sink.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gsr-kinesis-sink.png)


To integrate Kinesis Data Streams with Managed Service for Apache Flink for Apache Flink, with Kinesis Data Streams as a source or Kinesis Data Streams as a sink, make the code changes below. Add the bolded code blocks to your respective code in the analogous sections.

If Kinesis Data Streams is the source, use the deserializer code (block 2). If Kinesis Data Streams is the sink, use the serializer code (block 3).

```
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

String streamName = "stream";
Properties consumerConfig = new Properties();
consumerConfig.put(AWSConfigConstants.AWS_REGION, "aws-region");
consumerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id");
consumerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key");
consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST");

// block 1
Map<String, Object> configs = new HashMap<>();
configs.put(AWSSchemaRegistryConstants.AWS_REGION, "aws-region");
configs.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
configs.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());

FlinkKinesisConsumer<GenericRecord> consumer = new FlinkKinesisConsumer<>(
    streamName,
    // block 2
    GlueSchemaRegistryAvroDeserializationSchema.forGeneric(schema, configs),
    properties);

FlinkKinesisProducer<GenericRecord> producer = new FlinkKinesisProducer<>(
    // block 3
    GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs),
    properties);
producer.setDefaultStream(streamName);
producer.setDefaultPartition("0");

DataStream<GenericRecord> stream = env.addSource(consumer);
stream.addSink(producer);
env.execute();
```

## Use Case: Integration with AWS Lambda
<a name="schema-registry-integrations-aws-lambda"></a>

To use an AWS Lambdafunction as an Apache Kafka/Amazon MSK consumer and deserialize Avro-encoded messages using AWS Glue Schema Registry, visit the [MSK Labs page](https://amazonmsk-labs.workshop.aws/en/msklambda/gsrschemareg.html).

## Use case: AWS Glue Data Catalog
<a name="schema-registry-integrations-aws-glue-data-catalog"></a>

AWS Glue tables support schemas that you can specify manually or by reference to the AWS Glue Schema Registry. The Schema Registry integrates with the Data Catalog to allow you to optionally use schemas stored in the Schema Registry when creating or updating AWS Glue tables or partitions in the Data Catalog. To identify a schema definition in the Schema Registry, at a minimum, you need to know the ARN of the schema it is part of. A schema version of a schema, which contains a schema definition, can be referenced by its UUID or version number. There is always one schema version, the "latest" version, that can be looked up without knowing its version number or UUID.

When calling the `CreateTable` or `UpdateTable` operations, you will pass a `TableInput` structure that contains a `StorageDescriptor`, which may have a `SchemaReference` to an existing schema in the Schema Registry. Similarly, when you call the `GetTable` or `GetPartition` APIs, the response may contain the schema and the `SchemaReference`. When a table or partition was created using a schema references, the Data Catalog will try to fetch the schema for this schema reference. In case it is unable to find the schema in the Schema Registry, it returns an empty schema in the `GetTable` response; otherwise the response will have both the schema and schema reference.

You can also perform the actions from the AWS Glue console.

To perform these operations and create, update, or view the schema information, you must give an IAM role to the calling user that provides permissions for the `GetSchemaVersion` API.

### Adding a table or updating the schema for a table
<a name="schema-registry-integrations-aws-glue-data-catalog-table"></a>

Adding a new table from an existing schema binds the table to a specific schema version. Once new schema versions get registered, you can update this table definition from the View table page in the AWS Glue console or using the [UpdateTable action (Python: update\$1table)](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-UpdateTable) API.

#### Adding a table from an existing schema
<a name="schema-registry-integrations-aws-glue-data-catalog-table-existing"></a>

You can create an AWS Glue table from a schema version in the registry using the AWS Glue console or `CreateTable` API.

**AWS Glue API**  
When calling the `CreateTable` API, you will pass a `TableInput` that contains a `StorageDescriptor` which has a `SchemaReference` to an existing schema in the Schema Registry.

**AWS Glue console**  
To create a table from the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Tables**.

1. In the **Add Tables** menu, choose **Add table from existing schema**.

1. Configure the table properties and data store per the AWS Glue Developer Guide.

1. In the **Choose a Glue schema** page, select the **Registry** where the schema resides.

1. Choose the **Schema name** and select the **Version** of the schema to apply.

1. Review the schema preview, and choose **Next**.

1. Review and create the table.

The schema and version applied to the table appears in the **Glue schema** column in the list of tables. You can view the table to see more details.

#### Updating the schema for a table
<a name="schema-registry-integrations-aws-glue-data-catalog-table-updating"></a>

When a new schema version becomes available, you may want to update a table's schema using the [UpdateTable action (Python: update\$1table)](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-UpdateTable) API or the AWS Glue console. 

**Important**  
When updating the schema for an existing table that has an AWS Glue schema specified manually, the new schema referenced in the Schema Registry may be incompatible. This can result in your jobs failing.

**AWS Glue API**  
When calling the `UpdateTable` API, you will pass a `TableInput` that contains a `StorageDescriptor` which has a `SchemaReference` to an existing schema in the Schema Registry.

**AWS Glue console**  
To update the schema for a table from the AWS Glue console:

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\).

1. In the navigation pane, under **Data catalog**, choose **Tables**.

1. View the table from the list of tables.

1. Click **Update schema** in the box that informs you about a new version.

1. Review the differences between the current and new schema.

1. Choose **Show all schema differences** to see more details.

1. Choose **Save table** to accept the new version.

## Use case: AWS Glue streaming
<a name="schema-registry-integrations-aws-glue-streaming"></a>

AWS Glue streaming consumes data from streaming sources and perform ETL operations before writing to an output sink. Input streaming source can be specified using a Data Table or directly by specifying the source configuration.

AWS Glue streaming supports a Data Catalog table for the streaming source created with the schema present in the AWS Glue Schema Registry. You can create a schema in the AWS Glue Schema Registry and create an AWS Glue table with a streaming source using this schema. This AWS Glue table can be used as an input to an AWS Glue streaming job for deserializing data in the input stream.

One point to note here is when the schema in the AWS Glue Schema Registry changes, you need to restart the AWS Glue streaming job needs to reflect the changes in the schema.

## Use case: Apache Kafka Streams
<a name="schema-registry-integrations-apache-kafka-streams"></a>

The Apache Kafka Streams API is a client library for processing and analyzing data stored in Apache Kafka. This section describes the integration of Apache Kafka Streams with AWS Glue Schema Registry, which allows you to manage and enforce schemas on your data streaming applications. For more information on Apache Kafka Streams, see [Apache Kafka Streams](https://kafka.apache.org/documentation/streams/).

### Integrating with the SerDes Libraries
<a name="schema-registry-integrations-apache-kafka-streams-integrate"></a>

There is a `GlueSchemaRegistryKafkaStreamsSerde` class that you can configure a Streams application with.

#### Kafka Streams application example code
<a name="schema-registry-integrations-apache-kafka-streams-application"></a>

To use the AWS Glue Schema Registry within an Apache Kafka Streams application:

1. Configure the Kafka Streams application.

   ```
   final Properties props = new Properties();
       props.put(StreamsConfig.APPLICATION_ID_CONFIG, "avro-streams");
       props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
       props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
       props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
       props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, AWSKafkaAvroSerDe.class.getName());
       props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
   
       props.put(AWSSchemaRegistryConstants.AWS_REGION, "aws-region");
       props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
       props.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());
   	props.put(AWSSchemaRegistryConstants.DATA_FORMAT, DataFormat.AVRO.name());
   ```

1. Create a stream from the topic avro-input.

   ```
   StreamsBuilder builder = new StreamsBuilder();
   final KStream<String, GenericRecord> source = builder.stream("avro-input");
   ```

1. Process the data records (the example filters out those records whose value of favorite\$1color is pink or where the value of amount is 15).

   ```
   final KStream<String, GenericRecord> result = source
       .filter((key, value) -> !"pink".equals(String.valueOf(value.get("favorite_color"))));
       .filter((key, value) -> !"15.0".equals(String.valueOf(value.get("amount"))));
   ```

1. Write the results back to the topic avro-output.

   ```
   result.to("avro-output");
   ```

1. Start the Apache Kafka Streams application.

   ```
   KafkaStreams streams = new KafkaStreams(builder.build(), props);
   streams.start();
   ```

#### Implementation results
<a name="schema-registry-integrations-apache-kafka-streams-results"></a>

These results show the filtering process of records that were filtered out in step 3 as a favorite\$1color of "pink" or value of "15.0".

Records before filtering:

```
{"name": "Sansa", "favorite_number": 99, "favorite_color": "white"}
{"name": "Harry", "favorite_number": 10, "favorite_color": "black"}
{"name": "Hermione", "favorite_number": 1, "favorite_color": "red"}
{"name": "Ron", "favorite_number": 0, "favorite_color": "pink"}
{"name": "Jay", "favorite_number": 0, "favorite_color": "pink"}

{"id": "commute_1","amount": 3.5}
{"id": "grocery_1","amount": 25.5}
{"id": "entertainment_1","amount": 19.2}
{"id": "entertainment_2","amount": 105}
	{"id": "commute_1","amount": 15}
```

Records after filtering:

```
{"name": "Sansa", "favorite_number": 99, "favorite_color": "white"}
{"name": "Harry", "favorite_number": 10, "favorite_color": "black"}
{"name": "Hermione", "favorite_number": 1, "favorite_color": "red"}
{"name": "Ron", "favorite_number": 0, "favorite_color": "pink"}

{"id": "commute_1","amount": 3.5}
{"id": "grocery_1","amount": 25.5}
{"id": "entertainment_1","amount": 19.2}
{"id": "entertainment_2","amount": 105}
```

### Use case: Apache Kafka Connect
<a name="schema-registry-integrations-apache-kafka-connect"></a>

The integration of Apache Kafka Connect with the AWS Glue Schema Registry enables you to get schema information from connectors. The Apache Kafka converters specify the format of data within Apache Kafka and how to translate it into Apache Kafka Connect data. Every Apache Kafka Connect user will need to configure these converters based on the format they want their data in when loaded from or stored into Apache Kafka. In this way, you can define your own converters to translate Apache Kafka Connect data into the type used in the AWS Glue Schema Registry (for example: Avro) and utilize our serializer to register its schema and do serialization. Then converters are also able to use our deserializer to deserialize data received from Apache Kafka and convert it back into Apache Kafka Connect data. An example workflow diagram is given below.

![\[Apache Kafka Connect workflow.\]](http://docs.aws.amazon.com/glue/latest/dg/images/schema_reg_int_kafka_connect.png)


1. Install the `aws-glue-schema-registry` project by cloning the [Github repository for the AWS Glue Schema Registry](https://github.com/awslabs/aws-glue-schema-registry).

   ```
   git clone git@github.com:awslabs/aws-glue-schema-registry.git
   cd aws-glue-schema-registry
   mvn clean install
   mvn dependency:copy-dependencies
   ```

1. If you plan on using Apache Kafka Connect in *Standalone* mode, update **connect-standalone.properties** using the instructions below for this step. If you plan on using Apache Kafka Connect in *Distributed* mode, update **connect-avro-distributed.properties** using the same instructions.

   1. Add these properties also to the Apache Kafka connect properties file:

      ```
      key.converter.region=aws-region
      value.converter.region=aws-region
      key.converter.schemaAutoRegistrationEnabled=true
      value.converter.schemaAutoRegistrationEnabled=true
      key.converter.avroRecordType=GENERIC_RECORD
      value.converter.avroRecordType=GENERIC_RECORD
      ```

   1. Add the command below to the **Launch mode** section under **kafka-run-class.sh**:

      ```
      -cp $CLASSPATH:"<your AWS GlueSchema Registry base directory>/target/dependency/*"
      ```

1. Add the command below to the **Launch mode** section under **kafka-run-class.sh**

   ```
   -cp $CLASSPATH:"<your AWS GlueSchema Registry base directory>/target/dependency/*" 
   ```

   It should look like this:

   ```
   # Launch mode
   if [ "x$DAEMON_MODE" = "xtrue" ]; then
     nohup "$JAVA" $KAFKA_HEAP_OPTS $KAFKA_JVM_PERFORMANCE_OPTS $KAFKA_GC_LOG_OPTS $KAFKA_JMX_OPTS $KAFKA_LOG4J_OPTS -cp $CLASSPATH:"/Users/johndoe/aws-glue-schema-registry/target/dependency/*" $KAFKA_OPTS "$@" > "$CONSOLE_OUTPUT_FILE" 2>&1 < /dev/null &
   else
     exec "$JAVA" $KAFKA_HEAP_OPTS $KAFKA_JVM_PERFORMANCE_OPTS $KAFKA_GC_LOG_OPTS $KAFKA_JMX_OPTS $KAFKA_LOG4J_OPTS -cp $CLASSPATH:"/Users/johndoe/aws-glue-schema-registry/target/dependency/*" $KAFKA_OPTS "$@"
   fi
   ```

1. If using bash, run the below commands to set-up your CLASSPATH in your bash\$1profile. For any other shell, update the environment accordingly.

   ```
   echo 'export GSR_LIB_BASE_DIR=<>' >>~/.bash_profile
   echo 'export GSR_LIB_VERSION=1.0.0' >>~/.bash_profile
   echo 'export KAFKA_HOME=<your Apache Kafka installation directory>' >>~/.bash_profile
   echo 'export CLASSPATH=$CLASSPATH:$GSR_LIB_BASE_DIR/avro-kafkaconnect-converter/target/schema-registry-kafkaconnect-converter-$GSR_LIB_VERSION.jar:$GSR_LIB_BASE_DIR/common/target/schema-registry-common-$GSR_LIB_VERSION.jar:$GSR_LIB_BASE_DIR/avro-serializer-deserializer/target/schema-registry-serde-$GSR_LIB_VERSION.jar' >>~/.bash_profile
   source ~/.bash_profile
   ```

1. (Optional) If you want to test with a simple file source, then clone the file source connector.

   ```
   git clone https://github.com/mmolimar/kafka-connect-fs.git
   cd kafka-connect-fs/
   ```

   1. Under the source connector configuration, edit the data format to Avro, file reader to `AvroFileReader` and update an example Avro object from the file path you are reading from. For example:

      ```
      vim config/kafka-connect-fs.properties
      ```

      ```
      fs.uris=<path to a sample avro object>
      policy.regexp=^.*\.avro$
      file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.AvroFileReader
      ```

   1. Install the source connector.

      ```
      mvn clean package
      echo "export CLASSPATH=\$CLASSPATH:\"\$(find target/ -type f -name '*.jar'| grep '\-package' | tr '\n' ':')\"" >>~/.bash_profile
      source ~/.bash_profile
      ```

   1. Update the sink properties under `<your Apache Kafka installation directory>/config/connect-file-sink.properties` update the topic name and out file name.

      ```
      file=<output file full path>
      topics=<my topic>
      ```

1. Start the Source Connector (in this example it is a file source connector).

   ```
   $KAFKA_HOME/bin/connect-standalone.sh $KAFKA_HOME/config/connect-standalone.properties config/kafka-connect-fs.properties
   ```

1. Run the Sink Connector (in this example it is a file sink connector).

   ```
   $KAFKA_HOME/bin/connect-standalone.sh $KAFKA_HOME/config/connect-standalone.properties $KAFKA_HOME/config/connect-file-sink.properties
   ```

   For an example Kafka Connect usage, look at the run-local-tests.sh script under integration-tests folder in the [Github repository for the AWS Glue Schema Registry](https://github.com/awslabs/aws-glue-schema-registry/tree/master/integration-tests).

# Migration from a third-party schema registry to AWS Glue Schema Registry
<a name="schema-registry-integrations-migration"></a>

The migration from a third-party schema registry to the AWS Glue Schema Registry has a dependency on the existing, current third-party schema registry. If there are records in an Apache Kafka topic which were sent using a third-party schema registry, consumers need the third-party schema registry to deserialize those records. The `AWSKafkaAvroDeserializer` provides the ability to specify a secondary deserializer class which points to the third-party deserializer and is used to deserialize those records.

There are two criteria for retirement of a third-party schema. First, retirement can occur only after records in Apache Kafka topics using the 3rd party schema registry are either no longer required by and for any consumers. Second, retirement can occur by aging out of the Apache Kafka topics, depending on the retention period specified for those topics. Note that if you have topics which have infinite retention, you can still migrate to the AWS Glue Schema Registry but you will not be able to retire the third-party schema registry. As a workaround, you can use an application or Mirror Maker 2 to read from the current topic and produce to a new topic with the AWS Glue Schema Registry.

To migrate from a third-party schema registry to the AWS Glue Schema Registry:

1. Create a registry in the AWS Glue Schema Registry, or use the default registry.

1. Stop the consumer. Modify it to include AWS Glue Schema Registry as the primary deserializer, and the third-party schema registry as the secondary. 
   + Set the consumer properties. In this example, the secondary\$1deserializer is set to a different deserializer. The behavior is as follows: the consumer retrieves records from Amazon MSK and first tries to use the `AWSKafkaAvroDeserializer`. If it is unable to read the magic byte that contains the Avro Schema ID for the AWS Glue Schema Registry schema, the `AWSKafkaAvroDeserializer` then tries to use the deserializer class provided in the secondary\$1deserializer. The properties specific to the secondary deserializer also need to be provided in the consumer properties, such as the schema\$1registry\$1url\$1config and specific\$1avro\$1reader\$1config, as shown below.

     ```
     consumerProps.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
     consumerProps.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
     consumerProps.setProperty(AWSSchemaRegistryConstants.AWS_REGION, KafkaClickstreamConsumer.gsrRegion);
     consumerProps.setProperty(AWSSchemaRegistryConstants.SECONDARY_DESERIALIZER, KafkaAvroDeserializer.class.getName());
     consumerProps.setProperty(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "URL for third-party schema registry");
     consumerProps.setProperty(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true");
     ```

1. Restart the consumer.

1. Stop the producer and point the producer to the AWS Glue Schema Registry.

   1. Set the producer properties. In this example, the producer will use the default-registry and auto register schema versions.

      ```
      producerProps.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
      producerProps.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
      producerProps.setProperty(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
      producerProps.setProperty(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.SPECIFIC_RECORD.getName());
      producerProps.setProperty(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, "true");
      ```

1. (Optional) Manually move existing schemas and schema versions from the current third-party schema registry to the AWS Glue Schema Registry, either to the default-registry in AWS Glue Schema Registry or to a specific non-default registry in AWS Glue Schema Registry. This can be done by exporting schemas from the third-party schema registries in JSON format and creating new schemas in AWS Glue Schema Registry using the AWS Management Console or the AWS CLI.

    This step may be important if you need to enable compatibility checks with previous schema versions for newly created schema versions using the AWS CLI and the AWS Management Console, or when producers send messages with a new schema with auto-registration of schema versions turned on.

1. Start the producer.