

# Defining metadata manually
<a name="populate-dg-manual"></a>

 The AWS Glue Data Catalog is a central repository that stores metadata about your data sources and data sets. While a crawler can automatically crawl and populate metadata for supported data sources, there are certain scenarios where you may need to define metadata manually in the Data Catalog: 
+ Unsupported data formats – If you have data sources that are not supported by the crawler, you need to manually define the metadata for those data sources in the Data Catalog.
+ Custom metadata requirements – The AWS Glue crawler infers metadata based on predefined rules and conventions. If you have specific metadata requirements that are not covered by the AWS Glue crawler inferred metadata, you can manually define the metadata to meet your needs 
+ Data governance and standardization – In some cases, you may want to have more control over the metadata definitions for data governance, compliance, or security reasons. Manually defining metadata allows you to ensure that the metadata adheres to your organization's standards and policies. 
+ Placeholder for future data ingestion – If you have data sources that are not immediately available or accessible, you can create empty schema tables as placeholders. Once the data sources become available, you can populate the tables with the actual data, while maintaining the predefined structure. 

 To define metadata manually, you can use the AWS Glue console, Lake Formation console, AWS Glue API, or the AWS Command Line Interface (AWS CLI). You can create databases, tables, and partitions, and specify metadata properties such as column names, data types, descriptions, and other attributes. 

# Creating databases
<a name="define-database"></a>

Databases are used to organize metadata tables in the AWS Glue. When you define a table in the AWS Glue Data Catalog, you add it to a database. A table can be in only one database.

Your database can contain tables that define data from many different data stores. This data can include objects in Amazon Simple Storage Service (Amazon S3) and relational tables in Amazon Relational Database Service.

**Note**  
When you delete a database from the AWS Glue Data Catalog, all the tables in the database are also deleted.

 To view the list of databases, sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Choose **Databases**, and then choose a database name in the list to view the details. 

 From the **Databases** tab in the AWS Glue console, you can add, edit, and delete databases:
+ To create a new database, choose **Add database** and provide a name and description. For compatibility with other metadata stores, such as Apache Hive, the name is folded to lowercase characters. 
**Note**  
If you plan to access the database from Amazon Athena, then provide a name with only alphanumeric and underscore characters. For more information, see [ Athena names](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#ate-table-database-and-column-names-allow-only-underscore-special-characters). 
+  To edit the description for a database, select the check box next to the database name and choose **Edit**. 
+  To delete a database, select the check box next to the database name and choose **Remove**. 
+  To display the list of tables contained in the database, choose the database name and the database properties will display all tables in the database. 

To change the database that a crawler writes to, you must change the crawler definition. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).

## Database resource links
<a name="databases-resource-links"></a>


|  | 
| --- |
| The AWS Glue console was recently updated. The current version of the console does not support Database Resource Links. | 

The Data Catalog can also contain *resource links* to databases. A database resource link is a link to a local or shared database. Currently, you can create resource links only in AWS Lake Formation. After you create a resource link to a database, you can use the resource link name wherever you would use the database name. Along with databases that you own or that are shared with you, database resource links are returned by `glue:GetDatabases()` and appear as entries on the **Databases** page of the AWS Glue console.

The Data Catalog can also contain table resource links.

For more information about resource links, see [Creating Resource Links](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html) in the *AWS Lake Formation Developer Guide*.

# Creating tables
<a name="tables-described"></a>

Even though running a crawler is the recommended method to take inventory of the data in your data stores, you can add metadata tables to the AWS Glue Data Catalog manually. This approach allows you to have more control over the metadata definitions and customize them according them to your specific requirements.

You can also add tables to the Data Catalog manually in the following ways:
+ Use the AWS Glue console to manually create a table in the AWS Glue Data Catalog. For more information, see [Creating tables using the console](#console-tables).
+ Use the `CreateTable` operation in the [AWS Glue API](aws-glue-api.md) to create a table in the AWS Glue Data Catalog. For more information, see [CreateTable action (Python: create\$1table)](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-CreateTable).
+ Use CloudFormation templates. For more information, see [AWS CloudFormation for AWS Glue](populate-with-cloudformation-templates.md).

When you define a table manually using the console or an API, you specify the table schema and the value of a classification field that indicates the type and format of the data in the data source. If a crawler creates the table, the data format and schema are determined by either a built-in classifier or a custom classifier. For more information about creating a table using the AWS Glue console, see [Creating tables using the console](#console-tables).

**Topics**
+ [Table partitions](#tables-partition)
+ [Table resource links](#tables-resource-links)
+ [Creating tables using the console](#console-tables)
+ [Creating partition indexes](partition-indexes.md)
+ [Updating manually created Data Catalog tables using crawlers](#update-manual-tables)
+ [Data Catalog table properties](#table-properties)

## Table partitions
<a name="tables-partition"></a>

An AWS Glue table definition of an Amazon Simple Storage Service (Amazon S3) folder can describe a partitioned table. For example, to improve query performance, a partitioned table might separate monthly data into different files using the name of the month as a key. In AWS Glue, table definitions include the partitioning key of a table. When AWS Glue evaluates the data in Amazon S3 folders to catalog a table, it determines whether an individual table or a partitioned table is added. 

You can create partition indexes on a table to fetch a subset of the partitions instead of loading all the partitions in the table. For information about working with partition indexes, see [Creating partition indexes](partition-indexes.md).

All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3 folder:
+ The schemas of the files are similar, as determined by AWS Glue.
+ The data format of the files is the same.
+ The compression format of the files is the same.

For example, you might own an Amazon S3 bucket named `my-app-bucket`, where you store both iOS and Android app sales data. The data is partitioned by year, month, and day. The data files for iOS and Android sales have the same schema, data format, and compression format. In the AWS Glue Data Catalog, the AWS Glue crawler creates one table definition with partitioning keys for year, month, and day. 

The following Amazon S3 listing of `my-app-bucket` shows some of the partitions. The `=` symbol is used to assign partition key values. 

```
   my-app-bucket/Sales/year=2010/month=feb/day=1/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=1/Android.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/iOS.csv
   my-app-bucket/Sales/year=2010/month=feb/day=2/Android.csv
   ...
   my-app-bucket/Sales/year=2017/month=feb/day=4/iOS.csv
   my-app-bucket/Sales/year=2017/month=feb/day=4/Android.csv
```

## Table resource links
<a name="tables-resource-links"></a>


|  | 
| --- |
| The AWS Glue console was recently updated. The current version of the console does not support Table Resource Links. | 

The Data Catalog can also contain *resource links* to tables. A table resource link is a link to a local or shared table. Currently, you can create resource links only in AWS Lake Formation. After you create a resource link to a table, you can use the resource link name wherever you would use the table name. Along with tables that you own or that are shared with you, table resource links are returned by `glue:GetTables()` and appear as entries on the **Tables** page of the AWS Glue console.

The Data Catalog can also contain database resource links.

For more information about resource links, see [Creating Resource Links](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html) in the *AWS Lake Formation Developer Guide*.

## Creating tables using the console
<a name="console-tables"></a>

A table in the AWS Glue Data Catalog is the metadata definition that represents the data in a data store. You create tables when you run a crawler, or you can create a table manually in the AWS Glue console. The **Tables** list in the AWS Glue console displays values of your table's metadata. You use table definitions to specify sources and targets when you create ETL (extract, transform, and load) jobs. 

**Note**  
With recent changes to the AWS management console, you may need to modify your existing IAM roles to have the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-SearchTables](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-SearchTables) permission. For new role creation, the `SearchTables` API permission has already been added as default.

To get started, sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Choose the **Tables** tab, and use the **Add tables** button to create tables either with a crawler or by manually typing attributes. 

### Adding tables on the console
<a name="console-tables-add"></a>

To use a crawler to add tables, choose **Add tables**, **Add tables using a crawler**. Then follow the instructions in the **Add crawler** wizard. When the crawler runs, tables are added to the AWS Glue Data Catalog. For more information, see [Using crawlers to populate the Data Catalog](add-crawler.md).

If you know the attributes that are required to create an Amazon Simple Storage Service (Amazon S3) table definition in your Data Catalog, you can create it with the table wizard. Choose **Add tables**, **Add table manually**, and follow the instructions in the **Add table** wizard.

When adding a table manually through the console, consider the following:
+ If you plan to access the table from Amazon Athena, then provide a name with only alphanumeric and underscore characters. For more information, see [Athena names](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#ate-table-database-and-column-names-allow-only-underscore-special-characters).
+ The location of your source data must be an Amazon S3 path.
+ The data format of the data must match one of the listed formats in the wizard. The corresponding classification, SerDe, and other table properties are automatically populated based on the format chosen. You can define tables with the following formats:   
**Avro**  
Apache Avro JSON binary format.  
**CSV**  
Character separated values. You also specify the delimiter of either comma, pipe, semicolon, tab, or Ctrl-A.  
**JSON**  
JavaScript Object Notation.  
**XML**  
Extensible Markup Language format. Specify the XML tag that defines a row in the data. Columns are defined within row tags.  
**Parquet**  
Apache Parquet columnar storage.  
**ORC**  
Optimized Row Columnar (ORC) file format. A format designed to efficiently store Hive data.
+ You can define a partition key for the table.
+ Currently, partitioned tables that you create with the console cannot be used in ETL jobs.

### Table attributes
<a name="console-tables-attributes"></a>

The following are some important attributes of your table:

**Name**  
The name is determined when the table is created, and you can't change it. You refer to a table name in many AWS Glue operations.

**Database**  
The container object where your table resides. This object contains an organization of your tables that exists within the AWS Glue Data Catalog and might differ from an organization in your data store. When you delete a database, all tables contained in the database are also deleted from the Data Catalog. 

**Description**  
The description of the table. You can write a description to help you understand the contents of the table.

**Table format**  
Specify creating a standard AWS Glue table, or a table in Apache Iceberg format.  
The Data Catalog provides following table optimization options to manage table storage and improve query performance for Iceberg tables.  
+ **Compaction** – Data files are merged and rewritten remove obsolete data and consolidate fragmented data into larger, more efficient files.
+ **Snapshot retention **– Snapshots are timestamped versions of an Iceberg table. Snapshot retention configurations allow customers to enforce how long to retain snapshots and how many snapshots to retain. Configuring a snapshot retention optimizer can help manage storage overhead by removing older, unnecessary snapshots and their associated underlying files.
+ **Orphan file deletion** – Orphan files are files that are no longer referenced by the Iceberg table metadata. These files can accumulate over time, especially after operations like table deletions or failed ETL jobs. Enabling orphan file deletion allows AWS Glue to periodically identify and remove these unnecessary files, freeing up storage.
For more information, see [Optimizing Iceberg tables](table-optimizers.md).

**Optimization configuration**  
You can either use the default settings or customize the settings for enabling the table optimizers.

**IAM role**  
 To run the table optimizers, the service assumes an IAM role on your behalf. You can choose an IAM role using the drop-down. Ensure that the role has the permissions required to enable compaction.  
To learn more about the required permissions for the IAM role, see [Table optimization prerequisites](optimization-prerequisites.md).

**Location**  
The pointer to the location of the data in a data store that this table definition represents.

**Classification**  
A categorization value provided when the table was created. Typically, this is written when a crawler runs and specifies the format of the source data.

**Last updated**  
The time and date (UTC) that this table was updated in the Data Catalog.

**Date added**  
The time and date (UTC) that this table was added to the Data Catalog.

**Deprecated**  
If AWS Glue discovers that a table in the Data Catalog no longer exists in its original data store, it marks the table as deprecated in the data catalog. If you run a job that references a deprecated table, the job might fail. Edit jobs that reference deprecated tables to remove them as sources and targets. We recommend that you delete deprecated tables when they are no longer needed. 

**Connection**  
If AWS Glue requires a connection to your data store, the name of the connection is associated with the table.

### Viewing and managing table details
<a name="console-tables-details"></a>

To see the details of an existing table, choose the table name in the list, and then choose **Action, View details**.

The table details include properties of your table and its schema. This view displays the schema of the table, including column names in the order defined for the table, data types, and key columns for partitions. If a column is a complex type, you can choose **View properties** to display details of the structure of that field, as shown in the following example:

```
{
"StorageDescriptor": 
    {
      "cols": {
         "FieldSchema": [
           {
             "name": "primary-1",
             "type": "CHAR",
             "comment": ""
           },
           {
             "name": "second ",
             "type": "STRING",
             "comment": ""
           }
         ]
      },
      "location": "s3://aws-logs-111122223333-us-east-1",
      "inputFormat": "",
      "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "compressed": "false", 
      "numBuckets": "0",
      "SerDeInfo": {
           "name": "",
           "serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
           "parameters": {
               "separatorChar": "|"
            }
      },
      "bucketCols": [],
      "sortCols": [],
      "parameters": {},
      "SkewedInfo": {},
      "storedAsSubDirectories": "false"
    },
    "parameters": {
       "classification": "csv"
    }
}
```

For more information about the properties of a table, such as `StorageDescriptor`, see [StorageDescriptor structure](aws-glue-api-catalog-tables.md#aws-glue-api-catalog-tables-StorageDescriptor).

To change the schema of a table, choose **Edit schema** to add and remove columns, change column names, and change data types.

 To compare different versions of a table, including its schema, choose **Compare versions** to see a side-by-side comparison of two versions of the schema for a table. For more information, see [Comparing table schema versions](#console-tables-schema-comparison). 

To display the files that make up an Amazon S3 partition, choose **View partition**. For Amazon S3 tables, the **Key** column displays the partition keys that are used to partition the table in the source data store. Partitioning is a way to divide a table into related parts based on the values of a key column, such as date, location, or department. For more information about partitions, search the internet for information about "hive partitioning."

**Note**  
To get step-by-step guidance for viewing the details of a table, see the **Explore table** tutorial in the console.

### Comparing table schema versions
<a name="console-tables-schema-comparison"></a>

 When you compare two versions of table schemas, you can compare nested row changes by expanding and collapsing nested rows, compare schemas of two versions side-by-side, and view table properties side-by-side. 

 To compare versions 

1.  From the AWS Glue console, choose **Tables**, then **Actions** and choose **Compare versions**.   
![\[The screenshot shows the Actions button when selected. The drop-down menu displays the Compare versions option.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-table-compare-versions.png)

1.  Choose a version to compare by choosing the version drop-down menu. When comparing schemas, the Schema tab is highlighted in orange. 

1.  When you compare tables between two versions, the table schemas are presented to you on the left and right side of the screen. This enables you to determine changes visually by comparing the Column name, data type, key, and comment fields side-by-side. When there is a change, a colored icon displays the type of change that was made. 
   +  Deleted – displayed by a red icon indicates where the column was removed from a previous version of the table schema. 
   +  Edited or Moved – displayed by a blue icon indicates where the column was modified or moved in a newer version of the table schema. 
   +  Added – displayed by a green icon indicates where the column was added to a newer version of the table schema. 
   +  Nested changes – displayed by a yellow icon indicates where the nested column contains changes. Choose the column to expand and view the columns that have either been deleted, edited, moved, or added.   
![\[The screenshot shows the table schema comparison between two versions. On the left side is the older version. On the right side is the newer version. The delete icon is next to a column that was removed from the older version and is no longer in the newer version.\]](http://docs.aws.amazon.com/glue/latest/dg/images/catalog-table-version-comparison.png)

1.  Use the filter fields search bar to display fields based on the characters you enter here. If you enter a column name in either table version, the filtered fields are displayed in both table versions to show you where the changes have occurred. 

1.  To compare properties, choose the **Properties tab**. 

1.  To stop comparing versions, choose **Stop comparing** to return to the list of tables. 

# Creating partition indexes
<a name="partition-indexes"></a>

Over time, hundreds of thousands of partitions get added to a table. The [GetPartitions API](https://docs.aws.amazon.com/glue/latest/webapi/API_GetPartitions.html) is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request.

Lets take a *sales\$1data* table as an example which is partitioned by the keys *Country*, *Category*, *Year*, *Month*, and *creationDate*. If you want to obtain sales data for all the items sold for the *Books* category in the year 2020 after *2020-08-15*, you have to make a `GetPartitions` request with the expression "Category = 'Books' and creationDate > '2020-08-15'" to the Data Catalog.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions using the query expression provided by the user in the `GetPartitions` request. The query takes more time to run as the number of partitions increase on a table with no indexes. With an index, the `GetPartitions` query will try to fetch a subset of the partitions instead of loading all the partitions in the table.

**Topics**
+ [About partition indexes](#partition-index-1)
+ [Creating a table with partition indexes](#partition-index-creating-table)
+ [Adding a partition index to an existing table](#partition-index-existing-table)
+ [Describing partition indexes on a table](#partition-index-describing)
+ [Limitations on using partition indexes](#partition-index-limitations)
+ [Using indexes for an optimized GetPartitions call](#partition-index-getpartitions)
+ [Integration with engines](#partition-index-integration-engines)

## About partition indexes
<a name="partition-index-1"></a>

When you create a partition index, you specify a list of partition keys that already exist on a given table. Partition index is sub list of partition keys defined in the table. A partition index can be created on any permutation of partition keys defined on the table. For the above *sales\$1data* table, the possible indexes are (country, category, creationDate), (country, category, year), (country, category), (country), (category, country, year, month), and so on.

The Data Catalog will concatenate the partition values in the order provided at the time of index creation. The index is built consistently as partitions are added to the table. Indexes can be created for String (string, char, and varchar), Numeric (int, bigint, long, tinyint, and smallint), and Date (yyyy-MM-dd) column types. 

**Supported data types**
+ Date – A date in ISO format, such as `YYYY-MM-DD`. For example, date `2020-08-15`. The format uses hyphens (‐) to separate the year, month, and day. The permissible range for dates for indexing spans from `0000-01-01` to `9999-12-31`.
+ String – A string literal enclosed in single or double quotes. 
+ Char – Fixed length character data, with a specified length between 1 and 255, such as char(10).
+ Varchar – Variable length character data, with a specified length between 1 and 65535, such as varchar(10).
+ Numeric – int, bigint, long, tinyint, and smallint

Indexes on Numeric, String, and Date data types support =, >, >=, <, <= and between operators. The indexing solution currently only supports the `AND` logical operator. Sub-expressions with the operators "LIKE", "IN", "OR", and "NOT" are ignored in the expression for filtering using an index. Filtering for the ignored sub-expression is done on the partitions fetched after applying index filtering.

For each partition added to a table, there is a corresponding index item created. For a table with ‘n’ partitions, 1 partition index will result in 'n' partition index items. 'm' partition index on same table will result into 'm\$1n' partition index items. Each partition index item will be charged according to the current AWS Glue pricing policy for data catalog storage. For details on storage object pricing, see [AWS Glue pricing](https://aws.amazon.com/glue/pricing/).

## Creating a table with partition indexes
<a name="partition-index-creating-table"></a>

You can create a partition index during table creation. The `CreateTable` request takes a list of [`PartitionIndex` objects](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-PartitionIndex) as an input. A maximum of 3 partition indexes can be created on a given table. Each partition index requires a name and a list of `partitionKeys` defined for the table. Created indexes on a table can be fetched using the [`GetPartitionIndexes` API](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-GetPartitionIndexes)

## Adding a partition index to an existing table
<a name="partition-index-existing-table"></a>

To add a partition index to an existing table, use the `CreatePartitionIndex` operation. You can create one `PartitionIndex` per `CreatePartitionIndex` operation. Adding an index does not affect the availability of a table, as the table continues to be available while indexes are being created.

The index status for an added partition is set to CREATING and the creation of the index data is started. If the process for creating the indexes is successful, the indexStatus is updated to ACTIVE and for an unsuccessful process, the index status is updated to FAILED. Index creation can fail for multiple reasons, and you can use the `GetPartitionIndexes` operation to retrieve the failure details. The possible failures are:
+ ENCRYPTED\$1PARTITION\$1ERROR — Index creation on a table with encrypted partitions is not supported.
+ INVALID\$1PARTITION\$1TYPE\$1DATA\$1ERROR — Observed when the `partitionKey` value is not a valid value for the corresponding `partitionKey` data type. For example: a `partitionKey` with the 'int' datatype has a value 'foo'.
+ MISSING\$1PARTITION\$1VALUE\$1ERROR — Observed when the `partitionValue` for an `indexedKey` is not present. This can happen when a table is not partitioned consistently.
+ UNSUPPORTED\$1PARTITION\$1CHARACTER\$1ERROR — Observed when the value for an indexed partition key contains the characters \$1u0000, \$1u0001 or \$1u0002
+ INTERNAL\$1ERROR — An internal error occurred while indexes were being created. 

## Describing partition indexes on a table
<a name="partition-index-describing"></a>

To fetch the partition indexes created on a table, use the `GetPartitionIndexes` operation. The response returns all the indexes on the table, along with the current status of each index (the `IndexStatus`).

The `IndexStatus` for a partition index will be one of the following:
+ `CREATING` — The index is currently being created, and is not yet available for use.
+ `ACTIVE` — The index is ready for use. Requests can use the index to perform an optimized query.
+ `DELETING` — The index is currently being deleted, and can no longer be used. An index in the active state can be deleted using the `DeletePartitionIndex` request, which moves the status from ACTIVE to DELETING.
+ `FAILED` — The index creation on an existing table failed. Each table stores the last 10 failed indexes.

The possible state transitions for indexes created on an existing table are:
+ CREATING → ACTIVE → DELETING
+ CREATING → FAILED

## Limitations on using partition indexes
<a name="partition-index-limitations"></a>

Once you have created a partition index, note these changes to table and partition functionality:

**New partition creation (after Index Addition)**  
After a partition index is created on a table, all new partitions added to the table will be validated for the data type checks for indexed keys. The partition value of the indexed keys will be validated for data type format. If the data type check fails, the create partition operation will fail. For the *sales\$1data* table, if an index is created for keys (category, year) where the category is of type `string` and year of type `int`, the creation of the new partition with a value of YEAR as "foo" will fail.

After indexes are enabled, the addition of partitions with indexed key values having the characters U\$10000, U\$100001, and U\$10002 will start to fail.

**Table updates**  
Once a partition index is created on a table, you cannot modify the partition key names for existing partition keys, and you cannot change the type, or order, of keys which are registered with the index.

## Using indexes for an optimized GetPartitions call
<a name="partition-index-getpartitions"></a>

When you call `GetPartitions` on a table with an index, you can include an expression, and if applicable the Data Catalog will use an index if possible. The first key of the index should be passed in the expression for the indexes to be used in filtering. Index optimization in filtering is applied as a best effort. The Data Catalog tries to use index optimization as much as possible, but in case of a missing index, or unsupported operator, it falls back to the existing implementation of loading all partitions. 

For the *sales\$1data* table above, lets add the index [Country, Category, Year]. If "Country" is not passed in the expression, the registered index will not be able to filter partitions using indexes. You can add up to 3 indexes to support various query patterns.

Lets take some example expressions and see how indexes work on them:


| Expressions | How index will be used | 
| --- | --- | 
|  Country = 'US'  |  Index will be used to filter partitions.  | 
|  Country = 'US' and Category = 'Shoes'  |  Index will be used to filter partitions.  | 
|  Category = 'Shoes'  |  Indexes will not be used as "country" is not provided in the expression. All partitions will be loaded to return a response.  | 
|  Country = 'US' and Category = 'Shoes' and Year > '2018'  |  Index will be used to filter partitions.  | 
|  Country = 'US' and Category = 'Shoes' and Year > '2018' and month = 2  |  Index will be used to fetch all partitions with country = "US" and category = "shoes" and year > 2018. Then, filtering on the month expression will be performed.  | 
|  Country = 'US' AND Category = 'Shoes' OR Year > '2018'  |  Indexes will not be used as an `OR` operator is present in the expression.  | 
|  Country = 'US' AND Category = 'Shoes' AND (Year = 2017 OR Year = '2018')  |  Index will be used to fetch all partitions with country = "US" and category = "shoes", and then filtering on the year expression will be performed.  | 
|  Country in ('US', 'UK') AND Category = 'Shoes'  |  Indexes will not be used for filtering as the `IN` operator is not supported currently.  | 
|  Country = 'US' AND Category in ('Shoes', 'Books')  |  Index will be used to fetch all partitions with country = "US", and then filtering on the Category expression will be performed.  | 
|  Country = 'US' AND Category in ('Shoes', 'Books') AND (creationDate > '2023-9-01'  |  Index will be used to fetch all partitions with country = "US", with creationDate > '2023-9-01', and then filtering on the Category expression will be performed.  | 

## Integration with engines
<a name="partition-index-integration-engines"></a>

Redshift Spectrum, Amazon EMR and AWS Glue ETL Spark DataFrames are able to utilize indexes for fetching partitions after indexes are in an ACTIVE state in AWS Glue. [Athena](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index) and [AWS Glue ETL Dynamic frames](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-cat-predicates) require you to follow extra steps to utilize indexes for query improvement.

### Enable partition filtering
<a name="enable-partition-filtering-athena"></a>

To enable partition filtering in Athena, you need to update the table properties as follows:

1. In the AWS Glue console, under **Data Catalog**, choose **Tables**.

1. Choose a table.

1. Under **Actions**, choose **Edit table**.

1. Under **Table properties**, add the following:
   + Key –`partition_filtering.enabled`
   + Value – `true`

1. Choose **Apply**.

Alternatively, you can set this parameter by running an [ALTER TABLE SET PROPERTIES](https://docs.aws.amazon.com/athena/latest/ug/alter-table-set-tblproperties.html) query in Athena.

```
ALTER TABLE partition_index.table_with_index
SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')
```

## Updating manually created Data Catalog tables using crawlers
<a name="update-manual-tables"></a>

You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.

To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.

The following are other reasons why you might want to manually create catalog tables and specify catalog tables as the crawler source:
+ You want to choose the catalog table name and not rely on the catalog table naming algorithm.
+ You want to prevent new tables from being created in the case where files with a format that could disrupt partition detection are mistakenly saved in the data source path.

For more information, see [Step 2: Choose data sources and classifiers](define-crawler-choose-data-sources.md).

## Data Catalog table properties
<a name="table-properties"></a>

 Table properties, or parameters, as they are known in the AWS CLI, are unvalidated key and value strings. You can set your own properties on the table to support uses of the Data Catalog outside of AWS Glue. Other services using the Data Catalog may do so as well. AWS Glue sets some table properties when running jobs or crawlers. Unless otherwise described, these properties are for internal use, we do not support that they will continue to exist in their current form, or support product behavior if these properties are manually changed. 

 For more information about table properties set by AWS Glue crawlers, see [Parameters set on Data Catalog tables by crawler](table-properties-crawler.md). 