

# Run SQL queries in Amazon Athena
<a name="querying-athena-tables"></a>

You can run SQL queries using Amazon Athena on data sources that are registered with the AWS Glue Data Catalog and data sources such as Hive metastores and Amazon DocumentDB instances that you connect to using the Athena Federated Query feature. For more information about working with data sources, see [Connect to data sources](work-with-data-stores.md). When you run a Data Definition Language (DDL) query that modifies schema, Athena writes the metadata to the metastore associated with the data source. In addition, some queries, such as `CREATE TABLE AS` and `INSERT INTO` can write records to the dataset—for example, adding a CSV record to an Amazon S3 location.

This section provides guidance for running Athena queries on common data sources and data types using a variety of SQL statements. General guidance is provided for working with common structures and operators—for example, working with arrays, concatenating, filtering, flattening, and sorting. Other examples include queries for data in tables with nested structures and maps, tables based on JSON-encoded datasets, and datasets associated with AWS services such as AWS CloudTrail logs and Amazon EMR logs. Comprehensive coverage of standard SQL usage is beyond the scope of this documentation. For more information about SQL, refer to the [Trino](https://trino.io/docs/current/language.html) and [Presto](https://prestodb.io/docs/current/sql.html) language references.

**Topics**
+ [View query plans](query-plans.md)
+ [Work with query results and recent queries](querying.md)
+ [Reuse query results in Athena](reusing-query-results.md)
+ [View query stats](query-stats.md)
+ [Work with views](views.md)
+ [Use saved queries](saved-queries.md)
+ [Use parameterized queries](querying-with-prepared-statements.md)
+ [Use the cost-based optimizer](cost-based-optimizer.md)
+ [Query S3 Express One Zone](querying-express-one-zone.md)
+ [Query Amazon Glacier](querying-glacier.md)
+ [Handle schema updates](handling-schema-updates-chapter.md)
+ [Query arrays](querying-arrays.md)
+ [Query geospatial data](querying-geospatial-data.md)
+ [Query JSON data](querying-JSON.md)
+ [Use ML with Athena](querying-mlmodel.md)
+ [Query with UDFs](querying-udf.md)
+ [Query across regions](querying-across-regions.md)
+ [Query the AWS Glue Data Catalog](querying-glue-catalog.md)
+ [Query AWS service logs](querying-aws-service-logs.md)
+ [Query web server logs](querying-web-server-logs.md)

For considerations and limitations, see [Considerations and limitations for SQL queries in Amazon Athena](other-notable-limitations.md).

# View execution plans for SQL queries
<a name="query-plans"></a>

You can use the Athena query editor to see graphical representations of how your query will be run. When you enter a query in the editor and choose the **Explain** option, Athena uses an [EXPLAIN](athena-explain-statement.md) SQL statement on your query to create two corresponding graphs: a distributed execution plan and a logical execution plan. You can use these graphs to analyze, troubleshoot, and improve the efficiency of your queries.

**To view execution plans for a query**

1. Enter your query in the Athena query editor, and then choose **Explain**.  
![\[Choose Explain in the Athena query editor.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-plans-1.png)

   The **Distributed plan** tab shows you the execution plan for your query in a distributed environment. A distributed plan has processing fragments or *stages*. Each stage has a zero-based index number and is processed by one or more nodes. Data can be exchanged between nodes.  
![\[Sample query distributed plan graph.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-plans-2.png)

1. To navigate the graph, use the following options:
   + To zoom in or out, scroll the mouse, or use the magnifying icons.
   + To adjust the graph to fit the screen, choose the **Zoom to fit** icon.
   + To move the graph around, drag the mouse pointer.

1. To see details for a stage, choose the stage.  
![\[Choose a stage to see details for the stage.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-plans-3.png)

1. To see the stage details full width, choose the expand icon at the top right of the details pane.

1. To see more detail, expand one or more items in the operator tree. For information about distributed plan fragments, see [EXPLAIN statement output types](athena-explain-statement-understanding.md#athena-explain-statement-understanding-explain-plan-types).  
![\[Expanded operator tree for a stage in a distributed query plan.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-plans-4.png)
**Important**  
Currently, some partition filters may not be visible in the nested operator tree graph even though Athena does apply them to your query. To verify the effect of such filters, run [EXPLAIN](athena-explain-statement.md#athena-explain-statement-syntax-athena-engine-version-2) or [EXPLAIN ANALYZE](athena-explain-statement.md#athena-explain-analyze-statement) on your query and view the results.

1. Choose the **Logical plan** tab. The graph shows the logical plan for running your query. For information about operational terms, see [Understand Athena EXPLAIN statement results](athena-explain-statement-understanding.md).  
![\[Graph of a logical query plan in Athena.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-plans-5.png)

1. To export a plan as an SVG or PNG image, or as JSON text, choose **Export**.

## Additional resources
<a name="query-plans-additional-resources"></a>

For more information, see the following resources.

[Using EXPLAIN and EXPLAIN ANALYZE in Athena](athena-explain-statement.md)

[Understand Athena EXPLAIN statement results](athena-explain-statement-understanding.md)

[View statistics and execution details for completed queries](query-stats.md)

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/7JUyTqglmNU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/7JUyTqglmNU)


# Work with query results and recent queries
<a name="querying"></a>

Amazon Athena automatically stores query results and query execution result metadata for each query that runs in a *query result location* that you can specify in Amazon S3. If necessary, you can access the files in this location to work with them. You can also download query result files directly from the Athena console.

Athena now offers you two options for managing query results; you can either use a customer-owned S3 bucket or opt for the managed query results feature. With your own bucket, you maintain complete control over storage, permissions, lifecycle policies, and retention, providing maximum flexibility but require more management. Alternatively, when you choose the managed query results option, the service automatically handles storage and lifecycle management, eliminating your need to configure a separate results bucket and automatically cleaning up results after a predetermined retention period. For more information, see [Managed query results](managed-results.md).

To set up an Amazon S3 query result location for the first time, see [Specify a query result location using the Athena console](query-results-specify-location-console.md).

Output files are saved automatically for every query that runs. To access and view query output files using the Athena console, IAM principals (users and roles) need permission to the Amazon S3 [GetObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html) action for the query result location, as well as permission for the Athena [GetQueryResults](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetQueryResults.html) action. The query result location can be encrypted. If the location is encrypted, users must have the appropriate key permissions to encrypt and decrypt the query result location.

**Important**  
IAM principals with permission to the Amazon S3 `GetObject` action for the query result location are able to retrieve query results from Amazon S3 even if permission to the Athena `GetQueryResults` action is denied.

**Note**  
In the case of canceled or failed queries, Athena may have already written partial results to Amazon S3. In such cases, Athena does not delete partial results from the Amazon S3 prefix where results are stored. You must remove the Amazon S3 prefix with partial results. Athena uses Amazon S3 multipart uploads to write data Amazon S3. We recommend that you set the bucket lifecycle policy to end multipart uploads in cases when queries fail. For more information, see [Aborting incomplete multipart uploads using a bucket lifecycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config) in the *Amazon Simple Storage Service User Guide*. 
Under certain conditions, Athena may automatically retry query executions. In most cases, these queries are able to complete successfully and the query ID is marked as `Completed`. These queries might have written partial results during the initial attempts and may generate incomplete multipart uploads.

**Topics**
+ [Managed query results](managed-results.md)
+ [Specify a query result location](query-results-specify-location.md)
+ [Download query results files using the Athena console](saving-query-results.md)
+ [View recent queries in the Athena console](queries-viewing-history.md)
+ [Download multiple recent queries to a CSV file](queries-downloading-multiple-recent-queries-to-csv.md)
+ [Configure recent query display options](queries-recent-queries-configuring-options.md)
+ [Keep your query history longer than 45 days](querying-keeping-query-history.md)
+ [Find query output files in Amazon S3](querying-finding-output-files.md)

# Managed query results
<a name="managed-results"></a>

With managed query results, you can run SQL queries without providing an Amazon S3 bucket for query result storage. This saves you from having to provision, manage, control access to, and clean up your own S3 buckets. To get started, create a new workgroup or edit an existing workgroup. Under **Query result configuration**, select **Athena managed**. 

**Key features**
+ Simplifies your workflow by removing the requirement to choose an S3 bucket location before you run queries.
+ No additional cost to use managed query results and automatic deletion of query results reduces administration overhead and the need for separate S3 bucket clean up processes.
+ Simple to get started: new and pre-existing workgroups can be easily configured to use managed query results. You can have a mix of Athena managed and customer managed query results in your AWS account.
+ Streamlined IAM permissions with access to read results through `GetQueryResults` and `GetQueryResultsStream` tied to individual workgroups.
+ Query results are automatically encrypted with your choice of AWS owned keys or customer owned keys.

## Considerations and limitations
<a name="managed-results-considerations"></a>

****
+ Access to query results is managed at the workgroup level in Athena. For this, you need explicit permissions to `GetQueryResults` and `GetQueryResultsStream` IAM actions on the specific workgroup. The `GetQueryResults` action determines who can retrieve the results of a completed query in a paginated format, while the `GetQueryResultsStream` action determines who can stream the results of a completed query (commonly used by Athena drivers).
+ You cannot download query result files larger than 200 MB from the console. Use the `UNLOAD` statement to write results larger than 200 MB to a location that you can download separately.
+ Managed query results feature does not support [Query result reuse](reusing-query-results.md).
+ Query results are available for 24 hours. Query results are stored at no cost to you during this period. After this period, query results are automatically deleted.

## Create or edit a workgroup with managed query results
<a name="using-managed-query-results"></a>

To create a workgroup or update an existing workgroup with managed query results from the console:

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/).

1. From the left navigation, choose **Workgroups**.

1. Choose **Create Workgroup** to create a new workgroup or edit an existing workgroup from the list.

1. Under **Query result configuration**, choose **Athena managed**.   
![\[The Query result configuration menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/athena-managed.png)

1. For **Encrypt query results**, choose the encryption option that you want. For more information, see [Choose query result encryption](#managed-query-results-encryption-at-rest).

1. Fill in all the other required details and choose **Save changes**. 

## Choose query result encryption
<a name="managed-query-results-encryption-at-rest"></a>

There are two options for encryption configuration:
+ **Encrypt using an AWS owned key** – This is the default option when you use managed query results. Choose this option if you want query results encrypted by an AWS owned key.
+ **Encrypt using customer managed key** – Choose this option if you want to encrypt and decrypt query results with a customer managed key. To use a customer managed key, add the Athena service in the Principal element of the key policy section. For more information, see [Set up an AWS KMS key policy for managed query results](#managed-query-results-set-up). To run queries successfully, the user running queries needs permission to access the AWS KMS key.

## Set up an AWS KMS key policy for managed query results
<a name="managed-query-results-set-up"></a>

The `Principal` section on the key policy specifies who can use this key. The managed query results feature introduces the principal `encryption.athena.amazonaws.com` that you must specify in the `Principal` section. This service principal is specifically for accessing keys not owned by Athena. You must also add the `kms:Decrypt`, `kms:GenerateDataKey`, and `kms:DescribeKey` actions to the key policy that you use to access managed results. These three actions are the minimum allowed actions.

Managed query results uses your workgroup ARN for [encryption context](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#encrypt_context). Because the `Principal` section is an AWS service, you also need to add `aws:sourceArn` and `aws:sourceAccount` to the key policy conditions. The following example shows an AWS KMS key policy that has minimum permissions on a single workgroup.

```
 {
    "Sid": "Allow athena service principal to use the key",
    "Effect": "Allow",
    "Principal": {
        "Service": "encryption.athena.amazonaws.com"
    },
    "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey",
        "kms:DescribeKey"
      ],
    "Resource": "arn:aws:kms:us-east-1:{account-id}:key/{key-id}",
    "Condition": {
    "ArnLike": {
        "kms:EncryptionContext:aws:athena:arn": "arn:aws:athena:us-east-1:{account-id}:workgroup/{workgroup-name}",
        "aws:SourceArn": "arn:aws:athena:us-east-1:{account-id}:workgroup/{workgroup-name}"
    },
    "StringEquals": {
        "aws:SourceAccount": "{account-id}"
    }
}
```

The following example AWS KMS key policy allows all workgroups within the same account *account-id* to use the same AWS KMS key.

```
{
    "Sid": "Allow athena service principal to use the key",
    "Effect": "Allow",
    "Principal": {
        "Service": "encryption.athena.amazonaws.com"
    },
    "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey",
        "kms:DescribeKey"
    ],
    "Resource": "arn:aws:kms:us-east-1:account-id:key/{key-id}",
    "Condition": {
        "ArnLike": {
          "kms:EncryptionContext:aws:athena:arn": "arn:aws:athena:us-east-1:account-id:workgroup/*",
          "aws:SourceArn": "arn:aws:athena:us-east-1:account-id:workgroup/*"
        },
        "StringEquals": {
          "aws:SourceAccount": "account-id"
        }
    }
}
```

In addition to the Athena and Amazon S3 permissions, you must also get permissions to perform `kms:GenerateDataKey` and `kms:Decrypt` actions. For more information, see [Permissions to encrypted data in Amazon S3](encryption.md#permissions-for-encrypting-and-decrypting-data). 

For more information on managed query results encryption, see [Encrypt managed query results](encrypting-managed-results.md).

# Encrypt managed query results
<a name="encrypting-managed-results"></a>

Athena offers the following options for encrypting [Managed query results](managed-results.md).

## Encrypt using an AWS owned key
<a name="encrypting-managed-results-aws-owned-key"></a>

This is the default option when you use managed query results. This option indicates that you want to encrypt query results using an AWS owned key. AWS owned keys are not stored in your AWS account and are part of a collection of KMS keys that AWS owns. You are not charged a fee when you use AWS owned keys, and they do not count against AWS KMS quotas for your account.

## Encrypt using AWS KMS customer managed key
<a name="encrypting-managed-results-customer-managed-key"></a>

Customer managed keys are the KMS keys in your AWS account that you create, own, and manage. You have full control over these KMS keys, which includes establishing and maintaining their key policies, IAM policies and grants, enabling and disabling them, rotating their cryptographic material, adding tags, creating aliases that refer to them, and scheduling them for deletion. For more information, see [Customer managed keys](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk).

## How Athena uses customer managed key to encrypt results
<a name="encrypting-managed-results-how-ate-uses-cmk"></a>

When you specify a customer managed key, Athena uses it to encrypt the query results when stored in managed query results. The same key is used to decrypt the results when you call `GetQueryResults`. When you set the state of the customer managed key to disabled or schedule it for deletion, it prevents Athena and all users from encrypting or decrypting results with that key.

Athena uses envelope encryption and key hierarchy to encrypt data. Your AWS KMS encryption key is used to generate and decrypt the root key of this key hierarchy.

Each result is encrypted using the customer managed key configured in the workgroup at the time of encryption. Switching the key to a different customer managed key or to an AWS owned key does not re-encrypt existing results with the new key. Deleting and disabling a particular customer managed key only affects decryption of the results that the key encrypted.

Athena needs access to your encryption key to perform `kms:Decrypt`, `kms:GenerateDataKey`, and `kms:DescribeKey` operations for encrypting and decrypting results. For more information, see [Permissions to encrypted data in Amazon S3](encryption.md#permissions-for-encrypting-and-decrypting-data). 

The principal that submits the query using the `StartQueryExecution` API and reads results using `GetQueryResults` must also have permission to the customer managed key for `kms:Decrypt`, `kms:GenerateDataKey`, and `kms:DescribeKey` operations in addition to Athena and Amazon S3 permissions. For more information, see [Key policies in AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users).

# Specify a query result location
<a name="query-results-specify-location"></a>

The query result location that Athena uses is determined by a combination of workgroup settings and *client-side settings*. Client-side settings are based on how you run the query. 
+  If you run the query using the Athena console, the **Query result location** entered under **Settings** in the navigation bar determines the client-side setting. 
+ If you run the query using the Athena API, the `OutputLocation` parameter of the [StartQueryExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html) action determines the client-side setting. 
+ If you use the ODBC or JDBC drivers to run queries, the `S3OutputLocation` property specified in the connection URL determines the client-side setting. 

**Important**  
When you run a query using the API or using the ODBC or JDBC driver, the console setting does not apply. 

Each workgroup configuration has an [Override client-side settings](https://docs.aws.amazon.com/athena/latest/ug/workgroups-settings-override.html) option that can be enabled. When this option is enabled, the workgroup settings take precedence over the applicable client-side settings when an IAM principal associated with that workgroup runs the query.

## About previously created default locations
<a name="query-results-specify-location-previous-defaults"></a>

Previously in Athena, if you ran a query without specifying a value for **Query result location**, and the query result location setting was not overridden by a workgroup, Athena created a default location for you. The default location was `aws-athena-query-results-MyAcctID-MyRegion`, where *MyAcctID* was the Amazon Web Services account ID of the IAM principal that ran the query, and *MyRegion* was the region where the query ran (for example, `us-west-1`.)

Now, before you can run an Athena query in a region in which your account hasn't used Athena previously, you must specify a query result location, or use a workgroup that overrides the query result location setting. While Athena no longer creates a default query results location for you, previously created default `aws-athena-query-results-MyAcctID-MyRegion` locations remain valid and you can continue to use them.

**Topics**
+ [About previously created default locations](#query-results-specify-location-previous-defaults)
+ [Specify a query result location using the Athena console](query-results-specify-location-console.md)
+ [Specify a query result location using a workgroup](query-results-specify-location-workgroup.md)

# Specify a query result location using the Athena console
<a name="query-results-specify-location-console"></a>

Before you can run a query, a query result bucket location in Amazon S3 must be specified, or you must use a workgroup that has specified a bucket and whose configuration overrides client settings.

**To specify a client-side setting query result location using the Athena console**

1. [Switch](switching-workgroups.md) to the workgroup for which you want to specify a query results location. The name of the default workgroup is **primary**.

1. From the navigation bar, choose **Settings**.

1. From the navigation bar, choose **Manage**.

1. For **Manage settings**, do one of the following:
   + In the **Location of query result** box, enter the path to the bucket that you created in Amazon S3 for your query results. Prefix the path with `s3://`.
   + Choose **Browse S3**, choose the Amazon S3 bucket that you created for your current Region, and then choose **Choose**.
**Note**  
If you are using a workgroup that specifies a query result location for all users of the workgroup, the option to change the query result location is unavailable.

1. (Optional) Choose **View lifecycle configuration** to view and configure the [Amazon S3 lifecycle rules](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on your query results bucket. The Amazon S3 lifecycle rules that you create can be either expiration rules or transition rules. Expiration rules automatically delete query results after a certain amount of time. Transition rules move them to another Amazon S3 storage tier. For more information, see [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html) in the Amazon Simple Storage Service User Guide.

1. (Optional) For **Expected bucket owner**, enter the ID of the AWS account that you expect to be the owner of the output location bucket. This is an added security measure. If the account ID of the bucket owner does not match the ID that you specify here, attempts to output to the bucket will fail. For in-depth information, see [Verifying bucket ownership with bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html) in the *Amazon S3 User Guide*.
**Note**  
The expected bucket owner setting applies only to the Amazon S3 output location that you specify for Athena query results. It does not apply to other Amazon S3 locations like data source locations in external Amazon S3 buckets, `CTAS` and `INSERT INTO` destination table locations, `UNLOAD` statement output locations, operations to spill buckets for federated queries, or `SELECT` queries run against a table in another account.

1. (Optional) Choose **Encrypt query results** if you want to encrypt the query results stored in Amazon S3. For more information about encryption in Athena, see [Encryption at rest](encryption.md).

1. (Optional) Choose **Assign bucket owner full control over query results** to grant full control access over query results to the bucket owner when [ACLs are enabled](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html) for the query result bucket. For example, if your query result location is owned by another account, you can grant ownership and full control over your query results to the other account. For more information, see [Controlling ownership of objects and disabling ACLs for your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html) in the *Amazon S3 User Guide*.

1. Choose **Save**.

# Specify a query result location using a workgroup
<a name="query-results-specify-location-workgroup"></a>

You specify the query result location in a workgroup configuration using the AWS Management Console, the AWS CLI, or the Athena API.

When using the AWS CLI, specify the query result location using the `OutputLocation` parameter of the `--configuration` option when you run the [https://docs.aws.amazon.com/cli/latest/reference/athena/create-work-group.html](https://docs.aws.amazon.com/cli/latest/reference/athena/create-work-group.html) or [https://docs.aws.amazon.com/cli/latest/reference/athena/update-work-group.html](https://docs.aws.amazon.com/cli/latest/reference/athena/update-work-group.html) command.

**To specify the query result location for a workgroup using the Athena console**

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Workgroups**.

1. In the list of workgroups, choose the link of the workgroup that you want to edit.

1. Choose **Edit**.

1. For **Query result location and encryption**, do one of the following:
   + In the **Location of query result** box, enter the path to a bucket in Amazon S3 for your query results. Prefix the path with `s3://`.
   + Choose **Browse S3**, choose the Amazon S3 bucket for your current Region that you want to use, and then choose **Choose**.

1. (Optional) For **Expected bucket owner**, enter the ID of the AWS account that you expect to be the owner of the output location bucket. This is an added security measure. If the account ID of the bucket owner does not match the ID that you specify here, attempts to output to the bucket will fail. For in-depth information, see [Verifying bucket ownership with bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html) in the *Amazon S3 User Guide*.
**Note**  
The expected bucket owner setting applies only to the Amazon S3 output location that you specify for Athena query results. It does not apply to other Amazon S3 locations like data source locations in external Amazon S3 buckets, `CTAS` and `INSERT INTO` destination table locations, `UNLOAD` statement output locations, operations to spill buckets for federated queries, or `SELECT` queries run against a table in another account.

1. (Optional) Choose **Encrypt query results** if you want to encrypt the query results stored in Amazon S3. For more information about encryption in Athena, see [Encryption at rest](encryption.md).

1. (Optional) Choose **Assign bucket owner full control over query results** to grant full control access over query results to the bucket owner when [ACLs are enabled](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html) for the query result bucket. For example, if your query result location is owned by another account, you can grant ownership and full control over your query results to the other account. 

   If the bucket's S3 Object Ownership setting is **Bucket owner preferred**, the bucket owner also owns all query result objects written from this workgroup. For example, if an external account's workgroup enables this option and sets its query result location to your account's Amazon S3 bucket which has an S3 Object Ownership setting of **Bucket owner preferred**, you own and have full control access over the external workgroup's query results. 

   Selecting this option when the query result bucket's S3 Object Ownership setting is **Bucket owner enforced** has no effect. For more information, see [Controlling ownership of objects and disabling ACLs for your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html) in the *Amazon S3 User Guide*. 

1. If you want to require all users of the workgroup to use the query results location that you specified, scroll down to the **Settings** section and select **Override client-side settings**.

1. Choose **Save changes**.

# Download query results files using the Athena console
<a name="saving-query-results"></a>

You can download the query results CSV file from the query pane immediately after you run a query. You can also download query results from recent queries from the **Recent queries** tab.

**Note**  
Athena query result files are data files that contain information that can be configured by individual users. Some programs that read and analyze this data can potentially interpret some of the data as commands (CSV injection). For this reason, when you import query results CSV data to a spreadsheet program, that program might warn you about security concerns. To keep your system secure, you should always choose to disable links or macros from downloaded query results.

**To run a query and download the query results**

1. Enter your query in the query editor and then choose **Run**.

   When the query finishes running, the **Results** pane shows the query results.

1. To download a CSV file of the query results, choose **Download results** above the query results pane. Depending on your browser and browser configuration, you may need to confirm the download.  
![\[Saving query results to a .csv file in the Athena console.\]](http://docs.aws.amazon.com/athena/latest/ug/images/getting-started-query-results-download-csv.png)

**To download a query results file for an earlier query**

1. Choose **Recent queries**.  
![\[Choose Recent queries to view previous queries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/getting-started-recent-queries.png)

1. Use the search box to find the query, select the query, and then choose **Download results**.
**Note**  
You cannot use the **Download results** option to retrieve query results that have been deleted manually, or retrieve query results that have been deleted or moved to another location by Amazon S3 [lifecycle rules](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).  
![\[Choose Recent queries to find and download previous query results.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-recent-queries-tab-download.png)

# View recent queries in the Athena console
<a name="queries-viewing-history"></a>

You can use the Athena console to see which queries succeeded or failed, and view error details for the queries that failed. Athena keeps a query history for 45 days. 

**To view recent queries in the Athena console**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Choose **Recent queries**. The **Recent queries** tab shows information about each query that ran.

1. To open a query statement in the query editor, choose the query's execution ID.  
![\[Choose the execution ID of a query to see it in the query editor.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-view-query-statement.png)

1. To see the details for a query that failed, choose the **Failed** link for the query.  
![\[Choose the Failed link for a query to view information about the failure.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-view-query-failure-details.png)

# Download multiple recent queries to a CSV file
<a name="queries-downloading-multiple-recent-queries-to-csv"></a>

You can use the **Recent queries** tab of the Athena console to export one or more recent queries to a CSV file in order to view them in tabular format. The downloaded file contains not the query results, but the SQL query string itself and other information about the query. Exported fields include the execution ID, query string contents, query start time, status, run time, amount of data scanned, query engine version used, and encryption method. You can export a maximum of 500 recent queries, or a filtered maximum of 500 queries using criteria that you enter in the search box.

**To export one or more recent queries to a CSV file**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Choose **Recent queries**.

1. (Optional) Use the search box to filter for the recent queries that you want to download.

1. Choose **Download CSV**.  
![\[Choose Download CSV.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-recent-queries-csv.png)

1. At the file save prompt, choose **Save**. The default file name is `Recent Queries` followed by a timestamp (for example, `Recent Queries 2022-12-05T16 04 27.352-08 00.csv`)

# Configure recent query display options
<a name="queries-recent-queries-configuring-options"></a>

You can configure options for the **Recent queries** tab like columns to display and text wrapping.

**To configure options for the **Recent queries** tab**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Choose **Recent queries**.

1. Choose the options button (gear icon).  
![\[Choose the option button to configure the display of recent queries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-recent-queries-options.png)

1. In the **Preferences** dialog box, choose the number of rows per page, line wrapping behavior, and columns to display.  
![\[Configuring the display of recent queries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-recent-queries-preferences.png)

1. Choose **Confirm**.

# Keep your query history longer than 45 days
<a name="querying-keeping-query-history"></a>

If you want to keep the query history longer than 45 days, you can retrieve the query history and save it to a data store such as Amazon S3. To automate this process, you can use Athena and Amazon S3 API actions and CLI commands. The following procedure summarizes these steps.

**To retrieve and save query history programmatically**

1. Use Athena [ListQueryExecutions](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListQueryExecutions.html) API action or the [list-query-executions](https://docs.aws.amazon.com/cli/latest/reference/athena/list-query-executions.html) CLI command to retrieve the query IDs.

1. Use the Athena [GetQueryExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetQueryExecution.html) API action or the [get-query-execution](https://docs.aws.amazon.com/cli/latest/reference/athena/get-query-execution.html) CLI command to retrieve information about each query based on its ID.

1. Use the Amazon S3 [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) API action or the [put-object](https://docs.aws.amazon.com/cli/latest/reference/s3api/put-object.html) CLI command to save the information in Amazon S3.

# Find query output files in Amazon S3
<a name="querying-finding-output-files"></a>

Query output files are stored in sub-folders on Amazon S3 in the following path pattern unless the query occurs in a workgroup whose configuration overrides client-side settings. When workgroup configuration overrides client-side settings, the query uses the results path specified by the workgroup.

```
QueryResultsLocationInS3/[QueryName|Unsaved/yyyy/mm/dd/]
```
+ *QueryResultsLocationInS3* is the query result location specified either by workgroup settings or client-side settings. For more information, see [Specify a query result location](query-results-specify-location.md) later in this document.
+ The following sub-folders are created only for queries run from the console whose results path has not been overriden by workgroup configuration. Queries that run from the AWS CLI or using the Athena API are saved directly to the *QueryResultsLocationInS3*.
  + *QueryName* is the name of the query for which the results are saved. If the query ran but wasn't saved, `Unsaved` is used. 
  + *yyyy/mm/dd* is the date that the query ran.

Files associated with a `CREATE TABLE AS SELECT` query are stored in a `tables` sub-folder of the above pattern.

## Identify query output files
<a name="querying-identifying-output-files"></a>

Files are saved to the query result location in Amazon S3 based on the name of the query, the query ID, and the date that the query ran. Files for each query are named using the *QueryID*, which is a unique identifier that Athena assigns to each query when it runs.

The following file types are saved:


| File type | File naming patterns | Description | 
| --- | --- | --- | 
|  **Query results files**  |  `QueryID.csv` `QueryID.txt`  |  DML query results files are saved in comma-separated values (CSV) format. DDL query results are saved as plain text files.  You can download results files from the console from the **Results** pane when using the console or from the query **History**. For more information, see [Download query results files using the Athena console](saving-query-results.md).  | 
|  **Query metadata files**  |  `QueryID.csv.metadata` `QueryID.txt.metadata`  |  DML and DDL query metadata files are saved in binary format and are not human readable. The file extension corresponds to the related query results file. Athena uses the metadata when reading query results using the `GetQueryResults` action. Although these files can be deleted, we do not recommend it because important information about the query is lost.  | 
|  **Data manifest files**  |  `QueryID-manifest.csv`  |  Data manifest files are generated to track files that Athena creates in Amazon S3 data source locations when an [INSERT INTO](insert-into.md) query runs. If a query fails, the manifest also tracks files that the query intended to write. The manifest is useful for identifying orphaned files resulting from a failed query.  | 

## Use the AWS CLI to identify query output location and files
<a name="querying-finding-output-files-cli"></a>

To use the AWS CLI to identify the query output location and result files, run the `aws athena get-query-execution` command, as in the following example. Replace *abc1234d-5efg-67hi-jklm-89n0op12qr34* with the query ID.

```
aws athena get-query-execution --query-execution-id abc1234d-5efg-67hi-jklm-89n0op12qr34
```

The command returns output similar to the following. For descriptions of each output parameter, see [get-query-execution](https://docs.aws.amazon.com/cli/latest/reference/athena/get-query-execution.html) in the *AWS CLI Command Reference*.

```
{
    "QueryExecution": {
        "Status": {
            "SubmissionDateTime": 1565649050.175,
            "State": "SUCCEEDED",
            "CompletionDateTime": 1565649056.6229999
        },
        "Statistics": {
            "DataScannedInBytes": 5944497,
            "DataManifestLocation": "s3://amzn-s3-demo-bucket/athena-query-results-123456789012-us-west-1/MyInsertQuery/2019/08/12/abc1234d-5efg-67hi-jklm-89n0op12qr34-manifest.csv",
            "EngineExecutionTimeInMillis": 5209
        },
        "ResultConfiguration": {
            "EncryptionConfiguration": {
                "EncryptionOption": "SSE_S3"
            },
            "OutputLocation": "s3://amzn-s3-demo-bucket/athena-query-results-123456789012-us-west-1/MyInsertQuery/2019/08/12/abc1234d-5efg-67hi-jklm-89n0op12qr34"
        },
        "QueryExecutionId": "abc1234d-5efg-67hi-jklm-89n0op12qr34",
        "QueryExecutionContext": {},
        "Query": "INSERT INTO mydb.elb_log_backup SELECT * FROM mydb.elb_logs LIMIT 100",
        "StatementType": "DML",
        "WorkGroup": "primary"
    }
}
```

# Reuse query results in Athena
<a name="reusing-query-results"></a>

When you re-run a query in Athena, you can optionally choose to reuse the last stored query result. This option can increase performance and reduce costs in terms of the number of bytes scanned. Reusing query results is useful if, for example, you know that the results will not change within a given time frame. You can specify a maximum age for reusing query results. Athena uses the stored result as long as it is not older than the age that you specify. For more information, see [Reduce cost and improve query performance with Amazon Athena](https://aws.amazon.com/blogs/big-data/reduce-cost-and-improve-query-performance-with-amazon-athena-query-result-reuse/) in the *AWS Big Data Blog*.

## Key features
<a name="reusing-query-results-key-features"></a>

When you enable result reuse for a query, Athena looks for a previous execution of the query within the same workgroup. If Athena finds a match, it bypasses execution and returns the query result from the previous, matching execution. You can enable query result reuse on a per query basis.

Athena reuses the last query result when all of the following conditions are true:
+ Query strings match as determined by Athena.
+ Database and catalog names match.
+ The previous result has not expired.
+ Query result configuration matches the query result configuration from the previous execution.
+ You have access to all tables referenced in the query.
+ You have access to the S3 file location where the previous result is stored.

If any of these conditions are not met, Athena runs the query without using the cached results.

## Considerations and limitations
<a name="reusing-query-results-considerations-and-limitations"></a>

When using the query result reuse feature, keep in mind the following points:
+ Athena reuses query results only within the same workgroup.
+ The reuse query results feature respects workgroup configurations. If you override the result configuration for a query, the feature is disabled.
+ Only queries that produce result sets on Amazon S3 are supported. Statements other than `SELECT` and `EXECUTE` are not supported.
+ Apache Hive, Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake tables registered with AWS Glue are supported. External Hive metastores are not supported.
+ Queries that reference federated catalogs or an external Hive metastore are not supported.
+ Query result reuse is not supported for Lake Formation governed tables.
+ Query result reuse is not supported when the Amazon S3 location of the table source is registered as a data location in Lake Formation. 
+ Tables with row and column permissions are not supported.
+ Tables that have fine grained access control (for example, column or row filtering) are not supported.
+ Any query that references an unsupported table is not eligible for query result reuse.
+ Athena requires that you have Amazon S3 read permissions for the previously generated output file to be reused.
+ The reuse query results feature assumes that content of previous result has not been modified. Athena does not check the integrity of a previous result before using it.
+ If the query results from the previous execution have been deleted or moved to different location in Amazon S3, the subsequent execution of the same query will not reuse the query results. 
+ Potentially stale results can be returned. Athena does not check for changes in source data until the maximum reuse age that you specify has been reached.
+ If multiple results are available for reuse, Athena uses the latest result.
+ Queries that use non-deterministic operators or functions like `rand()` or `shuffle()` do not use cached results. For example, `LIMIT` without `ORDER BY` is non-deterministic and not cached, but `LIMIT` with `ORDER BY` is deterministic and is cached.
+ To use the query result reuse feature with JDBC, the minimum required driver version is 2.0.34.1000. For ODBC, the minimum required driver version is 1.1.19.1002. For driver download information, see [Connect to Amazon Athena with ODBC and JDBC drivers](athena-bi-tools-jdbc-odbc.md).
+ Query result reuse is not supported for queries that use more than one data catalog. 
+  Query result reuse is not supported for queries that include more than 20 tables.
+ For query strings under 100 KB in size, differences in comments and white space are ignored, and `INNER JOIN` and `JOIN` are treated as equivalents for the purposes of reusing results. Query strings larger than 100 KB must be an exact match to reuse results.
+ A query result is considered to be expired if it is older than the maximum age specified, or older than the default of 60 minutes if a maximum age has not been specified. The maximum age for reusing query results can be specified in minutes, hours, or days. The maximum age specifiable is the equivalent of 7 days regardless of the time unit used.
+ [Managed query results](https://docs.aws.amazon.com/athena/latest/ug/managed-results.html) is not supported.

## How to reuse query results in the Athena console
<a name="reusing-query-results-athena-console"></a>

To use the feature, enable the **Reuse query results** option in the Athena query editor.

![\[Enable Reuse query results in the Athena query editor.\]](http://docs.aws.amazon.com/athena/latest/ug/images/reusing-query-results-1.png)


**To configure the reuse query results feature**

1. In the Athena query editor, under the **Reuse query results** option, choose the edit icon next to **up to 60 minutes ago**.

1. In the **Edit reuse time** dialog box, from the box on the right, choose a time unit (minutes, hours, or days).

1. In the box on the left, enter or choose the number of time units that you want to specify. The maximum time you can enter is the equivalent of seven days regardless of the time unit chosen.  
![\[Configuring the maximum age for reusing query results.\]](http://docs.aws.amazon.com/athena/latest/ug/images/reusing-query-results-2.png)

1. Choose **Confirm**.

   A banner confirms your configuration change, and the **Reuse query results** option displays your new setting.

# View statistics and execution details for completed queries
<a name="query-stats"></a>

After you run a query, you can get statistics on the input and output data processed, see a graphical representation of the time taken for each phase of the query, and interactively explore execution details.

**To view query statistics for a completed query**

1. After you run a query in the Athena query editor, choose the **Query stats** tab.  
![\[Choose Query stats.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-stats-1.png)

   The **Query stats** tab provides the following information:
   + **Data processed** – Shows you the number of input rows and bytes processed, and the number of rows and bytes output.
   + **The Total runtime** – Shows the total amount of time the query took to run and a graphical representation of how much of that time was spent on queuing, planning, execution, and service processing.
**Note**  
Stage-level input and output row count and data size information are not shown when a query has row-level filters defined in Lake Formation.

1. To interactively explore information about how the query ran, choose **Execution details**.  
![\[Choose Execution details.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-stats-2.png)

   The **Execution details** page shows the execution ID for the query and a graph of the zero-based stages in the query. The stages are ordered start to finish from bottom to top. Each stage's label shows the amount of time the stage took to run.
**Note**  
The total runtime and execution stage time of a query often differ significantly. For example, a query with a total runtime in minutes can show an execution time for a stage in hours. Because a stage is a logical unit of computation executed in parallel across many tasks, the execution time of a stage is the aggregate execution time of all of its tasks. Despite this discrepancy, stage execution time can be useful as a relative indicator of which stage was most computationally intensive in a query.  
![\[The execution details page.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-stats-3.png)

   To navigate the graph, use the following options:
   + To zoom in or out, scroll the mouse, or use the magnifying icons.
   + To adjust the graph to fit the screen, choose the **Zoom to fit** icon.
   + To move the graph around, drag the mouse pointer.

1. To see more details for a stage, choose the stage. The stage details pane on the right shows the number of rows and bytes input and output, and an operator tree.  
![\[Stage details pane.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-stats-4.png)

1. To see the stage details full width, choose the expand icon at the top right of the details pane.

1. To get information about the parts of the stage, expand one or more items in the operator tree.  
![\[Expanded operator tree.\]](http://docs.aws.amazon.com/athena/latest/ug/images/query-stats-5.png)

For more information about execution details, see [Understand Athena EXPLAIN statement results](athena-explain-statement-understanding.md).

## Additional resources
<a name="query-stats-additional-resources"></a>

For more information, see the following resources.

[View execution plans for SQL queries](query-plans.md)

[Using EXPLAIN and EXPLAIN ANALYZE in Athena](athena-explain-statement.md)

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/7JUyTqglmNU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/7JUyTqglmNU)


# Work with views
<a name="views"></a>

A view in Amazon Athena is a logical table, not a physical table. The query that defines a view runs each time the view is referenced in a query. You can create a view from a `SELECT` query and then reference this view in future queries. 

You can use two different kinds of views in Athena: Athena views and AWS Glue Data Catalog views.

## When to use Athena views?
<a name="when-to-use-views"></a>

You may want to create Athena views to: 
+ **Query a subset of data** – For example, you can create a view with a subset of columns from the original table to simplify querying data. 
+ **Combine tables** – You can use views to combine multiple tables into one query. When you have multiple tables and want to combine them with `UNION ALL`, you can create a view with that expression to simplify queries against the combined tables.
+ **Hide complexity** – Use views to hide the complexity of existing base queries and simplify queries run by users. Base queries often include joins between tables, expressions in the column list, and other SQL syntax that make it difficult to understand and debug them. You might create a view that hides the complexity and simplifies queries.
+ **Optimize queries** – You can use views to experiment with optimization techniques to create optimized queries. For example, if you find a combination of `WHERE` conditions, `JOIN` order, or other expressions that demonstrate the best performance, you can create a view with these clauses and expressions. Applications can then make relatively simple queries against this view. If you later find a better way to optimize the original query, when you recreate the view, all the applications immediately take advantage of the optimized base query. 
+ **Hide underlying names** – You can use views to hide underlying table and column names, and minimize maintenance problems if the names change. If the names change, you can simply recreate the view using the new names. Queries that use the view rather than the tables directly keep running with no changes.

  For more information, see [Work with Athena views](views-console.md).

## When to use AWS Glue Data Catalog views?
<a name="when-to-use-views-gdc"></a>

Use AWS Glue Data Catalog views when you want a single common view across AWS services like Amazon Athena and Amazon Redshift. In Data Catalog views, access permissions are defined by the user who created the view instead of the user who queries the view. This method of granting permissions is called *definer* semantics.

The following use cases show how you can use Data Catalog views.
+ **Greater access control** – You create a view that restricts data access based on the level of permissions the user requires. For example, you can use Data Catalog views to prevent employees who don't work in the human resources (HR) department from seeing personally identifiable information.
+ **Ensure complete records** – By applying certain filters onto your Data Catalog view, you make sure that the data records in a Data Catalog view are always complete.
+ **Enhanced security** – In Data Catalog views, the query definition that creates the view must be intact in order for the view to be created. This makes Data Catalog views less susceptible to SQL commands from malicious actors.
+ **Prevent access to underlying tables** – Definer semantics allow users to access a view without making the underlying table available to them. Only the user who defines the view requires access to the tables.

Data Catalog view definitions are stored in the AWS Glue Data Catalog. This means that you can use AWS Lake Formation to grant access through resource grants, column grants, or tag-based access controls. For more information about granting and revoking access in Lake Formation, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the *AWS Lake Formation Developer Guide*.

For more information, see [Use Data Catalog views in Athena](views-glue.md).

# Work with Athena views
<a name="views-console"></a>

Athena views can be easily created, updated, and managed in the Athena console.

## Create views
<a name="creating-views"></a>

You can create a view in the Athena console by using a template or by running an existing query.

**To use a template to create a view**

1. In the Athena console, next to **Tables and views**, choose **Create**, and then choose **Create view**.  
![\[Creating a view.\]](http://docs.aws.amazon.com/athena/latest/ug/images/create-view.png)

   This action places an editable view template into the query editor. 

1. Edit the view template according to your requirements. When you enter a name for the view in the statement, remember that view names cannot contain special characters other than underscore `(_)`. See [Name databases, tables, and columns](tables-databases-columns-names.md). Avoid using [Escape reserved keywords in queries](reserved-words.md) for naming views. 

   For more information about creating views, see [CREATE VIEW and CREATE PROTECTED MULTI DIALECT VIEW](create-view.md) and [Examples of Athena views](views-examples.md). 

1. Choose **Run** to create the view. The view appears in the list of views in the Athena console.

**To create a view from an existing query**

1. Use the Athena query editor to run an existing query.

1. Under the query editor window, choose **Create**, and then choose **View from query**.  
![\[Choose Create, View from query.\]](http://docs.aws.amazon.com/athena/latest/ug/images/create-view-from-query.png)

1. In the **Create View** dialog box, enter a name for the view, and then choose **Create**. View names cannot contain special characters other than underscore `(_)`. See [Name databases, tables, and columns](tables-databases-columns-names.md). Avoid using [Escape reserved keywords in queries](reserved-words.md) for naming views.

   Athena adds the view to the list of views in the console and displays the `CREATE VIEW` statement for the view in the query editor.

**Notes**
+ If you delete a table on which a table is based and then attempt to run the view, Athena displays an error message.
+ You can create a nested view, which is a view on top of an existing view. Athena prevents you from running a recursive view that references itself.

# Examples of Athena views
<a name="views-examples"></a>

To show the syntax of the view query, use [SHOW CREATE VIEW](show-create-view.md).

**Example 1**  
Consider the following two tables: a table `employees` with two columns, `id` and `name`, and a table `salaries`, with two columns, `id` and `salary`.   
In this example, we create a view named `name_salary` as a `SELECT` query that obtains a list of IDs mapped to salaries from the tables `employees` and `salaries`:  

```
CREATE VIEW name_salary AS
SELECT
 employees.name, 
 salaries.salary 
FROM employees, salaries 
WHERE employees.id = salaries.id
```

**Example 2**  
In the following example, we create a view named `view1` that enables you to hide more complex query syntax.   
This view runs on top of two tables, `table1` and `table2`, where each table is a different `SELECT` query. The view selects columns from `table1` and joins the results with `table2`. The join is based on column `a` that is present in both tables.  

```
CREATE VIEW view1 AS
WITH
  table1 AS (
         SELECT a, 
         MAX(b) AS the_max 
         FROM x 
         GROUP BY a
         ),
  table2 AS (
         SELECT a, 
         AVG(d) AS the_avg 
         FROM y 
         GROUP BY a)
SELECT table1.a, table1.the_max, table2.the_avg
FROM table1
JOIN table2 
ON table1.a = table2.a;
```

For information about querying federated views, see [Query federated views](running-federated-queries.md#running-federated-queries-federated-views).

# Manage Athena views
<a name="views-managing"></a>

In the Athena console, you can:
+ Locate all views in the left pane, where tables are listed.
+ Filter views.
+ Preview a view, show its properties, edit it, or delete it.

**To show the actions for a view**

A view shows in the console only if you have already created it.

1. In the Athena console, choose **Views**, and then choose a view to expand it and show the columns in the view.

1. Choose the three vertical dots next to the view to show a list of actions for the view.  
![\[The actions menu for a view.\]](http://docs.aws.amazon.com/athena/latest/ug/images/view-options.png)

1. Choose actions to preview the view, insert the view name into the query editor, delete the view, see the view's properties, or display and edit the view in the query editor.

## Supported DDL actions for Athena views
<a name="views-supported-actions"></a>

Athena supports the following management actions for views.


| Statement | Description | 
| --- | --- | 
| [CREATE VIEW and CREATE PROTECTED MULTI DIALECT VIEW](create-view.md) |  Creates a new view from a specified `SELECT` query. For more information, see [Create views](views-console.md#creating-views). The optional `OR REPLACE` clause lets you update the existing view by replacing it.  | 
| [DESCRIBE VIEW](describe-view.md) |  Shows the list of columns for the named view. This allows you to examine the attributes of a complex view.   | 
| [DROP VIEW](drop-view.md) |  Deletes an existing view. The optional `IF EXISTS` clause suppresses the error if the view does not exist.  | 
| [SHOW CREATE VIEW](show-create-view.md) |  Shows the SQL statement that creates the specified view.  | 
| [SHOW VIEWS](show-views.md) |  Lists the views in the specified database, or in the current database if you omit the database name. Use the optional `LIKE` clause with a regular expression to restrict the list of view names. You can also see the list of views in the left pane in the console.  | 
| [SHOW COLUMNS](show-columns.md) |  Lists the columns in the schema for a view.  | 

# Considerations and limitations for Athena views
<a name="considerations-limitations-views"></a>

Athena views have the following considerations and limitations.

## Considerations
<a name="considerations-views"></a>

The following considerations apply to creating and using views in Athena:
+ In Athena, you can preview and work with views created in the Athena Console, in the AWS Glue Data Catalog or with Presto running on the Amazon EMR cluster connected to the same catalog.
+ If you have created Athena views in the Data Catalog, then Data Catalog treats views as tables. You can use table level fine-grained access control in Data Catalog to [restrict access](fine-grained-access-to-glue-resources.md) to these views. 
+  Athena prevents you from running recursive views and displays an error message in such cases. A recursive view is a view query that references itself.
+ Athena displays an error message when it detects stale views. A stale view is reported when one of the following occurs:
  + The view references tables or databases that do not exist.
  + A schema or metadata change is made in a referenced table. 
  + A referenced table is dropped and recreated with a different schema or configuration.
+ You can create and run nested views as long as the query behind the nested view is valid and the tables and databases exist.

## Limitations
<a name="limitations-views"></a>
+ Athena view names cannot contain special characters, other than underscore `(_)`. For more information, see [Name databases, tables, and columns](tables-databases-columns-names.md).
+ Avoid using reserved keywords for naming views. If you use reserved keywords, use double quotes to enclose reserved keywords in your queries on views. See [Escape reserved keywords in queries](reserved-words.md).
+ You cannot use views created in Athena with external Hive metastores or UDFs. For information about working with views created externally in Hive, see [Work with Hive views](hive-views.md).
+ You cannot use views with geospatial functions.
+ You cannot use views to manage access control on data in Amazon S3. To query a view, you need permissions to access the data stored in Amazon S3. For more information, see [Control access to Amazon S3 from Athena](s3-permissions.md).
+ While querying views across accounts is supported in Athena engine version 3, you cannot create a view that includes a cross-account AWS Glue Data Catalog. For information about cross-account data catalog access, see [Configure cross-account access to AWS Glue data catalogs](security-iam-cross-account-glue-catalog-access.md).
+ The Hive or Iceberg hidden metadata columns `$bucket`, `$file_modified_time`, `$file_size`, and `$partition` are not supported for views in Athena. For information about using the `$path` metadata column in Athena, see [Getting the file locations for source data in Amazon S3](select.md#select-path).

# Use Data Catalog views in Athena
<a name="views-glue"></a>

Creating Data Catalog views in Amazon Athena requires a special `CREATE VIEW` statement. Querying them uses conventional SQL `SELECT` syntax. Data Catalog views are also referred to as *multi dialect* views, or MDVs.

## Create a Data Catalog view
<a name="views-glue-creating-a-data-catalog-view"></a>

To create a Data Catalog view in Athena, use the following syntax.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW view_name 
SECURITY DEFINER 
[ SHOW VIEW JSON ]
AS athena-sql-statement
```

**Note**  
The `SHOW VIEW JSON` option applies to Data Catalog views only and not to Athena views. Using the `SHOW VIEW JSON` option performs a "dry run" that validates the input and, if the validation succeeds, returns the JSON of the AWS Glue table object that will represent the view. The actual view is not created. If the `SHOW VIEW JSON` option is not specified, validations are done and the view is created as usual in the Data Catalog.

The following example shows how a user of the `Definer` role creates the `orders_by_date` Data Catalog view. The example assumes that the `Definer` role has full `SELECT` permissions on the `orders` table in the `default` database.

```
CREATE PROTECTED MULTI DIALECT VIEW orders_by_date 
SECURITY DEFINER 
AS 
SELECT orderdate, sum(totalprice) AS price 
FROM orders 
WHERE order_city = 'SEATTLE' 
GROUP BY orderdate
```

For syntax information, see [CREATE PROTECTED MULTI DIALECT VIEW](create-view.md#create-protected-multi-dialect-view).

## Query a Data Catalog view
<a name="views-glue-querying-a-data-catalog-view"></a>

After the view is created, the `Lake Formation` admin can grant `SELECT` permissions on the Data Catalog view to the `Invoker` principals. The `Invoker` principals can then query the view without having access to the underlying base tables referenced by the view. The following is an example `Invoker` query.

```
SELECT * from orders_by_date where price > 5000
```

## Considerations and limitations
<a name="views-glue-limitations"></a>

Most of the following Data Catalog view limitations are specific to Athena. For additional limitations on Data Catalog views that also apply to other services, see the Lake Formation documentation.
+ Data Catalog views cannot reference other views, database resource links, or table resource links.
+ You can reference up to 10 tables in the view definition.
+ Tables must not have the `IAMAllowedPrincipals` data lake permission in Lake Formation. If present, the error Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions occurs.
+ The table's Amazon S3 location must be registered as a Lake Formation data lake location. If the table is not so registered, the error Multi Dialect views may only reference Lake Formation managed tables occurs. For information about how to register Amazon S3 locations in Lake Formation, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html) in the *AWS Lake Formation Developer Guide*.
+ The AWS Glue [GetTables](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTables.html) and [SearchTables](https://docs.aws.amazon.com/glue/latest/webapi/API_SearchTables.html) API calls do not update the `IsRegisteredWithLakeFormation` parameter. To view the correct value for the parameter, use the AWS Glue [GetTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html) API. For more information, see [GetTables and SearchTables APIs do not update the value for the IsRegisteredWithLakeFormation parameter](https://docs.aws.amazon.com/lake-formation/latest/dg/limitations.html#issue-GetTables-value) in the *AWS Lake Formation Developer Guide*.
+ The `DEFINER` principal can be only an IAM role.
+ The `DEFINER` role must have full `SELECT` (grantable) permissions on the underlying tables.
+ `UNPROTECTED` Data Catalog views are not supported.
+ User-defined functions (UDFs) are not supported in the view definition.
+ Athena federated data sources cannot be used in Data Catalog views.
+ Data Catalog views are not supported for external Hive metastores.
+ Athena displays an error message when it detects stale views. A stale view is reported when one of the following occurs:
  + The view references tables or databases that do not exist.
  + A schema or metadata change is made in a referenced table. 
  + A referenced table is dropped and recreated with a different schema or configuration.

## Permissions
<a name="views-glue-permissions"></a>

Data Catalog views require three roles: `Lake Formation Admin`, `Definer`, and `Invoker`. 
+ **`Lake Formation Admin`** – Has access to configure all Lake Formation permissions.
+ **`Definer`** – Creates the Data Catalog view. The `Definer` role must have full grantable `SELECT` permissions on all underlying tables that the view definition references.
+ **`Invoker`** – Can query the Data Catalog view or check its metadata. To show the invoker of a query, you can use the `invoker_principal()` DML function. For more information, see [invoker\$1principal()](functions-env3.md#functions-env3-invoker-principal).

The `Definer` role's trust relationships must allow the `sts:AssumeRole` action for the AWS Glue and Lake Formation service principals. For more information, see [Prerequisites for creating views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html#views-prereqs) in the *AWS Lake Formation Developer Guide*.

IAM permissions for Athena access are also required. For more information, see [AWS managed policies for Amazon Athena](security-iam-awsmanpol.md).

# Manage Data Catalog views
<a name="views-glue-managing"></a>

You can use DDL commands to update and manage your Data Catalog views.

## Update a Data Catalog view
<a name="views-glue-updating-a-data-catalog-view"></a>

The `Lake Formation` admin or definer can use the `ALTER VIEW UPDATE DIALECT` syntax to update the view definition. The following example modifies the view definition to select columns from the `returns` table instead of the `orders` table.

```
ALTER VIEW orders_by_date UPDATE DIALECT
AS
SELECT return_date, sum(totalprice) AS price
FROM returns
WHERE order_city = 'SEATTLE'
GROUP BY orderdate
```

## Supported DDL actions for AWS Glue Data Catalog views
<a name="views-glue-supported-actions"></a>

Athena supports the following actions for AWS Glue Data Catalog views.


| Statement | Description | 
| --- | --- | 
| [ALTER VIEW DIALECT](alter-view-dialect.md) |  Updates a Data Catalog view by either adding an engine dialect or by updating or dropping an existing engine dialect.  | 
| [CREATE PROTECTED MULTI DIALECT VIEW](create-view.md#create-protected-multi-dialect-view) |  Creates a Data Catalog view from a specified `SELECT` query. For more information, see [CREATE PROTECTED MULTI DIALECT VIEW](create-view.md#create-protected-multi-dialect-view). The optional `OR REPLACE` clause lets you update the existing view by replacing it.  | 
| [DESCRIBE VIEW](describe-view.md) |  Shows the list of columns for the named view. This allows you to examine the attributes of a complex view.   | 
| [DROP VIEW](drop-view.md) |  Deletes an existing view. The optional `IF EXISTS` clause suppresses the error if the view does not exist.  | 
| [SHOW CREATE VIEW](show-create-view.md) |  Shows the SQL statement that creates the specified view.  | 
| [SHOW VIEWS](show-views.md) |  Lists the views in the specified database, or in the current database if you omit the database name. Use the optional `LIKE` clause with a regular expression to restrict the list of view names. You can also see the list of views in the left pane in the console.  | 
| [SHOW COLUMNS](show-columns.md) |  Lists the columns in the schema for a view.  | 

# Use saved queries
<a name="saved-queries"></a>

You can use the Athena console to save, edit, run, rename, and delete the queries that you create in the query editor.

## Considerations and limitations
<a name="saved-queries-considerations-and-limitations"></a>
+ You can update the name, description, and query text of saved queries.
+ You can only update the queries in your own account.
+ You cannot change the workgroup or database to which the query belongs.
+ Athena does not keep a history of query modifications. If you want to keep a particular version of a query, save it with a different name.

**Note**  
Amazon Athena resources can now be accessed within Amazon SageMaker Unified Studio (Preview), which helps you access your organization's data and act on it with the best tools. You can migrate saved queries from an Athena workgroup to a SageMaker Unified Studio project, configure projects with existing Athena workgroups, and maintain necessary permissions through IAM role updates. For more information, see [Migrating Amazon Athena resources to Amazon SageMaker Unified Studio (Preview)](https://github.com/aws/Unified-Studio-for-Amazon-Sagemaker/tree/main/migration/athena).

**Topics**
+ [Considerations and limitations](#saved-queries-considerations-and-limitations)
+ [Save a query with a name](saved-queries-name.md)
+ [Run a saved query](saved-queries-run.md)
+ [Edit a saved query](saved-queries-edit.md)
+ [Rename or delete a saved query](saved-queries-rename-or-delete.md)
+ [Rename an undisplayed saved query](saved-queries-rename-not-displayed.md)
+ [Delete an undisplayed saved query](saved-queries-delete-not-displayed.md)
+ [Use the Athena API to update saved queries](saved-queries-update-with-api.md)

# Save a query with a name
<a name="saved-queries-name"></a>

**To save a query and give it a name**

1. In the Athena console query editor, enter or run a query.

1. Above the query editor window, on the tab for the query, choose the three vertical dots, and then choose **Save as**.

1. In the **Save query** dialog box, enter a name for the query and an optional description. You can use the expandable **Preview SQL query** window to verify the contents of the query before you save it.

1. Choose **Save query**.

   In the query editor, the tab for the query shows the name that you specified.

# Run a saved query
<a name="saved-queries-run"></a>

**To run a saved query**

1. In the Athena console, choose the **Saved queries** tab.

1. In the **Saved queries** list, choose the ID of the query that you want to run.

   The query editor displays the query that you chose.

1. Choose **Run**.

# Edit a saved query
<a name="saved-queries-edit"></a>

**To edit a saved query**

1. In the Athena console, choose the **Saved queries** tab.

1. In the **Saved queries** list, choose the ID of the query that you want to edit.

1. Edit the query in the query editor.

1. Perform one of the following steps:
   + To run the query, choose **Run**.
   + To save the query, choose the three vertical dots on the tab for the query, and then choose **Save**.
   + To save the query with a different name, choose the three vertical dots on the tab for the query, and then choose **Save as**.

# Rename or delete a saved query
<a name="saved-queries-rename-or-delete"></a>

**To rename or delete a saved query already displayed in the query editor**

1. Choose the three vertical dots on the tab for the query, and then choose **Rename** or **Delete**.

1. Follow the prompts to rename or delete the query.

# Rename an undisplayed saved query
<a name="saved-queries-rename-not-displayed"></a>

**To rename a saved query not displayed in the query editor**

1. In the Athena console, choose the **Saved queries** tab.

1. Select the check box for the query that you want to rename.

1. Choose **Rename**.

1. In the **Rename query** dialog box, edit the query name and query description. You can use the expandable **Preview SQL query** window to verify the contents of the query before you rename it.

1. Choose **Rename query**.

   The renamed query appears in the **Saved queries** list.

# Delete an undisplayed saved query
<a name="saved-queries-delete-not-displayed"></a>

**To delete a saved query not displayed in the query editor**

1. In the Athena console, choose the **Saved queries** tab.

1. Select one or more check boxes for the queries that you want to delete.

1. Choose **Delete**.

1. At the confirmation prompt, choose **Delete**.

   One or more queries are removed from the **Saved queries** list.

# Use the Athena API to update saved queries
<a name="saved-queries-update-with-api"></a>

For information about using the Athena API to update a saved query, see the [UpdateNamedQuery](https://docs.aws.amazon.com/athena/latest/APIReference/API_UpdateNamedQuery.html) action in the Athena API Reference.

# Use parameterized queries
<a name="querying-with-prepared-statements"></a>

You can use Athena parameterized queries to re-run the same query with different parameter values at execution time and help prevent SQL injection attacks. In Athena, parameterized queries can take the form of execution parameters in any DML query or SQL prepared statements.
+ Queries with execution parameters can be done in a single step and are not workgroup specific. You place question marks in any DML query for the values that you want to parameterize. When you run the query, you declare the execution parameter values sequentially. The declaration of parameters and the assigning of values for the parameters can be done in the same query, but in a decoupled fashion. Unlike prepared statements, you can select the workgroup when you submit a query with execution parameters.
+ Prepared statements require two separate SQL statements: `PREPARE` and `EXECUTE`. First, you define the parameters in the `PREPARE` statement. Then, you run an `EXECUTE` statement that supplies the values for the parameters that you defined. Prepared statements are workgroup specific; you cannot run them outside the context of the workgroup to which they belong.

## Considerations and limitations
<a name="querying-with-prepared-statements-considerations-and-limitations"></a>
+ Parameterized queries are supported in Athena engine version 2 and later versions. For information about Athena engine versions, see [Athena engine versioning](engine-versions.md).
+ Currently, parameterized queries are supported only for `SELECT`, `INSERT INTO`, `CTAS`, and `UNLOAD` statements.
+ In parameterized queries, parameters are positional and are denoted by `?`. Parameters are assigned values by their order in the query. Named parameters are not supported.
+ Currently, `?` parameters can be placed only in the `WHERE` clause. Syntax like `SELECT ? FROM table` is not supported.
+ Question mark parameters cannot be placed in double or single quotes (that is, `'?'` and `"?"` are not valid syntax).
+ For SQL execution parameters to be treated as strings, they must be enclosed in single quotes rather than double quotes.
+ If necessary, you can use the `CAST` function when you enter a value for a parameterized term. For example, if you have a column of the `date` type that you have parameterized in a query and you want to query for the date `2014-07-05`, entering `CAST('2014-07-05' AS DATE)` for the parameter value will return the result.
+ Prepared statements are workgroup specific, and prepared statement names must be unique within the workgroup.
+ IAM permissions for prepared statements are required. For more information, see [Configure access to prepared statements](security-iam-athena-prepared-statements.md).
+ Queries with execution parameters in the Athena console are limited to a maximum of 25 question marks.

**Topics**
+ [Considerations and limitations](#querying-with-prepared-statements-considerations-and-limitations)
+ [Use execution parameters](querying-with-prepared-statements-querying-using-execution-parameters.md)
+ [Use prepared statements](querying-with-prepared-statements-querying.md)
+ [Additional resources](querying-with-prepared-statements-additional-resources.md)

# Use execution parameters
<a name="querying-with-prepared-statements-querying-using-execution-parameters"></a>

You can use question mark placeholders in any DML query to create a parameterized query without creating a prepared statement first. To run these queries, you can use the Athena console, or use the AWS CLI or the AWS SDK and declare the variables in the `execution-parameters` argument.

**Topics**
+ [Use the Athena console](querying-with-prepared-statements-running-queries-with-execution-parameters-in-the-athena-console.md)
+ [Use the AWS CLI](querying-with-prepared-statements-running-queries-with-execution-parameters-using-the-aws-cli.md)

# Run queries with execution parameters in the Athena console
<a name="querying-with-prepared-statements-running-queries-with-execution-parameters-in-the-athena-console"></a>

When you run a parameterized query that has execution parameters (question marks) in the Athena console, you are prompted for the values in the order in which the question marks occur in the query.

**To run a query that has execution parameters**

1. Enter a query with question mark placeholders in the Athena editor, as in the following example.

   ```
   SELECT * FROM "my_database"."my_table"
   WHERE year = ? and month= ? and day= ?
   ```

1. Choose **Run.**

1. In the **Enter parameters** dialog box, enter a value in order for each of the question marks in the query.  
![\[Enter values for the query parameters in order\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-with-prepared-statements-1.png)

1. When you are finished entering the parameters, choose **Run**. The editor shows the query results for the parameter values that you entered.

At this point, you can do one of the following:
+ Enter different parameter values for the same query, and then choose **Run again**.
+ To clear all of the values that you entered at once, choose **Clear**.
+ To edit the query directly (for example, to add or remove question marks), close the **Enter parameters** dialog box first.
+ To save the parameterized query for later use, choose **Save** or **Save as**, and then give the query a name. For more information about using saved queries, see [Use saved queries](saved-queries.md).

As a convenience, the **Enter parameters** dialog box remembers the values that you entered previously for the query as long as you use the same tab in the query editor.

# Run queries with execution parameters using the AWS CLI
<a name="querying-with-prepared-statements-running-queries-with-execution-parameters-using-the-aws-cli"></a>

To use the AWS CLI to run queries with execution parameters, use the `start-query-execution` command and provide a parameterized query in the `query-string` argument. Then, in the `execution-parameters` argument, provide the values for the execution parameters. The following example illustrates this technique.

```
aws athena start-query-execution 
--query-string "SELECT * FROM table WHERE x = ? AND y = ?"
--query-execution-context "Database"="default" 
--result-configuration "OutputLocation"="s3://amzn-s3-demo-bucket;/..."
--execution-parameters "1" "2"
```

# Use prepared statements
<a name="querying-with-prepared-statements-querying"></a>

You can use a prepared statement for repeated execution of the same query with different query parameters. A prepared statement contains parameter placeholders whose values are supplied at execution time.

**Note**  
The maximum number of prepared statements in a workgroup is 1000.

**Topics**
+ [SQL syntax](querying-with-prepared-statements-sql-statements.md)
+ [Use the Athena console](querying-with-prepared-statements-executing-prepared-statements-without-the-using-clause-athena-console.md)
+ [Use the AWS CLI](querying-with-prepared-statements-cli-section.md)

# SQL syntax for prepared statements
<a name="querying-with-prepared-statements-sql-statements"></a>

You can use the `PREPARE`, `EXECUTE` and `DEALLOCATE PREPARE` SQL statements to run parameterized queries in the Athena console query editor. 

 
+ To specify parameters where you would normally use literal values, use question marks in the `PREPARE` statement.
+ To replace the parameters with values when you run the query, use the `USING` clause in the `EXECUTE` statement.
+ To remove a prepared statement from the prepared statements in a workgroup, use the `DEALLOCATE PREPARE` statement.

The following sections provide additional detail about each of these statements.

**Topics**
+ [PREPARE](querying-with-prepared-statements-prepare.md)
+ [EXECUTE](querying-with-prepared-statements-execute.md)
+ [DEALLOCATE PREPARE](querying-with-prepared-statements-deallocate-prepare.md)

# PREPARE
<a name="querying-with-prepared-statements-prepare"></a>

Prepares a statement to be run at a later time. Prepared statements are saved in the current workgroup with the name that you specify. The statement can include parameters in place of literals to be replaced when the query is run. Parameters to be replaced by values are denoted by question marks.

## Syntax
<a name="querying-with-prepared-statements-prepare-syntax"></a>

```
PREPARE statement_name FROM statement
```

The following table describes these parameters.


****  

| Parameter | Description | 
| --- | --- | 
| statement\$1name | The name of the statement to be prepared. The name must be unique within the workgroup. | 
| statement | A SELECT, CTAS, or INSERT INTO query. | 

## PREPARE examples
<a name="querying-with-prepared-statements-prepare-examples"></a>

The following examples show the use of the `PREPARE` statement. Question marks denote the values to be supplied by the `EXECUTE` statement when the query is run.

```
PREPARE my_select1 FROM
SELECT * FROM nation
```

```
PREPARE my_select2 FROM
SELECT * FROM "my_database"."my_table" WHERE year = ?
```

```
PREPARE my_select3 FROM
SELECT order FROM orders WHERE productid = ? and quantity < ?
```

```
PREPARE my_insert FROM
INSERT INTO cities_usa (city, state)
SELECT city, state
FROM cities_world
WHERE country = ?
```

```
PREPARE my_unload FROM
UNLOAD (SELECT * FROM table1 WHERE productid < ?)
TO 's3://amzn-s3-demo-bucket/'
WITH (format='PARQUET')
```

# EXECUTE
<a name="querying-with-prepared-statements-execute"></a>

Runs a prepared statement. Values for parameters are specified in the `USING` clause.

## Syntax
<a name="querying-with-prepared-statements-execute-syntax"></a>

```
EXECUTE statement_name [USING value1 [ ,value2, ... ] ]
```

*statement\$1name* is the name of the prepared statement. *value1* and *value2* are the values to be specified for the parameters in the statement.

## EXECUTE examples
<a name="querying-with-prepared-statements-execute-examples"></a>

The following example runs the `my_select1` prepared statement, which contains no parameters.

```
EXECUTE my_select1
```

The following example runs the `my_select2` prepared statement, which contains a single parameter.

```
EXECUTE my_select2 USING 2012
```

The following example runs the `my_select3` prepared statement, which has two parameters.

```
EXECUTE my_select3 USING 346078, 12
```

The following example supplies a string value for a parameter in the prepared statement `my_insert`.

```
EXECUTE my_insert USING 'usa'
```

The following example supplies a numerical value for the `productid` parameter in the prepared statement `my_unload`.

```
EXECUTE my_unload USING 12
```

# DEALLOCATE PREPARE
<a name="querying-with-prepared-statements-deallocate-prepare"></a>

Removes the prepared statement with the specified name from the list of prepared statements in the current workgroup.

## Syntax
<a name="querying-with-prepared-statements-deallocate-prepare-syntax"></a>

```
DEALLOCATE PREPARE statement_name
```

*statement\$1name* is the name of the prepared statement to be removed.

## Example
<a name="querying-with-prepared-statements-deallocate-prepare-examples"></a>

The following example removes the `my_select1` prepared statement from the current workgroup.

```
DEALLOCATE PREPARE my_select1
```

# Run interactive prepared statements in the Athena console
<a name="querying-with-prepared-statements-executing-prepared-statements-without-the-using-clause-athena-console"></a>

If you run an existing prepared statement with the syntax `EXECUTE` *prepared\$1statement* in the query editor, Athena opens the **Enter parameters** dialog box so that you can enter the values that would normally go in the `USING` clause of the `EXECUTE ... USING` statement.

**To run a prepared statement using the **Enter parameters** dialog box**

1. In the query editor, instead of using the syntax `EXECUTE prepared_statement USING` *value1*`,` *value2* ` ...`, use the syntax `EXECUTE` *prepared\$1statement*.

1. Choose **Run**. The **Enter parameters** dialog box appears.  
![\[Entering parameter values for a prepared statement in the Athena console.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-with-prepared-statements-2.png)

1. Enter the values in order in the **Execution parameters** dialog box. Because the original text of the query is not visible, you must remember the meaning of each positional parameter or have the prepared statement available for reference.

1. Choose **Run**.

# Use the AWS CLI to create, execute, and list prepared statements
<a name="querying-with-prepared-statements-cli-section"></a>

You can use the AWS CLI to create, execute, and list prepared statements.

**Topics**
+ [Create](querying-with-prepared-statements-creating-prepared-statements-using-the-aws-cli.md)
+ [Execute](querying-with-prepared-statements-cli-executing-prepared-statements.md)
+ [List](querying-with-prepared-statements-listing.md)

# Create prepared statements using the AWS CLI
<a name="querying-with-prepared-statements-creating-prepared-statements-using-the-aws-cli"></a>

To use the AWS CLI to create a prepared statement, you can use one of the following `athena` commands:
+ Use the `create-prepared-statement` command and provide a query statement that has execution parameters.
+ Use the `start-query-execution` command and provide a query string that uses the `PREPARE` syntax.

## Use create-prepared-statement
<a name="querying-with-prepared-statements-cli-using-create-prepared-statement"></a>

In a `create-prepared-statement` command, define the query text in the `query-statement` argument, as in the following example.

```
aws athena create-prepared-statement 
--statement-name PreparedStatement1 
--query-statement "SELECT * FROM table WHERE x = ?" 
--work-group athena-engine-v2
```

## Use start-query-execution and the PREPARE syntax
<a name="querying-with-prepared-statements-cli-using-start-query-execution-and-the-prepare-syntax"></a>

Use the `start-query-execution` command. Put the `PREPARE` statement in the `query-string` argument, as in the following example:

```
aws athena start-query-execution 
--query-string "PREPARE PreparedStatement1 FROM SELECT * FROM table WHERE x = ?" 
--query-execution-context '{"Database": "default"}' 
--result-configuration '{"OutputLocation": "s3://amzn-s3-demo-bucket/..."}'
```

# Execute prepared statements using the AWS CLI
<a name="querying-with-prepared-statements-cli-executing-prepared-statements"></a>

To execute a prepared statement with the AWS CLI, you can supply values for the parameters by using one of the following methods:
+ Use the `execution-parameters` argument.
+ Use the `EXECUTE ... USING` SQL syntax in the `query-string` argument.

## Use the execution-parameters argument
<a name="querying-with-prepared-statements-cli-using-the-execution-parameters-argument"></a>

In this approach, you use the `start-query-execution` command and provide the name of an existing prepared statement in the `query-string` argument. Then, in the `execution-parameters` argument, you provide the values for the execution parameters. The following example shows this method.

```
aws athena start-query-execution 
--query-string "Execute PreparedStatement1" 
--query-execution-context "Database"="default" 
--result-configuration "OutputLocation"="s3://amzn-s3-demo-bucket/..."
--execution-parameters "1" "2"
```

## Use the EXECUTE ... USING SQL syntax
<a name="querying-with-prepared-statements-cli-using-the-execute-using-sql-syntax"></a>

To run an existing prepared statement using the `EXECUTE ... USING` syntax, you use the `start-query-execution` command and place the both the name of the prepared statement and the parameter values in the `query-string` argument, as in the following example:

```
aws athena start-query-execution 
--query-string "EXECUTE PreparedStatement1 USING 1"
--query-execution-context '{"Database": "default"}' 
--result-configuration '{"OutputLocation": "s3://amzn-s3-demo-bucket/..."}'
```

# List prepared statements using the AWS CLI
<a name="querying-with-prepared-statements-listing"></a>

To list the prepared statements for a specific workgroup, you can use the Athena [list-prepared-statements](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/list-prepared-statements.html) AWS CLI command or the [ListPreparedStatements](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListPreparedStatements.html) Athena API action. The `--work-group` parameter is required.

```
aws athena list-prepared-statements --work-group primary
```

# Additional resources
<a name="querying-with-prepared-statements-additional-resources"></a>

See the following related posts in the AWS Big Data Blog.
+ [Improve reusability and security using Amazon Athena parameterized queries](https://aws.amazon.com/blogs/big-data/improve-reusability-and-security-using-amazon-athena-parameterized-queries/) 
+ [Use Amazon Athena parameterized queries to provide data as a service](https://aws.amazon.com/blogs/big-data/use-amazon-athena-parameterized-queries-to-provide-data-as-a-service/) 

# Use the cost-based optimizer
<a name="cost-based-optimizer"></a>

You can use the cost-based optimizer (CBO) feature in Athena SQL to optimize your queries. You can optionally request that Athena gather table or column-level statistics for one of your tables in AWS Glue. If all of the tables in your query have statistics, Athena uses the statistics to create an execution plan that it determines to be the most performant. The query optimizer calculates alternative plans based on a statistical model and then selects the one that will likely be the fastest to run the query.

Statistics on AWS Glue tables are collected and stored in the AWS Glue Data Catalog and made available to Athena for improved query planning and execution. These statistics are column-level statistics such as number of distinct, number of null, max, and min values on file types such as Parquet, ORC, JSON, ION, CSV, and XML. Amazon Athena uses these statistics to optimize queries by applying the most restrictive filters as early as possible in query processing. This filtering limits memory usage and the number of records that must be read to deliver the query results.

In conjunction with CBO, Athena uses a feature called the rule-based optimizer (RBO). RBO mechanically applies rules that are expected to improve query performance. RBO is generally beneficial because its transformations aim to simplify the query plan. However, because RBO does not perform cost calculations or plan comparisons, more complicated queries make it difficult for RBO to create an optimal plan.

For this reason, Athena uses both RBO and CBO to optimize your queries. After Athena identifies opportunities to improve query execution, it creates an optimal plan. For information about execution plan details, see [View execution plans for SQL queries](query-plans.md). For a detailed discussion of how CBO works, see [Speed up queries with the cost-based optimizer in Amazon Athena](https://aws.amazon.com/blogs/big-data/speed-up-queries-with-cost-based-optimizer-in-amazon-athena/) in the the AWS Big Data Blog.

To generate statistics for AWS Glue Catalog tables, you can use the Athena console, the AWS Glue Console, or AWS Glue APIs. Because Athena is integrated with AWS Glue Catalog, you automatically get the corresponding query performance improvements when you run queries from Amazon Athena.

## Considerations and limitations
<a name="cost-based-optimizer-considerations-and-limitations"></a>
+ **Table types** – Currently, the CBO feature in Athena supports only Hive and Iceberg tables that are in the AWS Glue Data Catalog.
+ **Athena for Spark** – The CBO feature is not available in Athena for Spark.
+ **Pricing** – For pricing information, visit the [AWS Glue pricing page](https://aws.amazon.com/glue/pricing).

## Generate table statistics using the Athena console
<a name="cost-based-optimizer-generating-table-statistics-using-the-athena-console"></a>

This section describes how to use the Athena console to generate table or column-level statistics for a table in AWS Glue. For information on using AWS Glue to generate table statistics, see [Working with column statistics](https://docs.aws.amazon.com/glue/latest/dg/column-statistics.html) in the *AWS Glue Developer Guide*.

**To generate statistics for a table using the Athena console**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the Athena query editor **Tables** list, choose the three vertical dots for the table that you want, and then choose **Generate statistics**.  
![\[Context menu for a table in the Athena query editor.\]](http://docs.aws.amazon.com/athena/latest/ug/images/cost-based-optimizer-1.png)

1. In the **Generate statistics** dialog box, choose **All columns** to generate statistics for all columns in the table, or choose **Selected columns** to select specific columns. **All columns** is the default setting.  
![\[The generate statistics dialog box.\]](http://docs.aws.amazon.com/athena/latest/ug/images/cost-based-optimizer-2.png)

1. For **AWS Glue service role**, create or select an existing service role to give permission to AWS Glue to generate statistics. The AWS Glue service role also requires [https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html) permissions to the Amazon S3 bucket that contains the table's data.  
![\[Choosing a AWS Glue service role.\]](http://docs.aws.amazon.com/athena/latest/ug/images/cost-based-optimizer-3.png)

1. Choose **Generate statistics**. A **Generating statistics for *table\$1name*** notification banner displays the task status.  
![\[The Generating statistics notification banner.\]](http://docs.aws.amazon.com/athena/latest/ug/images/cost-based-optimizer-4.png)

1. To view details in the AWS Glue console, choose **View in Glue**. 

   For information about viewing statistics in the AWS Glue console, see [Viewing column statistics](https://docs.aws.amazon.com/glue/latest/dg/view-column-stats.html) in the *AWS Glue Developer Guide*. 

1. After statistics have been generated, the tables and columns that have statistics show the word **Statistics** in parentheses, as in the following image.  
![\[A table showing statistics icons in the Athena query editor.\]](http://docs.aws.amazon.com/athena/latest/ug/images/cost-based-optimizer-5.png)

Now when you run your queries, Athena will perform cost-based optimization on the tables and columns for which statistics were generated.

## Enable and disable the table statistics
<a name="cost-based-optimizer-enabling-iceberg-table-statistics"></a>

When you generate table statistics for an Iceberg table following the steps in previous section, a Glue table property called `use_iceberg_statistics` is automatically added to the Iceberg table in AWS Glue Data Catalog and set to **true** by default. If you remove this property or set it to **false**, CBO will not use the Iceberg table statistics when trying to optimize the query plan during query execution, even if the statistics were generated by Glue. For more information onhow to generate table statistics, see [Generate table statistics using the Athena console](#cost-based-optimizer-generating-table-statistics-using-the-athena-console).

In contrast, Hive tables in the Glue Data Catalog do not have a similar table property to enable or disable the use of table statistics for CBO. As a result, CBO always uses the table statistics generated by Glue when trying to optimize the query plan for Hive tables. 

## Additional resources
<a name="cost-based-optimizer-additional-resources"></a>

For additional information, see the following resource.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/zUHEXJdHUxs?si=rMAhJj3I5IlhN-1R/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/zUHEXJdHUxs?si=rMAhJj3I5IlhN-1R)


# Query S3 Express One Zone data
<a name="querying-express-one-zone"></a>

The Amazon S3 Express One Zone storage class is a highly performant Amazon S3 storage class that provides single-digit millisecond response times. As such, it is useful for applications that frequently access data with hundreds of thousands of requests per second.

S3 Express One Zone replicates and stores data within the same Availability Zone to optimize for speed and cost. This differs from Amazon S3 Regional storage classes, which automatically replicate data across a minimum of three AWS Availability Zones within an AWS Region.

For more information, see [What is S3 Express One Zone?](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html) in the *Amazon S3 User Guide*.

## Prerequisites
<a name="querying-express-one-zone-prerequisites"></a>

Confirm that the following conditions are met before you begin:
+ **Athena engine version 3** – To use S3 Express One Zone with Athena SQL, your workgroup must be configured to use Athena engine version 3.
+ **S3 Express One Zone permissions** – When S3 Express One Zone calls an action like `GET`, `LIST`, or `PUT` on an Amazon S3 object, the storage class calls `CreateSession` on your behalf. For this reason, your IAM policy must allow the `s3express:CreateSession` action, which allows Athena to invoke the corresponding API operation.

## Considerations and limitations
<a name="querying-express-one-zone-considerations-and-limitations"></a>

When you query S3 Express One Zone with Athena, consider the following points. 
+ S3 Express One Zone buckets supports `SSE_S3` and `SSE-KMS` encryption. Athena query results are written using `SSE_S3` encryption regardless of the option that you choose in workgroup settings to encrypt query results. This limitation includes all scenarios in which Athena writes data to S3 Express One Zone buckets, including `CREATE TABLE AS` (CTAS) and `INSERT INTO` statements.
+ The AWS Glue crawler is not supported for creating tables on S3 Express One Zone data.
+ The `MSCK REPAIR TABLE` statement is not supported. As a workaround, use [ALTER TABLE ADD PARTITION](alter-table-add-partition.md).
+ No table modifying DDL statements for Apache Iceberg (that is, no `ALTER TABLE` statements) are supported for S3 Express One Zone.
+ Lake Formation is not supported with S3 Express One Zone buckets.
+ The following file and table formats are unsupported or have limited support. If formats aren't listed, but are supported for Athena (such as Parquet, ORC, and JSON), then they're also supported for use with S3 Express One Zone storage.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/querying-express-one-zone.html)

## Get started
<a name="querying-express-one-zone-getting-started"></a>

Querying S3 Express One Zone data with Athena is straightforward. To get started, use the following procedure.

**To use Athena SQL to query S3 Express One Zone data**

1. Transition your data to S3 Express One Zone storage. For more information, see [Setting the storage class of an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html#sc-howtoset) in the *Amazon S3 User Guide*.

1. Use a [CREATE TABLE](create-table.md) statement in Athena to catalog your data in AWS Glue Data Catalog. For information about creating tables in Athena, see [Create tables in Athena](creating-tables.md) and the [CREATE TABLE](create-table.md) statement.

1. (Optional) Configure the query result location of your Athena workgroup to use an Amazon S3 *directory bucket*. Amazon S3 directory buckets are more performant that general buckets and are designed for workloads or performance-critical applications that require consistent single-digit millisecond latency. For more information, see [Directory buckets overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html) in the *Amazon S3 User Guide*.

# Query restored Amazon Glacier objects
<a name="querying-glacier"></a>

You can use Athena to query restored objects from the Amazon Glacier Flexible Retrieval (formerly Glacier) and Amazon Glacier Deep Archive [Amazon S3 storage classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html#sc-glacier). You must enable this capability on a per-table basis. If you do not enable the feature on a table before you run a query, Athena skips all of the table's Amazon Glacier Flexible Retrieval and Amazon Glacier Deep Archive objects during query execution. 

## Considerations and Limitations
<a name="querying-glacier-considerations-and-limitations"></a>
+  Querying restored Amazon Glacier objects is supported only on Athena engine version 3. 
+  The feature is supported only for Apache Hive tables. 
+  You must restore your objects before you query your data; Athena does not restore objects for you. 

## Configure a table to use restored objects
<a name="querying-glacier-configuring-a-table-to-use-restored-objects"></a>

 To configure your Athena table to include restored objects in your queries, you must set its `read_restored_glacier_objects` table property to `true`. To do this, you can use the Athena query editor or the AWS Glue console. You can also use the [AWS Glue CLI](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/glue/update-table.html), the [AWS Glue API](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-UpdateTable), or the [AWS Glue SDK](https://docs.aws.amazon.com/glue/latest/dg/sdk-general-information-section.html). 

### Use the Athena query editor
<a name="querying-glacier-using-the-athena-query-editor"></a>

 In Athena, you can use the [ALTER TABLE SET TBLPROPERTIES](alter-table-set-tblproperties.md) command to set the table property, as in the following example. 

```
ALTER TABLE table_name SET TBLPROPERTIES ('read_restored_glacier_objects' = 'true')
```

### Use the AWS Glue console
<a name="querying-glacier-using-the-aws-glue-console"></a>

 In the AWS Glue console, perform the following steps to add the `read_restored_glacier_objects` table property. 

**To configure table properties in the AWS Glue console**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Do one of the following:
   + Choose **Go to the Data Catalog**.
   + In the navigation pane, choose **Data Catalog tables**.

1. On the **Tables** page, in the list of tables, choose the link for the table that you want to edit.

1. Choose **Actions**, **Edit table**.

1. On the **Edit table** page, in the **Table properties** section, add the following key-value pair.
   + For **Key**, add `read_restored_glacier_objects`.
   + For **Value**, enter `true`.

1. Choose **Save**.

### Use the AWS CLI
<a name="querying-glacier-using-the-aws-cli"></a>

 In the AWS CLI, you can use the AWS Glue [update-table](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/glue/update-table.html) command and its `--table-input` argument to redefine the table and in so doing add the `read_restored_glacier_objects` property. In the `--table-input` argument, use the `Parameters` structure to specify the `read_restored_glacier_objects` property and the value of `true`. Note that the argument for `--table-input` must not have spaces and must use backslashes to escape the double quotes. In the following example, replace *my\$1database* and *my\$1table* with the name of your database and table.

```
aws glue update-table \
   --database-name my_database \
   --table-input={\"Name\":\"my_table\",\"Parameters\":{\"read_restored_glacier_objects\":\"true\"}}
```

**Important**  
The AWS Glue `update-table` command works in overwrite mode, which means that it replaces the existing table definition with the new definition specified by the `table-input` parameter. For this reason, be sure to also specify all of the fields that you want to be in your table in the `table-input` parameter when you add the `read_restored_glacier_objects` property. 

# Handle schema updates
<a name="handling-schema-updates-chapter"></a>

This section provides guidance on handling schema updates for various data formats. Athena is a schema-on-read query engine. This means that when you create a table in Athena, it applies schemas when reading the data. It does not change or rewrite the underlying data. 

If you anticipate changes in table schemas, consider creating them in a data format that is suitable for your needs. Your goals are to reuse existing Athena queries against evolving schemas, and avoid schema mismatch errors when querying tables with partitions.

To achieve these goals, choose a table's data format based on the table in the following topic.

**Topics**
+ [Supported schema update operations by data format](#summary-of-updates)
+ [Understand index access for Apache ORC and Apache Parquet](#index-access)
+ [Make schema updates](make-schema-updates.md)
+ [Update tables with partitions](updates-and-partitions.md)

## Supported schema update operations by data format
<a name="summary-of-updates"></a>

The following table summarizes data storage formats and their supported schema manipulations. Use this table to help you choose the format that will enable you to continue using Athena queries even as your schemas change over time. 

In this table, observe that Parquet and ORC are columnar formats with different default column access methods. By default, Parquet will access columns by name and ORC by index (ordinal value). Therefore, Athena provides a SerDe property defined when creating a table to toggle the default column access method which enables greater flexibility with schema evolution. 

For Parquet, the `parquet.column.index.access` property may be set to `true`, which sets the column access method to use the column's ordinal number. Setting this property to `false` will change the column access method to use column name. Similarly, for ORC use the `orc.column.index.access` property to control the column access method. For more information, see [Understand index access for Apache ORC and Apache Parquet](#index-access).

CSV and TSV allow you to do all schema manipulations except reordering of columns, or adding columns at the beginning of the table. For example, if your schema evolution requires only renaming columns but not removing them, you can choose to create your tables in CSV or TSV. If you require removing columns, do not use CSV or TSV, and instead use any of the other supported formats, preferably, a columnar format, such as Parquet or ORC.


**Schema updates and data formats in Athena**  

| Expected type of schema update | Summary | CSV (with and without headers) and TSV | JSON | AVRO | PARQUET: Read by name (default) | PARQUET: Read by index | ORC: Read by index (default) | ORC: Read by name | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
|  [Rename columns](updates-renaming-columns.md) | Store your data in CSV and TSV, or in ORC and Parquet if they are read by index. | Y | N | N | N  | Y | Y | N | 
|  [Add columns at the beginning or in the middle of the table](updates-add-columns-beginning-middle-of-table.md) | Store your data in JSON, AVRO, or in Parquet and ORC if they are read by name. Do not use CSV and TSV. | N | Y | Y | Y | N | N | Y | 
|  [Add columns at the end of the table](updates-add-columns-end-of-table.md) | Store your data in CSV or TSV, JSON, AVRO, ORC, or Parquet. | Y | Y | Y | Y | Y | Y | Y | 
| [Remove columns](updates-removing-columns.md) |  Store your data in JSON, AVRO, or Parquet and ORC, if they are read by name. Do not use CSV and TSV. | N | Y | Y | Y | N | N | Y | 
| [Reorder columns](updates-reordering-columns.md) | Store your data in AVRO, JSON or ORC and Parquet if they are read by name. | N | Y | Y | Y | N | N | Y | 
| [Change a column's data type](updates-changing-column-type.md) | Store your data in any format, but test your query in Athena to make sure the data types are compatible. For Parquet and ORC, changing a data type works only for partitioned tables. | Y | Y | Y | Y | Y | Y | Y | 

## Understand index access for Apache ORC and Apache Parquet
<a name="index-access"></a>

PARQUET and ORC are columnar data storage formats that can be read by index, or by name. Storing your data in either of these formats lets you perform all operations on schemas and run Athena queries without schema mismatch errors. 
+ Athena *reads ORC by index by default*, as defined in `SERDEPROPERTIES ( 'orc.column.index.access'='true')`. For more information, see [ORC: Read by index](#orc-read-by-index).
+ Athena reads *Parquet by name by default*, as defined in `SERDEPROPERTIES ( 'parquet.column.index.access'='false')`. For more information, see [Parquet: Read by name](#parquet-read-by-name).

Since these are defaults, specifying these SerDe properties in your `CREATE TABLE` queries is optional, they are used implicitly. When used, they allow you to run some schema update operations while preventing other such operations. To enable those operations, run another `CREATE TABLE` query and change the SerDe settings. 

**Note**  
The SerDe properties are *not* automatically propagated to each partition. Use `ALTER TABLE ADD PARTITION` statements to set the SerDe properties for each partition. To automate this process, write a script that runs `ALTER TABLE ADD PARTITION` statements.

The following sections describe these cases in detail.

### ORC: Read by index
<a name="orc-read-by-index"></a>

A table in *ORC is read by index*, by default. This is defined by the following syntax:

```
WITH SERDEPROPERTIES ( 
  'orc.column.index.access'='true')
```

*Reading by index* allows you to rename columns. But then you lose the ability to remove columns or add them in the middle of the table. 

To make ORC read by name, which will allow you to add columns in the middle of the table or remove columns in ORC, set the SerDe property `orc.column.index.access` to `false` in the `CREATE TABLE` statement. In this configuration, you will lose the ability to rename columns.

**Note**  
In Athena engine version 2, when ORC tables are set to read by name, Athena requires that all column names in the ORC files be in lower case. Because Apache Spark does not lowercase field names when it generates ORC files, Athena might not be able to read the data so generated. The workaround is to rename the columns to be in lower case, or use Athena engine version 3. 

The following example illustrates how to change the ORC to make it read by name:

```
CREATE EXTERNAL TABLE orders_orc_read_by_name (
   `o_comment` string,
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderpriority` string, 
   `o_orderstatus` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
   `o_orderdate` string
) 
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
WITH SERDEPROPERTIES ( 
  'orc.column.index.access'='false') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/orders_orc/';
```

### Parquet: Read by name
<a name="parquet-read-by-name"></a>

A table in *Parquet is read by name*, by default. This is defined by the following syntax:

```
WITH SERDEPROPERTIES ( 
  'parquet.column.index.access'='false')
```

*Reading by name* allows you to add columns in the middle of the table and remove columns. But then you lose the ability to rename columns. 

To make Parquet read by index, which will allow you to rename columns, you must create a table with `parquet.column.index.access` SerDe property set to `true`.

# Make schema updates
<a name="make-schema-updates"></a>

This topic describes some of the changes that you can make to the schema in `CREATE TABLE` statements without actually altering your data. To update a schema, you can in some cases use an `ALTER TABLE` command, but in other cases you do not actually modify an existing table. Instead, you create a table with a new name that modifies the schema that you used in your original `CREATE TABLE` statement.

Depending on how you expect your schemas to evolve, to continue using Athena queries, choose a compatible data format. 

Consider an application that reads orders information from an `orders` table that exists in two formats: CSV and Parquet. 

The following example creates a table in Parquet:

```
CREATE EXTERNAL TABLE orders_parquet (
   `orderkey` int, 
   `orderstatus` string, 
   `totalprice` double, 
   `orderdate` string, 
   `orderpriority` string, 
   `clerk` string, 
   `shippriority` int
) STORED AS PARQUET
LOCATION 's3://amzn-s3-demo-bucket/orders_ parquet/';
```

The following example creates the same table in CSV:

```
CREATE EXTERNAL TABLE orders_csv (
   `orderkey` int, 
   `orderstatus` string, 
   `totalprice` double, 
   `orderdate` string, 
   `orderpriority` string, 
   `clerk` string, 
   `shippriority` int
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://amzn-s3-demo-bucket/orders_csv/';
```

The following topics show how updates to these tables affect Athena queries.

**Topics**
+ [Add columns at the beginning or in the middle of the table](updates-add-columns-beginning-middle-of-table.md)
+ [Add columns at the end of the table](updates-add-columns-end-of-table.md)
+ [Remove columns](updates-removing-columns.md)
+ [Rename columns](updates-renaming-columns.md)
+ [Reorder columns](updates-reordering-columns.md)
+ [Change a column data type](updates-changing-column-type.md)

# Add columns at the beginning or in the middle of the table
<a name="updates-add-columns-beginning-middle-of-table"></a>

Adding columns is one of the most frequent schema changes. For example, you may add a new column to enrich the table with new data. Or, you may add a new column if the source for an existing column has changed, and keep the previous version of this column, to adjust applications that depend on them.

To add columns at the beginning or in the middle of the table, and continue running queries against existing tables, use AVRO, JSON, and Parquet and ORC if their SerDe property is set to read by name. For information, see [Understand index access for Apache ORC and Apache Parquet](handling-schema-updates-chapter.md#index-access).

Do not add columns at the beginning or in the middle of the table in CSV and TSV, as these formats depend on ordering. Adding a column in such cases will lead to schema mismatch errors when the schema of partitions changes.

 The following example creates a new table that adds an `o_comment` column in the middle of a table based on JSON data.

```
CREATE EXTERNAL TABLE orders_json_column_addition (
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderstatus` string, 
   `o_comment` string, 
   `o_totalprice` double, 
   `o_orderdate` string, 
   `o_orderpriority` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://amzn-s3-demo-bucket/orders_json/';
```

# Add columns at the end of the table
<a name="updates-add-columns-end-of-table"></a>

If you create tables in any of the formats that Athena supports, such as Parquet, ORC, Avro, JSON, CSV, and TSV, you can use the `ALTER TABLE ADD COLUMNS` statement to add columns after existing columns but before partition columns.

The following example adds a `comment` column at the end of the `orders_parquet` table before any partition columns: 

```
ALTER TABLE orders_parquet ADD COLUMNS (comment string)
```

**Note**  
To see a new table column in the Athena Query Editor after you run `ALTER TABLE ADD COLUMNS`, manually refresh the table list in the editor, and then expand the table again.

# Remove columns
<a name="updates-removing-columns"></a>

You may need to remove columns from tables if they no longer contain data, or to restrict access to the data in them.
+ You can remove columns from tables in JSON, Avro, and in Parquet and ORC if they are read by name. For information, see [Understand index access for Apache ORC and Apache Parquet](handling-schema-updates-chapter.md#index-access). 
+ We do not recommend removing columns from tables in CSV and TSV if you want to retain the tables you have already created in Athena. Removing a column breaks the schema and requires that you recreate the table without the removed column.

In this example, remove a column ``totalprice`` from a table in Parquet and run a query. In Athena, Parquet is read by name by default, this is why we omit the SERDEPROPERTIES configuration that specifies reading by name. Notice that the following query succeeds, even though you changed the schema:

```
CREATE EXTERNAL TABLE orders_parquet_column_removed (
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderstatus` string, 
   `o_orderdate` string, 
   `o_orderpriority` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
   `o_comment` string
) 
STORED AS PARQUET
LOCATION 's3://amzn-s3-demo-bucket/orders_parquet/';
```

# Rename columns
<a name="updates-renaming-columns"></a>

You may want to rename columns in your tables to correct spelling, make column names more descriptive, or to reuse an existing column to avoid column reordering.

You can rename columns if you store your data in CSV and TSV, or in Parquet and ORC that are configured to read by index. For information, see [Understand index access for Apache ORC and Apache Parquet](handling-schema-updates-chapter.md#index-access). 

Athena reads data in CSV and TSV in the order of the columns in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries. 

One strategy for renaming columns is to create a new table based on the same underlying data, but using new column names. The following example creates a new `orders_parquet` table called `orders_parquet_column_renamed`. The example changes the column ``o_totalprice`` name to ``o_total_price`` and then runs a query in Athena: 

```
CREATE EXTERNAL TABLE orders_parquet_column_renamed (
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderstatus` string, 
   `o_total_price` double, 
   `o_orderdate` string, 
   `o_orderpriority` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
   `o_comment` string
) 
STORED AS PARQUET
LOCATION 's3://amzn-s3-demo-bucket/orders_parquet/';
```

In the Parquet table case, the following query runs, but the renamed column does not show data because the column was being accessed by name (a default in Parquet) rather than by index:

```
SELECT * 
FROM orders_parquet_column_renamed;
```

A query with a table in CSV looks similar:

```
CREATE EXTERNAL TABLE orders_csv_column_renamed (
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderstatus` string, 
   `o_total_price` double, 
   `o_orderdate` string, 
   `o_orderpriority` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
   `o_comment` string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://amzn-s3-demo-bucket/orders_csv/';
```

In the CSV table case, the following query runs and the data displays in all columns, including the one that was renamed:

```
SELECT * 
FROM orders_csv_column_renamed;
```

# Reorder columns
<a name="updates-reordering-columns"></a>

You can reorder columns only for tables with data in formats that read by name, such as JSON or Parquet, which reads by name by default. You can also make ORC read by name, if needed. For information, see [Understand index access for Apache ORC and Apache Parquet](handling-schema-updates-chapter.md#index-access).

The following example creates a new table with the columns in a different order:

```
CREATE EXTERNAL TABLE orders_parquet_columns_reordered (
   `o_comment` string,
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderpriority` string, 
   `o_orderstatus` string, 
   `o_clerk` string, 
   `o_shippriority` int, 
   `o_orderdate` string
) 
STORED AS PARQUET
LOCATION 's3://amzn-s3-demo-bucket/orders_parquet/';
```

# Change a column data type
<a name="updates-changing-column-type"></a>

You might want to use a different column type when the existing type can no longer hold the amount of information required. For example, an ID column's values might exceed the size of the `INT` data type and require the use of the `BIGINT` data type.

## Considerations
<a name="updates-changing-column-type-considerations"></a>

When planning to use a different data type for a column, consider the following points: 
+ In most cases, you cannot change the data type of a column directly. Instead, you re-create the Athena table and define the column with the new data type. 
+ Only certain data types can be read as other data types. See the table in this section for data types that can be so treated.
+ For data in Parquet and ORC, you cannot use a different data type for a column if the table is not partitioned. 
+ For partitioned tables in Parquet and ORC, a partition's column type can be different from another partition's column type, and Athena will `CAST` to the desired type, if possible. For information, see [Avoid schema mismatch errors for tables with partitions](updates-and-partitions.md#partitions-dealing-with-schema-mismatch-errors).
+ For tables created using the [LazySimpleSerDe](lazy-simple-serde.md) only, it is possible to use the `ALTER TABLE REPLACE COLUMNS` statement to replace existing columns with a different data type, but all existing columns that you want to keep must also be redefined in the statement, or they will be dropped. For more information, see [ALTER TABLE REPLACE COLUMNS](alter-table-replace-columns.md).
+ For Apache Iceberg tables only, you can use the [ALTER TABLE CHANGE COLUMN](querying-iceberg-alter-table-change-column.md) statement to change the data type of a column. `ALTER TABLE REPLACE COLUMNS` is not supported for Iceberg tables. For more information, see [Evolve Iceberg table schema](querying-iceberg-evolving-table-schema.md).

**Important**  
We strongly suggest that you test and verify your queries before performing data type translations. If Athena cannot use the target data type, the `CREATE TABLE` query may fail. 

## Use compatible data types
<a name="updates-changing-column-type-use-compatible-data-types"></a>

Whenever possible, use compatible data types. The following table lists data types that can be treated as other data types:


| Original data type | Available target data types | 
| --- | --- | 
| STRING | BYTE, TINYINT, SMALLINT, INT, BIGINT | 
| BYTE | TINYINT, SMALLINT, INT, BIGINT | 
| TINYINT | SMALLINT, INT, BIGINT | 
| SMALLINT | INT, BIGINT | 
| INT | BIGINT | 
| FLOAT | DOUBLE | 

The following example uses the `CREATE TABLE` statement for the original `orders_json` table to create a new table called `orders_json_bigint`. The new table uses `BIGINT` instead of `INT` as the data type for the ``o_shippriority`` column. 

```
CREATE EXTERNAL TABLE orders_json_bigint (
   `o_orderkey` int, 
   `o_custkey` int, 
   `o_orderstatus` string, 
   `o_totalprice` double, 
   `o_orderdate` string, 
   `o_orderpriority` string, 
   `o_clerk` string, 
   `o_shippriority` BIGINT
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://amzn-s3-demo-bucket/orders_json';
```

The following query runs successfully, similar to the original `SELECT` query, before the data type change:

```
Select * from orders_json 
LIMIT 10;
```

# Update tables with partitions
<a name="updates-and-partitions"></a>

In Athena, a table and its partitions must use the same data formats but their schemas may differ. When you create a new partition, that partition usually inherits the schema of the table. Over time, the schemas may start to differ. Reasons include:
+ If your table's schema changes, the schemas for partitions are not updated to remain in sync with the table's schema. 
+ The AWS Glue Crawler allows you to discover data in partitions with different schemas. This means that if you create a table in Athena with AWS Glue, after the crawler finishes processing, the schemas for the table and its partitions may be different.
+ If you add partitions directly using an AWS API.

Athena processes tables with partitions successfully if they meet the following constraints. If these constraints are not met, Athena issues a HIVE\$1PARTITION\$1SCHEMA\$1MISMATCH error. 
+ Each partition's schema is compatible with the table's schema. 
+ The table's data format allows the type of update you want to perform: add, delete, reorder columns, or change a column's data type. 

  For example, for CSV and TSV formats, you can rename columns, add new columns at the end of the table, and change a column's data type if the types are compatible, but you cannot remove columns. For other formats, you can add or remove columns, or change a column's data type to another if the types are compatible. For information, see [Summary: Updates and Data Formats in Athena](handling-schema-updates-chapter.md#summary-of-updates). 

## Avoid schema mismatch errors for tables with partitions
<a name="partitions-dealing-with-schema-mismatch-errors"></a>

At the beginning of query execution, Athena verifies the table's schema by checking that each column data type is compatible between the table and the partition. 
+ For Parquet and ORC data storage types, Athena relies on the column names and uses them for its column name-based schema verification. This eliminates `HIVE_PARTITION_SCHEMA_MISMATCH` errors for tables with partitions in Parquet and ORC. (This is true for ORC if the SerDe property is set to access the index by name: `orc.column.index.access=FALSE`. Parquet reads the index by name by default).
+ For CSV, JSON, and Avro, Athena uses an index-based schema verification. This means that if you encounter a schema mismatch error, you should drop the partition that is causing a schema mismatch and recreate it, so that Athena can query it without failing.

 Athena compares the table's schema to the partition schemas. If you create a table in CSV, JSON, and AVRO in Athena with AWS Glue Crawler, after the Crawler finishes processing, the schemas for the table and its partitions may be different. If there is a mismatch between the table's schema and the partition schemas, your queries fail in Athena due to the schema verification error similar to this: 'crawler\$1test.click\$1avro' is declared as type 'string', but partition 'partition\$10=2017-01-17' declared column 'col68' as type 'double'."

A typical workaround for such errors is to drop the partition that is causing the error and recreate it. For more information, see [ALTER TABLE DROP PARTITION](alter-table-drop-partition.md) and [ALTER TABLE ADD PARTITION](alter-table-add-partition.md).

# Query arrays
<a name="querying-arrays"></a>

Amazon Athena lets you create arrays, concatenate them, convert them to different data types, and then filter, flatten, and sort them.

**Topics**
+ [Create arrays](creating-arrays.md)
+ [Concatenate strings and arrays](concatenating-strings-and-arrays.md)
+ [Convert array data types](converting-array-data-types.md)
+ [Find array lengths](finding-lengths.md)
+ [Access array elements](accessing-array-elements.md)
+ [Flatten nested arrays](flattening-arrays.md)
+ [Create arrays from subqueries](creating-arrays-from-subqueries.md)
+ [Filter arrays](filtering-arrays.md)
+ [Sort arrays](sorting-arrays.md)
+ [Use aggregation functions with arrays](arrays-and-aggregation.md)
+ [Convert arrays to strings](converting-arrays-to-strings.md)
+ [Use arrays to create maps](arrays-create-maps.md)
+ [Query arrays with complex types](rows-and-structs.md)

# Create arrays
<a name="creating-arrays"></a>

To build an array literal in Athena, use the `ARRAY` keyword, followed by brackets `[ ]`, and include the array elements separated by commas.

## Examples
<a name="examples"></a>

This query creates one array with four elements.

```
SELECT ARRAY [1,2,3,4] AS items
```

It returns:

```
+-----------+
| items     |
+-----------+
| [1,2,3,4] |
+-----------+
```

This query creates two arrays.

```
SELECT ARRAY[ ARRAY[1,2], ARRAY[3,4] ] AS items
```

It returns:

```
+--------------------+
| items              |
+--------------------+
| [[1, 2], [3, 4]]   |
+--------------------+
```

To create an array from selected columns of compatible types, use a query, as in this example:

```
WITH
dataset AS (
  SELECT 1 AS x, 2 AS y, 3 AS z
)
SELECT ARRAY [x,y,z] AS items FROM dataset
```

This query returns:

```
+-----------+
| items     |
+-----------+
| [1,2,3]   |
+-----------+
```

In the following example, two arrays are selected and returned as a welcome message.

```
WITH
dataset AS (
  SELECT
    ARRAY ['hello', 'amazon', 'athena'] AS words,
    ARRAY ['hi', 'alexa'] AS alexa
)
SELECT ARRAY[words, alexa] AS welcome_msg
FROM dataset
```

This query returns:

```
+----------------------------------------+
| welcome_msg                            |
+----------------------------------------+
| [[hello, amazon, athena], [hi, alexa]] |
+----------------------------------------+
```

To create an array of key-value pairs, use the `MAP` operator that takes an array of keys followed by an array of values, as in this example:

```
SELECT ARRAY[
   MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
   MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
   MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
] AS people
```

This query returns:

```
+-----------------------------------------------------------------------------------------------------+
| people                                                                                              |
+-----------------------------------------------------------------------------------------------------+
| [{last=Smith, first=Bob, age=40}, {last=Doe, first=Jane, age=30}, {last=Smith, first=Billy, age=8}] |
+-----------------------------------------------------------------------------------------------------+
```

# Concatenate strings and arrays
<a name="concatenating-strings-and-arrays"></a>

Concatenating strings and concatenating arrays use similar techniques.

## Concatenate strings
<a name="concatenating-strings"></a>

To concatenate two strings, you can use the double pipe `||` operator, as in the following example.

```
SELECT 'This' || ' is' || ' a' || ' test.' AS Concatenated_String
```

This query returns:


****  

| \$1 | Concatenated\$1String | 
| --- | --- | 
| 1 |  `This is a test.`  | 

You can use the `concat()` function to achieve the same result.

```
SELECT concat('This', ' is', ' a', ' test.') AS Concatenated_String
```

This query returns:


****  

| \$1 | Concatenated\$1String | 
| --- | --- | 
| 1 |  `This is a test.`  | 

You can use the `concat_ws()` function to concatenate strings with the separator specified in the first argument.

```
SELECT concat_ws(' ', 'This', 'is', 'a', 'test.') as Concatenated_String
```

This query returns:


****  

| \$1 | Concatenated\$1String | 
| --- | --- | 
| 1 |  `This is a test.`  | 

To concatenate two columns of the string data type using a dot, reference the two columns using double quotes, and enclose the dot in single quotes as a hard-coded string. If a column is not of the string data type, you can use `CAST("column_name" as VARCHAR)` to cast the column first.

```
SELECT "col1" || '.' || "col2" as Concatenated_String
FROM my_table
```

This query returns:


****  

| \$1 | Concatenated\$1String | 
| --- | --- | 
| 1 |  `col1_string_value.col2_string_value`  | 

## Concatenate arrays
<a name="concatenating-arrays"></a>

You can use the same techniques to concatenate arrays.

To concatenate multiple arrays, use the double pipe `||` operator.

```
SELECT ARRAY [4,5] || ARRAY[ ARRAY[1,2], ARRAY[3,4] ] AS items
```

This query returns:


****  

| \$1 | items | 
| --- | --- | 
| 1 |  `[[4, 5], [1, 2], [3, 4]]`  | 

To combine multiple arrays into a single array, use the double pipe operator or the `concat()` function.

```
WITH
dataset AS (
  SELECT
    ARRAY ['Hello', 'Amazon', 'Athena'] AS words,
    ARRAY ['Hi', 'Alexa'] AS alexa
)
SELECT concat(words, alexa) AS welcome_msg
FROM dataset
```

This query returns:


****  

| \$1 | welcome\$1msg | 
| --- | --- | 
| 1 |  `[Hello, Amazon, Athena, Hi, Alexa]`  | 

For more information about `concat()` other string functions, see [String functions and operators](https://trino.io/docs/current/functions/string.html) in the Trino documentation.

# Convert array data types
<a name="converting-array-data-types"></a>

To convert data in arrays to supported data types, use the `CAST` operator, as `CAST(value AS type)`. Athena supports all of the native Presto data types.

```
SELECT
   ARRAY [CAST(4 AS VARCHAR), CAST(5 AS VARCHAR)]
AS items
```

This query returns:

```
+-------+
| items |
+-------+
| [4,5] |
+-------+
```

Create two arrays with key-value pair elements, convert them to JSON, and concatenate, as in this example:

```
SELECT
   ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
   ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items
```

This query returns:

```
+--------------------------------------------------+
| items                                            |
+--------------------------------------------------+
| [{"a1":1,"a2":2,"a3":3}, {"b1":4,"b2":5,"b3":6}] |
+--------------------------------------------------+
```

# Find array lengths
<a name="finding-lengths"></a>

The `cardinality` function returns the length of an array, as in this example:

```
SELECT cardinality(ARRAY[1,2,3,4]) AS item_count
```

This query returns:

```
+------------+
| item_count |
+------------+
| 4          |
+------------+
```

# Access array elements
<a name="accessing-array-elements"></a>

To access array elements, use the `[]` operator, with 1 specifying the first element, 2 specifying the second element, and so on, as in this example:

```
WITH dataset AS (
SELECT
   ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
   ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items )
SELECT items[1] AS item FROM dataset
```

This query returns:

```
+------------------------+
| item                   |
+------------------------+
| {"a1":1,"a2":2,"a3":3} |
+------------------------+
```

To access the elements of an array at a given position (known as the index position), use the `element_at()` function and specify the array name and the index position:
+ If the index is greater than 0, `element_at()` returns the element that you specify, counting from the beginning to the end of the array. It behaves as the `[]` operator.
+ If the index is less than 0, `element_at()` returns the element counting from the end to the beginning of the array.

The following query creates an array `words`, and selects the first element `hello` from it as the `first_word`, the second element `amazon` (counting from the end of the array) as the `middle_word`, and the third element `athena`, as the `last_word`.

```
WITH dataset AS (
  SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT
  element_at(words, 1) AS first_word,
  element_at(words, -2) AS middle_word,
  element_at(words, cardinality(words)) AS last_word
FROM dataset
```

This query returns:

```
+----------------------------------------+
| first_word  | middle_word | last_word  |
+----------------------------------------+
| hello       | amazon      | athena     |
+----------------------------------------+
```

# Flatten nested arrays
<a name="flattening-arrays"></a>

When working with nested arrays, you often need to expand nested array elements into a single array, or expand the array into multiple rows.

## Use the flatten function
<a name="flattening-arrays-flatten-function"></a>

To flatten a nested array's elements into a single array of values, use the `flatten` function. This query returns a row for each element in the array.

```
SELECT flatten(ARRAY[ ARRAY[1,2], ARRAY[3,4] ]) AS items
```

This query returns:

```
+-----------+
| items     |
+-----------+
| [1,2,3,4] |
+-----------+
```

## Use CROSS JOIN and UNNEST
<a name="flattening-arrays-cross-join-and-unnest"></a>

To flatten an array into multiple rows, use `CROSS JOIN` in conjunction with the `UNNEST` operator, as in this example:

```
WITH dataset AS (
  SELECT
    'engineering' as department,
    ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users
)
SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)
```

This query returns:

```
+----------------------+
| department  | names  |
+----------------------+
| engineering | Sharon |
+----------------------|
| engineering | John   |
+----------------------|
| engineering | Bob    |
+----------------------|
| engineering | Sally  |
+----------------------+
```

To flatten an array of key-value pairs, transpose selected keys into columns, as in this example:

```
WITH
dataset AS (
  SELECT
    'engineering' as department,
     ARRAY[
      MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
      MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
      MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
     ] AS people
  )
SELECT names['first'] AS
 first_name,
 names['last'] AS last_name,
 department FROM dataset
CROSS JOIN UNNEST(people) AS t(names)
```

This query returns:

```
+--------------------------------------+
| first_name | last_name | department  |
+--------------------------------------+
| Bob        | Smith     | engineering |
| Jane       | Doe       | engineering |
| Billy      | Smith     | engineering |
+--------------------------------------+
```

From a list of employees, select the employee with the highest combined scores. `UNNEST` can be used in the `FROM` clause without a preceding `CROSS JOIN` as it is the default join operator and therefore implied.

```
WITH
dataset AS (
  SELECT ARRAY[
    CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER))),
    CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER))),
    CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER)))
  ] AS users
),
users AS (
 SELECT person, score
 FROM
   dataset,
   UNNEST(dataset.users) AS t(person),
   UNNEST(person.scores) AS t(score)
)
SELECT person.name, person.department, SUM(score) AS total_score FROM users
GROUP BY (person.name, person.department)
ORDER BY (total_score) DESC
LIMIT 1
```

This query returns:

```
+---------------------------------+
| name | department | total_score |
+---------------------------------+
| Amy  | devops     | 54          |
+---------------------------------+
```

From a list of employees, select the employee with the highest individual score.

```
WITH
dataset AS (
 SELECT ARRAY[
   CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER))),
   CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER))),
   CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR, scores ARRAY(INTEGER)))
 ] AS users
),
users AS (
 SELECT person, score
 FROM
   dataset,
   UNNEST(dataset.users) AS t(person),
   UNNEST(person.scores) AS t(score)
)
SELECT person.name, score FROM users
ORDER BY (score) DESC
LIMIT 1
```

This query returns:

```
+--------------+
| name | score |
+--------------+
| Amy  | 15    |
+--------------+
```

### Considerations for CROSS JOIN and UNNEST
<a name="flattening-arrays-cross-join-and-unnest-considerations"></a>

If `UNNEST` is used on one or more arrays in the query, and one of the arrays is `NULL`, the query returns no rows. If `UNNEST` is used on an array that is an empty string, the empty string is returned.

For example, in the following query, because the second array is null, the query returns no rows.

```
SELECT 
    col1, 
    col2 
FROM UNNEST (ARRAY ['apples','oranges','lemons']) AS t(col1)
CROSS JOIN UNNEST (ARRAY []) AS t(col2)
```

In this next example, the second array is modified to contain an empty string. For each row, the query returns the value in `col1` and an empty string for the value in `col2`. The empty string in the second array is required in order for the values in the first array to be returned.

```
SELECT 
    col1, 
    col2 
FROM UNNEST (ARRAY ['apples','oranges','lemons']) AS t(col1)
CROSS JOIN UNNEST (ARRAY ['']) AS t(col2)
```

# Create arrays from subqueries
<a name="creating-arrays-from-subqueries"></a>

Create an array from a collection of rows.

```
WITH
dataset AS (
  SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
```

This query returns:

```
+-----------------+
| array_items     |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
```

To create an array of unique values from a set of rows, use the `distinct` keyword.

```
WITH
dataset AS (
  SELECT ARRAY [1,2,2,3,3,4,5] AS items
)
SELECT array_agg(distinct i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
```

This query returns the following result. Note that ordering is not guaranteed.

```
+-----------------+
| array_items     |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
```

For more information about using the `array_agg` function, see [Aggregate functions](https://trino.io/docs/current/functions/aggregate.html) in the Trino documentation.

# Filter arrays
<a name="filtering-arrays"></a>

Create an array from a collection of rows if they match the filter criteria.

```
WITH
dataset AS (
  SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
WHERE i > 3
```

This query returns:

```
+-------------+
| array_items |
+-------------+
| [4, 5]      |
+-------------+
```

Filter an array based on whether one of its elements contain a specific value, such as 2, as in this example:

```
WITH
dataset AS (
  SELECT ARRAY
  [
    ARRAY[1,2,3,4],
    ARRAY[5,6,7,8],
    ARRAY[9,0]
  ] AS items
)
SELECT i AS array_items FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
WHERE contains(i, 2)
```

This query returns:

```
+--------------+
| array_items  |
+--------------+
| [1, 2, 3, 4] |
+--------------+
```

## Use the `filter` function
<a name="filtering-arrays-filter-function"></a>

```
 filter(ARRAY [list_of_values], boolean_function)
```

You can use the `filter` function on an `ARRAY` expression to create a new array that is the subset of the items in the *list\$1of\$1values* for which *boolean\$1function* is true. The `filter` function can be useful in cases in which you cannot use the *UNNEST* function.

The following example filters for values greater than zero in the array `[1,0,5,-1]`.

```
SELECT filter(ARRAY [1,0,5,-1], x -> x>0)
```

**Results**  
`[1,5]`

The following example filters for the non-null values in the array `[-1, NULL, 10, NULL]`.

```
SELECT filter(ARRAY [-1, NULL, 10, NULL], q -> q IS NOT NULL)
```

**Results**  
`[-1,10]`

# Sort arrays
<a name="sorting-arrays"></a>

To create a sorted array of unique values from a set of rows, you can use the [array\$1sort](https://prestodb.io/docs/current/functions/array.html#array_sort) function, as in the following example.

```
WITH
dataset AS (
  SELECT ARRAY[3,1,2,5,2,3,6,3,4,5] AS items
)
SELECT array_sort(array_agg(distinct i)) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
```

This query returns:

```
+--------------------+
| array_items        |
+--------------------+
| [1, 2, 3, 4, 5, 6] |
+--------------------+
```

For information about expanding an array into multiple rows, see [Flatten nested arrays](flattening-arrays.md).

# Use aggregation functions with arrays
<a name="arrays-and-aggregation"></a>
+ To add values within an array, use `SUM`, as in the following example.
+ To aggregate multiple rows within an array, use `array_agg`. For information, see [Create arrays from subqueries](creating-arrays-from-subqueries.md).

**Note**  
`ORDER BY` is supported for aggregation functions starting in Athena engine version 2.

```
WITH
dataset AS (
  SELECT ARRAY
  [
    ARRAY[1,2,3,4],
    ARRAY[5,6,7,8],
    ARRAY[9,0]
  ] AS items
),
item AS (
  SELECT i AS array_items
  FROM dataset, UNNEST(items) AS t(i)
)
SELECT array_items, sum(val) AS total
FROM item, UNNEST(array_items) AS t(val)
GROUP BY array_items;
```

In the last `SELECT` statement, instead of using `sum()` and `UNNEST`, you can use `reduce()` to decrease processing time and data transfer, as in the following example.

```
WITH
dataset AS (
  SELECT ARRAY
  [
    ARRAY[1,2,3,4],
    ARRAY[5,6,7,8],
    ARRAY[9,0]
  ] AS items
),
item AS (
  SELECT i AS array_items
  FROM dataset, UNNEST(items) AS t(i)
)
SELECT array_items, reduce(array_items, 0 , (s, x) -> s + x, s -> s) AS total
FROM item;
```

Either query returns the following results. The order of returned results is not guaranteed.

```
+----------------------+
| array_items  | total |
+----------------------+
| [1, 2, 3, 4] | 10    |
| [5, 6, 7, 8] | 26    |
| [9, 0]       | 9     |
+----------------------+
```

# Convert arrays to strings
<a name="converting-arrays-to-strings"></a>

To convert an array into a single string, use the `array_join` function. The following standalone example creates a table called `dataset` that contains an aliased array called `words`. The query uses `array_join` to join the array elements in `words`, separate them with spaces, and return the resulting string in an aliased column called `welcome_msg`.

```
WITH
dataset AS (
  SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT array_join(words, ' ') AS welcome_msg
FROM dataset
```

This query returns:

```
+---------------------+
| welcome_msg         |
+---------------------+
| hello amazon athena |
+---------------------+
```

# Use arrays to create maps
<a name="arrays-create-maps"></a>

Maps are key-value pairs that consist of data types available in Athena. To create maps, use the `MAP` operator and pass it two arrays: the first is the column (key) names, and the second is values. All values in the arrays must be of the same type. If any of the map value array elements need to be of different types, you can convert them later.

## Examples
<a name="examples"></a>

This example selects a user from a dataset. It uses the `MAP` operator and passes it two arrays. The first array includes values for column names, such as "first", "last", and "age". The second array consists of values for each of these columns, such as "Bob", "Smith", "35".

```
WITH dataset AS (
  SELECT MAP(
    ARRAY['first', 'last', 'age'],
    ARRAY['Bob', 'Smith', '35']
  ) AS user
)
SELECT user FROM dataset
```

This query returns:

```
+---------------------------------+
| user                            |
+---------------------------------+
| {last=Smith, first=Bob, age=35} |
+---------------------------------+
```

You can retrieve `Map` values by selecting the field name followed by `[key_name]`, as in this example:

```
WITH dataset AS (
 SELECT MAP(
   ARRAY['first', 'last', 'age'],
   ARRAY['Bob', 'Smith', '35']
 ) AS user
)
SELECT user['first'] AS first_name FROM dataset
```

This query returns:

```
+------------+
| first_name |
+------------+
| Bob        |
+------------+
```

# Query arrays with complex types and nested structures
<a name="rows-and-structs"></a>

Your source data often contains arrays with complex data types and nested structures. Examples in this section show how to change element's data type, locate elements within arrays, and find keywords using Athena queries.

**Topics**
+ [Create a `ROW`](creating-row.md)
+ [Change field names in arrays using `CAST`](changing-row-arrays-with-cast.md)
+ [Filter arrays using the `.` notation](filtering-with-dot.md)
+ [Filter arrays with nested values](filtering-nested-with-dot.md)
+ [Filter arrays using `UNNEST`](filtering-with-unnest.md)
+ [Find keywords in arrays using `regexp_like`](filtering-with-regexp.md)

# Create a `ROW`
<a name="creating-row"></a>

**Note**  
The examples in this section use `ROW` as a means to create sample data to work with. When you query tables within Athena, you do not need to create `ROW` data types, as they are already created from your data source. When you use `CREATE_TABLE`, Athena defines a `STRUCT` in it, populates it with data, and creates the `ROW` data type for you, for each row in the dataset. The underlying `ROW` data type consists of named fields of any supported SQL data types.

```
WITH dataset AS (
 SELECT
   ROW('Bob', 38) AS users
 )
SELECT * FROM dataset
```

This query returns:

```
+-------------------------+
| users                   |
+-------------------------+
| {field0=Bob, field1=38} |
+-------------------------+
```

# Change field names in arrays using `CAST`
<a name="changing-row-arrays-with-cast"></a>

To change the field name in an array that contains `ROW` values, you can `CAST` the `ROW` declaration:

```
WITH dataset AS (
  SELECT
    CAST(
      ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)
    ) AS users
)
SELECT * FROM dataset
```

This query returns:

```
+--------------------+
| users              |
+--------------------+
| {NAME=Bob, AGE=38} |
+--------------------+
```

**Note**  
In the example above, you declare `name` as a `VARCHAR` because this is its type in Presto. If you declare this `STRUCT` inside a `CREATE TABLE` statement, use `String` type because Hive defines this data type as `String`.

# Filter arrays using the `.` notation
<a name="filtering-with-dot"></a>

In the following example, select the `accountId` field from the `userIdentity` column of a AWS CloudTrail logs table by using the dot `.` notation. For more information, see [Querying AWS CloudTrail Logs](cloudtrail-logs.md).

```
SELECT
  CAST(useridentity.accountid AS bigint) as newid
FROM cloudtrail_logs
LIMIT 2;
```

This query returns:

```
+--------------+
| newid        |
+--------------+
| 112233445566 |
+--------------+
| 998877665544 |
+--------------+
```

To query an array of values, issue this query:

```
WITH dataset AS (
  SELECT ARRAY[
    CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
    CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
    CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
  ] AS users
)
SELECT * FROM dataset
```

It returns this result:

```
+-----------------------------------------------------------------+
| users                                                           |
+-----------------------------------------------------------------+
| [{NAME=Bob, AGE=38}, {NAME=Alice, AGE=35}, {NAME=Jane, AGE=27}] |
+-----------------------------------------------------------------+
```

# Filter arrays with nested values
<a name="filtering-nested-with-dot"></a>

Large arrays often contain nested structures, and you need to be able to filter, or search, for values within them.

To define a dataset for an array of values that includes a nested `BOOLEAN` value, issue this query:

```
WITH dataset AS (
  SELECT
    CAST(
      ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    ) AS sites
)
SELECT * FROM dataset
```

It returns this result:

```
+----------------------------------------------------------+
| sites                                                    |
+----------------------------------------------------------+
| {HOSTNAME=aws.amazon.com, FLAGGEDACTIVITY={ISNEW=true}}  |
+----------------------------------------------------------+
```

Next, to filter and access the `BOOLEAN` value of that element, continue to use the dot `.` notation.

```
WITH dataset AS (
  SELECT
    CAST(
      ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    ) AS sites
)
SELECT sites.hostname, sites.flaggedactivity.isnew
FROM dataset
```

This query selects the nested fields and returns this result:

```
+------------------------+
| hostname       | isnew |
+------------------------+
| aws.amazon.com | true  |
+------------------------+
```

# Filter arrays using `UNNEST`
<a name="filtering-with-unnest"></a>

To filter an array that includes a nested structure by one of its child elements, issue a query with an `UNNEST` operator. For more information about `UNNEST`, see [Flattening Nested Arrays](flattening-arrays.md).

For example, this query finds host names of sites in the dataset.

```
WITH dataset AS (
  SELECT ARRAY[
    CAST(
      ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    ),
    CAST(
      ROW('news.cnn.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    ),
    CAST(
      ROW('netflix.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    )
  ] as items
)
SELECT sites.hostname, sites.flaggedActivity.isNew
FROM dataset, UNNEST(items) t(sites)
WHERE sites.flaggedActivity.isNew = true
```

It returns:

```
+------------------------+
| hostname       | isnew |
+------------------------+
| aws.amazon.com | true  |
+------------------------+
```

# Find keywords in arrays using `regexp_like`
<a name="filtering-with-regexp"></a>

The following examples illustrate how to search a dataset for a keyword within an element inside an array, using the [regexp\$1like](https://prestodb.io/docs/current/functions/regexp.html) function. It takes as an input a regular expression pattern to evaluate, or a list of terms separated by a pipe (\$1), evaluates the pattern, and determines if the specified string contains it.

The regular expression pattern needs to be contained within the string, and does not have to match it. To match the entire string, enclose the pattern with ^ at the beginning of it, and \$1 at the end, such as `'^pattern$'`.

Consider an array of sites containing their host name, and a `flaggedActivity` element. This element includes an `ARRAY`, containing several `MAP` elements, each listing different popular keywords and their popularity count. Assume you want to find a particular keyword inside a `MAP` in this array.

To search this dataset for sites with a specific keyword, we use `regexp_like` instead of the similar SQL `LIKE` operator, because searching for a large number of keywords is more efficient with `regexp_like`.

**Example 1: Using `regexp_like`**  
The query in this example uses the `regexp_like` function to search for terms `'politics|bigdata'`, found in values within arrays:  

```
WITH dataset AS (
  SELECT ARRAY[
    CAST(
      ROW('aws.amazon.com', ROW(ARRAY[
          MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
          MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
          MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
          MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
      ])
      ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
   ),
   CAST(
     ROW('news.cnn.com', ROW(ARRAY[
       MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
       MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
       MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
       MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
     ])
     ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
   ),
   CAST(
     ROW('netflix.com', ROW(ARRAY[
       MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
       MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
       MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
       MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
     ])
     ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
   )
 ] AS items
),
sites AS (
  SELECT sites.hostname, sites.flaggedactivity
  FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
```
This query returns two sites:  

```
+----------------+
| hostname       |
+----------------+
| aws.amazon.com |
+----------------+
| news.cnn.com   |
+----------------+
```

**Example 2: Using `regexp_like`**  
The query in the following example adds up the total popularity scores for the sites matching your search terms with the `regexp_like` function, and then orders them from highest to lowest.   

```
WITH dataset AS (
  SELECT ARRAY[
    CAST(
      ROW('aws.amazon.com', ROW(ARRAY[
          MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
          MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
          MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
          MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
      ])
      ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
    ),
    CAST(
      ROW('news.cnn.com', ROW(ARRAY[
        MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
        MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
        MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
        MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
      ])
      ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
    ),
    CAST(
      ROW('netflix.com', ROW(ARRAY[
        MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
        MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
        MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
        MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
      ])
      ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
    )
  ] AS items
),
sites AS (
  SELECT sites.hostname, sites.flaggedactivity
  FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS total
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
ORDER BY total DESC
```
This query returns two sites:  

```
+------------------------------------+
| hostname       | terms    | total  |
+----------------+-------------------+
| news.cnn.com   | politics |  241   |
+----------------+-------------------+
| aws.amazon.com | bigdata |  10    |
+----------------+-------------------+
```

# Query geospatial data
<a name="querying-geospatial-data"></a>

Geospatial data contains identifiers that specify a geographic position for an object. Examples of this type of data include weather reports, map directions, tweets with geographic positions, store locations, and airline routes. Geospatial data plays an important role in business analytics, reporting, and forecasting.

Geospatial identifiers, such as latitude and longitude, allow you to convert any mailing address into a set of geographic coordinates.

## What is a geospatial query?
<a name="geospatial-query-what-is"></a>

Geospatial queries are specialized types of SQL queries supported in Athena. They differ from non-spatial SQL queries in the following ways:
+ Using the following specialized geometry data types: `point`, `line`, `multiline`, `polygon`, and `multipolygon`.
+ Expressing relationships between geometry data types, such as `distance`, `equals`, `crosses`, `touches`, `overlaps`, `disjoint`, and others.

Using geospatial queries in Athena, you can run these and other similar operations:
+ Find the distance between two points.
+ Check whether one area (polygon) contains another.
+ Check whether one line crosses or touches another line or polygon.

For example, to obtain a `point` geometry data type from values of type `double` for the geographic coordinates of Mount Rainier in Athena, use the `ST_Point (longitude, latitude)` geospatial function, as in the following example. 

```
ST_Point(-121.7602, 46.8527)
```

## Input data formats and geometry data types
<a name="geospatial-input-data-formats-supported-geometry-types"></a>

To use geospatial functions in Athena, input your data in the WKT format, or use the Hive JSON SerDe. You can also use the geometry data types supported in Athena.

### Input data formats
<a name="input-data-formats"></a>

To handle geospatial queries, Athena supports input data in these data formats:
+  **WKT (Well-known Text)**. In Athena, WKT is represented as a `varchar(x)` or `string` data type.
+  **JSON-encoded geospatial data**. To parse JSON files with geospatial data and create tables for them, Athena uses the [Hive JSON SerDe](https://github.com/Esri/spatial-framework-for-hadoop/wiki/Hive-JSON-SerDe). For more information about using this SerDe in Athena, see [JSON SerDe libraries](json-serde.md).

### Geometry data types
<a name="geometry-data-types"></a>

To handle geospatial queries, Athena supports these specialized geometry data types:
+  `point` 
+  `line` 
+  `polygon` 
+  `multiline` 
+  `multipolygon` 

## Supported geospatial functions
<a name="geospatial-functions-list"></a>

For information about the geospatial functions in Athena engine version 3, see [Geospatial functions](https://trino.io/docs/current/functions/geospatial.html) in the Trino documentation.

# Examples: Geospatial queries
<a name="geospatial-example-queries"></a>

The examples in this topic create two tables from sample data available on GitHub and query the tables based on the data. The sample data, which are for illustration purposes only and are not guaranteed to be accurate, are in the following files:
+ **[https://github.com/Esri/gis-tools-for-hadoop/blob/master/samples/data/earthquake-data/earthquakes.csv](https://github.com/Esri/gis-tools-for-hadoop/blob/master/samples/data/earthquake-data/earthquakes.csv)** – Lists earthquakes that occurred in California. The example `earthquakes` table uses fields from this data.
+ **[https://github.com/Esri/gis-tools-for-hadoop/blob/master/samples/data/counties-data/california-counties.json](https://github.com/Esri/gis-tools-for-hadoop/blob/master/samples/data/counties-data/california-counties.json)** – Lists county data for the state of California in [ESRI-compliant GeoJSON format](https://doc.arcgis.com/en/arcgis-online/reference/geojson.htm). The data includes many fields such as `AREA`, `PERIMETER`, `STATE`, `COUNTY`, and `NAME`, but the example `counties` table uses only two: `Name` (string), and `BoundaryShape` (binary). 
**Note**  
Athena uses the `com.esri.json.hadoop.EnclosedEsriJsonInputFormat` to convert the JSON data to geospatial binary format.

The following code example creates a table called `earthquakes`:

```
CREATE external TABLE earthquakes
(
 earthquake_date string,
 latitude double,
 longitude double,
 depth double,
 magnitude double,
 magtype string,
 mbstations string,
 gap string,
 distance string,
 rms string,
 source string,
 eventid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 's3://amzn-s3-demo-bucket/my-query-log/csv/';
```

The following code example creates a table called `counties`:

```
CREATE external TABLE IF NOT EXISTS counties
 (
 Name string,
 BoundaryShape binary
 )
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/my-query-log/json/';
```

The following example query uses the `CROSS JOIN` function on the `counties` and `earthquake` tables. The example uses `ST_CONTAINS` to query for counties whose boundaries include earthquake locations, which are specified with `ST_POINT`. The query groups such counties by name, orders them by count, and returns them in descending order.

```
SELECT counties.name,
        COUNT(*) cnt
FROM counties
CROSS JOIN earthquakes
WHERE ST_CONTAINS (ST_GeomFromLegacyBinary(counties.boundaryshape), ST_POINT(earthquakes.longitude, earthquakes.latitude))
GROUP BY  counties.name
ORDER BY  cnt DESC
```

This query returns:

```
+------------------------+
| name             | cnt |
+------------------------+
| Kern             | 36  |
+------------------------+
| San Bernardino   | 35  |
+------------------------+
| Imperial         | 28  |
+------------------------+
| Inyo             | 20  |
+------------------------+
| Los Angeles      | 18  |
+------------------------+
| Riverside        | 14  |
+------------------------+
| Monterey         | 14  |
+------------------------+
| Santa Clara      | 12  |
+------------------------+
| San Benito       | 11  |
+------------------------+
| Fresno           | 11  |
+------------------------+
| San Diego        | 7   |
+------------------------+
| Santa Cruz       | 5   |
+------------------------+
| Ventura          | 3   |
+------------------------+
| San Luis Obispo  | 3   |
+------------------------+
| Orange           | 2   |
+------------------------+
| San Mateo        | 1   |
+------------------------+
```

## Additional resources
<a name="geospatial-example-queries-additional-resources"></a>

For additional examples of geospatial queries, see the following blog posts:
+ [Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda](https://aws.amazon.com/blogs/big-data/extend-geospatial-queries-in-amazon-athena-with-udfs-and-aws-lambda/) 
+ [Visualize over 200 years of global climate data using Amazon Athena and Amazon Quick](https://aws.amazon.com/blogs/big-data/visualize-over-200-years-of-global-climate-data-using-amazon-athena-and-amazon-quicksight/).
+ [Querying OpenStreetMap with Amazon Athena](https://aws.amazon.com/blogs/big-data/querying-openstreetmap-with-amazon-athena/)

# Query JSON data
<a name="querying-JSON"></a>

Amazon Athena lets you query JSON-encoded data, extract data from nested JSON, search for values, and find length and size of JSON arrays. To learn the basics of querying JSON data in Athena, consider the following sample planet data:

```
{name:"Mercury",distanceFromSun:0.39,orbitalPeriod:0.24,dayLength:58.65}
{name:"Venus",distanceFromSun:0.72,orbitalPeriod:0.62,dayLength:243.02}
{name:"Earth",distanceFromSun:1.00,orbitalPeriod:1.00,dayLength:1.00}
{name:"Mars",distanceFromSun:1.52,orbitalPeriod:1.88,dayLength:1.03}
```

Notice how each record (essentially, each row in the table) is on a separate line. To query this JSON data, you can use a `CREATE TABLE` statement like the following:

```
CREATE EXTERNAL TABLE `planets_json`(
  `name` string,
  `distancefromsun` double,
  `orbitalperiod` double,
  `daylength` double)
ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket/json/'
```

To query the data, use a simple `SELECT` statement like the following example.

```
SELECT * FROM planets_json
```

The query results look like the following.


****  

| \$1 | name | distancefromsun | orbitalperiod | daylength | 
| --- | --- | --- | --- | --- | 
| 1 | Mercury | 0.39 | 0.24 | 58.65 | 
| 2 | Venus | 0.72 | 0.62 | 243.02 | 
| 3 | Earth | 1.0 | 1.0 | 1.0 | 
| 4 | Mars | 1.52 | 1.88 | 1.03 | 

Notice how the `CREATE TABLE` statement uses the [OpenX JSON SerDe](openx-json-serde.md), which requires each JSON record to be on a separate line. If the JSON is in pretty print format, or if all records are on a single line, the data will not be read correctly.

To query JSON data that is in pretty print format, you can use the [Amazon Ion Hive SerDe](ion-serde.md) instead of the OpenX JSON SerDe. Consider the previous data stored in pretty print format:

```
{
  name:"Mercury",
  distanceFromSun:0.39,
  orbitalPeriod:0.24,
  dayLength:58.65
}
{
  name:"Venus",
  distanceFromSun:0.72,
  orbitalPeriod:0.62,
  dayLength:243.02
}
{
  name:"Earth",
  distanceFromSun:1.00,
  orbitalPeriod:1.00,
  dayLength:1.00
}
{
  name:"Mars",
  distanceFromSun:1.52,
  orbitalPeriod:1.88,
  dayLength:1.03
}
```

To query this data without reformatting, you can use a `CREATE TABLE` statement like the following. Notice that, instead of specifying the OpenX JSON SerDe, the statement specifies `STORED AS ION`. 

```
CREATE EXTERNAL TABLE `planets_ion`(
  `name` string,
  `distancefromsun` DECIMAL(10, 2),
  `orbitalperiod` DECIMAL(10, 2),
  `daylength` DECIMAL(10, 2))
STORED AS ION
LOCATION
  's3://amzn-s3-demo-bucket/json-ion/'
```

The query `SELECT * FROM planets_ion` produces the same results as before. For more information about creating tables in this way using the Amazon Ion Hive SerDe, see [Create Amazon Ion tables](ion-serde-using-create-table.md).

The preceding example JSON data does not contain complex data types such as nested arrays or structs. For more information about querying nested JSON data, see [Example: deserializing nested JSON](openx-json-serde.md#nested-json-serde-example).

**Topics**
+ [Best practices for reading JSON data](parsing-json-data.md)
+ [Extract JSON data from strings](extracting-data-from-JSON.md)
+ [Search for values in JSON arrays](searching-for-values.md)
+ [Get the length and size of JSON arrays](length-and-size.md)
+ [Troubleshoot JSON queries](json-troubleshooting.md)

# Best practices for reading JSON data
<a name="parsing-json-data"></a>

JavaScript Object Notation (JSON) is a common method for encoding data structures as text. Many applications and tools output data that is JSON-encoded.

In Amazon Athena, you can create tables from external data and include the JSON-encoded data in them. For such types of source data, use Athena together with [JSON SerDe libraries](json-serde.md). 

Use the following tips to read JSON-encoded data:
+ Choose the right SerDe, a native JSON SerDe, `org.apache.hive.hcatalog.data.JsonSerDe`, or an OpenX SerDe, `org.openx.data.jsonserde.JsonSerDe`. For more information, see [JSON SerDe libraries](json-serde.md).
+ Make sure that each JSON-encoded record is represented on a separate line, not pretty-printed.
**Note**  
The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 
+ Generate your JSON-encoded data in case-insensitive columns.
+ Provide an option to ignore malformed records, as in this example.

  ```
  CREATE EXTERNAL TABLE json_table (
    column_a string,
    column_b int
   )
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
   WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
   LOCATION 's3://amzn-s3-demo-bucket/path/';
  ```
+ Convert fields in source data that have an undetermined schema to JSON-encoded strings in Athena.

When Athena creates tables backed by JSON data, it parses the data based on the existing and predefined schema. However, not all of your data may have a predefined schema. To simplify schema management in such cases, it is often useful to convert fields in source data that have an undetermined schema to JSON strings in Athena, and then use [JSON SerDe libraries](json-serde.md).

For example, consider an IoT application that publishes events with common fields from different sensors. One of those fields must store a custom payload that is unique to the sensor sending the event. In this case, since you don't know the schema, we recommend that you store the information as a JSON-encoded string. To do this, convert data in your Athena table to JSON, as in the following example. You can also convert JSON-encoded data to Athena data types.

**Topics**
+ [Convert Athena data types to JSON](converting-native-data-types-to-json.md)
+ [Convert JSON to Athena data types](converting-json-to-native-data-types.md)

# Convert Athena data types to JSON
<a name="converting-native-data-types-to-json"></a>

To convert Athena data types to JSON, use `CAST`.

```
WITH dataset AS (
  SELECT
    CAST('HELLO ATHENA' AS JSON) AS hello_msg,
    CAST(12345 AS JSON) AS some_int,
    CAST(MAP(ARRAY['a', 'b'], ARRAY[1,2]) AS JSON) AS some_map
)
SELECT * FROM dataset
```

This query returns:

```
+-------------------------------------------+
| hello_msg      | some_int | some_map      |
+-------------------------------------------+
| "HELLO ATHENA" | 12345    | {"a":1,"b":2} |
+-------------------------------------------+
```

# Convert JSON to Athena data types
<a name="converting-json-to-native-data-types"></a>

To convert JSON data to Athena data types, use `CAST`.

**Note**  
In this example, to denote strings as JSON-encoded, start with the `JSON` keyword and use single quotes, such as `JSON '12345'` 

```
WITH dataset AS (
  SELECT
    CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
    CAST(JSON '12345' AS INTEGER) AS some_int,
    CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset
```

This query returns:

```
+-------------------------------------+
| hello_msg    | some_int | some_map  |
+-------------------------------------+
| HELLO ATHENA | 12345    | {a:1,b:2} |
+-------------------------------------+
```

# Extract JSON data from strings
<a name="extracting-data-from-JSON"></a>

You may have source data containing JSON-encoded strings that you do not necessarily want to deserialize into a table in Athena. In this case, you can still run SQL operations on this data, using the JSON functions available in Presto.

Consider this JSON string as an example dataset.

```
{"name": "Susan Smith",
"org": "engineering",
"projects":
    [
     {"name":"project1", "completed":false},
     {"name":"project2", "completed":true}
    ]
}
```

## Examples: Extract properties
<a name="examples-extracting-properties"></a>

To extract the `name` and `projects` properties from the JSON string, use the `json_extract` function as in the following example. The `json_extract` function takes the column containing the JSON string, and searches it using a `JSONPath`-like expression with the dot `.` notation.

**Note**  
 `JSONPath` performs a simple tree traversal. It uses the `$` sign to denote the root of the JSON document, followed by a period and an element nested directly under the root, such as `$.name`.

```
WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},
           {"name":"project2", "completed":true}]}'
    AS myblob
)
SELECT
  json_extract(myblob, '$.name') AS name,
  json_extract(myblob, '$.projects') AS projects
FROM dataset
```

The returned value is a JSON-encoded string, and not a native Athena data type.

```
+-----------------------------------------------------------------------------------------------+
| name           | projects                                                                     |
+-----------------------------------------------------------------------------------------------+
| "Susan Smith"  | [{"name":"project1","completed":false},{"name":"project2","completed":true}] |
+-----------------------------------------------------------------------------------------------+
```

To extract the scalar value from the JSON string, use the `json_extract_scalar(json, json_path)` function. It is similar to `json_extract`, but returns a `varchar` string value instead of a JSON-encoded string. The value for the *json\$1path* parameter must be a scalar (a Boolean, number, or string).

**Note**  
Do not use the `json_extract_scalar` function on arrays, maps, or structs.

```
WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}'
    AS myblob
)
SELECT
  json_extract_scalar(myblob, '$.name') AS name,
  json_extract_scalar(myblob, '$.projects') AS projects
FROM dataset
```

This query returns:

```
+---------------------------+
| name           | projects |
+---------------------------+
| Susan Smith    |          |
+---------------------------+
```

To obtain the first element of the `projects` property in the example array, use the `json_array_get` function and specify the index position.

```
WITH dataset AS (
  SELECT '{"name": "Bob Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}'
    AS myblob
)
SELECT json_array_get(json_extract(myblob, '$.projects'), 0) AS item
FROM dataset
```

It returns the value at the specified index position in the JSON-encoded array.

```
+---------------------------------------+
| item                                  |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+
```

To return an Athena string type, use the `[]` operator inside a `JSONPath` expression, then Use the `json_extract_scalar` function. For more information about `[]`, see [Access array elements](accessing-array-elements.md).

```
WITH dataset AS (
   SELECT '{"name": "Bob Smith",
             "org": "engineering",
             "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}'
     AS myblob
)
SELECT json_extract_scalar(myblob, '$.projects[0].name') AS project_name
FROM dataset
```

It returns this result:

```
+--------------+
| project_name |
+--------------+
| project1     |
+--------------+
```

# Search for values in JSON arrays
<a name="searching-for-values"></a>

To determine if a specific value exists inside a JSON-encoded array, use the `json_array_contains` function.

The following query lists the names of the users who are participating in "project2".

```
WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
  ) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')
```

This query returns a list of users.

```
+-------------+
| user        |
+-------------+
| Susan Smith |
+-------------+
| Jane Smith  |
+-------------+
```

The following query example lists the names of users who have completed projects along with the total number of completed projects. It performs these actions:
+ Uses nested `SELECT` statements for clarity.
+ Extracts the array of projects.
+ Converts the array to a native array of key-value pairs using `CAST`.
+ Extracts each individual array element using the `UNNEST` operator.
+ Filters obtained values by completed projects and counts them.

**Note**  
When using `CAST` to `MAP` you can specify the key element as `VARCHAR` (native String in Presto), but leave the value as JSON, because the values in the `MAP` are of different types: String for the first key-value pair, and Boolean for the second.

```
WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith",
             "org": "legal",
             "projects": [{"name":"project1", "completed":false}]}'),
    (JSON '{"name": "Susan Smith",
             "org": "engineering",
             "projects": [{"name":"project2", "completed":true},
                          {"name":"project3", "completed":true}]}'),
    (JSON '{"name": "Jane Smith",
             "org": "finance",
             "projects": [{"name":"project2", "completed":true}]}')
  ) AS t (users)
),
employees AS (
  SELECT users, CAST(json_extract(users, '$.projects') AS
    ARRAY(MAP(VARCHAR, JSON))) AS projects_array
  FROM dataset
),
names AS (
  SELECT json_extract_scalar(users, '$.name') AS name, projects
  FROM employees, UNNEST (projects_array) AS t(projects)
)
SELECT name, count(projects) AS completed_projects FROM names
WHERE cast(element_at(projects, 'completed') AS BOOLEAN) = true
GROUP BY name
```

This query returns the following result:

```
+----------------------------------+
| name        | completed_projects |
+----------------------------------+
| Susan Smith | 2                  |
+----------------------------------+
| Jane Smith  | 1                  |
+----------------------------------+
```

# Get the length and size of JSON arrays
<a name="length-and-size"></a>

To get the length and size of JSON arrays, you can use the `json_array_length` and `json_size` functions.

## Example: `json_array_length`
<a name="example-json-array-length"></a>

To obtain the length of a JSON-encoded array, use the `json_array_length` function.

```
WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name":
            "Bob Smith",
            "org":
            "legal",
            "projects": [{"name":"project1", "completed":false}]}'),
    (JSON '{"name": "Susan Smith",
            "org": "engineering",
            "projects": [{"name":"project2", "completed":true},
                         {"name":"project3", "completed":true}]}'),
    (JSON '{"name": "Jane Smith",
             "org": "finance",
             "projects": [{"name":"project2", "completed":true}]}')
  ) AS t (users)
)
SELECT
  json_extract_scalar(users, '$.name') as name,
  json_array_length(json_extract(users, '$.projects')) as count
FROM dataset
ORDER BY count DESC
```

This query returns this result:

```
+---------------------+
| name        | count |
+---------------------+
| Susan Smith | 2     |
+---------------------+
| Bob Smith   | 1     |
+---------------------+
| Jane Smith  | 1     |
+---------------------+
```

## Example: `json_size`
<a name="example-json-size"></a>

To obtain the size of a JSON-encoded array or object, use the `json_size` function, and specify the column containing the JSON string and the `JSONPath` expression to the array or object.

```
WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": [{"name":"project1", "completed":false}]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": [{"name":"project2", "completed":true},{"name":"project3", "completed":true}]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": [{"name":"project2", "completed":true}]}')
  ) AS t (users)
)
SELECT
  json_extract_scalar(users, '$.name') as name,
  json_size(users, '$.projects') as count
FROM dataset
ORDER BY count DESC
```

This query returns this result:

```
+---------------------+
| name        | count |
+---------------------+
| Susan Smith | 2     |
+---------------------+
| Bob Smith   | 1     |
+---------------------+
| Jane Smith  | 1     |
+---------------------+
```

# Troubleshoot JSON queries
<a name="json-troubleshooting"></a>

For help on troubleshooting issues with JSON-related queries, see [JSON related errors](troubleshooting-athena.md#troubleshooting-athena-json-related-errors) or consult the following resources:
+ [I get errors when I try to read JSON data in Amazon Athena](https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/)
+ [How do I resolve "HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON object - JSONException: Duplicate key" when reading files from AWS Config in Athena?](https://aws.amazon.com/premiumsupport/knowledge-center/json-duplicate-key-error-athena-config/)
+ [The SELECT COUNT query in Amazon Athena returns only one record even though the input JSON file has multiple records](https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/)
+ [How can I see the Amazon S3 source file for a row in an Athena table?](https://aws.amazon.com/premiumsupport/knowledge-center/find-s3-source-file-athena-table-row/)

See also [Considerations and limitations for SQL queries in Amazon Athena](other-notable-limitations.md).

# Use Machine Learning (ML) with Amazon Athena
<a name="querying-mlmodel"></a>

Machine Learning (ML) with Amazon Athena lets you use Athena to write SQL statements that run Machine Learning (ML) inference using Amazon SageMaker AI. This feature simplifies access to ML models for data analysis, eliminating the need to use complex programming methods to run inference.

To use ML with Athena, you define an ML with Athena function with the `USING EXTERNAL FUNCTION` clause. The function points to the SageMaker AI model endpoint that you want to use and specifies the variable names and data types to pass to the model. Subsequent clauses in the query reference the function to pass values to the model. The model runs inference based on the values that the query passes and then returns inference results. For more information about SageMaker AI and how SageMaker AI endpoints work, see the [Amazon SageMaker AI Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/).

For an example that uses ML with Athena and SageMaker AI inference to detect an anomalous value in a result set, see the AWS Big Data Blog article [Detecting anomalous values by invoking the Amazon Athena machine learning inference function](https://aws.amazon.com/blogs/big-data/detecting-anomalous-values-by-invoking-the-amazon-athena-machine-learning-inference-function/).

## Considerations and limitations
<a name="considerations-and-limitations"></a>
+ **Available Regions** – The Athena ML feature is available in the AWS Regions where Athena engine version 2 or later are supported.
+ **SageMaker AI model endpoint must accept and return `text/csv`** – For more information about data formats, see [Common data formats for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) in the *Amazon SageMaker AI Developer Guide*.
+ **Athena does not send CSV headers** – If your SageMaker AI endpoint is `text/csv`, your input handler should not assume that the first line of the input is a CSV header. Because Athena does not send CSV headers, the output returned to Athena will contain one less row than Athena expects and cause an error. 
+ **SageMaker AI endpoint scaling** – Make sure that the referenced SageMaker AI model endpoint is sufficiently scaled up for Athena calls to the endpoint. For more information, see [Automatically scale SageMaker AI models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) in the *Amazon SageMaker AI Developer Guide* and [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateEndpointConfig.html) in the *Amazon SageMaker AI API Reference*.
+ **IAM permissions** – To run a query that specifies an ML with Athena function, the IAM principal running the query must be allowed to perform the `sagemaker:InvokeEndpoint` action for the referenced SageMaker AI model endpoint. For more information, see [Allow access for ML with Athena](machine-learning-iam-access.md).
+ **ML with Athena functions cannot be used in `GROUP BY` clauses directly**

**Topics**
+ [Considerations and limitations](#considerations-and-limitations)
+ [Use ML with Athena syntax](ml-syntax.md)
+ [See customer use examples](ml-videos.md)

# Use ML with Athena syntax
<a name="ml-syntax"></a>

The `USING EXTERNAL FUNCTION` clause specifies an ML with Athena function or multiple functions that can be referenced by a subsequent `SELECT` statement in the query. You define the function name, variable names, and data types for the variables and return values.

## Synopsis
<a name="ml-synopsis"></a>

The following syntax shows a `USING EXTERNAL FUNCTION` clause that specifies an ML with Athena function.

```
USING EXTERNAL FUNCTION ml_function_name (variable1 data_type[, variable2 data_type][,...])
RETURNS data_type 
SAGEMAKER 'sagemaker_endpoint'
SELECT ml_function_name()
```

## Parameters
<a name="udf-parameters"></a>

**USING EXTERNAL FUNCTION *ml\$1function\$1name* (*variable1* *data\$1type*[, *variable2* *data\$1type*][,...])**  
*ml\$1function\$1name* defines the function name, which can be used in subsequent query clauses. Each *variable data\$1type* specifies a named variable and its corresponding data type that the SageMaker AI model accepts as input. The data type specified must be a supported Athena data type.

**RETURNS *data\$1type***  
*data\$1type* specifies the SQL data type that *ml\$1function\$1name* returns to the query as output from the SageMaker AI model.

**SAGEMAKER '*sagemaker\$1endpoint*'**  
*sagemaker\$1endpoint* specifies the endpoint of the SageMaker AI model.

**SELECT [...] *ml\$1function\$1name*(*expression*) [...]**  
The SELECT query that passes values to function variables and the SageMaker AI model to return a result. *ml\$1function\$1name* specifies the function defined earlier in the query, followed by an *expression* that is evaluated to pass values. Values that are passed and returned must match the corresponding data types specified for the function in the `USING EXTERNAL FUNCTION` clause.

## Example
<a name="ml-examples"></a>

The following example demonstrates a query using ML with Athena.

**Example**  

```
USING EXTERNAL FUNCTION predict_customer_registration(age INTEGER) 
    RETURNS DOUBLE
    SAGEMAKER 'xgboost-2019-09-20-04-49-29-303' 
SELECT predict_customer_registration(age) AS probability_of_enrolling, customer_id 
     FROM "sampledb"."ml_test_dataset" 
     WHERE predict_customer_registration(age) < 0.5;
```

# See customer use examples
<a name="ml-videos"></a>

The following videos, which use the Preview version of Machine Learning (ML) with Amazon Athena, showcase ways in which you can use SageMaker AI with Athena.

## Predicting customer churn
<a name="ml-videos-predict-churn"></a>

The following video shows how to combine Athena with the machine learning capabilities of Amazon SageMaker AI to predict customer churn.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/CUHbSpekRVg/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/CUHbSpekRVg)


## Detecting botnets
<a name="ml-videos-detect-botnets"></a>

The following video shows how one company uses Amazon Athena and Amazon SageMaker AI to detect botnets.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/0dUv-jCt2aw/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/0dUv-jCt2aw)


# Query with user defined functions
<a name="querying-udf"></a>

User Defined Functions (UDF) in Amazon Athena allow you to create custom functions to process records or groups of records. A UDF accepts parameters, performs work, and then returns a result.

To use a UDF in Athena, you write a `USING EXTERNAL FUNCTION` clause before a `SELECT` statement in a SQL query. The `SELECT` statement references the UDF and defines the variables that are passed to the UDF when the query runs. The SQL query invokes a Lambda function using the Java runtime when it calls the UDF. UDFs are defined within the Lambda function as methods in a Java deployment package. Multiple UDFs can be defined in the same Java deployment package for a Lambda function. You also specify the name of the Lambda function in the `USING EXTERNAL FUNCTION` clause.

You have two options for deploying a Lambda function for Athena UDFs. You can deploy the function directly using Lambda, or you can use the AWS Serverless Application Repository. To find existing Lambda functions for UDFs, you can search the public AWS Serverless Application Repository or your private repository and then deploy to Lambda. You can also create or modify Java source code, package it into a JAR file, and deploy it using Lambda or the AWS Serverless Application Repository. For example Java source code and packages to get you started, see [Create and deploy a UDF using Lambda](udf-creating-and-deploying.md). For more information about Lambda, see [AWS Lambda Developer Guide](https://docs.aws.amazon.com/lambda/latest/dg/). For more information about AWS Serverless Application Repository, see the [AWS Serverless Application Repository Developer Guide](https://docs.aws.amazon.com/serverlessrepo/latest/devguide/).

For an example that uses UDFs with Athena to translate and analyze text, see the AWS Machine Learning Blog article [Translate and analyze text using SQL functions with Amazon Athena, Amazon Translate, and Amazon Comprehend](https://aws.amazon.com/blogs/machine-learning/translate-and-analyze-text-using-sql-functions-with-amazon-athena-amazon-translate-and-amazon-comprehend/), or watch the [video](udf-videos.md#udf-videos-xlate).

For an example of using UDFs to extend geospatial queries in Amazon Athena, see [Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda](https://aws.amazon.com/blogs/big-data/extend-geospatial-queries-in-amazon-athena-with-udfs-and-aws-lambda/) in the *AWS Big Data Blog*.

**Topics**
+ [Videos on UDFs in Athena](udf-videos.md)
+ [Considerations and limitations](udf-considerations-limitations.md)
+ [Query using UDF query syntax](udf-query-syntax.md)
+ [Create and deploy a UDF using Lambda](udf-creating-and-deploying.md)

# Videos on UDFs in Athena
<a name="udf-videos"></a>

Watch the following videos to learn more about using UDFs in Athena.

**Video: Introducing User Defined Functions (UDFs) in Amazon Athena**  
The following video shows how you can use UDFs in Amazon Athena to redact sensitive information.

**Note**  
The syntax in this video is prerelease, but the concepts are the same. Use Athena without the `AmazonAthenaPreviewFunctionality` workgroup. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/AxJ6jP4Pfmo/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/AxJ6jP4Pfmo)


**Video: Translate, analyze, and redact text fields using SQL queries in Amazon Athena**  
The following video shows how you can use UDFs in Amazon Athena together with other AWS services to translate and analyze text.

**Note**  
The syntax in this video is prerelease, but the concepts are the same. For the correct syntax, see the related blog post [Translate, redact, and analyze text using SQL functions with Amazon Athena, Amazon Translate, and Amazon Comprehend](https://aws.amazon.com/blogs/machine-learning/translate-and-analyze-text-using-sql-functions-with-amazon-athena-amazon-translate-and-amazon-comprehend/) on the *AWS Machine Learning Blog*.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/Od7rXG-WMO4/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/Od7rXG-WMO4)


# Considerations and limitations
<a name="udf-considerations-limitations"></a>

Consider the following points when you use user defined function (UDFs) in Athena.
+ **Built-in Athena functions** – Built-in functions in Athena are designed to be highly performant. We recommend that you use built-in functions over UDFs when possible. For more information about built-in functions, see [Functions in Amazon Athena](functions.md).
+ **Scalar UDFs only** – Athena only supports scalar UDFs, which process one row at a time and return a single column value. Athena passes a batch of rows, potentially in parallel, to the UDF each time it invokes Lambda. When designing UDFs and queries, be mindful of the potential impact to network traffic of this processing.
+ **UDF handler functions use abbreviated format** – Use abbreviated format (not full format), for your UDF functions (for example, `package.Class` instead of `package.Class::method`). 
+ **UDF methods must be lowercase** – UDF methods must be in lowercase; camel case is not permitted. 
+ **UDF methods require parameters** – UDF methods must have at least one input parameter. Attempting to invoke a UDF defined without input parameters causes a runtime exception. UDFs are meant to perform functions against data records, but a UDF without arguments takes in no data, so an exception occurs.
+ **Java runtime support** – Currently, Athena UDFs support the Java 8, Java 11, and Java 17 runtimes for Lambda. For more information, see [Building Lambda functions with Java](https://docs.aws.amazon.com/lambda/latest/dg/lambda-java.html) in the *AWS Lambda Developer Guide*.
**Note**  
 For Java 17, you must set the value of `JAVA_TOOL_OPTIONS` environment variable as `--add-opens=java.base/java.nio=ALL-UNNAMED` in your Lambda. 
+ **IAM permissions** – To run and create UDF query statements in Athena, the IAM principal running the query must be allowed to perform actions in addition to Athena functions. For more information, see [Allow access to Athena UDFs: Example policies](udf-iam-access.md).
+ **Lambda quotas** – Lambda quotas apply to UDFs. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/limits.html) in the *AWS Lambda Developer Guide*.
+ **Row-level filtering** – Lake Formation row-level filtering is not supported for UDFs. 
+ **Views** – You cannot use views with UDFs. 
+ **Known issues** – For the most up-to-date list of known issues, see [Limitations and issues](https://github.com/awslabs/aws-athena-query-federation/wiki/Limitations_And_Issues) in the awslabs/aws-athena-query-federation section of GitHub.

# Query using UDF query syntax
<a name="udf-query-syntax"></a>

The `USING EXTERNAL FUNCTION` clause specifies a UDF or multiple UDFs that can be referenced by a subsequent `SELECT` statement in the query. You need the method name for the UDF and the name of the Lambda function that hosts the UDF. In place of the Lambda function name, you can use the Lambda ARN. In cross-account scenarios, the Lambda ARN is required.

## Synopsis
<a name="udf-synopsis"></a>

```
USING EXTERNAL FUNCTION UDF_name(variable1 data_type[, variable2 data_type][,...])
RETURNS data_type
LAMBDA 'lambda_function_name_or_ARN'
[, EXTERNAL FUNCTION UDF_name2(variable1 data_type[, variable2 data_type][,...]) 
RETURNS data_type 
LAMBDA 'lambda_function_name_or_ARN'[,...]]
SELECT  [...] UDF_name(expression) [, UDF_name2(expression)] [...]
```

## Parameters
<a name="udf-parameters"></a>

**USING EXTERNAL FUNCTION *UDF\$1name*(*variable1* *data\$1type*[, *variable2* *data\$1type*][,...])**  
*UDF\$1name* specifies the name of the UDF, which must correspond to a Java method within the referenced Lambda function. Each *variable data\$1type* specifies a named variable and its corresponding data type that the UDF accepts as input. The *data\$1type* must be one of the supported Athena data types listed in the following table and map to the corresponding Java data type.      
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/udf-query-syntax.html)

**RETURNS *data\$1type***  
`data_type` specifies the SQL data type that the UDF returns as output. Athena data types listed in the table above are supported. For the `DECIMAL` data type, use the syntax `RETURNS DECIMAL(precision, scale)` where *precision* and *scale* are integers.

**LAMBDA '*lambda\$1function*'**  
*lambda\$1function* specifies the name of the Lambda function to be invoked when running the UDF.

**SELECT [...] *UDF\$1name*(*expression*) [...]**  
The `SELECT` query that passes values to the UDF and returns a result. *UDF\$1name* specifies the UDF to use, followed by an *expression* that is evaluated to pass values. Values that are passed and returned must match the corresponding data types specified for the UDF in the `USING EXTERNAL FUNCTION` clause.

### Examples
<a name="udf-examples"></a>

For example queries based on the [AthenaUDFHandler.java](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-udfs/src/main/java/com/amazonaws/athena/connectors/udfs/AthenaUDFHandler.java) code on GitHub, see the GitHub [Amazon Athena UDF connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-udfs) page. 

# Create and deploy a UDF using Lambda
<a name="udf-creating-and-deploying"></a>

To create a custom UDF, you create a new Java class by extending the `UserDefinedFunctionHandler` class. The source code for the [UserDefinedFunctionHandler.java](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-sdk/src/main/java/com/amazonaws/athena/connector/lambda/handlers/UserDefinedFunctionHandler.java) in the SDK is available on GitHub in the awslabs/aws-athena-query-federation/athena-federation-sdk [repository](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk), along with [example UDF implementations](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-udfs) that you can examine and modify to create a custom UDF.

The steps in this section demonstrate writing and building a custom UDF Jar file using [Apache Maven](https://maven.apache.org/index.html) from the command line and a deploy.

Perform the following steps to create a custom UDF for Athena using Maven

1. [Clone the SDK and prepare your development environment](#udf-create-install-sdk-prep-environment)

1. [Create your Maven project](#create-maven-project)

1. [Add dependencies and plugins to your Maven project](#udf-add-maven-dependencies)

1. [Write Java code for the UDFs](#udf-write-java)

1. [Build the JAR file](#udf-create-package-jar)

1. [Deploy the JAR to AWS Lambda](#udf-create-deploy)

## Clone the SDK and prepare your development environment
<a name="udf-create-install-sdk-prep-environment"></a>

Before you begin, make sure that git is installed on your system using `sudo yum install git -y`.

**To install the AWS query federation SDK**
+ Enter the following at the command line to clone the SDK repository. This repository includes the SDK, examples and a suite of data source connectors. For more information about data source connectors, see [Use Amazon Athena Federated Query](federated-queries.md).

  ```
  git clone https://github.com/awslabs/aws-athena-query-federation.git
  ```

**To install prerequisites for this procedure**

If you are working on a development machine that already has Apache Maven, the AWS CLI, and the AWS Serverless Application Model build tool installed, you can skip this step.

1. From the root of the `aws-athena-query-federation` directory that you created when you cloned, run the [prepare\$1dev\$1env.sh](https://github.com/awslabs/aws-athena-query-federation/blob/master/tools/prepare_dev_env.sh) script that prepares your development environment.

1. Update your shell to source new variables created by the installation process or restart your terminal session.

   ```
   source ~/.profile
   ```
**Important**  
If you skip this step, you will get errors later about the AWS CLI or AWS SAM build tool not being able to publish your Lambda function.

## Create your Maven project
<a name="create-maven-project"></a>

Run the following command to create your Maven project. Replace *groupId* with the unique ID of your organization, and replace *my-athena-udf* with the name of your application For more information, see [How do I make my first Maven project?](https://maven.apache.org/guides/getting-started/index.html#How_do_I_make_my_first_Maven_project) in Apache Maven documentation.

```
mvn -B archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DgroupId=groupId \
-DartifactId=my-athena-udfs
```

## Add dependencies and plugins to your Maven project
<a name="udf-add-maven-dependencies"></a>

Add the following configurations to your Maven project `pom.xml` file. For an example, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-udfs/pom.xml) file in GitHub.

```
<properties>
    <aws-athena-federation-sdk.version>2022.47.1</aws-athena-federation-sdk.version>
</properties>

<dependencies>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-athena-federation-sdk</artifactId>
        <version>${aws-athena-federation-sdk.version}</version>
    </dependency>
</dependencies>
    
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.1</version>
            <configuration>
                <createDependencyReducedPom>false</createDependencyReducedPom>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
```

## Write Java code for the UDFs
<a name="udf-write-java"></a>

Create a new class by extending [UserDefinedFunctionHandler.java](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-sdk/src/main/java/com/amazonaws/athena/connector/lambda/handlers/UserDefinedFunctionHandler.java). Write your UDFs inside the class.

In the following example, two Java methods for UDFs, `compress()` and `decompress()`, are created inside the class `MyUserDefinedFunctions`.

```
*package *com.mycompany.athena.udfs;

public class MyUserDefinedFunctions
        extends UserDefinedFunctionHandler
{
    private static final String SOURCE_TYPE = "MyCompany";

    public MyUserDefinedFunctions()
    {
        super(SOURCE_TYPE);
    }

    /**
     * Compresses a valid UTF-8 String using the zlib compression library.
     * Encodes bytes with Base64 encoding scheme.
     *
     * @param input the String to be compressed
     * @return the compressed String
     */
    public String compress(String input)
    {
        byte[] inputBytes = input.getBytes(StandardCharsets.UTF_8);

        // create compressor
        Deflater compressor = new Deflater();
        compressor.setInput(inputBytes);
        compressor.finish();

        // compress bytes to output stream
        byte[] buffer = new byte[4096];
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(inputBytes.length);
        while (!compressor.finished()) {
            int bytes = compressor.deflate(buffer);
            byteArrayOutputStream.write(buffer, 0, bytes);
        }

        try {
            byteArrayOutputStream.close();
        }
        catch (IOException e) {
            throw new RuntimeException("Failed to close ByteArrayOutputStream", e);
        }

        // return encoded string
        byte[] compressedBytes = byteArrayOutputStream.toByteArray();
        return Base64.getEncoder().encodeToString(compressedBytes);
    }

    /**
     * Decompresses a valid String that has been compressed using the zlib compression library.
     * Decodes bytes with Base64 decoding scheme.
     *
     * @param input the String to be decompressed
     * @return the decompressed String
     */
    public String decompress(String input)
    {
        byte[] inputBytes = Base64.getDecoder().decode((input));

        // create decompressor
        Inflater decompressor = new Inflater();
        decompressor.setInput(inputBytes, 0, inputBytes.length);

        // decompress bytes to output stream
        byte[] buffer = new byte[4096];
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(inputBytes.length);
        try {
            while (!decompressor.finished()) {
                int bytes = decompressor.inflate(buffer);
                if (bytes == 0 && decompressor.needsInput()) {
                    throw new DataFormatException("Input is truncated");
                }
                byteArrayOutputStream.write(buffer, 0, bytes);
            }
        }
        catch (DataFormatException e) {
            throw new RuntimeException("Failed to decompress string", e);
        }

        try {
            byteArrayOutputStream.close();
        }
        catch (IOException e) {
            throw new RuntimeException("Failed to close ByteArrayOutputStream", e);
        }

        // return decoded string
        byte[] decompressedBytes = byteArrayOutputStream.toByteArray();
        return new String(decompressedBytes, StandardCharsets.UTF_8);
    }
}
```

## Build the JAR file
<a name="udf-create-package-jar"></a>

Run `mvn clean install` to build your project. After it successfully builds, a JAR file is created in the `target` folder of your project named `artifactId-version.jar`, where *artifactId* is the name you provided in the Maven project, for example, `my-athena-udfs`.

## Deploy the JAR to AWS Lambda
<a name="udf-create-deploy"></a>

You have two options to deploy your code to Lambda:
+ Deploy Using AWS Serverless Application Repository (Recommended)
+ Create a Lambda Function from the JAR file

### Option 1: Deploy to the AWS Serverless Application Repository
<a name="udf-create-deploy-sar"></a>

When you deploy your JAR file to the AWS Serverless Application Repository, you create an AWS SAM template YAML file that represents the architecture of your application. You then specify this YAML file and an Amazon S3 bucket where artifacts for your application are uploaded and made available to the AWS Serverless Application Repository. The procedure below uses the [publish.sh](https://github.com/awslabs/aws-athena-query-federation/blob/master/tools/publish.sh) script located in the `athena-query-federation/tools` directory of the Athena Query Federation SDK that you cloned earlier.

For more information and requirements, see [Publishing applications](https://docs.aws.amazon.com/serverlessrepo/latest/devguide/serverlessrepo-publishing-applications.html) in the *AWS Serverless Application Repository Developer Guide*, [AWS SAM template concepts](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-template-basics.html) in the *AWS Serverless Application Model Developer Guide*, and [Publishing serverless applications using the AWS SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-template-publishing-applications.html).

The following example demonstrates parameters in a YAML file. Add similar parameters to your YAML file and save it in your project directory. See [athena-udf.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-udfs/athena-udfs.yaml) in GitHub for a full example.

```
Transform: 'AWS::Serverless-2016-10-31'
Metadata:
  'AWS::ServerlessRepo::Application':
    Name: MyApplicationName
    Description: 'The description I write for my application'
    Author: 'Author Name'
    Labels:
      - athena-federation
    SemanticVersion: 1.0.0
Parameters:
  LambdaFunctionName:
    Description: 'The name of the Lambda function that will contain your UDFs.'
    Type: String
  LambdaTimeout:
    Description: 'Maximum Lambda invocation runtime in seconds. (min 1 - 900 max)'
    Default: 900
    Type: Number
  LambdaMemory:
    Description: 'Lambda memory in MB (min 128 - 3008 max).'
    Default: 3008
    Type: Number
Resources:
  ConnectorConfig:
    Type: 'AWS::Serverless::Function'
    Properties:
      FunctionName: !Ref LambdaFunctionName
      Handler: "full.path.to.your.handler. For example, com.amazonaws.athena.connectors.udfs.MyUDFHandler"
      CodeUri: "Relative path to your JAR file. For example, ./target/athena-udfs-1.0.jar"
      Description: "My description of the UDFs that this Lambda function enables."
      Runtime: java8
      Timeout: !Ref LambdaTimeout
      MemorySize: !Ref LambdaMemory
```

Copy the `publish.sh` script to the project directory where you saved your YAML file, and run the following command:

```
./publish.sh MyS3Location MyYamlFile
```

For example, if your bucket location is `s3://amzn-s3-demo-bucket/mysarapps/athenaudf` and your YAML file was saved as `my-athena-udfs.yaml`:

```
./publish.sh amzn-s3-demo-bucket/mysarapps/athenaudf my-athena-udfs
```

**To create a Lambda function**

1. Open the Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/), choose **Create function**, and then choose **Browse serverless app repository**

1. Choose **Private applications**, find your application in the list, or search for it using key words, and select it.

1. Review and provide application details, and then choose **Deploy.**

   You can now use the method names defined in your Lambda function JAR file as UDFs in Athena.

### Option 2: Create a Lambda function directly
<a name="udf-create-deploy-lambda"></a>

You can also create a Lambda function directly using the console or AWS CLI. The following example demonstrates using the Lambda `create-function` CLI command. 

```
aws lambda create-function \
 --function-name MyLambdaFunctionName \
 --runtime java8 \
 --role arn:aws:iam::1234567890123:role/my_lambda_role \
 --handler com.mycompany.athena.udfs.MyUserDefinedFunctions \
 --timeout 900 \
 --zip-file fileb://./target/my-athena-udfs-1.0-SNAPSHOT.jar
```

# Query across regions
<a name="querying-across-regions"></a>

Athena supports the ability to query Amazon S3 data in an AWS Region that is different from the Region in which you are using Athena. Querying across Regions can be an option when moving the data is not practical or permissible, or if you want to query data across multiple regions. Even if Athena is not available in a particular Region, data from that Region can be queried from another Region in which Athena is available.

To query data in a Region, your account must be enabled in that Region even if the Amazon S3 data does not belong to your account. For some regions such as US East (Ohio), your access to the Region is automatically enabled when your account is created. Other Regions require that your account be "opted-in" to the Region before you can use it. For a list of Regions that require opt-in, see [Available regions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions) in the *Amazon EC2 User Guide*. For specific instructions about opting-in to a Region, see [Managing AWS regions](https://docs.aws.amazon.com/general/latest/gr/rande-manage.html) in the *Amazon Web Services General Reference*.

## Considerations and limitations
<a name="querying-across-regions-considerations-and-limitations"></a>
+ **Data access permissions** – To successfully query Amazon S3 data from Athena across Regions, your account must have permissions to read the data. If the data that you want to query belongs to another account, the other account must grant you access to the Amazon S3 location that contains the data.
+ **Data transfer charges** – Amazon S3 data transfer charges apply for cross-region queries. Running a query can result in more data transferred than the size of the dataset. We recommend that you start by testing your queries on a subset of data and reviewing the costs in [AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/).
+ **AWS Glue** – You can use AWS Glue across Regions. Additional charges may apply for cross-region AWS Glue traffic. For more information, see [Create cross-account and cross-region AWS Glue connections](https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/) in the *AWS Big Data Blog*.
+ **Amazon S3 encryption options** – The SSE-S3 and SSE-KMS encryption options are supported for queries across Regions; CSE-KMS is not. For more information, see [Supported Amazon S3 encryption options](encryption.md#encryption-options-S3-and-Athena).
+ **Federated queries** – Using federated queries across AWS Regions is not supported. 

Provided the above conditions are met, you can create an Athena table that points to the `LOCATION` value that you specify and query the data transparently. No special syntax is required. For information about creating Athena tables, see [Create tables in Athena](creating-tables.md).

# Query the AWS Glue Data Catalog
<a name="querying-glue-catalog"></a>

Because AWS Glue Data Catalog is used by many AWS services as their central metadata repository, you might want to query Data Catalog metadata. To do so, you can use SQL queries in Athena. You can use Athena to query AWS Glue catalog metadata like databases, tables, partitions, and columns.

To obtain AWS Glue Catalog metadata, you query the `information_schema` database on the Athena backend. The example queries in this topic show how to use Athena to query AWS Glue Catalog metadata for common use cases.

## Considerations and limitations
<a name="querying-glue-catalog-considerations-limitations"></a>
+ Instead of querying the `information_schema` database, it is possible to use individual Apache Hive [DDL commands](ddl-reference.md) to extract metadata information for specific databases, tables, views, partitions, and columns from Athena. However, the output is in a non-tabular format.
+ Querying `information_schema` is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
+ You cannot use `CREATE VIEW` to create a view on the `information_schema` database. 

**Topics**
+ [Considerations and limitations](#querying-glue-catalog-considerations-limitations)
+ [List databases and searching a specified database](querying-glue-catalog-querying-available-databases-including-rdbms.md)
+ [List tables in a specified database and searching for a table by name](querying-glue-catalog-listing-tables.md)
+ [List partitions for a specific table](querying-glue-catalog-listing-partitions.md)
+ [List or search columns for a specified table or view](querying-glue-catalog-listing-columns.md)
+ [List the columns that specific tables have in common](querying-glue-catalog-listing-columns-in-common.md)
+ [List all columns for all tables](querying-glue-catalog-listing-all-columns-for-all-tables.md)

# List databases and searching a specified database
<a name="querying-glue-catalog-querying-available-databases-including-rdbms"></a>

The examples in this section show how to list the databases in metadata by schema name.

**Example – Listing databases**  
The following example query lists the databases from the `information_schema.schemata` table.  

```
SELECT schema_name
FROM   information_schema.schemata
LIMIT  10;
```
The following table shows sample results.  


****  

|  |  | 
| --- |--- |
| 6 | alb-databas1 | 
| 7 | alb\$1original\$1cust | 
| 8 | alblogsdatabase | 
| 9 | athena\$1db\$1test | 
| 10 | athena\$1ddl\$1db | 

**Example – Searching a specified database**  
In the following example query, `rdspostgresql` is a sample database.  

```
SELECT schema_name
FROM   information_schema.schemata
WHERE  schema_name = 'rdspostgresql'
```
The following table shows sample results.  


****  

|  | schema\$1name | 
| --- | --- | 
| 1 | rdspostgresql | 

# List tables in a specified database and searching for a table by name
<a name="querying-glue-catalog-listing-tables"></a>

To list metadata for tables, you can query by table schema or by table name.

**Example – Listing tables by schema**  
The following query lists tables that use the `rdspostgresql` table schema.  

```
SELECT table_schema,
       table_name,
       table_type
FROM   information_schema.tables
WHERE  table_schema = 'rdspostgresql'
```
The following table shows a sample result.  


****  

|  | table\$1schema | table\$1name | table\$1type | 
| --- | --- | --- | --- | 
| 1 | rdspostgresql | rdspostgresqldb1\$1public\$1account | BASE TABLE | 

**Example – Searching for a table by name**  
The following query obtains metadata information for the table `athena1`.  

```
SELECT table_schema,
       table_name,
       table_type
FROM   information_schema.tables
WHERE  table_name = 'athena1'
```
The following table shows a sample result.  


****  

|  | table\$1schema | table\$1name | table\$1type | 
| --- | --- | --- | --- | 
| 1 | default | athena1 | BASE TABLE | 

# List partitions for a specific table
<a name="querying-glue-catalog-listing-partitions"></a>

You can use `SHOW PARTITIONS table_name` to list the partitions for a specified table, as in the following example.

```
SHOW PARTITIONS cloudtrail_logs_test2
```

You can also use a `$partitions` metadata query to list the partition numbers and partition values for a specific table.

**Example – Querying the partitions for a table using the \$1partitions syntax**  
The following example query lists the partitions for the table `cloudtrail_logs_test2` using the `$partitions` syntax.  

```
SELECT * FROM default."cloudtrail_logs_test2$partitions" ORDER BY partition_number
```
The following table shows sample results.  


****  

|  | table\$1catalog | table\$1schema | table\$1name | Year | Month | Day | 
| --- | --- | --- | --- | --- | --- | --- | 
| 1 | awsdatacatalog | default | cloudtrail\$1logs\$1test2 | 2020 | 08 | 10 | 
| 2 | awsdatacatalog | default | cloudtrail\$1logs\$1test2 | 2020 | 08 | 11 | 
| 3 | awsdatacatalog | default | cloudtrail\$1logs\$1test2 | 2020 | 08 | 12 | 

# List or search columns for a specified table or view
<a name="querying-glue-catalog-listing-columns"></a>

You can list all columns for a table, all columns for a view, or search for a column by name in a specified database and table.

To list the columns, use a `SELECT *` query. In the `FROM` clause, specify `information_schema.columns`. In the `WHERE` clause, use `table_schema='database_name'` to specify the database and `table_name = 'table_name'` to specify the table or view that has the columns that you want to list.

**Example – Listing all columns for a specified table**  
The following example query lists all columns for the table `rdspostgresqldb1_public_account`.  

```
SELECT *
FROM   information_schema.columns
WHERE  table_schema = 'rdspostgresql'
       AND table_name = 'rdspostgresqldb1_public_account'
```
The following table shows sample results.  


****  

|  | table\$1catalog | table\$1schema | table\$1name | column\$1name | ordinal\$1position | column\$1default | is\$1nullable | data\$1type | comment | extra\$1info | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 1 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | password | 1 |  | YES | varchar |  |  | 
| 2 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | user\$1id | 2 |  | YES | integer |  |  | 
| 3 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | created\$1on | 3 |  | YES | timestamp |  |  | 
| 4 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | last\$1login | 4 |  | YES | timestamp |  |  | 
| 5 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | email | 5 |  | YES | varchar |  |  | 
| 6 | awsdatacatalog | rdspostgresql | rdspostgresqldb1\$1public\$1account | username | 6 |  | YES | varchar |  |  | 

**Example – Listing the columns for a specified view**  
The following example query lists all the columns in the `default` database for the view `arrayview`.  

```
SELECT *
FROM   information_schema.columns
WHERE  table_schema = 'default'
       AND table_name = 'arrayview'
```
The following table shows sample results.  


****  

|  | table\$1catalog | table\$1schema | table\$1name | column\$1name | ordinal\$1position | column\$1default | is\$1nullable | data\$1type | comment | extra\$1info | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 1 | awsdatacatalog | default | arrayview | searchdate | 1 |  | YES | varchar |  |  | 
| 2 | awsdatacatalog | default | arrayview | sid | 2 |  | YES | varchar |  |  | 
| 3 | awsdatacatalog | default | arrayview | btid | 3 |  | YES | varchar |  |  | 
| 4 | awsdatacatalog | default | arrayview | p | 4 |  | YES | varchar |  |  | 
| 5 | awsdatacatalog | default | arrayview | infantprice | 5 |  | YES | varchar |  |  | 
| 6 | awsdatacatalog | default | arrayview | sump | 6 |  | YES | varchar |  |  | 
| 7 | awsdatacatalog | default | arrayview | journeymaparray | 7 |  | YES | array(varchar) |  |  | 

**Example – Searching for a column by name in a specified database and table**  
The following example query searches for metadata for the `sid` column in the `arrayview` view of the `default` database.  

```
SELECT *
FROM   information_schema.columns
WHERE  table_schema = 'default'
       AND table_name = 'arrayview' 
       AND column_name='sid'
```
The following table shows a sample result.  


****  

|  | table\$1catalog | table\$1schema | table\$1name | column\$1name | ordinal\$1position | column\$1default | is\$1nullable | data\$1type | comment | extra\$1info | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 1 | awsdatacatalog | default | arrayview | sid | 2 |  | YES | varchar |  |  | 

# List the columns that specific tables have in common
<a name="querying-glue-catalog-listing-columns-in-common"></a>

You can list the columns that specific tables in a database have in common.
+ Use the syntax `SELECT column_name FROM information_schema.columns`.
+ For the `WHERE` clause, use the syntax `WHERE table_name IN ('table1', 'table2')`.

**Example – Listing common columns for two tables in the same database**  
The following example query lists the columns that the tables `table1` and `table2` have in common.  

```
SELECT column_name
FROM information_schema.columns
WHERE table_name IN ('table1', 'table2')
GROUP BY column_name
HAVING COUNT(*) > 1;
```

# List all columns for all tables
<a name="querying-glue-catalog-listing-all-columns-for-all-tables"></a>

You can list all columns for all tables in `AwsDataCatalog` or for all tables in a specific database in `AwsDataCatalog`.
+ To list all columns for all databases in `AwsDataCatalog`, use the query `SELECT * FROM information_schema.columns`.
+ To restrict the results to a specific database, use `table_schema='database_name'` in the `WHERE` clause.

**Example – Listing all columns for all tables in a specific database**  
The following example query lists all columns for all tables in the database `webdata`.  

```
SELECT * FROM information_schema.columns WHERE table_schema = 'webdata'            
```

# Query AWS service logs
<a name="querying-aws-service-logs"></a>

This section includes several procedures for using Amazon Athena to query popular datasets, such as AWS CloudTrail logs, Amazon CloudFront logs, Classic Load Balancer logs, Application Load Balancer logs, Amazon VPC flow logs, and Network Load Balancer logs.

The tasks in this section use the Athena console, but you can also use other tools like the [Athena JDBC driver](connect-with-jdbc.md), the [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/athena/), or the [Amazon Athena API Reference](https://docs.aws.amazon.com/athena/latest/APIReference/).

For information about using AWS CloudFormation to automatically create AWS service log tables, partitions, and example queries in Athena, see [Automating AWS service logs table creation and querying them with Amazon Athena](https://aws.amazon.com/blogs/big-data/automating-aws-service-logs-table-creation-and-querying-them-with-amazon-athena/) in the AWS Big Data Blog. For information about using a Python library for AWS Glue to create a common framework for processing AWS service logs and querying them in Athena, see [Easily query AWS service logs using Amazon Athena](https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/).

The topics in this section assume that you have configured appropriate permissions to access Athena and the Amazon S3 bucket where the data to query should reside. For more information, see [Set up, administrative, and programmatic access](setting-up.md) and [Get started](getting-started.md).

**Topics**
+ [Application Load Balancer](application-load-balancer-logs.md)
+ [Elastic Load Balancing](elasticloadbalancer-classic-logs.md)
+ [CloudFront](cloudfront-logs.md)
+ [CloudTrail](cloudtrail-logs.md)
+ [Amazon EMR](emr-logs.md)
+ [Global Accelerator](querying-global-accelerator-flow-logs.md)
+ [GuardDuty](querying-guardduty.md)
+ [Network Firewall](querying-network-firewall-logs.md)
+ [Network Load Balancer](networkloadbalancer-classic-logs.md)
+ [Route 53](querying-r53-resolver-logs.md)
+ [Amazon SES](querying-ses-logs.md)
+ [Amazon VPC](vpc-flow-logs.md)
+ [AWS WAF](waf-logs.md)

# Query Application Load Balancer logs
<a name="application-load-balancer-logs"></a>

An Application Load Balancer is a load balancing option for Elastic Load Balancing that enables traffic distribution in a microservices deployment using containers. Querying Application Load Balancer logs allows you to see the source of traffic, latency, and bytes transferred to and from Elastic Load Balancing instances and backend applications. For more information, see [Access logs for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html) and [Connection logs for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html) in the *User Guide for Application Load Balancers*.

## Prerequisites
<a name="application-load-balancer-logs-prerequisites"></a>
+ Enable [access logging](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html) or [connection logging ](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html) so that Application Load Balancer logs can be saved to your Amazon S3 bucket.
+ A database to hold the table that you will create for Athena. To create a database, you can use the Athena or AWS Glue console. For more information, see [Create databases in Athena](creating-databases.md) in this guide or [Working with databases on the AWS glue console](https://docs.aws.amazon.com/glue/latest/dg/console-databases.html) in the *AWS Glue Developer Guide*. 

**Topics**
+ [Prerequisites](#application-load-balancer-logs-prerequisites)
+ [Create the table for ALB access logs](create-alb-access-logs-table.md)
+ [Create the table for ALB access logs in Athena using partition projection](create-alb-access-logs-table-partition-projection.md)
+ [Example queries for ALB access logs](query-alb-access-logs-examples.md)
+ [Create the table for ALB connection logs](create-alb-connection-logs-table.md)
+ [Create the table for ALB connection logs in Athena using partition projection](create-alb-connection-logs-table-partition-projection.md)
+ [Example queries for ALB connection logs](query-alb-connection-logs-examples.md)
+ [Additional resources](application-load-balancer-logs-additional-resources.md)

# Create the table for ALB access logs
<a name="create-alb-access-logs-table"></a>

1. Copy and paste the following `CREATE TABLE` statement into the query editor in the Athena console, and then modify it as necessary for your own log entry requirements. For information about getting started with the Athena console, see [Get started](getting-started.md). Replace the path in the `LOCATION` clause with your Amazon S3 access log folder location. For more information about access log file location, see [Access log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-file-format) in the *User Guide for Application Load Balancers*.

   For information about each log file field, see [Access log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format) in the *User Guide for Application Load Balancers*.
**Note**  
The following example `CREATE TABLE` statement includes the recently added `classification`, `classification_reason`, and `conn_trace_id` ('traceability ID', or TID) columns. To create a table for Application Load Balancer access logs that do not contain these entries, remove the corresponding columns from the `CREATE TABLE` statement and modify the regular expression accordingly. 

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS alb_access_logs (
               type string,
               time string,
               elb string,
               client_ip string,
               client_port int,
               target_ip string,
               target_port int,
               request_processing_time double,
               target_processing_time double,
               response_processing_time double,
               elb_status_code int,
               target_status_code string,
               received_bytes bigint,
               sent_bytes bigint,
               request_verb string,
               request_url string,
               request_proto string,
               user_agent string,
               ssl_cipher string,
               ssl_protocol string,
               target_group_arn string,
               trace_id string,
               domain_name string,
               chosen_cert_arn string,
               matched_rule_priority string,
               request_creation_time string,
               actions_executed string,
               redirect_url string,
               lambda_error_reason string,
               target_port_list string,
               target_status_code_list string,
               classification string,
               classification_reason string,
               conn_trace_id string
               )
               ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               'serialization.format' = '1',
               'input.regex' = 
           '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\\s]+?)\" \"([^\\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)? ?( .*)?'
               )
               LOCATION 's3://amzn-s3-demo-bucket/access-log-folder-path/'
   ```
**Note**  
We suggest that the pattern *`?( .*)?`* at the end of the `input.regex` parameter always remain in place to handle future log entries in case new ALB log fields are added. 

1. Run the query in the Athena console. After the query completes, Athena registers the `alb_access_logs` table, making the data in it ready for you to issue queries.

# Create the table for ALB access logs in Athena using partition projection
<a name="create-alb-access-logs-table-partition-projection"></a>

Because ALB access logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. 

The following example `CREATE TABLE` statement automatically uses partition projection on ALB access logs from a specified date until the present for a single AWS region. The statement is based on the example in the previous section but adds `PARTITIONED BY` and `TBLPROPERTIES` clauses to enable partition projection. In the `LOCATION` and `storage.location.template` clauses, replace the placeholders with values that identify the Amazon S3 bucket location of your ALB access logs. For more information about access log file location, see [Access log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-file-format) in the *User Guide for Application Load Balancers*. For `projection.day.range`, replace *2022*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. For information about each log file field, see [Access log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format). 

```
CREATE EXTERNAL TABLE IF NOT EXISTS alb_access_logs (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code int,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string,
            conn_trace_id string
            )
            PARTITIONED BY
            (
             day STRING
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' = 
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\\s]+?)\" \"([^\\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)? ?( .*)?'
            )
            LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/<ACCOUNT-NUMBER>/elasticloadbalancing/<REGION>/'
            TBLPROPERTIES
            (
             "projection.enabled" = "true",
             "projection.day.type" = "date",
             "projection.day.range" = "2022/01/01,NOW",
             "projection.day.format" = "yyyy/MM/dd",
             "projection.day.interval" = "1",
             "projection.day.interval.unit" = "DAYS",
             "storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs/<ACCOUNT-NUMBER>/elasticloadbalancing/<REGION>/${day}"
            )
```

For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md).

**Note**  
We suggest that the pattern *?( .\$1)?* at the end of the `input.regex` parameter always remain in place to handle future log entries in case new ALB log fields are added. 

# Example queries for ALB access logs
<a name="query-alb-access-logs-examples"></a>

The following query counts the number of HTTP GET requests received by the load balancer grouped by the client IP address:

```
SELECT COUNT(request_verb) AS
 count,
 request_verb,
 client_ip
FROM alb_access_logs
GROUP BY request_verb, client_ip
LIMIT 100;
```

Another query shows the URLs visited by Safari browser users:

```
SELECT request_url
FROM alb_access_logs
WHERE user_agent LIKE '%Safari%'
LIMIT 10;
```

The following query shows records that have ELB status code values greater than or equal to 500.

```
SELECT * FROM alb_access_logs
WHERE elb_status_code >= 500
```

The following example shows how to parse the logs by `datetime`:

```
SELECT client_ip, sum(received_bytes) 
FROM alb_access_logs
WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') 
     BETWEEN parse_datetime('2018-05-30-12:00:00','yyyy-MM-dd-HH:mm:ss') 
     AND parse_datetime('2018-05-31-00:00:00','yyyy-MM-dd-HH:mm:ss') 
GROUP BY client_ip;
```

The following query queries the table that uses partition projection for all ALB access logs from the specified day.

```
SELECT * 
FROM alb_access_logs 
WHERE day = '2022/02/12'
```

# Create the table for ALB connection logs
<a name="create-alb-connection-logs-table"></a>

1. Copy and paste the following example `CREATE TABLE` statement into the query editor in the Athena console, and then modify it as necessary for your own log entry requirements. For information about getting started with the Athena console, see [Get started](getting-started.md). Replace the path in the `LOCATION` clause with your Amazon S3 connection log folder location. For more information about connection log file location, see [Connection log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-file-format) in the *User Guide for Application Load Balancers*. For information about each log file field, see [Connection log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-entry-format). 

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS alb_connection_logs (
            time string,
            client_ip string,
            client_port int,
            listener_port int,
            tls_protocol string,
            tls_cipher string,
            tls_handshake_latency double,
            leaf_client_cert_subject string,
            leaf_client_cert_validity string,
            leaf_client_cert_serial_number string,
            tls_verify_status string,
            conn_trace_id string
            ) 
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
             '([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([A-Za-z0-9.-]*) ([^ ]*) ([-.0-9]*) \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?'
            )
            LOCATION 's3://amzn-s3-demo-bucket/connection-log-folder-path/'
   ```

1. Run the query in the Athena console. After the query completes, Athena registers the `alb_connection_logs` table, making the data in it ready for you to issue queries.

# Create the table for ALB connection logs in Athena using partition projection
<a name="create-alb-connection-logs-table-partition-projection"></a>

Because ALB connection logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. 

The following example `CREATE TABLE` statement automatically uses partition projection on ALB connection logs from a specified date until the present for a single AWS region. The statement is based on the example in the previous section but adds `PARTITIONED BY` and `TBLPROPERTIES` clauses to enable partition projection. In the `LOCATION` and `storage.location.template` clauses, replace the placeholders with values that identify the Amazon S3 bucket location of your ALB connection logs. For more information about connection log file location, see [Connection log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-file-format) in the *User Guide for Application Load Balancers*. For `projection.day.range`, replace *2023*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. For information about each log file field, see [Connection log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-entry-format).

```
CREATE EXTERNAL TABLE IF NOT EXISTS alb_connection_logs (
         time string,
         client_ip string,
         client_port int,
         listener_port int,
         tls_protocol string,
         tls_cipher string,
         tls_handshake_latency double,
         leaf_client_cert_subject string,
         leaf_client_cert_validity string,
         leaf_client_cert_serial_number string,
         tls_verify_status string,
         conn_trace_id string
         )
            PARTITIONED BY
            (
             day STRING
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
             '([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([A-Za-z0-9.-]*) ([^ ]*) ([-.0-9]*) \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?'
            )
            LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/<ACCOUNT-NUMBER>/elasticloadbalancing/<REGION>/'
            TBLPROPERTIES
            (
             "projection.enabled" = "true",
             "projection.day.type" = "date",
             "projection.day.range" = "2023/01/01,NOW",
             "projection.day.format" = "yyyy/MM/dd",
             "projection.day.interval" = "1",
             "projection.day.interval.unit" = "DAYS",
             "storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs/<ACCOUNT-NUMBER>/elasticloadbalancing/<REGION>/${day}"
            )
```

For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md).

# Example queries for ALB connection logs
<a name="query-alb-connection-logs-examples"></a>

The following query count occurrences where the value for `tls_verify_status` was not `'Success'`, grouped by client IP address:

```
SELECT DISTINCT client_ip, count() AS count FROM alb_connection_logs
WHERE tls_verify_status != 'Success'
GROUP BY client_ip
ORDER BY count() DESC;
```

The following query searches occurrences where the value for `tls_handshake_latency` was over 2 seconds in the specified time range:

```
SELECT * FROM alb_connection_logs
WHERE 
  (
    parse_datetime(time, 'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') 
    BETWEEN 
    parse_datetime('2024-01-01-00:00:00', 'yyyy-MM-dd-HH:mm:ss') 
    AND 
    parse_datetime('2024-03-20-00:00:00', 'yyyy-MM-dd-HH:mm:ss') 
  ) 
  AND 
    (tls_handshake_latency >= 2.0);
```

# Additional resources
<a name="application-load-balancer-logs-additional-resources"></a>

For more information about using ALB logs, see the following resources.
+ [How do I analyze my Application Load Balancer access logs using Amazon Athena](https://repost.aws/knowledge-center/athena-analyze-access-logs) in the *AWS Knowledge Center*.
+ For information about HTTP status codes in Elastic Load Balancing, see [Troubleshoot your application load balancers](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html) in the *User Guide for Application Load Balancers*.
+ [Catalog and analyze Application Load Balancer logs more efficiently with AWS Glue custom classifiers and Amazon Athena](https://aws.amazon.com/blogs/big-data/catalog-and-analyze-application-load-balancer-logs-more-efficiently-with-aws-glue-custom-classifiers-and-amazon-athena/) in the *AWS Big Data Blog*.

# Query Classic Load Balancer logs
<a name="elasticloadbalancer-classic-logs"></a>

Use Classic Load Balancer logs to analyze and understand traffic patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traffic, latency, and bytes that have been transferred.

Before you analyze the Elastic Load Balancing logs, configure them for saving in the destination Amazon S3 bucket. For more information, see [Enable access logs for your Classic Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/enable-access-logs.html).

**To create the table for Elastic Load Balancing logs**

1. Copy and paste the following DDL statement into the Athena console. Check the [syntax ](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html#access-log-entry-format) of the Elastic Load Balancing log records. You may need to update the following query to include the columns and the Regex syntax for latest version of the record. 

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs (
    
    timestamp string,
    elb_name string,
    request_ip string,
    request_port int,
    backend_ip string,
    backend_port int,
    request_processing_time double,
    backend_processing_time double,
    client_response_time double,
    elb_response_code string,
    backend_response_code string,
    received_bytes bigint,
    sent_bytes bigint,
    request_verb string,
    url string,
    protocol string,
    user_agent string,
    ssl_cipher string,
    ssl_protocol string
   )
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
   WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'
   )
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/';
   ```

1. Modify the `LOCATION` Amazon S3 bucket to specify the destination of your Elastic Load Balancing logs.

1. Run the query in the Athena console. After the query completes, Athena registers the `elb_logs` table, making the data in it ready for queries. For more information, see [Example queries](#query-elb-classic-example).

## Example queries
<a name="query-elb-classic-example"></a>

Use a query similar to the following example. It lists the backend application servers that returned a `4XX` or `5XX` error response code. Use the `LIMIT` operator to limit the number of logs to query at a time.

```
SELECT
 timestamp,
 elb_name,
 backend_ip,
 backend_response_code
FROM elb_logs
WHERE backend_response_code LIKE '4%' OR
      backend_response_code LIKE '5%'
LIMIT 100;
```

Use a subsequent query to sum up the response time of all the transactions grouped by the backend IP address and Elastic Load Balancing instance name.

```
SELECT sum(backend_processing_time) AS
 total_ms,
 elb_name,
 backend_ip
FROM elb_logs WHERE backend_ip <> ''
GROUP BY backend_ip, elb_name
LIMIT 100;
```

For more information, see [Analyzing data in S3 using Athena](https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/).

# Query Amazon CloudFront logs
<a name="cloudfront-logs"></a>

You can configure Amazon CloudFront CDN to export Web distribution access logs to Amazon Simple Storage Service. Use these logs to explore users' surfing patterns across your web properties served by CloudFront.

Before you begin querying the logs, enable Web distributions access log on your preferred CloudFront distribution. For information, see [Access logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html) in the *Amazon CloudFront Developer Guide*. Make a note of the Amazon S3 bucket in which you save these logs.

**Topics**
+ [Create a table for CloudFront standard logs (legacy)](create-cloudfront-table-standard-logs.md)
+ [Create a table for CloudFront logs in Athena using manual partitioning with JSON](create-cloudfront-table-manual-json.md)
+ [Create a table for CloudFront logs in Athena using manual partitioning with Parquet](create-cloudfront-table-manual-parquet.md)
+ [Create a table for CloudFront logs in Athena using partition projection with JSON](create-cloudfront-table-partition-json.md)
+ [Create a table for CloudFront logs in Athena using partition projection with Parquet](create-cloudfront-table-partition-parquet.md)
+ [Create a table for CloudFront real-time logs](create-cloudfront-table-real-time-logs.md)
+ [Additional resources](cloudfront-logs-additional-resources.md)

# Create a table for CloudFront standard logs (legacy)
<a name="create-cloudfront-table-standard-logs"></a>

**Note**  
The following procedure works for the Web distribution access logs in CloudFront. It does not apply to streaming logs from RTMP distributions.

**To create a table for CloudFront standard log file fields**

1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. For information about using the Query Editor, see [Get started](getting-started.md).

   This query specifies `ROW FORMAT DELIMITED` and `FIELDS TERMINATED BY '\t'` to indicate that the fields are delimited by tab characters. For `ROW FORMAT DELIMITED`, Athena uses the [LazySimpleSerDe](lazy-simple-serde.md) by default. The column `date` is escaped using backticks (`) because it is a reserved word in Athena. For information, see [Escape reserved keywords in queries](reserved-words.md).

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_standard_logs (
     `date` DATE,
     time STRING,
     x_edge_location STRING,
     sc_bytes BIGINT,
     c_ip STRING,
     cs_method STRING,
     cs_host STRING,
     cs_uri_stem STRING,
     sc_status INT,
     cs_referrer STRING,
     cs_user_agent STRING,
     cs_uri_query STRING,
     cs_cookie STRING,
     x_edge_result_type STRING,
     x_edge_request_id STRING,
     x_host_header STRING,
     cs_protocol STRING,
     cs_bytes BIGINT,
     time_taken FLOAT,
     x_forwarded_for STRING,
     ssl_protocol STRING,
     ssl_cipher STRING,
     x_edge_response_result_type STRING,
     cs_protocol_version STRING,
     fle_status STRING,
     fle_encrypted_fields INT,
     c_port INT,
     time_to_first_byte FLOAT,
     x_edge_detailed_result_type STRING,
     sc_content_type STRING,
     sc_content_len BIGINT,
     sc_range_start BIGINT,
     sc_range_end BIGINT
   )
   ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY '\t'
   LOCATION 's3://amzn-s3-demo-bucket/'
   TBLPROPERTIES ( 'skip.header.line.count'='2' )
   ```

1. Run the query in Athena console. After the query completes, Athena registers the `cloudfront_standard_logs` table, making the data in it ready for you to issue queries.

## Example queries
<a name="query-examples-cloudfront-logs"></a>

The following query adds up the number of bytes served by CloudFront between June 9 and June 11, 2018. Surround the date column name with double quotes because it is a reserved word.

```
SELECT SUM(bytes) AS total_bytes
FROM cloudfront_standard_logs
WHERE "date" BETWEEN DATE '2018-06-09' AND DATE '2018-06-11'
LIMIT 100;
```

To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. 

```
SELECT DISTINCT * 
FROM cloudfront_standard_logs 
LIMIT 10;
```

# Create a table for CloudFront logs in Athena using manual partitioning with JSON
<a name="create-cloudfront-table-manual-json"></a>

**To create a table for CloudFront standard log file fields using a JSON format**

1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. 

   This query uses OpenX JSON SerDe with the following SerDe properties to read JSON fields correctly in Athena.

   ```
   CREATE EXTERNAL TABLE `cf_logs_manual_partition_json`(
     `date` string , 
     `time` string , 
     `x-edge-location` string , 
     `sc-bytes` string , 
     `c-ip` string , 
     `cs-method` string , 
     `cs(host)` string , 
     `cs-uri-stem` string , 
     `sc-status` string , 
     `cs(referer)` string , 
     `cs(user-agent)` string , 
     `cs-uri-query` string , 
     `cs(cookie)` string , 
     `x-edge-result-type` string , 
     `x-edge-request-id` string , 
     `x-host-header` string , 
     `cs-protocol` string , 
     `cs-bytes` string , 
     `time-taken` string , 
     `x-forwarded-for` string , 
     `ssl-protocol` string , 
     `ssl-cipher` string , 
     `x-edge-response-result-type` string , 
     `cs-protocol-version` string , 
     `fle-status` string , 
     `fle-encrypted-fields` string , 
     `c-port` string , 
     `time-to-first-byte` string , 
     `x-edge-detailed-result-type` string , 
     `sc-content-type` string , 
     `sc-content-len` string , 
     `sc-range-start` string , 
     `sc-range-end` string )
   ROW FORMAT SERDE 
     'org.openx.data.jsonserde.JsonSerDe' 
   WITH SERDEPROPERTIES ( 
     'paths'='c-ip,c-port,cs(Cookie),cs(Host),cs(Referer),cs(User-Agent),cs-bytes,cs-method,cs-protocol,cs-protocol-version,cs-uri-query,cs-uri-stem,date,fle-encrypted-fields,fle-status,sc-bytes,sc-content-len,sc-content-type,sc-range-end,sc-range-start,sc-status,ssl-cipher,ssl-protocol,time,time-taken,time-to-first-byte,x-edge-detailed-result-type,x-edge-location,x-edge-request-id,x-edge-response-result-type,x-edge-result-type,x-forwarded-for,x-host-header') 
   STORED AS INPUTFORMAT 
     'org.apache.hadoop.mapred.TextInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION
     's3://amzn-s3-demo-bucket/'
   ```

1. Run the query in Athena console. After the query completes, Athena registers the `cf_logs_manual_partition_json` table, making the data in it ready for you to issue queries.

## Example queries
<a name="query-examples-cloudfront-logs-manual-json"></a>

The following query adds up the number of bytes served by CloudFront for January 15, 2025.

```
SELECT sum(cast("sc-bytes" as BIGINT)) as sc
FROM cf_logs_manual_partition_json
WHERE "date"='2025-01-15'
```

To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. 

```
SELECT DISTINCT * FROM cf_logs_manual_partition_json
```

# Create a table for CloudFront logs in Athena using manual partitioning with Parquet
<a name="create-cloudfront-table-manual-parquet"></a>

**To create a table for CloudFront standard log file fields using a Parquet format**

1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. 

   This query uses ParquetHiveSerDe with the following SerDe properties to read Parquet fields correctly in Athena.

   ```
   CREATE EXTERNAL TABLE `cf_logs_manual_partition_parquet`(
     `date` string, 
     `time` string, 
     `x_edge_location` string, 
     `sc_bytes` string, 
     `c_ip` string, 
     `cs_method` string, 
     `cs_host` string, 
     `cs_uri_stem` string, 
     `sc_status` string, 
     `cs_referer` string, 
     `cs_user_agent` string, 
     `cs_uri_query` string, 
     `cs_cookie` string, 
     `x_edge_result_type` string, 
     `x_edge_request_id` string, 
     `x_host_header` string, 
     `cs_protocol` string, 
     `cs_bytes` string, 
     `time_taken` string, 
     `x_forwarded_for` string, 
     `ssl_protocol` string, 
     `ssl_cipher` string, 
     `x_edge_response_result_type` string, 
     `cs_protocol_version` string, 
     `fle_status` string, 
     `fle_encrypted_fields` string, 
     `c_port` string, 
     `time_to_first_byte` string, 
     `x_edge_detailed_result_type` string, 
     `sc_content_type` string, 
     `sc_content_len` string, 
     `sc_range_start` string, 
     `sc_range_end` string)
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   STORED AS INPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     's3://amzn-s3-demo-bucket/'
   ```

1. Run the query in Athena console. After the query completes, Athena registers the `cf_logs_manual_partition_parquet` table, making the data in it ready for you to issue queries.

## Example queries
<a name="query-examples-cloudfront-logs-manual-parquet"></a>

The following query adds up the number of bytes served by CloudFront for January 19, 2025.

```
SELECT sum(cast("sc_bytes" as BIGINT)) as sc
FROM cf_logs_manual_partition_parquet
WHERE "date"='2025-01-19'
```

To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. 

```
SELECT DISTINCT * FROM cf_logs_manual_partition_parquet
```

# Create a table for CloudFront logs in Athena using partition projection with JSON
<a name="create-cloudfront-table-partition-json"></a>

You can reduce query runtime and automate partition management with Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`.

The following example CREATE TABLE statement automatically uses partition projection on CloudFront logs from a specified CloudFront distribution until present for a single AWS Region. After you run the query successfully, you can query the table.

```
CREATE EXTERNAL TABLE `cloudfront_logs_pp`(
  `date` string, 
  `time` string, 
  `x-edge-location` string, 
  `sc-bytes` string, 
  `c-ip` string, 
  `cs-method` string, 
  `cs(host)` string, 
  `cs-uri-stem` string, 
  `sc-status` string, 
  `cs(referer)` string, 
  `cs(user-agent)` string, 
  `cs-uri-query` string, 
  `cs(cookie)` string, 
  `x-edge-result-type` string, 
  `x-edge-request-id` string, 
  `x-host-header` string, 
  `cs-protocol` string, 
  `cs-bytes` string, 
  `time-taken` string, 
  `x-forwarded-for` string, 
  `ssl-protocol` string, 
  `ssl-cipher` string, 
  `x-edge-response-result-type` string, 
  `cs-protocol-version` string, 
  `fle-status` string, 
  `fle-encrypted-fields` string, 
  `c-port` string, 
  `time-to-first-byte` string, 
  `x-edge-detailed-result-type` string, 
  `sc-content-type` string, 
  `sc-content-len` string, 
  `sc-range-start` string, 
  `sc-range-end` string)
  PARTITIONED BY(
         distributionid string,
         year int,
         month int,
         day int,
         hour int )
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 
  'paths'='c-ip,c-port,cs(Cookie),cs(Host),cs(Referer),cs(User-Agent),cs-bytes,cs-method,cs-protocol,cs-protocol-version,cs-uri-query,cs-uri-stem,date,fle-encrypted-fields,fle-status,sc-bytes,sc-content-len,sc-content-type,sc-range-end,sc-range-start,sc-status,ssl-cipher,ssl-protocol,time,time-taken,time-to-first-byte,x-edge-detailed-result-type,x-edge-location,x-edge-request-id,x-edge-response-result-type,x-edge-result-type,x-forwarded-for,x-host-header') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/'
TBLPROPERTIES (
  'projection.distributionid.type'='enum',
  'projection.distributionid.values'='E2Oxxxxxxxxxxx',
  'projection.day.range'='01,31', 
  'projection.day.type'='integer', 
  'projection.day.digits'='2', 
  'projection.enabled'='true', 
  'projection.month.range'='01,12', 
  'projection.month.type'='integer', 
  'projection.month.digits'='2', 
  'projection.year.range'='2025,2026', 
  'projection.year.type'='integer', 
  'projection.hour.range'='00,23',
  'projection.hour.type'='integer',
  'projection.hour.digits'='2',
  'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/${distributionid}/${year}/${month}/${day}/${hour}/')
```

Following are some considerations for the properties used in the previous example.
+ **Table name** – The table name *`cloudfront_logs_pp`* is replaceable. You can change it to any name that you prefer.
+ **Location** – Modify `s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/` to point to your Amazon S3 bucket.
+ **Distribution IDs** – For `projection.distributionid.values`, you can specify multiple distribution IDs if you separate them with commas. For example, *<distributionID1>*, *<distributionID2>*.
+ **Year range** – In `projection.year.range`, you can define the range of years based on your data. For example, you can adjust it to any period, such as *2025*, *2026*.
**Note**  
Including empty partitions, such as those for future dates (example: 2025-2040), can impact query performance. However, partition projection is designed to effectively handle future dates. To maintain optimal performance, ensure that partitions are managed thoughtfully and avoid excessive empty partitions when possible.
+ **Storage location template** – You must ensure to update the `storage.location.template` correctly based on the following CloudFront partitioning structure and S3 path.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/create-cloudfront-table-partition-json.html)

  After you confirm that the CloudFront partitioning structure and S3 structure match the required patterns, update the `storage.location.template` as follows:

  ```
  'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account_id/CloudFront/${distributionid}/folder2/${year}/${month}/${day}/${hour}/folder3/'
  ```
**Note**  
Proper configuration of the `storage.location.template` is crucial for ensuring correct data storage and retrieval.

# Create a table for CloudFront logs in Athena using partition projection with Parquet
<a name="create-cloudfront-table-partition-parquet"></a>

The following example CREATE TABLE statement automatically uses partition projection on CloudFront logs in Parquet, from a specified CloudFront distribution until present for a single AWS Region. After you run the query successfully, you can query the table.

```
CREATE EXTERNAL TABLE `cloudfront_logs_parquet_pp`(
`date` string, 
`time` string, 
`x_edge_location` string, 
`sc_bytes` string, 
`c_ip` string, 
`cs_method` string, 
`cs_host` string, 
`cs_uri_stem` string, 
`sc_status` string, 
`cs_referer` string, 
`cs_user_agent` string, 
`cs_uri_query` string, 
`cs_cookie` string, 
`x_edge_result_type` string, 
`x_edge_request_id` string, 
`x_host_header` string, 
`cs_protocol` string, 
`cs_bytes` string, 
`time_taken` string, 
`x_forwarded_for` string, 
`ssl_protocol` string, 
`ssl_cipher` string, 
`x_edge_response_result_type` string, 
`cs_protocol_version` string, 
`fle_status` string, 
`fle_encrypted_fields` string, 
`c_port` string, 
`time_to_first_byte` string, 
`x_edge_detailed_result_type` string, 
`sc_content_type` string, 
`sc_content_len` string, 
`sc_range_start` string, 
`sc_range_end` string)
PARTITIONED BY(
 distributionid string,
 year int,
 month int,
 day int,
 hour int )
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/'
TBLPROPERTIES (
'projection.distributionid.type'='enum',
'projection.distributionid.values'='E3OK0LPUNWWO3',
'projection.day.range'='01,31',
'projection.day.type'='integer',
'projection.day.digits'='2',
'projection.enabled'='true',
'projection.month.range'='01,12',
'projection.month.type'='integer',
'projection.month.digits'='2',
'projection.year.range'='2019,2025',
'projection.year.type'='integer',
'projection.hour.range'='01,12',
'projection.hour.type'='integer',
'projection.hour.digits'='2',
'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/${distributionid}/${year}/${month}/${day}/${hour}/')
```

Following are some considerations for the properties used in the previous example.
+ **Table name** – The table name *`cloudfront_logs_pp`* is replaceable. You can change it to any name that you prefer.
+ **Location** – Modify `s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/` to point to your Amazon S3 bucket.
+ **Distribution IDs** – For `projection.distributionid.values`, you can specify multiple distribution IDs if you separate them with commas. For example, *<distributionID1>*, *<distributionID2>*.
+ **Year range** – In `projection.year.range`, you can define the range of years based on your data. For example, you can adjust it to any period, such as *2025*, *2026*.
**Note**  
Including empty partitions, such as those for future dates (example: 2025-2040), can impact query performance. However, partition projection is designed to effectively handle future dates. To maintain optimal performance, ensure that partitions are managed thoughtfully and avoid excessive empty partitions when possible.
+ **Storage location template** – You must ensure to update the `storage.location.template` correctly based on the following CloudFront partitioning structure and S3 path.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/create-cloudfront-table-partition-parquet.html)

  After you confirm that the CloudFront partitioning structure and S3 structure match the required patterns, update the `storage.location.template` as follows:

  ```
  'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account_id/CloudFront/${distributionid}/folder2/${year}/${month}/${day}/${hour}/folder3/'
  ```
**Note**  
Proper configuration of the `storage.location.template` is crucial for ensuring correct data storage and retrieval.

# Create a table for CloudFront real-time logs
<a name="create-cloudfront-table-real-time-logs"></a>

**To create a table for CloudFront real-time log file fields**

1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Real-time logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. For information about using the Query Editor, see [Get started](getting-started.md).

   This query specifies `ROW FORMAT DELIMITED` and `FIELDS TERMINATED BY '\t'` to indicate that the fields are delimited by tab characters. For `ROW FORMAT DELIMITED`, Athena uses the [LazySimpleSerDe](lazy-simple-serde.md) by default. The column `timestamp` is escaped using backticks (`) because it is a reserved word in Athena. For information, see [Escape reserved keywords in queries](reserved-words.md).

   The follow example contains all of the available fields. You can comment out or remove fields that you do not require.

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_real_time_logs ( 
   `timestamp` STRING,
   c_ip STRING,
   time_to_first_byte BIGINT,
   sc_status BIGINT,
   sc_bytes BIGINT,
   cs_method STRING,
   cs_protocol STRING,
   cs_host STRING,
   cs_uri_stem STRING,
   cs_bytes BIGINT,
   x_edge_location STRING,
   x_edge_request_id STRING,
   x_host_header STRING,
   time_taken BIGINT,
   cs_protocol_version STRING,
   c_ip_version STRING,
   cs_user_agent STRING,
   cs_referer STRING,
   cs_cookie STRING,
   cs_uri_query STRING,
   x_edge_response_result_type STRING,
   x_forwarded_for STRING,
   ssl_protocol STRING,
   ssl_cipher STRING,
   x_edge_result_type STRING,
   fle_encrypted_fields STRING,
   fle_status STRING,
   sc_content_type STRING,
   sc_content_len BIGINT,
   sc_range_start STRING,
   sc_range_end STRING,
   c_port BIGINT,
   x_edge_detailed_result_type STRING,
   c_country STRING,
   cs_accept_encoding STRING,
   cs_accept STRING,
   cache_behavior_path_pattern STRING,
   cs_headers STRING,
   cs_header_names STRING,
   cs_headers_count BIGINT,
   primary_distribution_id STRING,
   primary_distribution_dns_name STRING,
   origin_fbl STRING,
   origin_lbl STRING,
   asn STRING
   )
   ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY '\t'
   LOCATION 's3://amzn-s3-demo-bucket/'
   TBLPROPERTIES ( 'skip.header.line.count'='2' )
   ```

1. Run the query in Athena console. After the query completes, Athena registers the `cloudfront_real_time_logs` table, making the data in it ready for you to issue queries.

# Additional resources
<a name="cloudfront-logs-additional-resources"></a>

For more information about using Athena to query CloudFront logs, see the following posts from the [AWS big data blog](https://aws.amazon.com/blogs/big-data/).

[Easily query AWS service logs using Amazon Athena](https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/) (May 29, 2019).

[Analyze your Amazon CloudFront access logs at scale](https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/) (December 21, 2018).

[Build a serverless architecture to analyze Amazon CloudFront access logs using AWS Lambda, Amazon Athena, and Amazon Managed Service for Apache Flink](https://aws.amazon.com/blogs/big-data/build-a-serverless-architecture-to-analyze-amazon-cloudfront-access-logs-using-aws-lambda-amazon-athena-and-amazon-kinesis-analytics/) (May 26, 2017).

# Query AWS CloudTrail logs
<a name="cloudtrail-logs"></a>

AWS CloudTrail is a service that records AWS API calls and events for Amazon Web Services accounts. 

CloudTrail logs include details about any API calls made to your AWS services, including the console. CloudTrail generates encrypted log files and stores them in Amazon S3. For more information, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html). 

**Note**  
If you want to perform SQL queries on CloudTrail event information across accounts, regions, and dates, consider using CloudTrail Lake. CloudTrail Lake is an AWS alternative to creating trails that aggregates information from an enterprise into a single, searchable event data store. Instead of using Amazon S3 bucket storage, it stores events in a data lake, which allows richer, faster queries. You can use it to create SQL queries that search events across organizations, regions, and within custom time ranges. Because you perform CloudTrail Lake queries within the CloudTrail console itself, using CloudTrail Lake does not require Athena. For more information, see the [CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) documentation.

Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity. For example, you can use queries to identify trends and further isolate activity by attributes, such as source IP address or user.

A common application is to use CloudTrail logs to analyze operational activity for security and compliance. For information about a detailed example, see the AWS Big Data Blog post, [Analyze security, compliance, and operational activity using AWS CloudTrail and Amazon Athena](https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/).

You can use Athena to query these log files directly from Amazon S3, specifying the `LOCATION` of log files. You can do this one of two ways:
+ By creating tables for CloudTrail log files directly from the CloudTrail console.
+ By manually creating tables for CloudTrail log files in the Athena console.

**Topics**
+ [Understand CloudTrail logs and Athena tables](create-cloudtrail-table-understanding.md)
+ [Use the CloudTrail console to create an Athena table for CloudTrail logs](create-cloudtrail-table-ct.md)
+ [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md)
+ [Create a table for an organization wide trail using manual partitioning](create-cloudtrail-table-org-wide-trail.md)
+ [Create the table for CloudTrail logs in Athena using partition projection](create-cloudtrail-table-partition-projection.md)
+ [Example CloudTrail log queries](query-examples-cloudtrail-logs.md)

# Understand CloudTrail logs and Athena tables
<a name="create-cloudtrail-table-understanding"></a>

Before you begin creating tables, you should understand a little more about CloudTrail and how it stores data. This can help you create the tables that you need, whether you create them from the CloudTrail console or from Athena.

CloudTrail saves logs as JSON text files in compressed gzip format (`*.json.gz`). The location of the log files depends on how you set up trails, the AWS Region or Regions in which you are logging, and other factors. 

For more information about where logs are stored, the JSON structure, and the record file contents, see the following topics in the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html):
+  [Finding your CloudTrail log files](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-find-log-files.html) 
+  [CloudTrail Log File examples](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-log-file-examples.html) 
+  [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html)
+  [CloudTrail event reference](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference.html) 

To collect logs and save them to Amazon S3, enable CloudTrail from the AWS Management Console. For more information, see [Creating a trail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-a-trail-using-the-console-first-time.html) in the *AWS CloudTrail User Guide*.

# Use the CloudTrail console to create an Athena table for CloudTrail logs
<a name="create-cloudtrail-table-ct"></a>

You can create a non-partitioned Athena table for querying CloudTrail logs directly from the CloudTrail console. Creating an Athena table from the CloudTrail console requires that you be logged in with a role that has sufficient permissions to create tables in Athena.

**Note**  
You cannot use the CloudTrail console to create an Athena table for organization trail logs. Instead, create the table manually using the Athena console so that you can specify the correct storage location. For information about organization trails, see [Creating a trail for an organization](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/creating-trail-organization.html) in the *AWS CloudTrail User Guide*.
+ For information about setting up permissions for Athena, see [Set up, administrative, and programmatic access](setting-up.md).
+ For information about creating a table with partitions, see [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md).

**To create an Athena table for a CloudTrail trail using the CloudTrail console**

1. Open the CloudTrail console at [https://console.aws.amazon.com/cloudtrail/](https://console.aws.amazon.com/cloudtrail/).

1. In the navigation pane, choose **Event history**. 

1. Choose **Create Athena table**.  
![\[Choose Create Athena table\]](http://docs.aws.amazon.com/athena/latest/ug/images/cloudtrail-logs-create-athena-table.png)

1. For **Storage location**, use the down arrow to select the Amazon S3 bucket where log files are stored for the trail to query.
**Note**  
To find the name of the bucket that is associated with a trail, choose **Trails** in the CloudTrail navigation pane and view the trail's **S3 bucket** column. To see the Amazon S3 location for the bucket, choose the link for the bucket in the **S3 bucket** column. This opens the Amazon S3 console to the CloudTrail bucket location. 

1. Choose **Create table**. The table is created with a default name that includes the name of the Amazon S3 bucket.

# Create a table for CloudTrail logs in Athena using manual partitioning
<a name="create-cloudtrail-table"></a>

You can manually create tables for CloudTrail log files in the Athena console, and then run queries in Athena.

**To create an Athena table for a CloudTrail trail using the Athena console**

1. Copy and paste the following DDL statement into the Athena console query editor, then modify it according to your requirements. Note that because CloudTrail log files are not an ordered stack trace of public API calls, the fields in the log files do not appear in any specific order.

   ```
   CREATE EXTERNAL TABLE cloudtrail_logs (
   eventversion STRING,
   useridentity STRUCT<
                  type:STRING,
                  principalid:STRING,
                  arn:STRING,
                  accountid:STRING,
                  invokedby:STRING,
                  accesskeyid:STRING,
                  username:STRING,
                  onbehalfof: STRUCT<
                       userid: STRING,
                       identitystorearn: STRING>,
     sessioncontext:STRUCT<
       attributes:STRUCT<
                  mfaauthenticated:STRING,
                  creationdate:STRING>,
       sessionissuer:STRUCT<  
                  type:STRING,
                  principalid:STRING,
                  arn:STRING, 
                  accountid:STRING,
                  username:STRING>,
       ec2roledelivery:string,
       webidfederationdata: STRUCT<
                  federatedprovider: STRING,
                  attributes: map<string,string>>
     >
   >,
   eventtime STRING,
   eventsource STRING,
   eventname STRING,
   awsregion STRING,
   sourceipaddress STRING,
   useragent STRING,
   errorcode STRING,
   errormessage STRING,
   requestparameters STRING,
   responseelements STRING,
   additionaleventdata STRING,
   requestid STRING,
   eventid STRING,
   resources ARRAY<STRUCT<
                  arn:STRING,
                  accountid:STRING,
                  type:STRING>>,
   eventtype STRING,
   apiversion STRING,
   readonly STRING,
   recipientaccountid STRING,
   serviceeventdetails STRING,
   sharedeventid STRING,
   vpcendpointid STRING,
   vpcendpointaccountid STRING,
   eventcategory STRING,
   addendum STRUCT<
     reason:STRING,
     updatedfields:STRING,
     originalrequestid:STRING,
     originaleventid:STRING>,
   sessioncredentialfromconsole STRING,
   edgedevicedetails STRING,
   tlsdetails STRUCT<
     tlsversion:STRING,
     ciphersuite:STRING,
     clientprovidedhostheader:STRING>
   )
   PARTITIONED BY (region string, year string, month string, day string)
   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
   STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/Account_ID/';
   ```
**Note**  
We suggest using the `org.apache.hive.hcatalog.data.JsonSerDe` shown in the example. Although a `com.amazon.emr.hive.serde.CloudTrailSerde` exists, it does not currently handle some of the newer CloudTrail fields. 

1. (Optional) Remove any fields not required for your table. If you need to read only a certain set of columns, your table definition can exclude the other columns.

1. Modify `s3://amzn-s3-demo-bucket/AWSLogs/Account_ID/` to point to the Amazon S3 bucket that contains the log data that you want to query. The example uses a `LOCATION` value of logs for a particular account, but you can use the degree of specificity that suits your application. For example:
   + To analyze data from multiple accounts, you can roll back the `LOCATION` specifier to indicate all `AWSLogs` by using `LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/'`.
   + To analyze data from a specific date, account, and Region, use `LOCATION 's3://amzn-s3-demo-bucket/123456789012/CloudTrail/us-east-1/2016/03/14/'.` 
   + To analyze network activity data instead of management events, replace `/CloudTrail/` in the `LOCATION` clause with `/CloudTrail-NetworkActivity/`. 

   Using the highest level in the object hierarchy gives you the greatest flexibility when you query using Athena.

1. Verify that fields are listed correctly. For more information about the full list of fields in a CloudTrail record, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html).

   The example `CREATE TABLE` statement in Step 1 uses the [Hive JSON SerDe](hive-json-serde.md). In the example, the fields `requestparameters`, `responseelements`, and `additionaleventdata` are listed as type `STRING` in the query, but are `STRUCT` data type used in JSON. Therefore, to get data out of these fields, use `JSON_EXTRACT` functions. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md). For performance improvements, the example partitions the data by AWS Region, year, month, and day.

1. Run the `CREATE TABLE` statement in the Athena console.

1. Use the [ALTER TABLE ADD PARTITION](alter-table-add-partition.md) command to load the partitions so that you can query them, as in the following example.

   ```
   ALTER TABLE table_name ADD 
      PARTITION (region='us-east-1',
                 year='2019',
                 month='02',
                 day='01')
      LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/Account_ID/CloudTrail/us-east-1/2019/02/01/'
   ```

# Create a table for an organization wide trail using manual partitioning
<a name="create-cloudtrail-table-org-wide-trail"></a>

To create a table for organization wide CloudTrail log files in Athena, follow the steps in [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md), but make the modifications noted in the following procedure.

**To create an Athena table for organization wide CloudTrail logs**

1. In the `CREATE TABLE` statement, modify the `LOCATION` clause to include the organization ID, as in the following example:

   ```
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/'
   ```

1. In the `PARTITIONED BY` clause, add an entry for the account ID as a string, as in the following example:

   ```
   PARTITIONED BY (account string, region string, year string, month string, day string)
   ```

   The following example shows the combined result:

   ```
   ...
   
   PARTITIONED BY (account string, region string, year string, month string, day string) 
   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
   STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/Account_ID/CloudTrail/'
   ```

1. In the `ALTER TABLE` statement `ADD PARTITION` clause, include the account ID, as in the following example:

   ```
   ALTER TABLE table_name ADD
   PARTITION (account='111122223333',
   region='us-east-1',
   year='2022',
   month='08',
   day='08')
   ```

1. In the `ALTER TABLE` statement `LOCATION` clause, include the organization ID, the account ID, and the partition that you want to add, as in the following example:

   ```
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/Account_ID/CloudTrail/us-east-1/2022/08/08/'
   ```

   The following example `ALTER TABLE` statement shows the combined result:

   ```
   ALTER TABLE table_name ADD
   PARTITION (account='111122223333',
   region='us-east-1',
   year='2022',
   month='08',
   day='08')
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/111122223333/CloudTrail/us-east-1/2022/08/08/'
   ```

Note that, in a large organization, using this method to manually add and maintain a partition for each organization account ID can be cumbersome. In such a scenario, consider using CloudTrail Lake rather than Athena. CloudTrail Lake in such a scenario offers the following advantages:
+ Automatically aggregates logs across an entire organization
+ Does not require setting up or maintaining partitions or an Athena table
+ Queries are run directly in the CloudTrail console
+ Uses a SQL-compatible query language

For more information, see [Working with AWS CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) in the *AWS CloudTrail User Guide*. 

# Create the table for CloudTrail logs in Athena using partition projection
<a name="create-cloudtrail-table-partition-projection"></a>

Because CloudTrail logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. 

The following example `CREATE TABLE` statement automatically uses partition projection on CloudTrail logs from a specified date until the present for a single AWS Region. In the `LOCATION` and `storage.location.template` clauses, replace the *bucket*, *account-id*, and *aws-region* placeholders with correspondingly identical values. For `projection.timestamp.range`, replace *2020*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions.

```
CREATE EXTERNAL TABLE cloudtrail_logs_pp(
    eventversion STRING,
    useridentity STRUCT<
        type: STRING,
        principalid: STRING,
        arn: STRING,
        accountid: STRING,
        invokedby: STRING,
        accesskeyid: STRING,
        username: STRING,
        onbehalfof: STRUCT<
             userid: STRING,
             identitystorearn: STRING>,
        sessioncontext: STRUCT<
            attributes: STRUCT<
                mfaauthenticated: STRING,
                creationdate: STRING>,
            sessionissuer: STRUCT<
                type: STRING,
                principalid: STRING,
                arn: STRING,
                accountid: STRING,
                username: STRING>,
            ec2roledelivery:string,
            webidfederationdata: STRUCT<
                federatedprovider: STRING,
                attributes: map<string,string>>
        >
    >,
    eventtime STRING,
    eventsource STRING,
    eventname STRING,
    awsregion STRING,
    sourceipaddress STRING,
    useragent STRING,
    errorcode STRING,
    errormessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestid STRING,
    eventid STRING,
    readonly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountid: STRING,
        type: STRING>>,
    eventtype STRING,
    apiversion STRING,
    recipientaccountid STRING,
    serviceeventdetails STRING,
    sharedeventid STRING,
    vpcendpointid STRING,
    vpcendpointaccountid STRING,
    eventcategory STRING,
    addendum STRUCT<
      reason:STRING,
      updatedfields:STRING,
      originalrequestid:STRING,
      originaleventid:STRING>,
    sessioncredentialfromconsole STRING,
    edgedevicedetails STRING,
    tlsdetails STRUCT<
      tlsversion:STRING,
      ciphersuite:STRING,
      clientprovidedhostheader:STRING>
  )
PARTITIONED BY (
   `timestamp` string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket/AWSLogs/account-id/CloudTrail/aws-region'
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.timestamp.format'='yyyy/MM/dd', 
  'projection.timestamp.interval'='1', 
  'projection.timestamp.interval.unit'='DAYS', 
  'projection.timestamp.range'='2020/01/01,NOW', 
  'projection.timestamp.type'='date', 
  'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account-id/CloudTrail/aws-region/${timestamp}')
```

For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md).

# Example CloudTrail log queries
<a name="query-examples-cloudtrail-logs"></a>

The following example shows a portion of a query that returns all anonymous (unsigned) requests from the table created for CloudTrail event logs. This query selects those requests where `useridentity.accountid` is anonymous, and `useridentity.arn` is not specified:

```
SELECT *
FROM cloudtrail_logs
WHERE 
    eventsource = 's3.amazonaws.com' AND 
    eventname in ('GetObject') AND 
    useridentity.accountid = 'anonymous' AND 
    useridentity.arn IS NULL AND
    requestparameters LIKE '%[your bucket name ]%';
```

For more information, see the AWS Big Data blog post [Analyze security, compliance, and operational activity using AWS CloudTrail and Amazon Athena](https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/).

## Query nested fields in CloudTrail logs
<a name="cloudtrail-logs-nested-fields"></a>

Because the `userIdentity` and `resources` fields are nested data types, querying them requires special treatment.

The `userIdentity` object consists of nested `STRUCT` types. These can be queried using a dot to separate the fields, as in the following example:

```
SELECT 
    eventsource, 
    eventname,
    useridentity.sessioncontext.attributes.creationdate,
    useridentity.sessioncontext.sessionissuer.arn
FROM cloudtrail_logs
WHERE useridentity.sessioncontext.sessionissuer.arn IS NOT NULL
ORDER BY eventsource, eventname
LIMIT 10
```

The `resources` field is an array of `STRUCT` objects. For these arrays, use `CROSS JOIN UNNEST` to unnest the array so that you can query its objects.

The following example returns all rows where the resource ARN ends in `example/datafile.txt`. For readability, the [replace](https://prestodb.io/docs/current/functions/string.html#replace) function removes the initial `arn:aws:s3:::` substring from the ARN.

```
SELECT 
    awsregion,
    replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as s3_resource,
    eventname,
    eventtime,
    useragent
FROM cloudtrail_logs t
CROSS JOIN UNNEST(t.resources) unnested (resources_entry)
WHERE unnested.resources_entry.ARN LIKE '%example/datafile.txt'
ORDER BY eventtime
```

The following example queries for `DeleteBucket` events. The query extracts the name of the bucket and the account ID to which the bucket belongs from the `resources` object.

```
SELECT 
    awsregion,
    replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as deleted_bucket,
    eventtime AS time_deleted,
    useridentity.username, 
    unnested.resources_entry.accountid as bucket_acct_id 
FROM cloudtrail_logs t
CROSS JOIN UNNEST(t.resources) unnested (resources_entry)
WHERE eventname = 'DeleteBucket'
ORDER BY eventtime
```

For more information about unnesting, see [Filter arrays](filtering-arrays.md).

## Tips for querying CloudTrail logs
<a name="tips-for-querying-cloudtrail-logs"></a>

Consider the following when exploring CloudTrail log data:
+ Before querying the logs, verify that your logs table looks the same as the one in [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md). If it is not the first table, delete the existing table using the following command: `DROP TABLE cloudtrail_logs`.
+ After you drop the existing table, re-create it. For more information, see [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md).

  Verify that fields in your Athena query are listed correctly. For information about the full list of fields in a CloudTrail record, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html). 

  If your query includes fields in JSON formats, such as `STRUCT`, extract data from JSON. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md). 

  Some suggestions for issuing queries against your CloudTrail table:
+ Start by looking at which users called which API operations and from which source IP addresses.
+ Use the following basic SQL query as your template. Paste the query to the Athena console and run it.

  ```
  SELECT
   useridentity.arn,
   eventname,
   sourceipaddress,
   eventtime
  FROM cloudtrail_logs
  LIMIT 100;
  ```
+ Modify the query to further explore your data.
+ To improve performance, include the `LIMIT` clause to return a specified subset of rows.

# Query Amazon EMR logs
<a name="emr-logs"></a>

Amazon EMR and big data applications that run on Amazon EMR produce log files. Log files are written to the [primary node](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html), and you can also configure Amazon EMR to archive log files to Amazon S3 automatically. You can use Amazon Athena to query these logs to identify events and trends for applications and clusters. For more information about the types of log files in Amazon EMR and saving them to Amazon S3, see [View log files](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html) in the *Amazon EMR Management Guide*.

**Topics**
+ [Create and query a basic table based on Amazon EMR log files](emr-create-table.md)
+ [Create and query a partitioned table based on Amazon EMR logs](emr-create-table-partitioned.md)

# Create and query a basic table based on Amazon EMR log files
<a name="emr-create-table"></a>

The following example creates a basic table, `myemrlogs`, based on log files saved to `s3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/elasticmapreduce/`. The Amazon S3 location used in the examples below reflects the pattern of the default log location for an EMR cluster created by Amazon Web Services account *123456789012* in Region *us-west-2*. If you use a custom location, the pattern is s3://amzn-s3-demo-bucket/*ClusterID*.

For information about creating a partitioned table to potentially improve query performance and reduce data transfer, see [Create and query a partitioned table based on Amazon EMR logs](emr-create-table-partitioned.md).

```
CREATE EXTERNAL TABLE `myemrlogs`(
  `data` string COMMENT 'from deserializer')
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6'
```

## Example queries
<a name="emr-example-queries-basic"></a>

The following example queries can be run on the `myemrlogs` table created by the previous example.

**Example – Query step logs for occurrences of ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG**  

```
SELECT data,
        "$PATH"
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'s-86URH188Z6B1')
        AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
```

**Example – Query a specific instance log, i-00b3c0a839ece0a9c, for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG**  

```
SELECT "data",
        "$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'i-00b3c0a839ece0a9c')
        AND regexp_like("$PATH",'state')
        AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
```

**Example – Query presto application logs for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG**  

```
SELECT "data",
        "$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'presto')
        AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
```

**Example – Query Namenode application logs for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG**  

```
SELECT "data",
        "$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'namenode')
        AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
```

**Example – Query all logs by date and hour for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG**  

```
SELECT distinct("$PATH") AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'2019-07-23-10')
        AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
```

# Create and query a partitioned table based on Amazon EMR logs
<a name="emr-create-table-partitioned"></a>

These examples use the same log location to create an Athena table, but the table is partitioned, and a partition is then created for each log location. For more information, see [Partition your data](partitions.md).

The following query creates the partitioned table named `mypartitionedemrlogs`:

```
CREATE EXTERNAL TABLE `mypartitionedemrlogs`(
  `data` string COMMENT 'from deserializer')
 partitioned by (logtype string)
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6'
```

The following query statements then create table partitions based on sub-directories for different log types that Amazon EMR creates in Amazon S3:

```
ALTER TABLE mypartitionedemrlogs ADD
     PARTITION (logtype='containers')
     LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/containers/'
```

```
ALTER TABLE mypartitionedemrlogs ADD
     PARTITION (logtype='hadoop-mapreduce')
     LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/hadoop-mapreduce/'
```

```
ALTER TABLE mypartitionedemrlogs ADD
     PARTITION (logtype='hadoop-state-pusher')
     LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/hadoop-state-pusher/'
```

```
ALTER TABLE mypartitionedemrlogs ADD
     PARTITION (logtype='node')
     LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/node/'
```

```
ALTER TABLE mypartitionedemrlogs ADD
     PARTITION (logtype='steps')
     LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/steps/'
```

After you create the partitions, you can run a `SHOW PARTITIONS` query on the table to confirm:

```
SHOW PARTITIONS mypartitionedemrlogs;
```

## Example queries
<a name="emr-example-queries-partitioned"></a>

The following examples demonstrate queries for specific log entries use the table and partitions created by the examples above.

**Example – Querying application application\$11561661818238\$10002 logs in the containers partition for ERROR or WARN**  

```
SELECT data,
        "$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='containers'
        AND regexp_like("$PATH",'application_1561661818238_0002')
        AND regexp_like(data, 'ERROR|WARN') limit 100;
```

**Example – Querying the hadoop-Mapreduce partition for job job\$11561661818238\$10004 and failed reduces**  

```
SELECT data,
        "$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='hadoop-mapreduce'
        AND regexp_like(data,'job_1561661818238_0004|Failed Reduces') limit 100;
```

**Example – Querying Hive logs in the node partition for query ID 056e0609-33e1-4611-956c-7a31b42d2663**  

```
SELECT data,
        "$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='node'
        AND regexp_like("$PATH",'hive')
        AND regexp_like(data,'056e0609-33e1-4611-956c-7a31b42d2663') limit 100;
```

**Example – Querying resourcemanager logs in the node partition for application 1567660019320\$10001\$101\$1000001**  

```
SELECT data,
        "$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='node'
        AND regexp_like(data,'resourcemanager')
        AND regexp_like(data,'1567660019320_0001_01_000001') limit 100
```

# Query AWS Global Accelerator flow logs
<a name="querying-global-accelerator-flow-logs"></a>

You can use AWS Global Accelerator to create accelerators that direct network traffic to optimal endpoints over the AWS global network. For more information about Global Accelerator, see [What is AWS Global Accelerator](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html).

Global Accelerator flow logs enable you to capture information about the IP address traffic going to and from network interfaces in your accelerators. Flow log data is published to Amazon S3, where you can retrieve and view your data. For more information, see [Flow logs in AWS Global Accelerator](https://docs.aws.amazon.com/global-accelerator/latest/dg/monitoring-global-accelerator.flow-logs.html).

You can use Athena to query your Global Accelerator flow logs by creating a table that specifies their location in Amazon S3.

**To create the table for Global Accelerator flow logs**

1. Copy and paste the following DDL statement into the Athena console. This query specifies *ROW FORMAT DELIMITED* and omits specifying a [SerDe](serde-reference.md), which means that the query uses the [`LazySimpleSerDe`](lazy-simple-serde.md). In this query, fields are terminated by a space.

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS aga_flow_logs (
     version string,
     account string,
     acceleratorid string,
     clientip string,
     clientport int,
     gip string,
     gipport int,
     endpointip string,
     endpointport int,
     protocol string,
     ipaddresstype string,
     numpackets bigint,
     numbytes int,
     starttime int,
     endtime int,
     action string,
     logstatus string,
     agasourceip string,
     agasourceport int,
     endpointregion string,
     agaregion string,
     direction string,
     vpc_id string,
     reject_reason string
   )
   PARTITIONED BY (dt string)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ' '
   LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region/'
   TBLPROPERTIES ("skip.header.line.count"="1");
   ```

1. Modify the `LOCATION` value to point to the Amazon S3 bucket that contains your log data.

   ```
   's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region_code/'
   ```

1. Run the query in the Athena console. After the query completes, Athena registers the `aga_flow_logs` table, making the data in it available for queries.

1. Create partitions to read the data, as in the following sample query. The query creates a single partition for a specified date. Replace the placeholders for date and location.

   ```
   ALTER TABLE aga_flow_logs
   ADD PARTITION (dt='YYYY-MM-dd')
   LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region_code/YYYY/MM/dd';
   ```

## Example queries for AWS Global Accelerator flow logs
<a name="querying-global-accelerator-flow-logs-examples"></a>

**Example – List the requests that pass through a specific edge location**  
The following example query lists requests that passed through the LHR edge location. Use the `LIMIT` operator to limit the number of logs to query at one time.  

```
SELECT 
  clientip, 
  agaregion, 
  protocol, 
  action 
FROM 
  aga_flow_logs 
WHERE 
  agaregion LIKE 'LHR%' 
LIMIT 
  100;
```

**Example – List the endpoint IP addresses that receive the most HTTPS requests**  
To see which endpoint IP addresses are receiving the highest number of HTTPS requests, use the following query. This query counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 IP addresses.  

```
SELECT 
  SUM(numpackets) AS packetcount, 
  endpointip 
FROM 
  aga_flow_logs 
WHERE 
  endpointport = 443 
GROUP BY 
  endpointip 
ORDER BY 
  packetcount DESC 
LIMIT 
  10;
```

# Query Amazon GuardDuty findings
<a name="querying-guardduty"></a>

[Amazon GuardDuty](https://aws.amazon.com/guardduty/) is a security monitoring service for helping to identify unexpected and potentially unauthorized or malicious activity in your AWS environment. When it detects unexpected and potentially malicious activity, GuardDuty generates security [findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_findings.html) that you can export to Amazon S3 for storage and analysis. After you export your findings to Amazon S3, you can use Athena to query them. This article shows how to create a table in Athena for your GuardDuty findings and query them.

For more information about Amazon GuardDuty, see the [Amazon GuardDuty User Guide](https://docs.aws.amazon.com/guardduty/latest/ug/).

## Prerequisites
<a name="querying-guardduty-prerequisites"></a>
+ Enable the GuardDuty feature for exporting findings to Amazon S3. For steps, see [Exporting findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_exportfindings.html) in the Amazon GuardDuty User Guide.

## Create a table in Athena for GuardDuty findings
<a name="querying-guardduty-creating-a-table-in-athena-for-guardduty-findings"></a>

To query your GuardDuty findings from Athena, you must create a table for them.

**To create a table in Athena for GuardDuty findings**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Paste the following DDL statement into the Athena console. Modify the values in `LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/account-id/GuardDuty/'` to point to your GuardDuty findings in Amazon S3.

   ```
   CREATE EXTERNAL TABLE `gd_logs` (
     `schemaversion` string,
     `accountid` string,
     `region` string,
     `partition` string,
     `id` string,
     `arn` string,
     `type` string,
     `resource` string,
     `service` string,
     `severity` string,
     `createdat` string,
     `updatedat` string,
     `title` string,
     `description` string)
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/account-id/GuardDuty/'
    TBLPROPERTIES ('has_encrypted_data'='true')
   ```
**Note**  
The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 

1. Run the query in the Athena console to register the `gd_logs` table. When the query completes, the findings are ready for you to query from Athena.

## Example queries
<a name="querying-guardduty-examples"></a>

The following examples show how to query GuardDuty findings from Athena.

**Example – DNS data exfiltration**  
The following query returns information about Amazon EC2 instances that might be exfiltrating data through DNS queries.  

```
SELECT
    title,
    severity,
    type,
    id AS FindingID,
    accountid,
    region,
    createdat,
    updatedat,
    json_extract_scalar(service, '$.count') AS Count,
    json_extract_scalar(resource, '$.instancedetails.instanceid') AS InstanceID,
    json_extract_scalar(service, '$.action.actiontype') AS DNS_ActionType,
    json_extract_scalar(service, '$.action.dnsrequestaction.domain') AS DomainName,
    json_extract_scalar(service, '$.action.dnsrequestaction.protocol') AS protocol,
    json_extract_scalar(service, '$.action.dnsrequestaction.blocked') AS blocked
FROM gd_logs
WHERE type = 'Trojan:EC2/DNSDataExfiltration'
ORDER BY severity DESC
```

**Example – Unauthorized IAM user access**  
The following query returns all `UnauthorizedAccess:IAMUser` finding types for an IAM Principal from all regions.   

```
SELECT title,
         severity,
         type,
         id,
         accountid,
         region,
         createdat,
         updatedat,
         json_extract_scalar(service, '$.count') AS Count, 
         json_extract_scalar(resource, '$.accesskeydetails.username') AS IAMPrincipal, 
         json_extract_scalar(service,'$.action.awsapicallaction.api') AS APIActionCalled
FROM gd_logs
WHERE type LIKE '%UnauthorizedAccess:IAMUser%' 
ORDER BY severity desc;
```

## Tips for querying GuardDuty findings
<a name="querying-guardduty-tips"></a>

When you create your query, keep the following points in mind.
+ To extract data from nested JSON fields, use the Presto `json_extract` or `json_extract_scalar` functions. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md).
+ Make sure that all characters in the JSON fields are in lower case.
+  For information about downloading query results, see [Download query results files using the Athena console](saving-query-results.md).

# Query AWS Network Firewall logs
<a name="querying-network-firewall-logs"></a>

AWS Network Firewall is a managed service that you can use to deploy essential network protections for your Amazon Virtual Private Cloud instances. AWS Network Firewall works together with AWS Firewall Manager so you can build policies based on AWS Network Firewall rules and then centrally apply those policies across your VPCs and accounts. For more information about AWS Network Firewall, see [AWS Network Firewall](https://aws.amazon.com/network-firewall/).

You can configure AWS Network Firewall logging for traffic that you forward to your firewall's stateful rules engine. Logging gives you detailed information about network traffic, including the time that the stateful engine received a packet, detailed information about the packet, and any stateful rule action taken against the packet. The logs are published to the log destination that you've configured, where you can retrieve and view them. For more information, see [Logging network traffic from AWS Network Firewall](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html) in the *AWS Network Firewall Developer Guide*.

**Topics**
+ [Create and query a table for alert logs](querying-network-firewall-logs-sample-alert-logs-table.md)
+ [Create and query a table for netflow logs](querying-network-firewall-logs-sample-netflow-logs-table.md)

# Create and query a table for alert logs
<a name="querying-network-firewall-logs-sample-alert-logs-table"></a>

1. Modify the following sample DDL statement to conform to the structure of your alert log. You may need to update the statement to include the columns for the latest version of the logs. For more information, see [Contents of a firewall log](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html#firewall-logging-contents) in the *AWS Network Firewall Developer Guide*.

   ```
   CREATE EXTERNAL TABLE network_firewall_alert_logs (
     firewall_name string,
     availability_zone string,
     event_timestamp string,
     event struct<
       timestamp:string,
       flow_id:bigint,
       event_type:string,
       src_ip:string,
       src_port:int,
       dest_ip:string,
       dest_port:int,
       proto:string,
       app_proto:string,
       sni:string,
       tls_inspected:boolean,
       tls_error:struct<
         error_message:string>,
       revocation_check:struct<
         leaf_cert_fpr:string,
         status:string,
         action:string>,
       alert:struct<
         alert_id:string,
         alert_type:string,
         action:string,
         signature_id:int,
         rev:int,
         signature:string,
         category:string,
         severity:int,
         rule_name:string,
         alert_name:string,
         alert_severity:string,
         alert_description:string,
         file_name:string,
         file_hash:string,
         packet_capture:string,
         reference_links:array<string>
       >, 
       src_country:string, 
       dest_country:string, 
       src_hostname:string, 
       dest_hostname:string, 
       user_agent:string, 
       url:string
      >
   )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    LOCATION 's3://amzn-s3-demo-bucket/path_to_alert_logs_folder/';
   ```

1. Modify the `LOCATION` clause to specify the folder for your logs in Amazon S3.

1. Run your `CREATE TABLE` query in the Athena query editor. After the query completes, Athena registers the `network_firewall_alert_logs` table, making the data that it points to ready for queries.

## Example query
<a name="querying-network-firewall-logs-alert-log-sample-query"></a>

The sample alert log query in this section filters for events in which TLS inspection was performed that have alerts with a severity level of 2 or higher.

The query uses aliases to create output column headings that show the `struct` that the column belongs to. For example, the column heading for the `event.alert.category` field is `event_alert_category` instead of just `category`. To customize the column names further, you can modify the aliases to suit your preferences. For example, you can use underscores or other separators to delimit the `struct` names and field names. 

Remember to modify column names and `struct` references based on your table definition and on the fields that you want in the query result.

```
SELECT 
  firewall_name,
  availability_zone,
  event_timestamp,
  event.timestamp AS event_timestamp,
  event.flow_id AS event_flow_id,
  event.event_type AS event_type,
  event.src_ip AS event_src_ip,
  event.src_port AS event_src_port,
  event.dest_ip AS event_dest_ip,
  event.dest_port AS event_dest_port,
  event.proto AS event_protol,
  event.app_proto AS event_app_proto,
  event.sni AS event_sni,
  event.tls_inspected AS event_tls_inspected,
  event.tls_error.error_message AS event_tls_error_message,
  event.revocation_check.leaf_cert_fpr AS event_revocation_leaf_cert,
  event.revocation_check.status AS event_revocation_check_status,
  event.revocation_check.action AS event_revocation_check_action,
  event.alert.alert_id AS event_alert_alert_id,
  event.alert.alert_type AS event_alert_alert_type,
  event.alert.action AS event_alert_action,
  event.alert.signature_id AS event_alert_signature_id,
  event.alert.rev AS event_alert_rev,
  event.alert.signature AS event_alert_signature,
  event.alert.category AS event_alert_category,
  event.alert.severity AS event_alert_severity,
  event.alert.rule_name AS event_alert_rule_name,
  event.alert.alert_name AS event_alert_alert_name,
  event.alert.alert_severity AS event_alert_alert_severity,
  event.alert.alert_description AS event_alert_alert_description,
  event.alert.file_name AS event_alert_file_name,
  event.alert.file_hash AS event_alert_file_hash,
  event.alert.packet_capture AS event_alert_packet_capture,
  event.alert.reference_links AS event_alert_reference_links,
  event.src_country AS event_src_country,
  event.dest_country AS event_dest_country,
  event.src_hostname AS event_src_hostname,
  event.dest_hostname AS event_dest_hostname,
  event.user_agent AS event_user_agent,
  event.url AS event_url
FROM 
  network_firewall_alert_logs 
WHERE 
  event.alert.severity >= 2
  AND event.tls_inspected = true 
LIMIT 10;
```

# Create and query a table for netflow logs
<a name="querying-network-firewall-logs-sample-netflow-logs-table"></a>

1. Modify the following sample DDL statement to conform to the structure of your netflow logs. You may need to update the statement to include the columns for the latest version of the logs. For more information, see [Contents of a firewall log](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html#firewall-logging-contents) in the *AWS Network Firewall Developer Guide*.

   ```
   CREATE EXTERNAL TABLE network_firewall_netflow_logs (
     firewall_name string,
     availability_zone string,
     event_timestamp string,
     event struct<
       timestamp:string,
       flow_id:bigint,
       event_type:string,
       src_ip:string,
       src_port:int,
       dest_ip:string,
       dest_port:int,
       proto:string,
       app_proto:string,
       tls_inspected:boolean,
       netflow:struct<
         pkts:int,
         bytes:bigint,
         start:string,
         `end`:string,
         age:int,
         min_ttl:int,
         max_ttl:int,
         tcp_flags:struct<
           syn:boolean,
           fin:boolean,
           rst:boolean,
           psh:boolean,
           ack:boolean,
           urg:boolean
           >
         >
       >
   )
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
   LOCATION 's3://amzn-s3-demo-bucket/path_to_netflow_logs_folder/';
   ```

1. Modify the `LOCATION` clause to specify the folder for your logs in Amazon S3.

1. Run the `CREATE TABLE` query in the Athena query editor. After the query completes, Athena registers the `network_firewall_netflow_logs` table, making the data that it points to ready for queries.

## Example query
<a name="querying-network-firewall-logs-netflow-log-sample-query"></a>

The sample netflow log query in this section filters for events in which TLS inspection was performed.

The query uses aliases to create output column headings that show the `struct` that the column belongs to. For example, the column heading for the `event.netflow.bytes` field is `event_netflow_bytes` instead of just `bytes`. To customize the column names further, you can modify the aliases to suit your preferences. For example, you can use underscores or other separators to delimit the `struct` names and field names. 

Remember to modify column names and `struct` references based on your table definition and on the fields that you want in the query result.

```
SELECT
  event.src_ip AS event_src_ip,
  event.dest_ip AS event_dest_ip,
  event.proto AS event_proto,
  event.app_proto AS event_app_proto,
  event.tls_inspected AS event_tls_inspected,
  event.netflow.pkts AS event_netflow_pkts,
  event.netflow.bytes AS event_netflow_bytes,
  event.netflow.tcp_flags.syn AS event_netflow_tcp_flags_syn 
FROM network_firewall_netflow_logs 
WHERE event.tls_inspected = true
```

# Query Network Load Balancer logs
<a name="networkloadbalancer-classic-logs"></a>

Use Athena to analyze and process logs from Network Load Balancer. These logs receive detailed information about the Transport Layer Security (TLS) requests sent to the Network Load Balancer. You can use these access logs to analyze traffic patterns and troubleshoot issues. 

Before you analyze the Network Load Balancer access logs, enable and configure them for saving in the destination Amazon S3 bucket. For more information, and for information about each Network Load Balancer access log entry, see [ Access logs for your Network Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html).

**To create the table for Network Load Balancer logs**

1. Copy and paste the following DDL statement into the Athena console. Check the [syntax ](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html#access-log-file-format) of the Network Load Balancer log records. Update the statement as required to include the columns and the regex corresponding to your log records.

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS nlb_tls_logs (
               type string,
               version string,
               time string,
               elb string,
               listener_id string,
               client_ip string,
               client_port int,
               target_ip string,
               target_port int,
               tcp_connection_time_ms double,
               tls_handshake_time_ms double,
               received_bytes bigint,
               sent_bytes bigint,
               incoming_tls_alert int,
               cert_arn string,
               certificate_serial string,
               tls_cipher_suite string,
               tls_protocol_version string,
               tls_named_group string,
               domain_name string,
               alpn_fe_protocol string,
               alpn_be_protocol string,
               alpn_client_preference_list string,
               tls_connection_creation_time string
               )
               ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               'serialization.format' = '1',
               'input.regex' = 
               '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?'
               )
               LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/region';
   ```

1. Modify the `LOCATION` Amazon S3 bucket to specify the destination of your Network Load Balancer logs.

1. Run the query in the Athena console. After the query completes, Athena registers the `nlb_tls_logs` table, making the data in it ready for queries.

## Example queries
<a name="query-nlb-example"></a>

To see how many times a certificate is used, use a query similar to this example:

```
SELECT count(*) AS 
         ct,
         cert_arn
FROM "nlb_tls_logs"
GROUP BY  cert_arn;
```

The following query shows how many users are using a TLS version earlier than 1.3:

```
SELECT tls_protocol_version,
         COUNT(tls_protocol_version) AS 
         num_connections,
         client_ip
FROM "nlb_tls_logs"
WHERE tls_protocol_version < 'tlsv13'
GROUP BY tls_protocol_version, client_ip;
```

Use the following query to identify connections that take a long TLS handshake time:

```
SELECT *
FROM "nlb_tls_logs"
ORDER BY  tls_handshake_time_ms DESC 
LIMIT 10;
```

Use the following query to identify and count which TLS protocol versions and cipher suites have been negotiated in the past 30 days.

```
SELECT tls_cipher_suite,
         tls_protocol_version,
         COUNT(*) AS ct
FROM "nlb_tls_logs"
WHERE from_iso8601_timestamp(time) > current_timestamp - interval '30' day
        AND NOT tls_protocol_version = '-'
GROUP BY tls_cipher_suite, tls_protocol_version
ORDER BY ct DESC;
```

# Query Amazon Route 53 resolver query logs
<a name="querying-r53-resolver-logs"></a>

You can create Athena tables for your Amazon Route 53 Resolver query logs and query them from Athena.

Route 53 Resolver query logging is for logging of DNS queries made by resources within a VPC, on-premises resources that use inbound resolver endpoints, queries that use an outbound Resolver endpoint for recursive DNS resolution, and queries that use Route 53 Resolver DNS firewall rules to block, allow, or monitor a domain list. For more information about Resolver query logging, see [Resolver query logging](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs.html) in the *Amazon Route 53 Developer Guide*. For information about each of the fields in the logs, see [Values that appear in resolver query logs](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs-format.html) in the *Amazon Route 53 Developer Guide*.

**Topics**
+ [Create the table for resolver query logs](querying-r53-resolver-logs-creating-the-table.md)
+ [Use partition projection](querying-r53-resolver-logs-partitioning-example.md)
+ [Example queries](querying-r53-resolver-logs-example-queries.md)

# Create the table for resolver query logs
<a name="querying-r53-resolver-logs-creating-the-table"></a>

You can use the Query Editor in the Athena console to create and query a table for your Route 53 Resolver query logs.

**To create and query an Athena table for Route 53 resolver query logs**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the Athena Query Editor, enter the following `CREATE TABLE` statement. Replace the `LOCATION` clause values with those corresponding to the location of your Resolver logs in Amazon S3.

   ```
   CREATE EXTERNAL TABLE r53_rlogs (
     version string,
     account_id string,
     region string,
     vpc_id string,
     query_timestamp string,
     query_name string,
     query_type string,
     query_class
       string,
     rcode string,
     answers array<
       struct<
         Rdata: string,
         Type: string,
         Class: string>
       >,
     srcaddr string,
     srcport int,
     transport string,
     srcids struct<
       instance: string,
       resolver_endpoint: string
       >,
     firewall_rule_action string,
     firewall_rule_group_id string,
     firewall_domain_list_id string
    )
        
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
   LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/aws_account_id/vpcdnsquerylogs/{vpc-id}/'
   ```

   Because Resolver query log data is in JSON format, the CREATE TABLE statement uses a [JSON SerDe library](json-serde.md) to analyze the data.
**Note**  
The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 

1. Choose **Run query**. The statement creates an Athena table named `r53_rlogs` whose columns represent each of the fields in your Resolver log data.

1. In the Athena console Query Editor, run the following query to verify that your table has been created.

   ```
   SELECT * FROM "r53_rlogs" LIMIT 10
   ```

# Use partition projection
<a name="querying-r53-resolver-logs-partitioning-example"></a>

The following example shows a `CREATE TABLE` statement for Resolver query logs that uses partition projection and is partitioned by vpc and by date. For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md).

```
CREATE EXTERNAL TABLE r53_rlogs (
  version string,
  account_id string,
  region string,
  vpc_id string,
  query_timestamp string,
  query_name string,
  query_type string,
  query_class string,
  rcode string,
  answers array<
    struct<
      Rdata: string,
      Type: string,
      Class: string>
    >,
  srcaddr string,
  srcport int,
  transport string,
  srcids struct<
    instance: string,
    resolver_endpoint: string
    >,
  firewall_rule_action string,
  firewall_rule_group_id string,
  firewall_domain_list_id string
)
PARTITIONED BY (
`date` string,
`vpc` string
)
ROW FORMAT SERDE      'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT          'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION              's3://amzn-s3-demo-bucket/route53-query-logging/AWSLogs/aws_account_id/vpcdnsquerylogs/'
TBLPROPERTIES(
'projection.enabled' = 'true',
'projection.vpc.type' = 'enum',
'projection.vpc.values' = 'vpc-6446ae02',
'projection.date.type' = 'date',
'projection.date.range' = '2023/06/26,NOW',
'projection.date.format' = 'yyyy/MM/dd',
'projection.date.interval' = '1',
'projection.date.interval.unit' = 'DAYS',
'storage.location.template' = 's3://amzn-s3-demo-bucket/route53-query-logging/AWSLogs/aws_account_id/vpcdnsquerylogs/${vpc}/${date}/'
)
```

# Example queries
<a name="querying-r53-resolver-logs-example-queries"></a>

The following examples show some queries that you can perform from Athena on your Resolver query logs.

## Example 1 - query logs in descending query\$1timestamp order
<a name="querying-r53-resolver-logs-example-1-query-logs-in-descending-query_timestamp-order"></a>

The following query displays log results in descending `query_timestamp` order.

```
SELECT * FROM "r53_rlogs"
ORDER BY query_timestamp DESC
```

## Example 2 - query logs within specified start and end times
<a name="querying-r53-resolver-logs-example-2-query-logs-within-specified-start-and-end-times"></a>

The following query queries logs between midnight and 8am on September 24, 2020. Substitute the start and end times according to your own requirements.

```
SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode
FROM "r53_rlogs"
WHERE (parse_datetime(query_timestamp,'yyyy-MM-dd''T''HH:mm:ss''Z')
     BETWEEN parse_datetime('2020-09-24-00:00:00','yyyy-MM-dd-HH:mm:ss') 
     AND parse_datetime('2020-09-24-00:08:00','yyyy-MM-dd-HH:mm:ss'))
ORDER BY query_timestamp DESC
```

## Example 3 - query logs based on a specified DNS query name pattern
<a name="querying-r53-resolver-logs-example-3-query-logs-based-on-a-specified-dns-query-name-pattern"></a>

The following query selects records whose query name includes the string "example.com".

```
SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answers
FROM "r53_rlogs"
WHERE query_name LIKE '%example.com%'
ORDER BY query_timestamp DESC
```

## Example 4 - query log requests with no answer
<a name="querying-r53-resolver-logs-example-4-query-log-requests-with-no-answer"></a>

The following query selects log entries in which the request received no answer.

```
SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answers
FROM "r53_rlogs"
WHERE cardinality(answers) = 0
```

## Example 5 - query logs with a specific answer
<a name="querying-r53-resolver-logs-example-5-query-logs-with-a-specific-answer"></a>

The following query shows logs in which the `answer.Rdata` value has the specified IP address.

```
SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answer.Rdata
FROM "r53_rlogs"
CROSS JOIN UNNEST(r53_rlogs.answers) as st(answer)
WHERE answer.Rdata='203.0.113.16';
```

# Query Amazon SES event logs
<a name="querying-ses-logs"></a>

You can use Amazon Athena to query [Amazon Simple Email Service](https://aws.amazon.com/ses/) (Amazon SES) event logs.

Amazon SES is an email platform that provides a convenient and cost-effective way to send and receive email using your own email addresses and domains. You can monitor your Amazon SES sending activity at a granular level using events, metrics, and statistics.

Based on the characteristics that you define, you can publish Amazon SES events to [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/), [Amazon Data Firehose](https://aws.amazon.com/kinesis/data-firehose/), or [Amazon Simple Notification Service](https://aws.amazon.com/sns/). After the information is stored in Amazon S3, you can query it from Amazon Athena. 

For an example Athena `CREATE TABLE` statement for Amazon SES logs, including steps on how to create views and flatten nested arrays in Amazon SES event log data, see "Step 3: Using Amazon Athena to query the SES event logs" in the AWS blog post [Analyzing Amazon SES event data with AWS Analytics Services](https://aws.amazon.com/blogs/messaging-and-targeting/analyzing-amazon-ses-event-data-with-aws-analytics-services/).

# Query Amazon VPC flow logs
<a name="vpc-flow-logs"></a>

Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Use the logs to investigate network traffic patterns and identify threats and risks across your VPC network.

To query your Amazon VPC flow logs, you have two options:

****
+ **Amazon VPC Console** – Use the Athena integration feature in the Amazon VPC Console to generate an CloudFormation template that creates an Athena database, workgroup, and flow logs table with partitioning for you. The template also creates a set of [predefined flow log queries](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-athena.html#predefined-queries) that you can use to obtain insights about the traffic flowing through your VPC.

  For information about this approach, see [Query flow logs using Amazon Athena](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-athena.html) in the *Amazon VPC User Guide*.
+ **Amazon Athena console** – Create your tables and queries directly in the Athena console. For more information, continue reading this page.

Before you begin querying the logs in Athena, [enable VPC flow logs](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/flow-logs.html), and configure them to be saved to your Amazon S3 bucket. After you create the logs, let them run for a few minutes to collect some data. The logs are created in a GZIP compression format that Athena lets you query directly. 

When you create a VPC flow log, you can use a custom format when you want to specify the fields to return in the flow log and the order in which the fields appear. For more information about flow log records, see [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records) in the *Amazon VPC User Guide*.

## Considerations and limitations
<a name="vpc-flow-logs-common-considerations"></a>

When you create tables in Athena for Amazon VPC flow logs, remember the following points:
+ By default, in Athena, Parquet will access columns by name. For more information, see [Handle schema updates](handling-schema-updates-chapter.md).
+ Use the names in the flow log records for the column names in Athena. The names of the columns in the Athena schema should exactly match the field names in the Amazon VPC flow logs, with the following differences: 
  + Replace the hyphens in the Amazon VPC log field names with underscores in the Athena column names. For information about acceptable characters for database names, table names, and column names in Athena, see [Name databases, tables, and columns](tables-databases-columns-names.md).
  + Escape the flow log record names that are [reserved keywords](reserved-words.md) in Athena by enclosing them with backticks. 
+ VPC flow logs are AWS account specific. When you publish your log files to Amazon S3, the path that Amazon VPC creates in Amazon S3 includes the ID of the AWS account that was used to create the flow log. For more information, see [Publish flow logs to Amazon S3](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html) in the *Amazon VPC User Guide*.

**Topics**
+ [Considerations and limitations](#vpc-flow-logs-common-considerations)
+ [Create a table for Amazon VPC flow logs and query it](vpc-flow-logs-create-table-statement.md)
+ [Create tables for flow logs in Apache Parquet format](vpc-flow-logs-parquet.md)
+ [Create and query a table for Amazon VPC flow logs using partition projection](vpc-flow-logs-partition-projection.md)
+ [Create tables for flow logs in Apache Parquet format using partition projection](vpc-flow-logs-partition-projection-parquet-example.md)
+ [Additional resources](query-examples-vpc-logs-additional-resources.md)

# Create a table for Amazon VPC flow logs and query it
<a name="vpc-flow-logs-create-table-statement"></a>

The following procedure creates an Amazon VPC table for Amazon VPC flow logs. When you create a flow log with a custom format, you create a table with fields that match the fields that you specified when you created the flow log in the same order that you specified them.

**To create an Athena table for Amazon VPC flow logs**

1. Enter a DDL statement like the following into the Athena console query editor, following the guidelines in the [Considerations and limitations](vpc-flow-logs.md#vpc-flow-logs-common-considerations) section. The sample statement creates a table that has the columns for Amazon VPC flow logs versions 2 through 5 as documented in [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records). If you use a different set of columns or order of columns, modify the statement accordingly.

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS `vpc_flow_logs` (
     version int,
     account_id string,
     interface_id string,
     srcaddr string,
     dstaddr string,
     srcport int,
     dstport int,
     protocol bigint,
     packets bigint,
     bytes bigint,
     start bigint,
     `end` bigint,
     action string,
     log_status string,
     vpc_id string,
     subnet_id string,
     instance_id string,
     tcp_flags int,
     type string,
     pkt_srcaddr string,
     pkt_dstaddr string,
     region string,
     az_id string,
     sublocation_type string,
     sublocation_id string,
     pkt_src_aws_service string,
     pkt_dst_aws_service string,
     flow_direction string,
     traffic_path int
   )
   PARTITIONED BY (`date` date)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ' '
   LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/'
   TBLPROPERTIES ("skip.header.line.count"="1");
   ```

   Note the following points:
   + The query specifies `ROW FORMAT DELIMITED` and omits specifying a SerDe. This means that the query uses the [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md). In this query, fields are terminated by a space.
   + The `PARTITIONED BY` clause uses the `date` type. This makes it possible to use mathematical operators in queries to select what's older or newer than a certain date.
**Note**  
Because `date` is a reserved keyword in DDL statements, it is escaped by back tick characters. For more information, see [Escape reserved keywords in queries](reserved-words.md).
   + For a VPC flow log with a different custom format, modify the fields to match the fields that you specified when you created the flow log.

1. Modify the `LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/'` to point to the Amazon S3 bucket that contains your log data.

1. Run the query in Athena console. After the query completes, Athena registers the `vpc_flow_logs` table, making the data in it ready for you to issue queries.

1. Create partitions to be able to read the data, as in the following sample query. This query creates a single partition for a specified date. Replace the placeholders for date and location as needed. 
**Note**  
This query creates a single partition only, for a date that you specify. To automate the process, use a script that runs this query and creates partitions this way for the `year/month/day`, or use a `CREATE TABLE` statement that specifies [partition projection](vpc-flow-logs-partition-projection.md).

   ```
   ALTER TABLE vpc_flow_logs
   ADD PARTITION (`date`='YYYY-MM-dd')
   LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/YYYY/MM/dd';
   ```

## Example queries for the vpc\$1flow\$1logs table
<a name="query-examples-vpc-logs"></a>

Use the query editor in the Athena console to run SQL statements on the table that you create. You can save the queries, view previous queries, or download query results in CSV format. In the following examples, replace `vpc_flow_logs` with the name of your table. Modify the column values and other variables according to your requirements.

The following example query lists a maximum of 100 flow logs for the date specified.

```
SELECT * 
FROM vpc_flow_logs 
WHERE date = DATE('2020-05-04') 
LIMIT 100;
```

The following query lists all of the rejected TCP connections and uses the newly created date partition column, `date`, to extract from it the day of the week for which these events occurred.

```
SELECT day_of_week(date) AS
  day,
  date,
  interface_id,
  srcaddr,
  action,
  protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;
```

To see which one of your servers is receiving the highest number of HTTPS requests, use the following query. It counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 from the last week.

```
SELECT SUM(packets) AS
  packetcount,
  dstaddr
FROM vpc_flow_logs
WHERE dstport = 443 AND date > current_date - interval '7' day
GROUP BY dstaddr
ORDER BY packetcount DESC
LIMIT 10;
```

# Create tables for flow logs in Apache Parquet format
<a name="vpc-flow-logs-parquet"></a>

The following procedure creates an Amazon VPC table for Amazon VPC flow logs in Apache Parquet format.

**To create an Athena table for Amazon VPC flow logs in Parquet format**

1. Enter a DDL statement like the following into the Athena console query editor, following the guidelines in the [Considerations and limitations](vpc-flow-logs.md#vpc-flow-logs-common-considerations) section. The sample statement creates a table that has the columns for Amazon VPC flow logs versions 2 through 5 as documented in [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records) in Parquet format, Hive partitioned hourly. If you do not have hourly partitions, remove `hour` from the `PARTITIONED BY` clause.

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_parquet (
     version int,
     account_id string,
     interface_id string,
     srcaddr string,
     dstaddr string,
     srcport int,
     dstport int,
     protocol bigint,
     packets bigint,
     bytes bigint,
     start bigint,
     `end` bigint,
     action string,
     log_status string,
     vpc_id string,
     subnet_id string,
     instance_id string,
     tcp_flags int,
     type string,
     pkt_srcaddr string,
     pkt_dstaddr string,
     region string,
     az_id string,
     sublocation_type string,
     sublocation_id string,
     pkt_src_aws_service string,
     pkt_dst_aws_service string,
     flow_direction string,
     traffic_path int
   )
   PARTITIONED BY (
     `aws-account-id` string,
     `aws-service` string,
     `aws-region` string,
     `year` string, 
     `month` string, 
     `day` string,
     `hour` string
   )
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   STORED AS INPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     's3://amzn-s3-demo-bucket/prefix/AWSLogs/'
   TBLPROPERTIES (
     'EXTERNAL'='true', 
     'skip.header.line.count'='1'
     )
   ```

1. Modify the sample `LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/'` to point to the Amazon S3 path that contains your log data.

1. Run the query in Athena console.

1. If your data is in Hive-compatible format, run the following command in the Athena console to update and load the Hive partitions in the metastore. After the query completes, you can query the data in the `vpc_flow_logs_parquet` table.

   ```
   MSCK REPAIR TABLE vpc_flow_logs_parquet
   ```

   If you are not using Hive compatible data, run [ALTER TABLE ADD PARTITION](alter-table-add-partition.md) to load the partitions.

For more information about using Athena to query Amazon VPC flow logs in Parquet format, see the post [Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format](https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/) in the *AWS Big Data Blog*.

# Create and query a table for Amazon VPC flow logs using partition projection
<a name="vpc-flow-logs-partition-projection"></a>

Use a `CREATE TABLE` statement like the following to create a table, partition the table, and populate the partitions automatically by using [partition projection](partition-projection.md). Replace the table name `test_table_vpclogs` in the example with the name of your table. Edit the `LOCATION` clause to specify the Amazon S3 bucket that contains your Amazon VPC log data.

The following `CREATE TABLE` statement is for VPC flow logs delivered in non-Hive style partitioning format. The example allows for multi-account aggregation. If you are centralizing VPC Flow logs from multiple accounts into one Amazon S3 bucket, the account ID must be entered in the Amazon S3 path.

```
CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  `end` bigint,
  action string,
  log_status string,
  vpc_id string,
  subnet_id string,
  instance_id string,
  tcp_flags int,
  type string,
  pkt_srcaddr string,
  pkt_dstaddr string,
  az_id string,
  sublocation_type string,
  sublocation_id string,
  pkt_src_aws_service string,
  pkt_dst_aws_service string,
  flow_direction string,
  traffic_path int
)
PARTITIONED BY (accid string, region string, day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION '$LOCATION_OF_LOGS'
TBLPROPERTIES
(
"skip.header.line.count"="1",
"projection.enabled" = "true",
"projection.accid.type" = "enum",
"projection.accid.values" = "$ACCID_1,$ACCID_2",
"projection.region.type" = "enum",
"projection.region.values" = "$REGION_1,$REGION_2,$REGION_3",
"projection.day.type" = "date",
"projection.day.range" = "$START_RANGE,NOW",
"projection.day.format" = "yyyy/MM/dd",
"storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs/${accid}/vpcflowlogs/${region}/${day}"
)
```

## Example queries for test\$1table\$1vpclogs
<a name="query-examples-vpc-logs-pp"></a>

The following example queries query the `test_table_vpclogs` created by the preceding `CREATE TABLE` statement. Replace `test_table_vpclogs` in the queries with the name of your own table. Modify the column values and other variables according to your requirements.

To return the first 100 access log entries in chronological order for a specified period of time, run a query like the following.

```
SELECT *
FROM test_table_vpclogs
WHERE day >= '2021/02/01' AND day < '2021/02/28'
ORDER BY day ASC
LIMIT 100
```

To view which server receives the top ten number of HTTP packets for a specified period of time, run a query like the following. The query counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 entries from the previous week.

```
SELECT SUM(packets) AS packetcount, 
       dstaddr
FROM test_table_vpclogs
WHERE dstport = 443
  AND day >= '2021/03/01'
  AND day < '2021/03/31'
GROUP BY dstaddr
ORDER BY packetcount DESC
LIMIT 10
```

To return the logs that were created during a specified period of time, run a query like the following.

```
SELECT interface_id,
       srcaddr,
       action,
       protocol,
       to_iso8601(from_unixtime(start)) AS start_time,
       to_iso8601(from_unixtime("end")) AS end_time
FROM test_table_vpclogs
WHERE DAY >= '2021/04/01'
  AND DAY < '2021/04/30'
```

To return the access logs for a source IP address between specified time periods, run a query like the following.

```
SELECT *
FROM test_table_vpclogs
WHERE srcaddr = '10.117.1.22'
  AND day >= '2021/02/01'
  AND day < '2021/02/28'
```

To list rejected TCP connections, run a query like the following.

```
SELECT day,
       interface_id,
       srcaddr,
       action,
       protocol
FROM test_table_vpclogs
WHERE action = 'REJECT' AND protocol = 6 AND day >= '2021/02/01' AND day < '2021/02/28'
LIMIT 10
```

To return the access logs for the IP address range that starts with `10.117`, run a query like the following.

```
SELECT *
FROM test_table_vpclogs
WHERE split_part(srcaddr,'.', 1)='10'
  AND split_part(srcaddr,'.', 2) ='117'
```

To return the access logs for a destination IP address between a certain time range, run a query like the following.

```
SELECT *
FROM test_table_vpclogs
WHERE dstaddr = '10.0.1.14'
  AND day >= '2021/01/01'
  AND day < '2021/01/31'
```

# Create tables for flow logs in Apache Parquet format using partition projection
<a name="vpc-flow-logs-partition-projection-parquet-example"></a>

The following partition projection `CREATE TABLE` statement for VPC flow logs is in Apache Parquet format, not Hive compatible, and partitioned by hour and by date instead of day. Replace the table name `test_table_vpclogs_parquet` in the example with the name of your table. Edit the `LOCATION` clause to specify the Amazon S3 bucket that contains your Amazon VPC log data.

```
CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs_parquet (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  `end` bigint,
  action string,
  log_status string,
  vpc_id string,
  subnet_id string,
  instance_id string,
  tcp_flags int,
  type string,
  pkt_srcaddr string,
  pkt_dstaddr string,
  az_id string,
  sublocation_type string,
  sublocation_id string,
  pkt_src_aws_service string,
  pkt_dst_aws_service string,
  flow_direction string,
  traffic_path int
)
PARTITIONED BY (region string, date string, hour string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/'
TBLPROPERTIES (
"EXTERNAL"="true",
"skip.header.line.count" = "1",
"projection.enabled" = "true",
"projection.region.type" = "enum",
"projection.region.values" = "us-east-1,us-west-2,ap-south-1,eu-west-1",
"projection.date.type" = "date",
"projection.date.range" = "2021/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.hour.type" = "integer",
"projection.hour.range" = "00,23",
"projection.hour.digits" = "2",
"storage.location.template" = "s3://amzn-s3-demo-bucket/prefix/AWSLogs/${account_id}/vpcflowlogs/${region}/${date}/${hour}"
)
```

# Additional resources
<a name="query-examples-vpc-logs-additional-resources"></a>

For more information about using Athena to analyze VPC flow logs, see the following AWS Big Data blog posts:
+ [Analyze VPC Flow Logs with point-and-click Amazon Athena integration](https://aws.amazon.com/blogs/networking-and-content-delivery/analyze-vpc-flow-logs-with-point-and-click-amazon-athena-integration/) 
+ [Analyzing VPC flow logs using Amazon Athena and Quick](https://aws.amazon.com/blogs/big-data/analyzing-vpc-flow-logs-using-amazon-athena-and-amazon-quicksight/)
+ [Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format](https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/)

# Query AWS WAF logs
<a name="waf-logs"></a>

AWS WAF is a web application firewall that lets you monitor and control the HTTP and HTTPS requests that your protected web applications receive from clients. You define how to handle the web requests by configuring rules inside an AWS WAF web access control list (ACL). You then protect a web application by associating a web ACL to it. Examples of web application resources that you can protect with AWS WAF include Amazon CloudFront distributions, Amazon API Gateway REST APIs, and Application Load Balancers. For more information about AWS WAF, see [AWS WAF](https://docs.aws.amazon.com/waf/latest/developerguide/waf-chapter.html) in the *AWS WAF developer guide*.

AWS WAF logs include information about the traffic that is analyzed by your web ACL, such as the time that AWS WAF received the request from your AWS resource, detailed information about the request, and the action for the rule that each request matched.

You can configure an AWS WAF web ACL to publish logs to one of several destinations, where you can query and view them. For more information about configuring web ACL logging and the contents of the AWS WAF logs, see [Logging AWS WAF web ACL traffic](https://docs.aws.amazon.com/waf/latest/developerguide/logging.html) in the *AWS WAF developer guide*.

For information on how to use Athena to analyze AWS WAF logs for insights into threat detection and potential security attacks, see the AWS Networking & Content Delivery Blog post [How to use Amazon Athena queries to analyze AWS WAF logs and provide the visibility needed for threat detection](https://aws.amazon.com/blogs/networking-and-content-delivery/how-to-use-amazon-athena-queries-to-analyze-aws-waf-logs-and-provide-the-visibility-needed-for-threat-detection/).

For an example of how to aggregate AWS WAF logs into a central data lake repository and query them with Athena, see the AWS Big Data Blog post [Analyzing AWS WAF logs with OpenSearch Service, Amazon Athena, and Quick](https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/).

This topic provides example `CREATE TABLE` statements for partition projection, manual partitioning, and one that does not uses any partitioning.

**Note**  
The `CREATE TABLE` statements in this topic can be used for both v1 and v2 AWS WAF logs. In v1, the `webaclid` field contains an ID. In v2, the `webaclid` field contains a full ARN. The `CREATE TABLE` statements here treat this content agnostically by using the `string` data type.

**Topics**
+ [Create a table for AWS WAF S3 logs in Athena using partition projection](create-waf-table-partition-projection.md)
+ [Create a table for AWS WAF S3 logs in Athena using manual partition](create-waf-table-manual-partition.md)
+ [Create a table for AWS WAF logs without partitioning](create-waf-table.md)
+ [Example queries for AWS WAF logs](query-examples-waf-logs.md)

# Create a table for AWS WAF S3 logs in Athena using partition projection
<a name="create-waf-table-partition-projection"></a>

Because AWS WAF logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena [partition projection](partition-projection.md) feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. 

The following example `CREATE TABLE` statement automatically uses partition projection on AWS WAF logs from a specified date until the present for four different AWS regions. The `PARTITION BY` clause in this example partitions by region and by date, but you can modify this according to your requirements. Modify the fields as necessary to match your log output. In the `LOCATION` and `storage.location.template` clauses, replace the *amzn-s3-demo-bucket* and *AWS\$1ACCOUNT\$1NUMBER* placeholders with values that identify the Amazon S3 bucket location of your AWS WAF logs. For `projection.day.range`, replace *2021*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. 

```
CREATE EXTERNAL TABLE `waf_logs_partition_projection`(
  `timestamp` bigint, 
  `formatversion` int, 
  `webaclid` string, 
  `terminatingruleid` string, 
  `terminatingruletype` string, 
  `action` string, 
  `terminatingrulematchdetails` array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>, 
  `httpsourcename` string, 
  `httpsourceid` string, 
  `rulegrouplist` array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,overriddenaction:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>,challengeresponse:struct<responsecode:string,solvetimestamp:string>,captcharesponse:struct<responsecode:string,solvetimestamp:string>>>,excludedrules:string>>, 
  `ratebasedrulelist` array<struct<ratebasedruleid:string,limitkey:string,maxrateallowed:int>>, 
  `nonterminatingmatchingrules` array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>,challengeresponse:struct<responsecode:string,solvetimestamp:string>,captcharesponse:struct<responsecode:string,solvetimestamp:string>>>, 
  `requestheadersinserted` array<struct<name:string,value:string>>, 
  `responsecodesent` string, 
  `httprequest` struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string,fragment:string,scheme:string,host:string>,
  `labels` array<struct<name:string>>, 
  `captcharesponse` struct<responsecode:string,solvetimestamp:string,failurereason:string>, 
  `challengeresponse` struct<responsecode:string,solvetimestamp:string,failurereason:string>, 
  `ja3fingerprint` string, 
  `ja4fingerprint` string, 
  `oversizefields` string, 
  `requestbodysize` int, 
  `requestbodysizeinspectedbywaf` int)
  PARTITIONED BY ( 
   `log_time` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/testui/'
TBLPROPERTIES (
 'projection.enabled'='true',
  'projection.log_time.format'='yyyy/MM/dd/HH/mm',
  'projection.log_time.interval'='1',
  'projection.log_time.interval.unit'='minutes',
  'projection.log_time.range'='2025/01/01/00/00,NOW',
  'projection.log_time.type'='date',
  'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/testui/${log_time}')
```

**Note**  
The format of the path in the `LOCATION` clause in the example is standard but can vary based on the AWS WAF configuration that you have implemented. For example, the following example AWS WAF logs path is for a CloudFront distribution:   

```
s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/cloudfronyt/2025/01/01/00/00/
```
If you experience issues while creating or querying your AWS WAF logs table, confirm the location of your log data or [contact Support](https://console.aws.amazon.com/support/home/).

For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md).

# Create a table for AWS WAF S3 logs in Athena using manual partition
<a name="create-waf-table-manual-partition"></a>

This section describes how to create a table for AWS WAF logs using manual partition.

In the `LOCATION` and `storage.location.template` clauses, replace the *amzn-s3-demo-bucket* and *AWS\$1ACCOUNT\$1NUMBER* placeholders with values that identify the Amazon S3 bucket location of your AWS WAF logs.

```
CREATE EXTERNAL TABLE `waf_logs_manual_partition`(
  `timestamp` bigint, 
  `formatversion` int, 
  `webaclid` string, 
  `terminatingruleid` string, 
  `terminatingruletype` string, 
  `action` string, 
  `terminatingrulematchdetails` array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>, 
  `httpsourcename` string, 
  `httpsourceid` string, 
  `rulegrouplist` array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,overriddenaction:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>,challengeresponse:struct<responsecode:string,solvetimestamp:string>,captcharesponse:struct<responsecode:string,solvetimestamp:string>>>,excludedrules:string>>, 
  `ratebasedrulelist` array<struct<ratebasedruleid:string,limitkey:string,maxrateallowed:int>>, 
  `nonterminatingmatchingrules` array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,sensitivitylevel:string,location:string,matcheddata:array<string>>>,challengeresponse:struct<responsecode:string,solvetimestamp:string>,captcharesponse:struct<responsecode:string,solvetimestamp:string>>>, 
  `requestheadersinserted` array<struct<name:string,value:string>>, 
  `responsecodesent` string, 
  `httprequest` struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string,fragment:string,scheme:string,host:string>, 
  `labels` array<struct<name:string>>, 
  `captcharesponse` struct<responsecode:string,solvetimestamp:string,failurereason:string>, 
  `challengeresponse` struct<responsecode:string,solvetimestamp:string,failurereason:string>, 
  `ja3fingerprint` string, 
  `ja4fingerprint` string, 
  `oversizefields` string, 
  `requestbodysize` int, 
  `requestbodysizeinspectedbywaf` int)
  PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string, `min` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/webacl/'
```

# Create a table for AWS WAF logs without partitioning
<a name="create-waf-table"></a>

This section describes how to create a table for AWS WAF logs without partitioning or partition projection.

**Note**  
For performance and cost reasons, we do not recommend using non-partitioned schema for queries. For more information, see [Top 10 Performance Tuning Tips for Amazon Athena](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/) in the AWS Big Data Blog.

**To create the AWS WAF table**

1. Copy and paste the following DDL statement into the Athena console. Modify the fields as necessary to match your log output. Modify the `LOCATION` for the Amazon S3 bucket to correspond to the one that stores your logs.

   This query uses the [OpenX JSON SerDe](openx-json-serde.md).
**Note**  
The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 

   ```
   CREATE EXTERNAL TABLE `waf_logs`(
     `timestamp` bigint,
     `formatversion` int,
     `webaclid` string,
     `terminatingruleid` string,
     `terminatingruletype` string,
     `action` string,
     `terminatingrulematchdetails` array <
                                       struct <
                                           conditiontype: string,
                                           sensitivitylevel: string,
                                           location: string,
                                           matcheddata: array < string >
                                             >
                                        >,
     `httpsourcename` string,
     `httpsourceid` string,
     `rulegrouplist` array <
                         struct <
                             rulegroupid: string,
                             terminatingrule: struct <
                                                 ruleid: string,
                                                 action: string,
                                                 rulematchdetails: array <
                                                                      struct <
                                                                          conditiontype: string,
                                                                          sensitivitylevel: string,
                                                                          location: string,
                                                                          matcheddata: array < string >
                                                                             >
                                                                       >
                                                   >,
                             nonterminatingmatchingrules: array <
                                                                 struct <
                                                                     ruleid: string,
                                                                     action: string,
                                                                     overriddenaction: string,
                                                                     rulematchdetails: array <
                                                                                          struct <
                                                                                              conditiontype: string,
                                                                                              sensitivitylevel: string,
                                                                                              location: string,
                                                                                              matcheddata: array < string >
                                                                                                 >
                                                                      >,
                                                                     challengeresponse: struct <
                                                                               responsecode: string,
                                                                               solvetimestamp: string
                                                                                 >,
                                                                     captcharesponse: struct <
                                                                               responsecode: string,
                                                                               solvetimestamp: string
                                                                                 >
                                                                       >
                                                                >,
                             excludedrules: string
                               >
                          >,
   `ratebasedrulelist` array <
                            struct <
                                ratebasedruleid: string,
                                limitkey: string,
                                maxrateallowed: int
                                  >
                             >,
     `nonterminatingmatchingrules` array <
                                       struct <
                                           ruleid: string,
                                           action: string,
                                           rulematchdetails: array <
                                                                struct <
                                                                    conditiontype: string,
                                                                    sensitivitylevel: string,
                                                                    location: string,
                                                                    matcheddata: array < string >
                                                                       >
                                                                >,
                                           challengeresponse: struct <
                                                               responsecode: string,
                                                               solvetimestamp: string
                                                                >,
                                           captcharesponse: struct <
                                                               responsecode: string,
                                                               solvetimestamp: string
                                                                >
                                             >
                                        >,
     `requestheadersinserted` array <
                                   struct <
                                       name: string,
                                       value: string
                                         >
                                    >,
     `responsecodesent` string,
     `httprequest` struct <
                       clientip: string,
                       country: string,
                       headers: array <
                                   struct <
                                       name: string,
                                       value: string
                                         >
                                    >,
                       uri: string,
                       args: string,
                       httpversion: string,
                       httpmethod: string,
                       requestid: string
                         >,
     `labels` array <
                  struct <
                      name: string
                        >
                   >,
     `captcharesponse` struct <
                           responsecode: string,
                           solvetimestamp: string,
                           failureReason: string
                             >,
     `challengeresponse` struct <
                           responsecode: string,
                           solvetimestamp: string,
                           failureReason: string
                           >,
     `ja3Fingerprint` string,
     `oversizefields` string,
     `requestbodysize` int,
     `requestbodysizeinspectedbywaf` int
   )
   ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
   STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION 's3://amzn-s3-demo-bucket/prefix/'
   ```

1. Run the `CREATE EXTERNAL TABLE` statement in the Athena console query editor. This registers the `waf_logs` table and makes the data in it available for queries from Athena.

# Example queries for AWS WAF logs
<a name="query-examples-waf-logs"></a>

Many of the example queries in this section use the partition projection table created previously. Modify the table name, column values, and other variables in the examples according to your requirements. To improve the performance of your queries and reduce cost, add the partition column in the filter condition.

**Topics**
+ [Count referrers, IP addresses, or matched rules](query-examples-waf-logs-count.md)
+ [Query using date and time](query-examples-waf-logs-date-time.md)
+ [Query for blocked requests or addresses](query-examples-waf-logs-blocked-requests.md)

# Count referrers, IP addresses, or matched rules
<a name="query-examples-waf-logs-count"></a>

The examples in this section query for counts of log items of interest.
+ [Count the number of referrers that contain a specified term](#waf-example-count-referrers-with-specified-term)
+ [Count all matched IP addresses in the last 10 days that have matched excluded rules](#waf-example-count-matched-ip-addresses)
+ [Group all counted managed rules by the number of times matched](#waf-example-group-managed-rules-by-times-matched)
+ [Group all counted custom rules by number of times matched](#waf-example-group-custom-rules-by-times-matched)

**Example – Count the number of referrers that contain a specified term**  
The following query counts the number of referrers that contain the term "amazon" for the specified date range.  

```
WITH test_dataset AS 
  (SELECT header FROM waf_logs
    CROSS JOIN UNNEST(httprequest.headers) AS t(header) WHERE "date" >= '2021/03/01'
    AND "date" < '2021/03/31')
SELECT COUNT(*) referer_count 
FROM test_dataset 
WHERE LOWER(header.name)='referer' AND header.value LIKE '%amazon%'
```

**Example – Count all matched IP addresses in the last 10 days that have matched excluded rules**  
The following query counts the number of times in the last 10 days that the IP address matched the excluded rule in the rule group.   

```
WITH test_dataset AS 
  (SELECT * FROM waf_logs 
    CROSS JOIN UNNEST(rulegrouplist) AS t(allrulegroups))
SELECT 
  COUNT(*) AS count, 
  "httprequest"."clientip", 
  "allrulegroups"."excludedrules",
  "allrulegroups"."ruleGroupId"
FROM test_dataset 
WHERE allrulegroups.excludedrules IS NOT NULL AND from_unixtime(timestamp/1000) > now() - interval '10' day
GROUP BY "httprequest"."clientip", "allrulegroups"."ruleGroupId", "allrulegroups"."excludedrules"
ORDER BY count DESC
```

**Example – Group all counted managed rules by the number of times matched**  
If you set rule group rule actions to Count in your web ACL configuration before October 27, 2022, AWS WAF saved your overrides in the web ACL JSON as `excludedRules`. Now, the JSON setting for overriding a rule to Count is in the `ruleActionOverrides` settings. For more information, see [Action overrides in rule groups](https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-rule-group-override-options.html) in the *AWS WAF Developer Guide*. To extract managed rules in Count mode from the new log structure, query the `nonTerminatingMatchingRules` in the `ruleGroupList` section instead of the `excludedRules` field, as in the following example.  

```
SELECT
 count(*) AS count,
 httpsourceid,
 httprequest.clientip,
 t.rulegroupid, 
 t.nonTerminatingMatchingRules
FROM "waf_logs" 
CROSS JOIN UNNEST(rulegrouplist) AS t(t) 
WHERE action <> 'BLOCK' AND cardinality(t.nonTerminatingMatchingRules) > 0 
GROUP BY t.nonTerminatingMatchingRules, action, httpsourceid, httprequest.clientip, t.rulegroupid 
ORDER BY "count" DESC 
Limit 50
```

**Example – Group all counted custom rules by number of times matched**  
The following query groups all counted custom rules by the number of times matched.  

```
SELECT
  count(*) AS count,
         httpsourceid,
         httprequest.clientip,
         t.ruleid,
         t.action
FROM "waf_logs" 
CROSS JOIN UNNEST(nonterminatingmatchingrules) AS t(t) 
WHERE action <> 'BLOCK' AND cardinality(nonTerminatingMatchingRules) > 0 
GROUP BY t.ruleid, t.action, httpsourceid, httprequest.clientip 
ORDER BY "count" DESC
Limit 50
```

For information about the log locations for custom rules and managed rule groups, see [Monitoring and tuning](https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-testing-activities.html) in the *AWS WAF Developer Guide*.

# Query using date and time
<a name="query-examples-waf-logs-date-time"></a>

The examples in this section include queries that use date and time values.
+ [Return the timestamp field in human-readable ISO 8601 format](#waf-example-return-human-readable-timestamp)
+ [Return records from the last 24 hours](#waf-example-return-records-last-24-hours)
+ [Return records for a specified date range and IP address](#waf-example-return-records-date-range-and-ip)
+ [For a specified date range, count the number of IP addresses in five minute intervals](#waf-example-count-ip-addresses-in-date-range)
+ [Count the number of X-Forwarded-For IP in the last 10 days](#waf-example-count-x-forwarded-for-ip)

**Example – Return the timestamp field in human-readable ISO 8601 format**  
The following query uses the `from_unixtime` and `to_iso8601` functions to return the `timestamp` field in human-readable ISO 8601 format (for example, `2019-12-13T23:40:12.000Z` instead of `1576280412771`). The query also returns the HTTP source name, source ID, and request.   

```
SELECT to_iso8601(from_unixtime(timestamp / 1000)) as time_ISO_8601,
       httpsourcename,
       httpsourceid,
       httprequest
FROM waf_logs
LIMIT 10;
```

**Example – Return records from the last 24 hours**  
The following query uses a filter in the `WHERE` clause to return the HTTP source name, HTTP source ID, and HTTP request fields for records from the last 24 hours.  

```
SELECT to_iso8601(from_unixtime(timestamp/1000)) AS time_ISO_8601, 
       httpsourcename, 
       httpsourceid, 
       httprequest 
FROM waf_logs
WHERE from_unixtime(timestamp/1000) > now() - interval '1' day
LIMIT 10;
```

**Example – Return records for a specified date range and IP address**  
The following query lists the records in a specified date range for a specified client IP address.  

```
SELECT * 
FROM waf_logs 
WHERE httprequest.clientip='53.21.198.66' AND "date" >= '2021/03/01' AND "date" < '2021/03/31'
```

**Example – For a specified date range, count the number of IP addresses in five minute intervals**  
The following query counts, for a particular date range, the number of IP addresses in five minute intervals.  

```
WITH test_dataset AS 
  (SELECT 
     format_datetime(from_unixtime((timestamp/1000) - ((minute(from_unixtime(timestamp / 1000))%5) * 60)),'yyyy-MM-dd HH:mm') AS five_minutes_ts,
     "httprequest"."clientip" 
     FROM waf_logs 
     WHERE "date" >= '2021/03/01' AND "date" < '2021/03/31')
SELECT five_minutes_ts,"clientip",count(*) ip_count 
FROM test_dataset 
GROUP BY five_minutes_ts,"clientip"
```

**Example – Count the number of X-Forwarded-For IP in the last 10 days**  
The following query filters the request headers and counts the number of X-Forwarded-For IP in the last 10 days.  

```
WITH test_dataset AS
  (SELECT header
   FROM waf_logs
   CROSS JOIN UNNEST (httprequest.headers) AS t(header)
   WHERE from_unixtime("timestamp"/1000) > now() - interval '10' DAY) 
SELECT header.value AS ip,
       count(*) AS COUNT 
FROM test_dataset 
WHERE header.name='X-Forwarded-For' 
GROUP BY header.value 
ORDER BY COUNT DESC
```

For more information about date and time functions, see [Date and time functions and operators](https://trino.io/docs/current/functions/datetime.html) in the Trino documentation.

# Query for blocked requests or addresses
<a name="query-examples-waf-logs-blocked-requests"></a>

The examples in this section query for blocked requests or addresses.
+ [Extract the top 100 IP addresses blocked by a specified rule type](#waf-example-extract-top-100-blocked-ip-by-rule)
+ [Count the number of times a request from a specified country has been blocked](#waf-example-count-request-blocks-from-country)
+ [Count the number of times a request has been blocked, grouping by specific attributes](#waf-example-count-request-blocks-by-attribute)
+ [Count the number of times a specific terminating rule ID has been matched](#waf-example-count-terminating-rule-id-matches)
+ [Retrieve the top 100 IP addresses blocked during a specified date range](#waf-example-top-100-ip-addresses-blocked-for-date-range)

**Example – Extract the top 100 IP addresses blocked by a specified rule type**  
The following query extracts and counts the top 100 IP addresses that have been blocked by the `RATE_BASED` terminating rule during the specified date range.  

```
SELECT COUNT(httpRequest.clientIp) as count,
httpRequest.clientIp
FROM waf_logs
WHERE terminatingruletype='RATE_BASED' AND action='BLOCK' and "date" >= '2021/03/01'
AND "date" < '2021/03/31'
GROUP BY httpRequest.clientIp
ORDER BY count DESC
LIMIT 100
```

**Example – Count the number of times a request from a specified country has been blocked**  
The following query counts the number of times the request has arrived from an IP address that belongs to Ireland (IE) and has been blocked by the `RATE_BASED` terminating rule.  

```
SELECT 
  COUNT(httpRequest.country) as count, 
  httpRequest.country 
FROM waf_logs
WHERE 
  terminatingruletype='RATE_BASED' AND 
  httpRequest.country='IE'
GROUP BY httpRequest.country
ORDER BY count
LIMIT 100;
```

**Example – Count the number of times a request has been blocked, grouping by specific attributes**  
The following query counts the number of times the request has been blocked, with results grouped by WebACL, RuleId, ClientIP, and HTTP Request URI.  

```
SELECT 
  COUNT(*) AS count,
  webaclid,
  terminatingruleid,
  httprequest.clientip,
  httprequest.uri
FROM waf_logs
WHERE action='BLOCK'
GROUP BY webaclid, terminatingruleid, httprequest.clientip, httprequest.uri
ORDER BY count DESC
LIMIT 100;
```

**Example – Count the number of times a specific terminating rule ID has been matched**  
The following query counts the number of times a specific terminating rule ID has been matched (`WHERE terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e'`). The query then groups the results by WebACL, Action, ClientIP, and HTTP Request URI.  

```
SELECT 
  COUNT(*) AS count,
  webaclid,
  action,
  httprequest.clientip,
  httprequest.uri
FROM waf_logs
WHERE terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e'
GROUP BY webaclid, action, httprequest.clientip, httprequest.uri
ORDER BY count DESC
LIMIT 100;
```

**Example – Retrieve the top 100 IP addresses blocked during a specified date range**  
The following query extracts the top 100 IP addresses that have been blocked for a specified date range. The query also lists the number of times the IP addresses have been blocked.  

```
SELECT "httprequest"."clientip", "count"(*) "ipcount", "httprequest"."country"
FROM waf_logs
WHERE "action" = 'BLOCK' and "date" >= '2021/03/01'
AND "date" < '2021/03/31'
GROUP BY "httprequest"."clientip", "httprequest"."country"
ORDER BY "ipcount" DESC limit 100
```

For information about querying Amazon S3 logs, see the following topics:
+ [How do I analyze my Amazon S3 server access logs using Athena?](https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/) in the AWS Knowledge Center
+ [Querying Amazon S3 access logs for requests using Amazon Athena](https://docs.aws.amazon.com/AmazonS3/latest/dev/using-s3-access-logs-to-identify-requests.html#querying-s3-access-logs-for-requests) in the Amazon Simple Storage Service User Guide
+ [Using AWS CloudTrail to identify Amazon S3 requests](https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudtrail-request-identification.html) in the Amazon Simple Storage Service User Guide

# Query web server logs stored in Amazon S3
<a name="querying-web-server-logs"></a>

You can use Athena to query Web server logs stored in Amazon S3. The topics in this section show you how to create tables in Athena to query Web server logs in a variety of formats.

**Topics**
+ [Query Apache logs stored in Amazon S3](querying-apache-logs.md)
+ [Query internet information server (IIS) logs stored in Amazon S3](querying-iis-logs.md)

# Query Apache logs stored in Amazon S3
<a name="querying-apache-logs"></a>

You can use Amazon Athena to query [Apache HTTP Server log files](https://httpd.apache.org/docs/2.4/logs.html) stored in your Amazon S3 account. This topic shows you how to create table schemas to query Apache [Access log](https://httpd.apache.org/docs/2.4/logs.html#accesslog) files in the common log format.

Fields in the common log format include the client IP address, client ID, user ID, request received timestamp, text of the client request, server status code, and size of the object returned to the client.

The following example data shows the Apache common log format.

```
198.51.100.7 - Li [10/Oct/2019:13:55:36 -0700] "GET /logo.gif HTTP/1.0" 200 232
198.51.100.14 - Jorge [24/Nov/2019:10:49:52 -0700] "GET /index.html HTTP/1.1" 200 2165
198.51.100.22 - Mateo [27/Dec/2019:11:38:12 -0700] "GET /about.html HTTP/1.1" 200 1287
198.51.100.9 - Nikki [11/Jan/2020:11:40:11 -0700] "GET /image.png HTTP/1.1" 404 230
198.51.100.2 - Ana [15/Feb/2019:10:12:22 -0700] "GET /favicon.ico HTTP/1.1" 404 30
198.51.100.13 - Saanvi [14/Mar/2019:11:40:33 -0700] "GET /intro.html HTTP/1.1" 200 1608
198.51.100.11 - Xiulan [22/Apr/2019:10:51:34 -0700] "GET /group/index.html HTTP/1.1" 200 1344
```

## Create a table in Athena for Apache logs
<a name="querying-apache-logs-creating-a-table-in-athena"></a>

Before you can query Apache logs stored in Amazon S3, you must create a table schema for Athena so that it can read the log data. To create an Athena table for Apache logs, you can use the [Grok SerDe](grok-serde.md). For more information about using the Grok SerDe, see [Writing grok custom classifiers](https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok) in the *AWS Glue Developer Guide*.

**To create a table in Athena for Apache web server logs**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Paste the following DDL statement into the Athena Query Editor. Modify the values in `LOCATION 's3://amzn-s3-demo-bucket/apache-log-folder/'` to point to your Apache logs in Amazon S3.

   ```
   CREATE EXTERNAL TABLE apache_logs (
     client_ip string,
     client_id string,
     user_id string,
     request_received_time string,
     client_request string,
     server_status string,
     returned_obj_size string
     )
   ROW FORMAT SERDE
      'com.amazonaws.glue.serde.GrokSerDe'
   WITH SERDEPROPERTIES (
      'input.format'='^%{IPV4:client_ip} %{DATA:client_id} %{USERNAME:user_id} %{GREEDYDATA:request_received_time} %{QUOTEDSTRING:client_request} %{DATA:server_status} %{DATA: returned_obj_size}$'
      )
   STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION
      's3://amzn-s3-demo-bucket/apache-log-folder/';
   ```

1. Run the query in the Athena console to register the `apache_logs` table. When the query completes, the logs are ready for you to query from Athena.

### Example queries
<a name="querying-apache-logs-example-select-queries"></a>

**Example – Filter for 404 errors**  
The following example query selects the request received time, text of the client request, and server status code from the `apache_logs` table. The `WHERE` clause filters for HTTP status code `404` (page not found).  

```
SELECT request_received_time, client_request, server_status
FROM apache_logs
WHERE server_status = '404'
```
The following image shows the results of the query in the Athena Query Editor.  

![\[Querying an Apache log from Athena for HTTP 404 entries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-apache-logs-1.png)


**Example – Filter for successful requests**  
The following example query selects the user ID, request received time, text of the client request, and server status code from the `apache_logs` table. The `WHERE` clause filters for HTTP status code `200` (successful).  

```
SELECT user_id, request_received_time, client_request, server_status
FROM apache_logs
WHERE server_status = '200'
```
The following image shows the results of the query in the Athena Query Editor.  

![\[Querying an Apache log from Athena for HTTP 200 entries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-apache-logs-2.png)


**Example – Filter by timestamp**  
The following example queries for records whose request received time is greater than the specified timestamp.  

```
SELECT * FROM apache_logs WHERE request_received_time > 10/Oct/2023:00:00:00
```

# Query internet information server (IIS) logs stored in Amazon S3
<a name="querying-iis-logs"></a>

You can use Amazon Athena to query Microsoft Internet Information Services (IIS) web server logs stored in your Amazon S3 account. While IIS uses a [variety](https://docs.microsoft.com/en-us/previous-versions/iis/6.0-sdk/ms525807(v%3Dvs.90)) of log file formats, this topic shows you how to create table schemas to query W3C extended and IIS log file format logs from Athena.

Because the W3C extended and IIS log file formats use single character delimiters (spaces and commas, respectively) and do not have values enclosed in quotation marks, you can use the [LazySimpleSerDe](lazy-simple-serde.md) to create Athena tables for them.

**Topics**
+ [Query W3C extended log file format](querying-iis-logs-w3c-extended-log-file-format.md)
+ [Query IIS log file format](querying-iis-logs-iis-log-file-format.md)
+ [Query NCSA log file format](querying-iis-logs-ncsa-log-file-format.md)

# Query W3C extended log file format
<a name="querying-iis-logs-w3c-extended-log-file-format"></a>

The [W3C extended](https://docs.microsoft.com/en-us/windows/win32/http/w3c-logging) log file data format has space-separated fields. The fields that appear in W3C extended logs are determined by a web server administrator who chooses which log fields to include. The following example log data has the fields `date, time`, `c-ip`, `s-ip`, `cs-method`, `cs-uri-stem`, `sc-status`, `sc-bytes`, `cs-bytes`, `time-taken`, and `cs-version`.

```
2020-01-19 22:48:39 203.0.113.5 198.51.100.2 GET /default.html 200 540 524 157 HTTP/1.0
2020-01-19 22:49:40 203.0.113.10 198.51.100.12 GET /index.html 200 420 324 164 HTTP/1.0
2020-01-19 22:50:12 203.0.113.12 198.51.100.4 GET /image.gif 200 324 320 358 HTTP/1.0
2020-01-19 22:51:44 203.0.113.15 198.51.100.16 GET /faq.html 200 330 324 288 HTTP/1.0
```

## Create a table in Athena for W3C extended logs
<a name="querying-iis-logs-creating-a-table-in-athena-for-w3c-extended-logs"></a>

Before you can query your W3C extended logs, you must create a table schema so that Athena can read the log data.

**To create a table in Athena for W3C extended logs**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Paste a DDL statement like the following into the Athena console, noting the following points:

   1. Add or remove the columns in the example to correspond to the fields in the logs that you want to query.

   1. Column names in the W3C extended log file format contain hyphens (`-`). However, in accordance with [Athena naming conventions](tables-databases-columns-names.md), the example `CREATE TABLE` statement replaces them with underscores (`_`).

   1. To specify the space delimiter, use `FIELDS TERMINATED BY ' '`.

   1. Modify the values in `LOCATION 's3://amzn-s3-demo-bucket/w3c-log-folder/'` to point to your W3C extended logs in Amazon S3.

   ```
   CREATE EXTERNAL TABLE `iis_w3c_logs`( 
     date_col string, 
     time_col string, 
     c_ip string,
     s_ip string,
     cs_method string, 
     cs_uri_stem string, 
     sc_status string,
     sc_bytes string,
     cs_bytes string,
     time_taken string,
     cs_version string
     ) 
   ROW FORMAT DELIMITED  
     FIELDS TERMINATED BY ' '  
   STORED AS INPUTFORMAT  
     'org.apache.hadoop.mapred.TextInputFormat'  
   OUTPUTFORMAT  
     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
   LOCATION   's3://amzn-s3-demo-bucket/w3c-log-folder/'
   ```

1. Run the query in the Athena console to register the `iis_w3c_logs` table. When the query completes, the logs are ready for you to query from Athena.

## Example W3C extended log select query
<a name="querying-iis-logs-example-w3c-extended-log-select-query"></a>

The following example query selects the date, time, request target, and time taken for the request from the table `iis_w3c_logs`. The `WHERE` clause filters for cases in which the HTTP method is `GET` and the HTTP status code is `200` (successful).

```
SELECT date_col, time_col, cs_uri_stem, time_taken
FROM iis_w3c_logs
WHERE cs_method = 'GET' AND sc_status = '200'
```

The following image shows the results of the query in the Athena Query Editor.

![\[Example query results in Athena of W3C extended log files stored in Amazon S3.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-iis-logs-1.png)


## Combine the date and time fields
<a name="querying-iis-logs-example-w3c-extended-log-combining-date-and-time"></a>

The space delimited `date` and `time` fields are separate entries in the log source data, but you can combine them into a timestamp if you want. Use the [concat()](https://prestodb.io/docs/current/functions/string.html#concat) and [date\$1parse()](https://prestodb.io/docs/current/functions/datetime.html#date_parse) functions in a [SELECT](select.md) or [CREATE TABLE AS SELECT](create-table-as.md) query to concatenate and convert the date and time columns into timestamp format. The following example uses a CTAS query to create a new table with a `derived_timestamp` column.

```
CREATE TABLE iis_w3c_logs_w_timestamp AS
SELECT 
  date_parse(concat(date_col,' ', time_col),'%Y-%m-%d %H:%i:%s') as derived_timestamp, 
  c_ip, 
  s_ip, 
  cs_method, 
  cs_uri_stem, 
  sc_status, 
  sc_bytes, 
  cs_bytes, 
  time_taken, 
  cs_version
FROM iis_w3c_logs
```

After the table is created, you can query the new timestamp column directly, as in the following example.

```
SELECT derived_timestamp, cs_uri_stem, time_taken
FROM iis_w3c_logs_w_timestamp
WHERE cs_method = 'GET' AND sc_status = '200'
```

The following image shows the results of the query.

![\[W3C extended log file query results on an table with a derived timestamp column.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-iis-logs-1a.png)


# Query IIS log file format
<a name="querying-iis-logs-iis-log-file-format"></a>

Unlike the W3C extended format, the [IIS log file format](https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2003/cc728311(v%3dws.10)) has a fixed set of fields and includes a comma as a delimiter. The LazySimpleSerDe treats the comma as the delimiter and the space after the comma as the beginning of the next field.

The following example shows sample data in the IIS log file format.

```
203.0.113.15, -, 2020-02-24, 22:48:38, W3SVC2, SERVER5, 198.51.100.4, 254, 501, 488, 200, 0, GET, /index.htm, -, 
203.0.113.4, -, 2020-02-24, 22:48:39, W3SVC2, SERVER6, 198.51.100.6, 147, 411, 388, 200, 0, GET, /about.html, -, 
203.0.113.11, -, 2020-02-24, 22:48:40, W3SVC2, SERVER7, 198.51.100.18, 170, 531, 468, 200, 0, GET, /image.png, -, 
203.0.113.8, -, 2020-02-24, 22:48:41, W3SVC2, SERVER8, 198.51.100.14, 125, 711, 868, 200, 0, GET, /intro.htm, -,
```

## Create a table in Athena for IIS log files
<a name="querying-iis-logs-creating-a-table-in-athena-for-iis-log-files"></a>

To query your IIS log file format logs in Amazon S3, you first create a table schema so that Athena can read the log data.

**To create a table in Athena for IIS log file format logs**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Paste the following DDL statement into the Athena console, noting the following points:

   1. To specify the comma delimiter, use `FIELDS TERMINATED BY ','`.

   1. Modify the values in LOCATION 's3://amzn-s3-demo-bucket/*iis-log-file-folder*/' to point to your IIS log format log files in Amazon S3.

   ```
   CREATE EXTERNAL TABLE `iis_format_logs`(
   client_ip_address string,
   user_name string,
   request_date string,
   request_time string,
   service_and_instance string,
   server_name string,
   server_ip_address string,
   time_taken_millisec string,
   client_bytes_sent string,
   server_bytes_sent string,
   service_status_code string,
   windows_status_code string,
   request_type string,
   target_of_operation string,
   script_parameters string
      )
   ROW FORMAT DELIMITED
     FIELDS TERMINATED BY ','
   STORED AS INPUTFORMAT
     'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT
     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION
     's3://amzn-s3-demo-bucket/iis-log-file-folder/'
   ```

1. Run the query in the Athena console to register the `iis_format_logs` table. When the query completes, the logs are ready for you to query from Athena.

## Example IIS log format select query
<a name="querying-iis-logs-example-iis-log-format-select-query"></a>

The following example query selects the request date, request time, request target, and time taken in milliseconds from the table `iis_format_logs`. The `WHERE` clause filters for cases in which the request type is `GET` and the HTTP status code is `200` (successful). In the query, note that the leading spaces in `' GET'` and `' 200'` are required to make the query successful. 

```
SELECT request_date, request_time, target_of_operation, time_taken_millisec
FROM iis_format_logs
WHERE request_type = ' GET' AND service_status_code = ' 200'
```

The following image shows the results of the query of the sample data.

![\[Example query results in Athena of IIS log file format log files stored in Amazon S3.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-iis-logs-2.png)


# Query NCSA log file format
<a name="querying-iis-logs-ncsa-log-file-format"></a>

IIS also uses the [NCSA logging](https://docs.microsoft.com/en-us/windows/win32/http/ncsa-logging) format, which has a fixed number of fields in ASCII text format separated by spaces. The structure is similar to the common log format used for Apache access logs. Fields in the NCSA common log data format include the client IP address, client ID (not typically used), domain\$1user ID, request received timestamp, text of the client request, server status code, and size of the object returned to the client.

The following example shows data in the NCSA common log format as documented for IIS.

```
198.51.100.7 - ExampleCorp\Li [10/Oct/2019:13:55:36 -0700] "GET /logo.gif HTTP/1.0" 200 232
198.51.100.14 - AnyCompany\Jorge [24/Nov/2019:10:49:52 -0700] "GET /index.html HTTP/1.1" 200 2165
198.51.100.22 - ExampleCorp\Mateo [27/Dec/2019:11:38:12 -0700] "GET /about.html HTTP/1.1" 200 1287
198.51.100.9 - AnyCompany\Nikki [11/Jan/2020:11:40:11 -0700] "GET /image.png HTTP/1.1" 404 230
198.51.100.2 - ExampleCorp\Ana [15/Feb/2019:10:12:22 -0700] "GET /favicon.ico HTTP/1.1" 404 30
198.51.100.13 - AnyCompany\Saanvi [14/Mar/2019:11:40:33 -0700] "GET /intro.html HTTP/1.1" 200 1608
198.51.100.11 - ExampleCorp\Xiulan [22/Apr/2019:10:51:34 -0700] "GET /group/index.html HTTP/1.1" 200 1344
```

## Create a table in Athena for IIS NCSA logs
<a name="querying-iis-logs-ncsa-creating-a-table-in-athena"></a>

For your `CREATE TABLE` statement, you can use the [Grok SerDe](grok-serde.md) and a grok pattern similar to the one for [Apache web server logs](querying-apache-logs.md). Unlike Apache logs, the grok pattern uses `%{DATA:user_id}` for the third field instead of `%{USERNAME:user_id}` to account for the presence of the backslash in `domain\user_id`. For more information about using the Grok SerDe, see [Writing grok custom classifiers](https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok) in the *AWS Glue Developer Guide*.

**To create a table in Athena for IIS NCSA web server logs**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. Paste the following DDL statement into the Athena Query Editor. Modify the values in `LOCATION 's3://amzn-s3-demo-bucket/iis-ncsa-logs/'` to point to your IIS NCSA logs in Amazon S3.

   ```
   CREATE EXTERNAL TABLE iis_ncsa_logs(
     client_ip string,
     client_id string,
     user_id string,
     request_received_time string,
     client_request string,
     server_status string,
     returned_obj_size string
     )
   ROW FORMAT SERDE
      'com.amazonaws.glue.serde.GrokSerDe'
   WITH SERDEPROPERTIES (
      'input.format'='^%{IPV4:client_ip} %{DATA:client_id} %{DATA:user_id} %{GREEDYDATA:request_received_time} %{QUOTEDSTRING:client_request} %{DATA:server_status} %{DATA: returned_obj_size}$'
      )
   STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION
      's3://amzn-s3-demo-bucket/iis-ncsa-logs/';
   ```

1. Run the query in the Athena console to register the `iis_ncsa_logs` table. When the query completes, the logs are ready for you to query from Athena.

## Example select queries for IIS NCSA logs
<a name="querying-iis-logs-ncsa-example-select-queries"></a>

**Example – Filtering for 404 errors**  
The following example query selects the request received time, text of the client request, and server status code from the `iis_ncsa_logs` table. The `WHERE` clause filters for HTTP status code `404` (page not found).  

```
SELECT request_received_time, client_request, server_status
FROM iis_ncsa_logs
WHERE server_status = '404'
```
The following image shows the results of the query in the Athena Query Editor.  

![\[Querying an IIS NCSA log from Athena for HTTP 404 entries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-iis-logs-3.png)


**Example – Filtering for successful requests from a particular domain**  
The following example query selects the user ID, request received time, text of the client request, and server status code from the `iis_ncsa_logs` table. The `WHERE` clause filters for requests with HTTP status code `200` (successful) from users in the `AnyCompany` domain.  

```
SELECT user_id, request_received_time, client_request, server_status
FROM iis_ncsa_logs
WHERE server_status = '200' AND user_id LIKE 'AnyCompany%'
```
The following image shows the results of the query in the Athena Query Editor.  

![\[Querying an IIS NCSA log from Athena for HTTP 200 entries.\]](http://docs.aws.amazon.com/athena/latest/ug/images/querying-iis-logs-4.png)
