# Query AWS service logs This section includes several procedures for using Amazon Athena to query popular datasets, such as AWS CloudTrail logs, Amazon CloudFront logs, Classic Load Balancer logs, Application Load Balancer logs, Amazon VPC flow logs, and Network Load Balancer logs. The tasks in this section use the Athena console, but you can also use other tools like the [Athena JDBC driver](connect-with-jdbc.md), the [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/athena/), or the [Amazon Athena API Reference](https://docs.aws.amazon.com/athena/latest/APIReference/). For information about using AWS CloudFormation to automatically create AWS service log tables, partitions, and example queries in Athena, see [Automating AWS service logs table creation and querying them with Amazon Athena](https://aws.amazon.com/blogs/big-data/automating-aws-service-logs-table-creation-and-querying-them-with-amazon-athena/) in the AWS Big Data Blog. For information about using a Python library for AWS Glue to create a common framework for processing AWS service logs and querying them in Athena, see [Easily query AWS service logs using Amazon Athena](https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/). The topics in this section assume that you have configured appropriate permissions to access Athena and the Amazon S3 bucket where the data to query should reside. For more information, see [Set up, administrative, and programmatic access](setting-up.md) and [Get started](getting-started.md). **Topics** + [Application Load Balancer](application-load-balancer-logs.md) + [Elastic Load Balancing](elasticloadbalancer-classic-logs.md) + [CloudFront](cloudfront-logs.md) + [CloudTrail](cloudtrail-logs.md) + [Amazon EMR](emr-logs.md) + [Global Accelerator](querying-global-accelerator-flow-logs.md) + [GuardDuty](querying-guardduty.md) + [Network Firewall](querying-network-firewall-logs.md) + [Network Load Balancer](networkloadbalancer-classic-logs.md) + [Route 53](querying-r53-resolver-logs.md) + [Amazon SES](querying-ses-logs.md) + [Amazon VPC](vpc-flow-logs.md) + [AWS WAF](waf-logs.md) # Query Application Load Balancer logs An Application Load Balancer is a load balancing option for Elastic Load Balancing that enables traffic distribution in a microservices deployment using containers. Querying Application Load Balancer logs allows you to see the source of traffic, latency, and bytes transferred to and from Elastic Load Balancing instances and backend applications. For more information, see [Access logs for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html) and [Connection logs for your Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html) in the *User Guide for Application Load Balancers*. ## Prerequisites + Enable [access logging](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html) or [connection logging ](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html) so that Application Load Balancer logs can be saved to your Amazon S3 bucket. + A database to hold the table that you will create for Athena. To create a database, you can use the Athena or AWS Glue console. For more information, see [Create databases in Athena](creating-databases.md) in this guide or [Working with databases on the AWS glue console](https://docs.aws.amazon.com/glue/latest/dg/console-databases.html) in the *AWS Glue Developer Guide*. **Topics** + [Prerequisites](#application-load-balancer-logs-prerequisites) + [Create the table for ALB access logs](create-alb-access-logs-table.md) + [Create the table for ALB access logs in Athena using partition projection](create-alb-access-logs-table-partition-projection.md) + [Example queries for ALB access logs](query-alb-access-logs-examples.md) + [Create the table for ALB connection logs](create-alb-connection-logs-table.md) + [Create the table for ALB connection logs in Athena using partition projection](create-alb-connection-logs-table-partition-projection.md) + [Example queries for ALB connection logs](query-alb-connection-logs-examples.md) + [Additional resources](application-load-balancer-logs-additional-resources.md) # Create the table for ALB access logs 1. Copy and paste the following `CREATE TABLE` statement into the query editor in the Athena console, and then modify it as necessary for your own log entry requirements. For information about getting started with the Athena console, see [Get started](getting-started.md). Replace the path in the `LOCATION` clause with your Amazon S3 access log folder location. For more information about access log file location, see [Access log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-file-format) in the *User Guide for Application Load Balancers*. For information about each log file field, see [Access log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format) in the *User Guide for Application Load Balancers*. **Note** The following example `CREATE TABLE` statement includes the recently added `classification`, `classification_reason`, and `conn_trace_id` ('traceability ID', or TID) columns. To create a table for Application Load Balancer access logs that do not contain these entries, remove the corresponding columns from the `CREATE TABLE` statement and modify the regular expression accordingly. ``` CREATE EXTERNAL TABLE IF NOT EXISTS alb_access_logs ( type string, time string, elb string, client_ip string, client_port int, target_ip string, target_port int, request_processing_time double, target_processing_time double, response_processing_time double, elb_status_code int, target_status_code string, received_bytes bigint, sent_bytes bigint, request_verb string, request_url string, request_proto string, user_agent string, ssl_cipher string, ssl_protocol string, target_group_arn string, trace_id string, domain_name string, chosen_cert_arn string, matched_rule_priority string, request_creation_time string, actions_executed string, redirect_url string, lambda_error_reason string, target_port_list string, target_status_code_list string, classification string, classification_reason string, conn_trace_id string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\\s]+?)\" \"([^\\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)? ?( .*)?' ) LOCATION 's3://amzn-s3-demo-bucket/access-log-folder-path/' ``` **Note** We suggest that the pattern *`?( .*)?`* at the end of the `input.regex` parameter always remain in place to handle future log entries in case new ALB log fields are added. 1. Run the query in the Athena console. After the query completes, Athena registers the `alb_access_logs` table, making the data in it ready for you to issue queries. # Create the table for ALB access logs in Athena using partition projection Because ALB access logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. The following example `CREATE TABLE` statement automatically uses partition projection on ALB access logs from a specified date until the present for a single AWS region. The statement is based on the example in the previous section but adds `PARTITIONED BY` and `TBLPROPERTIES` clauses to enable partition projection. In the `LOCATION` and `storage.location.template` clauses, replace the placeholders with values that identify the Amazon S3 bucket location of your ALB access logs. For more information about access log file location, see [Access log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-file-format) in the *User Guide for Application Load Balancers*. For `projection.day.range`, replace *2022*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. For information about each log file field, see [Access log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format). ``` CREATE EXTERNAL TABLE IF NOT EXISTS alb_access_logs ( type string, time string, elb string, client_ip string, client_port int, target_ip string, target_port int, request_processing_time double, target_processing_time double, response_processing_time double, elb_status_code int, target_status_code string, received_bytes bigint, sent_bytes bigint, request_verb string, request_url string, request_proto string, user_agent string, ssl_cipher string, ssl_protocol string, target_group_arn string, trace_id string, domain_name string, chosen_cert_arn string, matched_rule_priority string, request_creation_time string, actions_executed string, redirect_url string, lambda_error_reason string, target_port_list string, target_status_code_list string, classification string, classification_reason string, conn_trace_id string ) PARTITIONED BY ( day STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\\s]+?)\" \"([^\\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)? ?( .*)?' ) LOCATION 's3://amzn-s3-demo-bucket/AWSLogs//elasticloadbalancing//' TBLPROPERTIES ( "projection.enabled" = "true", "projection.day.type" = "date", "projection.day.range" = "2022/01/01,NOW", "projection.day.format" = "yyyy/MM/dd", "projection.day.interval" = "1", "projection.day.interval.unit" = "DAYS", "storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs//elasticloadbalancing//${day}" ) ``` For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md). **Note** We suggest that the pattern *?( .\$1)?* at the end of the `input.regex` parameter always remain in place to handle future log entries in case new ALB log fields are added. # Example queries for ALB access logs The following query counts the number of HTTP GET requests received by the load balancer grouped by the client IP address: ``` SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_access_logs GROUP BY request_verb, client_ip LIMIT 100; ``` Another query shows the URLs visited by Safari browser users: ``` SELECT request_url FROM alb_access_logs WHERE user_agent LIKE '%Safari%' LIMIT 10; ``` The following query shows records that have ELB status code values greater than or equal to 500. ``` SELECT * FROM alb_access_logs WHERE elb_status_code >= 500 ``` The following example shows how to parse the logs by `datetime`: ``` SELECT client_ip, sum(received_bytes) FROM alb_access_logs WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') BETWEEN parse_datetime('2018-05-30-12:00:00','yyyy-MM-dd-HH:mm:ss') AND parse_datetime('2018-05-31-00:00:00','yyyy-MM-dd-HH:mm:ss') GROUP BY client_ip; ``` The following query queries the table that uses partition projection for all ALB access logs from the specified day. ``` SELECT * FROM alb_access_logs WHERE day = '2022/02/12' ``` # Create the table for ALB connection logs 1. Copy and paste the following example `CREATE TABLE` statement into the query editor in the Athena console, and then modify it as necessary for your own log entry requirements. For information about getting started with the Athena console, see [Get started](getting-started.md). Replace the path in the `LOCATION` clause with your Amazon S3 connection log folder location. For more information about connection log file location, see [Connection log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-file-format) in the *User Guide for Application Load Balancers*. For information about each log file field, see [Connection log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-entry-format). ``` CREATE EXTERNAL TABLE IF NOT EXISTS alb_connection_logs ( time string, client_ip string, client_port int, listener_port int, tls_protocol string, tls_cipher string, tls_handshake_latency double, leaf_client_cert_subject string, leaf_client_cert_validity string, leaf_client_cert_serial_number string, tls_verify_status string, conn_trace_id string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([A-Za-z0-9.-]*) ([^ ]*) ([-.0-9]*) \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?' ) LOCATION 's3://amzn-s3-demo-bucket/connection-log-folder-path/' ``` 1. Run the query in the Athena console. After the query completes, Athena registers the `alb_connection_logs` table, making the data in it ready for you to issue queries. # Create the table for ALB connection logs in Athena using partition projection Because ALB connection logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. The following example `CREATE TABLE` statement automatically uses partition projection on ALB connection logs from a specified date until the present for a single AWS region. The statement is based on the example in the previous section but adds `PARTITIONED BY` and `TBLPROPERTIES` clauses to enable partition projection. In the `LOCATION` and `storage.location.template` clauses, replace the placeholders with values that identify the Amazon S3 bucket location of your ALB connection logs. For more information about connection log file location, see [Connection log files](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-file-format) in the *User Guide for Application Load Balancers*. For `projection.day.range`, replace *2023*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. For information about each log file field, see [Connection log entries](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-connection-logs.html#connection-log-entry-format). ``` CREATE EXTERNAL TABLE IF NOT EXISTS alb_connection_logs ( time string, client_ip string, client_port int, listener_port int, tls_protocol string, tls_cipher string, tls_handshake_latency double, leaf_client_cert_subject string, leaf_client_cert_validity string, leaf_client_cert_serial_number string, tls_verify_status string, conn_trace_id string ) PARTITIONED BY ( day STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([A-Za-z0-9.-]*) ([^ ]*) ([-.0-9]*) \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?' ) LOCATION 's3://amzn-s3-demo-bucket/AWSLogs//elasticloadbalancing//' TBLPROPERTIES ( "projection.enabled" = "true", "projection.day.type" = "date", "projection.day.range" = "2023/01/01,NOW", "projection.day.format" = "yyyy/MM/dd", "projection.day.interval" = "1", "projection.day.interval.unit" = "DAYS", "storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs//elasticloadbalancing//${day}" ) ``` For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md). # Example queries for ALB connection logs The following query count occurrences where the value for `tls_verify_status` was not `'Success'`, grouped by client IP address: ``` SELECT DISTINCT client_ip, count() AS count FROM alb_connection_logs WHERE tls_verify_status != 'Success' GROUP BY client_ip ORDER BY count() DESC; ``` The following query searches occurrences where the value for `tls_handshake_latency` was over 2 seconds in the specified time range: ``` SELECT * FROM alb_connection_logs WHERE ( parse_datetime(time, 'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') BETWEEN parse_datetime('2024-01-01-00:00:00', 'yyyy-MM-dd-HH:mm:ss') AND parse_datetime('2024-03-20-00:00:00', 'yyyy-MM-dd-HH:mm:ss') ) AND (tls_handshake_latency >= 2.0); ``` # Additional resources For more information about using ALB logs, see the following resources. + [How do I analyze my Application Load Balancer access logs using Amazon Athena](https://repost.aws/knowledge-center/athena-analyze-access-logs) in the *AWS Knowledge Center*. + For information about HTTP status codes in Elastic Load Balancing, see [Troubleshoot your application load balancers](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html) in the *User Guide for Application Load Balancers*. + [Catalog and analyze Application Load Balancer logs more efficiently with AWS Glue custom classifiers and Amazon Athena](https://aws.amazon.com/blogs/big-data/catalog-and-analyze-application-load-balancer-logs-more-efficiently-with-aws-glue-custom-classifiers-and-amazon-athena/) in the *AWS Big Data Blog*. # Query Classic Load Balancer logs Use Classic Load Balancer logs to analyze and understand traffic patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traffic, latency, and bytes that have been transferred. Before you analyze the Elastic Load Balancing logs, configure them for saving in the destination Amazon S3 bucket. For more information, see [Enable access logs for your Classic Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/enable-access-logs.html). **To create the table for Elastic Load Balancing logs** 1. Copy and paste the following DDL statement into the Athena console. Check the [syntax ](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html#access-log-entry-format) of the Elastic Load Balancing log records. You may need to update the following query to include the columns and the Regex syntax for latest version of the record. ``` CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs ( timestamp string, elb_name string, request_ip string, request_port int, backend_ip string, backend_port int, request_processing_time double, backend_processing_time double, client_response_time double, elb_response_code string, backend_response_code string, received_bytes bigint, sent_bytes bigint, request_verb string, url string, protocol string, user_agent string, ssl_cipher string, ssl_protocol string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' ) LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/'; ``` 1. Modify the `LOCATION` Amazon S3 bucket to specify the destination of your Elastic Load Balancing logs. 1. Run the query in the Athena console. After the query completes, Athena registers the `elb_logs` table, making the data in it ready for queries. For more information, see [Example queries](#query-elb-classic-example). ## Example queries Use a query similar to the following example. It lists the backend application servers that returned a `4XX` or `5XX` error response code. Use the `LIMIT` operator to limit the number of logs to query at a time. ``` SELECT timestamp, elb_name, backend_ip, backend_response_code FROM elb_logs WHERE backend_response_code LIKE '4%' OR backend_response_code LIKE '5%' LIMIT 100; ``` Use a subsequent query to sum up the response time of all the transactions grouped by the backend IP address and Elastic Load Balancing instance name. ``` SELECT sum(backend_processing_time) AS total_ms, elb_name, backend_ip FROM elb_logs WHERE backend_ip <> '' GROUP BY backend_ip, elb_name LIMIT 100; ``` For more information, see [Analyzing data in S3 using Athena](https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/). # Query Amazon CloudFront logs You can configure Amazon CloudFront CDN to export Web distribution access logs to Amazon Simple Storage Service. Use these logs to explore users' surfing patterns across your web properties served by CloudFront. Before you begin querying the logs, enable Web distributions access log on your preferred CloudFront distribution. For information, see [Access logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html) in the *Amazon CloudFront Developer Guide*. Make a note of the Amazon S3 bucket in which you save these logs. **Topics** + [Create a table for CloudFront standard logs (legacy)](create-cloudfront-table-standard-logs.md) + [Create a table for CloudFront logs in Athena using manual partitioning with JSON](create-cloudfront-table-manual-json.md) + [Create a table for CloudFront logs in Athena using manual partitioning with Parquet](create-cloudfront-table-manual-parquet.md) + [Create a table for CloudFront logs in Athena using partition projection with JSON](create-cloudfront-table-partition-json.md) + [Create a table for CloudFront logs in Athena using partition projection with Parquet](create-cloudfront-table-partition-parquet.md) + [Create a table for CloudFront real-time logs](create-cloudfront-table-real-time-logs.md) + [Additional resources](cloudfront-logs-additional-resources.md) # Create a table for CloudFront standard logs (legacy) **Note** The following procedure works for the Web distribution access logs in CloudFront. It does not apply to streaming logs from RTMP distributions. **To create a table for CloudFront standard log file fields** 1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. For information about using the Query Editor, see [Get started](getting-started.md). This query specifies `ROW FORMAT DELIMITED` and `FIELDS TERMINATED BY '\t'` to indicate that the fields are delimited by tab characters. For `ROW FORMAT DELIMITED`, Athena uses the [LazySimpleSerDe](lazy-simple-serde.md) by default. The column `date` is escaped using backticks (`) because it is a reserved word in Athena. For information, see [Escape reserved keywords in queries](reserved-words.md). ``` CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_standard_logs ( `date` DATE, time STRING, x_edge_location STRING, sc_bytes BIGINT, c_ip STRING, cs_method STRING, cs_host STRING, cs_uri_stem STRING, sc_status INT, cs_referrer STRING, cs_user_agent STRING, cs_uri_query STRING, cs_cookie STRING, x_edge_result_type STRING, x_edge_request_id STRING, x_host_header STRING, cs_protocol STRING, cs_bytes BIGINT, time_taken FLOAT, x_forwarded_for STRING, ssl_protocol STRING, ssl_cipher STRING, x_edge_response_result_type STRING, cs_protocol_version STRING, fle_status STRING, fle_encrypted_fields INT, c_port INT, time_to_first_byte FLOAT, x_edge_detailed_result_type STRING, sc_content_type STRING, sc_content_len BIGINT, sc_range_start BIGINT, sc_range_end BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://amzn-s3-demo-bucket/' TBLPROPERTIES ( 'skip.header.line.count'='2' ) ``` 1. Run the query in Athena console. After the query completes, Athena registers the `cloudfront_standard_logs` table, making the data in it ready for you to issue queries. ## Example queries The following query adds up the number of bytes served by CloudFront between June 9 and June 11, 2018. Surround the date column name with double quotes because it is a reserved word. ``` SELECT SUM(bytes) AS total_bytes FROM cloudfront_standard_logs WHERE "date" BETWEEN DATE '2018-06-09' AND DATE '2018-06-11' LIMIT 100; ``` To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. ``` SELECT DISTINCT * FROM cloudfront_standard_logs LIMIT 10; ``` # Create a table for CloudFront logs in Athena using manual partitioning with JSON **To create a table for CloudFront standard log file fields using a JSON format** 1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. This query uses OpenX JSON SerDe with the following SerDe properties to read JSON fields correctly in Athena. ``` CREATE EXTERNAL TABLE `cf_logs_manual_partition_json`( `date` string , `time` string , `x-edge-location` string , `sc-bytes` string , `c-ip` string , `cs-method` string , `cs(host)` string , `cs-uri-stem` string , `sc-status` string , `cs(referer)` string , `cs(user-agent)` string , `cs-uri-query` string , `cs(cookie)` string , `x-edge-result-type` string , `x-edge-request-id` string , `x-host-header` string , `cs-protocol` string , `cs-bytes` string , `time-taken` string , `x-forwarded-for` string , `ssl-protocol` string , `ssl-cipher` string , `x-edge-response-result-type` string , `cs-protocol-version` string , `fle-status` string , `fle-encrypted-fields` string , `c-port` string , `time-to-first-byte` string , `x-edge-detailed-result-type` string , `sc-content-type` string , `sc-content-len` string , `sc-range-start` string , `sc-range-end` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='c-ip,c-port,cs(Cookie),cs(Host),cs(Referer),cs(User-Agent),cs-bytes,cs-method,cs-protocol,cs-protocol-version,cs-uri-query,cs-uri-stem,date,fle-encrypted-fields,fle-status,sc-bytes,sc-content-len,sc-content-type,sc-range-end,sc-range-start,sc-status,ssl-cipher,ssl-protocol,time,time-taken,time-to-first-byte,x-edge-detailed-result-type,x-edge-location,x-edge-request-id,x-edge-response-result-type,x-edge-result-type,x-forwarded-for,x-host-header') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/' ``` 1. Run the query in Athena console. After the query completes, Athena registers the `cf_logs_manual_partition_json` table, making the data in it ready for you to issue queries. ## Example queries The following query adds up the number of bytes served by CloudFront for January 15, 2025. ``` SELECT sum(cast("sc-bytes" as BIGINT)) as sc FROM cf_logs_manual_partition_json WHERE "date"='2025-01-15' ``` To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. ``` SELECT DISTINCT * FROM cf_logs_manual_partition_json ``` # Create a table for CloudFront logs in Athena using manual partitioning with Parquet **To create a table for CloudFront standard log file fields using a Parquet format** 1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Standard log file fields](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#BasicDistributionFileFormat) section of the *Amazon CloudFront Developer Guide*. This query uses ParquetHiveSerDe with the following SerDe properties to read Parquet fields correctly in Athena. ``` CREATE EXTERNAL TABLE `cf_logs_manual_partition_parquet`( `date` string, `time` string, `x_edge_location` string, `sc_bytes` string, `c_ip` string, `cs_method` string, `cs_host` string, `cs_uri_stem` string, `sc_status` string, `cs_referer` string, `cs_user_agent` string, `cs_uri_query` string, `cs_cookie` string, `x_edge_result_type` string, `x_edge_request_id` string, `x_host_header` string, `cs_protocol` string, `cs_bytes` string, `time_taken` string, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `x_edge_response_result_type` string, `cs_protocol_version` string, `fle_status` string, `fle_encrypted_fields` string, `c_port` string, `time_to_first_byte` string, `x_edge_detailed_result_type` string, `sc_content_type` string, `sc_content_len` string, `sc_range_start` string, `sc_range_end` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/' ``` 1. Run the query in Athena console. After the query completes, Athena registers the `cf_logs_manual_partition_parquet` table, making the data in it ready for you to issue queries. ## Example queries The following query adds up the number of bytes served by CloudFront for January 19, 2025. ``` SELECT sum(cast("sc_bytes" as BIGINT)) as sc FROM cf_logs_manual_partition_parquet WHERE "date"='2025-01-19' ``` To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the `SELECT DISTINCT` statement, as in the following example. ``` SELECT DISTINCT * FROM cf_logs_manual_partition_parquet ``` # Create a table for CloudFront logs in Athena using partition projection with JSON You can reduce query runtime and automate partition management with Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. The following example CREATE TABLE statement automatically uses partition projection on CloudFront logs from a specified CloudFront distribution until present for a single AWS Region. After you run the query successfully, you can query the table. ``` CREATE EXTERNAL TABLE `cloudfront_logs_pp`( `date` string, `time` string, `x-edge-location` string, `sc-bytes` string, `c-ip` string, `cs-method` string, `cs(host)` string, `cs-uri-stem` string, `sc-status` string, `cs(referer)` string, `cs(user-agent)` string, `cs-uri-query` string, `cs(cookie)` string, `x-edge-result-type` string, `x-edge-request-id` string, `x-host-header` string, `cs-protocol` string, `cs-bytes` string, `time-taken` string, `x-forwarded-for` string, `ssl-protocol` string, `ssl-cipher` string, `x-edge-response-result-type` string, `cs-protocol-version` string, `fle-status` string, `fle-encrypted-fields` string, `c-port` string, `time-to-first-byte` string, `x-edge-detailed-result-type` string, `sc-content-type` string, `sc-content-len` string, `sc-range-start` string, `sc-range-end` string) PARTITIONED BY( distributionid string, year int, month int, day int, hour int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='c-ip,c-port,cs(Cookie),cs(Host),cs(Referer),cs(User-Agent),cs-bytes,cs-method,cs-protocol,cs-protocol-version,cs-uri-query,cs-uri-stem,date,fle-encrypted-fields,fle-status,sc-bytes,sc-content-len,sc-content-type,sc-range-end,sc-range-start,sc-status,ssl-cipher,ssl-protocol,time,time-taken,time-to-first-byte,x-edge-detailed-result-type,x-edge-location,x-edge-request-id,x-edge-response-result-type,x-edge-result-type,x-forwarded-for,x-host-header') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/' TBLPROPERTIES ( 'projection.distributionid.type'='enum', 'projection.distributionid.values'='E2Oxxxxxxxxxxx', 'projection.day.range'='01,31', 'projection.day.type'='integer', 'projection.day.digits'='2', 'projection.enabled'='true', 'projection.month.range'='01,12', 'projection.month.type'='integer', 'projection.month.digits'='2', 'projection.year.range'='2025,2026', 'projection.year.type'='integer', 'projection.hour.range'='00,23', 'projection.hour.type'='integer', 'projection.hour.digits'='2', 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/${distributionid}/${year}/${month}/${day}/${hour}/') ``` Following are some considerations for the properties used in the previous example. + **Table name** – The table name *`cloudfront_logs_pp`* is replaceable. You can change it to any name that you prefer. + **Location** – Modify `s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/` to point to your Amazon S3 bucket. + **Distribution IDs** – For `projection.distributionid.values`, you can specify multiple distribution IDs if you separate them with commas. For example, **, **. + **Year range** – In `projection.year.range`, you can define the range of years based on your data. For example, you can adjust it to any period, such as *2025*, *2026*. **Note** Including empty partitions, such as those for future dates (example: 2025-2040), can impact query performance. However, partition projection is designed to effectively handle future dates. To maintain optimal performance, ensure that partitions are managed thoughtfully and avoid excessive empty partitions when possible. + **Storage location template** – You must ensure to update the `storage.location.template` correctly based on the following CloudFront partitioning structure and S3 path. **** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/create-cloudfront-table-partition-json.html) After you confirm that the CloudFront partitioning structure and S3 structure match the required patterns, update the `storage.location.template` as follows: ``` 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account_id/CloudFront/${distributionid}/folder2/${year}/${month}/${day}/${hour}/folder3/' ``` **Note** Proper configuration of the `storage.location.template` is crucial for ensuring correct data storage and retrieval. # Create a table for CloudFront logs in Athena using partition projection with Parquet The following example CREATE TABLE statement automatically uses partition projection on CloudFront logs in Parquet, from a specified CloudFront distribution until present for a single AWS Region. After you run the query successfully, you can query the table. ``` CREATE EXTERNAL TABLE `cloudfront_logs_parquet_pp`( `date` string, `time` string, `x_edge_location` string, `sc_bytes` string, `c_ip` string, `cs_method` string, `cs_host` string, `cs_uri_stem` string, `sc_status` string, `cs_referer` string, `cs_user_agent` string, `cs_uri_query` string, `cs_cookie` string, `x_edge_result_type` string, `x_edge_request_id` string, `x_host_header` string, `cs_protocol` string, `cs_bytes` string, `time_taken` string, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `x_edge_response_result_type` string, `cs_protocol_version` string, `fle_status` string, `fle_encrypted_fields` string, `c_port` string, `time_to_first_byte` string, `x_edge_detailed_result_type` string, `sc_content_type` string, `sc_content_len` string, `sc_range_start` string, `sc_range_end` string) PARTITIONED BY( distributionid string, year int, month int, day int, hour int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/' TBLPROPERTIES ( 'projection.distributionid.type'='enum', 'projection.distributionid.values'='E3OK0LPUNWWO3', 'projection.day.range'='01,31', 'projection.day.type'='integer', 'projection.day.digits'='2', 'projection.enabled'='true', 'projection.month.range'='01,12', 'projection.month.type'='integer', 'projection.month.digits'='2', 'projection.year.range'='2019,2025', 'projection.year.type'='integer', 'projection.hour.range'='01,12', 'projection.hour.type'='integer', 'projection.hour.digits'='2', 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/CloudFront/${distributionid}/${year}/${month}/${day}/${hour}/') ``` Following are some considerations for the properties used in the previous example. + **Table name** – The table name *`cloudfront_logs_pp`* is replaceable. You can change it to any name that you prefer. + **Location** – Modify `s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_ID/` to point to your Amazon S3 bucket. + **Distribution IDs** – For `projection.distributionid.values`, you can specify multiple distribution IDs if you separate them with commas. For example, **, **. + **Year range** – In `projection.year.range`, you can define the range of years based on your data. For example, you can adjust it to any period, such as *2025*, *2026*. **Note** Including empty partitions, such as those for future dates (example: 2025-2040), can impact query performance. However, partition projection is designed to effectively handle future dates. To maintain optimal performance, ensure that partitions are managed thoughtfully and avoid excessive empty partitions when possible. + **Storage location template** – You must ensure to update the `storage.location.template` correctly based on the following CloudFront partitioning structure and S3 path. **** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/create-cloudfront-table-partition-parquet.html) After you confirm that the CloudFront partitioning structure and S3 structure match the required patterns, update the `storage.location.template` as follows: ``` 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account_id/CloudFront/${distributionid}/folder2/${year}/${month}/${day}/${hour}/folder3/' ``` **Note** Proper configuration of the `storage.location.template` is crucial for ensuring correct data storage and retrieval. # Create a table for CloudFront real-time logs **To create a table for CloudFront real-time log file fields** 1. Copy and paste the following example DDL statement into the Query Editor in the Athena console. The example statement uses the log file fields documented in the [Real-time logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html) section of the *Amazon CloudFront Developer Guide*. Modify the `LOCATION` for the Amazon S3 bucket that stores your logs. For information about using the Query Editor, see [Get started](getting-started.md). This query specifies `ROW FORMAT DELIMITED` and `FIELDS TERMINATED BY '\t'` to indicate that the fields are delimited by tab characters. For `ROW FORMAT DELIMITED`, Athena uses the [LazySimpleSerDe](lazy-simple-serde.md) by default. The column `timestamp` is escaped using backticks (`) because it is a reserved word in Athena. For information, see [Escape reserved keywords in queries](reserved-words.md). The follow example contains all of the available fields. You can comment out or remove fields that you do not require. ``` CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_real_time_logs ( `timestamp` STRING, c_ip STRING, time_to_first_byte BIGINT, sc_status BIGINT, sc_bytes BIGINT, cs_method STRING, cs_protocol STRING, cs_host STRING, cs_uri_stem STRING, cs_bytes BIGINT, x_edge_location STRING, x_edge_request_id STRING, x_host_header STRING, time_taken BIGINT, cs_protocol_version STRING, c_ip_version STRING, cs_user_agent STRING, cs_referer STRING, cs_cookie STRING, cs_uri_query STRING, x_edge_response_result_type STRING, x_forwarded_for STRING, ssl_protocol STRING, ssl_cipher STRING, x_edge_result_type STRING, fle_encrypted_fields STRING, fle_status STRING, sc_content_type STRING, sc_content_len BIGINT, sc_range_start STRING, sc_range_end STRING, c_port BIGINT, x_edge_detailed_result_type STRING, c_country STRING, cs_accept_encoding STRING, cs_accept STRING, cache_behavior_path_pattern STRING, cs_headers STRING, cs_header_names STRING, cs_headers_count BIGINT, primary_distribution_id STRING, primary_distribution_dns_name STRING, origin_fbl STRING, origin_lbl STRING, asn STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://amzn-s3-demo-bucket/' TBLPROPERTIES ( 'skip.header.line.count'='2' ) ``` 1. Run the query in Athena console. After the query completes, Athena registers the `cloudfront_real_time_logs` table, making the data in it ready for you to issue queries. # Additional resources For more information about using Athena to query CloudFront logs, see the following posts from the [AWS big data blog](https://aws.amazon.com/blogs/big-data/). [Easily query AWS service logs using Amazon Athena](https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/) (May 29, 2019). [Analyze your Amazon CloudFront access logs at scale](https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/) (December 21, 2018). [Build a serverless architecture to analyze Amazon CloudFront access logs using AWS Lambda, Amazon Athena, and Amazon Managed Service for Apache Flink](https://aws.amazon.com/blogs/big-data/build-a-serverless-architecture-to-analyze-amazon-cloudfront-access-logs-using-aws-lambda-amazon-athena-and-amazon-kinesis-analytics/) (May 26, 2017). # Query AWS CloudTrail logs AWS CloudTrail is a service that records AWS API calls and events for Amazon Web Services accounts. CloudTrail logs include details about any API calls made to your AWS services, including the console. CloudTrail generates encrypted log files and stores them in Amazon S3. For more information, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html). **Note** If you want to perform SQL queries on CloudTrail event information across accounts, regions, and dates, consider using CloudTrail Lake. CloudTrail Lake is an AWS alternative to creating trails that aggregates information from an enterprise into a single, searchable event data store. Instead of using Amazon S3 bucket storage, it stores events in a data lake, which allows richer, faster queries. You can use it to create SQL queries that search events across organizations, regions, and within custom time ranges. Because you perform CloudTrail Lake queries within the CloudTrail console itself, using CloudTrail Lake does not require Athena. For more information, see the [CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) documentation. Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity. For example, you can use queries to identify trends and further isolate activity by attributes, such as source IP address or user. A common application is to use CloudTrail logs to analyze operational activity for security and compliance. For information about a detailed example, see the AWS Big Data Blog post, [Analyze security, compliance, and operational activity using AWS CloudTrail and Amazon Athena](https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/). You can use Athena to query these log files directly from Amazon S3, specifying the `LOCATION` of log files. You can do this one of two ways: + By creating tables for CloudTrail log files directly from the CloudTrail console. + By manually creating tables for CloudTrail log files in the Athena console. **Topics** + [Understand CloudTrail logs and Athena tables](create-cloudtrail-table-understanding.md) + [Use the CloudTrail console to create an Athena table for CloudTrail logs](create-cloudtrail-table-ct.md) + [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md) + [Create a table for an organization wide trail using manual partitioning](create-cloudtrail-table-org-wide-trail.md) + [Create the table for CloudTrail logs in Athena using partition projection](create-cloudtrail-table-partition-projection.md) + [Example CloudTrail log queries](query-examples-cloudtrail-logs.md) # Understand CloudTrail logs and Athena tables Before you begin creating tables, you should understand a little more about CloudTrail and how it stores data. This can help you create the tables that you need, whether you create them from the CloudTrail console or from Athena. CloudTrail saves logs as JSON text files in compressed gzip format (`*.json.gz`). The location of the log files depends on how you set up trails, the AWS Region or Regions in which you are logging, and other factors. For more information about where logs are stored, the JSON structure, and the record file contents, see the following topics in the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html): + [Finding your CloudTrail log files](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-find-log-files.html) + [CloudTrail Log File examples](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-log-file-examples.html) + [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html) + [CloudTrail event reference](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference.html) To collect logs and save them to Amazon S3, enable CloudTrail from the AWS Management Console. For more information, see [Creating a trail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-a-trail-using-the-console-first-time.html) in the *AWS CloudTrail User Guide*. # Use the CloudTrail console to create an Athena table for CloudTrail logs You can create a non-partitioned Athena table for querying CloudTrail logs directly from the CloudTrail console. Creating an Athena table from the CloudTrail console requires that you be logged in with a role that has sufficient permissions to create tables in Athena. **Note** You cannot use the CloudTrail console to create an Athena table for organization trail logs. Instead, create the table manually using the Athena console so that you can specify the correct storage location. For information about organization trails, see [Creating a trail for an organization](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/creating-trail-organization.html) in the *AWS CloudTrail User Guide*. + For information about setting up permissions for Athena, see [Set up, administrative, and programmatic access](setting-up.md). + For information about creating a table with partitions, see [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md). **To create an Athena table for a CloudTrail trail using the CloudTrail console** 1. Open the CloudTrail console at [https://console.aws.amazon.com/cloudtrail/](https://console.aws.amazon.com/cloudtrail/). 1. In the navigation pane, choose **Event history**. 1. Choose **Create Athena table**. ![\[Choose Create Athena table\]](http://docs.aws.amazon.com/athena/latest/ug/images/cloudtrail-logs-create-athena-table.png) 1. For **Storage location**, use the down arrow to select the Amazon S3 bucket where log files are stored for the trail to query. **Note** To find the name of the bucket that is associated with a trail, choose **Trails** in the CloudTrail navigation pane and view the trail's **S3 bucket** column. To see the Amazon S3 location for the bucket, choose the link for the bucket in the **S3 bucket** column. This opens the Amazon S3 console to the CloudTrail bucket location. 1. Choose **Create table**. The table is created with a default name that includes the name of the Amazon S3 bucket. # Create a table for CloudTrail logs in Athena using manual partitioning You can manually create tables for CloudTrail log files in the Athena console, and then run queries in Athena. **To create an Athena table for a CloudTrail trail using the Athena console** 1. Copy and paste the following DDL statement into the Athena console query editor, then modify it according to your requirements. Note that because CloudTrail log files are not an ordered stack trace of public API calls, the fields in the log files do not appear in any specific order. ``` CREATE EXTERNAL TABLE cloudtrail_logs ( eventversion STRING, useridentity STRUCT< type:STRING, principalid:STRING, arn:STRING, accountid:STRING, invokedby:STRING, accesskeyid:STRING, username:STRING, onbehalfof: STRUCT< userid: STRING, identitystorearn: STRING>, sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING, creationdate:STRING>, sessionissuer:STRUCT< type:STRING, principalid:STRING, arn:STRING, accountid:STRING, username:STRING>, ec2roledelivery:string, webidfederationdata: STRUCT< federatedprovider: STRING, attributes: map> > >, eventtime STRING, eventsource STRING, eventname STRING, awsregion STRING, sourceipaddress STRING, useragent STRING, errorcode STRING, errormessage STRING, requestparameters STRING, responseelements STRING, additionaleventdata STRING, requestid STRING, eventid STRING, resources ARRAY>, eventtype STRING, apiversion STRING, readonly STRING, recipientaccountid STRING, serviceeventdetails STRING, sharedeventid STRING, vpcendpointid STRING, vpcendpointaccountid STRING, eventcategory STRING, addendum STRUCT< reason:STRING, updatedfields:STRING, originalrequestid:STRING, originaleventid:STRING>, sessioncredentialfromconsole STRING, edgedevicedetails STRING, tlsdetails STRUCT< tlsversion:STRING, ciphersuite:STRING, clientprovidedhostheader:STRING> ) PARTITIONED BY (region string, year string, month string, day string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/Account_ID/'; ``` **Note** We suggest using the `org.apache.hive.hcatalog.data.JsonSerDe` shown in the example. Although a `com.amazon.emr.hive.serde.CloudTrailSerde` exists, it does not currently handle some of the newer CloudTrail fields. 1. (Optional) Remove any fields not required for your table. If you need to read only a certain set of columns, your table definition can exclude the other columns. 1. Modify `s3://amzn-s3-demo-bucket/AWSLogs/Account_ID/` to point to the Amazon S3 bucket that contains the log data that you want to query. The example uses a `LOCATION` value of logs for a particular account, but you can use the degree of specificity that suits your application. For example: + To analyze data from multiple accounts, you can roll back the `LOCATION` specifier to indicate all `AWSLogs` by using `LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/'`. + To analyze data from a specific date, account, and Region, use `LOCATION 's3://amzn-s3-demo-bucket/123456789012/CloudTrail/us-east-1/2016/03/14/'.` + To analyze network activity data instead of management events, replace `/CloudTrail/` in the `LOCATION` clause with `/CloudTrail-NetworkActivity/`. Using the highest level in the object hierarchy gives you the greatest flexibility when you query using Athena. 1. Verify that fields are listed correctly. For more information about the full list of fields in a CloudTrail record, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html). The example `CREATE TABLE` statement in Step 1 uses the [Hive JSON SerDe](hive-json-serde.md). In the example, the fields `requestparameters`, `responseelements`, and `additionaleventdata` are listed as type `STRING` in the query, but are `STRUCT` data type used in JSON. Therefore, to get data out of these fields, use `JSON_EXTRACT` functions. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md). For performance improvements, the example partitions the data by AWS Region, year, month, and day. 1. Run the `CREATE TABLE` statement in the Athena console. 1. Use the [ALTER TABLE ADD PARTITION](alter-table-add-partition.md) command to load the partitions so that you can query them, as in the following example. ``` ALTER TABLE table_name ADD PARTITION (region='us-east-1', year='2019', month='02', day='01') LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/Account_ID/CloudTrail/us-east-1/2019/02/01/' ``` # Create a table for an organization wide trail using manual partitioning To create a table for organization wide CloudTrail log files in Athena, follow the steps in [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md), but make the modifications noted in the following procedure. **To create an Athena table for organization wide CloudTrail logs** 1. In the `CREATE TABLE` statement, modify the `LOCATION` clause to include the organization ID, as in the following example: ``` LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/' ``` 1. In the `PARTITIONED BY` clause, add an entry for the account ID as a string, as in the following example: ``` PARTITIONED BY (account string, region string, year string, month string, day string) ``` The following example shows the combined result: ``` ... PARTITIONED BY (account string, region string, year string, month string, day string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/Account_ID/CloudTrail/' ``` 1. In the `ALTER TABLE` statement `ADD PARTITION` clause, include the account ID, as in the following example: ``` ALTER TABLE table_name ADD PARTITION (account='111122223333', region='us-east-1', year='2022', month='08', day='08') ``` 1. In the `ALTER TABLE` statement `LOCATION` clause, include the organization ID, the account ID, and the partition that you want to add, as in the following example: ``` LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/Account_ID/CloudTrail/us-east-1/2022/08/08/' ``` The following example `ALTER TABLE` statement shows the combined result: ``` ALTER TABLE table_name ADD PARTITION (account='111122223333', region='us-east-1', year='2022', month='08', day='08') LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/organization_id/111122223333/CloudTrail/us-east-1/2022/08/08/' ``` Note that, in a large organization, using this method to manually add and maintain a partition for each organization account ID can be cumbersome. In such a scenario, consider using CloudTrail Lake rather than Athena. CloudTrail Lake in such a scenario offers the following advantages: + Automatically aggregates logs across an entire organization + Does not require setting up or maintaining partitions or an Athena table + Queries are run directly in the CloudTrail console + Uses a SQL-compatible query language For more information, see [Working with AWS CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) in the *AWS CloudTrail User Guide*. # Create the table for CloudTrail logs in Athena using partition projection Because CloudTrail logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena partition projection feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. The following example `CREATE TABLE` statement automatically uses partition projection on CloudTrail logs from a specified date until the present for a single AWS Region. In the `LOCATION` and `storage.location.template` clauses, replace the *bucket*, *account-id*, and *aws-region* placeholders with correspondingly identical values. For `projection.timestamp.range`, replace *2020*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. ``` CREATE EXTERNAL TABLE cloudtrail_logs_pp( eventversion STRING, useridentity STRUCT< type: STRING, principalid: STRING, arn: STRING, accountid: STRING, invokedby: STRING, accesskeyid: STRING, username: STRING, onbehalfof: STRUCT< userid: STRING, identitystorearn: STRING>, sessioncontext: STRUCT< attributes: STRUCT< mfaauthenticated: STRING, creationdate: STRING>, sessionissuer: STRUCT< type: STRING, principalid: STRING, arn: STRING, accountid: STRING, username: STRING>, ec2roledelivery:string, webidfederationdata: STRUCT< federatedprovider: STRING, attributes: map> > >, eventtime STRING, eventsource STRING, eventname STRING, awsregion STRING, sourceipaddress STRING, useragent STRING, errorcode STRING, errormessage STRING, requestparameters STRING, responseelements STRING, additionaleventdata STRING, requestid STRING, eventid STRING, readonly STRING, resources ARRAY>, eventtype STRING, apiversion STRING, recipientaccountid STRING, serviceeventdetails STRING, sharedeventid STRING, vpcendpointid STRING, vpcendpointaccountid STRING, eventcategory STRING, addendum STRUCT< reason:STRING, updatedfields:STRING, originalrequestid:STRING, originaleventid:STRING>, sessioncredentialfromconsole STRING, edgedevicedetails STRING, tlsdetails STRUCT< tlsversion:STRING, ciphersuite:STRING, clientprovidedhostheader:STRING> ) PARTITIONED BY ( `timestamp` string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/account-id/CloudTrail/aws-region' TBLPROPERTIES ( 'projection.enabled'='true', 'projection.timestamp.format'='yyyy/MM/dd', 'projection.timestamp.interval'='1', 'projection.timestamp.interval.unit'='DAYS', 'projection.timestamp.range'='2020/01/01,NOW', 'projection.timestamp.type'='date', 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/account-id/CloudTrail/aws-region/${timestamp}') ``` For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md). # Example CloudTrail log queries The following example shows a portion of a query that returns all anonymous (unsigned) requests from the table created for CloudTrail event logs. This query selects those requests where `useridentity.accountid` is anonymous, and `useridentity.arn` is not specified: ``` SELECT * FROM cloudtrail_logs WHERE eventsource = 's3.amazonaws.com' AND eventname in ('GetObject') AND useridentity.accountid = 'anonymous' AND useridentity.arn IS NULL AND requestparameters LIKE '%[your bucket name ]%'; ``` For more information, see the AWS Big Data blog post [Analyze security, compliance, and operational activity using AWS CloudTrail and Amazon Athena](https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/). ## Query nested fields in CloudTrail logs Because the `userIdentity` and `resources` fields are nested data types, querying them requires special treatment. The `userIdentity` object consists of nested `STRUCT` types. These can be queried using a dot to separate the fields, as in the following example: ``` SELECT eventsource, eventname, useridentity.sessioncontext.attributes.creationdate, useridentity.sessioncontext.sessionissuer.arn FROM cloudtrail_logs WHERE useridentity.sessioncontext.sessionissuer.arn IS NOT NULL ORDER BY eventsource, eventname LIMIT 10 ``` The `resources` field is an array of `STRUCT` objects. For these arrays, use `CROSS JOIN UNNEST` to unnest the array so that you can query its objects. The following example returns all rows where the resource ARN ends in `example/datafile.txt`. For readability, the [replace](https://prestodb.io/docs/current/functions/string.html#replace) function removes the initial `arn:aws:s3:::` substring from the ARN. ``` SELECT awsregion, replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as s3_resource, eventname, eventtime, useragent FROM cloudtrail_logs t CROSS JOIN UNNEST(t.resources) unnested (resources_entry) WHERE unnested.resources_entry.ARN LIKE '%example/datafile.txt' ORDER BY eventtime ``` The following example queries for `DeleteBucket` events. The query extracts the name of the bucket and the account ID to which the bucket belongs from the `resources` object. ``` SELECT awsregion, replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as deleted_bucket, eventtime AS time_deleted, useridentity.username, unnested.resources_entry.accountid as bucket_acct_id FROM cloudtrail_logs t CROSS JOIN UNNEST(t.resources) unnested (resources_entry) WHERE eventname = 'DeleteBucket' ORDER BY eventtime ``` For more information about unnesting, see [Filter arrays](filtering-arrays.md). ## Tips for querying CloudTrail logs Consider the following when exploring CloudTrail log data: + Before querying the logs, verify that your logs table looks the same as the one in [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md). If it is not the first table, delete the existing table using the following command: `DROP TABLE cloudtrail_logs`. + After you drop the existing table, re-create it. For more information, see [Create a table for CloudTrail logs in Athena using manual partitioning](create-cloudtrail-table.md). Verify that fields in your Athena query are listed correctly. For information about the full list of fields in a CloudTrail record, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html). If your query includes fields in JSON formats, such as `STRUCT`, extract data from JSON. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md). Some suggestions for issuing queries against your CloudTrail table: + Start by looking at which users called which API operations and from which source IP addresses. + Use the following basic SQL query as your template. Paste the query to the Athena console and run it. ``` SELECT useridentity.arn, eventname, sourceipaddress, eventtime FROM cloudtrail_logs LIMIT 100; ``` + Modify the query to further explore your data. + To improve performance, include the `LIMIT` clause to return a specified subset of rows. # Query Amazon EMR logs Amazon EMR and big data applications that run on Amazon EMR produce log files. Log files are written to the [primary node](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html), and you can also configure Amazon EMR to archive log files to Amazon S3 automatically. You can use Amazon Athena to query these logs to identify events and trends for applications and clusters. For more information about the types of log files in Amazon EMR and saving them to Amazon S3, see [View log files](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html) in the *Amazon EMR Management Guide*. **Topics** + [Create and query a basic table based on Amazon EMR log files](emr-create-table.md) + [Create and query a partitioned table based on Amazon EMR logs](emr-create-table-partitioned.md) # Create and query a basic table based on Amazon EMR log files The following example creates a basic table, `myemrlogs`, based on log files saved to `s3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/elasticmapreduce/`. The Amazon S3 location used in the examples below reflects the pattern of the default log location for an EMR cluster created by Amazon Web Services account *123456789012* in Region *us-west-2*. If you use a custom location, the pattern is s3://amzn-s3-demo-bucket/*ClusterID*. For information about creating a partitioned table to potentially improve query performance and reduce data transfer, see [Create and query a partitioned table based on Amazon EMR logs](emr-create-table-partitioned.md). ``` CREATE EXTERNAL TABLE `myemrlogs`( `data` string COMMENT 'from deserializer') ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6' ``` ## Example queries The following example queries can be run on the `myemrlogs` table created by the previous example. **Example – Query step logs for occurrences of ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG** ``` SELECT data, "$PATH" FROM "default"."myemrlogs" WHERE regexp_like("$PATH",'s-86URH188Z6B1') AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100; ``` **Example – Query a specific instance log, i-00b3c0a839ece0a9c, for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG** ``` SELECT "data", "$PATH" AS filepath FROM "default"."myemrlogs" WHERE regexp_like("$PATH",'i-00b3c0a839ece0a9c') AND regexp_like("$PATH",'state') AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100; ``` **Example – Query presto application logs for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG** ``` SELECT "data", "$PATH" AS filepath FROM "default"."myemrlogs" WHERE regexp_like("$PATH",'presto') AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100; ``` **Example – Query Namenode application logs for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG** ``` SELECT "data", "$PATH" AS filepath FROM "default"."myemrlogs" WHERE regexp_like("$PATH",'namenode') AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100; ``` **Example – Query all logs by date and hour for ERROR, WARN, INFO, EXCEPTION, FATAL, or DEBUG** ``` SELECT distinct("$PATH") AS filepath FROM "default"."myemrlogs" WHERE regexp_like("$PATH",'2019-07-23-10') AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100; ``` # Create and query a partitioned table based on Amazon EMR logs These examples use the same log location to create an Athena table, but the table is partitioned, and a partition is then created for each log location. For more information, see [Partition your data](partitions.md). The following query creates the partitioned table named `mypartitionedemrlogs`: ``` CREATE EXTERNAL TABLE `mypartitionedemrlogs`( `data` string COMMENT 'from deserializer') partitioned by (logtype string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6' ``` The following query statements then create table partitions based on sub-directories for different log types that Amazon EMR creates in Amazon S3: ``` ALTER TABLE mypartitionedemrlogs ADD PARTITION (logtype='containers') LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/containers/' ``` ``` ALTER TABLE mypartitionedemrlogs ADD PARTITION (logtype='hadoop-mapreduce') LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/hadoop-mapreduce/' ``` ``` ALTER TABLE mypartitionedemrlogs ADD PARTITION (logtype='hadoop-state-pusher') LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/hadoop-state-pusher/' ``` ``` ALTER TABLE mypartitionedemrlogs ADD PARTITION (logtype='node') LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/node/' ``` ``` ALTER TABLE mypartitionedemrlogs ADD PARTITION (logtype='steps') LOCATION 's3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/steps/' ``` After you create the partitions, you can run a `SHOW PARTITIONS` query on the table to confirm: ``` SHOW PARTITIONS mypartitionedemrlogs; ``` ## Example queries The following examples demonstrate queries for specific log entries use the table and partitions created by the examples above. **Example – Querying application application\$11561661818238\$10002 logs in the containers partition for ERROR or WARN** ``` SELECT data, "$PATH" FROM "default"."mypartitionedemrlogs" WHERE logtype='containers' AND regexp_like("$PATH",'application_1561661818238_0002') AND regexp_like(data, 'ERROR|WARN') limit 100; ``` **Example – Querying the hadoop-Mapreduce partition for job job\$11561661818238\$10004 and failed reduces** ``` SELECT data, "$PATH" FROM "default"."mypartitionedemrlogs" WHERE logtype='hadoop-mapreduce' AND regexp_like(data,'job_1561661818238_0004|Failed Reduces') limit 100; ``` **Example – Querying Hive logs in the node partition for query ID 056e0609-33e1-4611-956c-7a31b42d2663** ``` SELECT data, "$PATH" FROM "default"."mypartitionedemrlogs" WHERE logtype='node' AND regexp_like("$PATH",'hive') AND regexp_like(data,'056e0609-33e1-4611-956c-7a31b42d2663') limit 100; ``` **Example – Querying resourcemanager logs in the node partition for application 1567660019320\$10001\$101\$1000001** ``` SELECT data, "$PATH" FROM "default"."mypartitionedemrlogs" WHERE logtype='node' AND regexp_like(data,'resourcemanager') AND regexp_like(data,'1567660019320_0001_01_000001') limit 100 ``` # Query AWS Global Accelerator flow logs You can use AWS Global Accelerator to create accelerators that direct network traffic to optimal endpoints over the AWS global network. For more information about Global Accelerator, see [What is AWS Global Accelerator](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html). Global Accelerator flow logs enable you to capture information about the IP address traffic going to and from network interfaces in your accelerators. Flow log data is published to Amazon S3, where you can retrieve and view your data. For more information, see [Flow logs in AWS Global Accelerator](https://docs.aws.amazon.com/global-accelerator/latest/dg/monitoring-global-accelerator.flow-logs.html). You can use Athena to query your Global Accelerator flow logs by creating a table that specifies their location in Amazon S3. **To create the table for Global Accelerator flow logs** 1. Copy and paste the following DDL statement into the Athena console. This query specifies *ROW FORMAT DELIMITED* and omits specifying a [SerDe](serde-reference.md), which means that the query uses the [`LazySimpleSerDe`](lazy-simple-serde.md). In this query, fields are terminated by a space. ``` CREATE EXTERNAL TABLE IF NOT EXISTS aga_flow_logs ( version string, account string, acceleratorid string, clientip string, clientport int, gip string, gipport int, endpointip string, endpointport int, protocol string, ipaddresstype string, numpackets bigint, numbytes int, starttime int, endtime int, action string, logstatus string, agasourceip string, agasourceport int, endpointregion string, agaregion string, direction string, vpc_id string, reject_reason string ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region/' TBLPROPERTIES ("skip.header.line.count"="1"); ``` 1. Modify the `LOCATION` value to point to the Amazon S3 bucket that contains your log data. ``` 's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region_code/' ``` 1. Run the query in the Athena console. After the query completes, Athena registers the `aga_flow_logs` table, making the data in it available for queries. 1. Create partitions to read the data, as in the following sample query. The query creates a single partition for a specified date. Replace the placeholders for date and location. ``` ALTER TABLE aga_flow_logs ADD PARTITION (dt='YYYY-MM-dd') LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/account_id/globalaccelerator/region_code/YYYY/MM/dd'; ``` ## Example queries for AWS Global Accelerator flow logs **Example – List the requests that pass through a specific edge location** The following example query lists requests that passed through the LHR edge location. Use the `LIMIT` operator to limit the number of logs to query at one time. ``` SELECT clientip, agaregion, protocol, action FROM aga_flow_logs WHERE agaregion LIKE 'LHR%' LIMIT 100; ``` **Example – List the endpoint IP addresses that receive the most HTTPS requests** To see which endpoint IP addresses are receiving the highest number of HTTPS requests, use the following query. This query counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 IP addresses. ``` SELECT SUM(numpackets) AS packetcount, endpointip FROM aga_flow_logs WHERE endpointport = 443 GROUP BY endpointip ORDER BY packetcount DESC LIMIT 10; ``` # Query Amazon GuardDuty findings [Amazon GuardDuty](https://aws.amazon.com/guardduty/) is a security monitoring service for helping to identify unexpected and potentially unauthorized or malicious activity in your AWS environment. When it detects unexpected and potentially malicious activity, GuardDuty generates security [findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_findings.html) that you can export to Amazon S3 for storage and analysis. After you export your findings to Amazon S3, you can use Athena to query them. This article shows how to create a table in Athena for your GuardDuty findings and query them. For more information about Amazon GuardDuty, see the [Amazon GuardDuty User Guide](https://docs.aws.amazon.com/guardduty/latest/ug/). ## Prerequisites + Enable the GuardDuty feature for exporting findings to Amazon S3. For steps, see [Exporting findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_exportfindings.html) in the Amazon GuardDuty User Guide. ## Create a table in Athena for GuardDuty findings To query your GuardDuty findings from Athena, you must create a table for them. **To create a table in Athena for GuardDuty findings** 1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home). 1. Paste the following DDL statement into the Athena console. Modify the values in `LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/account-id/GuardDuty/'` to point to your GuardDuty findings in Amazon S3. ``` CREATE EXTERNAL TABLE `gd_logs` ( `schemaversion` string, `accountid` string, `region` string, `partition` string, `id` string, `arn` string, `type` string, `resource` string, `service` string, `severity` string, `createdat` string, `updatedat` string, `title` string, `description` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/account-id/GuardDuty/' TBLPROPERTIES ('has_encrypted_data'='true') ``` **Note** The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 1. Run the query in the Athena console to register the `gd_logs` table. When the query completes, the findings are ready for you to query from Athena. ## Example queries The following examples show how to query GuardDuty findings from Athena. **Example – DNS data exfiltration** The following query returns information about Amazon EC2 instances that might be exfiltrating data through DNS queries. ``` SELECT title, severity, type, id AS FindingID, accountid, region, createdat, updatedat, json_extract_scalar(service, '$.count') AS Count, json_extract_scalar(resource, '$.instancedetails.instanceid') AS InstanceID, json_extract_scalar(service, '$.action.actiontype') AS DNS_ActionType, json_extract_scalar(service, '$.action.dnsrequestaction.domain') AS DomainName, json_extract_scalar(service, '$.action.dnsrequestaction.protocol') AS protocol, json_extract_scalar(service, '$.action.dnsrequestaction.blocked') AS blocked FROM gd_logs WHERE type = 'Trojan:EC2/DNSDataExfiltration' ORDER BY severity DESC ``` **Example – Unauthorized IAM user access** The following query returns all `UnauthorizedAccess:IAMUser` finding types for an IAM Principal from all regions. ``` SELECT title, severity, type, id, accountid, region, createdat, updatedat, json_extract_scalar(service, '$.count') AS Count, json_extract_scalar(resource, '$.accesskeydetails.username') AS IAMPrincipal, json_extract_scalar(service,'$.action.awsapicallaction.api') AS APIActionCalled FROM gd_logs WHERE type LIKE '%UnauthorizedAccess:IAMUser%' ORDER BY severity desc; ``` ## Tips for querying GuardDuty findings When you create your query, keep the following points in mind. + To extract data from nested JSON fields, use the Presto `json_extract` or `json_extract_scalar` functions. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md). + Make sure that all characters in the JSON fields are in lower case. + For information about downloading query results, see [Download query results files using the Athena console](saving-query-results.md). # Query AWS Network Firewall logs AWS Network Firewall is a managed service that you can use to deploy essential network protections for your Amazon Virtual Private Cloud instances. AWS Network Firewall works together with AWS Firewall Manager so you can build policies based on AWS Network Firewall rules and then centrally apply those policies across your VPCs and accounts. For more information about AWS Network Firewall, see [AWS Network Firewall](https://aws.amazon.com/network-firewall/). You can configure AWS Network Firewall logging for traffic that you forward to your firewall's stateful rules engine. Logging gives you detailed information about network traffic, including the time that the stateful engine received a packet, detailed information about the packet, and any stateful rule action taken against the packet. The logs are published to the log destination that you've configured, where you can retrieve and view them. For more information, see [Logging network traffic from AWS Network Firewall](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html) in the *AWS Network Firewall Developer Guide*. **Topics** + [Create and query a table for alert logs](querying-network-firewall-logs-sample-alert-logs-table.md) + [Create and query a table for netflow logs](querying-network-firewall-logs-sample-netflow-logs-table.md) # Create and query a table for alert logs 1. Modify the following sample DDL statement to conform to the structure of your alert log. You may need to update the statement to include the columns for the latest version of the logs. For more information, see [Contents of a firewall log](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html#firewall-logging-contents) in the *AWS Network Firewall Developer Guide*. ``` CREATE EXTERNAL TABLE network_firewall_alert_logs ( firewall_name string, availability_zone string, event_timestamp string, event struct< timestamp:string, flow_id:bigint, event_type:string, src_ip:string, src_port:int, dest_ip:string, dest_port:int, proto:string, app_proto:string, sni:string, tls_inspected:boolean, tls_error:struct< error_message:string>, revocation_check:struct< leaf_cert_fpr:string, status:string, action:string>, alert:struct< alert_id:string, alert_type:string, action:string, signature_id:int, rev:int, signature:string, category:string, severity:int, rule_name:string, alert_name:string, alert_severity:string, alert_description:string, file_name:string, file_hash:string, packet_capture:string, reference_links:array >, src_country:string, dest_country:string, src_hostname:string, dest_hostname:string, user_agent:string, url:string > ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://amzn-s3-demo-bucket/path_to_alert_logs_folder/'; ``` 1. Modify the `LOCATION` clause to specify the folder for your logs in Amazon S3. 1. Run your `CREATE TABLE` query in the Athena query editor. After the query completes, Athena registers the `network_firewall_alert_logs` table, making the data that it points to ready for queries. ## Example query The sample alert log query in this section filters for events in which TLS inspection was performed that have alerts with a severity level of 2 or higher. The query uses aliases to create output column headings that show the `struct` that the column belongs to. For example, the column heading for the `event.alert.category` field is `event_alert_category` instead of just `category`. To customize the column names further, you can modify the aliases to suit your preferences. For example, you can use underscores or other separators to delimit the `struct` names and field names. Remember to modify column names and `struct` references based on your table definition and on the fields that you want in the query result. ``` SELECT firewall_name, availability_zone, event_timestamp, event.timestamp AS event_timestamp, event.flow_id AS event_flow_id, event.event_type AS event_type, event.src_ip AS event_src_ip, event.src_port AS event_src_port, event.dest_ip AS event_dest_ip, event.dest_port AS event_dest_port, event.proto AS event_protol, event.app_proto AS event_app_proto, event.sni AS event_sni, event.tls_inspected AS event_tls_inspected, event.tls_error.error_message AS event_tls_error_message, event.revocation_check.leaf_cert_fpr AS event_revocation_leaf_cert, event.revocation_check.status AS event_revocation_check_status, event.revocation_check.action AS event_revocation_check_action, event.alert.alert_id AS event_alert_alert_id, event.alert.alert_type AS event_alert_alert_type, event.alert.action AS event_alert_action, event.alert.signature_id AS event_alert_signature_id, event.alert.rev AS event_alert_rev, event.alert.signature AS event_alert_signature, event.alert.category AS event_alert_category, event.alert.severity AS event_alert_severity, event.alert.rule_name AS event_alert_rule_name, event.alert.alert_name AS event_alert_alert_name, event.alert.alert_severity AS event_alert_alert_severity, event.alert.alert_description AS event_alert_alert_description, event.alert.file_name AS event_alert_file_name, event.alert.file_hash AS event_alert_file_hash, event.alert.packet_capture AS event_alert_packet_capture, event.alert.reference_links AS event_alert_reference_links, event.src_country AS event_src_country, event.dest_country AS event_dest_country, event.src_hostname AS event_src_hostname, event.dest_hostname AS event_dest_hostname, event.user_agent AS event_user_agent, event.url AS event_url FROM network_firewall_alert_logs WHERE event.alert.severity >= 2 AND event.tls_inspected = true LIMIT 10; ``` # Create and query a table for netflow logs 1. Modify the following sample DDL statement to conform to the structure of your netflow logs. You may need to update the statement to include the columns for the latest version of the logs. For more information, see [Contents of a firewall log](https://docs.aws.amazon.com/network-firewall/latest/developerguide/firewall-logging.html#firewall-logging-contents) in the *AWS Network Firewall Developer Guide*. ``` CREATE EXTERNAL TABLE network_firewall_netflow_logs ( firewall_name string, availability_zone string, event_timestamp string, event struct< timestamp:string, flow_id:bigint, event_type:string, src_ip:string, src_port:int, dest_ip:string, dest_port:int, proto:string, app_proto:string, tls_inspected:boolean, netflow:struct< pkts:int, bytes:bigint, start:string, `end`:string, age:int, min_ttl:int, max_ttl:int, tcp_flags:struct< syn:boolean, fin:boolean, rst:boolean, psh:boolean, ack:boolean, urg:boolean > > > ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://amzn-s3-demo-bucket/path_to_netflow_logs_folder/'; ``` 1. Modify the `LOCATION` clause to specify the folder for your logs in Amazon S3. 1. Run the `CREATE TABLE` query in the Athena query editor. After the query completes, Athena registers the `network_firewall_netflow_logs` table, making the data that it points to ready for queries. ## Example query The sample netflow log query in this section filters for events in which TLS inspection was performed. The query uses aliases to create output column headings that show the `struct` that the column belongs to. For example, the column heading for the `event.netflow.bytes` field is `event_netflow_bytes` instead of just `bytes`. To customize the column names further, you can modify the aliases to suit your preferences. For example, you can use underscores or other separators to delimit the `struct` names and field names. Remember to modify column names and `struct` references based on your table definition and on the fields that you want in the query result. ``` SELECT event.src_ip AS event_src_ip, event.dest_ip AS event_dest_ip, event.proto AS event_proto, event.app_proto AS event_app_proto, event.tls_inspected AS event_tls_inspected, event.netflow.pkts AS event_netflow_pkts, event.netflow.bytes AS event_netflow_bytes, event.netflow.tcp_flags.syn AS event_netflow_tcp_flags_syn FROM network_firewall_netflow_logs WHERE event.tls_inspected = true ``` # Query Network Load Balancer logs Use Athena to analyze and process logs from Network Load Balancer. These logs receive detailed information about the Transport Layer Security (TLS) requests sent to the Network Load Balancer. You can use these access logs to analyze traffic patterns and troubleshoot issues. Before you analyze the Network Load Balancer access logs, enable and configure them for saving in the destination Amazon S3 bucket. For more information, and for information about each Network Load Balancer access log entry, see [ Access logs for your Network Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html). **To create the table for Network Load Balancer logs** 1. Copy and paste the following DDL statement into the Athena console. Check the [syntax ](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html#access-log-file-format) of the Network Load Balancer log records. Update the statement as required to include the columns and the regex corresponding to your log records. ``` CREATE EXTERNAL TABLE IF NOT EXISTS nlb_tls_logs ( type string, version string, time string, elb string, listener_id string, client_ip string, client_port int, target_ip string, target_port int, tcp_connection_time_ms double, tls_handshake_time_ms double, received_bytes bigint, sent_bytes bigint, incoming_tls_alert int, cert_arn string, certificate_serial string, tls_cipher_suite string, tls_protocol_version string, tls_named_group string, domain_name string, alpn_fe_protocol string, alpn_be_protocol string, alpn_client_preference_list string, tls_connection_creation_time string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ?([^ ]*)?( .*)?' ) LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_account_ID/elasticloadbalancing/region'; ``` 1. Modify the `LOCATION` Amazon S3 bucket to specify the destination of your Network Load Balancer logs. 1. Run the query in the Athena console. After the query completes, Athena registers the `nlb_tls_logs` table, making the data in it ready for queries. ## Example queries To see how many times a certificate is used, use a query similar to this example: ``` SELECT count(*) AS ct, cert_arn FROM "nlb_tls_logs" GROUP BY cert_arn; ``` The following query shows how many users are using a TLS version earlier than 1.3: ``` SELECT tls_protocol_version, COUNT(tls_protocol_version) AS num_connections, client_ip FROM "nlb_tls_logs" WHERE tls_protocol_version < 'tlsv13' GROUP BY tls_protocol_version, client_ip; ``` Use the following query to identify connections that take a long TLS handshake time: ``` SELECT * FROM "nlb_tls_logs" ORDER BY tls_handshake_time_ms DESC LIMIT 10; ``` Use the following query to identify and count which TLS protocol versions and cipher suites have been negotiated in the past 30 days. ``` SELECT tls_cipher_suite, tls_protocol_version, COUNT(*) AS ct FROM "nlb_tls_logs" WHERE from_iso8601_timestamp(time) > current_timestamp - interval '30' day AND NOT tls_protocol_version = '-' GROUP BY tls_cipher_suite, tls_protocol_version ORDER BY ct DESC; ``` # Query Amazon Route 53 resolver query logs You can create Athena tables for your Amazon Route 53 Resolver query logs and query them from Athena. Route 53 Resolver query logging is for logging of DNS queries made by resources within a VPC, on-premises resources that use inbound resolver endpoints, queries that use an outbound Resolver endpoint for recursive DNS resolution, and queries that use Route 53 Resolver DNS firewall rules to block, allow, or monitor a domain list. For more information about Resolver query logging, see [Resolver query logging](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs.html) in the *Amazon Route 53 Developer Guide*. For information about each of the fields in the logs, see [Values that appear in resolver query logs](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs-format.html) in the *Amazon Route 53 Developer Guide*. **Topics** + [Create the table for resolver query logs](querying-r53-resolver-logs-creating-the-table.md) + [Use partition projection](querying-r53-resolver-logs-partitioning-example.md) + [Example queries](querying-r53-resolver-logs-example-queries.md) # Create the table for resolver query logs You can use the Query Editor in the Athena console to create and query a table for your Route 53 Resolver query logs. **To create and query an Athena table for Route 53 resolver query logs** 1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home). 1. In the Athena Query Editor, enter the following `CREATE TABLE` statement. Replace the `LOCATION` clause values with those corresponding to the location of your Resolver logs in Amazon S3. ``` CREATE EXTERNAL TABLE r53_rlogs ( version string, account_id string, region string, vpc_id string, query_timestamp string, query_name string, query_type string, query_class string, rcode string, answers array< struct< Rdata: string, Type: string, Class: string> >, srcaddr string, srcport int, transport string, srcids struct< instance: string, resolver_endpoint: string >, firewall_rule_action string, firewall_rule_group_id string, firewall_domain_list_id string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/aws_account_id/vpcdnsquerylogs/{vpc-id}/' ``` Because Resolver query log data is in JSON format, the CREATE TABLE statement uses a [JSON SerDe library](json-serde.md) to analyze the data. **Note** The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 1. Choose **Run query**. The statement creates an Athena table named `r53_rlogs` whose columns represent each of the fields in your Resolver log data. 1. In the Athena console Query Editor, run the following query to verify that your table has been created. ``` SELECT * FROM "r53_rlogs" LIMIT 10 ``` # Use partition projection The following example shows a `CREATE TABLE` statement for Resolver query logs that uses partition projection and is partitioned by vpc and by date. For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md). ``` CREATE EXTERNAL TABLE r53_rlogs ( version string, account_id string, region string, vpc_id string, query_timestamp string, query_name string, query_type string, query_class string, rcode string, answers array< struct< Rdata: string, Type: string, Class: string> >, srcaddr string, srcport int, transport string, srcids struct< instance: string, resolver_endpoint: string >, firewall_rule_action string, firewall_rule_group_id string, firewall_domain_list_id string ) PARTITIONED BY ( `date` string, `vpc` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/route53-query-logging/AWSLogs/aws_account_id/vpcdnsquerylogs/' TBLPROPERTIES( 'projection.enabled' = 'true', 'projection.vpc.type' = 'enum', 'projection.vpc.values' = 'vpc-6446ae02', 'projection.date.type' = 'date', 'projection.date.range' = '2023/06/26,NOW', 'projection.date.format' = 'yyyy/MM/dd', 'projection.date.interval' = '1', 'projection.date.interval.unit' = 'DAYS', 'storage.location.template' = 's3://amzn-s3-demo-bucket/route53-query-logging/AWSLogs/aws_account_id/vpcdnsquerylogs/${vpc}/${date}/' ) ``` # Example queries The following examples show some queries that you can perform from Athena on your Resolver query logs. ## Example 1 - query logs in descending query\$1timestamp order The following query displays log results in descending `query_timestamp` order. ``` SELECT * FROM "r53_rlogs" ORDER BY query_timestamp DESC ``` ## Example 2 - query logs within specified start and end times The following query queries logs between midnight and 8am on September 24, 2020. Substitute the start and end times according to your own requirements. ``` SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode FROM "r53_rlogs" WHERE (parse_datetime(query_timestamp,'yyyy-MM-dd''T''HH:mm:ss''Z') BETWEEN parse_datetime('2020-09-24-00:00:00','yyyy-MM-dd-HH:mm:ss') AND parse_datetime('2020-09-24-00:08:00','yyyy-MM-dd-HH:mm:ss')) ORDER BY query_timestamp DESC ``` ## Example 3 - query logs based on a specified DNS query name pattern The following query selects records whose query name includes the string "example.com". ``` SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answers FROM "r53_rlogs" WHERE query_name LIKE '%example.com%' ORDER BY query_timestamp DESC ``` ## Example 4 - query log requests with no answer The following query selects log entries in which the request received no answer. ``` SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answers FROM "r53_rlogs" WHERE cardinality(answers) = 0 ``` ## Example 5 - query logs with a specific answer The following query shows logs in which the `answer.Rdata` value has the specified IP address. ``` SELECT query_timestamp, srcids.instance, srcaddr, srcport, query_name, rcode, answer.Rdata FROM "r53_rlogs" CROSS JOIN UNNEST(r53_rlogs.answers) as st(answer) WHERE answer.Rdata='203.0.113.16'; ``` # Query Amazon SES event logs You can use Amazon Athena to query [Amazon Simple Email Service](https://aws.amazon.com/ses/) (Amazon SES) event logs. Amazon SES is an email platform that provides a convenient and cost-effective way to send and receive email using your own email addresses and domains. You can monitor your Amazon SES sending activity at a granular level using events, metrics, and statistics. Based on the characteristics that you define, you can publish Amazon SES events to [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/), [Amazon Data Firehose](https://aws.amazon.com/kinesis/data-firehose/), or [Amazon Simple Notification Service](https://aws.amazon.com/sns/). After the information is stored in Amazon S3, you can query it from Amazon Athena. For an example Athena `CREATE TABLE` statement for Amazon SES logs, including steps on how to create views and flatten nested arrays in Amazon SES event log data, see "Step 3: Using Amazon Athena to query the SES event logs" in the AWS blog post [Analyzing Amazon SES event data with AWS Analytics Services](https://aws.amazon.com/blogs/messaging-and-targeting/analyzing-amazon-ses-event-data-with-aws-analytics-services/). # Query Amazon VPC flow logs Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Use the logs to investigate network traffic patterns and identify threats and risks across your VPC network. To query your Amazon VPC flow logs, you have two options: **** + **Amazon VPC Console** – Use the Athena integration feature in the Amazon VPC Console to generate an CloudFormation template that creates an Athena database, workgroup, and flow logs table with partitioning for you. The template also creates a set of [predefined flow log queries](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-athena.html#predefined-queries) that you can use to obtain insights about the traffic flowing through your VPC. For information about this approach, see [Query flow logs using Amazon Athena](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-athena.html) in the *Amazon VPC User Guide*. + **Amazon Athena console** – Create your tables and queries directly in the Athena console. For more information, continue reading this page. Before you begin querying the logs in Athena, [enable VPC flow logs](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/flow-logs.html), and configure them to be saved to your Amazon S3 bucket. After you create the logs, let them run for a few minutes to collect some data. The logs are created in a GZIP compression format that Athena lets you query directly. When you create a VPC flow log, you can use a custom format when you want to specify the fields to return in the flow log and the order in which the fields appear. For more information about flow log records, see [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records) in the *Amazon VPC User Guide*. ## Considerations and limitations When you create tables in Athena for Amazon VPC flow logs, remember the following points: + By default, in Athena, Parquet will access columns by name. For more information, see [Handle schema updates](handling-schema-updates-chapter.md). + Use the names in the flow log records for the column names in Athena. The names of the columns in the Athena schema should exactly match the field names in the Amazon VPC flow logs, with the following differences: + Replace the hyphens in the Amazon VPC log field names with underscores in the Athena column names. For information about acceptable characters for database names, table names, and column names in Athena, see [Name databases, tables, and columns](tables-databases-columns-names.md). + Escape the flow log record names that are [reserved keywords](reserved-words.md) in Athena by enclosing them with backticks. + VPC flow logs are AWS account specific. When you publish your log files to Amazon S3, the path that Amazon VPC creates in Amazon S3 includes the ID of the AWS account that was used to create the flow log. For more information, see [Publish flow logs to Amazon S3](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html) in the *Amazon VPC User Guide*. **Topics** + [Considerations and limitations](#vpc-flow-logs-common-considerations) + [Create a table for Amazon VPC flow logs and query it](vpc-flow-logs-create-table-statement.md) + [Create tables for flow logs in Apache Parquet format](vpc-flow-logs-parquet.md) + [Create and query a table for Amazon VPC flow logs using partition projection](vpc-flow-logs-partition-projection.md) + [Create tables for flow logs in Apache Parquet format using partition projection](vpc-flow-logs-partition-projection-parquet-example.md) + [Additional resources](query-examples-vpc-logs-additional-resources.md) # Create a table for Amazon VPC flow logs and query it The following procedure creates an Amazon VPC table for Amazon VPC flow logs. When you create a flow log with a custom format, you create a table with fields that match the fields that you specified when you created the flow log in the same order that you specified them. **To create an Athena table for Amazon VPC flow logs** 1. Enter a DDL statement like the following into the Athena console query editor, following the guidelines in the [Considerations and limitations](vpc-flow-logs.md#vpc-flow-logs-common-considerations) section. The sample statement creates a table that has the columns for Amazon VPC flow logs versions 2 through 5 as documented in [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records). If you use a different set of columns or order of columns, modify the statement accordingly. ``` CREATE EXTERNAL TABLE IF NOT EXISTS `vpc_flow_logs` ( version int, account_id string, interface_id string, srcaddr string, dstaddr string, srcport int, dstport int, protocol bigint, packets bigint, bytes bigint, start bigint, `end` bigint, action string, log_status string, vpc_id string, subnet_id string, instance_id string, tcp_flags int, type string, pkt_srcaddr string, pkt_dstaddr string, region string, az_id string, sublocation_type string, sublocation_id string, pkt_src_aws_service string, pkt_dst_aws_service string, flow_direction string, traffic_path int ) PARTITIONED BY (`date` date) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/' TBLPROPERTIES ("skip.header.line.count"="1"); ``` Note the following points: + The query specifies `ROW FORMAT DELIMITED` and omits specifying a SerDe. This means that the query uses the [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md). In this query, fields are terminated by a space. + The `PARTITIONED BY` clause uses the `date` type. This makes it possible to use mathematical operators in queries to select what's older or newer than a certain date. **Note** Because `date` is a reserved keyword in DDL statements, it is escaped by back tick characters. For more information, see [Escape reserved keywords in queries](reserved-words.md). + For a VPC flow log with a different custom format, modify the fields to match the fields that you specified when you created the flow log. 1. Modify the `LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/'` to point to the Amazon S3 bucket that contains your log data. 1. Run the query in Athena console. After the query completes, Athena registers the `vpc_flow_logs` table, making the data in it ready for you to issue queries. 1. Create partitions to be able to read the data, as in the following sample query. This query creates a single partition for a specified date. Replace the placeholders for date and location as needed. **Note** This query creates a single partition only, for a date that you specify. To automate the process, use a script that runs this query and creates partitions this way for the `year/month/day`, or use a `CREATE TABLE` statement that specifies [partition projection](vpc-flow-logs-partition-projection.md). ``` ALTER TABLE vpc_flow_logs ADD PARTITION (`date`='YYYY-MM-dd') LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/YYYY/MM/dd'; ``` ## Example queries for the vpc\$1flow\$1logs table Use the query editor in the Athena console to run SQL statements on the table that you create. You can save the queries, view previous queries, or download query results in CSV format. In the following examples, replace `vpc_flow_logs` with the name of your table. Modify the column values and other variables according to your requirements. The following example query lists a maximum of 100 flow logs for the date specified. ``` SELECT * FROM vpc_flow_logs WHERE date = DATE('2020-05-04') LIMIT 100; ``` The following query lists all of the rejected TCP connections and uses the newly created date partition column, `date`, to extract from it the day of the week for which these events occurred. ``` SELECT day_of_week(date) AS day, date, interface_id, srcaddr, action, protocol FROM vpc_flow_logs WHERE action = 'REJECT' AND protocol = 6 LIMIT 100; ``` To see which one of your servers is receiving the highest number of HTTPS requests, use the following query. It counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 from the last week. ``` SELECT SUM(packets) AS packetcount, dstaddr FROM vpc_flow_logs WHERE dstport = 443 AND date > current_date - interval '7' day GROUP BY dstaddr ORDER BY packetcount DESC LIMIT 10; ``` # Create tables for flow logs in Apache Parquet format The following procedure creates an Amazon VPC table for Amazon VPC flow logs in Apache Parquet format. **To create an Athena table for Amazon VPC flow logs in Parquet format** 1. Enter a DDL statement like the following into the Athena console query editor, following the guidelines in the [Considerations and limitations](vpc-flow-logs.md#vpc-flow-logs-common-considerations) section. The sample statement creates a table that has the columns for Amazon VPC flow logs versions 2 through 5 as documented in [Flow log records](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records) in Parquet format, Hive partitioned hourly. If you do not have hourly partitions, remove `hour` from the `PARTITIONED BY` clause. ``` CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_parquet ( version int, account_id string, interface_id string, srcaddr string, dstaddr string, srcport int, dstport int, protocol bigint, packets bigint, bytes bigint, start bigint, `end` bigint, action string, log_status string, vpc_id string, subnet_id string, instance_id string, tcp_flags int, type string, pkt_srcaddr string, pkt_dstaddr string, region string, az_id string, sublocation_type string, sublocation_id string, pkt_src_aws_service string, pkt_dst_aws_service string, flow_direction string, traffic_path int ) PARTITIONED BY ( `aws-account-id` string, `aws-service` string, `aws-region` string, `year` string, `month` string, `day` string, `hour` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/' TBLPROPERTIES ( 'EXTERNAL'='true', 'skip.header.line.count'='1' ) ``` 1. Modify the sample `LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/'` to point to the Amazon S3 path that contains your log data. 1. Run the query in Athena console. 1. If your data is in Hive-compatible format, run the following command in the Athena console to update and load the Hive partitions in the metastore. After the query completes, you can query the data in the `vpc_flow_logs_parquet` table. ``` MSCK REPAIR TABLE vpc_flow_logs_parquet ``` If you are not using Hive compatible data, run [ALTER TABLE ADD PARTITION](alter-table-add-partition.md) to load the partitions. For more information about using Athena to query Amazon VPC flow logs in Parquet format, see the post [Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format](https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/) in the *AWS Big Data Blog*. # Create and query a table for Amazon VPC flow logs using partition projection Use a `CREATE TABLE` statement like the following to create a table, partition the table, and populate the partitions automatically by using [partition projection](partition-projection.md). Replace the table name `test_table_vpclogs` in the example with the name of your table. Edit the `LOCATION` clause to specify the Amazon S3 bucket that contains your Amazon VPC log data. The following `CREATE TABLE` statement is for VPC flow logs delivered in non-Hive style partitioning format. The example allows for multi-account aggregation. If you are centralizing VPC Flow logs from multiple accounts into one Amazon S3 bucket, the account ID must be entered in the Amazon S3 path. ``` CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs ( version int, account_id string, interface_id string, srcaddr string, dstaddr string, srcport int, dstport int, protocol bigint, packets bigint, bytes bigint, start bigint, `end` bigint, action string, log_status string, vpc_id string, subnet_id string, instance_id string, tcp_flags int, type string, pkt_srcaddr string, pkt_dstaddr string, az_id string, sublocation_type string, sublocation_id string, pkt_src_aws_service string, pkt_dst_aws_service string, flow_direction string, traffic_path int ) PARTITIONED BY (accid string, region string, day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION '$LOCATION_OF_LOGS' TBLPROPERTIES ( "skip.header.line.count"="1", "projection.enabled" = "true", "projection.accid.type" = "enum", "projection.accid.values" = "$ACCID_1,$ACCID_2", "projection.region.type" = "enum", "projection.region.values" = "$REGION_1,$REGION_2,$REGION_3", "projection.day.type" = "date", "projection.day.range" = "$START_RANGE,NOW", "projection.day.format" = "yyyy/MM/dd", "storage.location.template" = "s3://amzn-s3-demo-bucket/AWSLogs/${accid}/vpcflowlogs/${region}/${day}" ) ``` ## Example queries for test\$1table\$1vpclogs The following example queries query the `test_table_vpclogs` created by the preceding `CREATE TABLE` statement. Replace `test_table_vpclogs` in the queries with the name of your own table. Modify the column values and other variables according to your requirements. To return the first 100 access log entries in chronological order for a specified period of time, run a query like the following. ``` SELECT * FROM test_table_vpclogs WHERE day >= '2021/02/01' AND day < '2021/02/28' ORDER BY day ASC LIMIT 100 ``` To view which server receives the top ten number of HTTP packets for a specified period of time, run a query like the following. The query counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 entries from the previous week. ``` SELECT SUM(packets) AS packetcount, dstaddr FROM test_table_vpclogs WHERE dstport = 443 AND day >= '2021/03/01' AND day < '2021/03/31' GROUP BY dstaddr ORDER BY packetcount DESC LIMIT 10 ``` To return the logs that were created during a specified period of time, run a query like the following. ``` SELECT interface_id, srcaddr, action, protocol, to_iso8601(from_unixtime(start)) AS start_time, to_iso8601(from_unixtime("end")) AS end_time FROM test_table_vpclogs WHERE DAY >= '2021/04/01' AND DAY < '2021/04/30' ``` To return the access logs for a source IP address between specified time periods, run a query like the following. ``` SELECT * FROM test_table_vpclogs WHERE srcaddr = '10.117.1.22' AND day >= '2021/02/01' AND day < '2021/02/28' ``` To list rejected TCP connections, run a query like the following. ``` SELECT day, interface_id, srcaddr, action, protocol FROM test_table_vpclogs WHERE action = 'REJECT' AND protocol = 6 AND day >= '2021/02/01' AND day < '2021/02/28' LIMIT 10 ``` To return the access logs for the IP address range that starts with `10.117`, run a query like the following. ``` SELECT * FROM test_table_vpclogs WHERE split_part(srcaddr,'.', 1)='10' AND split_part(srcaddr,'.', 2) ='117' ``` To return the access logs for a destination IP address between a certain time range, run a query like the following. ``` SELECT * FROM test_table_vpclogs WHERE dstaddr = '10.0.1.14' AND day >= '2021/01/01' AND day < '2021/01/31' ``` # Create tables for flow logs in Apache Parquet format using partition projection The following partition projection `CREATE TABLE` statement for VPC flow logs is in Apache Parquet format, not Hive compatible, and partitioned by hour and by date instead of day. Replace the table name `test_table_vpclogs_parquet` in the example with the name of your table. Edit the `LOCATION` clause to specify the Amazon S3 bucket that contains your Amazon VPC log data. ``` CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs_parquet ( version int, account_id string, interface_id string, srcaddr string, dstaddr string, srcport int, dstport int, protocol bigint, packets bigint, bytes bigint, start bigint, `end` bigint, action string, log_status string, vpc_id string, subnet_id string, instance_id string, tcp_flags int, type string, pkt_srcaddr string, pkt_dstaddr string, az_id string, sublocation_type string, sublocation_id string, pkt_src_aws_service string, pkt_dst_aws_service string, flow_direction string, traffic_path int ) PARTITIONED BY (region string, date string, hour string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/' TBLPROPERTIES ( "EXTERNAL"="true", "skip.header.line.count" = "1", "projection.enabled" = "true", "projection.region.type" = "enum", "projection.region.values" = "us-east-1,us-west-2,ap-south-1,eu-west-1", "projection.date.type" = "date", "projection.date.range" = "2021/01/01,NOW", "projection.date.format" = "yyyy/MM/dd", "projection.hour.type" = "integer", "projection.hour.range" = "00,23", "projection.hour.digits" = "2", "storage.location.template" = "s3://amzn-s3-demo-bucket/prefix/AWSLogs/${account_id}/vpcflowlogs/${region}/${date}/${hour}" ) ``` # Additional resources For more information about using Athena to analyze VPC flow logs, see the following AWS Big Data blog posts: + [Analyze VPC Flow Logs with point-and-click Amazon Athena integration](https://aws.amazon.com/blogs/networking-and-content-delivery/analyze-vpc-flow-logs-with-point-and-click-amazon-athena-integration/) + [Analyzing VPC flow logs using Amazon Athena and Quick](https://aws.amazon.com/blogs/big-data/analyzing-vpc-flow-logs-using-amazon-athena-and-amazon-quicksight/) + [Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format](https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/) # Query AWS WAF logs AWS WAF is a web application firewall that lets you monitor and control the HTTP and HTTPS requests that your protected web applications receive from clients. You define how to handle the web requests by configuring rules inside an AWS WAF web access control list (ACL). You then protect a web application by associating a web ACL to it. Examples of web application resources that you can protect with AWS WAF include Amazon CloudFront distributions, Amazon API Gateway REST APIs, and Application Load Balancers. For more information about AWS WAF, see [AWS WAF](https://docs.aws.amazon.com/waf/latest/developerguide/waf-chapter.html) in the *AWS WAF developer guide*. AWS WAF logs include information about the traffic that is analyzed by your web ACL, such as the time that AWS WAF received the request from your AWS resource, detailed information about the request, and the action for the rule that each request matched. You can configure an AWS WAF web ACL to publish logs to one of several destinations, where you can query and view them. For more information about configuring web ACL logging and the contents of the AWS WAF logs, see [Logging AWS WAF web ACL traffic](https://docs.aws.amazon.com/waf/latest/developerguide/logging.html) in the *AWS WAF developer guide*. For information on how to use Athena to analyze AWS WAF logs for insights into threat detection and potential security attacks, see the AWS Networking & Content Delivery Blog post [How to use Amazon Athena queries to analyze AWS WAF logs and provide the visibility needed for threat detection](https://aws.amazon.com/blogs/networking-and-content-delivery/how-to-use-amazon-athena-queries-to-analyze-aws-waf-logs-and-provide-the-visibility-needed-for-threat-detection/). For an example of how to aggregate AWS WAF logs into a central data lake repository and query them with Athena, see the AWS Big Data Blog post [Analyzing AWS WAF logs with OpenSearch Service, Amazon Athena, and Quick](https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/). This topic provides example `CREATE TABLE` statements for partition projection, manual partitioning, and one that does not uses any partitioning. **Note** The `CREATE TABLE` statements in this topic can be used for both v1 and v2 AWS WAF logs. In v1, the `webaclid` field contains an ID. In v2, the `webaclid` field contains a full ARN. The `CREATE TABLE` statements here treat this content agnostically by using the `string` data type. **Topics** + [Create a table for AWS WAF S3 logs in Athena using partition projection](create-waf-table-partition-projection.md) + [Create a table for AWS WAF S3 logs in Athena using manual partition](create-waf-table-manual-partition.md) + [Create a table for AWS WAF logs without partitioning](create-waf-table.md) + [Example queries for AWS WAF logs](query-examples-waf-logs.md) # Create a table for AWS WAF S3 logs in Athena using partition projection Because AWS WAF logs have a known structure whose partition scheme you can specify in advance, you can reduce query runtime and automate partition management by using the Athena [partition projection](partition-projection.md) feature. Partition projection automatically adds new partitions as new data is added. This removes the need for you to manually add partitions by using `ALTER TABLE ADD PARTITION`. The following example `CREATE TABLE` statement automatically uses partition projection on AWS WAF logs from a specified date until the present for four different AWS regions. The `PARTITION BY` clause in this example partitions by region and by date, but you can modify this according to your requirements. Modify the fields as necessary to match your log output. In the `LOCATION` and `storage.location.template` clauses, replace the *amzn-s3-demo-bucket* and *AWS\$1ACCOUNT\$1NUMBER* placeholders with values that identify the Amazon S3 bucket location of your AWS WAF logs. For `projection.day.range`, replace *2021*/*01*/*01* with the starting date that you want to use. After you run the query successfully, you can query the table. You do not have to run `ALTER TABLE ADD PARTITION` to load the partitions. ``` CREATE EXTERNAL TABLE `waf_logs_partition_projection`( `timestamp` bigint, `formatversion` int, `webaclid` string, `terminatingruleid` string, `terminatingruletype` string, `action` string, `terminatingrulematchdetails` array>>, `httpsourcename` string, `httpsourceid` string, `rulegrouplist` array>>>,nonterminatingmatchingrules:array>>,challengeresponse:struct,captcharesponse:struct>>,excludedrules:string>>, `ratebasedrulelist` array>, `nonterminatingmatchingrules` array>>,challengeresponse:struct,captcharesponse:struct>>, `requestheadersinserted` array>, `responsecodesent` string, `httprequest` struct>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string,fragment:string,scheme:string,host:string>, `labels` array>, `captcharesponse` struct, `challengeresponse` struct, `ja3fingerprint` string, `ja4fingerprint` string, `oversizefields` string, `requestbodysize` int, `requestbodysizeinspectedbywaf` int) PARTITIONED BY ( `log_time` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/testui/' TBLPROPERTIES ( 'projection.enabled'='true', 'projection.log_time.format'='yyyy/MM/dd/HH/mm', 'projection.log_time.interval'='1', 'projection.log_time.interval.unit'='minutes', 'projection.log_time.range'='2025/01/01/00/00,NOW', 'projection.log_time.type'='date', 'storage.location.template'='s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/testui/${log_time}') ``` **Note** The format of the path in the `LOCATION` clause in the example is standard but can vary based on the AWS WAF configuration that you have implemented. For example, the following example AWS WAF logs path is for a CloudFront distribution: ``` s3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/cloudfronyt/2025/01/01/00/00/ ``` If you experience issues while creating or querying your AWS WAF logs table, confirm the location of your log data or [contact Support](https://console.aws.amazon.com/support/home/). For more information about partition projection, see [Use partition projection with Amazon Athena](partition-projection.md). # Create a table for AWS WAF S3 logs in Athena using manual partition This section describes how to create a table for AWS WAF logs using manual partition. In the `LOCATION` and `storage.location.template` clauses, replace the *amzn-s3-demo-bucket* and *AWS\$1ACCOUNT\$1NUMBER* placeholders with values that identify the Amazon S3 bucket location of your AWS WAF logs. ``` CREATE EXTERNAL TABLE `waf_logs_manual_partition`( `timestamp` bigint, `formatversion` int, `webaclid` string, `terminatingruleid` string, `terminatingruletype` string, `action` string, `terminatingrulematchdetails` array>>, `httpsourcename` string, `httpsourceid` string, `rulegrouplist` array>>>,nonterminatingmatchingrules:array>>,challengeresponse:struct,captcharesponse:struct>>,excludedrules:string>>, `ratebasedrulelist` array>, `nonterminatingmatchingrules` array>>,challengeresponse:struct,captcharesponse:struct>>, `requestheadersinserted` array>, `responsecodesent` string, `httprequest` struct>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string,fragment:string,scheme:string,host:string>, `labels` array>, `captcharesponse` struct, `challengeresponse` struct, `ja3fingerprint` string, `ja4fingerprint` string, `oversizefields` string, `requestbodysize` int, `requestbodysizeinspectedbywaf` int) PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string, `min` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/cloudfront/webacl/' ``` # Create a table for AWS WAF logs without partitioning This section describes how to create a table for AWS WAF logs without partitioning or partition projection. **Note** For performance and cost reasons, we do not recommend using non-partitioned schema for queries. For more information, see [Top 10 Performance Tuning Tips for Amazon Athena](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/) in the AWS Big Data Blog. **To create the AWS WAF table** 1. Copy and paste the following DDL statement into the Athena console. Modify the fields as necessary to match your log output. Modify the `LOCATION` for the Amazon S3 bucket to correspond to the one that stores your logs. This query uses the [OpenX JSON SerDe](openx-json-serde.md). **Note** The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. ``` CREATE EXTERNAL TABLE `waf_logs`( `timestamp` bigint, `formatversion` int, `webaclid` string, `terminatingruleid` string, `terminatingruletype` string, `action` string, `terminatingrulematchdetails` array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > >, `httpsourcename` string, `httpsourceid` string, `rulegrouplist` array < struct < rulegroupid: string, terminatingrule: struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > >, nonterminatingmatchingrules: array < struct < ruleid: string, action: string, overriddenaction: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > >, challengeresponse: struct < responsecode: string, solvetimestamp: string >, captcharesponse: struct < responsecode: string, solvetimestamp: string > > >, excludedrules: string > >, `ratebasedrulelist` array < struct < ratebasedruleid: string, limitkey: string, maxrateallowed: int > >, `nonterminatingmatchingrules` array < struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > >, challengeresponse: struct < responsecode: string, solvetimestamp: string >, captcharesponse: struct < responsecode: string, solvetimestamp: string > > >, `requestheadersinserted` array < struct < name: string, value: string > >, `responsecodesent` string, `httprequest` struct < clientip: string, country: string, headers: array < struct < name: string, value: string > >, uri: string, args: string, httpversion: string, httpmethod: string, requestid: string >, `labels` array < struct < name: string > >, `captcharesponse` struct < responsecode: string, solvetimestamp: string, failureReason: string >, `challengeresponse` struct < responsecode: string, solvetimestamp: string, failureReason: string >, `ja3Fingerprint` string, `oversizefields` string, `requestbodysize` int, `requestbodysizeinspectedbywaf` int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/prefix/' ``` 1. Run the `CREATE EXTERNAL TABLE` statement in the Athena console query editor. This registers the `waf_logs` table and makes the data in it available for queries from Athena. # Example queries for AWS WAF logs Many of the example queries in this section use the partition projection table created previously. Modify the table name, column values, and other variables in the examples according to your requirements. To improve the performance of your queries and reduce cost, add the partition column in the filter condition. **Topics** + [Count referrers, IP addresses, or matched rules](query-examples-waf-logs-count.md) + [Query using date and time](query-examples-waf-logs-date-time.md) + [Query for blocked requests or addresses](query-examples-waf-logs-blocked-requests.md) # Count referrers, IP addresses, or matched rules The examples in this section query for counts of log items of interest. + [Count the number of referrers that contain a specified term](#waf-example-count-referrers-with-specified-term) + [Count all matched IP addresses in the last 10 days that have matched excluded rules](#waf-example-count-matched-ip-addresses) + [Group all counted managed rules by the number of times matched](#waf-example-group-managed-rules-by-times-matched) + [Group all counted custom rules by number of times matched](#waf-example-group-custom-rules-by-times-matched) **Example – Count the number of referrers that contain a specified term** The following query counts the number of referrers that contain the term "amazon" for the specified date range. ``` WITH test_dataset AS (SELECT header FROM waf_logs CROSS JOIN UNNEST(httprequest.headers) AS t(header) WHERE "date" >= '2021/03/01' AND "date" < '2021/03/31') SELECT COUNT(*) referer_count FROM test_dataset WHERE LOWER(header.name)='referer' AND header.value LIKE '%amazon%' ``` **Example – Count all matched IP addresses in the last 10 days that have matched excluded rules** The following query counts the number of times in the last 10 days that the IP address matched the excluded rule in the rule group. ``` WITH test_dataset AS (SELECT * FROM waf_logs CROSS JOIN UNNEST(rulegrouplist) AS t(allrulegroups)) SELECT COUNT(*) AS count, "httprequest"."clientip", "allrulegroups"."excludedrules", "allrulegroups"."ruleGroupId" FROM test_dataset WHERE allrulegroups.excludedrules IS NOT NULL AND from_unixtime(timestamp/1000) > now() - interval '10' day GROUP BY "httprequest"."clientip", "allrulegroups"."ruleGroupId", "allrulegroups"."excludedrules" ORDER BY count DESC ``` **Example – Group all counted managed rules by the number of times matched** If you set rule group rule actions to Count in your web ACL configuration before October 27, 2022, AWS WAF saved your overrides in the web ACL JSON as `excludedRules`. Now, the JSON setting for overriding a rule to Count is in the `ruleActionOverrides` settings. For more information, see [Action overrides in rule groups](https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-rule-group-override-options.html) in the *AWS WAF Developer Guide*. To extract managed rules in Count mode from the new log structure, query the `nonTerminatingMatchingRules` in the `ruleGroupList` section instead of the `excludedRules` field, as in the following example. ``` SELECT count(*) AS count, httpsourceid, httprequest.clientip, t.rulegroupid, t.nonTerminatingMatchingRules FROM "waf_logs" CROSS JOIN UNNEST(rulegrouplist) AS t(t) WHERE action <> 'BLOCK' AND cardinality(t.nonTerminatingMatchingRules) > 0 GROUP BY t.nonTerminatingMatchingRules, action, httpsourceid, httprequest.clientip, t.rulegroupid ORDER BY "count" DESC Limit 50 ``` **Example – Group all counted custom rules by number of times matched** The following query groups all counted custom rules by the number of times matched. ``` SELECT count(*) AS count, httpsourceid, httprequest.clientip, t.ruleid, t.action FROM "waf_logs" CROSS JOIN UNNEST(nonterminatingmatchingrules) AS t(t) WHERE action <> 'BLOCK' AND cardinality(nonTerminatingMatchingRules) > 0 GROUP BY t.ruleid, t.action, httpsourceid, httprequest.clientip ORDER BY "count" DESC Limit 50 ``` For information about the log locations for custom rules and managed rule groups, see [Monitoring and tuning](https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-testing-activities.html) in the *AWS WAF Developer Guide*. # Query using date and time The examples in this section include queries that use date and time values. + [Return the timestamp field in human-readable ISO 8601 format](#waf-example-return-human-readable-timestamp) + [Return records from the last 24 hours](#waf-example-return-records-last-24-hours) + [Return records for a specified date range and IP address](#waf-example-return-records-date-range-and-ip) + [For a specified date range, count the number of IP addresses in five minute intervals](#waf-example-count-ip-addresses-in-date-range) + [Count the number of X-Forwarded-For IP in the last 10 days](#waf-example-count-x-forwarded-for-ip) **Example – Return the timestamp field in human-readable ISO 8601 format** The following query uses the `from_unixtime` and `to_iso8601` functions to return the `timestamp` field in human-readable ISO 8601 format (for example, `2019-12-13T23:40:12.000Z` instead of `1576280412771`). The query also returns the HTTP source name, source ID, and request. ``` SELECT to_iso8601(from_unixtime(timestamp / 1000)) as time_ISO_8601, httpsourcename, httpsourceid, httprequest FROM waf_logs LIMIT 10; ``` **Example – Return records from the last 24 hours** The following query uses a filter in the `WHERE` clause to return the HTTP source name, HTTP source ID, and HTTP request fields for records from the last 24 hours. ``` SELECT to_iso8601(from_unixtime(timestamp/1000)) AS time_ISO_8601, httpsourcename, httpsourceid, httprequest FROM waf_logs WHERE from_unixtime(timestamp/1000) > now() - interval '1' day LIMIT 10; ``` **Example – Return records for a specified date range and IP address** The following query lists the records in a specified date range for a specified client IP address. ``` SELECT * FROM waf_logs WHERE httprequest.clientip='53.21.198.66' AND "date" >= '2021/03/01' AND "date" < '2021/03/31' ``` **Example – For a specified date range, count the number of IP addresses in five minute intervals** The following query counts, for a particular date range, the number of IP addresses in five minute intervals. ``` WITH test_dataset AS (SELECT format_datetime(from_unixtime((timestamp/1000) - ((minute(from_unixtime(timestamp / 1000))%5) * 60)),'yyyy-MM-dd HH:mm') AS five_minutes_ts, "httprequest"."clientip" FROM waf_logs WHERE "date" >= '2021/03/01' AND "date" < '2021/03/31') SELECT five_minutes_ts,"clientip",count(*) ip_count FROM test_dataset GROUP BY five_minutes_ts,"clientip" ``` **Example – Count the number of X-Forwarded-For IP in the last 10 days** The following query filters the request headers and counts the number of X-Forwarded-For IP in the last 10 days. ``` WITH test_dataset AS (SELECT header FROM waf_logs CROSS JOIN UNNEST (httprequest.headers) AS t(header) WHERE from_unixtime("timestamp"/1000) > now() - interval '10' DAY) SELECT header.value AS ip, count(*) AS COUNT FROM test_dataset WHERE header.name='X-Forwarded-For' GROUP BY header.value ORDER BY COUNT DESC ``` For more information about date and time functions, see [Date and time functions and operators](https://trino.io/docs/current/functions/datetime.html) in the Trino documentation. # Query for blocked requests or addresses The examples in this section query for blocked requests or addresses. + [Extract the top 100 IP addresses blocked by a specified rule type](#waf-example-extract-top-100-blocked-ip-by-rule) + [Count the number of times a request from a specified country has been blocked](#waf-example-count-request-blocks-from-country) + [Count the number of times a request has been blocked, grouping by specific attributes](#waf-example-count-request-blocks-by-attribute) + [Count the number of times a specific terminating rule ID has been matched](#waf-example-count-terminating-rule-id-matches) + [Retrieve the top 100 IP addresses blocked during a specified date range](#waf-example-top-100-ip-addresses-blocked-for-date-range) **Example – Extract the top 100 IP addresses blocked by a specified rule type** The following query extracts and counts the top 100 IP addresses that have been blocked by the `RATE_BASED` terminating rule during the specified date range. ``` SELECT COUNT(httpRequest.clientIp) as count, httpRequest.clientIp FROM waf_logs WHERE terminatingruletype='RATE_BASED' AND action='BLOCK' and "date" >= '2021/03/01' AND "date" < '2021/03/31' GROUP BY httpRequest.clientIp ORDER BY count DESC LIMIT 100 ``` **Example – Count the number of times a request from a specified country has been blocked** The following query counts the number of times the request has arrived from an IP address that belongs to Ireland (IE) and has been blocked by the `RATE_BASED` terminating rule. ``` SELECT COUNT(httpRequest.country) as count, httpRequest.country FROM waf_logs WHERE terminatingruletype='RATE_BASED' AND httpRequest.country='IE' GROUP BY httpRequest.country ORDER BY count LIMIT 100; ``` **Example – Count the number of times a request has been blocked, grouping by specific attributes** The following query counts the number of times the request has been blocked, with results grouped by WebACL, RuleId, ClientIP, and HTTP Request URI. ``` SELECT COUNT(*) AS count, webaclid, terminatingruleid, httprequest.clientip, httprequest.uri FROM waf_logs WHERE action='BLOCK' GROUP BY webaclid, terminatingruleid, httprequest.clientip, httprequest.uri ORDER BY count DESC LIMIT 100; ``` **Example – Count the number of times a specific terminating rule ID has been matched** The following query counts the number of times a specific terminating rule ID has been matched (`WHERE terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e'`). The query then groups the results by WebACL, Action, ClientIP, and HTTP Request URI. ``` SELECT COUNT(*) AS count, webaclid, action, httprequest.clientip, httprequest.uri FROM waf_logs WHERE terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e' GROUP BY webaclid, action, httprequest.clientip, httprequest.uri ORDER BY count DESC LIMIT 100; ``` **Example – Retrieve the top 100 IP addresses blocked during a specified date range** The following query extracts the top 100 IP addresses that have been blocked for a specified date range. The query also lists the number of times the IP addresses have been blocked. ``` SELECT "httprequest"."clientip", "count"(*) "ipcount", "httprequest"."country" FROM waf_logs WHERE "action" = 'BLOCK' and "date" >= '2021/03/01' AND "date" < '2021/03/31' GROUP BY "httprequest"."clientip", "httprequest"."country" ORDER BY "ipcount" DESC limit 100 ``` For information about querying Amazon S3 logs, see the following topics: + [How do I analyze my Amazon S3 server access logs using Athena?](https://aws.amazon.com/premiumsupport/knowledge-center/analyze-logs-athena/) in the AWS Knowledge Center + [Querying Amazon S3 access logs for requests using Amazon Athena](https://docs.aws.amazon.com/AmazonS3/latest/dev/using-s3-access-logs-to-identify-requests.html#querying-s3-access-logs-for-requests) in the Amazon Simple Storage Service User Guide + [Using AWS CloudTrail to identify Amazon S3 requests](https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudtrail-request-identification.html) in the Amazon Simple Storage Service User Guide