

# Available data source connectors
<a name="connectors-available"></a>

This section lists prebuilt Athena data source connectors that you can use to query a variety of data sources external to Amazon S3. To use a connector in your Athena queries, configure it and deploy it to your account. 

## Considerations and limitations
<a name="connectors-available-considerations"></a>
+ Some prebuilt connectors require that you create a VPC and a security group before you can use the connector. For information about creating VPCs, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md). 
+ To use the Athena Federated Query feature with AWS Secrets Manager, you must configure an Amazon VPC private endpoint for Secrets Manager. For more information, see [Create a Secrets Manager VPC private endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) in the *AWS Secrets Manager User Guide*. 
+ For connectors that do not support predicate pushdown, queries that include a predicate take longer to execute. For small datasets, very little data is scanned, and queries take an average of about 2 minutes. However, for large datasets, many queries can time out.
+ Some federated data sources use terminology to refer data objects that is different from Athena. For more information, see [Understand federated table name qualifiers](tables-qualifiers.md).
+ We update our connectors periodically based on upgrades from the database or data source provider. We do not support data sources that are at end-of-life for support.
+ For connectors that do not support pagination when you list tables, the web service can time out if your database has many tables and metadata. The following connectors provide pagination support for listing tables:
  + DocumentDB
  + DynamoDB
  + MySQL
  + OpenSearch
  + Oracle
  + PostgreSQL
  + Redshift
  + SQL Server

## Case resolver modes in Federation SDK
<a name="case-resolver-modes"></a>

The Federation SDK supports the following standardized case resolver modes for schema and table names:
+ `NONE` – Does not change case of the given schema and table names.
+ `LOWER` – Lower case all given schema and table names.
+ `UPPER` – Upper case all given schema and table names.
+ `ANNOTATION` – This mode is maintained for backward compatibility only and is supported exclusively by existing Snowflake and SAP HANA connectors.
+ `CASE_INSENSITIVE_SEARCH` – Perform case insensitive searches against schema and tables names.

## Connector support for case resolver modes
<a name="connector-support-matrix"></a>

### Basic mode support
<a name="basic-mode-support"></a>

All JDBC connectors support the following basic modes:
+ `NONE`
+ `LOWER`
+ `UPPER`

### Annotation mode support
<a name="annotation-mode-support"></a>

Only the following connectors support the `ANNOTATION` mode:
+ Snowflake
+ SAP HANA

**Note**  
It is recommended to use CASE\$1INSENSITIVE\$1SEARCH instead of ANNOTATION.

### Case-insensitive search support
<a name="case-insensitive-search-support"></a>

The following connectors support `CASE_INSENSITIVE_SEARCH`:
+ DataLake Gen2
+ Snowflake
+ Oracle
+ Synapse
+ MySQL
+ PostgreSQL
+ Redshift
+ ClickHouse
+ SQL Server
+ DB2

## Case resolver limitations
<a name="case-resolver-limitations"></a>

Be aware of the following limitations when using case resolver modes:
+ When using `LOWER` mode, your schema name and all tables within the schema must be in lowercase.
+ When using `UPPER` mode, your schema name and all tables within the schema must be in uppercase.
+ When using `CASE_INSENSITIVE_SEARCH`:
  + Schema names must be unique
  + Table names within a schema must be unique (for example, you cannot have both "Apple" and "APPLE")
+ Glue integration limitations:
  + Glue only supports lowercase names
  + Only `NONE` or `LOWER` modes will work when registering your Lambda function with GlueDataCatalog/LakeFormation

## Additional information
<a name="connectors-available-additional-resources"></a>
+ For information about deploying an Athena data source connector, see [Use Amazon Athena Federated Query](federated-queries.md). 
+ For information about queries that use Athena data source connectors, see [Run federated queries](running-federated-queries.md).

**Topics**
+ [Considerations and limitations](#connectors-available-considerations)
+ [Case resolver modes in Federation SDK](#case-resolver-modes)
+ [Connector support for case resolver modes](#connector-support-matrix)
+ [Case resolver limitations](#case-resolver-limitations)
+ [Additional information](#connectors-available-additional-resources)
+ [Azure Data Lake Storage](connectors-adls-gen2.md)
+ [Azure Synapse](connectors-azure-synapse.md)
+ [Cloudera Hive](connectors-cloudera-hive.md)
+ [Cloudera Impala](connectors-cloudera-impala.md)
+ [CloudWatch](connectors-cloudwatch.md)
+ [CloudWatch metrics](connectors-cwmetrics.md)
+ [CMDB](connectors-cmdb.md)
+ [Db2](connectors-ibm-db2.md)
+ [Db2 iSeries](connectors-ibm-db2-as400.md)
+ [DocumentDB](connectors-docdb.md)
+ [DynamoDB](connectors-dynamodb.md)
+ [Google BigQuery](connectors-bigquery.md)
+ [Google Cloud Storage](connectors-gcs.md)
+ [HBase](connectors-hbase.md)
+ [Hortonworks](connectors-hortonworks.md)
+ [Kafka](connectors-kafka.md)
+ [MSK](connectors-msk.md)
+ [MySQL](connectors-mysql.md)
+ [Neptune](connectors-neptune.md)
+ [OpenSearch](connectors-opensearch.md)
+ [Oracle](connectors-oracle.md)
+ [PostgreSQL](connectors-postgresql.md)
+ [Redis OSS](connectors-redis.md)
+ [Redshift](connectors-redshift.md)
+ [SAP HANA](connectors-sap-hana.md)
+ [Snowflake](connectors-snowflake.md)
+ [SQL Server](connectors-microsoft-sql-server.md)
+ [Teradata](connectors-teradata.md)
+ [Timestream](connectors-timestream.md)
+ [TPC-DS](connectors-tpcds.md)
+ [Vertica](connectors-vertica.md)

**Note**  
The [AthenaJdbcConnector](https://serverlessrepo.aws.amazon.com/applications/us-east-1/292517598671/AthenaJdbcConnector) (latest version 2022.4.1) has been deprecated. Instead, use a database-specific connector like those for [MySQL](connectors-mysql.md), [Redshift](connectors-redshift.md), or [PostgreSQL](connectors-postgresql.md).

# Amazon Athena Azure Data Lake Storage (ADLS) Gen2 connector
<a name="connectors-adls-gen2"></a>

The Amazon Athena connector for [Azure Data Lake Storage (ADLS) Gen2](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/) enables Amazon Athena to run SQL queries on data stored on ADLS. Athena cannot access stored files in the data lake directly. 

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.
+ **Workflow** – The connector implements the JDBC interface, which uses the `com.microsoft.sqlserver.jdbc.SQLServerDriver` driver. The connector passes queries to the Azure Synapse engine, which then accesses the data lake. 
+ **Data handling and S3** – Normally, the Lambda connector queries data directly without transfer to Amazon S3. However, when data returned by the Lambda function exceeds Lambda limits, the data is written to the Amazon S3 spill bucket that you specify so that Athena can read the excess.
+ **AAD authentication** – AAD can be used as an authentication method for the Azure Synapse connector. In order to use AAD, the JDBC connection string that the connector uses must contain the URL parameters `authentication=ActiveDirectoryServicePrincipal`, `AADSecurePrincipalId`, and `AADSecurePrincipalSecret`. These parameters can either be passed in directly or by Secrets Manager.

## Prerequisites
<a name="connectors-datalakegentwo-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-adls-gen2-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms
<a name="connectors-adls-gen2-terms"></a>

The following terms relate to the Azure Data Lake Storage Gen2 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-adls-gen2-parameters"></a>

Use the parameters in this section to configure the Azure Data Lake Storage Gen2 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="adls-gen2-gc"></a>

We recommend that you configure a Azure Data Lake Storage Gen2 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Azure Data Lake Storage Gen2 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DATALAKEGEN2
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Azure Data Lake Storage Gen2 connector created using Glue connections does not support the use of a multiplexing handler.
The Azure Data Lake Storage Gen2 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="adls-gen2-legacy"></a>

#### Connection string
<a name="connectors-adls-gen2-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
datalakegentwo://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-adls-gen2-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | DataLakeGen2MuxCompositeHandler | 
| Metadata handler | DataLakeGen2MuxMetadataHandler | 
| Record handler | DataLakeGen2MuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-adls-gen2-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydatalakegentwocatalog, then the environment variable name is mydatalakegentwocatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a DataLakeGen2 MUX Lambda function that supports two database instances: `datalakegentwo1` (the default), and `datalakegentwo2`.


****  

| Property | Value | 
| --- | --- | 
| default | datalakegentwo://jdbc:sqlserver://adlsgentwo1.hostname:port;databaseName=database\$1name;\$1\$1secret1\$1name\$1 | 
| datalakegentwo\$1catalog1\$1connection\$1string | datalakegentwo://jdbc:sqlserver://adlsgentwo1.hostname:port;databaseName=database\$1name;\$1\$1secret1\$1name\$1 | 
| datalakegentwo\$1catalog2\$1connection\$1string | datalakegentwo://jdbc:sqlserver://adlsgentwo2.hostname:port;databaseName=database\$1name;\$1\$1secret2\$1name\$1 | 

##### Providing credentials
<a name="connectors-adls-gen2-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret1_name}`.

```
datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=database_name;${secret1_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=database_name;user=user_name;password=password
```

#### Using a single connection handler
<a name="connectors-adls-gen2-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Azure Data Lake Storage Gen2 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | DataLakeGen2CompositeHandler | 
| Metadata handler | DataLakeGen2MetadataHandler | 
| Record handler | DataLakeGen2RecordHandler | 

##### Single connection handler parameters
<a name="connectors-adls-gen2-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Azure Data Lake Storage Gen2 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=;\$1\$1secret\$1name\$1 | 

#### Spill parameters
<a name="connectors-adls-gen2-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-adls-gen2-data-type-support"></a>

The following table shows the corresponding data types for ADLS Gen2 and Arrow.


****  

| ADLS Gen2 | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 

## Partitions and splits
<a name="connectors-adls-gen2-partitions-and-splits"></a>

Azure Data Lake Storage Gen2 uses Hadoop compatible Gen2 blob storage for storing data files. The data from these files is queried from the Azure Synapse engine. The Azure Synapse engine treats Gen2 data stored in file systems as external tables. The partitions are implemented based on the type of data. If the data has already been partitioned and distributed within the Gen 2 storage system, the connector retrieves the data as single split.

## Performance
<a name="connectors-adls-gen2-performance"></a>

The Azure Data Lake Storage Gen2 connector shows slower query performance when running multiple queries at once, and is subject to throttling.

The Athena Azure Data Lake Storage Gen2 connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates
<a name="connectors-datalakegentwo-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Azure Data Lake Storage Gen2 connector can combine these expressions and push them directly to Azure Data Lake Storage Gen2 for enhanced functionality and to reduce the amount of data scanned.

The following Athena Azure Data Lake Storage Gen2 connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-datalakegentwo-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries
<a name="connectors-datalakegentwo-passthrough-queries"></a>

The Azure Data Lake Storage Gen2 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Azure Data Lake Storage Gen2, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Azure Data Lake Storage Gen2. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-datalakegentwo-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-datalakegentwo-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/pom.xml) file for the Azure Data Lake Storage Gen2 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-datalakegen2) on GitHub.com.

# Amazon Athena Azure Synapse connector
<a name="connectors-azure-synapse"></a>

The Amazon Athena connector for [Azure Synapse analytics](https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is) enables Amazon Athena to run SQL queries on your Azure Synapse databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-synapse-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-azure-synapse-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In filter conditions, you must cast the `Date` and `Timestamp` data types to the appropriate data type.
+ To search for negative values of type `Real` and `Float`, use the `<=` or `>=` operator.
+ The `binary`, `varbinary`, `image`, and `rowversion` data types are not supported.

## Terms
<a name="connectors-azure-synapse-terms"></a>

The following terms relate to the Synapse connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-azure-synapse-parameters"></a>

Use the parameters in this section to configure the Synapse connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-azure-synapse-gc"></a>

We recommend that you configure a Synapse connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Synapse connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SYNAPSE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Synapse connector created using Glue connections does not support the use of a multiplexing handler.
The Synapse connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections (recommended)
<a name="connectors-azure-synapse-legacy"></a>

#### Connection string
<a name="connectors-azure-synapse-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
synapse://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-azure-synapse-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SynapseMuxCompositeHandler | 
| Metadata handler | SynapseMuxMetadataHandler | 
| Record handler | SynapseMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-azure-synapse-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysynapsecatalog, then the environment variable name is mysynapsecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Synapse MUX Lambda function that supports two database instances: `synapse1` (the default), and `synapse2`.


****  

| Property | Value | 
| --- | --- | 
| default | synapse://jdbc:synapse://synapse1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| synapse\$1catalog1\$1connection\$1string | synapse://jdbc:synapse://synapse1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| synapse\$1catalog2\$1connection\$1string | synapse://jdbc:synapse://synapse2.hostname:port;databaseName=<database\$1name>;\$1\$1secret2\$1name\$1 | 

##### Providing credentials
<a name="connectors-azure-synapse-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name \$1\$1secret\$1name\$1.

```
synapse://jdbc:synapse://hostname:port;databaseName=<database_name>;${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
synapse://jdbc:synapse://hostname:port;databaseName=<database_name>;user=<user>;password=<password>
```

#### Using a single connection handler
<a name="connectors-azure-synapse-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Synapse instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SynapseCompositeHandler | 
| Metadata handler | SynapseMetadataHandler | 
| Record handler | SynapseRecordHandler | 

##### Single connection handler parameters
<a name="connectors-azure-synapse-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Synapse instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | synapse://jdbc:sqlserver://hostname:port;databaseName=<database\$1name>;\$1\$1secret\$1name\$1 | 

#### Configuring Active Directory authentication
<a name="connectors-azure-synapse-configuring-active-directory-authentication"></a>

The Amazon Athena Azure Synapse connector supports Microsoft Active Directory Authentication. Before you begin, you must configure an administrative user in the Microsoft Azure portal and then use AWS Secrets Manager to create a secret.

**To set the Active Directory administrative user**

1. Using an account that has administrative privileges, sign in to the Microsoft Azure portal at [https://portal.azure.com/](https://portal.azure.com/).

1. In the search box, enter **Azure Synapse Analytics**, and then choose **Azure Synapse Analytics**.  
![\[Choose Azure Synapse Analytics.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-1.png)

1. Open the menu on the left.  
![\[Choose the Azure portal menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-2.png)

1. In the navigation pane, choose **Azure Active Directory**.

1. On the **Set admin** tab, set **Active Directory admin** to a new or existing user.  
![\[Use the Set admin tab\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-3.png)

1. In AWS Secrets Manager, store the admin username and password credentials. For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.  
![\[Viewing secrets in AWS Secrets Manager.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-4.png)

##### Modifying the connection string
<a name="connectors-azure-synapse-modifying-the-connection-string"></a>

To enable Active Directory Authentication for the connector, modify the connection string using the following syntax:

```
synapse://jdbc:synapse://hostname:port;databaseName=database_name;authentication=ActiveDirectoryPassword;{secret_name}
```

##### Using ActiveDirectoryServicePrincipal
<a name="connectors-azure-synapse-using-activedirectoryserviceprincipal"></a>

The Amazon Athena Azure Synapse connector also supports `ActiveDirectoryServicePrincipal`. To enable this, modify the connection string as follows.

```
synapse://jdbc:synapse://hostname:port;databaseName=database_name;authentication=ActiveDirectoryServicePrincipal;{secret_name}
```

For `secret_name`, specify the application or client ID as the username and the secret of a service principal identity in the password.

#### Spill parameters
<a name="connectors-azure-synapse-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-azure-synapse-data-type-support"></a>

The following table shows the corresponding data types for Synapse and Apache Arrow.


****  

| Synapse | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 
| nchar[n] | VARCHAR | 
| nvarchar[n/max] | VARCHAR | 

## Partitions and splits
<a name="connectors-azure-synapse-partitions-and-splits"></a>

A partition is represented by a single partition column of type `varchar`. Synapse supports range partitioning, so partitioning is implemented by extracting the partition column and partition range from Synapse metadata tables. These range values are used to create the splits.

## Performance
<a name="connectors-azure-synapse-performance"></a>

Selecting a subset of columns significantly slows down query runtime. The connector shows significant throttling due to concurrency.

The Athena Synapse connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates
<a name="connectors-synapse-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Synapse connector can combine these expressions and push them directly to Synapse for enhanced functionality and to reduce the amount of data scanned.

The following Athena Synapse connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-synapse-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries
<a name="connectors-synapse-passthrough-queries"></a>

The Synapse connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Synapse, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Synapse. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-synapse-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-synapse-additional-resources"></a>
+ For an article that shows how to use Quick and Amazon Athena Federated Query to build dashboards and visualizations on data stored in Microsoft Azure Synapse databases, see [Perform multi-cloud analytics using Quick, Amazon Athena Federated Query, and Microsoft Azure Synapse](https://aws.amazon.com/blogs/business-intelligence/perform-multi-cloud-analytics-using-amazon-quicksight-amazon-athena-federated-query-and-microsoft-azure-synapse/) in the *AWS Big Data Blog*.
+ For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/pom.xml) file for the Synapse connector on GitHub.com.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-synapse) on GitHub.com.

# Amazon Athena Cloudera Hive connector
<a name="connectors-cloudera-hive"></a>

The Amazon Athena connector for Cloudera Hive enables Athena to run SQL queries on the [Cloudera Hive](https://www.cloudera.com/products/open-source/apache-hadoop/apache-hive.html) Hadoop distribution. The connector transforms your Athena SQL queries to their equivalent HiveQL syntax. 

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-hive-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-cloudera-hive-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms
<a name="connectors-cloudera-hive-terms"></a>

The following terms relate to the Cloudera Hive connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-cloudera-hive-parameters"></a>

Use the parameters in this section to configure the Cloudera Hive connector.

### Glue connections (recommended)
<a name="connectors-cloudera-hive-gc"></a>

We recommend that you configure a Cloudera Hive connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Cloudera Hive connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDERAHIVE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Cloudera Hive connector created using Glue connections does not support the use of a multiplexing handler.
The Cloudera Hive connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-cloudera-hive-legacy"></a>

#### Connection string
<a name="connectors-cloudera-hive-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
hive://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-cloudera-hive-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | HiveMuxCompositeHandler | 
| Metadata handler | HiveMuxMetadataHandler | 
| Record handler | HiveMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-cloudera-hive-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myhivecatalog, then the environment variable name is myhivecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Hive MUX Lambda function that supports two database instances: `hive1` (the default), and `hive2`.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default;\$1\$1Test/RDS/hive1\$1 | 
| hive2\$1catalog1\$1connection\$1string | hive://jdbc:hive2://hive1:10000/default;\$1\$1Test/RDS/hive1\$1 | 
| hive2\$1catalog2\$1connection\$1string | hive://jdbc:hive2://hive2:10000/default;UID=sample&PWD=sample | 

##### Providing credentials
<a name="connectors-cloudera-hive-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, the Cloudera Hive connector requires a secret from AWS Secrets Manager. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

Put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/hive1}`.

```
hive://jdbc:hive2://hive1:10000/default;...&${Test/RDS/hive1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
hive://jdbc:hive2://hive1:10000/default;...&UID=sample2&PWD=sample2&...
```

Currently, the Cloudera Hive connector recognizes the `UID` and `PWD` JDBC properties.

#### Using a single connection handler
<a name="connectors-cloudera-hive-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Cloudera Hive instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | HiveCompositeHandler | 
| Metadata handler | HiveMetadataHandler | 
| Record handler | HiveRecordHandler | 

##### Single connection handler parameters
<a name="connectors-cloudera-hive-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Cloudera Hive instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default;secret=\$1\$1Test/RDS/hive1\$1 | 

#### Spill parameters
<a name="connectors-cloudera-hive-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-cloudera-hive-data-type-support"></a>

The following table shows the corresponding data types for JDBC, Cloudera Hive, and Arrow.


****  

| JDBC | Cloudera Hive | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Cloudera Hive does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits
<a name="connectors-cloudera-hive-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-cloudera-hive-performance"></a>

Cloudera Hive supports static partitions. The Athena Cloudera Hive connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. The Cloudera Hive connector is resilient to throttling due to concurrency.

The Athena Cloudera Hive connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses
<a name="connectors-hive-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-hive-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Cloudera Hive connector can combine these expressions and push them directly to Cloudera Hive for enhanced functionality and to reduce the amount of data scanned.

The following Athena Cloudera Hive connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-hive-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-hive-passthrough-queries"></a>

The Cloudera Hive connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Cloudera Hive, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Cloudera Hive. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-hive-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-hive-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) file for the Cloudera Hive connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudera-hive) on GitHub.com.

# Amazon Athena Cloudera Impala connector
<a name="connectors-cloudera-impala"></a>

The Amazon Athena Cloudera Impala connector enables Athena to run SQL queries on the [Cloudera Impala](https://docs.cloudera.com/cdw-runtime/cloud/impala-overview/topics/impala-overview.html) distribution. The connector transforms your Athena SQL queries to the equivalent Impala syntax.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-impala-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-cloudera-impala-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms
<a name="connectors-cloudera-impala-terms"></a>

The following terms relate to the Cloudera Impala connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-cloudera-impala-parameters"></a>

Use the parameters in this section to configure the Cloudera Impala connector.

### Glue connections (recommended)
<a name="connectors-cloudera-impala-gc"></a>

We recommend that you configure a Cloudera Impala connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Cloudera Impala connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDERAIMPALA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Cloudera Impala connector created using Glue connections does not support the use of a multiplexing handler.
The Cloudera Impala connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-cloudera-impala-legacy"></a>

#### Connection string
<a name="connectors-cloudera-impala-connection-string"></a>

Use a JDBC connection string in the following format to connect to an Impala cluster.

```
impala://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-cloudera-impala-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | ImpalaMuxCompositeHandler | 
| Metadata handler | ImpalaMuxMetadataHandler | 
| Record handler | ImpalaMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-cloudera-impala-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. An Impala cluster connection string for an Athena catalog. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myimpalacatalog, then the environment variable name is myimpalacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Impala MUX Lambda function that supports two database instances: `impala1` (the default), and `impala2`.


****  

| Property | Value | 
| --- | --- | 
| default | impala://jdbc:impala://some.impala.host.name:21050/?\$1\$1Test/impala1\$1 | 
| impala\$1catalog1\$1connection\$1string | impala://jdbc:impala://someother.impala.host.name:21050/?\$1\$1Test/impala1\$1 | 
| impala\$1catalog2\$1connection\$1string | impala://jdbc:impala://another.impala.host.name:21050/?UID=sample&PWD=sample | 

##### Providing credentials
<a name="connectors-cloudera-impala-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/impala1host}`.

```
impala://jdbc:impala://Impala1host:21050/?...&${Test/impala1host}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
impala://jdbc:impala://Impala1host:21050/?...&UID=sample2&PWD=sample2&...
```

Currently, Cloudera Impala recognizes the `UID` and `PWD` JDBC properties.

#### Using a single connection handler
<a name="connectors-cloudera-impala-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Cloudera Impala instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | ImpalaCompositeHandler | 
| Metadata handler | ImpalaMetadataHandler | 
| Record handler | ImpalaRecordHandler | 

##### Single connection handler parameters
<a name="connectors-cloudera-impala-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Cloudera Impala instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | impala://jdbc:impala://Impala1host:21050/?secret=\$1\$1Test/impala1host\$1 | 

#### Spill parameters
<a name="connectors-cloudera-impala-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-cloudera-impala-data-type-support"></a>

The following table shows the corresponding data types for JDBC, Cloudera Impala, and Arrow.


****  

| JDBC | Cloudera Impala | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Cloudera Impala does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits
<a name="connectors-cloudera-impala-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-cloudera-impala-performance"></a>

Cloudera Impala supports static partitions. The Athena Cloudera Impala connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. The Cloudera Impala connector is resilient to throttling due to concurrency.

The Athena Cloudera Impala connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses
<a name="connectors-impala-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-impala-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Cloudera Impala connector can combine these expressions and push them directly to Cloudera Impala for enhanced functionality and to reduce the amount of data scanned.

The following Athena Cloudera Impala connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-impala-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-impala-passthrough-queries"></a>

The Cloudera Impala connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Cloudera Impala, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Cloudera Impala. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-impala-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-impala-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/pom.xml) file for the Cloudera Impala connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudera-impala) on GitHub.com.

# Amazon Athena CloudWatch connector
<a name="connectors-cloudwatch"></a>

The Amazon Athena CloudWatch connector enables Amazon Athena to communicate with CloudWatch so that you can query your log data with SQL.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

The connector maps your LogGroups as schemas and each LogStream as a table. The connector also maps a special `all_log_streams` view that contains all LogStreams in the LogGroup. This view enables you to query all the logs in a LogGroup at once instead of searching through each LogStream individually.

## Prerequisites
<a name="connectors-cloudwatch-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-cloudwatch-parameters"></a>

Use the parameters in this section to configure the CloudWatch connector.

### Glue connections (recommended)
<a name="connectors-cloudwatch-gc"></a>

We recommend that you configure a CloudWatch connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the CloudWatch connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDWATCH
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The CloudWatch connector created using Glue connections does not support the use of a multiplexing handler.
The CloudWatch connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-cloudwatch-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

The connector also supports [AIMD congestion control](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) for handling throttling events from CloudWatch through the [Amazon Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk) `ThrottlingInvoker` construct. You can tweak the default throttling behavior by setting any of the following optional environment variables:
+ **throttle\$1initial\$1delay\$1ms** – The initial call delay applied after the first congestion event. The default is 10 milliseconds.
+ **throttle\$1max\$1delay\$1ms** – The maximum delay between calls. You can derive TPS by dividing it into 1000ms. The default is 1000 milliseconds.
+ **throttle\$1decrease\$1factor** – The factor by which Athena reduces the call rate. The default is 0.5
+ **throttle\$1increase\$1ms** – The rate at which Athena decreases the call delay. The default is 10 milliseconds.

## Databases and tables
<a name="connectors-cloudwatch-databases-and-tables"></a>

The Athena CloudWatch connector maps your LogGroups as schemas (that is, databases) and each LogStream as a table. The connector also maps a special `all_log_streams` view that contains all LogStreams in the LogGroup. This view enables you to query all the logs in a LogGroup at once instead of searching through each LogStream individually.

Every table mapped by the Athena CloudWatch connector has the following schema. This schema matches the fields provided by CloudWatch Logs.
+ **log\$1stream** – A `VARCHAR` that contains the name of the LogStream that the row is from.
+ **time** – An `INT64` that contains the epoch time of when the log line was generated.
+ **message** – A `VARCHAR` that contains the log message.

**Examples**  
The following example shows how to perform a `SELECT` query on a specified LogStream.

```
SELECT * 
FROM "lambda:cloudwatch_connector_lambda_name"."log_group_path"."log_stream_name" 
LIMIT 100
```

The following example shows how to use the `all_log_streams` view to perform a query on all LogStreams in a specified LogGroup. 

```
SELECT * 
FROM "lambda:cloudwatch_connector_lambda_name"."log_group_path"."all_log_streams" 
LIMIT 100
```

## Required Permissions
<a name="connectors-cloudwatch-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-cloudwatch.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudwatch/athena-cloudwatch.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **CloudWatch Logs Read/Write** – The connector uses this permission to read your log data and to write its diagnostic logs.

## Performance
<a name="connectors-cloudwatch-performance"></a>

The Athena CloudWatch connector attempts to optimize queries against CloudWatch by parallelizing scans of the log streams required for your query. For certain time period filters, predicate pushdown is performed both within the Lambda function and within CloudWatch Logs.

For best performance, use only lowercase for your log group names and log stream names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**Note**  
 The CloudWatch connector does not support uppercase database names. 

## Passthrough queries
<a name="connectors-cloudwatch-passthrough-queries"></a>

The CloudWatch connector supports [passthrough queries](federated-query-passthrough.md) that use [CloudWatch Logs Insights query syntax](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html). For more information about CloudWatch Logs Insights, see [Analyzing log data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) in the *Amazon CloudWatch Logs User Guide*.

To create passthrough queries with CloudWatch, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            STARTTIME => 'start_time',
            ENDTIME => 'end_time',
            QUERYSTRING => 'query_string',
            LOGGROUPNAMES => 'log_group-names',
            LIMIT => 'max_number_of_results'
        ))
```

The following example CloudWatch passthrough query filters for the `duration` field when it does not equal 1000.

```
SELECT * FROM TABLE(
        system.query(
            STARTTIME => '1710918615308',
            ENDTIME => '1710918615972',
            QUERYSTRING => 'fields @duration | filter @duration != 1000',
            LOGGROUPNAMES => '/aws/lambda/cloudwatch-test-1',
            LIMIT => '2'
            ))
```

## License information
<a name="connectors-cloudwatch-license-information"></a>

The Amazon Athena CloudWatch connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-cloudwatch-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch) on GitHub.com.

# Amazon Athena CloudWatch Metrics connector
<a name="connectors-cwmetrics"></a>

The Amazon Athena CloudWatch Metrics connector enables Amazon Athena to query CloudWatch Metrics data with SQL.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

For information on publishing query metrics to CloudWatch from Athena itself, see [Use CloudWatch and EventBridge to monitor queries and control costs](workgroups-control-limits.md).

## Prerequisites
<a name="connectors-cwmetrics-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-cwmetrics-parameters"></a>

Use the parameters in this section to configure the CloudWatch Metrics connector.

### Glue connections (recommended)
<a name="connectors-cwmetrics-gc"></a>

We recommend that you configure a CloudWatch Metrics connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the CloudWatch Metrics connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDWATCHMETRICS
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The CloudWatch Metrics connector created using Glue connections does not support the use of a multiplexing handler.
The CloudWatch Metrics connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-cwmetrics-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

The connector also supports [AIMD congestion control](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) for handling throttling events from CloudWatch through the [Amazon Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk) `ThrottlingInvoker` construct. You can tweak the default throttling behavior by setting any of the following optional environment variables:
+ **throttle\$1initial\$1delay\$1ms** – The initial call delay applied after the first congestion event. The default is 10 milliseconds.
+ **throttle\$1max\$1delay\$1ms** – The maximum delay between calls. You can derive TPS by dividing it into 1000ms. The default is 1000 milliseconds.
+ **throttle\$1decrease\$1factor** – The factor by which Athena reduces the call rate. The default is 0.5
+ **throttle\$1increase\$1ms** – The rate at which Athena decreases the call delay. The default is 10 milliseconds.

## Databases and tables
<a name="connectors-cwmetrics-databases-and-tables"></a>

The Athena CloudWatch Metrics connector maps your namespaces, dimensions, metrics, and metric values into two tables in a single schema called `default`.

### The metrics table
<a name="connectors-cwmetrics-the-metrics-table"></a>

The `metrics` table contains the available metrics as uniquely defined by a combination of namespace, set, and name. The `metrics` table contains the following columns.
+ **namespace** – A `VARCHAR` containing the namespace.
+ **metric\$1name** – A `VARCHAR` containing the metric name.
+ **dimensions** – A `LIST` of `STRUCT` objects composed of `dim_name (VARCHAR)` and `dim_value (VARCHAR)`.
+ **statistic** – A `LIST` of `VARCH` statistics (for example, `p90`, `AVERAGE`, ...) available for the metric.

### The metric\$1samples table
<a name="connectors-cwmetrics-the-metric_samples-table"></a>

The `metric_samples` table contains the available metric samples for each metric in the `metrics` table. The `metric_samples` table contains the following columns.
+ **namespace** – A `VARCHAR` that contains the namespace.
+ **metric\$1name** – A `VARCHAR` that contains the metric name.
+ **dimensions** – A `LIST` of `STRUCT` objects composed of `dim_name (VARCHAR)` and `dim_value (VARCHAR)`.
+ **dim\$1name** – A `VARCHAR` convenience field that you can use to easily filter on a single dimension name.
+ **dim\$1value** – A `VARCHAR` convenience field that you can use to easily filter on a single dimension value.
+ **period** – An `INT` field that represents the "period" of the metric in seconds (for example, a 60 second metric).
+ **timestamp** – A `BIGINT` field that represents the epoch time in seconds that the metric sample is for.
+ **value** – A `FLOAT8` field that contains the value of the sample.
+ **statistic** – A `VARCHAR` that contains the statistic type of the sample (for example, `AVERAGE` or `p90`).

## Required Permissions
<a name="connectors-cwmetrics-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-cloudwatch-metrics.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudwatch-metrics/athena-cloudwatch-metrics.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **CloudWatch Metrics ReadOnly** – The connector uses this permission to query your metrics data.
+ **CloudWatch Logs Write** – The connector uses this access to write its diagnostic logs.

## Performance
<a name="connectors-cwmetrics-performance"></a>

The Athena CloudWatch Metrics connector attempts to optimize queries against CloudWatch Metrics by parallelizing scans of the log streams required for your query. For certain time period, metric, namespace, and dimension filters, predicate pushdown is performed both within the Lambda function and within CloudWatch Logs.

## License information
<a name="connectors-cwmetrics-license-information"></a>

The Amazon Athena CloudWatch Metrics connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-cwmetrics-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch-metrics) on GitHub.com.

# Amazon Athena AWS CMDB connector
<a name="connectors-cmdb"></a>

The Amazon Athena AWS CMDB connector enables Athena to communicate with various AWS services so that you can query them with SQL.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-cmdb-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-cmdb-parameters"></a>

Use the parameters in this section to configure the AWS CMDB connector.

### Glue connections (recommended)
<a name="connectors-cmdb-gc"></a>

We recommend that you configure a AWS CMDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the AWS CMDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CMDB
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The AWS CMDB connector created using Glue connections does not support the use of a multiplexing handler.
The AWS CMDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-cmdb-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **default\$1ec2\$1image\$1owner** – (Optional) When set, controls the default Amazon EC2 image owner that filters [Amazon Machine Images (AMI)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html). If you do not set this value and your query against the EC2 images table does not include a filter for owner, your results will include all public images.

## Databases and tables
<a name="connectors-cmdb-databases-and-tables"></a>

The Athena AWS CMDB connector makes the following databases and tables available for querying your AWS resource inventory. For more information on the columns available in each table, run a `DESCRIBE database.table` statement using the Athena console or API.
+ **ec2** – This database contains Amazon EC2 related resources, including the following.
+ **ebs\$1volumes** – Contains details of your Amazon EBS volumes.
+ **ec2\$1instances** – Contains details of your EC2 Instances.
+ **ec2\$1images** – Contains details of your EC2 Instance images.
+ **routing\$1tables** – Contains details of your VPC Routing Tables.
+ **security\$1groups** – Contains details of your security groups.
+ **subnets** – Contains details of your VPC Subnets.
+ **vpcs** – Contains details of your VPCs.
+ **emr** – This database contains Amazon EMR related resources, including the following.
+ **emr\$1clusters** – Contains details of your EMR Clusters.
+ **rds** – This database contains Amazon RDS related resources, including the following.
+ **rds\$1instances** – Contains details of your RDS Instances.
+ **s3** – This database contains RDS related resources, including the following.
+ **buckets** – Contains details of your Amazon S3 buckets.
+ **objects** – Contains details of your Amazon S3 objects, excluding their contents.

## Required Permissions
<a name="connectors-cmdb-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-aws-cmdb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-aws-cmdb/athena-aws-cmdb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **S3 List** – The connector uses this permission to list your Amazon S3 buckets and objects.
+ **EC2 Describe** – The connector uses this permission to describe resources such as your Amazon EC2 instances, security groups, VPCs, and Amazon EBS volumes.
+ **EMR Describe / List** – The connector uses this permission to describe your EMR clusters.
+ **RDS Describe** – The connector uses this permission to describe your RDS Instances.

## Performance
<a name="connectors-cmdb-performance"></a>

Currently, the Athena AWS CMDB connector does not support parallel scans. Predicate pushdown is performed within the Lambda function. Where possible, partial predicates are pushed to the services being queried. For example, a query for the details of a specific Amazon EC2 instance calls the EC2 API with the specific instance ID to run a targeted describe operation.

## License information
<a name="connectors-cmdb-license-information"></a>

The Amazon Athena AWS CMDB connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-cmdb-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-aws-cmdb) on GitHub.com.

# Amazon Athena IBM Db2 connector
<a name="connectors-ibm-db2"></a>

The Amazon Athena connector for Db2 enables Amazon Athena to run SQL queries on your IBM Db2 databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-dbtwo-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-ibm-db2-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms
<a name="connectors-ibm-db2-terms"></a>

The following terms relate to the Db2 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-ibm-db2-parameters"></a>

Use the parameters in this section to configure the Db2 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-ibm-db2-gc"></a>

We recommend that you configure a Db2 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Db2 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DB2
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Db2 connector created using Glue connections does not support the use of a multiplexing handler.
The Db2 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-ibm-db2-legacy"></a>

#### Connection string
<a name="connectors-ibm-db2-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
dbtwo://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-ibm-db2-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | Db2MuxCompositeHandler | 
| Metadata handler | Db2MuxMetadataHandler | 
| Record handler | Db2MuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-ibm-db2-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydbtwocatalog, then the environment variable name is mydbtwocatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Db2 MUX Lambda function that supports two database instances: `dbtwo1` (the default), and `dbtwo2`.


****  

| Property | Value | 
| --- | --- | 
| default | dbtwo://jdbc:db2://dbtwo1.hostname:port/database\$1name:\$1\$1secret1\$1name\$1 | 
| dbtwo\$1catalog1\$1connection\$1string | dbtwo://jdbc:db2://dbtwo1.hostname:port/database\$1name:\$1\$1secret1\$1name\$1 | 
| dbtwo\$1catalog2\$1connection\$1string | dbtwo://jdbc:db2://dbtwo2.hostname:port/database\$1name:\$1\$1secret2\$1name\$1 | 

##### Providing credentials
<a name="connectors-ibm-db2-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
dbtwo://jdbc:db2://hostname:port/database_name:${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
dbtwo://jdbc:db2://hostname:port/database_name:user=user_name;password=password;
```

#### Using a single connection handler
<a name="connectors-ibm-db2-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Db2 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | Db2CompositeHandler | 
| Metadata handler | Db2MetadataHandler | 
| Record handler | Db2RecordHandler | 

##### Single connection handler parameters
<a name="connectors-ibm-db2-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Db2 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | dbtwo://jdbc:db2://hostname:port/database\$1name:\$1\$1secret\$1name\$1  | 

#### Spill parameters
<a name="connectors-ibm-db2-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-ibm-db2-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Arrow.


****  

| Db2 | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| DATE | DATEDAY | 
| TIME | VARCHAR | 
| TIMESTAMP | DATEMILLI | 
| DATETIME | DATEMILLI | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | DECIMAL | 
| REAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 
| DECFLOAT | FLOAT8 | 

## Partitions and splits
<a name="connectors-ibm-db2-partitions-and-splits"></a>

A partition is represented by one or more partition columns of type `varchar`. The Db2 connector creates partitions using the following organization schemes.
+ Distribute by hash
+ Partition by range
+ Organize by dimensions

The connector retrieves partition details such as the number of partitions and column name from one or more Db2 metadata tables. Splits are created based upon the number of partitions identified. 

## Performance
<a name="connectors-ibm-db2-performance"></a>

The Athena Db2 connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses
<a name="connectors-dbtwo-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-dbtwo-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Db2 connector can combine these expressions and push them directly to Db2 for enhanced functionality and to reduce the amount of data scanned.

The following Athena Db2 connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-dbtwo-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-dbtwo-passthrough-queries"></a>

The Db2 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Db2, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Db2. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-dbtwo-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-dbtwo-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/pom.xml) file for the Db2 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-db2) on GitHub.com.

# Amazon Athena IBM Db2 AS/400 (Db2 iSeries) connector
<a name="connectors-ibm-db2-as400"></a>

The Amazon Athena connector for Db2 AS/400 enables Amazon Athena to run SQL queries on your IBM Db2 AS/400 (Db2 iSeries) databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-db2as400-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-ibm-db2-as400-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms
<a name="connectors-ibm-db2-as400-terms"></a>

The following terms relate to the Db2 AS/400 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-ibm-db2-as400-parameters"></a>

Use the parameters in this section to configure the Db2 AS/400 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-ibm-db2-as400-gc"></a>

We recommend that you configure a Db2 AS/400 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Db2 AS/400 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DB2AS400
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Db2 AS/400 connector created using Glue connections does not support the use of a multiplexing handler.
The Db2 AS/400 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-ibm-db2-as400-legacy"></a>

#### Connection string
<a name="connectors-ibm-db2-as400-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
db2as400://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-ibm-db2-as400-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | Db2MuxCompositeHandler | 
| Metadata handler | Db2MuxMetadataHandler | 
| Record handler | Db2MuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-ibm-db2-as400-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydb2as400catalog, then the environment variable name is mydb2as400catalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Db2 MUX Lambda function that supports two database instances: `db2as4001` (the default), and `db2as4002`.


****  

| Property | Value | 
| --- | --- | 
| default | db2as400://jdbc:as400://<ip\$1address>;<properties>;:\$1\$1<secret name>\$1; | 
| db2as400\$1catalog1\$1connection\$1string | db2as400://jdbc:as400://db2as4001.hostname/:\$1\$1secret1\$1name\$1 | 
| db2as400\$1catalog2\$1connection\$1string | db2as400://jdbc:as400://db2as4002.hostname/:\$1\$1secret2\$1name\$1 | 
| db2as400\$1catalog3\$1connection\$1string | db2as400://jdbc:as400://<ip\$1address>;user=<username>;password=<password>;<properties>; | 

##### Providing credentials
<a name="connectors-ibm-db2-as400-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
db2as400://jdbc:as400://<ip_address>;<properties>;:${<secret_name>};
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
db2as400://jdbc:as400://<ip_address>;user=<username>;password=<password>;<properties>;
```

#### Using a single connection handler
<a name="connectors-ibm-db2-as400-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Db2 AS/400 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | Db2CompositeHandler | 
| Metadata handler | Db2MetadataHandler | 
| Record handler | Db2RecordHandler | 

##### Single connection handler parameters
<a name="connectors-ibm-db2-as400-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Db2 AS/400 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | db2as400://jdbc:as400://<ip\$1address>;<properties>;:\$1\$1<secret\$1name>\$1; | 

#### Spill parameters
<a name="connectors-ibm-db2-as400-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-ibm-db2-as400-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| Db2 AS/400 | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| DATE | DATEDAY | 
| TIME | VARCHAR | 
| TIMESTAMP | DATEMILLI | 
| DATETIME | DATEMILLI | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | DECIMAL | 
| REAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 
| DECFLOAT | FLOAT8 | 

## Partitions and splits
<a name="connectors-ibm-db2-as400-partitions-and-splits"></a>

A partition is represented by one or more partition columns of type `varchar`. The Db2 AS/400 connector creates partitions using the following organization schemes.
+ Distribute by hash
+ Partition by range
+ Organize by dimensions

The connector retrieves partition details such as the number of partitions and column name from one or more Db2 AS/400 metadata tables. Splits are created based upon the number of partitions identified. 

## Performance
<a name="connectors-db2-as400-performance"></a>

For improved performance, use predicate pushdown to query from Athena, as in the following examples.

```
SELECT * FROM "lambda:<LAMBDA_NAME>"."<SCHEMA_NAME>"."<TABLE_NAME>" 
 WHERE integercol = 2147483647
```

```
SELECT * FROM "lambda: <LAMBDA_NAME>"."<SCHEMA_NAME>"."<TABLE_NAME>" 
 WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Passthrough queries
<a name="connectors-db2as400-passthrough-queries"></a>

The Db2 AS/400 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Db2 AS/400, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Db2 AS/400. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-db2as400-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-db2as400-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/pom.xml) file for the Db2 AS/400 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-db2-as400) on GitHub.com.

# Amazon Athena DocumentDB connector
<a name="connectors-docdb"></a>

The Amazon Athena DocumentDB connector enables Athena to communicate with your DocumentDB instances so that you can query your DocumentDB data with SQL. The connector also works with any endpoint that is compatible with MongoDB.

Unlike traditional relational data stores, Amazon DocumentDB collections do not have set schema. DocumentDB does not have a metadata store. Each entry in a DocumentDB collection can have different fields and data types.

The DocumentDB connector supports two mechanisms for generating table schema information: basic schema inference and AWS Glue Data Catalog metadata.

Schema inference is the default. This option scans a small number of documents in your collection, forms a union of all fields, and coerces fields that have non-overlapping data types. This option works well for collections that have mostly uniform entries.

For collections with a greater variety of data types, the connector supports retrieving metadata from the AWS Glue Data Catalog. If the connector sees a AWS Glue database and table that match your DocumentDB database and collection names, it gets its schema information from the corresponding AWS Glue table. When you create your AWS Glue table, we recommend that you make it a superset of all fields that you might want to access from your DocumentDB collection. 

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-docdb-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-docdb-parameters"></a>

Use the parameters in this section to configure the DocumentDB connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-docdb-gc"></a>

We recommend that you configure a DocumentDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the DocumentDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DOCUMENTDB
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The DocumentDB connector created using Glue connections does not support the use of a multiplexing handler.
The DocumentDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-docdb-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **default\$1docdb** – If present, specifies a DocumentDB connection string to use when no catalog-specific environment variable exists.
+ **disable\$1projection\$1and\$1casing** – (Optional) Disables projection and casing. Use if you want to query Amazon DocumentDB tables that use case sensitive column names. The `disable_projection_and_casing` parameter uses the following values to specify the behavior of casing and column mapping: 
  + **false** – This is the default setting. Projection is enabled, and the connector expects all column names to be in lower case. 
  + **true** – Disables projection and casing. When using the `disable_projection_and_casing` parameter, keep in mind the following points: 
    + Use of the parameter can result in higher bandwidth usage. Additionally, if your Lambda function is not in the same AWS Region as your data source, you will incur higher standard AWS cross-region transfer costs as a result of the higher bandwidth usage. For more information about cross-region transfer costs, see [AWS Data Transfer Charges for Server and Serverless Architectures](https://aws.amazon.com/blogs/apn/aws-data-transfer-charges-for-server-and-serverless-architectures/) in the AWS Partner Network Blog.
    + Because a larger number of bytes is transferred and because the larger number of bytes requires a higher deserialization time, overall latency can increase. 
+ **enable\$1case\$1insensitive\$1match** – (Optional) When `true`, performs case insensitive searches against schema and table names in Amazon DocumentDB. The default is `false`. Use if your query contains uppercase schema or table names.

#### Specifying connection strings
<a name="connectors-docdb-specifying-connection-strings"></a>

You can provide one or more properties that define the DocumentDB connection details for the DocumentDB instances that you use with the connector. To do this, set a Lambda environment variable that corresponds to the catalog name that you want to use in Athena. For example, suppose you want to use the following queries to query two different DocumentDB instances from Athena:

```
SELECT * FROM "docdb_instance_1".database.table
```

```
SELECT * FROM "docdb_instance_2".database.table
```

Before you can use these two SQL statements, you must add two environment variables to your Lambda function: `docdb_instance_1` and `docdb_instance_2`. The value for each should be a DocumentDB connection string in the following format:

```
mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0      
```

##### Using secrets
<a name="connectors-docdb-using-secrets"></a>

You can optionally use AWS Secrets Manager for part or all of the value for your connection string details. To use the Athena Federated Query feature with Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

If you use the syntax `${my_secret}` to put the name of a secret from Secrets Manager in your connection string, the connector replaces `${my_secret}` with its plain text value from Secrets Manager exactly. Secrets should be stored as a plain text secret with value `<username>:<password>`. Secrets stored as `{username:<username>,password:<password>}` will not be passed to the connection string properly.

Secrets can also be used for the entire connection string entirely, and the username and password can be defined within the secret.

For example, suppose you set the Lambda environment variable for `docdb_instance_1` to the following value:

```
mongodb://${docdb_instance_1_creds}@myhostname.com:123/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0         
```

The Athena Query Federation SDK automatically attempts to retrieve a secret named `docdb_instance_1_creds` from Secrets Manager and inject that value in place of `${docdb_instance_1_creds}`. Any part of the connection string that is enclosed by the `${ }` character combination is interpreted as a secret from Secrets Manager. If you specify a secret name that the connector cannot find in Secrets Manager, the connector does not replace the text.

## Retrieving supplemental metadata
<a name="supplemental-metadata"></a>

To retrieve supplemental metadata, follow these steps to configure your Glue database and table.

### Set up the Glue database
<a name="setup-glue-database"></a>

1. Create a Glue database with the same name as your DocumentDB collection.

1. In the Location URI field, enter `docdb-metadata-flag`.

### Configure the Glue table
<a name="setup-glue-table"></a>

Add the following parameters to your Glue table:
+ `docdb-metadata-flag = true`
+ `columnMapping = apple=APPLE`

  In this example, `apple` represents the lowercase column name in Glue, and `APPLE` represents the actual case-sensitive column name in your DocumentDB collection.

### Verify metadata retrieval
<a name="verify-metadata-retrieval"></a>

1. Run your query.

1. Check the Lambda function's CloudWatch logs for successful metadata retrieval. A successful retrieval will show the following log entry:

   ```
   doGetTable: Retrieved schema for table[TableName{schemaName=test, tableName=profiles}] from AWS Glue.
   ```

**Note**  
If your table already has a `columnMapping` field configured, you only need to add the `docdb-metadata-flag = true` parameter to the table properties.

## Setting up databases and tables in AWS Glue
<a name="connectors-docdb-setting-up-databases-and-tables-in-aws-glue"></a>

Because the connector's built-in schema inference capability scans a limited number of documents and supports only a subset of data types, you might want to use AWS Glue for metadata instead.

To enable an AWS Glue table for use with Amazon DocumentDB, you must have a AWS Glue database and table for the DocumentDB database and collection that you want to supply supplemental metadata for.

**To use an AWS Glue table for supplemental metadata**

1. Use the AWS Glue console to create an AWS Glue database that has the same name as your Amazon DocumentDB database name.

1. Set the URI property of the database to include **docdb-metadata-flag**.

1. (Optional) Add the **sourceTable** table property. This property defines the source table name in Amazon DocumentDB. Use this property if your AWS Glue table has a different name from the table name in Amazon DocumentDB. Differences in naming rules between AWS Glue and Amazon DocumentDB can make this necessary. For example, capital letters are not permitted in AWS Glue table names, but they are permitted in Amazon DocumentDB table names.

1. (Optional) Add the **columnMapping** table property. This property defines column name mappings. Use this property if AWS Glue column naming rules prevent you from creating an AWS Glue table that has the same column names as those in your Amazon DocumentDB table. This can be useful because capital letters are permitted in Amazon DocumentDB column names but are not permitted in AWS Glue column names.

   The `columnMapping` property value is expected to be a set of mappings in the format `col1=Col1,col2=Col2`.
**Note**  
 Column mapping applies only to top level column names and not to nested fields. 

   After you add the AWS Glue `columnMapping` table property, you can remove the `disable_projection_and_casing` Lambda environment variable.

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

## Data type support
<a name="connectors-docdb-data-type-support"></a>

This section lists the data types that the DocumentDB connector uses for schema inference, and the data types when AWS Glue metadata is used.

### Schema inference data types
<a name="connectors-docdb-schema-inference-data-types"></a>

The schema inference feature of the DocumentDB connector attempts to infer values as belonging to one of the following data types. The table shows the corresponding data types for Amazon DocumentDB, Java, and Apache Arrow.


****  

| Apache Arrow | Java or DocDB | 
| --- | --- | 
| VARCHAR | String | 
| INT | Integer | 
| BIGINT | Long | 
| BIT | Boolean | 
| FLOAT4 | Float | 
| FLOAT8 | Double | 
| TIMESTAMPSEC | Date | 
| VARCHAR | ObjectId | 
| LIST | List | 
| STRUCT | Document | 

### AWS Glue data types
<a name="connectors-docdb-glue-data-types"></a>

If you use AWS Glue for supplemental metadata, you can configure the following data types. The table shows the corresponding data types for AWS Glue and Apache Arrow.


****  

| AWS Glue | Apache Arrow | 
| --- | --- | 
| int | INT | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| boolean | BIT | 
| binary | VARBINARY | 
| string | VARCHAR | 
| List | LIST | 
| Struct | STRUCT | 

## Required Permissions
<a name="connectors-docdb-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-docdb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-docdb/athena-docdb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The DocumentDB connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store DocumentDB endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your DocumentDB instances.

## Performance
<a name="connectors-docdb-performance"></a>

The Athena Amazon DocumentDB connector does not currently support parallel scans but attempts to push down predicates as part of its DocumentDB queries, and predicates against indexes on your DocumentDB collection result in significantly less data scanned.

The Lambda function performs projection pushdown to decrease the data scanned by the query. However, selecting a subset of columns sometimes results in a longer query execution runtime. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data.

## Passthrough queries
<a name="connectors-docdb-passthrough-queries"></a>

The Athena Amazon DocumentDB connector supports [passthrough queries](federated-query-passthrough.md) and is NoSQL based. For information about querying Amazon DocumentDB, see [Querying](https://docs.aws.amazon.com/documentdb/latest/developerguide/querying.html) in the *Amazon DocumentDB Developer Guide*.

To use passthrough queries with Amazon DocumentDB, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            database => 'database_name',
            collection => 'collection_name',
            filter => '{query_syntax}'
        ))
```

The following example queries the `example` database within the `TPCDS` collection, filtering on all books with the title *Bill of Rights*.

```
SELECT * FROM TABLE(
        system.query(
            database => 'example',
            collection => 'tpcds',
            filter => '{title: "Bill of Rights"}'
        ))
```

## Additional resources
<a name="connectors-docdb-additional-resources"></a>
+ For an article on using [Amazon Athena Federated Query](federated-queries.md) to connect a MongoDB database to [Quick](https://aws.amazon.com/quicksight/) to build dashboards and visualizations, see [Visualize MongoDB data from Quick using Amazon Athena Federated Query](https://aws.amazon.com/blogs/big-data/visualize-mongodb-data-from-amazon-quicksight-using-amazon-athena-federated-query/) in the *AWS Big Data Blog*.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-docdb) on GitHub.com.

# Amazon Athena DynamoDB connector
<a name="connectors-dynamodb"></a>

The Amazon Athena DynamoDB connector enables Amazon Athena to communicate with DynamoDB so that you can query your tables with SQL. Write operations like [INSERT INTO](insert-into.md) are not supported.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites
<a name="connectors-dynamodb-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-dynamodb-limitations"></a>

If you migrate your DynamoDB connections to Glue Catalog and Lake Formation, only the lowercase table and column names will be recognized. 

## Parameters
<a name="connectors-dynamodb-parameters"></a>

Use the parameters in this section to configure the DynamoDB connector.

### Glue connections (recommended)
<a name="ddb-gc"></a>

We recommend that you configure a DynamoDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the DynamoDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DYNAMODB
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The DynamoDB connector created using Glue connections does not support the use of a multiplexing handler.
The DynamoDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="ddb-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **disable\$1projection\$1and\$1casing** – (Optional) Disables projection and casing. Use if you want to query DynamoDB tables that have casing in their column names and you do not want to specify a `columnMapping` property on your AWS Glue table.

  The `disable_projection_and_casing` parameter uses the following values to specify the behavior of casing and column mapping:
  + **auto** – Disables projection and casing when a previously unsupported type is detected and column name mapping is not set on the table. This is the default setting.
  + **always** – Disables projection and casing unconditionally. This is useful when you have casing in your DynamoDB column names but do not want to specify any column name mapping.

  When using the `disable_projection_and_casing` parameter, keep in mind the following points:
  + Use of the parameter can result in higher bandwidth usage. Additionally, if your Lambda function is not in the same AWS Region as your data source, you will incur higher standard AWS cross-region transfer costs as a result of the higher bandwidth usage. For more information about cross-region transfer costs, see [AWS Data Transfer Charges for Server and Serverless Architectures](https://aws.amazon.com/blogs/apn/aws-data-transfer-charges-for-server-and-serverless-architectures/) in the AWS Partner Network Blog.
  + Because a larger number of bytes is transferred and because the larger number of bytes requires a higher deserialization time, overall latency can increase. 

## Setting up databases and tables in AWS Glue
<a name="connectors-dynamodb-setting-up-databases-and-tables-in-aws-glue"></a>

Because the connector's built-in schema inference capability is limited, you might want to use AWS Glue for metadata. To do this, you must have a database and table in AWS Glue. To enable them for use with DynamoDB, you must edit their properties.

**To edit database properties in the AWS Glue console**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, expand **Data Catalog**, and then choose **Databases**.

   On the **Databases** page, you can edit an existing database, or choose **Add database** to create one.

1. In the list of databases, choose the link for the database that you want to edit.

1. Choose **Edit**.

1. On the **Update a database** page, under **Database settings**, for **Location**, add the string **dynamo-db-flag**. This keyword indicates that the database contains tables that the Athena DynamoDB connector is using for supplemental metadata and is required for AWS Glue databases other than `default`. The `dynamo-db-flag` property is useful for filtering out databases in accounts with many databases.

1. Choose **Update Database**.

**To edit table properties in the AWS Glue console**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, expand **Data Catalog**, and then choose **Tables**.

1. On the **Tables** page, in the list of tables, choose the linked name of the table that you want to edit.

1. Choose **Actions**, **Edit table**.

1. On the **Edit table** page, in the **Table properties** section, add the following table properties as required. If you use the AWS Glue DynamoDB crawler, these properties are automatically set.
   + **dynamodb** – String that indicates to the Athena DynamoDB connector that the table can be used for supplemental metadata. Enter the `dynamodb` string in the table properties under a field called **classification** (exact match).
**Note**  
The **Set table properties** page that is part of the table creation process in the AWS Glue console has a **Data format** section with a **Classification** field. You cannot enter or choose `dynamodb` here. Instead, after you create your table, follow the steps to edit the table and to enter `classification` and `dynamodb` as a key-value pair in the **Table properties** section.
   + **sourceTable** – Optional table property that defines the source table name in DynamoDB. Use this if AWS Glue table naming rules prevent you from creating a AWS Glue table with the same name as your DynamoDB table. For example, capital letters are not permitted in AWS Glue table names, but they are permitted in DynamoDB table names.
   + **columnMapping** – Optional table property that defines column name mappings. Use this if AWS Glue column naming rules prevent you from creating a AWS Glue table with the same column names as your DynamoDB table. For example, capital letters are not permitted in AWS Glue column names but are permitted in DynamoDB column names. The property value is expected to be in the format col1=Col1,col2=Col2. Note that column mapping applies only to top level column names and not to nested fields.
   + **defaultTimeZone** – Optional table property that is applied to `date` or `datetime` values that do not have an explicit time zone. Setting this value is a good practice to avoid discrepancies between the data source default time zone and the Athena session time zone.
   + **datetimeFormatMapping** – Optional table property that specifies the `date` or `datetime` format to use when parsing data from a column of the AWS Glue `date` or `timestamp` data type. If this property is not specified, the connector attempts to [infer](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/DateFormatUtils.html) an ISO-8601 format. If the connector cannot infer the `date` or `datetime` format or parse the raw string, then the value is omitted from the result. 

     The `datetimeFormatMapping` value should be in the format `col1=someformat1,col2=someformat2`. Following are some example formats:

     ```
     yyyyMMdd'T'HHmmss 
     ddMMyyyy'T'HH:mm:ss
     ```

     If your column has `date` or `datetime` values without a time zone and you want to use the column in the `WHERE` clause, set the `datetimeFormatMapping` property for the column.

1. If you define your columns manually, make sure that you use the appropriate data types. If you used a crawler, validate the columns and types that the crawler discovered.

1. Choose **Save**.

## Required Permissions
<a name="connectors-dynamodb-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-dynamodb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-dynamodb/athena-dynamodb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The DynamoDB connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **DynamoDB read access** – The connector uses the `DescribeTable`, `ListSchemas`, `ListTables`, `Query`, and `Scan` API operations.

## Performance
<a name="connectors-dynamodb-performance"></a>

The Athena DynamoDB connector supports parallel scans and attempts to push down predicates as part of its DynamoDB queries. A hash key predicate with `X` distinct values results in `X` query calls to DynamoDB. All other predicate scenarios result in `Y` number of scan calls, where `Y` is heuristically determined based on the size of your table and its provisioned throughput. However, selecting a subset of columns sometimes results in a longer query execution runtime.

`LIMIT` clauses and simple predicates are pushed down and can reduce the amount of data scanned and will lead to decreased query execution run time. 

### LIMIT clauses
<a name="connectors-dynamodb-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-dynamodb-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. For enhanced functionality, and to reduce the amount of data scanned, the Athena DynamoDB connector can combine these expressions and push them directly to DynamoDB.

The following Athena DynamoDB connector operators support predicate pushdown:
+ **Boolean: **AND
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL

### Combined pushdown example
<a name="connectors-dynamodb-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT *
FROM my_table
WHERE col_a > 10 and col_b < 10
LIMIT 10
```

For an article on using predicate pushdown to improve performance in federated queries, including DynamoDB, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries
<a name="connectors-dynamodb-passthrough-queries"></a>

The DynamoDB connector supports [passthrough queries](federated-query-passthrough.md) and uses PartiQL syntax. The DynamoDB [GetItem](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_GetItem.html) API operation is not supported. For information about querying DynamoDB using PartiQL, see [PartiQL select statements for DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.select.html) in the *Amazon DynamoDB Developer Guide*.

To use passthrough queries with DynamoDB, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query_string'
        ))
```

The following DynamoDB passthrough query example uses PartiQL to return a list of Fire TV Stick devices that have a `DateWatched` property later than 12/24/22.

```
SELECT * FROM TABLE(
        system.query(
           query => 'SELECT Devices 
                       FROM WatchList 
                       WHERE Devices.FireStick.DateWatched[0] > '12/24/22''
        ))
```

## Troubleshooting
<a name="connectors-dynamodb-troubleshooting"></a>

### Multiple filters on a sort key column
<a name="connectors-dynamodb-troubleshooting-sort-key-filters"></a>

**Error message**: KeyConditionExpressions must only contain one condition per key

**Cause**: This issue can occur in Athena engine version 3 in queries that have both a lower and upper bounded filter on a DynamoDB sort key column. Because DynamoDB does not support more than one filter condition on a sort key, an error is thrown when the connector attempts to push down a query that has both conditions applied.

**Solution**: Update the connector to version 2023.11.1 or later. For instructions on updating a connector, see [Update a data source connector](connectors-updating.md).

## Costs
<a name="connectors-dynamodb-costs"></a>

The costs for use of the connector depends on the underlying AWS resources that are used. Because queries that use scans can consume a large number of [read capacity units (RCUs)](https://aws.amazon.com/dynamodb/pricing/provisioned/), consider the information for [Amazon DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing/) carefully.

## Additional resources
<a name="connectors-dynamodb-additional-resources"></a>
+ For an introduction to using the Amazon Athena DynamoDB connector, see [Access, query, and join Amazon DynamoDB tables using Athena](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) in the *AWS Prescriptive Guidance Patterns* guide. 
+ For an article on how to use the Athena DynamoDB connector to query data in DynamoDB with SQL and visualize insights in Quick, see the *AWS Big Data Blog* post [Visualize Amazon DynamoDB insights in Quick using the Amazon Athena DynamoDB connector and AWS Glue](https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/). 
+ For an article on using the Amazon Athena DynamoDB connector with Amazon DynamoDB, Athena, and Quick to create a simple governance dashboard, see the *AWS Big Data Blog* post [Query cross-account Amazon DynamoDB tables using Amazon Athena Federated Query](https://aws.amazon.com/blogs/big-data/query-cross-account-amazon-dynamodb-tables-using-amazon-athena-federated-query/).
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb) on GitHub.com.

# Amazon Athena Google BigQuery connector
<a name="connectors-bigquery"></a>

The Amazon Athena connector for Google [BigQuery](https://cloud.google.com/bigquery/) enables Amazon Athena to run SQL queries on your Google BigQuery data.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-bigquery-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-bigquery-limitations"></a>
+ Lambda functions have a maximum timeout value of 15 minutes. Each split executes a query on BigQuery and must finish with enough time to store the results for Athena to read. If the Lambda function times out, the query fails.
+ Google BigQuery is case sensitive. The connector attempts to correct the case of dataset names, table names, and project IDs. This is necessary because Athena lower cases all metadata. These corrections make many extra calls to Google BigQuery.
+ Binary data types are not supported.
+ Because of Google BigQuery concurrency and quota limits, the connector may encounter Google quota limit issues. To avoid these issues, push as many constraints to Google BigQuery as feasible. For information about BigQuery quotas, see [Quotas and limits](https://cloud.google.com/bigquery/quotas) in the Google BigQuery documentation.

## Parameters
<a name="connectors-bigquery-parameters"></a>

Use the parameters in this section to configure the Google BigQuery connector.

### Glue connections (recommended)
<a name="bigquery-gc"></a>

We recommend that you configure a Google BigQuery connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Google BigQuery connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type BIGQUERY
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Google BigQuery connector created using Glue connections does not support the use of a multiplexing handler.
The Google BigQuery connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="bigquery-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **gcp\$1project\$1id** – The project ID (not project name) that contains the datasets that the connector should read from (for example, `semiotic-primer-1234567`).
+ **secret\$1manager\$1gcp\$1creds\$1name** – The name of the secret within AWS Secrets Manager that contains your BigQuery credentials in JSON format (for example, `GoogleCloudPlatformCredentials`).
+ **big\$1query\$1endpoint** – (Optional) The URL of a BigQuery private endpoint. Use this parameter when you want to access BigQuery over a private endpoint.

## Splits and views
<a name="connectors-bigquery-splits-and-views"></a>

Because the BigQuery connector uses the BigQuery Storage Read API to query tables, and the BigQuery Storage API does not support views, the connector uses the BigQuery client with a single split for views.

## Performance
<a name="connectors-bigquery-performance"></a>

To query tables, the BigQuery connector uses the BigQuery Storage Read API, which uses an RPC-based protocol that provides fast access to BigQuery managed storage. For more information about the BigQuery Storage Read API, see [Use the BigQuery Storage Read API to read table data](https://cloud.google.com/bigquery/docs/reference/storage) in the Google Cloud documentation.

Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The connector is subject to query failures as concurrency increases, and generally is a slow connector.

The Athena Google BigQuery connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, `ORDER BY` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses
<a name="connectors-bigquery-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Top N queries
<a name="connectors-bigquery-performance-top-n-queries"></a>

A top `N` query specifies an ordering of the result set and a limit on the number of rows returned. You can use this type of query to determine the top `N` max values or top `N` min values for your datasets. With top `N` pushdown, the connector returns only `N` ordered rows to Athena.

### Predicates
<a name="connectors-bigquery-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Google BigQuery connector can combine these expressions and push them directly to Google BigQuery for enhanced functionality and to reduce the amount of data scanned.

The following Athena Google BigQuery connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-bigquery-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
ORDER BY col_a DESC 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-bigquery-passthrough-queries"></a>

The Google BigQuery connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Google BigQuery, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Google BigQuery. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-bigquery-license-information"></a>

The Amazon Athena Google BigQuery connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-bigquery-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-google-bigquery) on GitHub.com.

# Amazon Athena Google Cloud Storage connector
<a name="connectors-gcs"></a>

The Amazon Athena Google Cloud Storage connector enables Amazon Athena to run queries on Parquet and CSV files stored in a Google Cloud Storage (GCS) bucket. After you group one or more Parquet or CSV files in an unpartitioned or partitioned folder in a GCS bucket, you can organize them in an [AWS Glue](https://aws.amazon.com/glue/) database table.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog. 

For an article that shows how to use Athena to run queries on Parquet or CSV files in a GCS bucket, see the AWS Big Data Blog post [Use Amazon Athena to query data stored in Google Cloud Platform](https://aws.amazon.com/blogs/big-data/use-amazon-athena-to-query-data-stored-in-google-cloud-platform/).

## Prerequisites
<a name="connectors-gcs-prerequisites"></a>
+ Set up an AWS Glue database and table that correspond to your bucket and folders in Google Cloud Storage. For the steps, see [Setting up databases and tables in AWS Glue](#connectors-gcs-setting-up-databases-and-tables-in-glue) later in this document.
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-gcs-limitations"></a>
+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Currently, the connector supports only the `VARCHAR` type for partition columns (`string` or `varchar` in an AWS Glue table schema). Other partition field types raise errors when you query them in Athena.

## Terms
<a name="connectors-gcs-terms"></a>

The following terms relate to the GCS connector.
+ **Handler** – A Lambda handler that accesses your GCS bucket. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your GCS bucket.
+ **Record handler** – A Lambda handler that retrieves data records from your GCS bucket.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your GCS bucket.

## Supported file types
<a name="connectors-gcs-supported-file-types"></a>

The GCS connector supports the Parquet and CSV file types.

**Note**  
Make sure you do not place both CSV and Parquet files in the same GCS bucket or path. Doing so can result in a runtime error when Parquet files are attempted to be read as CSV or vice versa. 

## Parameters
<a name="connectors-gcs-parameters"></a>

Use the parameters in this section to configure the GCS connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-gcs-gc"></a>

We recommend that you configure a GCS connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the GCS connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type GOOGLECLOUDSTORAGE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The GCS connector created using Glue connections does not support the use of a multiplexing handler.
The GCS connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-gcs-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **secret\$1manager\$1gcp\$1creds\$1name** – The name of the secret in AWS Secrets Manager that contains your GCS credentials in JSON format (for example, `GoogleCloudPlatformCredentials`).

## Setting up databases and tables in AWS Glue
<a name="connectors-gcs-setting-up-databases-and-tables-in-glue"></a>

Because the built-in schema inference capability of the GCS connector is limited, we recommend that you use AWS Glue for your metadata. The following procedures show how to create a database and table in AWS Glue that you can access from Athena.

### Creating a database in AWS Glue
<a name="connectors-gcs-creating-a-database-in-glue"></a>

You can use the AWS Glue console to create a database for use with the GCS connector.

**To create a database in AWS Glue**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. From the navigation pane, choose **Databases**.

1. Choose **Add database**.

1. For **Name**, enter a name for the database that you want to use with the GCS connector.

1. For **Location**, specify `google-cloud-storage-flag`. This location tells the GCS connector that the AWS Glue database contains tables for GCS data to be queried in Athena. The connector recognizes databases in Athena that have this flag and ignores databases that do not.

1. Choose **Create database**.

### Creating a table in AWS Glue
<a name="connectors-gcs-creating-a-table-in-glue"></a>

Now you can create a table for the database. When you create an AWS Glue table to use with the GCS connector, you must specify additional metadata.

**To create a table in the AWS Glue console**

1. In the AWS Glue console, from the navigation pane, choose **Tables**.

1. On the **Tables** page, choose **Add table**.

1. On the **Set table properties** page, enter the following information.
   + **Name** – A unique name for the table.
   + **Database** – Choose the AWS Glue database that you created for the GCS connector.
   + **Include path** – In the **Data store** section, for **Include path**, enter the URI location for GCS prefixed by `gs://` (for example, `gs://gcs_table/data/`). If you have one or more partition folders, don't include them in the path.
**Note**  
When you enter the non `s3://` table path, the AWS Glue console shows an error. You can ignore this error. The table will be created successfully.
   + **Data format** – For **Classification**, select **CSV** or **Parquet**.

1. Choose **Next.**

1. On the **Choose or define schema** page, defining a table schema is highly recommended, but not mandatory. If you do not define a schema, the GCS connector attempts to infer a schema for you.

   Do one of the following:
   + If you want the GCS connector to attempt to infer a schema for you, choose **Next**, and then choose **Create**.
   + To define a schema yourself, follow the steps in the next section.

### Defining a table schema in AWS Glue
<a name="connectors-gcs-defining-a-table-schema-in-glue"></a>

Defining a table schema in AWS Glue requires more steps but gives you greater control over the table creation process.

**To define a schema for your table in AWS Glue**

1. On the **Choose or define schema** page, choose **Add**.

1. Use the **Add schema entry** dialog box to provide a column name and data type.

1. To designate the column as a partition column, select the **Set as partition key** option.

1. Choose **Save** to save the column.

1. Choose **Add** to add another column.

1. When you are finished adding columns, choose **Next**.

1. On the **Review and create** page, review the table, and then choose **Create**.

1. If your schema contains partition information, follow the steps in the next section to add a partition pattern to the table's properties in AWS Glue.

### Adding a partition pattern to table properties in AWS Glue
<a name="connectors-gcs-adding-a-partition-pattern-to-table-properties-in-glue"></a>

If your GCS buckets have partitions, you must add the partition pattern to the properties of the table in AWS Glue.

**To add partition information to table properties AWS Glue**

1. On the details page for the table that you created in AWS Glue, choose **Actions**, **Edit table**.

1. On the **Edit table** page, scroll down to the **Table properties** section.

1. Choose **Add** to add a partition key.

1. For **Key**, enter **partition.pattern**. This key defines the folder path pattern.

1. For **Value**, enter a folder path pattern like **StateName=\$1\$1statename\$1/ZipCode=\$1\$1zipcode\$1/**, where **statename** and **zipcode** enclosed by **\$1\$1\$1** are partition column names. The GCS connector supports both Hive and non-Hive partition schemes.

1. When you are finished, choose **Save**.

1. To view the table properties that you just created, choose the **Advanced properties** tab.

At this point, you can navigate to the Athena console. The database and table that you created in AWS Glue are available for querying in Athena.

## Data type support
<a name="connectors-gcs-data-type-support"></a>

The following tables show the supported data types for CSV and for Parquet.

### CSV
<a name="connectors-gcs-csv"></a>


****  

| **Nature of data** | **Inferred Data Type** | 
| --- | --- | 
| Data looks like a number | BIGINT | 
| Data looks like a string | VARCHAR | 
| Data looks like a floating point (float, double, or decimal) | DOUBLE | 
| Data looks like a Date | Timestamp | 
| Data that contains true/false values | BOOL | 

### Parquet
<a name="connectors-gcs-parquet"></a>


****  

| **PARQUET** | **Athena (Arrow)** | 
| --- | --- | 
| BINARY | VARCHAR | 
| BOOLEAN | BOOL | 
| DOUBLE | DOUBLE | 
| ENUM | VARCHAR | 
| FIXED\$1LEN\$1BYTE\$1ARRAY | DECIMAL | 
| FLOAT | FLOAT (32-bit) | 
| INT32 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-gcs.html)  | 
| INT64 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-gcs.html)  | 
| INT96 | Timestamp | 
| MAP | MAP | 
| STRUCT | STRUCT | 
| LIST | LIST | 

## Required Permissions
<a name="connectors-gcs-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-gcs.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/athena-gcs.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The GCS connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.

## Performance
<a name="connectors-gcs-performance"></a>

When the table schema contains partition fields and the `partition.pattern` table property is configured correctly, you can include the partition field in the `WHERE` clause of your queries. For such queries, the GCS connector uses the partition columns to refine the GCS folder path and avoid scanning unneeded files in GCS folders.

For Parquet datasets, selecting a subset of columns results in fewer data being scanned. This usually results in a shorter query execution runtime when column projection is applied. 

For CSV datasets, column projection is not supported and does not reduce the amount of data being scanned. 

`LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The GCS connector scans more data for larger datasets than for smaller datasets, regardless of the `LIMIT` clause applied. For example, the query `SELECT * LIMIT 10000` scans more data for a larger underlying dataset than a smaller one.

### License information
<a name="connectors-gcs-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/LICENSE.txt) file on GitHub.com.

### Additional resources
<a name="connectors-gcs-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-gcs) on GitHub.com.

# Amazon Athena HBase connector
<a name="connectors-hbase"></a>

The Amazon Athena HBase connector enables Amazon Athena to communicate with your Apache HBase instances so that you can query your HBase data with SQL.

Unlike traditional relational data stores, HBase collections do not have set schema. HBase does not have a metadata store. Each entry in a HBase collection can have different fields and data types.

The HBase connector supports two mechanisms for generating table schema information: basic schema inference and AWS Glue Data Catalog metadata.

Schema inference is the default. This option scans a small number of documents in your collection, forms a union of all fields, and coerces fields that have non overlapping data types. This option works well for collections that have mostly uniform entries.

For collections with a greater variety of data types, the connector supports retrieving metadata from the AWS Glue Data Catalog. If the connector sees an AWS Glue database and table that match your HBase namespace and collection names, it gets its schema information from the corresponding AWS Glue table. When you create your AWS Glue table, we recommend that you make it a superset of all fields that you might want to access from your HBase collection.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog. 

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-hbase-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-hbase-parameters"></a>

Use the parameters in this section to configure the HBase connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-hbase-gc"></a>

We recommend that you configure a HBase connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the HBase connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type HBASE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The HBase connector created using Glue connections does not support the use of a multiplexing handler.
The HBase connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-hbase-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **default\$1hbase** – If present, specifies an HBase connection string to use when no catalog-specific environment variable exists.
+ **enable\$1case\$1insensitive\$1match** – (Optional) When `true`, performs case insensitive searches against table names in HBase. The default is `false`. Use if your query contains uppercase table names.

#### Specifying connection strings
<a name="connectors-hbase-specifying-connection-strings"></a>

You can provide one or more properties that define the HBase connection details for the HBase instances that you use with the connector. To do this, set a Lambda environment variable that corresponds to the catalog name that you want to use in Athena. For example, suppose you want to use the following queries to query two different HBase instances from Athena:

```
SELECT * FROM "hbase_instance_1".database.table
```

```
SELECT * FROM "hbase_instance_2".database.table
```

Before you can use these two SQL statements, you must add two environment variables to your Lambda function: `hbase_instance_1` and `hbase_instance_2`. The value for each should be a HBase connection string in the following format:

```
master_hostname:hbase_port:zookeeper_port
```

##### Using secrets
<a name="connectors-hbase-using-secrets"></a>

You can optionally use AWS Secrets Manager for part or all of the value for your connection string details. To use the Athena Federated Query feature with Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

If you use the syntax `${my_secret}` to put the name of a secret from Secrets Manager in your connection string, the connector replaces the secret name with your user name and password values from Secrets Manager.

For example, suppose you set the Lambda environment variable for `hbase_instance_1` to the following value:

```
${hbase_host_1}:${hbase_master_port_1}:${hbase_zookeeper_port_1}
```

The Athena Query Federation SDK automatically attempts to retrieve a secret named `hbase_instance_1_creds` from Secrets Manager and inject that value in place of `${hbase_instance_1_creds}`. Any part of the connection string that is enclosed by the `${ }` character combination is interpreted as a secret from Secrets Manager. If you specify a secret name that the connector cannot find in Secrets Manager, the connector does not replace the text.

## Setting up databases and tables in AWS Glue
<a name="connectors-hbase-setting-up-databases-and-tables-in-aws-glue"></a>

The connector's built-in schema inference supports only values that are serialized in HBase as strings (for example, `String.valueOf(int)`). Because the connector's built-in schema inference capability is limited, you might want to use AWS Glue for metadata instead. To enable an AWS Glue table for use with HBase, you must have an AWS Glue database and table with names that match the HBase namespace and table that you want to supply supplemental metadata for. The use of HBase column family naming conventions is optional but not required.

**To use an AWS Glue table for supplemental metadata**

1. When you edit the table and database in the AWS Glue console, add the following table properties:
   + **hbase-metadata-flag** – This property indicates to the HBase connector that the connector can use the table for supplemental metadata. You can provide any value for `hbase-metadata-flag` as long as the `hbase-metadata-flag` property is present in the list of table properties.
   + **hbase-native-storage-flag** – Use this flag to toggle the two value serialization modes supported by the connector. By default, when this field is not present, the connector assumes all values are stored in HBase as strings. As such it will attempt to parse data types such as `INT`, `BIGINT`, and `DOUBLE` from HBase as strings. If this field is set with any value on the table in AWS Glue, the connector switches to "native" storage mode and attempts to read `INT`, `BIGINT`, `BIT`, and `DOUBLE` as bytes by using the following functions:

     ```
     ByteBuffer.wrap(value).getInt() 
     ByteBuffer.wrap(value).getLong() 
     ByteBuffer.wrap(value).get() 
     ByteBuffer.wrap(value).getDouble()
     ```

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

### Modeling column families
<a name="connectors-hbase-modeling-column-families"></a>

The Athena HBase connector supports two ways to model HBase column families: fully qualified (flattened) naming like `family:column`, or using `STRUCT` objects.

In the `STRUCT` model, the name of the `STRUCT` field should match the column family, and children of the `STRUCT` should match the names of the columns of the family. However, because predicate push down and columnar reads are not yet fully supported for complex types like `STRUCT`, using `STRUCT` is currently not advised.

The following image shows a table configured in AWS Glue that uses a combination of the two approaches.

![\[Modeling column families in AWS Glue for Apache Hbase.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-hbase-1.png)


## Data type support
<a name="connectors-hbase-data-type-support"></a>

The connector retrieves all HBase values as the basic byte type. Then, based on how you defined your tables in AWS Glue Data Catalog, it maps the values into one of the Apache Arrow data types in the following table.


****  

| AWS Glue data type | Apache Arrow data type | 
| --- | --- | 
| int | INT | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| boolean | BIT | 
| binary | VARBINARY | 
| string | VARCHAR | 

**Note**  
If you do not use AWS Glue to supplement your metadata, the connector's schema inferencing uses only the data types `BIGINT`, `FLOAT8`, and `VARCHAR`.

## Required Permissions
<a name="connectors-hbase-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-hbase.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hbase/athena-hbase.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The HBase connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store HBase endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your HBase instances.

## Performance
<a name="connectors-hbase-performance"></a>

The Athena HBase connector attempts to parallelize queries against your HBase instance by reading each region server in parallel. The Athena HBase connector performs predicate pushdown to decrease the data scanned by the query.

The Lambda function also performs *projection* pushdown to decrease the data scanned by the query. However, selecting a subset of columns sometimes results in a longer query execution runtime. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data.

HBase is prone to query failures and variable query execution times. You might have to retry your queries multiple times for them to succeed. The HBase connector is resilient to throttling due to concurrency.

## Passthrough queries
<a name="connectors-hbase-passthrough-queries"></a>

The HBase connector supports [passthrough queries](federated-query-passthrough.md) and is NoSQL based. For information about querying Apache HBase using filtering, see [Filter language](https://hbase.apache.org/book.html#thrift.filter_language) in the Apache documentation.

To use passthrough queries with HBase, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            database => 'database_name',
            collection => 'collection_name',
            filter => '{query_syntax}'
        ))
```

The following example HBase passthrough query filters for employees aged 24 or 30 within the `employee` collection of the `default` database.

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'default',
            COLLECTION => 'employee',
            FILTER => 'SingleColumnValueFilter(''personaldata'', ''age'', =, ''binary:30'')' ||
                       ' OR SingleColumnValueFilter(''personaldata'', ''age'', =, ''binary:24'')'
        ))
```

## License information
<a name="connectors-hbase-license-information"></a>

The Amazon Athena HBase connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-hbase-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-hbase) on GitHub.com.

# Amazon Athena Hortonworks connector
<a name="connectors-hortonworks"></a>

The Amazon Athena connector for Hortonworks enables Amazon Athena to run SQL queries on the Cloudera [Hortonworks](https://www.cloudera.com/products/hdp.html) data platform. The connector transforms your Athena SQL queries to their equivalent HiveQL syntax.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-hive-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-hortonworks-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms
<a name="connectors-hortonworks-terms"></a>

The following terms relate to the Hortonworks Hive connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-hortonworks-parameters"></a>

Use the parameters in this section to configure the Hortonworks Hive connector.

### Connection string
<a name="connectors-hortonworks-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
hive://${jdbc_connection_string}
```

### Using a multiplexing handler
<a name="connectors-hortonworks-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | HiveMuxCompositeHandler | 
| Metadata handler | HiveMuxMetadataHandler | 
| Record handler | HiveMuxRecordHandler | 

#### Multiplexing handler parameters
<a name="connectors-hortonworks-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myhivecatalog, then the environment variable name is myhivecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Hive MUX Lambda function that supports two database instances: `hive1` (the default), and `hive2`.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default?\$1\$1Test/RDS/hive1\$1 | 
| hive\$1catalog1\$1connection\$1string | hive://jdbc:hive2://hive1:10000/default?\$1\$1Test/RDS/hive1\$1 | 
| hive\$1catalog2\$1connection\$1string | hive://jdbc:hive2://hive2:10000/default?UID=sample&PWD=sample | 

#### Providing credentials
<a name="connectors-hortonworks-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/hive1host}`.

```
hive://jdbc:hive2://hive1host:10000/default?...&${Test/RDS/hive1host}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
hive://jdbc:hive2://hive1host:10000/default?...&UID=sample2&PWD=sample2&...
```

Currently, the Hortonworks Hive connector recognizes the `UID` and `PWD` JDBC properties.

### Using a single connection handler
<a name="connectors-hortonworks-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Hortonworks Hive instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | HiveCompositeHandler | 
| Metadata handler | HiveMetadataHandler | 
| Record handler | HiveRecordHandler | 

#### Single connection handler parameters
<a name="connectors-hortonworks-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Hortonworks Hive instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1host:10000/default?secret=\$1\$1Test/RDS/hive1host\$1 | 

### Spill parameters
<a name="connectors-hortonworks-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-hortonworks-data-type-support"></a>

The following table shows the corresponding data types for JDBC, Hortonworks Hive, and Arrow.


****  

| JDBC | Hortonworks Hive | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Hortonworks Hive does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits
<a name="connectors-hortonworks-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-hortonworks-performance"></a>

Hortonworks Hive supports static partitions. The Athena Hortonworks Hive connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Hortonworks Hive connector is resilient to throttling due to concurrency.

The Athena Hortonworks Hive connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses
<a name="connectors-hive-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-hive-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Hortonworks Hive connector can combine these expressions and push them directly to Hortonworks Hive for enhanced functionality and to reduce the amount of data scanned.

The following Athena Hortonworks Hive connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-hive-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-hive-passthrough-queries"></a>

The Hortonworks Hive connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Hortonworks Hive, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Hortonworks Hive. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-hive-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-hive-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/pom.xml) file for the Hortonworks Hive connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-hortonworks-hive) on GitHub.com.

# Amazon Athena Apache Kafka connector
<a name="connectors-kafka"></a>

The Amazon Athena connector for Apache Kafka enables Amazon Athena to run SQL queries on your Apache Kafka topics. Use this connector to view [Apache Kafka](https://kafka.apache.org/) topics as tables and messages as rows in Athena.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-kafka-prerequisites"></a>

Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-kafka-limitations"></a>
+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.
+ Date and timestamp data types are not supported for the CSV file type and are treated as varchar values.
+ Mapping into nested JSON fields is not supported. The connector maps top-level fields only.
+ The connector does not support complex types. Complex types are interpreted as strings.
+ To extract or work with complex JSON values, use the JSON-related functions available in Athena. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md).
+ The connector does not support access to Kafka message metadata.

## Terms
<a name="connectors-kafka-terms"></a>
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Kafka endpoint** – A text string that establishes a connection to a Kafka instance.

## Cluster compatibility
<a name="connectors-kafka-cluster-compatibility"></a>

The Kafka connector can be used with the following cluster types.
+ **Standalone Kafka** – A direct connection to Kafka (authenticated or unauthenticated).
+ **Confluent** – A direct connection to Confluent Kafka. For information about using Athena with Confluent Kafka data, see [Visualize Confluent data in Quick using Amazon Athena](https://aws.amazon.com/blogs/business-intelligence/visualize-confluent-data-in-amazon-quicksight-using-amazon-athena/) in the *AWS Business Intelligence Blog*. 

### Connecting to Confluent
<a name="connectors-kafka-connecting-to-confluent"></a>

Connecting to Confluent requires the following steps:

1. Generate an API key from Confluent.

1. Store the username and password for the Confluent API key into AWS Secrets Manager.

1. Provide the secret name for the `secrets_manager_secret` environment variable in the Kafka connector.

1. Follow the steps in the [Setting up the Kafka connector](#connectors-kafka-setup) section of this document.

## Supported authentication methods
<a name="connectors-kafka-supported-authentication-methods"></a>

The connector supports the following authentication methods.
+ [SSL](https://kafka.apache.org/documentation/#security_ssl)
+ [SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram)
+ SASL/PLAIN
+ SASL/PLAINTEXT
+ NO\$1AUTH
+ **Self-managed Kafka and Confluent Platform** – SSL, SASL/SCRAM, SASL/PLAINTEXT, NO\$1AUTH
+ **Self-managed Kafka and Confluent Cloud** – SASL/PLAIN

For more information, see [Configuring authentication for the Athena Kafka connector](#connectors-kafka-setup-configuring-authentication).

## Supported input data formats
<a name="connectors-kafka-supported-input-data-formats"></a>

The connector supports the following input data formats.
+ JSON
+ CSV
+ AVRO
+ PROTOBUF (PROTOCOL BUFFERS)

## Parameters
<a name="connectors-kafka-parameters"></a>

Use the parameters in this section to configure the Athena Kafka connector.
+ **auth\$1type** – Specifies the authentication type of the cluster. The connector supports the following types of authentication:
  + **NO\$1AUTH** – Connect directly to Kafka (for example, to a Kafka cluster deployed over an EC2 instance that does not use authentication).
  + **SASL\$1SSL\$1PLAIN** – This method uses the `SASL_SSL` security protocol and the `PLAIN` SASL mechanism. For more information, see [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) in the Apache Kafka documentation.
  + **SASL\$1PLAINTEXT\$1PLAIN** – This method uses the `SASL_PLAINTEXT` security protocol and the `PLAIN` SASL mechanism. For more information, see [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) in the Apache Kafka documentation.
  + **SASL\$1SSL\$1SCRAM\$1SHA512** – You can use this authentication type to control access to your Apache Kafka clusters. This method stores the user name and password in AWS Secrets Manager. The secret must be associated with the Kafka cluster. For more information, see [Authentication using SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram) in the Apache Kafka documentation.
  + **SASL\$1PLAINTEXT\$1SCRAM\$1SHA512** – This method uses the `SASL_PLAINTEXT` security protocol and the `SCRAM_SHA512 SASL` mechanism. This method uses your user name and password stored in AWS Secrets Manager. For more information, see the [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) section of the Apache Kafka documentation.
  + **SSL** – SSL authentication uses key store and trust store files to connect with the Apache Kafka cluster. You must generate the trust store and key store files, upload them to an Amazon S3 bucket, and provide the reference to Amazon S3 when you deploy the connector. The key store, trust store, and SSL key are stored in AWS Secrets Manager. Your client must provide the AWS secret key when the connector is deployed. For more information, see [Encryption and Authentication using SSL](https://kafka.apache.org/documentation/#security_ssl) in the Apache Kafka documentation.

    For more information, see [Configuring authentication for the Athena Kafka connector](#connectors-kafka-setup-configuring-authentication).
+ **certificates\$1s3\$1reference** – The Amazon S3 location that contains the certificates (the key store and trust store files).
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **kafka\$1endpoint** – The endpoint details to provide to Kafka.
+ **schema\$1registry\$1url** – The URL address for the schema registry (for example, `http://schema-registry.example.org:8081`). Applies to the `AVRO` and `PROTOBUF` data formats. Athena only supports Confluent schema registry.
+ **secrets\$1manager\$1secret** – The name of the AWS secret in which the credentials are saved.
+ **Spill parameters** – Lambda functions temporarily store ("spill") data that do not fit into memory to Amazon S3. All database instances accessed by the same Lambda function spill to the same location. Use the parameters in the following table to specify the spill location.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)
+ **Subnet IDs** – One or more subnet IDs that correspond to the subnet that the Lambda function can use to access your data source.
  + **Public Kafka cluster or standard Confluent Cloud cluster** – Associate the connector with a private subnet that has a NAT Gateway.
  + **Confluent Cloud cluster with private connectivity** – Associate the connector with a private subnet that has a route to the Confluent Cloud cluster.
    + For [AWS Transit Gateway](https://docs.confluent.io/cloud/current/networking/aws-transit-gateway.html), the subnets must be in a VPC that is attached to the same transit gateway that Confluent Cloud uses.
    + For [VPC Peering](https://docs.confluent.io/cloud/current/networking/peering/aws-peering.html), the subnets must be in a VPC that is peered to Confluent Cloud VPC.
    + For [AWS PrivateLink](https://docs.confluent.io/cloud/current/networking/private-links/aws-privatelink.html), the subnets must be in a VPC that a has route to the VPC endpoints that connect to Confluent Cloud.

**Note**  
If you deploy the connector into a VPC in order to access private resources and also want to connect to a publicly accessible service like Confluent, you must associate the connector with a private subnet that has a NAT Gateway. For more information, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the Amazon VPC User Guide.

## Data type support
<a name="connectors-kafka-data-type-support"></a>

The following table shows the corresponding data types supported for Kafka and Apache Arrow.


****  

| Kafka | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| TIMESTAMP | MILLISECOND | 
| DATE | DAY | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 

## Partitions and splits
<a name="connectors-kafka-partitions-and-splits"></a>

Kafka topics are split into partitions. Each partition is ordered. Each message in a partition has an incremental ID called an *offset*. Each Kafka partition is further divided to multiple splits for parallel processing. Data is available for the retention period configured in Kafka clusters.

## Best practices
<a name="connectors-kafka-best-practices"></a>

As a best practice, use predicate pushdown when you query Athena, as in the following examples.

```
SELECT * 
FROM "kafka_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE integercol = 2147483647
```

```
SELECT * 
FROM "kafka_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Setting up the Kafka connector
<a name="connectors-kafka-setup"></a>

Before you can use the connector, you must set up your Apache Kafka cluster, use the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html) to define your schema, and configure authentication for the connector.

When working with the AWS Glue Schema Registry, note the following points:
+ Make sure that the text in **Description** field of the AWS Glue Schema Registry includes the string `{AthenaFederationKafka}`. This marker string is required for AWS Glue Registries that you use with the Amazon Athena Kafka connector.
+ For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**To set up your Apache Kafka environment and AWS Glue Schema Registry**

1. Set up your Apache Kafka environment.

1. Upload the Kafka topic description file (that is, its schema) in JSON format to the AWS Glue Schema Registry. For more information, see [Integrating with AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html) in the AWS Glue Developer Guide.

1. To use the `AVRO` or `PROTOBUF` data format when you define the schema in the AWS Glue Schema Registry:
   + For **Schema name**, enter the Kafka topic name in the same casing as the original.
   + For **Data format**, choose **Apache Avro** or **Protocol Buffers**.

    For example schemas, see the following section.

### Schema examples for the AWS Glue Schema Registry
<a name="connectors-kafka-setup-schema-examples"></a>

Use the format of the examples in this section when you upload your schema to the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).

#### JSON type schema example
<a name="connectors-kafka-setup-schema-examples-json"></a>

In the following example, the schema to be created in the AWS Glue Schema Registry specifies `json` as the value for `dataFormat` and uses `datatypejson` for `topicName`.

**Note**  
The value for `topicName` should use the same casing as the topic name in Kafka. 

```
{
  "topicName": "datatypejson",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "intcol",
        "mapping": "intcol",
        "type": "INTEGER"
      },
      {
        "name": "varcharcol",
        "mapping": "varcharcol",
        "type": "VARCHAR"
      },
      {
        "name": "booleancol",
        "mapping": "booleancol",
        "type": "BOOLEAN"
      },
      {
        "name": "bigintcol",
        "mapping": "bigintcol",
        "type": "BIGINT"
      },
      {
        "name": "doublecol",
        "mapping": "doublecol",
        "type": "DOUBLE"
      },
      {
        "name": "smallintcol",
        "mapping": "smallintcol",
        "type": "SMALLINT"
      },
      {
        "name": "tinyintcol",
        "mapping": "tinyintcol",
        "type": "TINYINT"
      },
      {
        "name": "datecol",
        "mapping": "datecol",
        "type": "DATE",
        "formatHint": "yyyy-MM-dd"
      },
      {
        "name": "timestampcol",
        "mapping": "timestampcol",
        "type": "TIMESTAMP",
        "formatHint": "yyyy-MM-dd HH:mm:ss.SSS"
      }
    ]
  }
}
```

#### CSV type schema example
<a name="connectors-kafka-setup-schema-examples-csv"></a>

In the following example, the schema to be created in the AWS Glue Schema Registry specifies `csv` as the value for `dataFormat` and uses `datatypecsvbulk` for `topicName`. The value for `topicName` should use the same casing as the topic name in Kafka.

```
{
  "topicName": "datatypecsvbulk",
  "message": {
    "dataFormat": "csv",
    "fields": [
      {
        "name": "intcol",
        "type": "INTEGER",
        "mapping": "0"
      },
      {
        "name": "varcharcol",
        "type": "VARCHAR",
        "mapping": "1"
      },
      {
        "name": "booleancol",
        "type": "BOOLEAN",
        "mapping": "2"
      },
      {
        "name": "bigintcol",
        "type": "BIGINT",
        "mapping": "3"
      },
      {
        "name": "doublecol",
        "type": "DOUBLE",
        "mapping": "4"
      },
      {
        "name": "smallintcol",
        "type": "SMALLINT",
        "mapping": "5"
      },
      {
        "name": "tinyintcol",
        "type": "TINYINT",
        "mapping": "6"
      },
      {
        "name": "floatcol",
        "type": "DOUBLE",
        "mapping": "7"
      }
    ]
  }
}
```

#### AVRO type schema example
<a name="connectors-kafka-setup-schema-examples-avro"></a>

The following example is used to create an AVRO-based schema in the AWS Glue Schema Registry. When you define the schema in the AWS Glue Schema Registry, for **Schema name**, you enter the Kafka topic name in the same casing as the original, and for **Data format**, you choose **Apache Avro**. Because you specify this information directly in the registry, the `dataformat`and `topicName` fields are not required.

```
{
    "type": "record",
    "name": "avrotest",
    "namespace": "example.com",
    "fields": [{
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        }
    ]
}
```

#### PROTOBUF type schema example
<a name="connectors-kafka-setup-schema-examples-protobuf"></a>

The following example is used to create an PROTOBUF-based schema in the AWS Glue Schema Registry. When you define the schema in the AWS Glue Schema Registry, for **Schema name**, you enter the Kafka topic name in the same casing as the original, and for **Data format**, you choose **Protocol Buffers**. Because you specify this information directly in the registry, the `dataformat`and `topicName` fields are not required. The first line defines the schema as PROTOBUF.

```
syntax = "proto3";
message protobuftest {
string name = 1;
int64 calories = 2;
string colour = 3;
}
```

For more information about adding a registry and schemas in the AWS Glue Schema Registry, see [Getting started with Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-gs.html) in the AWS Glue documentation.

### Configuring authentication for the Athena Kafka connector
<a name="connectors-kafka-setup-configuring-authentication"></a>

You can use a variety of methods to authenticate to your Apache Kafka cluster, including SSL, SASL/SCRAM, SASL/PLAIN, and SASL/PLAINTEXT.

The following table shows the authentication types for the connector and the security protocol and SASL mechanism for each. For more information, see the [Security](https://kafka.apache.org/documentation/#security) section of the Apache Kafka documentation.


****  

| auth\$1type | security.protocol | sasl.mechanism | Cluster type compatibility | 
| --- | --- | --- | --- | 
| SASL\$1SSL\$1PLAIN | SASL\$1SSL | PLAIN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1PLAINTEXT\$1PLAIN | SASL\$1PLAINTEXT | PLAIN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1SSL\$1SCRAM\$1SHA512 | SASL\$1SSL | SCRAM-SHA-512 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1PLAINTEXT\$1SCRAM\$1SHA512 | SASL\$1PLAINTEXT | SCRAM-SHA-512 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SSL | SSL | N/A |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 

#### SSL
<a name="connectors-kafka-setup-configuring-authentication-tls"></a>

If the cluster is SSL authenticated, you must generate the trust store and key store files and upload them to the Amazon S3 bucket. You must provide this Amazon S3 reference when you deploy the connector. The key store, trust store, and SSL key are stored in the AWS Secrets Manager. You provide the AWS secret key when you deploy the connector.

For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

To use this authentication type, set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SSL | 
| certificates\$1s3\$1reference | The Amazon S3 location that contains the certificates. | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

After you create a secret in Secrets Manager, you can view it in the Secrets Manager console.

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.

   The following image shows an example secret with three key/value pairs: `keystore_password`, `truststore_password`, and `ssl_key_password`.  
![\[Retrieving an SSL secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-kafka-setup-1.png)

For more information about using SSL with Kafka, see [Encryption and Authentication using SSL](https://kafka.apache.org/documentation/#security_ssl) in the Apache Kafka documentation.

#### SASL/SCRAM
<a name="connectors-kafka-setup-configuring-authentication-sasl-scram"></a>

If your cluster uses SCRAM authentication, provide the Secrets Manager key that is associated with the cluster when you deploy the connector. The user's AWS credentials (secret key and access key) are used to authenticate with the cluster.

Set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SASL\$1SSL\$1SCRAM\$1SHA512 | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

The following image shows an example secret in the Secrets Manager console with two key/value pairs: one for `username`, and one for `password`.

![\[Retrieving a SCRAM secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-kafka-setup-2.png)


For more information about using SASL/SCRAM with Kafka, see [Authentication using SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram) in the Apache Kafka documentation.

## License information
<a name="connectors-kafka-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-kafka/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-kafka/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-kafka-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-kafka) on GitHub.com.

# Amazon Athena MSK connector
<a name="connectors-msk"></a>

The Amazon Athena connector for [Amazon MSK](https://aws.amazon.com/msk/) enables Amazon Athena to run SQL queries on your Apache Kafka topics. Use this connector to view [Apache Kafka](https://kafka.apache.org/) topics as tables and messages as rows in Athena. For additional information, see [Analyze real-time streaming data in Amazon MSK with Amazon Athena](https://aws.amazon.com/blogs/big-data/analyze-real-time-streaming-data-in-amazon-msk-with-amazon-athena/) in the AWS Big Data Blog.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-msk-prerequisites"></a>

Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-msk-limitations"></a>
+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.
+ Date and timestamp data types are not supported for the CSV file type and are treated as varchar values.
+ Mapping into nested JSON fields is not supported. The connector maps top-level fields only.
+ The connector does not support complex types. Complex types are interpreted as strings.
+ To extract or work with complex JSON values, use the JSON-related functions available in Athena. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md).
+ The connector does not support access to Kafka message metadata.

## Terms
<a name="connectors-msk-terms"></a>
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Kafka endpoint** – A text string that establishes a connection to a Kafka instance.

## Cluster compatibility
<a name="connectors-msk-cluster-compatibility"></a>

The MSK connector can be used with the following cluster types.
+ **MSK Provisioned cluster** – You manually specify, monitor, and scale cluster capacity.
+ **MSK Serverless cluster** – Provides on-demand capacity that scales automatically as application I/O scales.
+ **Standalone Kafka** – A direct connection to Kafka (authenticated or unauthenticated).

## Supported authentication methods
<a name="connectors-msk-supported-authentication-methods"></a>

The connector supports the following authentication methods.
+ [SASL/IAM](https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html) 
+ [SSL](https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html)
+ [SASL/SCRAM](https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html)
+ SASL/PLAIN
+ SASL/PLAINTEXT
+ NO\$1AUTH

  For more information, see [Configuring authentication for the Athena MSK connector](#connectors-msk-setup-configuring-authentication).

## Supported input data formats
<a name="connectors-msk-supported-input-data-formats"></a>

The connector supports the following input data formats.
+ JSON
+ CSV

## Parameters
<a name="connectors-msk-parameters"></a>

Use the parameters in this section to configure the Athena MSK connector.
+ **auth\$1type** – Specifies the authentication type of the cluster. The connector supports the following types of authentication:
  + **NO\$1AUTH** – Connect directly to Kafka with no authentication (for example, to a Kafka cluster deployed over an EC2 instance that does not use authentication).
  + **SASL\$1SSL\$1PLAIN** – This method uses the `SASL_SSL` security protocol and the `PLAIN` SASL mechanism.
  + **SASL\$1PLAINTEXT\$1PLAIN** – This method uses the `SASL_PLAINTEXT` security protocol and the `PLAIN` SASL mechanism.
**Note**  
The `SASL_SSL_PLAIN` and `SASL_PLAINTEXT_PLAIN` authentication types are supported by Apache Kafka but not by Amazon MSK.
  + **SASL\$1SSL\$1AWS\$1MSK\$1IAM** – IAM access control for Amazon MSK enables you to handle both authentication and authorization for your MSK cluster. Your user's AWS credentials (secret key and access key) are used to connect with the cluster. For more information, see [IAM access control](https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.
  + **SASL\$1SSL\$1SCRAM\$1SHA512** – You can use this authentication type to control access to your Amazon MSK clusters. This method stores the user name and password on AWS Secrets Manager. The secret must be associated with the Amazon MSK cluster. For more information, see [Setting up SASL/SCRAM authentication for an Amazon MSK cluster](https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html#msk-password-tutorial) in the Amazon Managed Streaming for Apache Kafka Developer Guide.
  + **SSL** – SSL authentication uses key store and trust store files to connect with the Amazon MSK cluster. You must generate the trust store and key store files, upload them to an Amazon S3 bucket, and provide the reference to Amazon S3 when you deploy the connector. The key store, trust store, and SSL key are stored in AWS Secrets Manager. Your client must provide the AWS secret key when the connector is deployed. For more information, see [Mutual TLS authentication](https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

    For more information, see [Configuring authentication for the Athena MSK connector](#connectors-msk-setup-configuring-authentication).
+ **certificates\$1s3\$1reference** – The Amazon S3 location that contains the certificates (the key store and trust store files).
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **kafka\$1endpoint** – The endpoint details to provide to Kafka. For example, for an Amazon MSK cluster, you provide a [bootstrap URL](https://docs.aws.amazon.com/msk/latest/developerguide/msk-get-bootstrap-brokers.html) for the cluster.
+ **secrets\$1manager\$1secret** – The name of the AWS secret in which the credentials are saved. This parameter is not required for IAM authentication.
+ **Spill parameters** – Lambda functions temporarily store ("spill") data that do not fit into memory to Amazon S3. All database instances accessed by the same Lambda function spill to the same location. Use the parameters in the following table to specify the spill location.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-msk.html)

## Data type support
<a name="connectors-msk-data-type-support"></a>

The following table shows the corresponding data types supported for Kafka and Apache Arrow.


****  

| Kafka | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| TIMESTAMP | MILLISECOND | 
| DATE | DAY | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 

## Partitions and splits
<a name="connectors-msk-partitions-and-splits"></a>

Kafka topics are split into partitions. Each partition is ordered. Each message in a partition has an incremental ID called an *offset*. Each Kafka partition is further divided to multiple splits for parallel processing. Data is available for the retention period configured in Kafka clusters.

## Best practices
<a name="connectors-msk-best-practices"></a>

As a best practice, use predicate pushdown when you query Athena, as in the following examples.

```
SELECT * 
FROM "msk_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE integercol = 2147483647
```

```
SELECT * 
FROM "msk_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Setting up the MSK connector
<a name="connectors-msk-setup"></a>

Before you can use the connector, you must set up your Amazon MSK cluster, use the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html) to define your schema, and configure authentication for the connector.

**Note**  
If you deploy the connector into a VPC in order to access private resources and also want to connect to a publicly accessible service like Confluent, you must associate the connector with a private subnet that has a NAT Gateway. For more information, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the Amazon VPC User Guide.

When working with the AWS Glue Schema Registry, note the following points:
+ Make sure that the text in **Description** field of the AWS Glue Schema Registry includes the string `{AthenaFederationMSK}`. This marker string is required for AWS Glue Registries that you use with the Amazon Athena MSK connector.
+ For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**To set up your Amazon MSK environment and AWS Glue Schema Registry**

1. Set up your Amazon MSK environment. For information and steps, see [Setting up Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/before-you-begin.html) and [Getting started using Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/getting-started.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

1. Upload the Kafka topic description file (that is, its schema) in JSON format to the AWS Glue Schema Registry. For more information, see [Integrating with AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html) in the AWS Glue Developer Guide. For example schemas, see the following section.

### Schema examples for the AWS Glue Schema Registry
<a name="connectors-msk-setup-schema-examples"></a>

Use the format of the examples in this section when you upload your schema to the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).

#### JSON type schema example
<a name="connectors-msk-setup-schema-examples-json"></a>

In the following example, the schema to be created in the AWS Glue Schema Registry specifies `json` as the value for `dataFormat` and uses `datatypejson` for `topicName`.

**Note**  
The value for `topicName` should use the same casing as the topic name in Kafka. 

```
{
  "topicName": "datatypejson",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "intcol",
        "mapping": "intcol",
        "type": "INTEGER"
      },
      {
        "name": "varcharcol",
        "mapping": "varcharcol",
        "type": "VARCHAR"
      },
      {
        "name": "booleancol",
        "mapping": "booleancol",
        "type": "BOOLEAN"
      },
      {
        "name": "bigintcol",
        "mapping": "bigintcol",
        "type": "BIGINT"
      },
      {
        "name": "doublecol",
        "mapping": "doublecol",
        "type": "DOUBLE"
      },
      {
        "name": "smallintcol",
        "mapping": "smallintcol",
        "type": "SMALLINT"
      },
      {
        "name": "tinyintcol",
        "mapping": "tinyintcol",
        "type": "TINYINT"
      },
      {
        "name": "datecol",
        "mapping": "datecol",
        "type": "DATE",
        "formatHint": "yyyy-MM-dd"
      },
      {
        "name": "timestampcol",
        "mapping": "timestampcol",
        "type": "TIMESTAMP",
        "formatHint": "yyyy-MM-dd HH:mm:ss.SSS"
      }
    ]
  }
}
```

#### CSV type schema example
<a name="connectors-msk-setup-schema-examples-csv"></a>

In the following example, the schema to be created in the AWS Glue Schema Registry specifies `csv` as the value for `dataFormat` and uses `datatypecsvbulk` for `topicName`. The value for `topicName` should use the same casing as the topic name in Kafka.

```
{
  "topicName": "datatypecsvbulk",
  "message": {
    "dataFormat": "csv",
    "fields": [
      {
        "name": "intcol",
        "type": "INTEGER",
        "mapping": "0"
      },
      {
        "name": "varcharcol",
        "type": "VARCHAR",
        "mapping": "1"
      },
      {
        "name": "booleancol",
        "type": "BOOLEAN",
        "mapping": "2"
      },
      {
        "name": "bigintcol",
        "type": "BIGINT",
        "mapping": "3"
      },
      {
        "name": "doublecol",
        "type": "DOUBLE",
        "mapping": "4"
      },
      {
        "name": "smallintcol",
        "type": "SMALLINT",
        "mapping": "5"
      },
      {
        "name": "tinyintcol",
        "type": "TINYINT",
        "mapping": "6"
      },
      {
        "name": "floatcol",
        "type": "DOUBLE",
        "mapping": "7"
      }
    ]
  }
}
```

### Configuring authentication for the Athena MSK connector
<a name="connectors-msk-setup-configuring-authentication"></a>

You can use a variety of methods to authenticate to your Amazon MSK cluster, including IAM, SSL, SCRAM, and standalone Kafka.

The following table shows the authentication types for the connector and the security protocol and SASL mechanism for each. For more information, see [Authentication and authorization for Apache Kafka APIs](https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.


****  

| auth\$1type | security.protocol | sasl.mechanism | 
| --- | --- | --- | 
| SASL\$1SSL\$1PLAIN | SASL\$1SSL | PLAIN | 
| SASL\$1PLAINTEXT\$1PLAIN | SASL\$1PLAINTEXT | PLAIN | 
| SASL\$1SSL\$1AWS\$1MSK\$1IAM | SASL\$1SSL | AWS\$1MSK\$1IAM | 
| SASL\$1SSL\$1SCRAM\$1SHA512 | SASL\$1SSL | SCRAM-SHA-512 | 
| SSL | SSL | N/A | 

**Note**  
The `SASL_SSL_PLAIN` and `SASL_PLAINTEXT_PLAIN` authentication types are supported by Apache Kafka but not by Amazon MSK.

#### SASL/IAM
<a name="connectors-msk-setup-configuring-authentication-sasl-iam"></a>

If the cluster uses IAM authentication, you must configure the IAM policy for the user when you set up the cluster. For more information, see [IAM access control](https://docs.aws.amazon.com/msk/latest/developerguide/IAM-access-control.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

To use this authentication type, set the `auth_type` Lambda environment variable for the connector to `SASL_SSL_AWS_MSK_IAM`. 

#### SSL
<a name="connectors-msk-setup-configuring-authentication-tls"></a>

If the cluster is SSL authenticated, you must generate the trust store and key store files and upload them to the Amazon S3 bucket. You must provide this Amazon S3 reference when you deploy the connector. The key store, trust store, and SSL key are stored in the AWS Secrets Manager. You provide the AWS secret key when you deploy the connector.

For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

To use this authentication type, set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SSL | 
| certificates\$1s3\$1reference | The Amazon S3 location that contains the certificates. | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

After you create a secret in Secrets Manager, you can view it in the Secrets Manager console.

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.

   The following image shows an example secret with three key/value pairs: `keystore_password`, `truststore_password`, and `ssl_key_password`.  
![\[Retrieving an SSL secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-msk-setup-1.png)

#### SASL/SCRAM
<a name="connectors-msk-setup-configuring-authentication-sasl-scram"></a>

If your cluster uses SCRAM authentication, provide the Secrets Manager key that is associated with the cluster when you deploy the connector. The user's AWS credentials (secret key and access key) are used to authenticate with the cluster.

Set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SASL\$1SSL\$1SCRAM\$1SHA512 | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

The following image shows an example secret in the Secrets Manager console with two key/value pairs: one for `username`, and one for `password`.

![\[Retrieving a SCRAM secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-msk-setup-2.png)


## License information
<a name="connectors-msk-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-msk/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-msk/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-msk-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-msk) on GitHub.com.

# Amazon Athena MySQL connector
<a name="connectors-mysql"></a>

The Amazon Athena Lambda MySQL connector enables Amazon Athena to access MySQL databases.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-mysql-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-mysql-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Because Athena converts queries to lower case, MySQL table names must be in lower case. For example, Athena queries against a table named `myTable` will fail.
+ If you migrate your MySQL connections to Glue Catalog and Lake Formation, only the lowercase table and column names will be recognized. 

## Terms
<a name="connectors-mysql-terms"></a>

The following terms relate to the MySQL connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-mysql-parameters"></a>

Use the parameters in this section to configure the MySQL connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

### Glue connections (recommended)
<a name="connectors-mysql-gc"></a>

We recommend that you configure a MySQL connector by using a Glue connections object. 

To do this, set the `glue_connection` environment variable of the MySQL connector Lambda to the name of the Glue connection to use.

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type MYSQL
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The MySQL connector created using Glue connections does not support the use of a multiplexing handler.
The MySQL connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-mysql-connection-legacy"></a>

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

#### Connection string
<a name="connectors-mysql-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
mysql://${jdbc_connection_string}
```

**Note**  
If you receive the error java.sql.SQLException: Zero date value prohibited when doing a `SELECT` query on a MySQL table, add the following parameter to your connection string:  

```
zeroDateTimeBehavior=convertToNull
```
For more information, see [Error 'Zero date value prohibited' while trying to select from MySQL table](https://github.com/awslabs/aws-athena-query-federation/issues/760) on GitHub.com.

#### Using a multiplexing handler
<a name="connectors-mysql-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | MySqlMuxCompositeHandler | 
| Metadata handler | MySqlMuxMetadataHandler | 
| Record handler | MySqlMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-mysql-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mymysqlcatalog, then the environment variable name is mymysqlcatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a MySql MUX Lambda function that supports two database instances: `mysql1` (the default), and `mysql2`.


****  

| Property | Value | 
| --- | --- | 
| default | mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2 | 
| mysql\$1catalog1\$1connection\$1string | mysql://jdbc:mysql://mysql1.host:3306/default?\$1\$1Test/RDS/MySql1\$1 | 
| mysql\$1catalog2\$1connection\$1string | mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2 | 

##### Providing credentials
<a name="connectors-mysql-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/MySql1}`.

```
mysql://jdbc:mysql://mysql1.host:3306/default?...&${Test/RDS/MySql1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
mysql://jdbc:mysql://mysql1host:3306/default?...&user=sample2&password=sample2&...
```

Currently, the MySQL connector recognizes the `user` and `password` JDBC properties.

#### Using a single connection handler
<a name="connectors-mysql-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single MySQL instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | MySqlCompositeHandler | 
| Metadata handler | MySqlMetadataHandler | 
| Record handler | MySqlRecordHandler | 

##### Single connection handler parameters
<a name="connectors-mysql-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single MySQL instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | mysql://mysql1.host:3306/default?secret=Test/RDS/MySql1 | 

#### Spill parameters
<a name="connectors-mysql-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-mysql-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits
<a name="connectors-mysql-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-mysql-performance"></a>

MySQL supports native partitions. The Athena MySQL connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended.

The Athena MySQL connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses
<a name="connectors-mysql-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-mysql-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena MySQL connector can combine these expressions and push them directly to MySQL for enhanced functionality and to reduce the amount of data scanned.

The following Athena MySQL connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-mysql-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

For an article on using predicate pushdown to improve performance in federated queries, including MySQL, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries
<a name="connectors-mysql-passthrough-queries"></a>

The MySQL connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with MySQL, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in MySQL. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-mysql-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-mysql-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/pom.xml) file for the MySQL connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-mysql) on GitHub.com.

# Amazon Athena Neptune connector
<a name="connectors-neptune"></a>

Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Neptune's purpose-built, high-performance graph database engine stores billions of relationships optimally and queries graphs with a latency of only milliseconds. For more information, see the [Neptune User Guide](https://docs.aws.amazon.com/neptune/latest/userguide/intro.html).

The Amazon Athena Neptune Connector enables Athena to communicate with your Neptune graph database instance, making your Neptune graph data accessible by SQL queries.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites
<a name="connectors-neptune-prerequisites"></a>

Using the Neptune connector requires the following three steps.
+ Setting up a Neptune cluster
+ Setting up an AWS Glue Data Catalog
+ Deploying the connector to your AWS account. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md). For additional details specific to deploying the Neptune connector, see [Deploy the Amazon Athena Neptune Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/neptune-connector-setup) on GitHub.com.

## Limitations
<a name="connectors-neptune-limitations"></a>

Currently, the Neptune Connector has the following limitation.
+ Projecting columns, including the primary key (ID), is not supported. 

## Setting up a Neptune cluster
<a name="connectors-neptune-setting-up-a-neptune-cluster"></a>

If you don't have an existing Amazon Neptune cluster and property graph dataset in it that you would like to use, you must set one up.

Make sure you have an internet gateway and NAT gateway in the VPC that hosts your Neptune cluster. The private subnets that the Neptune connector Lambda function uses should have a route to the internet through this NAT Gateway. The Neptune connector Lambda function uses the NAT Gateway to communicate with AWS Glue.

For instructions on setting up a new Neptune cluster and loading it with a sample dataset, see [Sample Neptune Cluster Setup](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/neptune-cluster-setup) on GitHub.com.

## Setting up an AWS Glue Data Catalog
<a name="connectors-neptune-setting-up-an-aws-glue-data-catalog"></a>

Unlike traditional relational data stores, Neptune graph DB nodes and edges do not use a set schema. Each entry can have different fields and data types. However, because the Neptune connector retrieves metadata from the AWS Glue Data Catalog, you must create an AWS Glue database that has tables with the required schema. After you create the AWS Glue database and tables, the connector can populate the list of tables available to query from Athena.

### Enabling case insensitive column matching
<a name="connectors-neptune-glue-case-insensitive-column-matching"></a>

To resolve column names from your Neptune table with the correct casing even when the column names are all lower cased in AWS Glue, you can configure the Neptune connector for case insensitive matching.

To enable this feature, set the Neptune connector Lambda function environment variable `enable_caseinsensitivematch` to `true`. 

### Specifying the AWS Glue glabel table parameter for cased table names
<a name="connectors-neptune-glue-glabel-parameter-for-table-names"></a>

Because AWS Glue supports only lowercase table names, it is important to specify the `glabel` AWS Glue table parameter when you create an AWS Glue table for Neptune and your Neptune table name includes casing. 

In your AWS Glue table definition, include the `glabel` parameter and set its value to your table name with its original casing. This ensures that the correct casing is preserved when AWS Glue interacts with your Neptune table. The following example sets the value of `glabel` to the table name `Airport`.

```
glabel = Airport
```

![\[Setting the glabel AWS Glue table property to preserve table name casing for a Neptune table\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-neptune-1.png)


For more information on setting up a AWS Glue Data Catalog to work with Neptune, see [Set up AWS Glue Catalog](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/aws-glue-sample-scripts) on GitHub.com.

## Performance
<a name="connectors-neptune-performance"></a>

The Athena Neptune connector performs predicate pushdown to decrease the data scanned by the query. However, predicates using the primary key result in query failure. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Neptune connector is resilient to throttling due to concurrency.

## Passthrough queries
<a name="connectors-neptune-passthrough-queries"></a>

The Neptune connector supports [passthrough queries](federated-query-passthrough.md). You can use this feature to run Gremlin queries on property graphs and to run SPARQL queries on RDF data.

To create passthrough queries with Neptune, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'database_name',
            COLLECTION => 'collection_name',
            QUERY => 'query_string'
        ))
```

The following example Neptune passthrough query filters for airports with the code `ATL`. The doubled single quotes are for escaping.

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'graph-database',
            COLLECTION => 'airport',
            QUERY => 'g.V().has(''airport'', ''code'', ''ATL'').valueMap()' 
        ))
```

## Additional resources
<a name="connectors-neptune-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune) on GitHub.com.

# Amazon Athena OpenSearch connector
<a name="connectors-opensearch"></a>

OpenSearch Service

The Amazon Athena OpenSearch connector enables Amazon Athena to communicate with your OpenSearch instances so that you can use SQL to query your OpenSearch data.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

**Note**  
Due to a known issue, the OpenSearch connector cannot be used with a VPC.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites
<a name="connectors-opensearch-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Terms
<a name="connectors-opensearch-terms"></a>

The following terms relate to the OpenSearch connector.
+ **Domain** – A name that this connector associates with the endpoint of your OpenSearch instance. The domain is also used as the database name. For OpenSearch instances defined within the Amazon OpenSearch Service, the domain is auto-discoverable. For other instances, you must provide a mapping between the domain name and endpoint.
+ **Index** – A database table defined in your OpenSearch instance.
+ **Mapping** – If an index is a database table, then the mapping is its schema (that is, the definitions of its fields and attributes).

  This connector supports both metadata retrieval from the OpenSearch instance and from the AWS Glue Data Catalog. If the connector finds a AWS Glue database and table that match your OpenSearch domain and index names, the connector attempts to use them for schema definition. We recommend that you create your AWS Glue table so that it is a superset of all fields in your OpenSearch index.
+ **Document** – A record within a database table.
+ **Data stream** – Time based data that is composed of multiple backing indices. For more information, see [Data streams](https://opensearch.org/docs/latest/dashboards/im-dashboards/datastream/) in the OpenSearch documentation and [Getting started with data streams](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/data-streams.html#data-streams-example) in the *Amazon OpenSearch Service Developer Guide*.
**Note**  
Because data stream indices are internally created and managed by open search, the connector chooses the schema mapping from the first available index. For this reason, we strongly recommend setting up an AWS Glue table as a supplemental metadata source. For more information, see [Setting up databases and tables in AWS Glue](#connectors-opensearch-setting-up-databases-and-tables-in-aws-glue). 

## Parameters
<a name="connectors-opensearch-parameters"></a>

Use the parameters in this section to configure the OpenSearch connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="opensearch-gc"></a>

We recommend that you configure a OpenSearch connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the OpenSearch connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type OPENSEARCH
```

**Lambda environment properties**
+  **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The OpenSearch connector created using Glue connections does not support the use of a multiplexing handler.
The OpenSearch connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="opensearch-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **query\$1timeout\$1cluster** – The timeout period, in seconds, for cluster health queries used in the generation of parallel scans.
+ **query\$1timeout\$1search** – The timeout period, in seconds, for search queries used in the retrieval of documents from an index.
+ **auto\$1discover\$1endpoint** – Boolean. The default is `true`. When you use the Amazon OpenSearch Service and set this parameter to true, the connector can auto-discover your domains and endpoints by calling the appropriate describe or list API operations on the OpenSearch Service. For any other type of OpenSearch instance (for example, self-hosted), you must specify the associated domain endpoints in the `domain_mapping` variable. If `auto_discover_endpoint=true`, the connector uses AWS credentials to authenticate to the OpenSearch Service. Otherwise, the connector retrieves user name and password credentials from AWS Secrets Manager through the `domain_mapping` variable.
+ **domain\$1mapping** – Used only when `auto_discover_endpoint` is set to false and defines the mapping between domain names and their associated endpoints. The `domain_mapping` variable can accommodate multiple OpenSearch endpoints in the following format:

  ```
  domain1=endpoint1,domain2=endpoint2,domain3=endpoint3,...       
  ```

  For the purpose of authenticating to an OpenSearch endpoint, the connector supports substitution strings injected using the format `${SecretName}` with user name and password retrieved from AWS Secrets Manager. The secret should be stored in the following JSON format:

  ```
  { "username": "your_username", "password": "your_password" }
  ```

  The connector will automatically parse this JSON structure to retrieve the credentials.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.

  The following example uses the `opensearch-creds` secret.

  ```
  movies=https://${opensearch-creds}:search-movies-ne...qu---us-east-1---es.amazonaws.com     
  ```

  At runtime, `${opensearch-creds}` is rendered as the user name and password, as in the following example.

  ```
  movies=https://myusername@mypassword:search-movies-ne...qu---us-east-1---es.amazonaws.com
  ```

  In the `domain_mapping` parameter, each domain-endpoint pair can use a different secret. The secret itself must be specified in the format *user\$1name*@*password*. Although the password may contain embedded `@` signs, the first `@` serves as a separator from *user\$1name*.

  It is also important to note that the comma (,) and equal sign (=) are used by this connector as separators for the domain-endpoint pairs. For this reason, you should not use them anywhere inside the stored secret.

## Setting up databases and tables in AWS Glue
<a name="connectors-opensearch-setting-up-databases-and-tables-in-aws-glue"></a>

The connector obtains metadata information by using AWS Glue or OpenSearch. You can set up an AWS Glue table as a supplemental metadata definition source. To enable this feature, define a AWS Glue database and table that match the domain and index of the source that you are supplementing. The connector can also take advantage of metadata definitions stored in the OpenSearch instance by retrieving the mapping for the specified index.

### Defining metadata for arrays in OpenSearch
<a name="connectors-opensearch-defining-metadata-for-arrays-in-opensearch"></a>

OpenSearch does not have a dedicated array data type. Any field can contain zero or more values so long as they are of the same data type. If you want to use OpenSearch as your metadata definition source, you must define a `_meta` property for all indices used with Athena for the fields that to be considered a list or array. If you fail to complete this step, queries return only the first element in the list field. When you specify the `_meta` property, field names should be fully qualified for nested JSON structures (for example, `address.street`, where `street` is a nested field inside an `address` structure).

The following example defines `actor` and `genre` lists in the `movies` table.

```
PUT movies/_mapping 
{ 
  "_meta": { 
    "actor": "list", 
    "genre": "list" 
  } 
}
```

### Data types
<a name="connectors-opensearch-data-types"></a>

The OpenSearch connector can extract metadata definitions from either AWS Glue or the OpenSearch instance. The connector uses the mapping in the following table to convert the definitions to Apache Arrow data types, including the points noted in the section that follows.


****  

| OpenSearch | Apache Arrow | AWS Glue | 
| --- | --- | --- | 
| text, keyword, binary | VARCHAR | string | 
| long | BIGINT | bigint | 
| scaled\$1float | BIGINT | SCALED\$1FLOAT(...) | 
| integer | INT | int | 
| short | SMALLINT | smallint | 
| byte | TINYINT | tinyint | 
| double | FLOAT8 | double | 
| float, half\$1float | FLOAT4 | float | 
| boolean | BIT | boolean | 
| date, date\$1nanos | DATEMILLI | timestamp | 
| JSON structure | STRUCT | STRUCT | 
| \$1meta (for information, see the section [Defining metadata for arrays in OpenSearch](#connectors-opensearch-defining-metadata-for-arrays-in-opensearch).) | LIST | ARRAY | 

#### Notes on data types
<a name="connectors-opensearch-data-type-considerations-and-limitations"></a>
+ Currently, the connector supports only the OpenSearch and AWS Glue data-types listed in the preceding table.
+ A `scaled_float` is a floating-point number scaled by a fixed double scaling factor and represented as a `BIGINT` in Apache Arrow. For example, 0.756 with a scaling factor of 100 is rounded to 76.
+ To define a `scaled_float` in AWS Glue, you must select the `array` column type and declare the field using the format SCALED\$1FLOAT(*scaling\$1factor*).

  The following examples are valid:

  ```
  SCALED_FLOAT(10.51) 
  SCALED_FLOAT(100) 
  SCALED_FLOAT(100.0)
  ```

  The following examples are not valid:

  ```
  SCALED_FLOAT(10.) 
  SCALED_FLOAT(.5)
  ```
+ When converting from `date_nanos` to `DATEMILLI`, nanoseconds are rounded to the nearest millisecond. Valid values for `date` and `date_nanos` include, but are not limited to, the following formats:

  ```
  "2020-05-18T10:15:30.123456789" 
  "2020-05-15T06:50:01.123Z" 
  "2020-05-15T06:49:30.123-05:00" 
  1589525370001 (epoch milliseconds)
  ```
+ An OpenSearch `binary` is a string representation of a binary value encoded using `Base64` and is converted to a `VARCHAR`.

## Running SQL queries
<a name="connectors-opensearch-running-sql-queries"></a>

The following are examples of DDL queries that you can use with this connector. In the examples, *function\$1name* corresponds to the name of your Lambda function, *domain* is the name of the domain that you want to query, and *index* is the name of your index.

```
SHOW DATABASES in `lambda:function_name`
```

```
SHOW TABLES in `lambda:function_name`.domain
```

```
DESCRIBE `lambda:function_name`.domain.index
```

## Performance
<a name="connectors-opensearch-performance"></a>

The Athena OpenSearch connector supports shard-based parallel scans. The connector uses cluster health information retrieved from the OpenSearch instance to generate multiple requests for a document search query. The requests are split for each shard and run concurrently.

The connector also pushes down predicates as part of its document search queries. The following example query and predicate shows how the connector uses predicate push down.

**Query**

```
SELECT * FROM "lambda:elasticsearch".movies.movies 
WHERE year >= 1955 AND year <= 1962 OR year = 1996
```

**Predicate**

```
(_exists_:year) AND year:([1955 TO 1962] OR 1996)
```

## Passthrough queries
<a name="connectors-opensearch-passthrough-queries"></a>

The OpenSearch connector supports [passthrough queries](federated-query-passthrough.md) and uses the Query DSL language. For more information about querying with Query DSL, see [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) in the Elasticsearch documentation or [Query DSL](https://opensearch.org/docs/latest/query-dsl/) in the OpenSearch documentation.

To use passthrough queries with the OpenSearch connector, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            schema => 'schema_name',
            index => 'index_name',
            query => "{query_string}"
        ))
```

The following OpenSearch example passthrough query filters for employees with active employment status in the `employee` index of the `default` schema.

```
SELECT * FROM TABLE(
        system.query(
            schema => 'default',
            index => 'employee',
            query => "{ ''bool'':{''filter'':{''term'':{''status'': ''active''}}}}"
        ))
```

## Additional resources
<a name="connectors-opensearch-additional-resources"></a>
+ For an article on using the Amazon Athena OpenSearch connector to query data in Amazon OpenSearch Service and Amazon S3 in a single query, see [Query data in Amazon OpenSearch Service using SQL from Amazon Athena](https://aws.amazon.com/blogs/big-data/query-data-in-amazon-opensearch-service-using-sql-from-amazon-athena/) in the *AWS Big Data Blog*.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-elasticsearch) on GitHub.com.

# Amazon Athena Oracle connector
<a name="connectors-oracle"></a>

The Amazon Athena connector for Oracle enables Amazon Athena to run SQL queries on data stored in Oracle running on-premises or on Amazon EC2 or Amazon RDS. You can also use the connector to query data on [Oracle exadata](https://www.oracle.com/engineered-systems/exadata/).

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-oracle-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-oracle-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Only version 12.1.0.2 Oracle Databases are supported.
+ If the Oracle connector does not use a Glue connection, the database, table, and column names will be converted to upper case by the connector. 

  If the Oracle connector uses a Glue connection, the database, table, and column names will not default to upper case by the connector. To change this casing behavior, change the Lambda by environment variable `casing_mode` to `upper` or `lower` as needed.

   An Oracle connector using Glue connection does not support the use of a Multiplexing handler.
+ When you use the Oracle `NUMBER` without Precision and Scale defined, Athena treats this as `BIGINT`. To get the required decimal places in Athena, specify `default_scale=<number of decimal places>` in your Lambda environment variables.

## Terms
<a name="connectors-oracle-terms"></a>

The following terms relate to the Oracle connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-oracle-parameters"></a>

Use the parameters in this section to configure the Oracle connector.

### Glue connections (recommended)
<a name="oracle-gc"></a>

We recommend that you configure a Oracle connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Oracle connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type ORACLE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **is\$1fips\$1enabled** – (Optional) Set to true when FIPS mode is enabled. The default is false.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **lower** – Lower case all given schema and table names. This is the default for connectors that have an associated glue connection.
  + **upper** – Upper case all given schema and table names. This is the default for connectors that do not have an associated glue connection.
  + **case\$1insensitive\$1search** – Perform case insensitive searches against schema and tables names in Oracle. Use this value if your query contains schema or table names that do not match the default casing for your connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Oracle connector created using Glue connections does not support the use of a multiplexing handler.
The Oracle connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="oracle-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **default** – The JDBC connection string to use to connect to the Oracle database instance. For example, `oracle://${jdbc_connection_string}`
+ **catalog\$1connection\$1string** – Used by the Multiplexing handler (not supported when using a glue connection). A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myoraclecatalog, then the environment variable name is myoraclecatalog\$1connection\$1string.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **is\$1fips\$1enabled** – (Optional) Set to true when FIPS mode is enabled. The default is false.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **lower** – Lower case all given schema and table names. This is the default for connectors that have an associated glue connection.
  + **upper** – Upper case all given schema and table names. This is the default for connectors that do not have an associated glue connection.
  + **case\$1insensitive\$1search** – Perform case insensitive searches against schema and tables names in Oracle. Use this value if your query contains schema or table names that do not match the default casing for your connector.

#### Connection string
<a name="connectors-oracle-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
oracle://${jdbc_connection_string}
```

**Note**  
If your password contains special characters (for example, `some.password`), enclose your password in double quotes when you pass it to the connection string (for example, `"some.password"`). Failure to do so can result in an Invalid Oracle URL specified error.

#### Using a single connection handler
<a name="connectors-oracle-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Oracle instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | OracleCompositeHandler | 
| Metadata handler | OracleMetadataHandler | 
| Record handler | OracleRecordHandler | 

##### Single connection handler parameters
<a name="connectors-oracle-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 
| IsFIPSEnabled | Optional. Set to true when FIPS mode is enabled. The default is false.  | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The connector supports SSL based connections to Amazon RDS instances. Support is limited to the Transport Layer Security (TLS) protocol and to authentication of the server by the client. Mutual authentication it is not supported in Amazon RDS. The second row in the table below shows the syntax for using SSL.

The following example property is for a single Oracle instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle\$1@//hostname:port/servicename | 
|  | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle\$1@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS) (HOST=<HOST\$1NAME>)(PORT=))(CONNECT\$1DATA=(SID=))(SECURITY=(SSL\$1SERVER\$1CERT\$1DN=))) | 

#### Providing credentials
<a name="connectors-oracle-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Note**  
If your password contains special characters (for example, `some.password`), enclose your password in double quotes when you store it in Secrets Manager (for example, `"some.password"`). Failure to do so can result in an Invalid Oracle URL specified error.

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Oracle}`.

```
oracle://jdbc:oracle:thin:${Test/RDS/Oracle}@//hostname:port/servicename 
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
oracle://jdbc:oracle:thin:username/password@//hostname:port/servicename
```

Currently, the Oracle connector recognizes the `UID` and `PWD` JDBC properties.

#### Using a multiplexing handler
<a name="connectors-oracle-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | OracleMuxCompositeHandler | 
| Metadata handler | OracleMuxMetadataHandler | 
| Record handler | OracleMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-oracle-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myoraclecatalog, then the environment variable name is myoraclecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Oracle MUX Lambda function that supports two database instances: `oracle1` (the default), and `oracle2`.


****  

| Property | Value | 
| --- | --- | 
| default | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle1\$1@//oracle1.hostname:port/servicename | 
| oracle\$1catalog1\$1connection\$1string | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle1\$1@//oracle1.hostname:port/servicename | 
| oracle\$1catalog2\$1connection\$1string | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle2\$1@//oracle2.hostname:port/servicename | 

## Data type support
<a name="connectors-oracle-data-type-support"></a>

The following table shows the corresponding data types for JDBC, Oracle, and Arrow.


****  

| JDBC | Oracle | Arrow | 
| --- | --- | --- | 
| Boolean | boolean | Bit | 
| Integer | N/A | Tiny | 
| Short | smallint | Smallint | 
| Integer | integer | Int | 
| Long | bigint | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | text | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | numeric(p,s) | Decimal | 
| ARRAY | N/A (see note) | List | 

## Partitions and splits
<a name="connectors-oracle-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-oracle-performance"></a>

Oracle supports native partitions. The Athena Oracle connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Oracle connector is resilient to throttling due to concurrency. However, query runtimes tend to be long.

The Athena Oracle connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### Predicates
<a name="connectors-oracle-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Oracle connector can combine these expressions and push them directly to Oracle for enhanced functionality and to reduce the amount of data scanned.

The following Athena Oracle connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-oracle-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries
<a name="connectors-oracle-passthrough-queries"></a>

The Oracle connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Oracle, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Oracle. The query selects all columns in the `customer` table.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer'
        ))
```

## License information
<a name="connectors-oracle-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-oracle-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/pom.xml) file for the Oracle connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-oracle) on GitHub.com.

# Amazon Athena PostgreSQL connector
<a name="connectors-postgresql"></a>

The Amazon Athena PostgreSQL connector enables Athena to access your PostgreSQL databases.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-postgres-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-postgresql-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Like PostgreSQL, Athena treats trailing spaces in PostgreSQL `CHAR` types as semantically insignificant for length and comparison purposes. Note that this applies only to `CHAR` but not to `VARCHAR` types. Athena ignores trailing spaces for the `CHAR` type, but treats them as significant for the `VARCHAR` type.
+ When you use the [citext](https://www.postgresql.org/docs/current/citext.html) case-insensitive character string data type, PostgreSQL uses a case insensitive data comparison that is different from Athena. This difference creates a data discrepancy during SQL `JOIN` operations. To workaround this issue, use the PostgreSQL connector passthrough query feature. For more information, see the [passthrough queries](#connectors-postgres-passthrough-queries) section later in this document. 

## Terms
<a name="connectors-postgresql-terms"></a>

The following terms relate to the PostgreSQL connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-postgresql-parameters"></a>

Use the parameters in this section to configure the PostgreSQL connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

### Glue connections (recommended)
<a name="connectors-postgresql-gc"></a>

We recommend that you configure a PostgreSQL connector by using a Glue connections object. 

To do this, set the `glue_connection` environment variable of the PostgreSQL connector Lambda to the name of the Glue connection to use.

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type POSTGRESQL
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The PostgreSQL connector created using Glue connections does not support the use of a multiplexing handler.
The PostgreSQL connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-postgresql-connection-legacy"></a>

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

#### Connection string
<a name="connectors-postgresql-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
postgres://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-postgresql-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | PostGreSqlMuxCompositeHandler | 
| Metadata handler | PostGreSqlMuxMetadataHandler | 
| Record handler | PostGreSqlMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-postgresql-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mypostgrescatalog, then the environment variable name is mypostgrescatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a PostGreSql MUX Lambda function that supports two database instances: `postgres1` (the default), and `postgres2`.


****  

| Property | Value | 
| --- | --- | 
| default | postgres://jdbc:postgresql://postgres1.host:5432/default?\$1\$1Test/RDS/PostGres1\$1 | 
| postgres\$1catalog1\$1connection\$1string | postgres://jdbc:postgresql://postgres1.host:5432/default?\$1\$1Test/RDS/PostGres1\$1 | 
| postgres\$1catalog2\$1connection\$1string | postgres://jdbc:postgresql://postgres2.host:5432/default?user=sample&password=sample | 

##### Providing credentials
<a name="connectors-postgresql-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/PostGres1}`.

```
postgres://jdbc:postgresql://postgres1.host:5432/default?...&${Test/RDS/PostGres1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
postgres://jdbc:postgresql://postgres1.host:5432/default?...&user=sample2&password=sample2&...
```

Currently, the PostgreSQL connector recognizes the `user` and `password` JDBC properties.

##### Enabling SSL
<a name="connectors-postgresql-ssl"></a>

To support SSL in your PostgreSQL connection, append the following to your connection string:

```
&sslmode=verify-ca&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
```

**Example**  
The following example connection string does not use SSL.

```
postgres://jdbc:postgresql://example-asdf-aurora-postgres-endpoint:5432/asdf?user=someuser&password=somepassword
```

To enable SSL, modify the string as follows.

```
postgres://jdbc:postgresql://example-asdf-aurora-postgres-endpoint:5432/asdf?user=someuser&password=somepassword&sslmode=verify-ca&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
```

#### Using a single connection handler
<a name="connectors-postgresql-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single PostgreSQL instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | PostGreSqlCompositeHandler | 
| Metadata handler | PostGreSqlMetadataHandler | 
| Record handler | PostGreSqlRecordHandler | 

##### Single connection handler parameters
<a name="connectors-postgresql-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single PostgreSQL instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | postgres://jdbc:postgresql://postgres1.host:5432/default?secret=\$1\$1Test/RDS/PostgreSQL1\$1 | 

#### Spill parameters
<a name="connectors-postgresql-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-postgresql-data-type-support"></a>

The following table shows the corresponding data types for JDBC, PostGreSQL, and Arrow.


****  

| JDBC | PostGreSQL | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | N/A | Tiny | 
| Short | smallint | Smallint | 
| Integer | integer | Int | 
| Long | bigint | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | text | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | numeric(p,s) | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
The `ARRAY` type is supported for the PostgreSQL connector with the following constraints: Multidimensional arrays (`<data_type>[][]` or nested arrays) are not supported. Columns with unsupported `ARRAY` data-types are converted to an array of string elements (`array<varchar>`).

## Partitions and splits
<a name="connectors-postgresql-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance
<a name="connectors-postgresql-performance"></a>

PostgreSQL supports native partitions. The Athena PostgreSQL connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended.

The Athena PostgreSQL connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. However, selecting a subset of columns sometimes results in a longer query execution runtime.

### LIMIT clauses
<a name="connectors-postgres-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-postgres-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena PostgreSQL connector can combine these expressions and push them directly to PostgreSQL for enhanced functionality and to reduce the amount of data scanned.

The following Athena PostgreSQL connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-postgres-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-postgres-passthrough-queries"></a>

The PostgreSQL connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with PostgreSQL, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in PostgreSQL. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## Additional resources
<a name="connectors-postgresql-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-postgresql/pom.xml) file for the PostgreSQL connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-postgresql) on GitHub.com.

# Amazon Athena Redis OSS connector
<a name="connectors-redis"></a>

The Amazon Athena Redis OSS connector enables Amazon Athena to communicate with your Redis OSS instances so that you can query your Redis OSS data with SQL. You can use the AWS Glue Data Catalog to map your Redis OSS key-value pairs into virtual tables.

Unlike traditional relational data stores, Redis OSS does not have the concept of a table or a column. Instead, Redis OSS offers key-value access patterns where the key is essentially a `string` and the value is a `string`, `z-set`, or `hmap`.

You can use the AWS Glue Data Catalog to create schema and configure virtual tables. Special table properties tell the Athena Redis OSS connector how to map your Redis OSS keys and values into a table. For more information, see [Setting up databases and tables in AWS Glue](#connectors-redis-setting-up-databases-and-tables-in-glue) later in this document.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

The Amazon Athena Redis OSS connector supports Amazon MemoryDB and Amazon ElastiCache (Redis OSS).

## Prerequisites
<a name="connectors-redis-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Parameters
<a name="connectors-redis-parameters"></a>

Use the parameters in this section to configure the Redis connector.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

## Setting up databases and tables in AWS Glue
<a name="connectors-redis-setting-up-databases-and-tables-in-glue"></a>

To enable an AWS Glue table for use with Redis OSS, you can set the following table properties on the table: `redis-endpoint`, `redis-value-type`, and either `redis-keys-zset` or `redis-key-prefix`.

In addition, any AWS Glue database that contains Redis OSS tables must have a `redis-db-flag` in the URI property of the database. To set the `redis-db-flag` URI property, use the AWS Glue console to edit the database.

The following list describes the table properties.
+ **redis-endpoint** – (Required) The *hostname*`:`*port*`:`*password* of the Redis OSS server that contains data for this table (for example, `athena-federation-demo.cache.amazonaws.com:6379`) Alternatively, you can store the endpoint, or part of the endpoint, in AWS Secrets Manager by using \$1\$1*Secret\$1Name*\$1 as the table property value.

**Note**  
To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.
+ **redis-keys-zset** – (Required if `redis-key-prefix` is not used) A comma-separated list of keys whose value is a [zset](https://redis.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-5-sorted-sets/) (for example, `active-orders,pending-orders`). Each of the values in the zset is treated as a key that is part of the table. Either the `redis-keys-zset` property or the `redis-key-prefix` property must be set.
+ **redis-key-prefix** – (Required if `redis-keys-zset` is not used) A comma separated list of key prefixes to scan for values in the table (for example, `accounts-*,acct-`). Either the `redis-key-prefix` property or the `redis-keys-zset` property must be set.
+ **redis-value-type** – (Required) Defines how the values for the keys defined by either `redis-key-prefix` or `redis-keys-zset` are mapped to your table. A literal maps to a single column. A zset also maps to a single column, but each key can store many rows. A hash enables each key to be a row with multiple columns (for example, a hash, literal, or zset.)
+ **redis-ssl-flag** – (Optional) When `True`, creates a Redis connection that uses SSL/TLS. The default is `False`.
+ **redis-cluster-flag** – (Optional) When `True`, enables support for clustered Redis instances. The default is `False`.
+ **redis-db-number** – (Optional) Applies only to standalone, non-clustered instances.) Set this number (for example 1, 2, or 3) to read from a non-default Redis database. The default is Redis logical database 0. This number does not refer to a database in Athena or AWS Glue, but to a Redis logical database. For more information, see [SELECT index](https://redis.io/commands/select) in the Redis documentation.

## Data types
<a name="connectors-redis-data-types"></a>

The Redis OSS connector supports the following data types. Redis OSS streams are not supported.
+ [String](https://redis.com/ebook/part-1-getting-started/chapter-1-getting-to-know-redis/1-2-what-redis-data-structures-look-like/1-2-1-strings-in-redis/)
+ [Hash](https://redis.com/ebook/part-1-getting-started/chapter-1-getting-to-know-redis/1-2-what-redis-data-structures-look-like/1-2-4-hashes-in-redis/)
+ Sorted Set ([ZSet](https://redis.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-5-sorted-sets/))

All Redis OSS values are retrieved as the `string` data type. Then they are converted to one of the following Apache Arrow data types based on how your tables are defined in the AWS Glue Data Catalog.


****  

| AWS Glue data type | Apache Arrow data type | 
| --- | --- | 
| int | INT | 
| string | VARCHAR | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| smallint | SMALLINT | 
| tinyint | TINYINT | 
| boolean | BIT | 
| binary | VARBINARY | 

## Required Permissions
<a name="connectors-redis-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-redis.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-redis/athena-redis.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The Redis connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store Redis endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your Redis instances.

## Performance
<a name="connectors-redis-performance"></a>

The Athena Redis OSS connector attempts to parallelize queries against your Redis OSS instance according to the type of table that you have defined (for example, zset keys or prefix keys).

The Athena Redis connector performs predicate pushdown to decrease the data scanned by the query. However, queries containing a predicate against the primary key fail with timeout. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Redis connector is resilient to throttling due to concurrency.

## Passthrough queries
<a name="connectors-redis-passthrough-queries"></a>

The Redis connector supports [passthrough queries](federated-query-passthrough.md). You can use this feature to run queries that use Lua script on Redis databases. 

To create passthrough queries with Redis, use the following syntax:

```
SELECT * FROM TABLE(
        system.script(
            script => 'return redis.[call|pcall](query_script)',
            keys => '[key_pattern]',
            argv => '[script_arguments]'
))
```

The following example runs a Lua script to get the value at key `l:a`.

```
SELECT * FROM TABLE(
        system.script(
            script => 'return redis.call("GET", KEYS[1])',
            keys => '[l:a]',
            argv => '[]'
))
```

## License information
<a name="connectors-redis-license-information"></a>

The Amazon Athena Redis connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-redis-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-redis) on GitHub.com.

# Amazon Athena Redshift connector
<a name="connectors-redshift"></a>

The Amazon Athena Redshift connector enables Amazon Athena to access your Amazon Redshift and Amazon Redshift Serverless databases, including Redshift Serverless views. You can connect to either service using the JDBC connection string configuration settings described on this page.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-redshift-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-redshift-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Because Redshift does not support external partitions, all data specified by a query is retrieved every time.
+ Like Redshift, Athena treats trailing spaces in Redshift `CHAR` types as semantically insignificant for length and comparison purposes. Note that this applies only to `CHAR` but not to `VARCHAR` types. Athena ignores trailing spaces for the `CHAR` type, but treats them as significant for the `VARCHAR` type.

## Terms
<a name="connectors-redshift-terms"></a>

The following terms relate to the Redshift connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-redshift-parameters"></a>

Use the parameters in this section to configure the Redshift connector.

### Glue connections (recommended)
<a name="redshift-gc"></a>

We recommend that you configure a Redshift connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Amazon Redshift connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type REDSHIFT
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Redshift connector created using Glue connections does not support the use of a multiplexing handler.
The Redshift connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="redshift-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

#### Connection string
<a name="connectors-redshift-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
redshift://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-redshift-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | RedshiftMuxCompositeHandler | 
| Metadata handler | RedshiftMuxMetadataHandler | 
| Record handler | RedshiftMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-redshift-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myredshiftcatalog, then the environment variable name is myredshiftcatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Redshift MUX Lambda function that supports two database instances: `redshift1` (the default), and `redshift2`.


****  

| Property | Value | 
| --- | --- | 
| default | redshift://jdbc:redshift://redshift1.host:5439/dev?user=sample2&password=sample2 | 
| redshift\$1catalog1\$1connection\$1string | redshift://jdbc:redshift://redshift1.host:3306/default?\$1\$1Test/RDS/Redshift1\$1 | 
| redshift\$1catalog2\$1connection\$1string | redshift://jdbc:redshift://redshift2.host:3333/default?user=sample2&password=sample2 | 

##### Providing credentials
<a name="connectors-redshift-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name \$1\$1Test/RDS/ `Redshift1`\$1.

```
redshift://jdbc:redshift://redshift1.host:3306/default?...&${Test/RDS/Redshift1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
redshift://jdbc:redshift://redshift1.host:3306/default?...&user=sample2&password=sample2&...
```

Currently, the Redshift connector recognizes the `user` and `password` JDBC properties.

## Data type support
<a name="connectors-redshift-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits
<a name="connectors-redshift-partitions-and-splits"></a>

Redshift does not support external partitions. For information about performance related issues, see [Performance](#connectors-redshift-performance).

## Performance
<a name="connectors-redshift-performance"></a>

The Athena Redshift connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, `ORDER BY` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. However, selecting a subset of columns sometimes results in a longer query execution runtime. Amazon Redshift is particularly susceptible to query execution slowdown when you run multiple queries concurrently.

### LIMIT clauses
<a name="connectors-redshift-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Top N queries
<a name="connectors-redshift-performance-top-n-queries"></a>

A top `N` query specifies an ordering of the result set and a limit on the number of rows returned. You can use this type of query to determine the top `N` max values or top `N` min values for your datasets. With top `N` pushdown, the connector returns only `N` ordered rows to Athena.

### Predicates
<a name="connectors-redshift-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Redshift connector can combine these expressions and push them directly to Redshift for enhanced functionality and to reduce the amount of data scanned.

The following Athena Redshift connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-redshift-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
ORDER BY col_a DESC 
LIMIT 10;
```

For an article on using predicate pushdown to improve performance in federated queries, including Amazon Redshift, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries
<a name="connectors-redshift-passthrough-queries"></a>

The Redshift connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Redshift, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Redshift. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## Additional resources
<a name="connectors-redshift-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-redshift/pom.xml) file for the Redshift connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-redshift) on GitHub.com.

# Amazon Athena SAP HANA connector
<a name="connectors-sap-hana"></a>

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-saphana-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-sap-hana-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In SAP HANA, object names are converted to uppercase when they are stored in the SAP HANA database. However, because names in quotation marks are case sensitive, it is possible for two tables to have the same name in lower and upper case (for example, `EMPLOYEE` and `employee`).

  In Athena Federated Query, schema table names are provided to the Lambda function in lower case. To work around this issue, you can provide `@schemaCase` query hints to retrieve the data from the tables that have case sensitive names. Following are two sample queries with query hints.

  ```
  SELECT * 
  FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=upper"
  ```

  ```
  SELECT * 
  FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=lower"
  ```

## Terms
<a name="connectors-sap-hana-terms"></a>

The following terms relate to the SAP HANA connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-sap-hana-parameters"></a>

Use the parameters in this section to configure the SAP HANA connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-sap-hana-gc"></a>

We recommend that you configure a SAP HANA connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the SAP HANA connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SAPHANA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The SAP HANA connector created using Glue connections does not support the use of a multiplexing handler.
The SAP HANA connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-sap-hana-legacy"></a>

#### Connection string
<a name="connectors-sap-hana-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
saphana://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-sap-hana-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SaphanaMuxCompositeHandler | 
| Metadata handler | SaphanaMuxMetadataHandler | 
| Record handler | SaphanaMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-sap-hana-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysaphanacatalog, then the environment variable name is mysaphanacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Saphana MUX Lambda function that supports two database instances: `saphana1` (the default), and `saphana2`.


****  

| Property | Value | 
| --- | --- | 
| default | saphana://jdbc:sap://saphana1.host:port/?\$1\$1Test/RDS/ Saphana1\$1 | 
| saphana\$1catalog1\$1connection\$1string | saphana://jdbc:sap://saphana1.host:port/?\$1\$1Test/RDS/ Saphana1\$1 | 
| saphana\$1catalog2\$1connection\$1string | saphana://jdbc:sap://saphana2.host:port/?user=sample2&password=sample2 | 

##### Providing credentials
<a name="connectors-sap-hana-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Saphana1}`.

```
saphana://jdbc:sap://saphana1.host:port/?${Test/RDS/Saphana1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
saphana://jdbc:sap://saphana1.host:port/?user=sample2&password=sample2&...
```

Currently, the SAP HANA connector recognizes the `user` and `password` JDBC properties.

#### Using a single connection handler
<a name="connectors-sap-hana-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single SAP HANA instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SaphanaCompositeHandler | 
| Metadata handler | SaphanaMetadataHandler | 
| Record handler | SaphanaRecordHandler | 

##### Single connection handler parameters
<a name="connectors-sap-hana-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single SAP HANA instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | saphana://jdbc:sap://saphana1.host:port/?secret=Test/RDS/Saphana1 | 

#### Spill parameters
<a name="connectors-sap-hana-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-sap-hana-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Data type conversions
<a name="connectors-sap-hana-data-type-conversions"></a>

In addition to the JDBC to Arrow conversions, the connector performs certain other conversions to make the SAP HANA source and Athena data types compatible. These conversions help ensure that queries get executed successfully. The following table shows these conversions.


****  

| Source data type (SAP HANA) | Converted data type (Athena) | 
| --- | --- | 
| DECIMAL | BIGINT | 
| INTEGER | INT | 
| DATE | DATEDAY | 
| TIMESTAMP | DATEMILLI | 

All other unsupported data types are converted to `VARCHAR`.

## Partitions and splits
<a name="connectors-sap-hana-partitions-and-splits"></a>

A partition is represented by a single partition column of type `Integer`. The column contains partition names of the partitions defined on an SAP HANA table. For a table that does not have partition names, \$1 is returned, which is equivalent to a single partition. A partition is equivalent to a split.


****  

| Name | Type | Description | 
| --- | --- | --- | 
| PART\$1ID | Integer | Named partition in SAP HANA. | 

## Performance
<a name="connectors-sap-hana-performance"></a>

SAP HANA supports native partitions. The Athena SAP HANA connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The connector shows significant throttling, and sometimes query failures, due to concurrency.

The Athena SAP HANA connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses
<a name="connectors-saphana-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-saphana-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena SAP HANA connector can combine these expressions and push them directly to SAP HANA for enhanced functionality and to reduce the amount of data scanned.

The following Athena SAP HANA connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-saphana-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries
<a name="connectors-saphana-passthrough-queries"></a>

The SAP HANA connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with SAP HANA, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in SAP HANA. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-saphana-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-saphana-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/pom.xml) file for the SAP HANA connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-saphana) on GitHub.com.

# Amazon Athena Snowflake connector
<a name="connectors-snowflake"></a>

The Amazon Athena connector for [Snowflake](https://www.snowflake.com/) enables Amazon Athena to run SQL queries on data stored in your Snowflake SQL database or RDS instances using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-snowflake-prerequisites"></a>

Deploy the connector to your AWS account using the Athena console or the `CreateDataCatalog` API operation. For more information, see [Create a data source connection](connect-to-a-data-source.md).

## Limitations
<a name="connectors-snowflake-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Only legacy connections support multiplexer setup. 
+ Currently, Snowflake views with single split are supported. 
+  In Snowflake, object names are case-sensitive. Athena accepts mixed case in DDL and DML queries, but by default [lower cases](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#table-names-and-table-column-names-in-ate-must-be-lowercase) the object names when it executes the query. The Snowflake connector supports only lower case when Glue Catalog/Lake Formation are used. When the Athena Catalog is used, customers can control the casing behavior using the `casing_mode` Lambda environment variable whose possible values are listed in the [Parameters](#connectors-snowflake-parameters) section (for example, `key=casing_mode, value = CASE_INSENSITIVE_SEARCH`). 

## Terms
<a name="connectors-snowflake-terms"></a>

The following terms relate to the Snowflake connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-snowflake-parameters"></a>

Use the parameters in this section to configure the Snowflake connector.

### Glue connections (recommended)
<a name="snowflake-gc"></a>

We recommend that you configure a Snowflake connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Snowflake connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SNOWFLAKE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **NONE** – Do not change case of the given schema and table names (run the query as is against Snowflake). This is the default value when **casing\$1mode** is not specified. 
  + **UPPER** – Upper case all given schema and table names in the query before running it against Snowflake.
  + **LOWER** – Lower case all given schema and table names in the query before running it against Snowflake.
  + **CASE\$1INSENSITIVE\$1SEARCH** – Perform case insensitive searches against schema and tables names in Snowflake. For example, you can use this mode when you have a query like `SELECT * FROM EMPLOYEE` and Snowflake contains a table called `Employee`. However, in the presence of name collisions, such as having a table called `EMPLOYEE` and another table called `Employee` in Snowflake, the query will fail.

**Note**  
The Snowflake connector created using Glue connections does not support the use of a multiplexing handler.
The Snowflake connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

**Storing credentials**

All connectors that use Glue connections must use AWS Secrets Manager to store credentials. For more information, see [Authenticate with Snowflake](connectors-snowflake-authentication.md).

### Legacy connections
<a name="snowflake-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **default** – The JDBC connection string to use to connect to the Snowflake database instance. For example, `snowflake://${jdbc_connection_string}`
+ **catalog\$1connection\$1string** – Used by the Multiplexing handler (not supported when using a glue connection). A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysnowflakecatalog, then the environment variable name is mysnowflakecatalog\$1connection\$1string.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **NONE** – Do not change case of the given schema and table names (run the query as is against Snowflake). This is the default value when **casing\$1mode** is not specified. 
  + **UPPER** – Upper case all given schema and table names in the query before running it against Snowflake.
  + **LOWER** – Lower case all given schema and table names in the query before running it against Snowflake.
  + **CASE\$1INSENSITIVE\$1SEARCH** – Perform case insensitive searches against schema and tables names in Snowflake. For example, you can use this mode when you have a query like `SELECT * FROM EMPLOYEE` and Snowflake contains a table called `Employee`. However, in the presence of name collisions, such as having a table called `EMPLOYEE` and another table called `Employee` in Snowflake, the query will fail.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

#### Connection string
<a name="connectors-snowflake-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
snowflake://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-snowflake-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SnowflakeMuxCompositeHandler | 
| Metadata handler | SnowflakeMuxMetadataHandler | 
| Record handler | SnowflakeMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-snowflake-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysnowflakecatalog, then the environment variable name is mysnowflakecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Snowflake MUX Lambda function that supports two database instances: `snowflake1` (the default), and `snowflake2`.


****  

| Property | Value | 
| --- | --- | 
| default | snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1&\$1\$1Test/RDS/Snowflake1\$1 | 
| snowflake\$1catalog1\$1connection\$1string | snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1\$1\$1Test/RDS/Snowflake1\$1 | 
| snowflake\$1catalog2\$1connection\$1string | snowflake://jdbc:snowflake://snowflake2.host:port/?warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2 | 

##### Providing credentials
<a name="connectors-snowflake-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Snowflake1}`.

```
snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1${Test/RDS/Snowflake1}&... 
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
snowflake://jdbc:snowflake://snowflake1.host:port/warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2&... 
```

Currently, Snowflake recognizes the `user` and `password` JDBC properties. It also accepts the user name and password in the format *username*`/`*password* without the keys `user` or `password`.

#### Using a single connection handler
<a name="connectors-snowflake-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Snowflake instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SnowflakeCompositeHandler | 
| Metadata handler | SnowflakeMetadataHandler | 
| Record handler | SnowflakeRecordHandler | 

##### Single connection handler parameters
<a name="connectors-snowflake-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Snowflake instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | snowflake://jdbc:snowflake://snowflake1.host:port/?secret=Test/RDS/Snowflake1 | 

#### Spill parameters
<a name="connectors-snowflake-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-snowflake-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Data type conversions
<a name="connectors-snowflake-data-type-conversions"></a>

In addition to the JDBC to Arrow conversions, the connector performs certain other conversions to make the Snowflake source and Athena data types compatible. These conversions help ensure that queries get executed successfully. The following table shows these conversions.


****  

| Source data type (Snowflake) | Converted data type (Athena) | 
| --- | --- | 
| TIMESTAMP | TIMESTAMPMILLI | 
| DATE | TIMESTAMPMILLI | 
| INTEGER | INT | 
| DECIMAL | BIGINT | 
| TIMESTAMP\$1NTZ | TIMESTAMPMILLI | 

All other unsupported data types are converted to `VARCHAR`.

## Partitions and splits
<a name="connectors-snowflake-partitions-and-splits"></a>

Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

To create this synthetic column and the partitions, Athena requires a primary key to be defined. However, because Snowflake does not enforce primary key constraints, you must enforce uniqueness yourself. Failure to do so causes Athena to default to a single split.

## Performance
<a name="connectors-snowflake-performance"></a>

For optimal performance, use filters in queries whenever possible. In addition, we highly recommend native partitioning to retrieve huge datasets that have uniform partition distribution. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Snowflake connector is resilient to throttling due to concurrency.

The Athena Snowflake connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses
<a name="connectors-snowflake-performance-limit-clauses"></a>

A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates
<a name="connectors-snowflake-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Snowflake connector can combine these expressions and push them directly to Snowflake for enhanced functionality and to reduce the amount of data scanned.

The following Athena Snowflake connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-snowflake-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

# Authenticate with Snowflake
<a name="connectors-snowflake-authentication"></a>

You can configure the Amazon Athena Snowflake connector to use either the key-pair authentication or OAuth authentication method to connect to your Snowflake data warehouse. Both methods provide secure access to Snowflake and eliminate the need to store passwords in connection strings.
+ **Key-pair authentication** – This method uses RSA public or private key pairs to authenticate with Snowflake. The private key digitally signs authentication requests while the corresponding public key is registered in Snowflake for verification. This method eliminates password storage.
+ **OAuth authentication** – This method uses authorization token and refresh token to authenticate with Snowflake. It supports automatic token refresh, making it suitable for long-running applications.

For more information, see the [Key-pair authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth) and [OAuth authentication](https://docs.snowflake.com/en/user-guide/oauth-custom) in the Snowflake user guide.

## Prerequisites
<a name="connectors-snowflake-authentication-prerequisites"></a>

Before you begin, complete the following prerequisites:
+ Snowflake account access with administrative privileges.
+ Snowflake user account dedicated for the Athena connector.
+ OpenSSL or equivalent key generation tools for key-pair authentication.
+ AWS Secrets Manager access to create and manage secrets.
+ Web browser to complete the OAuth flow for the OAuth authentication.

## Configure key-pair authentication
<a name="connectors-snowflake-keypair-authentication"></a>

This process involves generating an RSA key-pair, configuring your Snowflake account with the public key, and securely storing the private key in AWS Secrets Manager. The following steps will guide you through creating the cryptographic keys, setting up the necessary Snowflake permissions, and configuring AWS credentials for seamless authentication. 

1. **Generate RSA key-pair**

   Generate a private and public key pair using OpenSSL.
   + To generate an unencrypted version, use the following command in your local command line application.

     ```
     openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
     ```
   + To generate an encrypted version, use the following command, which omits `-nocrypt`.

     ```
     openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8
     ```
   + To generate a public key from a private key.

     ```
     openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
     # Set appropriate permissions (Unix/Linux)
     chmod 600 rsa_key.p8
     chmod 644 rsa_key.pub
     ```
**Note**  
Do not share your private key. The private key should only be accessible to the application that needs to authenticate with Snowflake.

1. **Extract public key content without delimiters for Snowflake**

   ```
   # Extract public key content (remove BEGIN/END lines and newlines)
   cat rsa_key.pub | grep -v "BEGIN\|END" | tr -d '\n'
   ```

   Save this output as you will need it later in the next step.

1. **Configure Snowflake user**

   Follow these steps to configure a Snowflake user.

   1. Create a dedicated user for the Athena connector if it doesn't already exists.

      ```
      -- Create user for Athena connector
      CREATE USER athena_connector_user;
      
      -- Grant necessary privileges
      GRANT USAGE ON WAREHOUSE your_warehouse TO ROLE athena_connector_role;
      GRANT USAGE ON DATABASE your_database TO ROLE athena_connector_role;
      GRANT SELECT ON ALL TABLES IN DATABASE your_database TO ROLE athena_connector_role;
      ```

   1. Grant authentication privileges. To assign a public key to a user, you must have one of the following roles or privileges.
      + The `MODIFY PROGRAMMATIC AUTHENTICATION METHODS` or `OWNERSHIP` privilege on the user.
      + The `SECURITYADMIN` role or higher.

      Grant the necessary privileges to assign public keys with the following command.

      ```
      GRANT MODIFY PROGRAMMATIC AUTHENTICATION METHODS ON USER athena_connector_user TO ROLE your_admin_role;
      ```

   1. Assign the public key to the Snowflake user with the following command.

      ```
      ALTER USER athena_connector_user SET RSA_PUBLIC_KEY='RSAkey';
      ```

      Verify that the public key is successfully assigned to the user with the following command.

      ```
      DESC USER athena_connector_user;
      ```

1. **Store private key in AWS Secrets Manager**

   1. Convert your private key to the format required by the connector.

      ```
      # Read private key content
      cat rsa_key.p8
      ```

   1. Create a secret in AWS Secrets Manager with the following structure.

      ```
      {
        "sfUser": "your_snowflake_user",
        "pem_private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----",
        "pem_private_key_passphrase": "passphrase_in_case_of_encrypted_private_key(optional)"
      }
      ```
**Note**  
Header and footer are optional.
The private key must be separated by `\n`.

## Configure OAuth authentication
<a name="connectors-snowflake-oauth-authentication"></a>

This authentication method enables secure, token-based access to Snowflake with automatic credential refresh capabilities. The configuration process involves creating a security integration in Snowflake, retrieving OAuth client credentials, completing the authorization flow to obtain an access code, and storing the OAuth credentials in AWS Secrets Manager for the connector to use. 

1. **Create a security integration in Snowflake**

   Execute the following SQL command in Snowflake to create a Snowflake OAuth security integration.

   ```
   CREATE SECURITY INTEGRATION my_snowflake_oauth_integration_a
     TYPE = OAUTH
     ENABLED = TRUE
     OAUTH_CLIENT = CUSTOM
     OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
     OAUTH_REDIRECT_URI = 'https://localhost:8080/oauth/callback'
     OAUTH_ISSUE_REFRESH_TOKENS = TRUE
     OAUTH_REFRESH_TOKEN_VALIDITY = 7776000;
   ```

   **Configuration parameters**
   + `TYPE = OAUTH` – Specifies OAuth authentication type.
   + `ENABLED = TRUE` – Enables the security integration.
   + `OAUTH_CLIENT = CUSTOM` – Uses custom OAuth client configuration.
   + `OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'` – Sets client type for secure applications.
   + `OAUTH_REDIRECT_URI` – The callback URL for OAuth flow. It can be localhost for testing.
   + `OAUTH_ISSUE_REFRESH_TOKENS = TRUE` – Enables refresh token generation.
   + `OAUTH_REFRESH_TOKEN_VALIDITY = 7776000` – Sets refresh token validity (90 days in seconds).

1. **Retrieve OAuth client secrets**

   1. Run the following SQL command to get the client credentials.

      ```
      DESC SECURITY INTEGRATION 'MY_SNOWFLAKE_OAUTH_INTEGRATION_A';
      ```

   1. Retrieve the OAuth client secrets.

      ```
      SELECT SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('MY_SNOWFLAKE_OAUTH_INTEGRATION_A');
      ```

      **Example response**

      ```
      {
        "OAUTH_CLIENT_SECRET_2": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "OAUTH_CLIENT_SECRET": "je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY,
        "OAUTH_CLIENT_ID": "AIDACKCEVSQ6C2EXAMPLE"
      }
      ```
**Note**  
Keep these credentials secure and do not share them. These will be used to configure the OAuth client.

1. **Authorize user and retrieve authorization code**

   1. Open the following URL in a browser.

      ```
      https://<your_account>.snowflakecomputing.com/oauth/authorize?client_id=<OAUTH_CLIENT_ID>&response_type=code&redirect_uri=https://localhost:8080/oauth/callback
      ```

   1. Complete the authorization flow.

      1. Sign in using your Snowflake credentials.

      1. Grant the requested permissions. You will be redirected to the callback URI with an authorization code.

   1. Extract the authorization code by copying the code parameter from the redirect URL.

      ```
      https://localhost:8080/oauth/callback?code=<authorizationcode>
      ```
**Note**  
The authorization code is valid for a limited time and can only be used once.

1. **Store OAuth credentials in AWS Secrets Manager**

   Create a secret in AWS Secrets Manager with the following structure.

   ```
   {
     "redirect_uri": "https://localhost:8080/oauth/callback",
     "client_secret": "je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY",
     "token_url": "https://<your_account>.snowflakecomputing.com/oauth/token-request",
     "client_id": "AIDACKCEVSQ6C2EXAMPLE,
     "username": "your_snowflake_username",
     "auth_code": "authorizationcode"
   }
   ```

   **Required fields**
   + `redirect_uri` – OAuth redirect URI that you obtained from Step 1.
   + `client_secret` – OAuth client secret that you obtained from Step 2.
   + `token_url` – Snowflake The OAuth token endpoint.
   + `client_id` – The OAuth client ID from Step 2.
   + `username` – The Snowflake username for the connector.
   + `auth_code` – The authorization code that you obtained from Step 3.

After you create a secret, you get a secret ARN that you can use in your Glue connection when you [create a data source connection](connect-to-a-data-source.md). 

## Passthrough queries
<a name="connectors-snowflake-passthrough-queries"></a>

The Snowflake connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Snowflake, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Snowflake. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-snowflake-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-snowflake-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/pom.xml) file for the Snowflake connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-snowflake) on GitHub.com.

# Amazon Athena Microsoft SQL Server connector
<a name="connectors-microsoft-sql-server"></a>

The Amazon Athena connector for [Microsoft SQL Server](https://docs.microsoft.com/en-us/sql/?view=sql-server-ver15) enables Amazon Athena to run SQL queries on your data stored in Microsoft SQL Server using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-sqlserver-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-microsoft-sql-server-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In filter conditions, you must cast the `Date` and `Timestamp` data types to the appropriate data type.
+ To search for negative values of type `Real` and `Float`, use the `<=` or `>=` operator.
+ The `binary`, `varbinary`, `image`, and `rowversion` data types are not supported.

## Terms
<a name="connectors-microsoft-sql-server-terms"></a>

The following terms relate to the SQL Server connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-microsoft-sql-server-parameters"></a>

Use the parameters in this section to configure the SQL Server connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-microsoft-sql-server-gc"></a>

We recommend that you configure a SQL Server connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the SQL Server connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SQLSERVER
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The SQL Server connector created using Glue connections does not support the use of a multiplexing handler.
The SQL Server connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-microsoft-sql-server-legacy"></a>

#### Connection string
<a name="connectors-microsoft-sql-server-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
sqlserver://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-microsoft-sql-server-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SqlServerMuxCompositeHandler | 
| Metadata handler | SqlServerMuxMetadataHandler | 
| Record handler | SqlServerMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-microsoft-sql-server-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysqlservercatalog, then the environment variable name is mysqlservercatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a SqlServer MUX Lambda function that supports two database instances: `sqlserver1` (the default), and `sqlserver2`.


****  

| Property | Value | 
| --- | --- | 
| default | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| sqlserver\$1catalog1\$1connection\$1string | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| sqlserver\$1catalog2\$1connection\$1string | sqlserver://jdbc:sqlserver://sqlserver2.hostname:port;databaseName=<database\$1name>;\$1\$1secret2\$1name\$1 | 

##### Providing credentials
<a name="connectors-microsoft-sql-server-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database_name>;${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database_name>;user=<user>;password=<password>
```

#### Using a single connection handler
<a name="connectors-microsoft-sql-server-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single SQL Server instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SqlServerCompositeHandler | 
| Metadata handler | SqlServerMetadataHandler | 
| Record handler | SqlServerRecordHandler | 

##### Single connection handler parameters
<a name="connectors-microsoft-sql-server-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single SQL Server instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database\$1name>;\$1\$1secret\$1name\$1 | 

#### Spill parameters
<a name="connectors-microsoft-sql-server-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-microsoft-sql-server-data-type-support"></a>

The following table shows the corresponding data types for SQL Server and Apache Arrow.


****  

| SQL Server | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 
| nchar[n] | VARCHAR | 
| nvarchar[n/max] | VARCHAR | 
| text | VARCHAR | 
| ntext | VARCHAR | 

## Partitions and splits
<a name="connectors-microsoft-sql-server-partitions-and-splits"></a>

A partition is represented by a single partition column of type `varchar`. In case of the SQL Server connector, a partition function determines how partitions are applied on the table. The partition function and column name information are retrieved from the SQL Server metadata table. A custom query then gets the partition. Splits are created based upon the number of distinct partitions received.

## Performance
<a name="connectors-microsoft-sql-server-performance"></a>

Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The SQL Server connector is resilient to throttling due to concurrency.

The Athena SQL Server connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### Predicates
<a name="connectors-sqlserver-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena SQL Server connector can combine these expressions and push them directly to SQL Server for enhanced functionality and to reduce the amount of data scanned.

The following Athena SQL Server connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-sqlserver-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries
<a name="connectors-sqlserver-passthrough-queries"></a>

The SQL Server connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with SQL Server, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in SQL Server. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-sqlserver-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-sqlserver-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/pom.xml) file for the SQL Server connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-sqlserver) on GitHub.com.

# Amazon Athena Teradata connector
<a name="connectors-teradata"></a>

 The Amazon Athena connector for Teradata enables Athena to run SQL queries on data stored in your Teradata databases. 

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-teradata-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations
<a name="connectors-teradata-limitations"></a>
+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms
<a name="connectors-teradata-terms"></a>

The following terms relate to the Teradata connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters
<a name="connectors-teradata-parameters"></a>

Use the parameters in this section to configure the Teradata connector.

### Glue connections (recommended)
<a name="connectors-teradata-gc"></a>

We recommend that you configure a Teradata connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Teradata connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TERADATA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Teradata connector created using Glue connections does not support the use of a multiplexing handler.
The Teradata connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-teradata-legacy"></a>

#### Connection string
<a name="connectors-teradata-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
teradata://${jdbc_connection_string}
```

#### Using a multiplexing handler
<a name="connectors-teradata-using-a-multiplexing-handler"></a>

You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | TeradataMuxCompositeHandler | 
| Metadata handler | TeradataMuxMetadataHandler | 
| Record handler | TeradataMuxRecordHandler | 

##### Multiplexing handler parameters
<a name="connectors-teradata-multiplexing-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myteradatacatalog, then the environment variable name is myteradatacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Teradata MUX Lambda function that supports two database instances: `teradata1` (the default), and `teradata2`.


****  

| Property | Value | 
| --- | --- | 
| default | teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2 | 
| teradata\$1catalog1\$1connection\$1string | teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,\$1\$1Test/RDS/Teradata1\$1 | 
| teradata\$1catalog2\$1connection\$1string | teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2 | 

##### Providing credentials
<a name="connectors-teradata-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Teradata1}`.

```
teradata://jdbc:teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,${Test/RDS/Teradata1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,...&user=sample2&password=sample2&...
```

Currently, Teradata recognizes the `user` and `password` JDBC properties. It also accepts the user name and password in the format *username*`/`*password* without the keys `user` or `password`.

#### Using a single connection handler
<a name="connectors-teradata-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Teradata instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | TeradataCompositeHandler | 
| Metadata handler | TeradataMetadataHandler | 
| Record handler | TeradataRecordHandler | 

##### Single connection handler parameters
<a name="connectors-teradata-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Teradata instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,secret=Test/RDS/Teradata1 | 

#### Spill parameters
<a name="connectors-teradata-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-teradata-data-type-support"></a>

The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits
<a name="connectors-teradata-partitions-and-splits"></a>

A partition is represented by a single partition column of type `Integer`. The column contains partition names of the partitions defined on a Teradata table. For a table that does not have partition names, \$1 is returned, which is equivalent to a single partition. A partition is equivalent to a split.


****  

| Name | Type | Description | 
| --- | --- | --- | 
| partition | Integer | Named partition in Teradata. | 

## Performance
<a name="connectors-teradata-performance"></a>

Teradata supports native partitions. The Athena Teradata connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly slows down query runtime. The connector shows some throttling due to concurrency.

The Athena Teradata connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates
<a name="connectors-teradata-performance-predicates"></a>

A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Teradata connector can combine these expressions and push them directly to Teradata for enhanced functionality and to reduce the amount of data scanned.

The following Athena Teradata connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example
<a name="connectors-teradata-performance-pushdown-example"></a>

For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries
<a name="connectors-teradata-passthrough-queries"></a>

The Teradata connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Teradata, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Teradata. The query selects all columns in the `customer` table.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer'
        ))
```

## License information
<a name="connectors-teradata-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-teradata-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/pom.xml) file for the Teradata connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-teradata) on GitHub.com.

# Amazon Athena Timestream connector
<a name="connectors-timestream"></a>

The Amazon Athena Timestream connector enables Amazon Athena to communicate with [Amazon Timestream](https://aws.amazon.com/timestream/), making your time series data accessible through Amazon Athena. You can optionally use AWS Glue Data Catalog as a source of supplemental metadata.

Amazon Timestream is a fast, scalable, fully managed, purpose-built time series database that makes it easy to store and analyze trillions of time series data points per day. Timestream saves you time and cost in managing the lifecycle of time series data by keeping recent data in memory and moving historical data to a cost optimized storage tier based upon user defined policies.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites
<a name="connectors-timestream-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-timestream-parameters"></a>

Use the parameters in this section to configure the Timestream connector.

### Glue connections (recommended)
<a name="connectors-timestream-gc"></a>

We recommend that you configure a Timestream connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Timestream connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TIMESTREAM
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Timestream connector created using Glue connections does not support the use of a multiplexing handler.
The Timestream connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-timestream-legacy"></a>

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

## Setting up databases and tables in AWS Glue
<a name="connectors-timestream-setting-up-databases-and-tables-in-aws-glue"></a>

You can optionally use the AWS Glue Data Catalog as a source of supplemental metadata. To enable an AWS Glue table for use with Timestream, you must have an AWS Glue database and table with names that match the Timestream database and table that you want to supply supplemental metadata for.

**Note**  
For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

To configure AWS Glue table for use with Timestream, you must set its table properties in AWS Glue.

**To use an AWS Glue table for supplemental metadata**

1. Edit the table in the AWS Glue console to add the following table properties:
   + **timestream-metadata-flag** – This property indicates to the Timestream connector that the connector can use the table for supplemental metadata. You can provide any value for `timestream-metadata-flag` as long as the `timestream-metadata-flag` property is present in the list of table properties.
   + **\$1view\$1template** – When you use AWS Glue for supplemental metadata, you can use this table property and specify any Timestream SQL as the view. The Athena Timestream connector uses the SQL from the view together with your SQL from Athena to run your query. This is useful if you want to use a feature of Timestream SQL that is not otherwise available in Athena.

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

### Data types
<a name="connectors-timestream-data-types"></a>

Currently, the Timestream connector supports only a subset of the data types available in Timestream, specifically: the scalar values `varchar`, `double`, and `timestamp`.

To query the `timeseries` data type, you must configure a view in AWS Glue table properties that uses the Timestream `CREATE_TIME_SERIES` function. You also need to provide a schema for the view that uses the syntax `ARRAY<STRUCT<time:timestamp,measure_value::double:double>>` as the type for any of your time series columns. Be sure to replace `double` with the appropriate scalar type for your table.

The following image shows an example of AWS Glue table properties configured to set up a view over a time series.

![\[Configuring table properties in AWS Glue to set up a view over a time series.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-timestream-1.png)


## Required Permissions
<a name="connectors-timestream-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-timestream.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-timestream/athena-timestream.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The Timestream connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **Timestream Access** – For running Timestream queries.

## Performance
<a name="connectors-timestream-performance"></a>

We recommend that you use the `LIMIT` clause to limit the data returned (not the data scanned) to less than 256 MB to ensure that interactive queries are performant.

The Athena Timestream connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Timestream connector is resilient to throttling due to concurrency.

## Passthrough queries
<a name="connectors-timestream-passthrough-queries"></a>

The Timestream connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Timestream, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Timestream. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-timestream-license-information"></a>

The Amazon Athena Timestream connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-timestream-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-timestream) on GitHub.com.

# Amazon Athena TPC benchmark DS (TPC-DS) connector
<a name="connectors-tpcds"></a>

The Amazon Athena TPC-DS connector enables Amazon Athena to communicate with a source of randomly generated TPC Benchmark DS data for use in benchmarking and functional testing of Athena Federation. The Athena TPC-DS connector generates a TPC-DS compliant database at one of four scale factors. We do not recommend the use of this connector as an alternative to Amazon S3-based data lake performance tests.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites
<a name="connectors-tpcds-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters
<a name="connectors-tpcds-parameters"></a>

Use the parameters in this section to configure the TPC-DS connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)
<a name="connectors-tpcds-gc"></a>

We recommend that you configure a TPC-DS connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the TPC-DS connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TPCDS
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The TPC-DS connector created using Glue connections does not support the use of a multiplexing handler.
The TPC-DS connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-tpcds-legacy"></a>
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

## Test databases and tables
<a name="connectors-tpcds-test-databases-and-tables"></a>

The Athena TPC-DS connector generates a TPC-DS compliant database at one of the four scale factors `tpcds1`, `tpcds10`, `tpcds100`, `tpcds250`, or `tpcds1000`.

### Summary of tables
<a name="connectors-tpcds-table-summary"></a>

For a complete list of the test data tables and columns, run the `SHOW TABLES` or `DESCRIBE TABLE` queries. The following summary of tables is provided for convenience.

1. call\$1center

1. catalog\$1page

1. catalog\$1returns

1. catalog\$1sales

1. customer

1. customer\$1address

1. customer\$1demographics

1. date\$1dim

1. dbgen\$1version

1. household\$1demographics

1. income\$1band

1. inventory

1. item

1. promotion

1. reason

1. ship\$1mode

1. store

1. store\$1returns

1. store\$1sales

1. time\$1dim

1. warehouse

1. web\$1page

1. web\$1returns

1. web\$1sales

1. web\$1site

For TPC-DS queries that are compatible with this generated schema and data, see the [athena-tpcds/src/main/resources/queries/](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-tpcds/src/main/resources/queries) directory on GitHub.

### Example query
<a name="connectors-tpcds-example-query"></a>

The following `SELECT` query example queries the `tpcds` catalog for customer demographics in specific counties.

```
SELECT
  cd_gender,
  cd_marital_status,
  cd_education_status,
  count(*) cnt1,
  cd_purchase_estimate,
  count(*) cnt2,
  cd_credit_rating,
  count(*) cnt3,
  cd_dep_count,
  count(*) cnt4,
  cd_dep_employed_count,
  count(*) cnt5,
  cd_dep_college_count,
  count(*) cnt6
FROM
  "lambda:tpcds".tpcds1.customer c, "lambda:tpcds".tpcds1.customer_address ca, "lambda:tpcds".tpcds1.customer_demographics
WHERE
  c.c_current_addr_sk = ca.ca_address_sk AND
    ca_county IN ('Rush County', 'Toole County', 'Jefferson County',
                  'Dona Ana County', 'La Porte County') AND
    cd_demo_sk = c.c_current_cdemo_sk AND
    exists(SELECT *
           FROM "lambda:tpcds".tpcds1.store_sales, "lambda:tpcds".tpcds1.date_dim
           WHERE c.c_customer_sk = ss_customer_sk AND
             ss_sold_date_sk = d_date_sk AND
             d_year = 2002 AND
             d_moy BETWEEN 1 AND 1 + 3) AND
    (exists(SELECT *
            FROM "lambda:tpcds".tpcds1.web_sales, "lambda:tpcds".tpcds1.date_dim
            WHERE c.c_customer_sk = ws_bill_customer_sk AND
              ws_sold_date_sk = d_date_sk AND
              d_year = 2002 AND
              d_moy BETWEEN 1 AND 1 + 3) OR
      exists(SELECT *
             FROM "lambda:tpcds".tpcds1.catalog_sales, "lambda:tpcds".tpcds1.date_dim
             WHERE c.c_customer_sk = cs_ship_customer_sk AND
               cs_sold_date_sk = d_date_sk AND
               d_year = 2002 AND
               d_moy BETWEEN 1 AND 1 + 3))
GROUP BY cd_gender,
  cd_marital_status,
  cd_education_status,
  cd_purchase_estimate,
  cd_credit_rating,
  cd_dep_count,
  cd_dep_employed_count,
  cd_dep_college_count
ORDER BY cd_gender,
  cd_marital_status,
  cd_education_status,
  cd_purchase_estimate,
  cd_credit_rating,
  cd_dep_count,
  cd_dep_employed_count,
  cd_dep_college_count
LIMIT 100
```

## Required Permissions
<a name="connectors-tpcds-required-permissions"></a>

For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-tpcds.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-tpcds/athena-tpcds.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.

## Performance
<a name="connectors-tpcds-performance"></a>

The Athena TPC-DS connector attempts to parallelize queries based on the scale factor that you choose. Predicate pushdown is performed within the Lambda function.

## License information
<a name="connectors-tpcds-license-information"></a>

The Amazon Athena TPC-DS connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources
<a name="connectors-tpcds-additional-resources"></a>

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-tpcds) on GitHub.com.

# Amazon Athena Vertica connector
<a name="connectors-vertica"></a>

Vertica is a columnar database platform that can be deployed in the cloud or on premises that supports exabyte scale data warehouses. You can use the Amazon Athena Vertica connector in federated queries to query Vertica data sources from Athena. For example, you can run analytical queries over a data warehouse on Vertica and a data lake in Amazon S3.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites
<a name="connectors-vertica-prerequisites"></a>
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations
<a name="connectors-vertica-limitations"></a>
+ Because the Athena Vertica connector reads exported Parquet files from Amazon S3, performance of the connector can be slow. When you query large tables, we recommend that you use a [CREATE TABLE AS (SELECT ...)](ctas.md) query and SQL predicates.
+ Currently, due to a known issue in Athena Federated Query, the connector causes Vertica to export all columns of the queried table to Amazon S3, but only the queried columns are visible in the results on the Athena console.
+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Workflow
<a name="connectors-vertica-workflow"></a>

The following diagram shows the workflow of a query that uses the Vertica connector.

![\[Workflow of a Vertica query from Amazon Athena\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-vertica-1.png)


1. A SQL query is issued against one or more tables in Vertica.

1. The connector parses the SQL query to send the relevant portion to Vertica through the JDBC connection.

1. The connection strings use the user name and password stored in AWS Secrets Manager to gain access to Vertica.

1. The connector wraps the SQL query with a Vertica `EXPORT` command, as in the following example.

   ```
   EXPORT TO PARQUET (directory = 's3://amzn-s3-demo-bucket/folder_name, 
      Compression='Snappy', fileSizeMB=64) OVER() as 
   SELECT
   PATH_ID,
   ...
   SOURCE_ITEMIZED,
   SOURCE_OVERRIDE
   FROM DELETED_OBJECT_SCHEMA.FORM_USAGE_DATA
   WHERE PATH_ID <= 5;
   ```

1. Vertica processes the SQL query and sends the result set to an Amazon S3 bucket. For better throughput, Vertica uses the `EXPORT` option to parallelize the write operation of multiple Parquet files.

1. Athena scans the Amazon S3 bucket to determine the number of files to read for the result set.

1. Athena makes multiple calls to the Lambda function and uses an Apache `ArrowReader` to read the Parquet files from the resulting data set. Multiple calls enable Athena to parallelize the reading of the Amazon S3 files and achieve a throughput of up to 100GB per second.

1. Athena processes the data returned from Vertica with data scanned from the data lake and returns the result.

## Terms
<a name="connectors-vertica-terms"></a>

The following terms relate to the Vertica connector.
+ **Database instance** – Any instance of a Vertica database deployed on Amazon EC2.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.

## Parameters
<a name="connectors-vertica-parameters"></a>

Use the parameters in this section to configure the Vertica connector.

### Glue connections (recommended)
<a name="connectors-vertica-gc"></a>

We recommend that you configure a Vertica connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Vertica connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type VERTICA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Vertica connector created using Glue connections does not support the use of a multiplexing handler.
The Vertica connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections
<a name="connectors-vertica-legacy"></a>

The Amazon Athena Vertica connector exposes configuration options through Lambda environment variables. You can use the following Lambda environment variables to configure the connector. 
+  **AthenaCatalogName** – Lambda function name 
+  **ExportBucket** – The Amazon S3 bucket where the Vertica query results are exported. 
+  **SpillBucket** – The name of the Amazon S3 bucket where this function can spill data. 
+  **SpillPrefix** – The prefix for the `SpillBucket` location where this function can spill data. 
+  **SecurityGroupIds** – One or more IDs that correspond to the security group that should be applied to the Lambda function (for example, `sg1`, `sg2`, or `sg3`). 
+  **SubnetIds** – One or more subnet IDs that correspond to the subnet that the Lambda function can use to access your data source (for example, `subnet1`, or `subnet2`). 
+  **SecretNameOrPrefix** – The name or prefix of a set of names in Secrets Manager that this function has access to (for example, `vertica-*`) 
+  **VerticaConnectionString** – The Vertica connection details to use by default if no catalog specific connection is defined. The string can optionally use AWS Secrets Manager syntax (for example, `${secret_name}`). 
+  **VPC ID** – The VPC ID to be attached to the Lambda function. 

#### Connection string
<a name="connectors-vertica-connection-string"></a>

Use a JDBC connection string in the following format to connect to a database instance.

```
vertica://jdbc:vertica://host_name:
                        port/database?user=vertica-username&password=
                        vertica-password
```

#### Using a single connection handler
<a name="connectors-vertica-using-a-single-connection-handler"></a>

You can use the following single connection metadata and record handlers to connect to a single Vertica instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | VerticaCompositeHandler | 
| Metadata handler | VerticaMetadataHandler | 
| Record handler | VerticaRecordHandler | 

#### Single connection handler parameters
<a name="connectors-vertica-single-connection-handler-parameters"></a>


****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

#### Providing credentials
<a name="connectors-vertica-providing-credentials"></a>

To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret names**  
The following string has the secret names \$1\$1`vertica-username`\$1 and `${vertica-password}`. 

```
vertica://jdbc:vertica://
                        host_name:port/database?user=${vertica-username}&password=${vertica-password}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
vertica://jdbc:vertica://
                        host_name:port/database?user=sample-user&password=sample-password
```

Currently, the Vertica connector recognizes the `vertica-username` and `vertica-password` JDBC properties. 

#### Spill parameters
<a name="connectors-vertica-spill-parameters"></a>

The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support
<a name="connectors-vertica-data-type-support"></a>

The following table shows the supported data types for the Vertica connector.


****  

| Boolean | 
| --- | 
| BigInt | 
| Short | 
| Integer | 
| Long | 
| Float | 
| Double | 
| Date | 
| Varchar | 
| Bytes | 
| BigDecimal | 
| TimeStamp as Varchar | 

## Performance
<a name="connectors-vertica-performance"></a>

The Lambda function performs projection pushdown to decrease the data scanned by the query. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Vertica connector is resilient to throttling due to concurrency.

## Passthrough queries
<a name="connectors-vertica-passthrough-queries"></a>

The Vertica connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Vertica, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Vertica. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information
<a name="connectors-vertica-license-information"></a>

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/LICENSE.txt) file on GitHub.com.

## Additional resources
<a name="connectors-vertica-additional-resources"></a>

For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/pom.xml) file for the Vertica connector on GitHub.com.

For additional information about this connector, see [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-vertica) on GitHub.com and [Querying a Vertica data source in Amazon Athena using the Athena Federated Query SDK](https://aws.amazon.com/blogs/big-data/querying-a-vertica-data-source-in-amazon-athena-using-the-athena-federated-query-sdk/) in the *AWS Big Data Blog*.