

# Connection types and options for ETL in AWS Glue for Spark
<a name="aws-glue-programming-etl-connect"></a>

In AWS Glue for Spark, various PySpark and Scala methods and transforms specify the connection type using a `connectionType` parameter. They specify connection options using a `connectionOptions` or `options` parameter.

The `connectionType` parameter can take the values shown in the following table. The associated `connectionOptions` (or `options`) parameter values for each type are documented in the following sections. Except where otherwise noted, the parameters apply when the connection is used as a source or sink.

For sample code that demonstrates setting and using connection options, see the homepage for each connection type.


| `connectionType` | Connects to | 
| --- | --- | 
| [dynamodb](aws-glue-programming-etl-connect-dynamodb-home.md) | [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/) database | 
| [kinesis](aws-glue-programming-etl-connect-kinesis-home.md) | [Amazon Kinesis Data Streams](https://docs.aws.amazon.com/streams/latest/dev/introduction.html) | 
| [s3](aws-glue-programming-etl-connect-s3-home.md) | [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/) | 
| [documentdb](aws-glue-programming-etl-connect-documentdb-home.md#aws-glue-programming-etl-connect-documentdb) | [Amazon DocumentDB (with MongoDB compatibility)](https://docs.aws.amazon.com/documentdb/latest/developerguide/) database | 
| [opensearch](aws-glue-programming-etl-connect-opensearch-home.md) | [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/). | 
| [redshift](aws-glue-programming-etl-connect-redshift-home.md) | [Amazon Redshift](https://aws.amazon.com/redshift/) database | 
| [kafka](aws-glue-programming-etl-connect-kafka-home.md) |  [Kafka](https://kafka.apache.org/) or [Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html) | 
| [azurecosmos](aws-glue-programming-etl-connect-azurecosmos-home.md) | Azure Cosmos for NoSQL. | 
| [azuresql](aws-glue-programming-etl-connect-azuresql-home.md) | Azure SQL. | 
| [bigquery](aws-glue-programming-etl-connect-bigquery-home.md) | Google BigQuery. | 
| [mongodb](aws-glue-programming-etl-connect-mongodb-home.md) | [MongoDB](https://www.mongodb.com/what-is-mongodb) database, including MongoDB Atlas. | 
| [sqlserver](aws-glue-programming-etl-connect-jdbc-home.md) |  Microsoft SQL Server database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [mysql](aws-glue-programming-etl-connect-jdbc-home.md) | [MySQL](https://www.mysql.com/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [oracle](aws-glue-programming-etl-connect-jdbc-home.md) | [Oracle](https://www.oracle.com/database/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [postgresql](aws-glue-programming-etl-connect-jdbc-home.md) |  [PostgreSQL](https://www.postgresql.org/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [saphana](aws-glue-programming-etl-connect-saphana-home.md) | SAP HANA. | 
| [snowflake](aws-glue-programming-etl-connect-snowflake-home.md) | [Snowflake](https://www.snowflake.com/) data lake | 
| [teradata](aws-glue-programming-etl-connect-teradata-home.md) | Teradata Vantage. | 
| [vertica](aws-glue-programming-etl-connect-vertica-home.md) | Vertica. | 
| [custom.\$1](#aws-glue-programming-etl-connect-market) | Spark, Athena, or JDBC data stores (see [Custom and AWS Marketplace connectionType values](#aws-glue-programming-etl-connect-market)  | 
| [marketplace.\$1](#aws-glue-programming-etl-connect-market) | Spark, Athena, or JDBC data stores (see [Custom and AWS Marketplace connectionType values](#aws-glue-programming-etl-connect-market))  | 

# DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in DynamoDB in AWS Glue. You connect to DynamoDB using IAM permissions attached to your AWS Glue job. AWS Glue supports writing data into another AWS account's DynamoDB table. For more information, see [Cross-account cross-Region access to DynamoDB tables](aws-glue-programming-etl-dynamo-db-cross-account.md).

The original DynamoDB connector uses Glue DynamicFrame objects to work with the data extracted from DynamoDB. AWS Glue 5.0\$1 introduces a new [DynamoDB connector with Spark DataFrame support](aws-glue-programming-etl-connect-dynamodb-dataframe-support.md) that provides native Spark DataFrame support.

In addition to the AWS Glue DynamoDB ETL connector, you can read from DynamoDB using the DynamoDB export connector, that invokes a DynamoDB `ExportTableToPointInTime` request and stores it in an Amazon S3 location you supply, in the format of [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html). AWS Glue then creates a DynamicFrame object by reading the data from the Amazon S3 export location.

The DynamoDB writer is available in AWS Glue version 1.0 or later versions. The AWS Glue DynamoDB export connector is available in AWS Glue version 2.0 or later versions. The new DataFrame-based DynamoDB connector is available in AWS Glue version 5.0 or later versions.

For more information about DynamoDB, consult [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/) documentation.

**Note**  
The DynamoDB ETL reader does not support filters or pushdown predicates.

## Configuring DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-configure"></a>

To connect to DynamoDB from AWS Glue, grant the IAM role associated with your AWS Glue job permission to interact with DynamoDB. For more information about permissions necessary to read or write from DynamoDB, consult [Actions, resources, and condition keys for DynamoDB](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazondynamodb.html) in the IAM documentation.

In the following situations, you may need additional configuration:
+ When using the DynamoDB export connector, you will need to configure IAM so your job can request DynamoDB table exports. Additionally, you will need to identify an Amazon S3 bucket for the export and provide appropriate permissions in IAM for DynamoDB to write to it, and for your AWS Glue job to read from it. For more information, consult [Request a table export in DynamoDB](https://docs.aws.amazon.com//amazondynamodb/latest/developerguide/S3DataExport_Requesting.html).
+ If your AWS Glue job has specific Amazon VPC connectivity requirements, use the `NETWORK` AWS Glue connection type to provide network options. Since access to DynamoDB is authorized by IAM, there is no need for a AWS Glue DynamoDB connection type.

## Reading from and writing to DynamoDB
<a name="aws-glue-programming-etl-connect-dynamodb-read-write"></a>

The following code examples show how to read from (via the ETL connector) and write to DynamoDB tables. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={"dynamodb.input.tableName": test_source,
        "dynamodb.throughput.read.percent": "1.0",
        "dynamodb.splits": "100"
    }
)
print(dyf.getNumPartitions())

glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={"dynamodb.output.tableName": test_sink,
        "dynamodb.throughput.write.percent": "1.0"
    }
)

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.input.tableName" -> test_source,
        "dynamodb.throughput.read.percent" -> "1.0",
        "dynamodb.splits" -> "100"
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    val dynamoDbSink: DynamoDbDataSink =  glueContext.getSinkWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.output.tableName" -> test_sink,
        "dynamodb.throughput.write.percent" -> "1.0"
      ))
    ).asInstanceOf[DynamoDbDataSink]
    
    dynamoDbSink.writeDynamicFrame(dynamicFrame)

    Job.commit()
  }

}
```

------

## Using the DynamoDB export connector
<a name="aws-glue-programming-etl-connect-dynamodb-export-connector"></a>

The export connector performs better than the ETL connector when the DynamoDB table size is larger than 80 GB. In addition, given that the export request is conducted outside from the Spark processes in an AWS Glue job, you can enable [auto scaling of AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/auto-scaling.html) to save DPU usage during the export request. With the export connector, you also do not need to configure the number of splits for Spark executor parallelism or DynamoDB throughput read percentage.

**Note**  
DynamoDB has specific requirements to invoke the `ExportTableToPointInTime` requests. For more information, see [Requesting a table export in DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Requesting.html). For example, Point-in-Time-Restore (PITR) needs to be enabled on the table to use this connector. The DynamoDB connector also supports AWS KMS encryption for DynamoDB exports to Amazon S3. Supplying your security configuration in the AWS Glue job configuration enables AWS KMS encryption for a DynamoDB export. The KMS key must be in the same Region as the Amazon S3 bucket.  
Note that additional charges for DynamoDB export and Amazon S3 storage costs apply. Exported data in Amazon S3 persists after a job run finishes so you can reuse it without additional DynamoDB exports. A requirement for using this connector is that point-in-time recovery (PITR) is enabled for the table.  
The DynamoDB ETL connector or export connector do not support filters or pushdown predicates to be applied at the DynamoDB source.

The following code examples show how to read from (via the export connector) and print the number of partitions.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": test_source,
        "dynamodb.s3.bucket": bucket_name,
        "dynamodb.s3.prefix": bucket_prefix,
        "dynamodb.s3.bucketOwner": account_id_of_bucket,
    }
)
print(dyf.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.export" -> "ddb",
        "dynamodb.tableArn" -> test_source,
        "dynamodb.s3.bucket" -> bucket_name,
        "dynamodb.s3.prefix" -> bucket_prefix,
        "dynamodb.s3.bucketOwner" -> account_id_of_bucket,
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    Job.commit()
  }

}
```

------

These examples show how to do the read from (via the export connector) and print the number of partitions from an AWS Glue Data Catalog table that has a `dynamodb` classification:

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dynamicFrame = glue_context.create_dynamic_frame.from_catalog(
        database=catalog_database,
        table_name=catalog_table_name,
        additional_options={
            "dynamodb.export": "ddb", 
            "dynamodb.s3.bucket": s3_bucket,
            "dynamodb.s3.prefix": s3_bucket_prefix
        }
    )
print(dynamicFrame.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getCatalogSource(
        database = catalog_database,
        tableName = catalog_table_name,
        additionalOptions = JsonOptions(Map(
            "dynamodb.export" -> "ddb", 
            "dynamodb.s3.bucket" -> s3_bucket,
            "dynamodb.s3.prefix" -> s3_bucket_prefix
        ))
    ).getDynamicFrame()
    print(dynamicFrame.getNumPartitions())
)
```

------

## Simplifying usage of DynamoDB export JSON
<a name="etl-connect-dynamodb-traversing-structure"></a>

The DynamoDB exports with the AWS Glue DynamoDB export connector results in JSON files of specific nested structures. For more information, see [Data objects](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html). AWS Glue supplies a DynamicFrame transformation, which can unnest such structures into an easier-to-use form for downstream applications.

The transform can be invoked in one of two ways. You can set the connection option `"dynamodb.simplifyDDBJson"` with the value `"true"` when calling a method to read from DynamoDB. You can also call the transform as a method independently available in the AWS Glue library.

Consider the following schema generated by a DynamoDB export:

```
root
|-- Item: struct
|    |-- parentMap: struct
|    |    |-- M: struct
|    |    |    |-- childMap: struct
|    |    |    |    |-- M: struct
|    |    |    |    |    |-- appName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- packageName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- updatedAt: struct
|    |    |    |    |    |    |-- N: string
|    |-- strings: struct
|    |    |-- SS: array
|    |    |    |-- element: string
|    |-- numbers: struct
|    |    |-- NS: array
|    |    |    |-- element: string
|    |-- binaries: struct
|    |    |-- BS: array
|    |    |    |-- element: string
|    |-- isDDBJson: struct
|    |    |-- BOOL: boolean
|    |-- nullValue: struct
|    |    |-- NULL: boolean
```

The `simplifyDDBJson` transform will simplify this to:

```
root
|-- parentMap: struct
|    |-- childMap: struct
|    |    |-- appName: string
|    |    |-- packageName: string
|    |    |-- updatedAt: string
|-- strings: array
|    |-- element: string
|-- numbers: array
|    |-- element: string
|-- binaries: array
|    |-- element: string
|-- isDDBJson: boolean
|-- nullValue: null
```

**Note**  
`simplifyDDBJson` is available in AWS Glue 3.0 and later versions. The `unnestDDBJson` transform is also available to simplify DynamoDB export JSON. We encourage users to transition to `simplifyDDBJson` from `unnestDDBJson`.

## Configuring paralleism in DynamoDB operations
<a name="aws-glue-programming-etl-connect-dynamodb-parallelism"></a>

To improve performance, you can tune certain parameters available for the DynamoDB connector. Your goal when tuning paralleism parameters is to maximize the use of the provisioned AWS Glue workers. Then, if you need more performance, we recommend you to scale out your job by increasing the number of DPUs. 

 You can alter the parallelism in a DynamoDB read operation using the `dynamodb.splits` parameter when using the ETL connector. When reading with the export connector, you do not need to configure the number of splits for Spark executor parallelism. You can alter the parallelism in a DynamoDB write operation with `dynamodb.output.numParallelTasks`.

**Reading with the DynamoDB ETL connector**

We recommend you to calculate `dynamodb.splits` based on the maximum number of workers set in your job configuration and the following `numSlots` calculation. If autoscaling, the actual number of workers available may change under that cap. For more information about setting the maximum number of workers, see **Number of workers** (`NumberOfWorkers`) in [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 
+ `numExecutors = NumberOfWorkers - 1`

   For context, one executor is reserved for the Spark driver; other executors are used to process data.
+ `numSlotsPerExecutor =`

------
#### [ AWS Glue 3.0 and later versions ]
  + `4` if `WorkerType` is `G.1X`
  + `8` if `WorkerType` is `G.2X`
  + `16` if `WorkerType` is `G.4X`
  + `32` if `WorkerType` is `G.8X`

------
#### [ AWS Glue 2.0 and legacy versions ]
  + `8` if `WorkerType` is `G.1X`
  + `16` if `WorkerType` is `G.2X`

------
+ `numSlots = numSlotsPerExecutor * numExecutors`

We recommend you set `dynamodb.splits` to the number of slots available, `numSlots`.

**Writing to DynamoDB**

The `dynamodb.output.numParallelTasks` parameter is used to determine WCU per Spark task, using the following calculation:

`permittedWcuPerTask = ( TableWCU * dynamodb.throughput.write.percent ) / dynamodb.output.numParallelTasks`

The DynamoDB writer will function best if configuration accurately represents the number of Spark tasks writing to DynamoDB. In some cases, you may need to override the default calculation to improve write performance. If you do not specify this parameter, the permitted WCU per Spark task will be automatically calculated by the following formula:
+ 
  + `numPartitions = dynamicframe.getNumPartitions()`
  + `numSlots` (as defined previously in this section)
  + `numParallelTasks = min(numPartitions, numSlots)`
+ Example 1. DPU=10, WorkerType=Standard. Input DynamicFrame has 100 RDD partitions.
  + `numPartitions = 100`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(100, 68) = 68`
+ Example 2. DPU=10, WorkerType=Standard. Input DynamicFrame has 20 RDD partitions.
  + `numPartitions = 20`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(20, 68) = 20`

**Note**  
Jobs on legacy AWS Glue versions and those using Standard workers require different methods to calculate the number of slots. If you need to performance tune these jobs, we recommend you transition to supported AWS Glue versions.

## DynamoDB connection option reference
<a name="aws-glue-programming-etl-connect-dynamodb"></a>

Designates a connection to Amazon DynamoDB.

Connection options differ for a source connection and a sink connection.

### "connectionType": "dynamodb" with the ETL connector as source
<a name="etl-connect-dynamodb-as-source"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a source, when using the AWS Glue DynamoDB ETL connector:
+ `"dynamodb.input.tableName"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.throughput.read.percent"`: (Optional) The percentage of read capacity units (RCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default read rate, meaning that AWS Glue will attempt to consume half of the read capacity of the table. If you increase the value above `0.5`, AWS Glue increases the request rate; decreasing the value below `0.5` decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the read capacity of the table as 40000. For exporting a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.splits"`: (Optional) Defines how many splits we partition this DynamoDB table into while reading. The default is set to "1". Acceptable values are from "1" to "1,000,000", inclusive.

  `1` represents there is no parallelism. We highly recommend that you specify a larger value for better performance by using the below formula. For more information on appropriately setting a value, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access. This parameter is available in AWS Glue 1.0 or later.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session". This parameter is available in AWS Glue 1.0 or later.

### "connectionType": "dynamodb" with the AWS Glue DynamoDB export connector as source
<a name="etl-connect-dynamodb-as-source-export-connector"></a>

Use the following connection options with "connectionType": "dynamodb" as a source, when using the AWS Glue DynamoDB export connector, which is available only for AWS Glue version 2.0 onwards:
+ `"dynamodb.export"`: (Required) A string value:
  + If set to `ddb` enables the AWS Glue DynamoDB export connector where a new `ExportTableToPointInTimeRequest` will be invoked during the AWS Glue job. A new export will be generated with the location passed from `dynamodb.s3.bucket` and `dynamodb.s3.prefix`.
  + If set to `s3` enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the `dynamodb.s3.bucket` and `dynamodb.s3.prefix` as the Amazon S3 location of a past export of that table.
+ `"dynamodb.tableArn"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.unnestDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. In AWS Glue 3.0 and later versions, we recommend you use `"dynamodb.simplifyDDBJson"` for better behavior when simplifying DynamoDB Map types. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure). 
+ `"dynamodb.simplifyDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. This has the same purpose as the `"dynamodb.unnestDDBJson"` option but provides better support for DynamoDB Map types or even nested Map types in your DynamoDB table. This option is available in AWS Glue 3.0 and later versions. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure).
+ `"dynamodb.s3.bucket"`: (Optional) Indicates the Amazon S3 bucket location in which the DynamoDB `ExportTableToPointInTime` process is to be conducted. The file format for the export is DynamoDB JSON.
  + `"dynamodb.s3.prefix"`: (Optional) Indicates the Amazon S3 prefix location inside the Amazon S3 bucket in which the DynamoDB `ExportTableToPointInTime` loads are to be stored. If neither `dynamodb.s3.prefix` nor `dynamodb.s3.bucket` are specified, these values will default to the Temporary Directory location specified in the AWS Glue job configuration. For more information, see [Special Parameters Used by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html).
  + `"dynamodb.s3.bucketOwner"`: Indicates the bucket owner needed for cross-account Amazon S3 access.
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access and/or cross-Region access for the DynamoDB table. Note: The same IAM role ARN will be used to access the Amazon S3 location specified for the `ExportTableToPointInTime` request.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session".
+ `"dynamodb.exportTime"` (Optional) Valid values: strings representing ISO-8601 instants. A point-in-time at which the export should be made. 
+ `"dynamodb.sts.region"`: (Required if making a cross-region call using a regional endpoint) The region hosting the DynamoDB table you want to read.

### "connectionType": "dynamodb" with the ETL connector as sink
<a name="etl-connect-dynamodb-as-sink"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a sink:
+ `"dynamodb.output.tableName"`: (Required) The DynamoDB table to write to.
+ `"dynamodb.throughput.write.percent"`: (Optional) The percentage of write capacity units (WCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default write rate, meaning that AWS Glue will attempt to consume half of the write capacity of the table. If you increase the value above 0.5, AWS Glue increases the request rate; decreasing the value below 0.5 decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table).
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the write capacity of the table as `40000`. For importing a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.output.numParallelTasks"`: (Optional) Defines how many parallel tasks write into DynamoDB at the same time. Used to calculate permissive WCU per Spark task. In most cases, AWS Glue will calculate a reasonable default for this value. For more information, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.output.retry"`: (Optional) Defines how many retries we perform when there is a `ProvisionedThroughputExceededException` from DynamoDB. The default is set to "10".
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-write-sts-session".

# Cross-account cross-Region access to DynamoDB tables
<a name="aws-glue-programming-etl-dynamo-db-cross-account"></a>

AWS Glue ETL jobs support both cross-region and cross-account access to DynamoDB tables. AWS Glue ETL jobs support both reading data from another AWS Account's DynamoDB table, and writing data into another AWS Account's DynamoDB table. AWS Glue also supports both reading from a DynamoDB table in another region, and writing into a DynamoDB table in another region. This section gives instructions on setting up the access, and provides an example script. 

The procedures in this section reference an IAM tutorial for creating an IAM role and granting access to the role. The tutorial also discusses assuming a role, but here you will instead use a job script to assume the role in AWS Glue. This tutorial also contains information about general cross-account practices. For more information, see [Tutorial: Delegate Access Across AWS Accounts Using IAM Roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html) in the *IAM User Guide*.

## Create a role
<a name="aws-glue-programming-etl-dynamo-db-create-role"></a>

Follow [step 1 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-1) to create an IAM role in account A. When defining the permissions of the role, you can choose to attach existing policies such as `AmazonDynamoDBReadOnlyAccess`, or `AmazonDynamoDBFullAccess` to allow the role to read/write DynamoDB. The following example shows creating a role named `DynamoDBCrossAccessRole`, with the permission policy `AmazonDynamoDBFullAccess`.

## Grant access to the role
<a name="aws-glue-programming-etl-dynamo-db-grant-access"></a>

Follow [step 2 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-2) in the *IAM User Guide* to allow account B to switch to the newly-created role. The following example creates a new policy with the following statement:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::111122223333:role/DynamoDBCrossAccessRole"
  }
}
```

------

Then, you can attach this policy to the group/role/user you would like to use to access DynamoDB.

## Assume the role in the AWS Glue job script
<a name="aws-glue-programming-etl-dynamo-db-assume-role"></a>

Now, you can log in to account B and create an AWS Glue job. To create a job, refer to the instructions at [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 

In the job script you need to use the `dynamodb.sts.roleArn` parameter to assume the `DynamoDBCrossAccessRole` role. Assuming this role allows you to get the temporary credentials, which need to be used to access DynamoDB in account B. Review these example scripts.

For a cross-account read across regions (ETL connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
    "dynamodb.region": "us-east-1",
    "dynamodb.input.tableName": "test_source",
    "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a cross-account read across regions (ELT connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source ARN>",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a read and cross-account write across regions:

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
 
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
 
dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-east-1",
        "dynamodb.input.tableName": "test_source"
    }
)
dyf.show()
 
glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-west-2",
        "dynamodb.output.tableName": "test_sink",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
 
job.commit()
```

# DynamoDB connector with Spark DataFrame support
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-support"></a>

DynamoDB connector with Spark DataFrame support allows you to read from and write to tables in DynamoDB using Spark DataFrame APIs. The connector setup steps are the same as for DynamicFrame-based connector and can be found [here](aws-glue-programming-etl-connect-dynamodb-home.md#aws-glue-programming-etl-connect-dynamodb-configure).

In order to load in the DataFrame-based connector library, make sure to attach a DynamoDB connection to the Glue job.

**Note**  
Glue console UI currently does not support creating a DynamoDB connection. You can use Glue CLI ([CreateConnection](https://docs.aws.amazon.com/cli/latest/reference/glue/create-connection.html)) to create a DynamoDB connection:  

```
        aws glue create-connection \
            --connection-input '{ \
                "Name": "my-dynamodb-connection", \
                "ConnectionType": "DYNAMODB", \
                "ConnectionProperties": {}, \
                "ValidateCredentials": false, \
                "ValidateForComputeEnvironments": ["SPARK"] \
            }'
```

Upon creating the DynamoDB connection, you can attach it to your Glue job via CLI ([CreateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/create-job.html), [UpdateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html) ) or directly in the "Job details" page:

![\[alt text not found\]](http://docs.aws.amazon.com/glue/latest/dg/images/dynamodb-dataframe-connector.png)


Upon ensuring a connection with DYNAMODB Type is attached to your Glue job, you can utilize the following read, write, and export operations from the DataFrame-based connector.

## Reading from and writing to DynamoDB with the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-write"></a>

The following code examples show how to read from and write to DynamoDB tables via the DataFrame-based connector. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

# Read from DynamoDB
df = spark.read.format("dynamodb") \
    .option("dynamodb.input.tableName", "test-source") \
    .option("dynamodb.throughput.read.ratio", "0.5") \
    .option("dynamodb.consistentRead", "false") \
    .load()

print(df.rdd.getNumPartitions())

# Write to DynamoDB
df.write \
  .format("dynamodb") \
  .option("dynamodb.output.tableName", "test-sink") \
  .option("dynamodb.throughput.write.ratio", "0.5") \
  .option("dynamodb.item.size.check.enabled", "true") \
  .save()

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val df = spark.read
      .format("dynamodb")
      .option("dynamodb.input.tableName", "test-source")
      .option("dynamodb.throughput.read.ratio", "0.5")
      .option("dynamodb.consistentRead", "false")
      .load()

    print(df.rdd.getNumPartitions)

    df.write
      .format("dynamodb")
      .option("dynamodb.output.tableName", "test-sink")
      .option("dynamodb.throughput.write.ratio", "0.5")
      .option("dynamodb.item.size.check.enabled", "true")
      .save()

    job.commit()
  }
}
```

------

## Using DynamoDB export via the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export"></a>

The export operation is preffered to read operation for DynamoDB table sizes larger than 80 GB. The following code examples show how to read from a table, export to S3, and print the number of partitions via the DataFrame-based connector.

**Note**  
The DynamoDB export functionality is available through the Scala `DynamoDBExport` object. Python users can access it via Spark's JVM interop or use the AWS SDK for Python (boto3) with the DynamoDB `ExportTableToPointInTime` API.

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.spark.SparkContext
import glue.spark.dynamodb.DynamoDBExport
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val options = Map(
      "dynamodb.export" -> "ddb",
      "dynamodb.tableArn" -> "arn:aws:dynamodb:us-east-1:123456789012:table/my-table",
      "dynamodb.s3.bucket" -> "my-s3-bucket",
      "dynamodb.s3.prefix" -> "my-s3-prefix",
      "dynamodb.simplifyDDBJson" -> "true"
    )
    val df = DynamoDBExport.fullExport(spark, options)
    
    print(df.rdd.getNumPartitions)
    df.count()
    
    Job.commit()
  }
}
```

------

## Configuration Options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-options"></a>

### Read options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.input.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.read | The read capacity units (RCU) to use. If unspecified, dynamodb.throughput.read.ratio is used for calculation. | - | 
| dynamodb.throughput.read.ratio | The ratio of read capacity units (RCU) to use | 0.5 | 
| dynamodb.table.read.capacity | The read capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput read units. | - | 
| dynamodb.splits | Defines how many segments used in parallel scan operations. If not provided, connector will calculate a reasonable default value. | - | 
| dynamodb.consistentRead | Whether to use strongly consistent reads | FALSE | 
| dynamodb.input.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Write options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-write-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.output.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.write | The write capacity units (WCU) to use. If unspecified, dynamodb.throughput.write.ratio is used for calculation. | - | 
| dynamodb.throughput.write.ratio | The ratio of write capacity units (WCU) to use | 0.5 | 
| dynamodb.table.write.capacity | The write capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput write units. | - | 
| dynamodb.item.size.check.enabled | If true, the connector calculate the item size and abort if the size exceeds the maximum size, before writing to DynamoDB table. | TRUE | 
| dynamodb.output.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Export options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.export | If set to ddb enables the AWS Glue DynamoDB export connector where a new ExportTableToPointInTimeRequet will be invoked during the AWS Glue job. A new export will be generated with the location passed from dynamodb.s3.bucket and dynamodb.s3.prefix. If set to s3 enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the dynamodb.s3.bucket and dynamodb.s3.prefix as the Amazon S3 location of the past exported of that table. | ddb | 
| dynamodb.tableArn | The DynamoDB table to read from. Required if dynamodb.export is set to ddb. |  | 
| dynamodb.simplifyDDBJson | If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. | FALSE | 
| dynamodb.s3.bucket | The S3 bucket to store temporary data during DynamoDB export (required) |  | 
| dynamodb.s3.prefix | The S3 prefix to store temporary data during DynamoDB export |  | 
| dynamodb.s3.bucketOwner | Indicate the bucket owner needed for cross-account Amazon S3 access |  | 
| dynamodb.s3.sse.algorithm | Type of encryption used on the bucket where temporary data will be stored. Valid values are AES256 and KMS. |  | 
| dynamodb.s3.sse.kmsKeyId | The ID of the AWS KMS managed key used to encrypt the S3 bucket where temporary data will be stored (if applicable). |  | 
| dynamodb.exportTime | A point-in-time at which the export should be made. Valid values: strings representing ISO-8601 instants. |  | 

### General options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-general-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.sts.roleArn | The IAM role ARN to be assumed for cross-account access | - | 
| dynamodb.sts.roleSessionName | STS session name | glue-dynamodb-sts-session | 
| dynamodb.sts.region | Region for the STS client (for cross-region role assumption) | Same as region option | 

# Kinesis connections
<a name="aws-glue-programming-etl-connect-kinesis-home"></a>

You can use a Kinesis connection to read and write to Amazon Kinesis data streams using information stored in a Data Catalog table, or by providing information to directly access the data stream. You can read information from Kinesis into a Spark DataFrame, then convert it to a AWS Glue DynamicFrame. You can write DynamicFrames to Kinesis in a JSON format. If you directly access the data stream, use these options to provide the information about how to access the data stream.

If you use `getCatalogSource` or `create_data_frame_from_catalog` to consume records from a Kinesis streaming source, the job has the Data Catalog database and table name information, and can use that to obtain some basic parameters for reading from the Kinesis streaming source. If you use `getSource`, `getSourceWithFormat`, `createDataFrameFromOptions` or `create_data_frame_from_options`, you must specify these basic parameters using the connection options described here.

You can specify the connection options for Kinesis using the following arguments for the specified methods in the `GlueContext` class.
+ Scala
  + `connectionOptions`: Use with `getSource`, `createDataFrameFromOptions`, `getSink` 
  + `additionalOptions`: Use with `getCatalogSource`, `getCatalogSink`
  + `options`: Use with `getSourceWithFormat`, `getSinkWithFormat`
+ Python
  + `connection_options`: Use with `create_data_frame_from_options`, `write_dynamic_frame_from_options`
  + `additional_options`: Use with `create_data_frame_from_catalog`, `write_dynamic_frame_from_catalog`
  + `options`: Use with `getSource`, `getSink`

For notes and restrictions about Streaming ETL jobs, consult [Streaming ETL notes and restrictions](add-job-streaming.md#create-job-streaming-restrictions).

## Configure Kinesis
<a name="aws-glue-programming-etl-connect-kinesis-configure"></a>

To connect to a Kinesis data stream in an AWS Glue Spark job, you will need some prerequisites:
+ If reading, the AWS Glue job must have Read access level IAM permissions to the Kinesis data stream.
+ If writing, the AWS Glue job must have Write access level IAM permissions to the Kinesis data stream.

In certain cases, you will need to configure additional prerequisites:
+ If your AWS Glue job is configured with **Additional network connections** (typically to connect to other datasets) and one of those connections provides Amazon VPC **Network options**, this will direct your job to communicate over Amazon VPC. In this case you will also need to configure your Kinesis data stream to communicate over Amazon VPC. You can do this by creating an interface VPC endpoint between your Amazon VPC and Kinesis data stream. For more information, see [Using Kinesis Data Streams with Interface VPC Endpoints](https://docs.aws.amazon.com//streams/latest/dev/vpc.html).
+ When specifying Amazon Kinesis Data Streams in another account, you must setup the roles and policies to allow cross-account access. For more information, see [ Example: Read From a Kinesis Stream in a Different Account](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html).

For more information about Streaming ETL job prerequisites, consult [Streaming ETL jobs in AWS Glue](add-job-streaming.md).

## Example: Reading from Kinesis streams
<a name="aws-glue-programming-etl-connect-kinesis-read"></a>

### Example: Reading from Kinesis streams
<a name="section-etl-connect-kinesis-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Amazon Kinesis streaming source:

```
kinesis_options =
   { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
     "startingPosition": "TRIM_HORIZON", 
     "inferSchema": "true", 
     "classification": "json" 
   }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options)
```

## Example: Writing to Kinesis streams
<a name="aws-glue-programming-etl-connect-kinesis-write"></a>

### Example: Reading from Kinesis streams
<a name="section-etl-connect-kinesis-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Amazon Kinesis streaming source:

```
kinesis_options =
   { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
     "startingPosition": "TRIM_HORIZON", 
     "inferSchema": "true", 
     "classification": "json" 
   }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options)
```

## Kinesis connection option reference
<a name="aws-glue-programming-etl-connect-kinesis"></a>

Designates connection options for Amazon Kinesis Data Streams.

Use the following connection options for Kinesis streaming data sources: 
+ `"streamARN"` (Required) Used for Read/Write. The ARN of the Kinesis data stream.
+ `"classification"` (Required for read) Used for Read. The file format used by the data in the record. Required unless provided through the Data Catalog.
+ `"streamName"` – (Optional) Used for Read. The name of a Kinesis data stream to read from. Used with `endpointUrl`.
+ `"endpointUrl"` – (Optional) Used for Read. Default: "https://kinesis.us-east-1.amazonaws.com". The AWS endpoint of the Kinesis stream. You do not need to change this unless you are connecting to a special region.
+ `"partitionKey"` – (Optional) Used for Write. The Kinesis partition key used when producing records.
+ `"delimiter"` (Optional) Used for Read. The value separator used when `classification` is CSV. Default is "`,`."
+ `"startingPosition"`: (Optional) Used for Read. The starting position in the Kinesis data stream to read data from. The possible values are `"latest"`, `"trim_horizon"`, `"earliest"`, or a Timestamp string in UTC format in the pattern `yyyy-mm-ddTHH:MM:SSZ` (where `Z` represents a UTC timezone offset with a \$1/-. For example "2023-04-04T08:00:00-04:00"). The default value is `"latest"`. Note: the Timestamp string in UTC Format for `"startingPosition"` is supported only for AWS Glue version 4.0 or later.
+ `"failOnDataLoss"`: (Optional) Fail the job if any active shard is missing or expired. The default value is `"false"`.
+ `"awsSTSRoleARN"`: (Optional) Used for Read/Write. The Amazon Resource Name (ARN) of the role to assume using AWS Security Token Service (AWS STS). This role must have permissions for describe or read record operations for the Kinesis data stream. You must use this parameter when accessing a data stream in a different account. Used in conjunction with `"awsSTSSessionName"`.
+ `"awsSTSSessionName"`: (Optional) Used for Read/Write. An identifier for the session assuming the role using AWS STS. You must use this parameter when accessing a data stream in a different account. Used in conjunction with `"awsSTSRoleARN"`.
+ `"awsSTSEndpoint"`: (Optional) The AWS STS endpoint to use when connecting to Kinesis with an assumed role. This allows using the regional AWS STS endpoint in a VPC, which is not possible with the default global endpoint.
+ `"maxFetchTimeInMs"`: (Optional) Used for Read. The maximum time spent for the job executor to read records for the current batch from the Kinesis data stream, specified in milliseconds (ms). Multiple `GetRecords` API calls may be made within this time. The default value is `1000`.
+ `"maxFetchRecordsPerShard"`: (Optional) Used for Read. The maximum number of records to fetch per shard in the Kinesis data stream per microbatch. Note: The client can exceed this limit if the streaming job has already read extra records from Kinesis (in the same get-records call). If `maxFetchRecordsPerShard` needs to be strict then it needs to be a multiple of `maxRecordPerRead`. The default value is `100000`.
+ `"maxRecordPerRead"`: (Optional) Used for Read. The maximum number of records to fetch from the Kinesis data stream in each `getRecords` operation. The default value is `10000`.
+ `"addIdleTimeBetweenReads"`: (Optional) Used for Read. Adds a time delay between two consecutive `getRecords` operations. The default value is `"False"`. This option is only configurable for Glue version 2.0 and above. 
+ `"idleTimeBetweenReadsInMs"`: (Optional) Used for Read. The minimum time delay between two consecutive `getRecords` operations, specified in ms. The default value is `1000`. This option is only configurable for Glue version 2.0 and above. 
+ `"describeShardInterval"`: (Optional) Used for Read. The minimum time interval between two `ListShards` API calls for your script to consider resharding. For more information, see [Strategies for Resharding](https://docs.aws.amazon.com//streams/latest/dev/kinesis-using-sdk-java-resharding-strategies.html) in *Amazon Kinesis Data Streams Developer Guide*. The default value is `1s`.
+ `"numRetries"`: (Optional) Used for Read. The maximum number of retries for Kinesis Data Streams API requests. The default value is `3`.
+ `"retryIntervalMs"`: (Optional) Used for Read. The cool-off time period (specified in ms) before retrying the Kinesis Data Streams API call. The default value is `1000`.
+ `"maxRetryIntervalMs"`: (Optional) Used for Read. The maximum cool-off time period (specified in ms) between two retries of a Kinesis Data Streams API call. The default value is `10000`.
+ `"avoidEmptyBatches"`: (Optional) Used for Read. Avoids creating an empty microbatch job by checking for unread data in the Kinesis data stream before the batch is started. The default value is `"False"`.
+ `"schema"`: (Required when inferSchema set to false) Used for Read. The schema to use to process the payload. If classification is `avro` the provided schema must be in the Avro schema format. If the classification is not `avro` the provided schema must be in the DDL schema format.

  The following are schema examples.

------
#### [ Example in DDL schema format ]

  ```
  `column1` INT, `column2` STRING , `column3` FLOAT
  ```

------
#### [ Example in Avro schema format ]

  ```
  {
    "type":"array",
    "items":
    {
      "type":"record",
      "name":"test",
      "fields":
      [
        {
          "name":"_id",
          "type":"string"
        },
        {
          "name":"index",
          "type":
          [
            "int",
            "string",
            "float"
          ]
        }
      ]
    }
  }
  ```

------
+ `"inferSchema"`: (Optional) Used for Read. The default value is 'false'. If set to 'true', the schema will be detected at runtime from the payload within `foreachbatch`.
+ `"avroSchema"`: (Deprecated) Used for Read. Parameter used to specify a schema of Avro data when Avro format is used. This parameter is now deprecated. Use the `schema` parameter.
+ `"addRecordTimestamp"`: (Optional) Used for Read. When this option is set to 'true', the data output will contain an additional column named "\$1\$1src\$1timestamp" that indicates the time when the corresponding record received by the stream. The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"emitConsumerLagMetrics"`: (Optional) Used for Read. When the option is set to 'true', for each batch, it will emit the metrics for the duration between the oldest record received by the stream and the time it arrives in AWS Glue to CloudWatch. The metric's name is "glue.driver.streaming.maxConsumerLagInMs". The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"fanoutConsumerARN"`: (Optional) Used for Read. The ARN of a Kinesis stream consumer for the stream specified in `streamARN`. Used to enable enhanced fan-out mode for your Kinesis connection. For more information on consuming a Kinesis stream with enhanced fan-out, see [Using enhanced fan-out in Kinesis streaming jobs](aws-glue-programming-etl-connect-kinesis-efo.md).
+ `"recordMaxBufferedTime"` – (Optional) Used for Write. Default: 1000 (ms). Maximum time a record is buffered while waiting to be written.
+ `"aggregationEnabled"` – (Optional) Used for Write. Default: true. Specifies if records should be aggregated before sending them to Kinesis.
+ `"aggregationMaxSize"` – (Optional) Used for Write. Default: 51200 (bytes). If a record is larger than this limit, it will bypass the aggregator. Note Kinesis enforces a limit of 50KB on record size. If you set this beyond 50KB, oversize records will be rejected by Kinesis.
+ `"aggregationMaxCount"` – (Optional) Used for Write. Default: 4294967295. Maximum number of items to pack into an aggregated record.
+ `"producerRateLimit"` – (Optional) Used for Write. Default: 150 (%). Limits per-shard throughput sent from a single producer (such as your job), as a percentage of the backend limit.
+ `"collectionMaxCount"` – (Optional) Used for Write. Default: 500. Maximum number of items to pack into an PutRecords request. 
+ `"collectionMaxSize"` – (Optional) Used for Write. Default: 5242880 (bytes). Maximum amount of data to send with a PutRecords request.

# Using enhanced fan-out in Kinesis streaming jobs
<a name="aws-glue-programming-etl-connect-kinesis-efo"></a>

An enhanced fan-out consumer is able to recieve records from a Kinesis stream with dedicated throughput that can be greater than typical consumers. This is done by optimizing the transfer protocol used to provide data to a Kinesis consumer, such as your job. For more information about Kinesis Enhanced Fan-Out, see [the Kinesis documentation](https://docs.aws.amazon.com//streams/latest/dev/enhanced-consumers.html).

In enhanced fan-out mode, the `maxRecordPerRead` and `idleTimeBetweenReadsInMs` connection options no longer apply, as those parameters are not configurable when using enhanced fan-out. The configuration options for retries perform as described.

Use the following procedures to enable and disable enhanced fan-out for your streaming job. You should register a stream consumer for each job that will consume data from your stream.

**To enable enhanced fan-out consumption on your job:**

1. Register a stream consumer for your job using the Kinesis API. Follow the instructions to *register a consumer with enhanced fan-out using the Kinesis Data Streams API* in the [Kinesis documentation](https://docs.aws.amazon.com//streams/latest/dev/building-enhanced-consumers-api). You will only need to follow the first step - calling [RegisterStreamConsumer](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_RegisterStreamConsumer.html). Your request should return an ARN, *consumerARN*. 

1. Set the connection option `fanoutConsumerARN` to *consumerARN* in your connection method arguments.

1. Restart your job.

**To disable enhanced fan-out consumption on your job:**

1. Remove the `fanoutConsumerARN` connection option from your method call.

1. Restart your job.

1. Follow the instructions to *deregister a consumer* in the [Kinesis documentation](https://docs.aws.amazon.com/streams/latest/dev/building-enhanced-consumers-console.html). These instructions apply to the console, but can also be achieved through the Kinesis API. For more information about stream consumer deregistration through the Kinesis API, consult [DeregisterStreamConsumer](https://docs.aws.amazon.com//kinesis/latest/APIReference/API_DeregisterStreamConsumer.html) in the Kinesis documentation.

# Amazon S3 connections
<a name="aws-glue-programming-etl-connect-s3-home"></a>

You can use AWS Glue for Spark to read and write files in Amazon S3. AWS Glue for Spark supports many common data formats stored in Amazon S3 out of the box, including CSV, Avro, JSON, Orc and Parquet. For more information about supported data formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). Each data format may support a different set of AWS Glue features. Consult the page for your data format for the specifics of feature support. Additionally, you can read and write versioned files stored in the Hudi, Iceberg and Delta Lake data lake frameworks. For more information about data lake frameworks, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md). 

With AWS Glue you can partition your Amazon S3 objects into a folder structure while writing, then retrieve it by partition to improve performance using simple configuration. You can also set configuration to group small files together when transforming your data to improve performance. You can read and write `bzip2` and `gzip` archives in Amazon S3.

**Topics**
+ [Configuring S3 connections](#aws-glue-programming-etl-connect-s3-configure)
+ [Amazon S3 connection option reference](#aws-glue-programming-etl-connect-s3)
+ [Deprecated connection syntaxes for data formats](#aws-glue-programming-etl-connect-legacy-format)
+ [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md)
+ [Managing partitions for ETL output in AWS Glue](aws-glue-programming-etl-partitions.md)
+ [Reading input files in larger groups](grouping-input-files.md)
+ [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md)

## Configuring S3 connections
<a name="aws-glue-programming-etl-connect-s3-configure"></a>

To connect to Amazon S3 in a AWS Glue with Spark job, you will need some prerequisites:
+ The AWS Glue job must have IAM permissions for relevant Amazon S3 buckets.

In certain cases, you will need to configure additional prerequisites:
+ When configuring cross-account access, appropriate access controls on the Amazon S3 bucket.
+ For security reasons, you may choose to route your Amazon S3 requests through an Amazon VPC. This approach can introduce bandwidth and availability challenges. For more information, see [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md). 

## Amazon S3 connection option reference
<a name="aws-glue-programming-etl-connect-s3"></a>

Designates a connection to Amazon S3.

Since Amazon S3 manages files rather than tables, in addition to specifying the connection properties provided in this document, you will need to specify additional configuration about your file type. You specify this information through data format options. For more information about format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). You can also specify this information by integrating with the AWS Glue Data Catalog.

For an example of the distinction between connection options and format options, consider how the [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method takes `connection_type`, `connection_options`, `format` and `format_options`. This section specifically discusses parameters provided to `connection_options`.

Use the following connection options with `"connectionType": "s3"`:
+ `"paths"`: (Required) A list of the Amazon S3 paths to read from.
+ `"exclusions"`: (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, `"[\"**.pdf\"]"` excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see [Include and Exclude Patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude).
+ `"compressionType"`: or "`compression`": (Optional) Specifies how the data is compressed. Use `"compressionType"` for Amazon S3 sources and `"compression"` for Amazon S3 targets. This is generally not necessary if the data has a standard file extension. Possible values are `"gzip"` and `"bzip2"`). Additional compression formats may be supported for specific formats. For the specifics of feature support, consult the data format page. 
+ `"groupFiles"`: (Optional) Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to `"inPartition"`. To disable grouping when there are more than 50,000 files, set this parameter to `"none"`.
+ `"groupSize"`: (Optional) The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, `"groupFiles"` must be set to `"inPartition"` for this to take effect.
+ `"recurse"`: (Optional) If set to true, recursively reads files in all subdirectories under the specified paths.
+ `"maxBand"`: (Optional, advanced) This option controls the duration in milliseconds after which the `s3` listing is likely to be consistent. Files with modification timestamps falling within the last `maxBand` milliseconds are tracked specially when using `JobBookmarks` to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.
+ `"maxFilesInBand"`: (Optional, advanced) This option specifies the maximum number of files to save from the last `maxBand` seconds. If this number is exceeded, extra files are skipped and only processed in the next job run. Most users don't need to set this option.
+ `"isFailFast"`: (Optional) This option determines if an AWS Glue ETL job throws reader parsing exceptions. If set to `true`, jobs fail fast if four retries of the Spark task fail to parse the data correctly.
+ `"catalogPartitionPredicate"`: (Optional) Used for Read. The contents of a SQL `WHERE` clause. Used when reading from Data Catalog tables with a very large quantity of partitions. Retrieves matching partitions from Data Catalog indices. Used with `push_down_predicate`, an option on the [create\$1dynamic\$1frame\$1from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) method (and other similar methods). For more information, see [Server-side filtering using catalog partition predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-cat-predicates).
+ `"partitionKeys"`: (Optional) Used for Write. An array of column label strings. AWS Glue will partition your data as specified by this configuration. For more information, see [Writing partitions](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-writing).
+ `"excludeStorageClasses"`: (Optional) Used for Read. An array of strings specifying Amazon S3 storage classes. AWS Glue will exclude Amazon S3 objects based on this configuration. For more information, see [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md).

## Deprecated connection syntaxes for data formats
<a name="aws-glue-programming-etl-connect-legacy-format"></a>

Certain data formats can be accessed using a specific connection type syntax. This syntax is deprecated. We recommend you specify your formats using the `s3` connection type and the format options provided in [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) instead.

### "connectionType": "Orc"
<a name="aws-glue-programming-etl-connect-orc"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Hive Optimized Row Columnar (ORC)](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) file format.

Use the following connection options with `"connectionType": "orc"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

### "connectionType": "parquet"
<a name="aws-glue-programming-etl-connect-parquet"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Parquet](https://parquet.apache.org/docs/) file format.

Use the following connection options with `"connectionType": "parquet"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

# Excluding Amazon S3 storage classes
<a name="aws-glue-programming-etl-storage-classes"></a>

If you're running AWS Glue ETL jobs that read files or partitions from Amazon Simple Storage Service (Amazon S3), you can exclude some Amazon S3 storage class types.

The following storage classes are available in Amazon S3:
+ `STANDARD` — For general-purpose storage of frequently accessed data.
+ `INTELLIGENT_TIERING` — For data with unknown or changing access patterns.
+ `STANDARD_IA` and `ONEZONE_IA` — For long-lived, but less frequently accessed data.
+ `GLACIER`, `DEEP_ARCHIVE`, and `REDUCED_REDUNDANCY` — For long-term archive and digital preservation.

For more information, see [Amazon S3 Storage Classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html) in the *Amazon S3 Developer Guide*.

The examples in this section show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes. These classes allow you to list files, but they won't let you read the files unless they are restored. (For more information, see [Restoring Archived Objects](https://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html) in the *Amazon S3 Developer Guide*.)

By using storage class exclusions, you can ensure that your AWS Glue jobs will work on tables that have partitions across these storage class tiers. Without exclusions, jobs that read data from these tiers fail with the following error: AmazonS3Exception: The operation is not valid for the object's storage class.

There are different ways that you can filter Amazon S3 storage classes in AWS Glue.

**Topics**
+ [Excluding Amazon S3 storage classes when creating a Dynamic Frame](#aws-glue-programming-etl-storage-classes-dynamic-frame)
+ [Excluding Amazon S3 storage classes on a Data Catalog table](#aws-glue-programming-etl-storage-classes-table)

## Excluding Amazon S3 storage classes when creating a Dynamic Frame
<a name="aws-glue-programming-etl-storage-classes-dynamic-frame"></a>

To exclude Amazon S3 storage classes while creating a dynamic frame, use `excludeStorageClasses` in `additionalOptions`. AWS Glue automatically uses its own Amazon S3 `Lister` implementation to list and exclude files corresponding to the specified storage classes.

The following Python and Scala examples show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes when creating a dynamic frame.

Python example:

```
glueContext.create_dynamic_frame.from_catalog(
    database = "my_database",
    tableName = "my_table_name",
    redshift_tmp_dir = "",
    transformation_ctx = "my_transformation_context",
    additional_options = {
        "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"]
    }
)
```

Scala example:

```
val* *df = glueContext.getCatalogSource(
    nameSpace, tableName, "", "my_transformation_context",  
    additionalOptions = JsonOptions(
        Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE"))
    )
).getDynamicFrame()
```

## Excluding Amazon S3 storage classes on a Data Catalog table
<a name="aws-glue-programming-etl-storage-classes-table"></a>

You can specify storage class exclusions to be used by an AWS Glue ETL job as a table parameter in the AWS Glue Data Catalog. You can include this parameter in the `CreateTable` operation using the AWS Command Line Interface (AWS CLI) or programmatically using the API. For more information, see [Table Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table) and [CreateTable](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html). 

You can also specify excluded storage classes on the AWS Glue console.

**To exclude Amazon S3 storage classes (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane on the left, choose **Tables**.

1. Choose the table name in the list, and then choose **Edit table**.

1. In **Table properties**, add **excludeStorageClasses** as a key and **[\$1"GLACIER\$1",\$1"DEEP\$1ARCHIVE\$1"]** as a value.

1. Choose **Apply**.

# Managing partitions for ETL output in AWS Glue
<a name="aws-glue-programming-etl-partitions"></a>

Partitioning is an important technique for organizing datasets so they can be queried efficiently. It organizes data in a hierarchical directory structure based on the distinct values of one or more columns.

For example, you might decide to partition your application logs in Amazon Simple Storage Service (Amazon S3) by date, broken down by year, month, and day. Files that correspond to a single day's worth of data are then placed under a prefix such as `s3://my_bucket/logs/year=2018/month=01/day=23/`. Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by partition value without having to read all the underlying data from Amazon S3.

Crawlers not only infer file types and schemas, they also automatically identify the partition structure of your dataset when they populate the AWS Glue Data Catalog. The resulting partition columns are available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.

After you crawl a table, you can view the partitions that the crawler created. In the AWS Glue console, choose **Tables** in the left navigation pane. Choose the table created by the crawler, and then choose **View Partitions**.

For Apache Hive-style partitioned paths in `key=val` style, crawlers automatically populate the column name using the key name. Otherwise, it uses default names like `partition_0`, `partition_1`, and so on. You can change the default names on the console. To do so, navigate to the table. Check if indexes exist under the **Indexes** tab. If that's the case, you need to delete them to proceed (you can recreate them using the new column names afterwards). Then, choose **Edit Schema**, and modify the names of the partition columns there.

In your ETL scripts, you can then filter on the partition columns. Because the partition information is stored in the Data Catalog, use the `from_catalog` API calls to include the partition columns in the `DynamicFrame`. For example, use `create_dynamic_frame.from_catalog` instead of `create_dynamic_frame.from_options`.

Partitioning is an optimization technique that reduces data scan. For more information about the process of identifying when this technique is appropriate, consult [Reduce the amount of data scan](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/reduce-data-scan.html) in the *Best practices for performance tuning AWS Glue for Apache Spark jobs* guide on AWS Prescriptive Guidance.

## Pre-filtering using pushdown predicates
<a name="aws-glue-programming-etl-partitions-pushdowns"></a>

In many cases, you can use a pushdown predicate to filter on partitions without having to list and read all the files in your dataset. Instead of reading the entire dataset and then filtering in a DynamicFrame, you can apply the filter directly on the partition metadata in the Data Catalog. Then you only list and read what you actually need into a DynamicFrame.

For example, in Python, you could write the following.

```
glue_context.create_dynamic_frame.from_catalog(
    database = "my_S3_data_set",
    table_name = "catalog_data_table",
    push_down_predicate = my_partition_predicate)
```

This creates a DynamicFrame that loads only the partitions in the Data Catalog that satisfy the predicate expression. Depending on how small a subset of your data you are loading, this can save a great deal of processing time.

The predicate expression can be any Boolean expression supported by Spark SQL. Anything you could put in a `WHERE` clause in a Spark SQL query will work. For example, the predicate expression `pushDownPredicate = "(year=='2017' and month=='04')"` loads only the partitions in the Data Catalog that have both `year` equal to 2017 and `month` equal to 04. For more information, see the [Apache Spark SQL documentation](https://spark.apache.org/docs/2.1.1/sql-programming-guide.html), and in particular, the [Scala SQL functions reference](https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.functions$).

## Server-side filtering using catalog partition predicates
<a name="aws-glue-programming-etl-partitions-cat-predicates"></a>

The `push_down_predicate` option is applied after listing all the partitions from the catalog and before listing files from Amazon S3 for those partitions. If you have a lot of partitions for a table, catalog partition listing can still incur additional time overhead. To address this overhead, you can use server-side partition pruning with the `catalogPartitionPredicate` option that uses [partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) in the AWS Glue Data Catalog. This makes partition filtering much faster when you have millions of partitions in one table. You can use both `push_down_predicate` and `catalogPartitionPredicate` in `additional_options` together if your `catalogPartitionPredicate` requires predicate syntax that is not yet supported with the catalog partition indexes.

Python:

```
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database=dbname, 
    table_name=tablename,
    transformation_ctx="datasource0",
    push_down_predicate="day>=10 and customer_id like '10%'",
    additional_options={"catalogPartitionPredicate":"year='2021' and month='06'"}
)
```

Scala:

```
val dynamicFrame = glueContext.getCatalogSource(
    database = dbname,
    tableName = tablename, 
    transformationContext = "datasource0",
    pushDownPredicate="day>=10 and customer_id like '10%'",
    additionalOptions = JsonOptions("""{
        "catalogPartitionPredicate": "year='2021' and month='06'"}""")
    ).getDynamicFrame()
```

**Note**  
The `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

## Writing partitions
<a name="aws-glue-programming-etl-partitions-writing"></a>

By default, a DynamicFrame is not partitioned when it is written. All of the output files are written at the top level of the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it to a Spark SQL DataFrame before writing.

However, DynamicFrames now support native partitioning using a sequence of keys, using the `partitionKeys` option when you create a sink. For example, the following Python code writes out a dataset to Amazon S3 in the Parquet format, into directories partitioned by the type field. From there, you can process these partitions using other systems, such as Amazon Athena.

```
glue_context.write_dynamic_frame.from_options(
    frame = projectedEvents,
    connection_type = "s3",    
    connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
    format = "parquet")
```

# Reading input files in larger groups
<a name="grouping-input-files"></a>

You can set properties of your tables to enable an AWS Glue ETL job to group files when they are read from an Amazon S3 data store. These properties enable each ETL task to read a group of input files into a single in-memory partition, this is especially useful when there is a large number of small files in your Amazon S3 data store. When you set certain properties, you instruct AWS Glue to group files within an Amazon S3 data partition and set the size of the groups to be read. You can also set these options when reading from an Amazon S3 data store with the `create_dynamic_frame.from_options` method. 

To enable grouping files for a table, you set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see [Viewing and managing table details](tables-described.md#console-tables-details). 

You can use this method to enable grouping for tables in the Data Catalog with Amazon S3 data stores. 

**groupFiles**  
Set **groupFiles** to `inPartition` to enable the grouping of files within an Amazon S3 data partition. AWS Glue automatically enables grouping if there are more than 50,000 input files, as in the following example.  

```
  'groupFiles': 'inPartition'
```

**groupSize**  
Set **groupSize** to the target size of groups in bytes. The **groupSize** property is optional, if not provided, AWS Glue calculates a size to use all the CPU cores in the cluster while still reducing the overall number of ETL tasks and in-memory partitions.   
For example, the following sets the group size to 1 MB.  

```
  'groupSize': '1048576'
```
Note that the `groupsize` should be set with the result of a calculation. For example 1024 \$1 1024 = 1048576.

**recurse**  
Set **recurse** to `True` to recursively read files in all subdirectories when specifying `paths` as an array of paths. You do not need to set **recurse** if `paths` is an array of object keys in Amazon S3, or if the input format is parquet/orc, as in the following example.  

```
  'recurse':True
```

If you are reading from Amazon S3 directly using the `create_dynamic_frame.from_options` method, add these connection options. For example, the following attempts to group files into 1 MB groups.

```
df = glueContext.create_dynamic_frame.from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")
```

**Note**  
`groupFiles` is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.

# Amazon VPC endpoints for Amazon S3
<a name="vpc-endpoints-s3"></a>

For security reasons, many AWS customers run their applications within an Amazon Virtual Private Cloud environment (Amazon VPC). With Amazon VPC, you can launch Amazon EC2 instances into a virtual private cloud, which is logically isolated from other networks—including the public internet. With an Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.

**Note**  
If you created your AWS account after 2013-12-04, you already have a default VPC in each AWS Region. You can immediately start using your default VPC without any additional configuration.  
For more information, see [Your Default VPC and Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) in the Amazon VPC User Guide.

Many customers have legitimate privacy and security concerns about sending and receiving data across the public internet. Customers can address these concerns by using a virtual private network (VPN) to route all Amazon S3 network traffic through their own corporate network infrastructure. However, this approach can introduce bandwidth and availability challenges.

VPC endpoints for Amazon S3 can alleviate these challenges. A VPC endpoint for Amazon S3 enables AWS Glue to use private IP addresses to access Amazon S3 with no exposure to the public internet. AWS Glue does not require public IP addresses, and you don't need an internet gateway, a NAT device, or a virtual private gateway in your VPC. You use endpoint policies to control access to Amazon S3. Traffic between your VPC and the AWS service does not leave the Amazon network.

When you create a VPC endpoint for Amazon S3, any requests to an Amazon S3 endpoint within the Region (for example, *s3.us-west-2.amazonaws.com*) are routed to a private Amazon S3 endpoint within the Amazon network. You don't need to modify your applications running on Amazon EC2 instances in your VPC—the endpoint name remains the same, but the route to Amazon S3 stays entirely within the Amazon network, and does not access the public internet.

For more information about VPC endpoints, see [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html) in the Amazon VPC User Guide.

The following diagram shows how AWS Glue can use a VPC endpoint to access Amazon S3.

![\[Network traffic flow showing VPC connection to Amazon S3.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-vpc-endpoint.png)


**To set up access for Amazon S3**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. In the left navigation pane, choose **Endpoints**.

1. Choose **Create Endpoint**, and follow the steps to create an Amazon S3 VPC endpoint of type Gateway. 

# Amazon DocumentDB connections
<a name="aws-glue-programming-etl-connect-documentdb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Amazon DocumentDB. You can connect to Amazon DocumentDB using credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Amazon DocumentDB, consult the [Amazon DocumentDB documentation](https://docs.aws.amazon.com/documentdb/latest/developerguide/what-is.html).

**Note**  
Amazon DocumentDB elastic clusters are not currently supported when using the AWS Glue connector. For more information about elastic clusters, see [Using Amazon DocumentDB elastic clusters](https://docs.aws.amazon.com/documentdb/latest/developerguide/docdb-using-elastic-clusters.html).

## Reading and writing to Amazon DocumentDB collections
<a name="aws-glue-programming-etl-connect-documentdb-read-write"></a>

**Note**  
When you create an ETL job that connects to Amazon DocumentDB, for the `Connections` job property, you must designate a connection object that specifies the virtual private cloud (VPC) in which Amazon DocumentDB is running. For the connection object, the connection type must be `JDBC`, and the `JDBC URL` must be `mongo://<DocumentDB_host>:27017`.

**Note**  
These code samples were developed for AWS Glue 3.0. To migrate to AWS Glue 4.0, consult [MongoDB](migrating-version-40.md#migrating-version-40-connector-driver-migration-mongodb). The `uri` parameter has changed.

**Note**  
When using Amazon DocumentDB, `retryWrites` must be set to false in certain situations, such as when the document written specifies `_id`. For more information, consult [Functional Differences with MongoDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/functional-differences.html#functional-differences.retryable-writes) in the Amazon DocumentDB documentation.

The following Python script demonstrates using connection types and connection options for reading and writing to Amazon DocumentDB.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://some_bucket/output/" + str(time.time()) + "/"
documentdb_uri = "mongodb://<mongo-instanced-ip-address>:27017"
documentdb_write_uri = "mongodb://<mongo-instanced-ip-address>:27017"

read_docdb_options = {
    "uri": documentdb_uri,
    "database": "test",
    "collection": "coll",
    "username": "username",
    "password": "1234567890",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

write_documentdb_options = {
    "retryWrites": "false",
    "uri": documentdb_write_uri,
    "database": "test",
    "collection": "coll",
    "username": "username",
    "password": "pwd"
}

# Get DynamicFrame from  DocumentDB
dynamic_frame2 = glueContext.create_dynamic_frame.from_options(connection_type="documentdb",
                                                               connection_options=read_docdb_options)

# Write DynamicFrame to MongoDB and DocumentDB
glueContext.write_dynamic_frame.from_options(dynamic_frame2, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

job.commit()
```

The following Scala script demonstrates using connection types and connection options for reading and writing to Amazon DocumentDB.

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  val DOC_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  val DOC_WRITE_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  lazy val documentDBJsonOption = jsonOptions(DOC_URI)
  lazy val writeDocumentDBJsonOption = jsonOptions(DOC_WRITE_URI)
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Get DynamicFrame from DocumentDB
    val resultFrame2: DynamicFrame = glueContext.getSource("documentdb", documentDBJsonOption).getDynamicFrame()

    // Write DynamicFrame to DocumentDB
    glueContext.getSink("documentdb", writeJsonOption).writeDynamicFrame(resultFrame2)

    Job.commit()
  }

  private def jsonOptions(uri: String): JsonOptions = {
    new JsonOptions(
      s"""{"uri": "${uri}",
         |"database":"test",
         |"collection":"coll",
         |"username": "username",
         |"password": "pwd",
         |"ssl":"true",
         |"ssl.domain_match":"false",
         |"partitioner": "MongoSamplePartitioner",
         |"partitionerOptions.partitionSizeMB": "10",
         |"partitionerOptions.partitionKey": "_id"}""".stripMargin)
  }
}
```

## Amazon DocumentDB connection option reference
<a name="aws-glue-programming-etl-connect-documentdb"></a>

Designates a connection to Amazon DocumentDB (with MongoDB compatibility). 

Connection options differ for a source connection and a sink connection.

### "connectionType": "Documentdb" as source
<a name="etl-connect-documentdb-as-source"></a>

Use the following connection options with `"connectionType": "documentdb"` as a source:
+ `"uri"`: (Required) The Amazon DocumentDB host to read from, formatted as `mongodb://<host>:<port>`.
+ `"database"`: (Required) The Amazon DocumentDB database to read from.
+ `"collection"`: (Required) The Amazon DocumentDB collection to read from.
+ `"username"`: (Required) The Amazon DocumentDB user name.
+ `"password"`: (Required) The Amazon DocumentDB password.
+ `"ssl"`: (Required if using SSL) If your connection uses SSL, then you must include this option with the value `"true"`.
+ `"ssl.domain_match"`: (Required if using SSL) If your connection uses SSL, then you must include this option with the value `"false"`.
+ `"batchSize"`: (Optional): The number of documents to return per batch, used within the cursor of internal batches.
+ `"partitioner"`: (Optional): The class name of the partitioner for reading input data from Amazon DocumentDB. The connector provides the following partitioners:
  + `MongoDefaultPartitioner` (default) (Not supported in AWS Glue 4.0)
  + `MongoSamplePartitioner` (Not supported in AWS Glue 4.0)
  + `MongoShardedPartitioner`
  + `MongoSplitVectorPartitioner`
  + `MongoPaginateByCountPartitioner`
  + `MongoPaginateBySizePartitioner` (Not supported in AWS Glue 4.0)
+ `"partitionerOptions"` (Optional): Options for the designated partitioner. The following options are supported for each partitioner:
  + `MongoSamplePartitioner`: `partitionKey`, `partitionSizeMB`, `samplesPerPartition`
  + `MongoShardedPartitioner`: `shardkey`
  + `MongoSplitVectorPartitioner`: `partitionKey`, partitionSizeMB
  + `MongoPaginateByCountPartitioner`: `partitionKey`, `numberOfPartitions`
  + `MongoPaginateBySizePartitioner`: `partitionKey`, partitionSizeMB

  For more information about these options, see [Partitioner Configuration](https://docs.mongodb.com/spark-connector/master/configuration/#partitioner-conf) in the MongoDB documentation.

### "connectionType": "Documentdb" as sink
<a name="etl-connect-documentdb-as-sink"></a>

Use the following connection options with `"connectionType": "documentdb"` as a sink:
+ `"uri"`: (Required) The Amazon DocumentDB host to write to, formatted as `mongodb://<host>:<port>`.
+ `"database"`: (Required) The Amazon DocumentDB database to write to.
+ `"collection"`: (Required) The Amazon DocumentDB collection to write to.
+ `"username"`: (Required) The Amazon DocumentDB user name.
+ `"password"`: (Required) The Amazon DocumentDB password.
+ `"extendedBsonTypes"`: (Optional) If `true`, allows extended BSON types when writing data to Amazon DocumentDB. The default is `true`.
+ `"replaceDocument"`: (Optional) If `true`, replaces the whole document when saving datasets that contain an `_id` field. If `false`, only fields in the document that match the fields in the dataset are updated. The default is `true`.
+ `"maxBatchSize"`: (Optional): The maximum batch size for bulk operations when saving data. The default is 512.
+ `"retryWrites"`: (Optional): Automatically retry certain write operations a single time if AWS Glue encounters a network error.

# OpenSearch Service connections
<a name="aws-glue-programming-etl-connect-opensearch-home"></a>

You can use AWS Glue for Spark to read from and write to tables in OpenSearch Service in AWS Glue 4.0 and later versions. You can define what to read from OpenSearch Service with an OpenSearch query. You connect to OpenSearch Service using HTTP basic authentication credentials stored in AWS Secrets Manager through a AWS Glue connection. This feature is not compatible with OpenSearch Service serverless.

For more information about Amazon OpenSearch Service, see the [Amazon OpenSearch Service documentation](https://docs.aws.amazon.com/opensearch-service/).

## Configuring OpenSearch Service connections
<a name="aws-glue-programming-etl-connect-opensearch-configure"></a>

To connect to OpenSearch Service from AWS Glue, you will need to create and store your OpenSearch Service credentials in a AWS Secrets Manager secret, then associate that secret with a OpenSearch Service AWS Glue connection.

**Prerequisites:** 
+ Identify the domain endpoint, *aosEndpoint* and port, *aosPort* you would like to read from, or create the resource by following instructions in the Amazon OpenSearch Service documentation. For more information on creating a domain, see [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com//opensearch-service/latest/developerguide/createupdatedomains.html) in the Amazon OpenSearch Service documentation.

  An Amazon OpenSearch Service domain endpoint will have the following default form, https://search-*domainName*-*unstructuredIdContent*.*region*.es.amazonaws.com. For more information on identifying your domain endpoint, see [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com//opensearch-service/latest/developerguide/createupdatedomains.html) in the Amazon OpenSearch Service documentation. 

  Identify or generate HTTP basic authentication credentials, *aosUser* and *aosPassword* for your domain.

**To configure a connection to OpenSearch Service:**

1. In AWS Secrets Manager, create a secret using your OpenSearch Service credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `USERNAME` with the value *aosUser*.
   + When selecting **Key/value pairs**, create a pair for the key `PASSWORD` with the value *aosPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select OpenSearch Service.
   + When selecting a Domain endpoint, provide *aosEndpoint*.
   + When selecting a port, provide *aosPort*.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue OpenSearch Service connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from OpenSearch Service indexes
<a name="aws-glue-programming-etl-connect-opensearch-read"></a>

**Prerequisites:** 
+ A OpenSearch Service index you would like to read from, *aosIndex*.
+ A AWS Glue OpenSearch Service connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to OpenSearch Service*. You will need the name of the AWS Glue connection, *connectionName*. 

This example reads an index from Amazon OpenSearch Service. You will need to provide the `pushdown` parameter.

For example: 

```
opensearch_read = glueContext.create_dynamic_frame.from_options(
    connection_type="opensearch",
    connection_options={
        "connectionName": "connectionName",
        "opensearch.resource": "aosIndex",
        "pushdown": "true",
    }
)
```

You can also provide a query string to filter the results returned in your DynamicFrame. You will need to configure `opensearch.query`.

`opensearch.query` can take a URL query parameter string *queryString* or a query DSL JSON object *queryObject*. For more information about the query DSL, see [Query DSL](https://opensearch.org/docs/latest/query-dsl/index/) in the OpenSearch documentation. To provide a URL query parameter string, prepend `?q=` to your query, as you would in a fully qualified URL. To provide a query DSL object, string escape the JSON object before providing it.

For example: 

```
            queryObject = "{ "query": { "multi_match": { "query": "Sample", "fields": [ "sample" ] } } }"
            queryString = "?q=queryString"
            
            opensearch_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="opensearch",
    connection_options={
        "connectionName": "connectionName",
        "opensearch.resource": "aosIndex",
        "opensearch.query": queryString,
        "pushdown": "true",
    }
)
```

For more information about how to build a query outside of its specific syntax, see [Query string syntax](https://opensearch.org/docs/latest/query-dsl/full-text/query-string/#query-string-syntax) in the OpenSearch documentation.

When reading from OpenSearch collections that contain array type data, you must specify which fields are array type in your method call using the `opensearch.read.field.as.array.include` parameter. 

For example, when reading the following document, you will encounter the `genre` and `actor` array fields:

```
{
    "_index": "movies",
    "_id": "2",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "director": "Frankenheimer, John",
        "genre": [
            "Drama",
            "Mystery",
            "Thriller",
            "Crime"
        ],
        "year": 1962,
        "actor": [
            "Lansbury, Angela",
            "Sinatra, Frank",
            "Leigh, Janet",
            "Harvey, Laurence",
            "Silva, Henry",
            "Frees, Paul",
            "Gregory, James",
            "Bissell, Whit",
            "McGiver, John",
            "Parrish, Leslie",
            "Edwards, James",
            "Flowers, Bess",
            "Dhiegh, Khigh",
            "Payne, Julie",
            "Kleeb, Helen",
            "Gray, Joe",
            "Nalder, Reggie",
            "Stevens, Bert",
            "Masters, Michael",
            "Lowell, Tom"
        ],
        "title": "The Manchurian Candidate"
    }
}
```

In this case, you would include those field names in your method call. For example:

```
"opensearch.read.field.as.array.include": "genre,actor"
```

If your array field is nested inside of your document structure, refer to it using dot notation: `"genre,actor,foo.bar.baz"`. This would specify an array `baz` included in your source document through the embedded document `foo` containing the embedded document `bar`.

## Writing to OpenSearch Service tables
<a name="aws-glue-programming-etl-connect-opensearch-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to OpenSearch Service. If the index already has information, AWS Glue will append data from your DynamicFrame. You will need to provide the `pushdown` parameter.

**Prerequisites:** 
+ A OpenSearch Service table you would like to write to. You will need identification information for the table. Let's call this *tableName*.
+ A AWS Glue OpenSearch Service connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to OpenSearch Service*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="opensearch",
    connection_options={
      "connectionName": "connectionName",
      "opensearch.resource": "aosIndex",
    },
)
```

## OpenSearch Service connection option reference
<a name="aws-glue-programming-etl-connect-opensearch-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue OpenSearch Service connection configured to provide auth and network location information to your connection method.
+ `opensearch.resource` — Required. Used for Read/Write. Valid Values: OpenSearch index names. The name of the index your connection method will interact with.
+ `opensearch.query` — Used for Read. Valid Values: String escaped JSON or, when this string begins with `?`, the search part of a URL. An OpenSearch query that filters what should be retrieved when reading. For more information on using this parameter, consult the previous section [Reading from OpenSearch Service indexes](#aws-glue-programming-etl-connect-opensearch-read).
+ `pushdown` — Required if. Used for Read. Valid Values: boolean. Instructs Spark to pass read queries down to OpenSearch so the database only returns relevant documents.
+ `opensearch.read.field.as.array.include` — Required if reading array type data. Used for Read. Valid Values: comma separated lists of field names. Specifies fields to read as arrays from OpenSearch documents. For more information on using this parameter, consult the previous section [Reading from OpenSearch Service indexes](#aws-glue-programming-etl-connect-opensearch-read).

# Redshift connections
<a name="aws-glue-programming-etl-connect-redshift-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Amazon Redshift databases. When connecting to Amazon Redshift databases, AWS Glue moves data through Amazon S3 to achieve maximum throughput, using the Amazon Redshift SQL `COPY` and `UNLOAD` commands. In AWS Glue 4.0 and later, you can use the [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html) to read and write with optimizations and features specific to Amazon Redshift beyond those available when connecting through previous versions. 

Learn about how AWS Glue is making it easier than ever for Amazon Redshift users to migrate to AWS Glue for serverless data integration and ETL.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/ZapycBq8TKU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/ZapycBq8TKU)


## Configuring Redshift connections
<a name="aws-glue-programming-etl-connect-redshift-configure"></a>

To use Amazon Redshift clusters in AWS Glue, you will need some prerequisites:
+ An Amazon S3 directory to use for temporary storage when reading from and writing to the database.
+ An Amazon VPC enabling communication between your Amazon Redshift cluster, your AWS Glue job and your Amazon S3 directory.
+ Appropriate IAM permissions on the AWS Glue job and Amazon Redshift cluster.

### Configuring IAM roles
<a name="aws-glue-programming-etl-redshift-config-iam"></a>

**Set up the role for the Amazon Redshift cluster**  
Your Amazon Redshift cluster needs to be able to read and write to Amazon S3 in order to integrate with AWS Glue jobs. To allow this, you can associate IAM roles with the Amazon Redshift cluster you want to connect to. Your role should have a policy allowing read from and write to your Amazon S3 temporary directory. Your role should have a trust relationship allowing the `redshift.amazonaws.com` service to `AssumeRole`.

**To associate an IAM role with Amazon Redshift**

1. **Prerequisites: ** An Amazon S3 bucket or directory used for the temporary storage of files.

1. Identify which Amazon S3 permissions your Amazon Redshift cluster will need. When moving data to and from an Amazon Redshift cluster, AWS Glue jobs issue COPY and UNLOAD statements against Amazon Redshift. If your job modifies a table in Amazon Redshift, AWS Glue will also issue CREATE LIBRARY statements. For information on specific Amazon S3 permissions required for Amazon Redshift to execute these statements, refer to the Amazon Redshift documentation: [ Amazon Redshift: Permissions to access other AWS Resources](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html).

1. In the IAM console, create an IAM policy with the necessary permissions. For more information about creating a policy [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html). 

1. In the IAM console, create a role and trust relationship allowing Amazon Redshift to assume the role. Follow the instructions in the IAM documentation [To create a role for an AWS service (console) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html#roles-creatingrole-service-console)
   + When asked to choose an AWS service use case, choose "Redshift - Customizable".
   + When asked to attach a policy, choose the policy you previously defined.
**Note**  
For more information about configuring roles for Amazon Redshift, see [Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the Amazon Redshift documentation. 

1. In the Amazon Redshift console, associate the role with your Amazon Redshift cluster. Follow the instructions in [the Amazon Redshift documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html).

   Select the highlighted option in the Amazon Redshift console to configure this setting:  
![\[An example of where to manage IAM permissions in the Amazon Redshift console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/RS-role-config.png)

**Note**  
 By default, AWS Glue jobs pass Amazon Redshift temporary credentials that are created using the role that you specified to run the job. We do not recommend using these credentials. For security purposes, these credentials expire after 1 hour. 

**Set up the role for the AWS Glue job**  
The AWS Glue job needs a role to access the Amazon S3 bucket. You do not need IAM permissions for the Amazon Redshift cluster, your access is controlled by connectivity in Amazon VPC and your database credentials.

### Set up Amazon VPC
<a name="aws-glue-programming-etl-redshift-config-vpc"></a>

**To set up access for Amazon Redshift data stores**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. In the left navigation pane, choose **Clusters**.

1. Choose the cluster name that you want to access from AWS Glue.

1. In the **Cluster Properties** section, choose a security group in **VPC security groups** to allow AWS Glue to use. Record the name of the security group that you chose for future reference. Choosing the security group opens the Amazon EC2 console **Security Groups** list.

1. Choose the security group to modify and navigate to the **Inbound** tab.

1. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of **Type** `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and whose **Source** is the same security group name as the **Group ID**. 

   The inbound rule looks similar to the following:   
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

   For example:  
![\[An example of a self-referencing inbound rule.\]](http://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroup-Start.png)

1. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

   Or create a self-referencing rule where **Type** `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and whose **Destination** is the same security group name as the **Group ID**. If using an Amazon S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The *s3-prefix-list-id* is required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

   For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

### Set up AWS Glue
<a name="aws-glue-programming-etl-redshift-config-glue"></a>

You will need to create an AWS Glue Data Catalog connection that provides Amazon VPC connection information.

**To configure Amazon Redshift Amazon VPC connectivity to AWS Glue in the console**

1. Create a Data Catalog connection by following the steps in: [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step.
   + When selecting a **Connection type**, select **Amazon Redshift**.
   + When selecting a **Redshift cluster**, select your cluster by name.
   + Provide default connection information for a Amazon Redshift user on your cluster.
   + Your Amazon VPC settings will be automatically configured.
**Note**  
You will need to manually provide `PhysicalConnectionRequirements` for your Amazon VPC when creating an **Amazon Redshift** connection through the AWS SDK.

1. In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Example: Reading from Amazon Redshift tables
<a name="aws-glue-programming-etl-connect-redshift-read"></a>

 You can read from Amazon Redshift clusters and Amazon Redshift serverless environments. 

**Prerequisites:** An Amazon Redshift table you would like to read from. Follow the steps in the previous section [Configuring Redshift connections](#aws-glue-programming-etl-connect-redshift-configure) after which you should have the Amazon S3 URI for a temporary directory, *temp-s3-dir* and an IAM role, *rs-role-name*, (in account *role-account-id*).

------
#### [ Using the Data Catalog ]

**Additional Prerequisites:** A Data Catalog Database and Table for the Amazon Redshift table you would like to read from. For more information about Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md). After creating a entry for your Amazon Redshift table you will identify your connection with a *redshift-dc-database-name* and *redshift-table-name*.

**Configuration:** In your function options you will identify your Data Catalog Table with the `database` and `table_name` parameters. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You will also provide *rs-role-name* using the `aws_iam_role` key in the `additional_options` parameter.

```
 glueContext.create_dynamic_frame.from_catalog(
    database = "redshift-dc-database-name", 
    table_name = "redshift-table-name", 
    redshift_tmp_dir = args["temp-s3-dir"], 
    additional_options = {"aws_iam_role": "arn:aws:iam::role-account-id:role/rs-role-name"})
```

------
#### [ Connecting directly ]

**Additional Prerequisites:**You will need the name of your Amazon Redshift table (*redshift-table-name*. You will need the JDBC connection information for the Amazon Redshift cluster storing that table. You will supply your connection information with *host*, *port*, *redshift-database-name*, *username* and *password*.

You can retrieve your connection information from the Amazon Redshift console when working with Amazon Redshift clusters. When using Amazon Redshift serverless, consult [Connecting to Amazon Redshift Serverless](https://docs.aws.amazon.com//redshift/latest/mgmt/serverless-connecting.html) in the Amazon Redshift documentation.

**Configuration:** In your function options you will identify your connection parameters with `url`, `dbtable`, `user` and `password`. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You can specify your IAM role using `aws_iam_role` when you use `from_options`. The syntax is similar to connecting through the Data Catalog, but you put the parameters in the `connection_options` map.

It is bad practice to hardcode passwords into AWS Glue scripts. Consider storing your passwords in AWS Secrets Manager and retrieving them in your script with SDK for Python (Boto3).

```
my_conn_options = {  
    "url": "jdbc:redshift://host:port/redshift-database-name",
    "dbtable": "redshift-table-name",
    "user": "username",
    "password": "password",
    "redshiftTmpDir": args["temp-s3-dir"],
    "aws_iam_role": "arn:aws:iam::account id:role/rs-role-name"
}

df = glueContext.create_dynamic_frame.from_options("redshift", my_conn_options)
```

------

## Example: Writing to Amazon Redshift tables
<a name="aws-glue-programming-etl-connect-redshift-write"></a>

 You can write to Amazon Redshift clusters and Amazon Redshift serverless environments. 

**Prerequisites:** An Amazon Redshift cluster and follow the steps in the previous section [Configuring Redshift connections](#aws-glue-programming-etl-connect-redshift-configure) after which you should have the Amazon S3 URI for a temporary directory, *temp-s3-dir* and an IAM role, *rs-role-name*, (in account *role-account-id*). You will also need a `DynamicFrame` whose contents you would like to write to the database. 

------
#### [ Using the Data Catalog ]

**Additional Prerequisites** A Data Catalog Database for the Amazon Redshift cluster and table you would like to write to. For more information about Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md). You will identify your connection with *redshift-dc-database-name* and the target table with *redshift-table-name*.

**Configuration:** In your function options you will identify your Data Catalog Database with the `database` parameter, then provide table with `table_name`. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You will also provide *rs-role-name* using the `aws_iam_role` key in the `additional_options` parameter.

```
 glueContext.write_dynamic_frame.from_catalog(
    frame = input dynamic frame, 
    database = "redshift-dc-database-name", 
    table_name = "redshift-table-name", 
    redshift_tmp_dir = args["temp-s3-dir"], 
    additional_options = {"aws_iam_role": "arn:aws:iam::account-id:role/rs-role-name"})
```

------
#### [ Connecting through a AWS Glue connection ]

You can connect to Amazon Redshift directly using the `write_dynamic_frame.from_options` method. However, rather than insert your connection details directly into your script, you can reference connection details stored in a Data Catalog connection with the `from_jdbc_conf` method. You can do this without crawling or creating Data Catalog tables for your database. For more information about Data Catalog connections, see [Connecting to data](glue-connections.md).

**Additional Prerequisites:** A Data Catalog connection for your database, a Amazon Redshift table you would like to read from

**Configuration:** you will identify your Data Catalog connection with *dc-connection-name*. You will identify your Amazon Redshift database and table with *redshift-table-name* and *redshift-database-name*. You will provide your Data Catalog connection information with `catalog_connection` and your Amazon Redshift information with `dbtable` and `database`. The syntax is similar to connecting through the Data Catalog, but you put the parameters in the `connection_options` map. 

```
my_conn_options = {
    "dbtable": "redshift-table-name",
    "database": "redshift-database-name",
    "aws_iam_role": "arn:aws:iam::role-account-id:role/rs-role-name"
}

glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = input dynamic frame, 
    catalog_connection = "dc-connection-name", 
    connection_options = my_conn_options, 
    redshift_tmp_dir = args["temp-s3-dir"])
```

------

## Amazon Redshift connection option reference
<a name="w2aac67c11c24b8c21c15"></a>

The basic connection options used for all AWS Glue JDBC connections to set up information like `url`, `user` and `password` are consistent across all JDBC types. For more information about standard JDBC parameters, see [JDBC connection option reference](aws-glue-programming-etl-connect-jdbc-home.md#aws-glue-programming-etl-connect-jdbc).

The Amazon Redshift connection type takes some additional connection options:
+ `"redshiftTmpDir"`: (Required) The Amazon S3 path where temporary data can be staged when copying out of the database.
+ `"aws_iam_role"`: (Optional) ARN for an IAM role. The AWS Glue job will pass this role to the Amazon Redshift cluster to grant the cluster permissions needed to complete instructions from the job.

### Additional connection options available in AWS Glue 4.0\$1
<a name="aws-glue-programming-etl-redshift-enhancements"></a>

You can also pass options for the new Amazon Redshift connector through AWS Glue connection options. For a complete list of supported connector options, see the *Spark SQL parameters* section in [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html).

For you convenience, we reiterate certain new options here:


| Name | Required | Default | Description | 
| --- | --- | --- | --- | 
|  autopushdown  | No | TRUE |  Applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query, and then run in Amazon Redshift to improve performance.  | 
|  autopushdown.s3\$1result\$1cache  | No | FALSE |  Caches the SQL query to unload data for Amazon S3 path mapping in memory so that the same query doesn't need to run again in the same Spark session. Only supported when `autopushdown` is enabled.  | 
|  unload\$1s3\$1format  | No | PARQUET |  PARQUET - Unloads the query results in Parquet format. TEXT - Unloads the query results in pipe-delimited text format.  | 
|  sse\$1kms\$1key  | No | N/A |  The AWS SSE-KMS key to use for encryption during `UNLOAD` operations instead of the default encryption for AWS.  | 
|  extracopyoptions  | No | N/A |  A list of extra options to append to the Amazon Redshift `COPY`command when loading data, such as `TRUNCATECOLUMNS` or `MAXERROR n` (for other options see [COPY: Optional parameters](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#r_COPY-syntax-overview-optional-parameters)).  Note that because these options are appended to the end of the `COPY` command, only options that make sense at the end of the command can be used. That should cover most possible use cases.  | 
|  csvnullstring (experimental)  | No | NULL |  The String value to write for nulls when using the CSV `tempformat`. This should be a value that doesn't appear in your actual data.  | 

These new parameters can be used in the following ways.

**New options for performance improvement**  
The new connector introduces some new performance improvement options:
+ `autopushdown`: Enabled by default.
+ `autopushdown.s3_result_cache`: Disabled by default.
+ `unload_s3_format`: `PARQUET` by default.

For information about using these options, see [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html). We recommend that you don't turn on ` autopushdown.s3_result_cache` when you have mixed read and write operations because the cached results might contain stale information. The option `unload_s3_format` is set to `PARQUET` by default for the `UNLOAD` command, to improve performance and reduce storage cost. To use the `UNLOAD` command default behavior, reset the option to `TEXT`.

**New encryption option for reading**  
By default, the data in the temporary folder that AWS Glue uses when it reads data from the Amazon Redshift table is encrypted using `SSE-S3` encryption. To use customer managed keys from AWS Key Management Service (AWS KMS) to encrypt your data, you can set up `("sse_kms_key" → kmsKey)` where ksmKey is the [key ID from AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/find-cmk-id-arn.html), instead of the legacy setting option `("extraunloadoptions" → s"ENCRYPTED KMS_KEY_ID '$kmsKey'")` in AWS Glue version 3.0.

```
datasource0 = glueContext.create_dynamic_frame.from_catalog(
  database = "database-name", 
  table_name = "table-name", 
  redshift_tmp_dir = args["TempDir"],
  additional_options = {"sse_kms_key":"<KMS_KEY_ID>"}, 
  transformation_ctx = "datasource0"
)
```

**Support IAM-based JDBC URL**  
The new connector supports an IAM-based JDBC URL so you don't need to pass in a user/password or secret. With an IAM-based JDBC URL, the connector uses the job runtime role to access to the Amazon Redshift data source. 

Step 1: Attach the following minimal required policy to your AWS Glue job runtime role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:us-east-1:111122223333:dbgroup:<cluster name>/*",
                "arn:aws:redshift:*:111122223333:dbuser:*/*",
                "arn:aws:redshift:us-east-1:111122223333:dbname:<cluster name>/<database name>"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        }
    ]
}
```

------

Step 2: Use the IAM-based JDBC URL as follows. Specify a new option `DbUser` with the Amazon Redshift user name that you're connecting with.

```
conn_options = {
     // IAM-based JDBC URL
    "url": "jdbc:redshift:iam://<cluster name>:<region>/<database name>",
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "DbUser": "<Redshift User name>" // required for IAM-based JDBC URL
    }

redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="redshift",
    connection_options=conn_options
)

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=conn_options
)
```

**Note**  
A `DynamicFrame` currently only supports an IAM-based JDBC URL with a ` DbUser` in the `GlueContext.create_dynamic_frame.from_options` workflow. 

## Migrating from AWS Glue version 3.0 to version 4.0
<a name="aws-glue-programming-etl-redshift-migrating"></a>

In AWS Glue 4.0, ETL jobs have access to a new Amazon Redshift Spark connector and a new JDBC driver with different options and configuration. The new Amazon Redshift connector and driver are written with performance in mind, and keep transactional consistency of your data. These products are documented in the Amazon Redshift documentation. For more information, see:
+ [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)
+ [Amazon Redshift JDBC driver, version 2.1](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-download-driver.html)

**Table/column names and identifiers restriction**  
The new Amazon Redshift Spark connector and driver have a more restricted requirement for the Redshift table name. For more information, see [Names and identifiers](https://docs.aws.amazon.com/redshift/latest/dg/r_names.html) to define your Amazon Redshift table name. The job bookmark workflow might not work with a table name that doesn't match the rules and with certain characters, such as a space.

If you have legacy tables with names that don't conform to the [Names and identifiers](https://docs.aws.amazon.com/redshift/latest/dg/r_names.html) rules and see issues with bookmarks (jobs reprocessing old Amazon Redshift table data), we recommend that you rename your table names. For more information, see [ALTER TABLE examples](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_examples_basic.html). 

**Default tempformat change in Dataframe**  
The AWS Glue version 3.0 Spark connector defaults the `tempformat` to CSV while writing to Amazon Redshift. To be consistent, in AWS Glue version 3.0, the ` DynamicFrame` still defaults the `tempformat` to use `CSV`. If you've previously used Spark Dataframe APIs directly with the Amazon Redshift Spark connector, you can explicitly set the `tempformat` to CSV in the `DataframeReader` /`Writer` options. Otherwise, `tempformat` defaults to `AVRO` in the new Spark connector.

**Behavior change: map Amazon Redshift data type REAL to Spark data type FLOAT instead of DOUBLE**  
In AWS Glue version 3.0, Amazon Redshift `REAL` is converted to a Spark ` DOUBLE` type. The new Amazon Redshift Spark connector has updated the behavior so that the Amazon Redshift ` REAL` type is converted to, and back from, the Spark `FLOAT` type. If you have a legacy use case where you still want the Amazon Redshift `REAL` type to be mapped to a Spark `DOUBLE` type, you can use the following workaround:
+ For a `DynamicFrame`, map the `Float` type to a `Double` type with `DynamicFrame.ApplyMapping`. For a `Dataframe`, you need to use `cast`.

Code example:

```
dyf_cast = dyf.apply_mapping([('a', 'long', 'a', 'long'), ('b', 'float', 'b', 'double')])
```

**Handling VARBYTE Data Type**  
When working with AWS Glue 3.0 and Amazon Redshift data types, AWS Glue 3.0 converts Amazon Redshift `VARBYTE` to Spark `STRING` type. However, the latest Amazon Redshift Spark connector doesn't support the `VARBYTE` data type. To work around this limitation, you can [create a Redshift view](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html) that transforms `VARBYTE` columns to a supported data type. Then, use the new connector to load data from this view instead of the original table, which ensures compatibility while maintaining access to your `VARBYTE` data.

Example for Redshift query:

```
CREATE VIEW view_name AS SELECT FROM_VARBYTE(varbyte_column, 'hex') FROM table_name
```

# Kafka connections
<a name="aws-glue-programming-etl-connect-kafka-home"></a>

You can use a Kafka connection to read and write to Kafka data streams using information stored in a Data Catalog table, or by providing information to directly access the data stream. The connection supports a Kafka cluster or an Amazon Managed Streaming for Apache Kafka cluster. You can read information from Kafka into a Spark DataFrame, then convert it to a AWS Glue DynamicFrame. You can write DynamicFrames to Kafka in a JSON format. If you directly access the data stream, use these options to provide the information about how to access the data stream.

If you use `getCatalogSource` or `create_data_frame_from_catalog` to consume records from a Kafka streaming source, or `getCatalogSink` or `write_dynamic_frame_from_catalog` to write records to Kafka, and the job has the Data Catalog database and table name information, and can use that to obtain some basic parameters for reading from the Kafka streaming source. If you use `getSource`, `getCatalogSink`, `getSourceWithFormat`, `getSinkWithFormat`, `createDataFrameFromOptions` or `create_data_frame_from_options`, or `write_dynamic_frame_from_catalog`, you must specify these basic parameters using the connection options described here.

You can specify the connection options for Kafka using the following arguments for the specified methods in the `GlueContext` class.
+ Scala
  + `connectionOptions`: Use with `getSource`, `createDataFrameFromOptions`, `getSink` 
  + `additionalOptions`: Use with `getCatalogSource`, `getCatalogSink`
  + `options`: Use with `getSourceWithFormat`, `getSinkWithFormat`
+ Python
  + `connection_options`: Use with `create_data_frame_from_options`, `write_dynamic_frame_from_options`
  + `additional_options`: Use with `create_data_frame_from_catalog`, `write_dynamic_frame_from_catalog`
  + `options`: Use with `getSource`, `getSink`

For notes and restrictions about streaming ETL jobs, consult [Streaming ETL notes and restrictions](add-job-streaming.md#create-job-streaming-restrictions).

**Topics**
+ [Configure Kafka](#aws-glue-programming-etl-connect-kafka-configure)
+ [Example: Reading from Kafka streams](#aws-glue-programming-etl-connect-kafka-read)
+ [Example: Writing to Kafka streams](#aws-glue-programming-etl-connect-kafka-write)
+ [Kafka connection option reference](#aws-glue-programming-etl-connect-kafka)

## Configure Kafka
<a name="aws-glue-programming-etl-connect-kafka-configure"></a>

There are no AWS prerequisites to connecting to Kafka streams available through the internet.

You can create a AWS Glue Kafka connection to manage your connection credentials. For more information, see [Creating an AWS Glue connection for an Apache Kafka data stream](add-job-streaming.md#create-conn-streaming). In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**, then, in your method call, provide *connectionName* to the `connectionName` parameter.

In certain cases, you will need to configure additional prerequisites:
+ If using Amazon Managed Streaming for Apache Kafka with IAM authentication, you will need appropriate IAM configuration.
+ If using Amazon Managed Streaming for Apache Kafka within an Amazon VPC, you will need appropriate Amazon VPC configuration. You will need to create a AWS Glue connection that provides Amazon VPC connection information. You will need your job configuration to include the AWS Glue connection as an **Additional network connection**.

For more information about Streaming ETL job prerequisites, consult [Streaming ETL jobs in AWS Glue](add-job-streaming.md).

## Example: Reading from Kafka streams
<a name="aws-glue-programming-etl-connect-kafka-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Kafka streaming source:

```
kafka_options =
    { "connectionName": "ConfluentKafka", 
      "topicName": "kafka-auth-topic", 
      "startingOffsets": "earliest", 
      "inferSchema": "true", 
      "classification": "json" 
    }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kafka", connection_options=kafka_options)
```

## Example: Writing to Kafka streams
<a name="aws-glue-programming-etl-connect-kafka-write"></a>

Examples for writing to Kafka:

Example with the `getSink` method:

```
data_frame_datasource0 = 
glueContext.getSink(
	connectionType="kafka",
	connectionOptions={
		JsonOptions("""{
			"connectionName": "ConfluentKafka", 
			"classification": "json", 
			"topic": "kafka-auth-topic", 
			"typeOfData": "kafka"}
		""")}, 
	transformationContext="dataframe_ApacheKafka_node1711729173428")
	.getDataFrame()
```

Example with the `write_dynamic_frame.from_options` method:

```
kafka_options =
    { "connectionName": "ConfluentKafka", 
      "topicName": "kafka-auth-topic", 
      "classification": "json" 
    }
data_frame_datasource0 = glueContext.write_dynamic_frame.from_options(connection_type="kafka", connection_options=kafka_options)
```

## Kafka connection option reference
<a name="aws-glue-programming-etl-connect-kafka"></a>

When reading, use the following connection options with `"connectionType": "kafka"`:
+ `"bootstrap.servers"` (Required) A list of bootstrap server URLs, for example, as `b-1.vpc-test-2.o4q88o.c6.kafka.us-east-1.amazonaws.com:9094`. This option must be specified in the API call or defined in the table metadata in the Data Catalog.
+ `"security.protocol"` (Required) The protocol used to communicate with brokers. The possible values are `"SSL"` or `"PLAINTEXT"`.
+ `"topicName"` (Required) A comma-separated list of topics to subscribe to. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.
+ `"assign"`: (Required) A JSON string specifying the specific `TopicPartitions` to consume. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.

  Example: '\$1"topicA":[0,1],"topicB":[2,4]\$1'
+ `"subscribePattern"`: (Required) A Java regex string that identifies the topic list to subscribe to. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.

  Example: 'topic.\$1'
+ `"classification"` (Required) The file format used by the data in the record. Required unless provided through the Data Catalog.
+ `"delimiter"` (Optional) The value separator used when `classification` is CSV. Default is "`,`."
+ `"startingOffsets"`: (Optional) The starting position in the Kafka topic to read data from. The possible values are `"earliest"` or `"latest"`. The default value is `"latest"`.
+ `"startingTimestamp"`: (Optional, supported only for AWS Glue version 4.0 or later) The Timestamp of the record in the Kafka topic to read data from. The possible value is a Timestamp string in UTC format in the pattern `yyyy-mm-ddTHH:MM:SSZ` (where `Z` represents a UTC timezone offset with a \$1/-. For example: "2023-04-04T08:00:00-04:00").

  Note: Only one of 'startingOffsets' or 'startingTimestamp' can be present in the Connection Options list of the AWS Glue streaming script, including both these properties will result in job failure.
+ `"endingOffsets"`: (Optional) The end point when a batch query is ended. Possible values are either `"latest"` or a JSON string that specifies an ending offset for each `TopicPartition`.

  For the JSON string, the format is `{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}`. The value `-1` as an offset represents `"latest"`.
+ `"pollTimeoutMs"`: (Optional) The timeout in milliseconds to poll data from Kafka in Spark job executors. The default value is `600000`.
+ `"numRetries"`: (Optional) The number of times to retry before failing to fetch Kafka offsets. The default value is `3`.
+ `"retryIntervalMs"`: (Optional) The time in milliseconds to wait before retrying to fetch Kafka offsets. The default value is `10`.
+ `"maxOffsetsPerTrigger"`: (Optional) The rate limit on the maximum number of offsets that are processed per trigger interval. The specified total number of offsets is proportionally split across `topicPartitions` of different volumes. The default value is null, which means that the consumer reads all offsets until the known latest offset.
+ `"minPartitions"`: (Optional) The desired minimum number of partitions to read from Kafka. The default value is null, which means that the number of spark partitions is equal to the number of Kafka partitions.
+  `"includeHeaders"`: (Optional) Whether to include the Kafka headers. When the option is set to "true", the data output will contain an additional column named "glue\$1streaming\$1kafka\$1headers" with type `Array[Struct(key: String, value: String)]`. The default value is "false". This option is available in AWS Glue version 3.0 or later. 
+ `"schema"`: (Required when inferSchema set to false) The schema to use to process the payload. If classification is `avro` the provided schema must be in the Avro schema format. If the classification is not `avro` the provided schema must be in the DDL schema format.

  The following are schema examples.

------
#### [ Example in DDL schema format ]

  ```
  'column1' INT, 'column2' STRING , 'column3' FLOAT
  ```

------
#### [ Example in Avro schema format ]

  ```
  {
  "type":"array",
  "items":
  {
  "type":"record",
  "name":"test",
  "fields":
  [
    {
      "name":"_id",
      "type":"string"
    },
    {
      "name":"index",
      "type":
      [
        "int",
        "string",
        "float"
      ]
    }
  ]
  }
  }
  ```

------
+ `"inferSchema"`: (Optional) The default value is 'false'. If set to 'true', the schema will be detected at runtime from the payload within `foreachbatch`.
+ `"avroSchema"`: (Deprecated) Parameter used to specify a schema of Avro data when Avro format is used. This parameter is now deprecated. Use the `schema` parameter.
+ `"addRecordTimestamp"`: (Optional) When this option is set to 'true', the data output will contain an additional column named "\$1\$1src\$1timestamp" that indicates the time when the corresponding record received by the topic. The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"emitConsumerLagMetrics"`: (Optional) When the option is set to 'true', for each batch, it will emit the metrics for the duration between the oldest record received by the topic and the time it arrives in AWS Glue to CloudWatch. The metric's name is "glue.driver.streaming.maxConsumerLagInMs". The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.

When writing, use the following connection options with `"connectionType": "kafka"`:
+ `"connectionName"` (Required) Name of the AWS Glue connection used to connect to the Kafka cluster (similar to Kafka source).
+ `"topic"` (Required) If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the topic configuration option is set. That is, the `topic` configuration option overrides the topic column.
+ `"partition"` (Optional) If a valid partition number is specified, that `partition` will be used when sending the record.

  If no partition is specified but a `key` is present, a partition will be chosen using a hash of the key.

  If neither `key` nor `partition` is present, a partition will be chosen based on sticky partitioning those changes when at least batch.size bytes are produced to the partition.
+ `"key"` (Optional) Used for partitioning if `partition` is null.
+ `"classification"` (Optional) The file format used by the data in the record. We only support JSON, CSV and Avro.

  With Avro format, we can provide a custom avroSchema to serialize with, but note that this needs to be provided on the source for deserializing as well. Else, by default it uses the Apache AvroSchema for serializing.

Additionally, you can fine-tune the Kafka sink as required by updating the [Kafka producer configuration parameters](https://kafka.apache.org/documentation/#producerconfigs). Note that there is no allow listing on connection options, all the key-value pairs are persisted on the sink as is.

However, there is a small deny list of options that will not take effect. For more information, see [Kafka specific configurations](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html).

# Azure Cosmos DB connections
<a name="aws-glue-programming-etl-connect-azurecosmos-home"></a>

You can use AWS Glue for Spark to read from and write to existing containers in Azure Cosmos DB using the NoSQL API in AWS Glue 4.0 and later versions. You can define what to read from Azure Cosmos DB with a SQL query. You connect to Azure Cosmos DB using an Azure Cosmos DB Key stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Azure Cosmos DB for NoSQL, consult [the Azure documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/).

## Configuring Azure Cosmos DB connections
<a name="aws-glue-programming-etl-connect-azurecosmos-configure"></a>

To connect to Azure Cosmos DB from AWS Glue, you will need to create and store your Azure Cosmos DB Key in a AWS Secrets Manager secret, then associate that secret with a Azure Cosmos DB AWS Glue connection.

**Prerequisites:** 
+ In Azure, you will need to identify or generate an Azure Cosmos DB Key for use by AWS Glue, `cosmosKey`. For more information, see [Secure access to data in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/secure-access-to-data?tabs=using-primary-key) in the Azure documentation.

**To configure a connection to Azure Cosmos DB:**

1. In AWS Secrets Manager, create a secret using your Azure Cosmos DB Key. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `spark.cosmos.accountKey` with the value *cosmosKey*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select Azure Cosmos DB.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue Azure Cosmos DB connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Azure Cosmos DB for NoSQL containers
<a name="aws-glue-programming-etl-connect-azurecosmos-read"></a>

**Prerequisites:** 
+ A Azure Cosmos DB for NoSQL container you would like to read from. You will need identification information for the container.

  An Azure Cosmos for NoSQL container is identified by its database and container. You must provide the database, *cosmosDBName*, and container, *cosmosContainerName*, names when connecting to the Azure Cosmos for NoSQL API.
+ A AWS Glue Azure Cosmos DB connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to Azure Cosmos DB*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azurecosmos_read = glueContext.create_dynamic_frame.from_options(
    connection_type="azurecosmos",
    connection_options={
    "connectionName": connectionName,
    "spark.cosmos.database": cosmosDBName,
    "spark.cosmos.container": cosmosContainerName,
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
azurecosmos_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="azurecosmos",
    connection_options={
        "connectionName": "connectionName",
        "spark.cosmos.database": cosmosDBName,
        "spark.cosmos.container": cosmosContainerName,
        "spark.cosmos.read.customQuery": "query"
    }
)
```

## Writing to Azure Cosmos DB for NoSQL containers
<a name="aws-glue-programming-etl-connect-azurecosmos-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Azure Cosmos DB. If the container already has information, AWS Glue will append data from your DynamicFrame. If the information in the container has a different schema from the information you write, you will run into errors.

**Prerequisites:** 
+ A Azure Cosmos DB table you would like to write to. You will need identification information for the container. **You must create the container before calling the connection method.**

  An Azure Cosmos for NoSQL container is identified by its database and container. You must provide the database, *cosmosDBName*, and container, *cosmosContainerName*, names when connecting to the Azure Cosmos for NoSQL API.
+ A AWS Glue Azure Cosmos DB connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to Azure Cosmos DB*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azurecosmos_write = glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="azurecosmos",
    connection_options={
    "connectionName": connectionName,
    "spark.cosmos.database": cosmosDBName,
    "spark.cosmos.container": cosmosContainerName
)
```

## Azure Cosmos DB connection option reference
<a name="aws-glue-programming-etl-connect-azurecosmos-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Azure Cosmos DB connection configured to provide auth and network location information to your connection method.
+ `spark.cosmos.database` — Required. Used for Read/Write. Valid Values: database names. Azure Cosmos DB for NoSQL database name.
+ `spark.cosmos.container` — Required. Used for Read/Write. Valid Values: container names. Azure Cosmos DB for NoSQL container name.
+ `spark.cosmos.read.customQuery` — Used for Read. Valid Values: SELECT SQL queries. Custom query to select documents to be read.

# Azure SQL connections
<a name="aws-glue-programming-etl-connect-azuresql-home"></a>

You can use AWS Glue for Spark to read from and write to tables on Azure SQL Managed Instances in AWS Glue 4.0 and later versions. You can define what to read from Azure SQL with a SQL query. You connect to Azure SQL using user and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Azure SQL, consult the [Azure SQL documentation](https://azure.microsoft.com/en-us/products/azure-sql).

## Configuring Azure SQL connections
<a name="aws-glue-programming-etl-connect-azuresql-configure"></a>

To connect to Azure SQL from AWS Glue, you will need to create and store your Azure SQL credentials in a AWS Secrets Manager secret, then associate that secret with a Azure SQL AWS Glue connection.

**To configure a connection to Azure SQL:**

1. In AWS Secrets Manager, create a secret using your Azure SQL credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *azuresqlUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *azuresqlPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select Azure SQL.
   + When providing **Azure SQL URL**, provide a JDBC endpoint URL.

      The URL must be in the following format: `jdbc:sqlserver://databaseServerName:databasePort;databaseName=azuresqlDBname;`.

     AWS Glue requires the following URL properties: 
     + `databaseName` – A default database in Azure SQL to connect to.

     For more information about JDBC URLs for Azure SQL Managed Instances, see the [Microsoft documentation](https://learn.microsoft.com/en-us/sql/connect/jdbc/building-the-connection-url?view=azuresqldb-mi-current).
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue Azure SQL connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Azure SQL tables
<a name="aws-glue-programming-etl-connect-azuresql-read"></a>

**Prerequisites:** 
+ A Azure SQL table you would like to read from. You will need identification information for the table, *databaseName* and *tableIdentifier*.

  An Azure SQL table is identified by its database, schema and table name. You must provide the database name and table name when connecting to Azure SQL. You also must provide the schema if it is not the default, "public". Database is provided through a URL property in *connectionName* , schema and table name through the `dbtable`.
+ A AWS Glue Azure SQL connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Azure SQL* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azuresql_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier"
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
azuresql_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to Azure SQL tables
<a name="aws-glue-programming-etl-connect-azuresql-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Azure SQL. If the table already has information, AWS Glue will append data from your DynamicFrame.

**Prerequisites:** 
+ A Azure SQL table you would like to write to. You will need identification information for the table, *databaseName* and *tableIdentifier*.

  An Azure SQL table is identified by its database, schema and table name. You must provide the database name and table name when connecting to Azure SQL. You also must provide the schema if it is not the default, "public". Database is provided through a URL property in *connectionName* , schema and table name through the `dbtable`.
+ Azure SQL auth information. Complete the steps in the previous procedure, *To configure a connection to Azure SQL* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azuresql_write = glueContext.write_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier"
    }
)
```

## Azure SQL connection option reference
<a name="aws-glue-programming-etl-connect-azuresql-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Azure SQL connection configured to provide auth information to your connection method.
+ `databaseName` — Used for Read/Write. Valid Values: Azure SQL database names. The name of the database in Azure SQL to connect to.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. Valid Values: Names of Azure SQL tables, or period separated schema/table name combinations. Used to specify the table and schema that identify the table to connect to. The default schema is "public". If your table is in a non-default schema, provide this information in the form `schemaName.tableName`.
+ `query` — Used for Read. A Transact-SQL SELECT query defining what should be retrieved when reading from Azure SQL. For more information, see the [Microsoft documentation](https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=azuresqldb-mi-current).

# BigQuery connections
<a name="aws-glue-programming-etl-connect-bigquery-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Google BigQuery in AWS Glue 4.0 and later versions. You can read from BigQuery with a Google SQL query. You connect to BigQuery using credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Google BigQuery, see the [Google Cloud BigQuery website](https://cloud.google.com/bigquery).

## Configuring BigQuery connections
<a name="aws-glue-programming-etl-connect-bigquery-configure"></a>

To connect to Google BigQuery from AWS Glue, you will need to create and store your Google Cloud Platform credentials in a AWS Secrets Manager secret, then associate that secret with a Google BigQuery AWS Glue connection.

**To configure a connection to BigQuery:**

1. In Google Cloud Platform, create and identify relevant resources:
   + Create or identify a GCP project containing BigQuery tables you would like to connect to.
   + Enable the BigQuery API. For more information, see [ Use the BigQuery Storage Read API to read table data ](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api).

1. In Google Cloud Platform, create and export service account credentials:

   You can use the BigQuery credentials wizard to expedite this step: [Create credentials](https://console.cloud.google.com/apis/credentials/wizard?api=bigquery.googleapis.com).

   To create a service account in GCP, follow the tutorial available in [Create service accounts](https://cloud.google.com/iam/docs/service-accounts-create).
   + When selecting **project**, select the project containing your BigQuery table.
   + When selecting GCP IAM roles for your service account, add or create a role that would grant appropriate permissions to run BigQuery jobs to read, write or create BigQuery tables.

   To create credentials for your service account, follow the tutorial available in [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating).
   + When selecting key type, select **JSON**.

   You should now have downloaded a JSON file with credentials for your service account. It should look similar to the following:

   ```
   {
     "type": "service_account",
     "project_id": "*****",
     "private_key_id": "*****",
     "private_key": "*****",
     "client_email": "*****",
     "client_id": "*****",
     "auth_uri": "https://accounts.google.com/o/oauth2/auth",
     "token_uri": "https://oauth2.googleapis.com/token",
     "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
     "client_x509_cert_url": "*****",
     "universe_domain": "googleapis.com"
   }
   ```

1. Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file, *s3secretpath* for future steps.

1. In AWS Secrets Manager, create a secret using your Google Cloud Platform credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 

   When creating Key/value pairs, specify keys and values as followings:
   + For `token_uri`, `client_x509_cert_url`, `private_key_id`, `project_id`, `universe_domain`, `auth_provider_x509_cert_url`, `auth_uri`, `client_email`, `private_key`, `type`, `client_id` keys, specify the corresponding values in the downloaded JSON file.
   + For `spark.hadoop.google.cloud.auth.service.account.json.keyfile` key, specify the *s3secretpath*.

1. In the AWS Glue Data Catalog, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Google BigQuery.
   + When selecting an **AWS Secret**, provide *secretName*.

1. Grant the IAM role associated with your AWS Glue job permission to read *secretName*.

1. In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from BigQuery tables
<a name="aws-glue-programming-etl-connect-bigquery-read"></a>

**Prerequisites:** 
+ A BigQuery table you would like to read from. You will need the BigQuery table and dataset names, in the form `[dataset].[table]`. Let's call this *tableName*.
+ The billing project for the BigQuery table. You will need the name of the project, *parentProject*. If there is no billing parent project, use the project containing the table.
+ BigQuery auth information. Complete the steps *To manage your connection credentials with AWS Glue* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
bigquery_read = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "sourceType": "table",
        "table": "tableName",
    }
```

You can also provide a query, to filter the results returned to your DynamicFrame. You will need to configure `query`, `sourceType`, `viewsEnabled` and `materializationDataset`.

For example:

**Additional prerequisites:**

You will need to create or identify a BigQuery dataset, *materializationDataset*, where BigQuery can write materialized views for your queries.

You will need to grant appropriate GCP IAM permissions to your service account to create tables in *materializationDataset*.

```
glueContext.create_dynamic_frame.from_options(
            connection_type="bigquery",
            connection_options={
                "connectionName": "connectionName",
                "materializationDataset": materializationDataset,
                "parentProject": "parentProject",
                "viewsEnabled": "true",
                "sourceType": "query",
                "query": "select * from bqtest.test"
            }
        )
```

## Writing to BigQuery tables
<a name="aws-glue-programming-etl-connect-bigquery-write"></a>

This example writes directly to the BigQuery service. BigQuery also supports the "indirect" writing method. For more information about configuring indirect writes, see [Using indirect write with Google BigQuery](#aws-glue-programming-etl-connect-bigquery-indirect-write).

**Prerequisites:** 
+ A BigQuery table you would like to write to. You will need the BigQuery table and dataset names, in the form `[dataset].[table]`. You can also provide a new table name that will automatically be created. Let's call this *tableName*.
+ The billing project for the BigQuery table. You will need the name of the project, *parentProject*. If there is no billing parent project, use the project containing the table.
+ BigQuery auth information. Complete the steps *To manage your connection credentials with AWS Glue* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
bigquery_write = glueContext.write_dynamic_frame.from_options(
    frame=frameToWrite,
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "writeMethod": "direct",
        "table": "tableName",
    }
)
```

## BigQuery connection option reference
<a name="aws-glue-programming-etl-connect-bigquery-reference"></a>
+ `project` — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with your table.
+ `table` — (Required) Used for Read/Write. The name of your BigQuery table in the format `[[project:]dataset.]`.
+ `dataset` — Required when not defined through the `table` option. Used for Read/Write. The name of the dataset containing your BigQuery table.
+ `parentProject` — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with `project` used for billing.
+ `sourceType` — Used for Read. Required when reading. Valid Values: `table`, `query` Informs AWS Glue of whether you will read by table or by query. 
+ `materializationDataset` — Used for Read. Valid Values: strings. The name of a BigQuery dataset used to store materializations for views.
+ `viewsEnabled` — Used for Read. Default: false. Valid Values: true, false. Configures whether BigQuery will use views. 
+ `query` — Used for Read. Used when `viewsEnabled` is true. A GoogleSQL DQL query.
+ `temporaryGcsBucket` — Used for Write. Required when `writeMethod` is set to default (`indirect`). Name of a Google Cloud Storage bucket used to store an intermediate form of your data while writing to BigQuery.
+ `writeMethod` — Default: `indirect`. Valid Values: `direct`, `indirect`. Used for Write. Specifies the method used to write your data.
  + If set to `direct`, your connector will write using the BigQuery Storage Write API.
  + If set to `indirect`, you connector will write to Google Cloud Storage, then transfer it to BigQuery using a load operation. Your Google Cloud service account will need appropriate GCS permissions.

## Using indirect write with Google BigQuery
<a name="aws-glue-programming-etl-connect-bigquery-indirect-write"></a>

This example uses indirect write, which writes data to Google Cloud Storage and copies it to Google BigQuery.

**Prerequisites:**

You will need a temporary Google Cloud Storage bucket, *temporaryBucket*.

The GCP IAM role for AWS Glue's GCP service account will need appropriate GCS permissions to access *temporaryBucket*.

**Additional Configuration:**

**To configure indirect write with BigQuery:**

1. Assess [Configuring BigQuery connections](#aws-glue-programming-etl-connect-bigquery-configure) and locate or redownload your GCP credentials JSON file. Identify *secretName*, the AWS Secrets Manager secret for the Google BigQuery AWS Glue connection used in your job.

1. Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file, *s3secretpath* for future steps.

1. Edit *secretName*, adding the `spark.hadoop.google.cloud.auth.service.account.json.keyfile` key. Set the value to *s3secretpath*.

1. Grant your AWS Glue job Amazon S3 IAM permissions to access *s3secretpath*.

You can now provide your temporary GCS bucket location to your write method. You do not need to provide `writeMethod`, as `indirect` is historically the default.

```
bigquery_write = glueContext.write_dynamic_frame.from_options(
    frame=frameToWrite,
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "temporaryGcsBucket": "temporaryBucket",
        "table": "tableName",
    }
)
```

# JDBC connections
<a name="aws-glue-programming-etl-connect-jdbc-home"></a>

 Certain, typically relational, database types support connecting through the JDBC standard. For more information about JDBC, see the [Java JDBC API](https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/) documentation. AWS Glue natively supports connecting to certain databases through their JDBC connectors - the JDBC libraries are provided in AWS Glue Spark jobs. When connecting to these database types using AWS Glue libraries, you have access to a standard set of options. 

The JDBC connectionType values include the following:
+ `"connectionType": "sqlserver"`: Designates a connection to a Microsoft SQL Server database.
+ `"connectionType": "mysql"`: Designates a connection to a MySQL database.
+ `"connectionType": "oracle"`: Designates a connection to an Oracle database.
+ `"connectionType": "postgresql"`: Designates a connection to a PostgreSQL database.
+ `"connectionType": "redshift"`: Designates a connection to an Amazon Redshift database. For more information, see [Redshift connections](aws-glue-programming-etl-connect-redshift-home.md).

The following table lists the JDBC driver versions that AWS Glue supports.


| Product | JDBC driver versions for Glue 5.1 | JDBC driver versions for Glue 5.0 | JDBC driver versions for Glue 4.0 | JDBC driver versions for Glue 3.0 | JDBC driver versions for Glue 0.9, 1.0, 2.0 | 
| --- | --- | --- | --- | --- | --- | 
| Microsoft SQL Server | 10.2.0 | 10.2.0 | 9.4.0 | 7.x | 6.x | 
| MySQL | 8.0.33 | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Oracle Database | 23.3.0.23.09 | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.x | 
| Amazon Redshift \$1 | redshift-jdbc42-2.1.0.29 | redshift-jdbc42-2.1.0.29 | redshift-jdbc42-2.1.0.16 | redshift-jdbc41-1.2.12.1017 | redshift-jdbc41-1.2.12.1017 | 

\$1 For the Amazon Redshift connection type, all other option name/value pairs that are included in connection options for a JDBC connection, including formatting options, are passed directly to the underlying SparkSQL DataSource. In AWS Glue with Spark jobs in AWS Glue 4.0 and later versions, the AWS Glue native connector for Amazon Redshift uses the Amazon Redshift integration for Apache Spark. For more information see [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html). In previous versions, see [Amazon Redshift data source for Spark](https://github.com/databricks/spark-redshift).

To configure your Amazon VPC to connect to Amazon RDS data stores using JDBC, refer to [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md).

**Note**  
AWS Glue jobs are only associated with one subnet during a run. This may impact your ability to connect to multiple data sources through the same job. This behavior is not limited to JDBC sources.

**Topics**
+ [JDBC connection option reference](#aws-glue-programming-etl-connect-jdbc)
+ [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery)
+ [Use custom JDBC driver](#aws-glue-programming-etl-jdbc-custom-driver)
+ [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md)
+ [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md)

## JDBC connection option reference
<a name="aws-glue-programming-etl-connect-jdbc"></a>

If you already have a JDBC AWS Glue connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. This feature is available in AWS Glue 3.0 and later versions. To do so, use the following connection properties:
+ `"useConnectionProperties"`: Set it to "true" to indicate you want to use the configuration from a connection.
+ `"connectionName"`: Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.

Use these connection options with JDBC connections:
+ `"url"`: (Required) The JDBC URL for the database.
+ `"dbtable"`: (Required) The database table to read from. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.
+ `"user"`: (Required) The user name to use when connecting.
+ `"password"`: (Required) The password to use when connecting.
+ (Optional) The following options allow you to supply a custom JDBC driver. Use these options if you must use a driver that AWS Glue does not natively support. 

  ETL jobs can use different JDBC driver versions for the data source and target, even if the source and target are the same database product. This allows you to migrate data between source and target databases with different versions. To use these options, you must first upload the JAR file of the JDBC driver to Amazon S3.
  + `"customJdbcDriverS3Path"`: The Amazon S3 path of the custom JDBC driver.
  + `"customJdbcDriverClassName"`: The class name of JDBC driver.
+ `"bulkSize"`: (Optional) Used to configure parallel inserts for speeding up bulk loads into JDBC targets. Specify an integer value for the degree of parallelism to use when writing or inserting data. This option is helpful for improving the performance of writes into databases such as the Arch User Repository (AUR).
+ `"hashfield"` (Optional) A string, used to specify the name of a column in the JDBC table to be used to divide the data into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"hashexpression"` (Optional) A SQL select clause returning a whole number. Used to divide the data in a JDBC table into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"hashpartitions"` (Optional) A positive integer. Used to specify the number of parallel reads of the JDBC table when reading from JDBC tables in parallel. Default: 7. For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"sampleQuery"`: (Optional) A custom SQL query statement. Used to specify a subset of information in a table to retrieve a sample of the table contents. **When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors.** For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).
+ `"enablePartitioningForSampleQuery"`: (Optional) A boolean. Default: false. Used to enable reading from JDBC tables in parallel when specifying `sampleQuery`. **If set to true, `sampleQuery` must end with "where" or "and" for AWS Glue to append partitioning conditions.** For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).
+ `"sampleSize"`: (Optional) A positive integer. Limits the number of rows returned by the sample query. Works only when `enablePartitioningForSampleQuery` is true. If partitioning is not enabled, you should instead directly add `"limit x"` in the `sampleQuery` to limit the size. For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).

## Use sampleQuery
<a name="aws-glue-programming-etl-jdbc-samplequery"></a>

This section explains how to use `sampleQuery`, `sampleSize` and `enablePartitioningForSampleQuery`.

`sampleQuery` can be an efficient way to sample a few rows of your dataset. By default, the query is run by a single executor. When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors. Running SQL on the underlying database as part of your ETL pipeline is generally only needed for performance purposes. If you are trying to preview a few rows of your dataset, consider using [show](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-show). If you are trying to transform your dataset using SQL, consider using [toDF](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF) to define a SparkSQL transform against your data in a DataFrame form.

While your query may manipulate a variety of tables, `dbtable` remains required.

**Using sampleQuery to retrieve a sample of your table**

When using default sampleQuery behavior to retrieve a sample of your data, AWS Glue does not expect substantial throughput, so it runs your query on a single executor. In order to limit the data you provide and not cause performance problems, we suggest you provide SQL with a `LIMIT` clause.

**Example Use sampleQuery without partitioning**  
The following code example shows how to use `sampleQuery` without partitioning.  

```
//A full sql query statement.
val query = "select name from $tableName where age > 0 limit 1"
val connectionOptions = JsonOptions(Map(
    "url" -> url, 
    "dbtable" -> tableName, 
    "user" -> user, 
    "password" -> password, 
    "sampleQuery" -> query ))
val dyf = glueContext.getSource("mysql", connectionOptions)
          .getDynamicFrame()
```

**Using sampleQuery against larger datasets**

 If you're reading a large dataset, you might need to enable JDBC partitioning to query a table in parallel. For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md). To use `sampleQuery` with JDBC partitioning, set `enablePartitioningForSampleQuery` to true. Enabling this feature requires you to make some changes to your `sampleQuery`.

When using JDBC partitioning with `sampleQuery`, your query must end with "where" or "and" for AWS Glue to append partitioning conditions.

 If you would like to limit the results of your sampleQuery when reading from JDBC tables in parallel, set the `"sampleSize"` parameter rather than specifying a `LIMIT` clause.

**Example Use sampleQuery with JDBC partitioning**  
The following code example shows how to use `sampleQuery` with JDBC partitioning.  

```
//note that the query should end with "where" or "and" if use with JDBC partitioning.
val query = "select name from $tableName where age > 0 and"

//Enable JDBC partitioning by setting hashfield.
//to use sampleQuery with partitioning, set enablePartitioningForSampleQuery.
//use sampleSize to limit the size of returned data.
val connectionOptions = JsonOptions(Map(
    "url" -> url, 
    "dbtable" -> tableName, 
    "user" -> user, 
    "password" -> password, 
    "hashfield" -> primaryKey,
    "sampleQuery" -> query,
    "enablePartitioningForSampleQuery" -> true,
    "sampleSize" -> "1" ))
val dyf = glueContext.getSource("mysql", connectionOptions)
          .getDynamicFrame()
```

 **Notes and Restrictions:** 

Sample queries cannot be used together with job bookmarks. The bookmark state will be ignored when configuration for both are provided.

## Use custom JDBC driver
<a name="aws-glue-programming-etl-jdbc-custom-driver"></a>

The following code examples show how to read from and write to JDBC databases with custom JDBC drivers. They demonstrate reading from one version of a database product, and writing to a later version of the same product.

------
#### [ Python ]

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Construct JDBC connection options
connection_mysql5_options = {
    "url": "jdbc:mysql://<jdbc-host-name>:3306/db",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd"}

connection_mysql8_options = {
    "url": "jdbc:mysql://<jdbc-host-name>:3306/db",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd",
    "customJdbcDriverS3Path": "s3://amzn-s3-demo-bucket/mysql-connector-java-8.0.17.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}

connection_oracle11_options = {
    "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd"}

connection_oracle18_options = {
    "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd",
    "customJdbcDriverS3Path": "s3://amzn-s3-demo-bucket/ojdbc10.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

# Read from JDBC databases with custom driver
df_mysql8 = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                                          connection_options=connection_mysql8_options)

# Read DynamicFrame from MySQL 5 and write to MySQL 8
df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                                          connection_options=connection_mysql5_options)
glueContext.write_from_options(frame_or_dfc=df_mysql5, connection_type="mysql",
                               connection_options=connection_mysql8_options)

# Read DynamicFrame from Oracle 11 and write to Oracle 18
df_oracle11 = glueContext.create_dynamic_frame.from_options(connection_type="oracle",
                                                            connection_options=connection_oracle11_options)
glueContext.write_from_options(frame_or_dfc=df_oracle11, connection_type="oracle",
                               connection_options=connection_oracle18_options)
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {
  val MYSQL_5_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db"
  val MYSQL_8_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db"
  val ORACLE_11_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL"
  val ORACLE_18_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL"

  // Construct JDBC connection options
  lazy val mysql5JsonOption = jsonOptions(MYSQL_5_URI)
  lazy val mysql8JsonOption = customJDBCDriverJsonOptions(MYSQL_8_URI, "s3://amzn-s3-demo-bucket/mysql-connector-java-8.0.17.jar", "com.mysql.cj.jdbc.Driver")
  lazy val oracle11JsonOption = jsonOptions(ORACLE_11_URI)
  lazy val oracle18JsonOption = customJDBCDriverJsonOptions(ORACLE_18_URI, "s3://amzn-s3-demo-bucket/ojdbc10.jar", "oracle.jdbc.OracleDriver")

  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Read from JDBC database with custom driver
    val df_mysql8: DynamicFrame = glueContext.getSource("mysql", mysql8JsonOption).getDynamicFrame()

    // Read DynamicFrame from MySQL 5 and write to MySQL 8
    val df_mysql5: DynamicFrame = glueContext.getSource("mysql", mysql5JsonOption).getDynamicFrame()
    glueContext.getSink("mysql", mysql8JsonOption).writeDynamicFrame(df_mysql5)

    // Read DynamicFrame from Oracle 11 and write to Oracle 18
    val df_oracle11: DynamicFrame = glueContext.getSource("oracle", oracle11JsonOption).getDynamicFrame()
    glueContext.getSink("oracle", oracle18JsonOption).writeDynamicFrame(df_oracle11)

    Job.commit()
  }

  private def jsonOptions(url: String): JsonOptions = {
    new JsonOptions(
      s"""{"url": "${url}",
         |"dbtable":"test",
         |"user": "admin",
         |"password": "pwd"}""".stripMargin)
  }

  private def customJDBCDriverJsonOptions(url: String, customJdbcDriverS3Path: String, customJdbcDriverClassName: String): JsonOptions = {
    new JsonOptions(
      s"""{"url": "${url}",
         |"dbtable":"test",
         |"user": "admin",
         |"password": "pwd",
         |"customJdbcDriverS3Path": "${customJdbcDriverS3Path}",
         |"customJdbcDriverClassName" : "${customJdbcDriverClassName}"}""".stripMargin)
  }
}
```

------

# Reading from JDBC tables in parallel
<a name="run-jdbc-parallel-read-job"></a>

You can set properties of your JDBC table to enable AWS Glue to read data in parallel. When you set certain properties, you instruct AWS Glue to run parallel SQL queries against logical partitions of your data. You can control partitioning by setting a hash field or a hash expression. You can also control the number of parallel reads that are used to access your data. 

Reading from JDBC tables in parallel is an optimization technique that may improve performance. For more information about the process of identifying when this technique is appropriate, consult [Reduce the amount of data scan](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/parallelize-tasks.html) in the *Best practices for performance tuning AWS Glue for Apache Spark jobs* guide on AWS Prescriptive Guidance.

To enable parallel reads, you can set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see [Viewing and managing table details](tables-described.md#console-tables-details). You can also enable parallel reads when you call the ETL (extract, transform, and load) methods `create_dynamic_frame_from_options` and `create_dynamic_frame_from_catalog`. For more information about specifying options in these methods, see [from\$1options](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options) and [from\$1catalog](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog). 

You can use this method for JDBC tables, that is, most tables whose base data is a JDBC data store. These properties are ignored when reading Amazon Redshift and Amazon S3 tables.

**hashfield**  
Set `hashfield` to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type. AWS Glue generates non-overlapping queries that run in parallel to read the data partitioned by this column. For example, if your data is evenly distributed by month, you can use the `month` column to read each month of data in parallel.  

```
  'hashfield': 'month'
```
AWS Glue creates a query to hash the field value to a partition number and runs the query for all partitions in parallel. To use your own query to partition a table read, provide a `hashexpression` instead of a `hashfield`.

**hashexpression**  
Set `hashexpression` to an SQL expression (conforming to the JDBC database engine grammar) that returns a whole number. A simple expression is the name of any numeric column in the table. AWS Glue generates SQL queries to read the JDBC data in parallel using the `hashexpression` in the `WHERE` clause to partition data.  
For example, use the numeric column `customerID` to read data partitioned by a customer number.  

```
  'hashexpression': 'customerID'
```
To have AWS Glue control the partitioning, provide a `hashfield` instead of a `hashexpression`.

**hashpartitions**  
Set `hashpartitions` to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7.  
For example, set the number of parallel reads to `5` so that AWS Glue reads your data with five queries (or fewer).  

```
  'hashpartitions': '5'
```

# Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue
<a name="setup-vpc-for-glue-access"></a>

 When using JDBC to connect to databases in Amazon RDS, you will need to perform additional setup. To enable AWS Glue components to communicate with Amazon RDS, you must set up access to your Amazon RDS data stores in Amazon VPC. To enable AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC. A self-referencing rule will not open the VPC to all networks. The default security group for your VPC might already have a self-referencing inbound rule for ALL Traffic. 

**To set up access between AWS Glue and Amazon RDS data stores**

1. Sign in to the AWS Management Console and open the Amazon RDS console at [https://console.aws.amazon.com/rds/](https://console.aws.amazon.com/rds/).

1. In the Amazon RDS console, identify the security group(s) used to control access to your Amazon RDS database.

   In the left navigation pane, choose **Databases**, then select the instance you would like to connect to from the list in the main pane.

   In the database detail page, find **VPC security groups** on the **Connectivity & security** tab.

1. Based on your network architecture, identify which associated security group is best to modify to allow access for the AWS Glue service. Save its name, *database-security-group* for future reference. If there is no appropriate security group, follow the directions to [Provide access to your DB instance in your VPC by creating a security group](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SettingUp.html) in the Amazon RDS documentation.

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. In the Amazon VPC console, identify how to update *database-security-group*.

   In the left navigation pane, choose **Security groups**, then select *database-security-group* from the list in the main pane.

1. Identify the security group ID for *database-security-group*, *database-sg-id*. Save it for future reference.

   In the security group detail page, find **Security group ID**.

1. Alter the inbound rules for *database-security-group*, add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule where **Type** is `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and **Source** is *database-sg-id*. Verify that the security group you have entered for **Source** is the same as the security group you are editing.

   In the security group detail page, select **Edit inbound rules**.

   The inbound rule looks similar to this:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

1. Add rules for outbound traffic.

   In the security group detail page, select **Edit outbound rules**.

   If you security group allows all outbound traffic, you do not need separate rules. For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

   If your network architecture is designed for you to restrict outbound traffic, create the following outbound rules:

   Create a self-referencing rule where **Type** is `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and **Destination** is *database-sg-id*. Verify that the security group you have entered for **Destination** is the same as the security group you are editing.

    If using an Amazon S3 VPC endpoint, add an HTTPS rule to allow traffic from the VPC to Amazon S3. Create a rule where **Type** is `HTTPS`, **Protocol** is `TCP`, **Port Range** is `443` and **Destination** is the ID of the managed prefix list for the Amazon S3 gateway endpoint, *s3-prefix-list-id*. For more information about prefix lists and Amazon S3 gateway endpoints, see [Gateway endpoints for Amazon S3](https://docs.aws.amazon.com//vpc/latest/privatelink/vpc-endpoints-s3.html) in the Amazon VPC documentation.

   For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

# MongoDB connections
<a name="aws-glue-programming-etl-connect-mongodb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in MongoDB and MongoDB Atlas in AWS Glue 4.0 and later versions. You can connect to MongoDB using username and password credentials credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about MongoDB, consult the [MongoDB documentation](https://www.mongodb.com/docs/).

## Configuring MongoDB connections
<a name="aws-glue-programming-etl-connect-mongodb-configure"></a>

To connect to MongoDB from AWS Glue, you will need your MongoDB credentials, *mongodbUser* and *mongodbPass*.

To connect to MongoDB from AWS Glue, you may need some prerequisites:
+ If your MongoDB instance is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the MongoDB instance without traffic traversing the public internet. 

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your MongoDB instance and this location. Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.

You can then proceed to configure AWS Glue for use with MongoDB.

**To configure a connection to MongoDB:**

1. Optionally, in AWS Secrets Manager, create a secret using your MongoDB credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `username` with the value *mongodbUser*.

     When selecting **Key/value pairs**, create a pair for the key `password` with the value *mongodbPass*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select **MongoDB** or **MongoDB Atlas**.
   + When selecting **MongoDB URL** or **MongoDB Atlas URL**, provide the hostname of your MongoDB instance.

     A MongoDB URL is provided in the format `mongodb://mongoHost:mongoPort/mongoDBname`.

     A MongoDB Atlas URL is provided in the format `mongodb+srv://mongoHost/mongoDBname`.
   + If you chose to create an Secrets Manager secret, choose the AWS Secrets Manager **Credential type**.

     Then, in **AWS Secret** provide *secretName*.
   + If you choose to provide **Username and password**, provide *mongodbUser* and *mongodbPass*.

1. In the following situations, you may require additional configuration:
   + 

     For MongoDB instances hosted on AWS in an Amazon VPC
     + You will need to provide Amazon VPC connection information to the AWS Glue connection that defines your MongoDB security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue MongoDB connection, you will need to perform the following actions before calling your connection method:
+ If you chose to create an Secrets Manager secret, grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

To use your AWS Glue MongoDB connection in AWS Glue for Spark, provide the `connectionName` option in your connection method call. Alternatively, you can follow the steps in [Working with MongoDB connections in ETL jobs](integrate-with-mongo-db.md) to use the connection in conjunction with the AWS Glue Data Catalog.

## Reading from MongoDB using a AWS Glue connection
<a name="aws-glue-programming-etl-connect-mongodb-read"></a>

**Prerequisites:** 
+ A MongoDB collection you would like to read from. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ A AWS Glue MongoDB connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to MongoDB* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
mongodb_read = glueContext.create_dynamic_frame.from_options(
    connection_type="mongodb",
    connection_options={
        "connectionName": "connectionName",
        "database": "mongodbName",
        "collection": "mongodbCollection",
        "partitioner": "com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner",
        "partitionerOptions.partitionSizeMB": "10",
        "partitionerOptions.partitionKey": "_id",
        "disableUpdateUri": "false",
    }
)
```

## Writing to MongoDB tables
<a name="aws-glue-programming-etl-connect-mongodb-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to MongoDB.

**Prerequisites:** 
+ A MongoDB collection you would like to write to. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ A AWS Glue MongoDB connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to MongoDB* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="mongodb",
    connection_options={
        "connectionName": "connectionName",
        "database": "mongodbName",
        "collection": "mongodbCollection",
        "disableUpdateUri": "false",
        "retryWrites": "false", 
    },
)
```

## Reading and writing to MongoDB tables
<a name="aws-glue-programming-etl-connect-mongodb-read-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to MongoDB.

**Prerequisites:** 
+ A MongoDB collection you would like to read from. You will need identification information for the collection.

  A MongoDB collection you would like to write to. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ MongoDB auth information, *mongodbUser* and *mongodbPassword*.

For example: 

------
#### [ Python ]

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://some_bucket/output/" + str(time.time()) + "/"
mongo_uri = "mongodb://<mongo-instanced-ip-address>:27017"
mongo_ssl_uri = "mongodb://<mongo-instanced-ip-address>:27017"
write_uri = "mongodb://<mongo-instanced-ip-address>:27017"

read_mongo_options = {
    "uri": mongo_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "username": "mongodbUsername",
    "password": "mongodbPassword",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"}

ssl_mongo_options = {
    "uri": mongo_ssl_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "ssl": "true",
    "ssl.domain_match": "false"
}

write_mongo_options = {
    "uri": write_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "username": "mongodbUsername",
    "password": "mongodbPassword",
}

# Get DynamicFrame from MongoDB
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
                                                              connection_options=read_mongo_options)

# Write DynamicFrame to MongoDB
glueContext.write_dynamic_frame.from_options(dynamicFrame, connection_type="mongodb", connection_options=write_mongo_options)

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  val DEFAULT_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  val WRITE_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  lazy val defaultJsonOption = jsonOptions(DEFAULT_URI)
  lazy val writeJsonOption = jsonOptions(WRITE_URI)
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Get DynamicFrame from MongoDB
    val dynamicFrame: DynamicFrame = glueContext.getSource("mongodb", defaultJsonOption).getDynamicFrame()

    // Write DynamicFrame to MongoDB
    glueContext.getSink("mongodb", writeJsonOption).writeDynamicFrame(dynamicFrame)

    Job.commit()
  }

  private def jsonOptions(uri: String): JsonOptions = {
    new JsonOptions(
      s"""{"uri": "${uri}",
         |"database":"mongodbName",
         |"collection":"mongodbCollection",
         |"username": "mongodbUsername",
         |"password": "mongodbPassword",
         |"ssl":"true",
         |"ssl.domain_match":"false",
         |"partitioner": "MongoSamplePartitioner",
         |"partitionerOptions.partitionSizeMB": "10",
         |"partitionerOptions.partitionKey": "_id"}""".stripMargin)
  }
}
```

------

## MongoDB connection option reference
<a name="aws-glue-programming-etl-connect-mongodb"></a>

Designates a connection to MongoDB. Connection options differ for a source connection and a sink connection.

These connection properties are shared between source and sink connections:
+ `connectionName` — Used for Read/Write. The name of a AWS Glue MongoDB connection configured to provide auth and networking information to your connection method. When a AWS Glue connection is configured as described in the previous section, [Configuring MongoDB connections](#aws-glue-programming-etl-connect-mongodb-configure), providing `connectionName` will replace the need to provide the `"uri"`, `"username"` and `"password"` connection options. 
+ `"uri"`: (Required) The MongoDB host to read from, formatted as `mongodb://<host>:<port>`. Used in AWS Glue versions prior to AWS Glue 4.0.
+ `"connection.uri"`: (Required) The MongoDB host to read from, formatted as `mongodb://<host>:<port>`. Used in AWS Glue 4.0 and later versions.
+ `"username"`: (Required) The MongoDB user name.
+ `"password"`: (Required) The MongoDB password.
+ `"database"`: (Required) The MongoDB database to read from. This option can also be passed in `additional_options` when calling `glue_context.create_dynamic_frame_from_catalog` in your job script.
+ `"collection"`: (Required) The MongoDB collection to read from. This option can also be passed in `additional_options` when calling `glue_context.create_dynamic_frame_from_catalog` in your job script.

### "connectionType": "mongodb" as source
<a name="etl-connect-mongodb-as-source"></a>

Use the following connection options with `"connectionType": "mongodb"` as a source:
+ `"ssl"`: (Optional) If `true`, initiates an SSL connection. The default is `false`.
+ `"ssl.domain_match"`: (Optional) If `true` and `ssl` is `true`, domain match check is performed. The default is `true`.
+ `"batchSize"`: (Optional): The number of documents to return per batch, used within the cursor of internal batches.
+ `"partitioner"`: (Optional): The class name of the partitioner for reading input data from MongoDB. The connector provides the following partitioners:
  + `MongoDefaultPartitioner` (default) (Not supported in AWS Glue 4.0)
  + `MongoSamplePartitioner` (Requires MongoDB 3.2 or later) (Not supported in AWS Glue 4.0)
  + `MongoShardedPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoSplitVectorPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoPaginateByCountPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoPaginateBySizePartitioner` (Not supported in AWS Glue 4.0)
  + `com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner`
  + `com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner`
  + `com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner`
+ `"partitionerOptions"` (Optional): Options for the designated partitioner. The following options are supported for each partitioner:
  + `MongoSamplePartitioner`: `partitionKey`, `partitionSizeMB`, `samplesPerPartition`
  + `MongoShardedPartitioner`: `shardkey`
  + `MongoSplitVectorPartitioner`: `partitionKey`, `partitionSizeMB`
  + `MongoPaginateByCountPartitioner`: `partitionKey`, `numberOfPartitions`
  + `MongoPaginateBySizePartitioner`: `partitionKey`, `partitionSizeMB`

  For more information about these options, see [Partitioner Configuration](https://docs.mongodb.com/spark-connector/master/configuration/#partitioner-conf) in the MongoDB documentation.

### "connectionType": "mongodb" as sink
<a name="etl-connect-mongodb-as-sink"></a>

Use the following connection options with `"connectionType": "mongodb"` as a sink:
+ `"ssl"`: (Optional) If `true`, initiates an SSL connection. The default is `false`.
+ `"ssl.domain_match"`: (Optional) If `true` and `ssl` is `true`, domain match check is performed. The default is `true`.
+ `"extendedBsonTypes"`: (Optional) If `true`, allows extended BSON types when writing data to MongoDB. The default is `true`.
+ `"replaceDocument"`: (Optional) If `true`, replaces the whole document when saving datasets that contain an `_id` field. If `false`, only fields in the document that match the fields in the dataset are updated. The default is `true`.
+ `"maxBatchSize"`: (Optional): The maximum batch size for bulk operations when saving data. The default is 512.
+ `"retryWrites"`: (Optional): Automatically retry certain write operations a single time if AWS Glue encounters a network error.

# SAP HANA connections
<a name="aws-glue-programming-etl-connect-saphana-home"></a>

You can use AWS Glue for Spark to read from and write to tables in SAP HANA in AWS Glue 4.0 and later versions. You can define what to read from SAP HANA with a SQL query. You connect to SAP HANA using JDBC credentials stored in AWS Secrets Manager through a AWS Glue SAP HANA connection.

For more information about SAP HANA JDBC, consult [the SAP HANA documentation](https://help.sap.com/docs/SAP_HANA_PLATFORM/0eec0d68141541d1b07893a39944924e/ff15928cf5594d78b841fbbe649f04b4.html).

## Configuring SAP HANA connections
<a name="aws-glue-programming-etl-connect-saphana-configure"></a>

To connect to SAP HANA from AWS Glue, you will need to create and store your SAP HANA credentials in a AWS Secrets Manager secret, then associate that secret with a SAP HANA AWS Glue connection. You will need to configure network connectivity between your SAP HANA service and AWS Glue.

To connect to SAP HANA, you may need some prerequisites:
+ If your SAP HANA service is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the SAP HANA service without traffic traversing the public internet.

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your SAP HANA endpoint and this location. Your job will need to establish a TCP connection with your SAP HANA JDBC port. For more information about SAP HANA ports, see the [SAP HANA documentation](https://help.sap.com/docs/HANA_SMART_DATA_INTEGRATION/7952ef28a6914997abc01745fef1b607/88e2e8bded9e4041ad3ad87dc46c7b55.html?locale=en-US). Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.
+ There are no additional prerequisites if your SAP HANA endpoint is internet accesible.

**To configure a connection to SAP HANA:**

1. In AWS Secrets Manager, create a secret using your SAP HANA credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `username/USERNAME` with the value *saphanaUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password/PASSWORD` with the value *saphanaPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select SAP HANA.
   + When providing **SAP HANA URL**, provide the URL for your instance.

     SAP HANA JDBC URLs are in the form `jdbc:sap://saphanaHostname:saphanaPort/?databaseName=saphanaDBname,ParameterName=ParameterValue`

     AWS Glue requires the following JDBC URL parameters: 
     + `databaseName` – A default database in SAP HANA to connect to.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue SAP HANA connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from SAP HANA tables
<a name="aws-glue-programming-etl-connect-saphana-read"></a>

**Prerequisites:** 
+ A SAP HANA table you would like to read from. You will need identification information for the table.

  A table can be specified with a SAP HANA table name and schema name, in the form `schemaName.tableName`. The schema name and "." separator are not required if the table is in the default schema, "public". Call this *tableIdentifier*. Note that the database is provided as a JDBC URL parameter in `connectionName`.
+ A AWS Glue SAP HANA connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to SAP HANA* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
saphana_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="saphana",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier",
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
saphana_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="saphana",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to SAP HANA tables
<a name="aws-glue-programming-etl-connect-saphana-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to SAP HANA. If the table already has information, AWS Glue will error.

**Prerequisites:** 
+ A SAP HANA table you would like to write to. 

  A table can be specified with a SAP HANA table name and schema name, in the form `schemaName.tableName`. The schema name and "." separator are not required if the table is in the default schema, "public". Call this *tableIdentifier*. Note that the database is provided as a JDBC URL parameter in `connectionName`.
+ SAP HANA auth information. Complete the steps in the previous procedure, *To configure a connection to SAP HANA* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
options = {
    "connectionName": "connectionName",
    "dbtable": 'tableIdentifier'
}

    saphana_write = glueContext.write_dynamic_frame.from_options(
        frame=dynamicFrame,
        connection_type="saphana",
        connection_options=options
)
```

## SAP HANA connection option reference
<a name="aws-glue-programming-etl-connect-saphana-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue SAP HANA connection configured to provide auth and networking information to your connection method.
+ `databaseName` — Used for Read/Write. Valid Values: names of databases in SAP HANA. Name of database to connect to.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. Valid Values: contents of a SAP HANA SQL FROM clause. Identifies a table in SAP HANA to connect to. You may also provide other SQL than a table name, such as a subquery. For more information, see the [From clause](https://help.sap.com/docs/SAP_HANA_PLATFORM/4fe29514fd584807ac9f2a04f6754767/20fcf24075191014a89e9dc7b8408b26.html#loio20fcf24075191014a89e9dc7b8408b26__from_clause) in the SAP HANA documentation.
+ `query` — Used for Read. A SAP HANA SQL SELECT query defining what should be retrieved when reading from SAP HANA.

# Snowflake connections
<a name="aws-glue-programming-etl-connect-snowflake-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Snowflake in AWS Glue 4.0 and later versions. You can read from Snowflake with a SQL query. You can connect to Snowflake using one of three methods - basic authentication (using username and password), OAuth authentication, or key-pair authentication. You can refer to Snowflake credentials stored in AWS Secrets Manager through the AWS Glue Data connections. Data connection Snowflake credentials for AWS Glue for Spark are stored separately from Data Catalog Snowflake credentials for crawlers. You must choose a `SNOWFLAKE` type connection and not a `JDBC` type connection configured to connect to Snowflake.

For more information about Snowflake, see the [Snowflake website](https://www.snowflake.com/). For more information about Snowflake on AWS, see [Snowflake Data Warehouse on Amazon Web Services](https://aws.amazon.com/financial-services/partner-solutions/snowflake/).

## Configuring Snowflake connections
<a name="aws-glue-programming-etl-connect-snowflake-configure"></a>

There are no AWS prerequisites to connecting to Snowflake databases available through the internet.

Optionally, you can perform the following configuration to manage your connection credentials with AWS Glue.

**To manage your connection credentials with AWS Glue**

1. In AWS Secrets Manager, create a secret using your Snowflake credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html#create_secret_cli) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + For OAuth authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `sfUser`
     + When selecting **Key/value pairs**, create a pair for *OAUTH\$1CLIENT\$1SECRET* with the key `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET`
   + For Key-pair authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `sfUser`
     + When selecting **Key/value pairs**, create a pair for *private key* with the key `pem_private_key`
   + For basic authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `USERNAME`
     + When selecting **Key/value pairs**, create a pair for *snowflakePassword* with the key `PASSWORD`
   + When selecting **Key/value pairs**, you can provide your Snowflake warehouse with the key `sfWarehouse`.
   + When selecting **Key/value pairs**, you can provide additional Snowflake connection properties using their corresponding Spark property names as keys. Supported properties include:
     + `sfDatabase` - Snowflake database name
     + `sfSchema` - Snowflake schema name
     + `sfRole` - Snowflake role name

1. In the AWS Glue Studio Console, create a connection by choosing **Data Connections**, then **Create connection**. Following the steps in the connection wizard to complete the process: 
   + When selecting a **Data source**, select Snowflake, then choose **Next**.
   + Enter the connection details such as host and port. When entering the host **Snowflake URL**, provide the URL of your Snowflake instance. The URL will typically use a hostname in the form `account_identifier.snowflakecomputing.com`. However, the URL format may vary depending on your Snowflake account type (for example, AWS, Azure, or Snowflake-hosted).
   + When selecting the IAM service role, choose from the drop-down menu. This is the IAM role from your account that will be used to access AWS Secrets Manager and assign IP if VPC is specified.
   + When selecting an **AWS Secret**, provide *secretName*.

1. In the next step in the wizard, set properties for your Snowflake connection. 

1. In the final step in the wizard, review your settings and then complete the process to create your connection.

For Snowflake hosted on AWS in an Amazon VPC, you may require the following:
+ You will need appropriate Amazon VPC configuration for Snowflake. For more information on how to configure your Amazon VPC, consult [AWS PrivateLink & Snowflake ](https://docs.snowflake.com/en/user-guide/admin-security-privatelink) in the Snowflake documentation.
+ You will need appropriate Amazon VPC configuration for AWS Glue. [Configuring interface VPC endpoints (AWS PrivateLink) for AWS Glue (AWS PrivateLink)](vpc-interface-endpoints.md).
+ You will need to create a AWS Glue Data Catalog connection that provides Amazon VPC connection information (in addition to the id of an AWS Secrets Manager secret that defines your Snowflake security credentials). Your URL will change when using AWS PrivateLink, as described in the Snowflake documentation linked in a previous item.
+ You will need your job configuration in include the Data Catalog connection as an **Additional network connection**.

## Reading from Snowflake tables
<a name="aws-glue-programming-etl-connect-snowflake-read"></a>

**Prerequisites:** A Snowflake table you would like to read from. You will need the Snowflake table name, *tableName*. If your Snowflake user does not have a default namespace set, you will need the Snowflake database name, *databaseName* and the schema name *schemaName*. Additionally, if your Snowflake user does not have a default warehouse set, you will need a warehouse name *warehouseName*. To select which **Additional network connection** to connect with, the `connectionName` parameter will be used.

```
snowflake_read = glueContext.create_dynamic_frame.from_options(
  connection_type="snowflake",
  connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    }
)
```

 Additionally, you can use the `autopushdown` and `query` parameters to read a portion of a Snowflake table. This can be substantially more efficient than filtering your results after they have been loaded into Spark. Consider an example where all sales are stored in the same table, but you only need to analyze sales from a certain store on holidays. If that information is stored in the table, you could use predicate pushdown to retrieve the results as follows:

```
snowflake_node = glueContext.create_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "autopushdown": "on",
        "query": "select * from sales where store='1' and IsHoliday='TRUE'",
        "connectionName": "snowflake-glue-conn",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    }
)
```

## Writing to Snowflake tables
<a name="aws-glue-programming-etl-connect-snowflake-write"></a>

**Prerequisites:** A Snowflake database you would like to write to. You will need a current or desired table name, *tableName*. If your Snowflake user does not have a default namespace set, you will need the Snowflake database name, *databaseName* and the schema name *schemaName*. Additionally, if your Snowflake user does not have a default warehouse set, you will need a warehouse name *warehouseName*. To select which **Additional network connection** to connect with, the `connectionName` parameter will be used.

```
glueContext.write_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    },
)
```

## Snowflake connection option reference
<a name="aws-glue-programming-etl-connect-snowflake-reference"></a>

The Snowflake connection type takes the following connection options:

You can retrieve some of the parameters in this section from a AWS Glue connection (`sfUrl`, `sfUser`, `sfPassword`), in which case you are not required to provide them. You can do this by providing the parameter `connectionName`.

You can retrieve connection parameters from AWS Secrets Manager secrets using the `secretId` parameter. When using Secrets Manager, the following Spark properties can be automatically retrieved if present in the secret:
+ `sfUser` (using key `USERNAME` or `sfUser`)
+ `sfPassword` (using key `PASSWORD` or `sfPassword`, when using basic authentication)
+ `sfWarehouse` (using key `sfWarehouse`)
+ `sfDatabase` (using key `sfDatabase`)
+ `sfSchema` (using key `sfSchema`)
+ `sfRole` (using key `sfRole`)
+ `pem_private_key` (using key `pem_private_key`, when using key-pair authentication)
+ `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET` (when using OAuth authentication)

**Property Precedence Order:** When the same property is specified in multiple locations, AWS Glue uses the following precedence order (highest to lowest):

1. Explicitly provided connection options in your job code

1. Glue connection properties

1. AWS Secrets Manager secret values (when `secretId` is specified)

1. Snowflake user defaults

The following parameters are used generally when connecting to Snowflake.
+ `sfDatabase` — Required if a user default is not set in Snowflake. Used for Read/Write. The database to use for the session after connecting.
+ `sfSchema` — Required if a user default is not set in Snowflake. Used for Read/Write. The schema to use for the session after connecting.
+ `sfWarehouse` — Required if a user default is not set in Snowflake. Used for Read/Write. The default virtual warehouse to use for the session after connecting.
+ `sfRole` — Required if a user default is not set in Snowflake. Used for Read/Write. The default security role to use for the session after connecting.
+ `sfUrl` — (Required) Used for Read/Write. Specifies the hostname for your account in the following format: `account_identifier.snowflakecomputing.com`. For more information about account identifiers, see [Account Identifiers](https://docs.snowflake.com/en/user-guide/admin-account-identifier) in the Snowflake documentation.
+ `sfUser` — (Required) Used for Read/Write. Login name for the Snowflake user.
+ `sfPassword` — (Required when using basic authnetication) Used for Read/Write. Password for the Snowflake user.
+ `dbtable` — Required when working with full tables. Used for Read/Write. The name of the table to be read or the table to which data is written. When reading, all columns and records are retrieved.
+ `pem_private_key` — (Required when using key-pair authentication) Used for Read/Write. An unencrypted b64-encoded private key string. The private key for the Snowflake user. It is common to copy this out of a PEM file. For more information, see [Key-pair authentication and key-pair rotation](https://docs.snowflake.com/en/user-guide/key-pair-auth) in the Snowflake documentation.
+ `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET` — (Required when using OAuth Authentication) Used for both read and write operations. This value corresponds to the OAUTH\$1CLIENT\$1SECRET, which can be obtained from the Snowflake security integration configured to enable OAuth-based authentication for your account. For more details, refer to your Snowflake OAuth security integration setup documentation - [Configure Snowflake OAuth for custom clients](https://docs.snowflake.com/en/user-guide/oauth-custom).
+ `query` — Required when reading with a query. Used for Read. The exact query (`SELECT` statement) to run

The following options are used to configure specific behaviors during the process of connecting to Snowflake.
+ `preactions` — Used for Read/Write. Valid Values: Semicolon separated list of SQL statements as String. SQL statements run before data is transferred between AWS Glue and Snowflake. If a statement contains `%s`, the `%s` is replaced with the table name referenced for the operation.
+ `postactions` — Used for Read/Write. SQL statements run after data is transferred between AWS Glue and Snowflake. If a statement contains `%s`, the `%s` is replaced with the table name referenced for the operation.
+ `autopushdown` — Default: `"on"`. Valid Values: `"on"`, `"off"`. This parameter controls whether automatic query pushdown is enabled. If pushdown is enabled, then when a query is run on Spark, if part of the query can be "pushed down" to the Snowflake server, it is pushed down. This improves performance of some queries. For information about whether your query can be pushed down, consult [Pushdown](https://docs.snowflake.com/en/user-guide/spark-connector-use#pushdown) in the Snowflake documentation.

Additionally, some of the options available on the Snowflake Spark connector may be supported in AWS Glue. For more information about options available on the Snowflake Spark connector, see [Setting Configuration Options for the Connector](https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector) in the Snowflake documentation. 

## Snowflake authentication methods
<a name="aws-glue-programming-etl-connect-snowflake-authentication"></a>

AWS Glue supports the following authentication methods for connecting to Snowflake:
+ **Basic authentication:** Provide `sfUser` and `sfPassword` parameters.
+ **Key-pair authentication:** Provide `sfUser` and `pem_private_key` parameters. When using key-pair authentication, the `sfPassword` parameter is not required.
+ **OAuth authentication:** The Snowflake Connector supports the AUTHORIZATION\$1CODE grant type to request access to your Snowflake data. This grant type is referred to as “3-legged OAuth”, as it involves redirecting users to a third-party authorization server where they can authenticate and approve access. This method is used when creating a connection through the AWS Glue Console. 
  + **Prerequisite:** To use this authentication method, ensure the following setup is complete: 
    + **Configure Snowflake OAuth for a custom client** by following the official Snowflake documentation: [Configure Snowflake OAuth for custom clients.](https://docs.snowflake.com/en/user-guide/oauth-custom) 
    + **Set the correct redirect URI** when creating the Snowflake security integration. For example: If you are creating the connection in the DUB (eu-west-1) region, your redirect URI should be: `https://eu-west-1.console.aws.amazon.com/gluestudio/oauth` 
    + After creating the security integration, retain the following information for use when creating the Glue connection: 
      + OAUTH\$1CLIENT\$1ID: This value should be provided as User Managed Client Application Client ID on the Glue connection creation page.
      + OAUTH\$1CLIENT\$1SECRET: This value should be stored in the AWS Secret used for the connection, under the key USER\$1MANAGED\$1CLIENT\$1APPLICATION\$1CLIENT\$1SECRET.
  + OAuth Scopes — (Optional) Defines the specific permissions or levels of access requested from the Snowflake account. For example, a scope might limit access to a particular resource or operation.
    + This value can be specified in the following format: `session:role:Snowflake_Role_Name`
    + Example: `session:role:ANALYST_ROLE`
  + Authorization Code URL — (Required) The endpoint where the user is redirected to log in and grant authorization.
    + Example: `https://host/oauth/authorize`
  + Authorization Token URL — (Required) The endpoint used to exchange the authorization code for an access token.
    + Example: `https://host/oauth/token-request`
  + User Managed Client Application Client Id — (Required) The unique identifier for your registered OAuth client application in Snowflake
  + AWS Secret — (Required) Refers to an AWS Secrets Manager secret containing the following key-value pairs:
    + sfUser - The Snowflake username
    + USER\$1MANAGED\$1CLIENT\$1APPLICATION\$1CLIENT\$1SECRET - The client secret associated with the OAuth client application

All three authentication methods are fully supported and can be configured using any combination of connection options, Glue connections, or AWS Secrets Manager secrets.

## Snowflake connector limitations
<a name="aws-glue-programming-etl-connect-snowflake-limitations"></a>

Connecting to Snowflake with AWS Glue for Spark is subject to the following limitations. 
+ This connector does not support job bookmarks. For more information about job bookmarks, see [Tracking processed data using job bookmarks](monitor-continuations.md).
+ This connector does not support Snowflake reads and writes through tables in the AWS Glue Data Catalog using the `create_dynamic_frame.from_catalog` and `write_dynamic_frame.from_catalog` methods.
+ This connector supports basic authentication, key-pair authentication, and OAuth authentication. Other authentication methods (such as SAML) are not currently supported.
+ This connector is not supported within streaming jobs.
+ This connector supports `SELECT` statement based queries when retrieving information (such as with the `query` parameter). Other kind of queries (such as `SHOW`, `DESC`, or DML statements) are not supported.
+ Snowflake limits the size of query text (i.e. SQL statements) submitted through Snowflake clients to 1 MB per statement. For more details, see [Limits on Query Text Size](https://docs.snowflake.com/en/user-guide/query-size-limits).

# Teradata Vantage connections
<a name="aws-glue-programming-etl-connect-teradata-home"></a>

You can use AWS Glue for Spark to read from and write to existing tables in Teradata Vantage in AWS Glue 4.0 and later versions. You can define what to read from Teradata with a SQL query. You can connect to Teradata using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Teradata, consult the [Teradata documentation](https://docs.teradata.com/)

## Configuring Teradata connections
<a name="aws-glue-programming-etl-connect-teradata-configure"></a>

To connect to Teradata from AWS Glue, you will need to create and store your Teradata credentials in an AWS Secrets Manager secret, then associate that secret with a AWS Glue Teradata connection. If your Teradata instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Teradata connection.

To connect to Teradata from AWS Glue, you may need some prerequisites:
+ If you are accessing your Teradata environment through Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Teradata environment. We discourage accessing the Teradata environment over the public internet.

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Teradata instance and this location. Your job will need to establish a TCP connection with your Teradata client port. For more information about Teradata ports, see the [Teradata documentation](https://docs.teradata.com/r/Teradata-VantageTM-on-AWS-DIY-Installation-and-Administration-Guide/April-2020/Before-Deploying-Vantage-on-AWS-DIY/Security-Groups-and-Ports).

  Based on your network layout, secure VPC connectivity may require changes in Amazon VPC and other networking services. For more information about AWS connectivity, consult [AWS Connectivity Options](https://docs.teradata.com/r/Teradata-VantageCloud-Enterprise/Get-Started/Connecting-Your-Environment/AWS-Connectivity-Options) in the Teradata documentation.

**To configure a AWS Glue Teradata connection:**

1. In your Teradata configuration, identify or create a user and password AWS Glue will connect with, *teradataUser* and *teradataPassword*. For more information, consult [Vantage Security Overview](https://docs.teradata.com/r/Configuring-Teradata-VantageTM-After-Installation/January-2021/Security-Overview/Vantage-Security-Overview) in the Teradata documentation.

1. In AWS Secrets Manager, create a secret using your Teradata credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *teradataUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *teradataPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Teradata.
   + When providing **JDBC URL**, provide the URL for your instance. You can also hardcode certain comma separated connection parameters in your JDBC URL. The URL must conform to the following format: `jdbc:teradata://teradataHostname/ParameterName=ParameterValue,ParameterName=ParameterValue`

     Supported URL parameters include:
     + `DATABASE`– name of database on host to access by default.
     + `DBS_PORT`– the database port, used when running on a nonstandard port.
   + When selecting a **Credential type**, select **AWS Secrets Manager**, then set **AWS Secret** to *secretName*.

1. In the following situations, you may require additional configuration:
   + 

     For Teradata instances hosted on AWS in an Amazon VPC
     + You will need to provide Amazon VPC connection information to the AWS Glue connection that defines your Teradata security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue Teradata connection, you will need to perform the following steps before calling your connection method.
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Teradata
<a name="aws-glue-programming-etl-connect-teradata-read"></a>

**Prerequisites:**
+ A Teradata table you would like to read from. You will need the table name, *tableName*.
+ A AWS Glue Teradata connection configured to provide auth information. Complete the steps *To configure a connection to Teradata* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
teradata_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName"
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
teradata_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to Teradata tables
<a name="aws-glue-programming-etl-connect-teradata-write"></a>

**Prerequisites:** A Teradata table you would like to write to, *tableName*. **You must create the table before calling the connection method.**

For example:

```
teradata_write = glueContext.write_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName", 
        "dbtable": "tableName"
    }
)
```

## Teradata connection option reference
<a name="aws-glue-programming-etl-connect-teradata-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Teradata connection configured to provide auth and networking information to your connection method.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with.
+ `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata.

# Teradata Vantage NOS connections
<a name="connecting-to-data-teradata-nos"></a>

 Teradata NOS (Native Object Store) connection is a new connection for Teradata Vantage which leverages Teradata WRITE\$1NOS query to read from existing tables and READ\$1NOS query to write to tables. These queries uses Amazon S3 as a staging directory, and therefore the Teradata NOS connector is faster than the existing Teradata connector (JDBC-based) especially in handling large amount of data. 

 You can use the Teradata NOS connection in AWS Glue for Spark to read from and write to existing tables in Teradata Vantage in AWS Glue 5.0 and later versions. You can define what to read from Teradata with a SQL query. You can connect to Teradata using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection. 

 For more information about Teradata, consult the [Teradata documentation](https://docs.teradata.com/). 

**Topics**
+ [Creating a Teradata NOS connection](#creating-teradata-nos-connection)
+ [Reading from Teradata tables](#reading-from-teradata-nos-tables)
+ [Writing to Teradata tables](#writing-to-teradata-nos-tables)
+ [Teradata connection option reference](#teradata-nos-connection-option-reference)
+ [Provide Options in AWS Glue Visual ETL UI](#teradata-nos-connection-option-visual-etl-ui)

## Creating a Teradata NOS connection
<a name="creating-teradata-nos-connection"></a>

To connect to Teradata NOS from AWS Glue, you will need to create and store your Teradata credentials in an AWS Secrets Manager secret, then associate that secret with a AWS Glue Teradata NOS connection. If your Teradata instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Teradata NOS connection. 

 **Prerequisites**: 
+  If you are accessing your Teradata environment through Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Teradata environment. We discourage accessing the Teradata environment over the public internet. 
+  In Amazon VPC, identify or create a VPC, Subnet and Security group that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Teradata instance and this location. Your job will need to establish a TCP connection with your Teradata client port. For more information about Teradata ports, see the [ Security Groups for Teradata Vantage ](https://docs.teradata.com/r/Teradata-VantageTM-on-AWS-DIY-Installation-and-Administration-Guide/April-2020/Before-Deploying-Vantage-on-AWS-DIY/Security-Groups-and-Ports). 
+  Based on your network layout, secure VPC connectivity may require changes in Amazon VPC and other networking services. For more information about AWS connectivity, see [AWS Connectivity Options ](https://docs.teradata.com/r/Teradata-VantageCloud-Enterprise/Get-Started/Connecting-Your-Environment/AWS-Connectivity-Options) in the Teradata documentation. 

### To configure a AWS Glue Teradata NOS connection:
<a name="creating-teradata-nos-connection-procedure"></a>

1.  In your Teradata configuration, identify or create a *teradataUsername* and *teradataPassword* AWS Glue will connect with. For more information, see [ Vantage Security Overview ](https://docs.teradata.com/r/Configuring-Teradata-VantageTM-After-Installation/January-2021/Security-Overview/Vantage-Security-Overview) in the Teradata documentation. 

1.  In AWS Secrets Manager, create a secret using your Teradata credentials. To create a secret in AWS Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret in the AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   +  When selecting Key/value pairs, create a pair for the key USERNAME with the value *teradataUsername*. 
   +  When selecting Key/value pairs, create a pair for the key PASSWORD with the value *teradataPassword*. 

1.  In the AWS Glue console, create a connection by following the steps in [ Adding an AWS Glue connection ](https://docs.aws.amazon.com/glue/latest/dg/console-connections.html). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   +  When selecting a **Connection type**, select Teradata Vantage NOS. 
   +  When providing JDBC URL, provide the URL for your instance. You can also hardcode certain comma separated connection parameters in your JDBC URL. The URL must conform to the following format: ` jdbc:teradata://teradataHostname/ParameterName=ParameterValue,ParameterName=ParameterValue `. 
   +  Supported URL parameters include: 
     +  `DATABASE`– name of database on host to access by default. 
     +  `DBS_PORT`– the database port, used when running on a nonstandard port. 
   +  When selecting a **Credential type**, select AWS Secrets Manager, then set ** AWS Secret** to *secretName*. 

1.  In the following situations, you may require additional configuration: 
   +  For Teradata instances hosted on AWS in an Amazon VPC, you will need to provide Amazon VPC connection information to the AWS Glue connection that defines your Teradata security credentials. When creating or updating your connection, set **VPC**, **Subnet**, and **Security groups** in **Network options**. 

 After creating a AWS Glue Teradata Vantage NOS connection, you will need to perform the following steps before calling your connection method. 

1.  Grant the IAM role associated with your AWS Glue job permission to read *secretName*. 

1.  In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection** Under **Connections**. 

## Reading from Teradata tables
<a name="reading-from-teradata-nos-tables"></a>

### Prerequisites:
<a name="w2aac67c11c24b8c41c17b3"></a>
+  A Teradata table you would like to read from. You will need the table name, *tableName*. 
+  The Teradata environment has write access to the Amazon S3 path specified by `staging_fs_url` option, *stagingFsUrl*. 
+  The IAM role associated with AWS Glue job has write access to the Amazon S3 location specified by `staging_fs_url` option. 
+  An AWS Glue Teradata NOS connection configured to provide auth information. Complete the steps [To configure a AWS Glue Teradata NOS connection:](#creating-teradata-nos-connection-procedure) to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

 Example: 

```
teradata_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type= "teradatanos",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "staging_fs_url": "stagingFsUrl"
    }
)
```

 You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure query. If you configure both dbTable and query, the connector fails to read the data. For example: 

```
teradata_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="teradatanos",
    connection_options={
        "connectionName": "connectionName",
        "query": "query",
        "staging_fs_url": "stagingFsUrl"
    }
)
```

 Additionally, you can use Spark DataFrame API to read from Teradata tables. For example: 

```
options = {
    "url": "JDBC_URL",
    "dbtable": "tableName",
    "user": "teradataUsername", # or use "username" as key here
    "password": "teradataPassword",
    "staging_fs_url": "stagingFsUrl"
}
teradata_read_table = spark.read.format("teradatanos").option(**options).load()
```

## Writing to Teradata tables
<a name="writing-to-teradata-nos-tables"></a>

### Prerequisites
<a name="writing-to-teradata-nos-tables-prerequisites"></a>
+  A Teradata table you would like to write to: *tableName*. 
+  The Teradata environment has read access to the Amazon S3 location specified by `staging_fs_url` option, * stagingFsUrl *. 
+  The IAM role associated with AWS Glue job has write access to the Amazon S3 location specified by `staging_fs_url` option. 
+  An AWS Glue Teradata connection configured to provide auth information. Complete the steps in [To configure a AWS Glue Teradata NOS connection:](#creating-teradata-nos-connection-procedure) to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

   For example: 

  ```
  teradata_write = glueContext.write_dynamic_frame.from_options(
      frame=dynamicFrame,
      connection_type= "teradatanos",
      connection_options={
          "connectionName": "connectionName", 
          "dbtable": "tableName",
          "staging_fs_url": "stagingFsUrl"
      }
  )
  ```

## Teradata connection option reference
<a name="teradata-nos-connection-option-reference"></a>

 **Connection and Operation Options:** 
+  `connectionName` — Required. Used for Read/Write. The name of an AWS Glue Teradata connection configured to provide auth and networking information to your connection method. 
+  `staging_fs_url` — Required. Used for Read/Write. A writable location in Amazon S3, to be used for unloaded data when reading from Teradata, and for Parquet data to be loaded into Redshift when writing to Teradata. The S3 bucket must be in the same region as the region of your AWS Glue jobs. 
+  `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with. 
+  `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata. You can not pass if `dbtable` option is provided. 
+  `clean_staging_s3_dir` — Optional. Used for Read/Write. If true, clean up the staging Amazon S3 objects after a read or a write. The default value is true. 
+  `pre_actions` — Optional. Used for Write. Semicolon-separated list of SQL commands that are executed before data is transferred between Spark and Teradata Vantage. 
+  `post_actions` — Optional. Used for Write. Semicolon-separated list of SQL commands that are executed after data is transferred between Spark and Teradata Vantage. 
+  `truncate` — Optional. Used for Write. If true, the connector truncates the table when writing in overwrite mode. If false, the connector drops the table when writing in overwrite mode. The default value is false. 
+  `create_table_script` — Optional. Used for Write. An SQL statement to create table when writing to Teradata Vantage. Useful when you want to create a table with custom metadata (e.g. CREATE MULTISET or SET table or change primary index). Note that the table name used in create table script should match with the table name specified in `dbtable` option. 
+  `partition_size_in_mb` — Optional. Used for Read. Maximum size of a Spark partition in megabytes while reading staging Amazon S3 objects. The default value is 128. 

 You can provide advanced options when creating a Teradata node. These options are the same as those available when programming AWS Glue for Spark scripts.

 See [Teradata Vantage connections](aws-glue-programming-etl-connect-teradata-home.md). 

 **Authorization Options:** 

 Below are options used to provide AWS account credentials that the connector uses to access the staging Amazon S3 bucket. You can choose to (1) not provide any authorization options at all, and use temporary credentials generated from your AWS Glue execution role; or (2) provide an authorization object, `auth_object` you created; or (3) provide `aws_access_key_id and aws_secret_access_key` if using long term credentials, or provide `aws_access_key`, `aws_secret_access_key`, and `aws_session_token` if using temporary credentials. 
+  `auth_object` – Optional. Used for accessing the staging Amazon S3 bucket. An authorization object string created in Teradata instance. If provided, the connector will use this authorization object to access the staging Amazon S3 bucket. If not provided, and `aws_access_key_id` and `aws_secret_access_key` are also not provided, a temporary credential will be retrieved from AWS Glue execution role and used by the connector. The AWS account associated with this authorization object must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_access_key_id` – Optional. Used for accessing the staging Amazon S3 bucket. Part of an AWS account security credential. If `auth_object` is not provided, and `aws_access_key_id` is provided with `aws_secret_access_key`, the connector will use them to access staging Amazon S3 bucket. The AWS account associated with this access key must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_secret_access_key` – Optional. Used for accessing the staging Amazon S3 bucket. Part of an AWS account security credential. If `auth_object` is not provided, and `aws_secret_access_key` is provided with `aws_access_key_id` , the connector will use them to access staging Amazon S3 bucket. The AWS account associated with this secret key must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_session_token` – Optional. Used for accessing the staging Amazon S3 bucket. Part of a temporary AWS account security credential. Should be provided with `aws_access_key_id` and `aws_secret_access_key`. 

## Provide Options in AWS Glue Visual ETL UI
<a name="teradata-nos-connection-option-visual-etl-ui"></a>

 You can provide all above options in your visual ETL job UI. For connectionName option, you should choose it from the Teradata Vantage NOS connection drop down list. For all other options, you should provide them through the Custom Teradata Vantage NOS properties as key value pairs. 

![\[The window pane displays the Teradata NOS Vantage connection is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/teradata-nos-vantage-connection-options.png)


# Vertica connections
<a name="aws-glue-programming-etl-connect-vertica-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Vertica in AWS Glue 4.0 and later versions. You can define what to read from Vertica with a SQL query. You connect to Vertica using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Vertica, consult the [Vertica documentation](https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/UsingVerticaOnAWS/UsingVerticaOnAWS.htm).

## Configuring Vertica connections
<a name="aws-glue-programming-etl-connect-vertica-configure"></a>

To connect to Vertica from AWS Glue, you will need to create and store your Vertica credentials in a AWS Secrets Manager secret, then associate that secret with a Vertica AWS Glue connection. If your Vertica instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Vertica connection. You will need an Amazon S3 bucket or folder to use for temporary storage when reading from and writing to the database.

To connect to Vertica from AWS Glue, you will need some prerequisites:
+ An Amazon S3 bucket or folder to use for temporary storage when reading from and writing to the database, referred to by *tempS3Path*.
**Note**  
When using Vertica in AWS Glue job data previews, temporary files may not be automatically removed from *tempS3Path*. To ensure the removal of temporary files, directly end the data preview session by choosing **End session** in the **Data preview** pane.  
If you cannot guarantee the data preview session is ended directly, consider setting Amazon S3 Lifecycle configuration to remove old data. We recommend removing data older than 49 hours, based on maximum job runtime plus a margin. For more information about configuring Amazon S3 Lifecycle, see [Managing your storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) in the Amazon S3 documentation.
+ An IAM policy with appropriate permissions to your Amazon S3 path you can associate with your AWS Glue job role.
+ If your Vertica instance is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Vertica instance without traffic traversing the public internet. 

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Vertica instance and this location. Your job will need to establish a TCP connection with your Vertica client port, (default 5433). Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.

You can then proceed to configure AWS Glue for use with Vertica.

**To configure a connection to Vertica:**

1. In AWS Secrets Manager, create a secret using your Vertica credentials, *verticaUsername* and *verticaPassword*. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *verticaUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *verticaPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Vertica.
   + When selecting **Vertica Host**, provide the hostname of your Vertica installation.
   + When selecting **Vertica Port**, the port your Vertica installation is available through.
   + When selecting an **AWS Secret**, provide *secretName*.

1. In the following situations, you may require additional configuration:
   + 

     For Vertica instances hosted on AWS in an Amazon VPC
     + Provide Amazon VPC connection information to the AWS Glue connection that defines your Vertica security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue Vertica connection, you will need to perform the following steps before calling your connection method.
+ Grant the IAM role associated with your AWS Glue job permissions to *tempS3Path*.
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Vertica
<a name="aws-glue-programming-etl-connect-vertica-read"></a>

**Prerequisites:** 
+ A Vertica table you would like to read from. You will need the Vertica database name, *dbName* and the table name, *tableName*.
+ A AWS Glue Vertica connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Vertica* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 
+ A Amazon S3 bucket or folder to use for temporary storage, mentioned previously. You will need the name, *tempS3Path*. You will need to connect to this location using the `s3a` protocol.

For example: 

```
dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "table": "tableName",
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame or to access a dataset from multiple tables.

For example:

```
dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "query": "select * FROM tableName",
    },
)
```

## Writing to Vertica tables
<a name="aws-glue-programming-etl-connect-vertica-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Vertica. If the table already has information, AWS Glue will append data from your DynamicFrame.

**Prerequisites:** 
+ A current or desired table name, *tableName*, you would like to write to. You will also need the corresponding Vertica database name, *dbName*.
+ A AWS Glue Vertica connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Vertica* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 
+ A Amazon S3 bucket or folder to use for temporary storage, mentioned previously. You will need the name, *tempS3Path*. You will need to connect to this location using the `s3a` protocol.

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "table": "tableName",
    }
)
```

## Vertica connection option reference
<a name="aws-glue-programming-etl-connect-vertica-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Vertica connection configured to provide auth and networking information to your connection method.
+ `db` — Required. Used for Read/Write. The name of a database in Vertica your connection method will interact with.
+ `dbSchema` — Required if needed to identify your table. Used for Read/Write. Default: `public`. The name of a schema your connection method will interact with.
+ `table` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with.
+ `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata.
+ `staging_fs_url` — Required. Used for Read/Write. Valid Values: `s3a` URLs. The URL of a Amazon S3 bucket or folder to use for temporary storage.

## DataFrame options for ETL in AWS Glue 5.0 for Spark
<a name="aws-glue-programming-etl-connect-dataframe"></a>

A DataFrame is a Dataset organized into named columns similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

To create a DataFrame for a data source supported by Glue, the following are required:
+ data source connector `ClassName`
+ data source connection `Options`

Similarly, to write a DataFrame to a data sink supported by Glue, the same are required:
+ data sink connector `ClassName`
+ data sink connection `Options`

Note that AWS Glue features such as job bookmarks and DynamicFrame options such as `connectionName` are not supported in DataFrame. For more details about DataFrame and the supported operations, see the Spark documentation for [DataFrame](https://spark.apache.org/docs/3.5.2/api/python/reference/pyspark.sql/dataframe.html).

### Specifying the connector ClassName
<a name="aws-glue-programming-etl-connect-dataframe-classname"></a>

To specify the `ClassName` of a data source/sink, use the `.format` option to provide the corresponding connector `ClassName` that defines the data source/sink.

**JDBC connectors**  
For JDBC connectors, specify `jdbc` as the value of the `.format` option and provide the JDBC driver `ClassName` in the `driver` option.

```
df = spark.read.format("jdbc").option("driver", "<DATA SOURCE JDBC DRIVER CLASSNAME>")...

df.write.format("jdbc").option("driver", "<DATA SINK JDBC DRIVER CLASSNAME>")...
```

The following table lists the JDBC driver `ClassName` of the supported data source in AWS Glue for DataFrames.

| Data source | Driver ClassName | 
| --- |--- |
| PostgreSQL | org.postgresql.Driver | 
| Oracle | oracle.jdbc.driver.OracleDriver | 
| SQLServer | com.microsoft.sqlserver.jdbc.SQLServerDriver | 
| MySQL | com.mysql.jdbc.Driver | 
| SAPHana | com.sap.db.jdbc.Driver | 
| Teradata | com.teradata.jdbc.TeraDriver | 

**Spark connectors**  
For spark connectors, specify the `ClassName` of the connector as the value of the `.format` option.

```
df = spark.read.format("<DATA SOURCE CONNECTOR CLASSNAME>")...

df.write.format("<DATA SINK CONNECTOR CLASSNAME>")...
```

The following table lists the Spark connector `ClassName` of the supported data source in AWS Glue for DataFrames.

| Data source | ClassName | 
| --- |--- |
| MongoDB/DocumentDB | glue.spark.mongodb | 
| Redshift | io.github.spark\$1redshift\$1community.spark.redshift | 
| AzureCosmos | cosmos.oltp | 
| AzureSQL | com.microsoft.sqlserver.jdbc.spark | 
| BigQuery | com.google.cloud.spark.bigquery | 
| OpenSearch | org.opensearch.spark.sql | 
| Snowflake | net.snowflake.spark.snowflake | 
| Vertica | com.vertica.spark.datasource.VerticaSource | 

### Specifying the connection Options
<a name="aws-glue-programming-etl-connect-dataframe-connection-options"></a>

To specify the `Options` of the connection to a data source/sink, use the `.option(<KEY>, <VALUE>)` to provide individual options or `.options(<MAP>)` to provide multiple options as a key-value map.

Each data source/sink supports its own set of connection `Options`. For details on the available `Options`, refer to the public documentation of the specific data source/sink Spark connector listed in the following table.
+ [JDBC](https://spark.apache.org/docs/3.5.2/sql-data-sources-jdbc.html)
+ [MongoDB/DocumentDB](https://www.mongodb.com/docs/spark-connector/v10.4/)
+ [Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)
+ [AzureCosmos](https://github.com/Azure/azure-cosmosdb-spark)
+ [AzureSQL](https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16)
+ [BigQuery](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example)
+ [OpenSearch](https://github.com/opensearch-project/opensearch-hadoop/blob/main/USER_GUIDE.md#apache-spark)
+ [Snowflake](https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector)
+ [Vertica](https://github.com/vertica/spark-connector)

### Examples
<a name="aws-glue-programming-etl-connect-dataframe-examples"></a>

The following examples read from PostgreSQL and write into SnowFlake:

**Python**  
Example:

```
from awsglue.context import GlueContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataSourceClassName = "jdbc"
dataSourceOptions = {
  "driver": "org.postgresql.Driver",
  "url": "<url>",
  "user": "<user>",
  "password": "<password>",
  "dbtable": "<dbtable>",
}

dataframe = spark.read.format(className).options(**options).load()

dataSinkClassName = "net.snowflake.spark.snowflake"
dataSinkOptions = {
  "sfUrl": "<url>",
  "sfUsername": "<username>",
  "sfPassword": "<password>",
  "sfDatabase" -> "<database>",                              
  "sfSchema" -> "<schema>",                       
  "sfWarehouse" -> "<warehouse>"  
}

dataframe.write.format(dataSinkClassName).options(**dataSinkOptions).save()
```

**Scala**  
Example:

```
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()

val dataSourceClassName = "jdbc"
val dataSourceOptions = Map(
  "driver" -> "org.postgresql.Driver",
  "url" -> "<url>",
  "user" -> "<user>",
  "password" -> "<password>",
  "dbtable" -> "<dbtable>"
)

val dataframe = spark.read.format(dataSourceClassName).options(dataSourceOptions).load()

val dataSinkClassName = "net.snowflake.spark.snowflake"
val dataSinkOptions = Map(
  "sfUrl" -> "<url>",
  "sfUsername" -> "<username>",
  "sfPassword" -> "<password>",
  "sfDatabase" -> "<database>",
  "sfSchema" -> "<schema>",
  "sfWarehouse" -> "<warehouse>"
)

dataframe.write.format(dataSinkClassName).options(dataSinkOptions).save()
```

## Custom and AWS Marketplace connectionType values
<a name="aws-glue-programming-etl-connect-market"></a>

These include the following:
+ `"connectionType": "marketplace.athena"`: Designates a connection to an Amazon Athena data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "marketplace.spark"`: Designates a connection to an Apache Spark data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "marketplace.jdbc"`: Designates a connection to a JDBC data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "custom.athena"`: Designates a connection to an Amazon Athena data store. The connection uses a custom connector that you upload to AWS Glue Studio.
+ `"connectionType": "custom.spark"`: Designates a connection to an Apache Spark data store. The connection uses a custom connector that you upload to AWS Glue Studio.
+ `"connectionType": "custom.jdbc"`: Designates a connection to a JDBC data store. The connection uses a custom connector that you upload to AWS Glue Studio.

### Connection options for type custom.jdbc or marketplace.jdbc
<a name="marketplace-jdbc-connect-options"></a>
+ `className` – String, required, driver class name.
+ `connectionName` – String, required, name of the connection that is associated with the connector.
+ `url` – String, required, JDBC URL with placeholders (`${}`) which are used to build the connection to the data source. The placeholder `${secretKey}` is replaced with the secret of the same name in AWS Secrets Manager. Refer to the data store documentation for more information about constructing the URL. 
+ `secretId` or `user/password` – String, required, used to retrieve credentials for the URL. 
+ `dbTable` or `query` – String, required, the table or SQL query to get the data from. You can specify either `dbTable` or `query`, but not both. 
+ `partitionColumn` – String, optional, the name of an integer column that is used for partitioning. This option works only when it's included with `lowerBound`, `upperBound`, and `numPartitions`. This option works the same way as in the Spark SQL JDBC reader. For more information, see [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) in the *Apache Spark SQL, DataFrames and Datasets Guide*.

  The `lowerBound` and `upperBound` values are used to decide the partition stride, not for filtering the rows in table. All rows in the table are partitioned and returned. 
**Note**  
When using a query instead of a table name, you should validate that the query works with the specified partitioning condition. For example:   
If your query format is `"SELECT col1 FROM table1"`, then test the query by appending a `WHERE` clause at the end of the query that uses the partition column. 
If your query format is "`SELECT col1 FROM table1 WHERE col2=val"`, then test the query by extending the `WHERE` clause with `AND` and an expression that uses the partition column.
+ `lowerBound` – Integer, optional, the minimum value of `partitionColumn` that is used to decide partition stride. 
+ `upperBound` – Integer, optional, the maximum value of `partitionColumn` that is used to decide partition stride. 
+ `numPartitions` – Integer, optional, the number of partitions. This value, along with `lowerBound` (inclusive) and `upperBound` (exclusive), form partition strides for generated `WHERE` clause expressions that are used to split the `partitionColumn`. 
**Important**  
Be careful with the number of partitions because too many partitions might cause problems on your external database systems. 
+ `filterPredicate` – String, optional, extra condition clause to filter data from source. For example: 

  ```
  BillingCity='Mountain View'
  ```

  When using a *query* instead of a *table* name, you should validate that the query works with the specified `filterPredicate`. For example: 
  + If your query format is `"SELECT col1 FROM table1"`, then test the query by appending a `WHERE` clause at the end of the query that uses the filter predicate. 
  + If your query format is `"SELECT col1 FROM table1 WHERE col2=val"`, then test the query by extending the `WHERE` clause with `AND` and an expression that uses the filter predicate.
+ `dataTypeMapping` – Dictionary, optional, custom data type mapping that builds a mapping from a **JDBC** data type to a **Glue** data type. For example, the option `"dataTypeMapping":{"FLOAT":"STRING"}` maps data fields of JDBC type `FLOAT` into the Java `String` type by calling the `ResultSet.getString()` method of the driver, and uses it to build AWS Glue records. The `ResultSet` object is implemented by each driver, so the behavior is specific to the driver you use. Refer to the documentation for your JDBC driver to understand how the driver performs the conversions. 
+ The AWS Glue data types supported currently are:
  + DATE
  + STRING
  + TIMESTAMP
  + INT
  + FLOAT
  + LONG
  + BIGDECIMAL
  + BYTE
  + SHORT
  + DOUBLE

   The JDBC data types supported are [Java8 java.sql.types](https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html).

  The default data type mappings (from JDBC to AWS Glue) are:
  +  DATE -> DATE
  +  VARCHAR -> STRING
  +  CHAR -> STRING
  +  LONGNVARCHAR -> STRING
  +  TIMESTAMP -> TIMESTAMP
  +  INTEGER -> INT
  +  FLOAT -> FLOAT
  +  REAL -> FLOAT
  +  BIT -> BOOLEAN
  +  BOOLEAN -> BOOLEAN
  +  BIGINT -> LONG
  +  DECIMAL -> BIGDECIMAL
  +  NUMERIC -> BIGDECIMAL
  +  TINYINT -> SHORT
  +  SMALLINT -> SHORT
  +  DOUBLE -> DOUBLE

  If you use a custom data type mapping with the option `dataTypeMapping`, then you can override a default data type mapping. Only the JDBC data types listed in the `dataTypeMapping` option are affected; the default mapping is used for all other JDBC data types. You can add mappings for additional JDBC data types if needed. If a JDBC data type is not included in either the default mapping or a custom mapping, then the data type converts to the AWS Glue `STRING` data type by default. 

The following Python code example shows how to read from JDBC databases with AWS Marketplace JDBC drivers. It demonstrates reading from a database and writing to an S3 location. 

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.jdbc", connection_options = 
     {"dataTypeMapping":{"INTEGER":"STRING"},"upperBound":"200","query":"select id, 
       name, department from department where id < 200","numPartitions":"4",
       "partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"},
        transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"},
      "upperBound":"200","query":"select id, name, department from department where 
       id < 200","numPartitions":"4","partitionColumn":"id","lowerBound":"0",
       "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    ## @type: ApplyMapping
    ## @args: [mappings = [("department", "string", "department", "string"), ("name", "string",
      "name", "string"), ("id", "int", "id", "int")], transformation_ctx = "Transform0"]
    ## @return: Transform0
    ## @inputs: [frame = DataSource0]
    Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("department", "string",
      "department", "string"), ("name", "string", "name", "string"), ("id", "int", "id", "int")], 
       transformation_ctx = "Transform0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
     "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = Transform0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, 
      connection_type = "s3", format = "json", connection_options = {"path": 
      "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

### Connection options for type custom.athena or marketplace.athena
<a name="marketplace-athena-connect-options"></a>
+ `className` – String, required, driver class name. When you're using the Athena-CloudWatch connector, this parameter value is the prefix of the class Name (for example, `"com.amazonaws.athena.connectors"`). The Athena-CloudWatch connector is composed of two classes: a metadata handler and a record handler. If you supply the common prefix here, then the API loads the correct classes based on that prefix.
+ `tableName` – String, required, the name of the CloudWatch log stream to read. This code snippet uses the special view name `all_log_streams`, which means that the dynamic data frame returned will contain data from all log streams in the log group.
+ `schemaName` – String, required, the name of the CloudWatch log group to read from. For example, `/aws-glue/jobs/output`.
+ `connectionName` – String, required, name of the connection that is associated with the connector.

For additional options for this connector, see the [Amazon Athena CloudWatch Connector README](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch) file on GitHub.

The following Python code example shows how to read from an Athena data store using an AWS Marketplace connector. It demonstrates reading from Athena and writing to an S3 location. 

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.athena", connection_options = 
     {"tableName":"all_log_streams","schemaName":"/aws-glue/jobs/output",
      "connectionName":"test-connection-athena"}, transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.athena", connection_options = {"tableName":"all_log_streams",,
      "schemaName":"/aws-glue/jobs/output","connectionName":
      "test-connection-athena"}, transformation_ctx = "DataSource0")
    ## @type: ApplyMapping
    ## @args: [mappings = [("department", "string", "department", "string"), ("name", "string",
      "name", "string"), ("id", "int", "id", "int")], transformation_ctx = "Transform0"]
    ## @return: Transform0
    ## @inputs: [frame = DataSource0]
    Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("department", "string",
      "department", "string"), ("name", "string", "name", "string"), ("id", "int", "id", "int")], 
       transformation_ctx = "Transform0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
     "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = Transform0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, 
      connection_type = "s3", format = "json", connection_options = {"path": 
      "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

### Connection options for type custom.spark or marketplace.spark
<a name="marketplace-spark-connect-options"></a>
+ `className` – String, required, connector class name. 
+ `secretId` – String, optional, used to retrieve credentials for the connector connection.
+ `connectionName` – String, required, name of the connection that is associated with the connector.
+ Other options depend on the data store. For example, OpenSearch configuration options start with the prefix `es`, as described in the [Elasticsearch for Apache Hadoop](https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html) documentation. Spark connections to Snowflake use options such as `sfUser` and `sfPassword`, as described in [Using the Spark Connector](https://docs.snowflake.com/en/user-guide/spark-connector-use.html) in the *Connecting to Snowflake* guide.

The following Python code example shows how to read from an OpenSearch data store using a `marketplace.spark` connection.

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.spark", connection_options = {"path":"test",
      "es.nodes.wan.only":"true","es.nodes":"https://<AWS endpoint>",
      "connectionName":"test-spark-es","es.port":"443"}, transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.spark", connection_options = {"path":"test","es.nodes.wan.only":
      "true","es.nodes":"https://<AWS endpoint>","connectionName":
      "test-spark-es","es.port":"443"}, transformation_ctx = "DataSource0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
         "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = DataSource0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, 
       connection_type = "s3", format = "json", connection_options = {"path": 
       "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

## General options
<a name="aws-glue-programming-etl-connect-general-options"></a>

The options in this section are provided as `connection_options`, but do not specifically apply to one connector.

The following parameters are used generally when configuring bookmarks. They may apply to Amazon S3 or JDBC workflows. For more information, see [Using job bookmarks](programming-etl-connect-bookmarks.md).
+ `jobBookmarkKeys` — Array of column names. 
+ `jobBookmarkKeysSortOrder` — String defining how to compare values based on sort order. Valid values: `"asc"`, `"desc"`.
+ `useS3ListImplementation` — Used to manage memory performance when listing Amazon S3 bucket contents. For more information, see [Optimize memory management in AWS Glue](https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/).