

# DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in DynamoDB in AWS Glue. You connect to DynamoDB using IAM permissions attached to your AWS Glue job. AWS Glue supports writing data into another AWS account's DynamoDB table. For more information, see [Cross-account cross-Region access to DynamoDB tables](aws-glue-programming-etl-dynamo-db-cross-account.md).

The original DynamoDB connector uses Glue DynamicFrame objects to work with the data extracted from DynamoDB. AWS Glue 5.0\$1 introduces a new [DynamoDB connector with Spark DataFrame support](aws-glue-programming-etl-connect-dynamodb-dataframe-support.md) that provides native Spark DataFrame support.

In addition to the AWS Glue DynamoDB ETL connector, you can read from DynamoDB using the DynamoDB export connector, that invokes a DynamoDB `ExportTableToPointInTime` request and stores it in an Amazon S3 location you supply, in the format of [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html). AWS Glue then creates a DynamicFrame object by reading the data from the Amazon S3 export location.

The DynamoDB writer is available in AWS Glue version 1.0 or later versions. The AWS Glue DynamoDB export connector is available in AWS Glue version 2.0 or later versions. The new DataFrame-based DynamoDB connector is available in AWS Glue version 5.0 or later versions.

For more information about DynamoDB, consult [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/) documentation.

**Note**  
The DynamoDB ETL reader does not support filters or pushdown predicates.

## Configuring DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-configure"></a>

To connect to DynamoDB from AWS Glue, grant the IAM role associated with your AWS Glue job permission to interact with DynamoDB. For more information about permissions necessary to read or write from DynamoDB, consult [Actions, resources, and condition keys for DynamoDB](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazondynamodb.html) in the IAM documentation.

In the following situations, you may need additional configuration:
+ When using the DynamoDB export connector, you will need to configure IAM so your job can request DynamoDB table exports. Additionally, you will need to identify an Amazon S3 bucket for the export and provide appropriate permissions in IAM for DynamoDB to write to it, and for your AWS Glue job to read from it. For more information, consult [Request a table export in DynamoDB](https://docs.aws.amazon.com//amazondynamodb/latest/developerguide/S3DataExport_Requesting.html).
+ If your AWS Glue job has specific Amazon VPC connectivity requirements, use the `NETWORK` AWS Glue connection type to provide network options. Since access to DynamoDB is authorized by IAM, there is no need for a AWS Glue DynamoDB connection type.

## Reading from and writing to DynamoDB
<a name="aws-glue-programming-etl-connect-dynamodb-read-write"></a>

The following code examples show how to read from (via the ETL connector) and write to DynamoDB tables. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={"dynamodb.input.tableName": test_source,
        "dynamodb.throughput.read.percent": "1.0",
        "dynamodb.splits": "100"
    }
)
print(dyf.getNumPartitions())

glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={"dynamodb.output.tableName": test_sink,
        "dynamodb.throughput.write.percent": "1.0"
    }
)

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.input.tableName" -> test_source,
        "dynamodb.throughput.read.percent" -> "1.0",
        "dynamodb.splits" -> "100"
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    val dynamoDbSink: DynamoDbDataSink =  glueContext.getSinkWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.output.tableName" -> test_sink,
        "dynamodb.throughput.write.percent" -> "1.0"
      ))
    ).asInstanceOf[DynamoDbDataSink]
    
    dynamoDbSink.writeDynamicFrame(dynamicFrame)

    Job.commit()
  }

}
```

------

## Using the DynamoDB export connector
<a name="aws-glue-programming-etl-connect-dynamodb-export-connector"></a>

The export connector performs better than the ETL connector when the DynamoDB table size is larger than 80 GB. In addition, given that the export request is conducted outside from the Spark processes in an AWS Glue job, you can enable [auto scaling of AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/auto-scaling.html) to save DPU usage during the export request. With the export connector, you also do not need to configure the number of splits for Spark executor parallelism or DynamoDB throughput read percentage.

**Note**  
DynamoDB has specific requirements to invoke the `ExportTableToPointInTime` requests. For more information, see [Requesting a table export in DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Requesting.html). For example, Point-in-Time-Restore (PITR) needs to be enabled on the table to use this connector. The DynamoDB connector also supports AWS KMS encryption for DynamoDB exports to Amazon S3. Supplying your security configuration in the AWS Glue job configuration enables AWS KMS encryption for a DynamoDB export. The KMS key must be in the same Region as the Amazon S3 bucket.  
Note that additional charges for DynamoDB export and Amazon S3 storage costs apply. Exported data in Amazon S3 persists after a job run finishes so you can reuse it without additional DynamoDB exports. A requirement for using this connector is that point-in-time recovery (PITR) is enabled for the table.  
The DynamoDB ETL connector or export connector do not support filters or pushdown predicates to be applied at the DynamoDB source.

The following code examples show how to read from (via the export connector) and print the number of partitions.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": test_source,
        "dynamodb.s3.bucket": bucket_name,
        "dynamodb.s3.prefix": bucket_prefix,
        "dynamodb.s3.bucketOwner": account_id_of_bucket,
    }
)
print(dyf.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.export" -> "ddb",
        "dynamodb.tableArn" -> test_source,
        "dynamodb.s3.bucket" -> bucket_name,
        "dynamodb.s3.prefix" -> bucket_prefix,
        "dynamodb.s3.bucketOwner" -> account_id_of_bucket,
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    Job.commit()
  }

}
```

------

These examples show how to do the read from (via the export connector) and print the number of partitions from an AWS Glue Data Catalog table that has a `dynamodb` classification:

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dynamicFrame = glue_context.create_dynamic_frame.from_catalog(
        database=catalog_database,
        table_name=catalog_table_name,
        additional_options={
            "dynamodb.export": "ddb", 
            "dynamodb.s3.bucket": s3_bucket,
            "dynamodb.s3.prefix": s3_bucket_prefix
        }
    )
print(dynamicFrame.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getCatalogSource(
        database = catalog_database,
        tableName = catalog_table_name,
        additionalOptions = JsonOptions(Map(
            "dynamodb.export" -> "ddb", 
            "dynamodb.s3.bucket" -> s3_bucket,
            "dynamodb.s3.prefix" -> s3_bucket_prefix
        ))
    ).getDynamicFrame()
    print(dynamicFrame.getNumPartitions())
)
```

------

## Simplifying usage of DynamoDB export JSON
<a name="etl-connect-dynamodb-traversing-structure"></a>

The DynamoDB exports with the AWS Glue DynamoDB export connector results in JSON files of specific nested structures. For more information, see [Data objects](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html). AWS Glue supplies a DynamicFrame transformation, which can unnest such structures into an easier-to-use form for downstream applications.

The transform can be invoked in one of two ways. You can set the connection option `"dynamodb.simplifyDDBJson"` with the value `"true"` when calling a method to read from DynamoDB. You can also call the transform as a method independently available in the AWS Glue library.

Consider the following schema generated by a DynamoDB export:

```
root
|-- Item: struct
|    |-- parentMap: struct
|    |    |-- M: struct
|    |    |    |-- childMap: struct
|    |    |    |    |-- M: struct
|    |    |    |    |    |-- appName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- packageName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- updatedAt: struct
|    |    |    |    |    |    |-- N: string
|    |-- strings: struct
|    |    |-- SS: array
|    |    |    |-- element: string
|    |-- numbers: struct
|    |    |-- NS: array
|    |    |    |-- element: string
|    |-- binaries: struct
|    |    |-- BS: array
|    |    |    |-- element: string
|    |-- isDDBJson: struct
|    |    |-- BOOL: boolean
|    |-- nullValue: struct
|    |    |-- NULL: boolean
```

The `simplifyDDBJson` transform will simplify this to:

```
root
|-- parentMap: struct
|    |-- childMap: struct
|    |    |-- appName: string
|    |    |-- packageName: string
|    |    |-- updatedAt: string
|-- strings: array
|    |-- element: string
|-- numbers: array
|    |-- element: string
|-- binaries: array
|    |-- element: string
|-- isDDBJson: boolean
|-- nullValue: null
```

**Note**  
`simplifyDDBJson` is available in AWS Glue 3.0 and later versions. The `unnestDDBJson` transform is also available to simplify DynamoDB export JSON. We encourage users to transition to `simplifyDDBJson` from `unnestDDBJson`.

## Configuring paralleism in DynamoDB operations
<a name="aws-glue-programming-etl-connect-dynamodb-parallelism"></a>

To improve performance, you can tune certain parameters available for the DynamoDB connector. Your goal when tuning paralleism parameters is to maximize the use of the provisioned AWS Glue workers. Then, if you need more performance, we recommend you to scale out your job by increasing the number of DPUs. 

 You can alter the parallelism in a DynamoDB read operation using the `dynamodb.splits` parameter when using the ETL connector. When reading with the export connector, you do not need to configure the number of splits for Spark executor parallelism. You can alter the parallelism in a DynamoDB write operation with `dynamodb.output.numParallelTasks`.

**Reading with the DynamoDB ETL connector**

We recommend you to calculate `dynamodb.splits` based on the maximum number of workers set in your job configuration and the following `numSlots` calculation. If autoscaling, the actual number of workers available may change under that cap. For more information about setting the maximum number of workers, see **Number of workers** (`NumberOfWorkers`) in [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 
+ `numExecutors = NumberOfWorkers - 1`

   For context, one executor is reserved for the Spark driver; other executors are used to process data.
+ `numSlotsPerExecutor =`

------
#### [ AWS Glue 3.0 and later versions ]
  + `4` if `WorkerType` is `G.1X`
  + `8` if `WorkerType` is `G.2X`
  + `16` if `WorkerType` is `G.4X`
  + `32` if `WorkerType` is `G.8X`

------
#### [ AWS Glue 2.0 and legacy versions ]
  + `8` if `WorkerType` is `G.1X`
  + `16` if `WorkerType` is `G.2X`

------
+ `numSlots = numSlotsPerExecutor * numExecutors`

We recommend you set `dynamodb.splits` to the number of slots available, `numSlots`.

**Writing to DynamoDB**

The `dynamodb.output.numParallelTasks` parameter is used to determine WCU per Spark task, using the following calculation:

`permittedWcuPerTask = ( TableWCU * dynamodb.throughput.write.percent ) / dynamodb.output.numParallelTasks`

The DynamoDB writer will function best if configuration accurately represents the number of Spark tasks writing to DynamoDB. In some cases, you may need to override the default calculation to improve write performance. If you do not specify this parameter, the permitted WCU per Spark task will be automatically calculated by the following formula:
+ 
  + `numPartitions = dynamicframe.getNumPartitions()`
  + `numSlots` (as defined previously in this section)
  + `numParallelTasks = min(numPartitions, numSlots)`
+ Example 1. DPU=10, WorkerType=Standard. Input DynamicFrame has 100 RDD partitions.
  + `numPartitions = 100`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(100, 68) = 68`
+ Example 2. DPU=10, WorkerType=Standard. Input DynamicFrame has 20 RDD partitions.
  + `numPartitions = 20`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(20, 68) = 20`

**Note**  
Jobs on legacy AWS Glue versions and those using Standard workers require different methods to calculate the number of slots. If you need to performance tune these jobs, we recommend you transition to supported AWS Glue versions.

## DynamoDB connection option reference
<a name="aws-glue-programming-etl-connect-dynamodb"></a>

Designates a connection to Amazon DynamoDB.

Connection options differ for a source connection and a sink connection.

### "connectionType": "dynamodb" with the ETL connector as source
<a name="etl-connect-dynamodb-as-source"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a source, when using the AWS Glue DynamoDB ETL connector:
+ `"dynamodb.input.tableName"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.throughput.read.percent"`: (Optional) The percentage of read capacity units (RCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default read rate, meaning that AWS Glue will attempt to consume half of the read capacity of the table. If you increase the value above `0.5`, AWS Glue increases the request rate; decreasing the value below `0.5` decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the read capacity of the table as 40000. For exporting a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.splits"`: (Optional) Defines how many splits we partition this DynamoDB table into while reading. The default is set to "1". Acceptable values are from "1" to "1,000,000", inclusive.

  `1` represents there is no parallelism. We highly recommend that you specify a larger value for better performance by using the below formula. For more information on appropriately setting a value, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access. This parameter is available in AWS Glue 1.0 or later.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session". This parameter is available in AWS Glue 1.0 or later.

### "connectionType": "dynamodb" with the AWS Glue DynamoDB export connector as source
<a name="etl-connect-dynamodb-as-source-export-connector"></a>

Use the following connection options with "connectionType": "dynamodb" as a source, when using the AWS Glue DynamoDB export connector, which is available only for AWS Glue version 2.0 onwards:
+ `"dynamodb.export"`: (Required) A string value:
  + If set to `ddb` enables the AWS Glue DynamoDB export connector where a new `ExportTableToPointInTimeRequest` will be invoked during the AWS Glue job. A new export will be generated with the location passed from `dynamodb.s3.bucket` and `dynamodb.s3.prefix`.
  + If set to `s3` enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the `dynamodb.s3.bucket` and `dynamodb.s3.prefix` as the Amazon S3 location of a past export of that table.
+ `"dynamodb.tableArn"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.unnestDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. In AWS Glue 3.0 and later versions, we recommend you use `"dynamodb.simplifyDDBJson"` for better behavior when simplifying DynamoDB Map types. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure). 
+ `"dynamodb.simplifyDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. This has the same purpose as the `"dynamodb.unnestDDBJson"` option but provides better support for DynamoDB Map types or even nested Map types in your DynamoDB table. This option is available in AWS Glue 3.0 and later versions. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure).
+ `"dynamodb.s3.bucket"`: (Optional) Indicates the Amazon S3 bucket location in which the DynamoDB `ExportTableToPointInTime` process is to be conducted. The file format for the export is DynamoDB JSON.
  + `"dynamodb.s3.prefix"`: (Optional) Indicates the Amazon S3 prefix location inside the Amazon S3 bucket in which the DynamoDB `ExportTableToPointInTime` loads are to be stored. If neither `dynamodb.s3.prefix` nor `dynamodb.s3.bucket` are specified, these values will default to the Temporary Directory location specified in the AWS Glue job configuration. For more information, see [Special Parameters Used by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html).
  + `"dynamodb.s3.bucketOwner"`: Indicates the bucket owner needed for cross-account Amazon S3 access.
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access and/or cross-Region access for the DynamoDB table. Note: The same IAM role ARN will be used to access the Amazon S3 location specified for the `ExportTableToPointInTime` request.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session".
+ `"dynamodb.exportTime"` (Optional) Valid values: strings representing ISO-8601 instants. A point-in-time at which the export should be made. 
+ `"dynamodb.sts.region"`: (Required if making a cross-region call using a regional endpoint) The region hosting the DynamoDB table you want to read.

### "connectionType": "dynamodb" with the ETL connector as sink
<a name="etl-connect-dynamodb-as-sink"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a sink:
+ `"dynamodb.output.tableName"`: (Required) The DynamoDB table to write to.
+ `"dynamodb.throughput.write.percent"`: (Optional) The percentage of write capacity units (WCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default write rate, meaning that AWS Glue will attempt to consume half of the write capacity of the table. If you increase the value above 0.5, AWS Glue increases the request rate; decreasing the value below 0.5 decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table).
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the write capacity of the table as `40000`. For importing a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.output.numParallelTasks"`: (Optional) Defines how many parallel tasks write into DynamoDB at the same time. Used to calculate permissive WCU per Spark task. In most cases, AWS Glue will calculate a reasonable default for this value. For more information, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.output.retry"`: (Optional) Defines how many retries we perform when there is a `ProvisionedThroughputExceededException` from DynamoDB. The default is set to "10".
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-write-sts-session".

# Cross-account cross-Region access to DynamoDB tables
<a name="aws-glue-programming-etl-dynamo-db-cross-account"></a>

AWS Glue ETL jobs support both cross-region and cross-account access to DynamoDB tables. AWS Glue ETL jobs support both reading data from another AWS Account's DynamoDB table, and writing data into another AWS Account's DynamoDB table. AWS Glue also supports both reading from a DynamoDB table in another region, and writing into a DynamoDB table in another region. This section gives instructions on setting up the access, and provides an example script. 

The procedures in this section reference an IAM tutorial for creating an IAM role and granting access to the role. The tutorial also discusses assuming a role, but here you will instead use a job script to assume the role in AWS Glue. This tutorial also contains information about general cross-account practices. For more information, see [Tutorial: Delegate Access Across AWS Accounts Using IAM Roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html) in the *IAM User Guide*.

## Create a role
<a name="aws-glue-programming-etl-dynamo-db-create-role"></a>

Follow [step 1 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-1) to create an IAM role in account A. When defining the permissions of the role, you can choose to attach existing policies such as `AmazonDynamoDBReadOnlyAccess`, or `AmazonDynamoDBFullAccess` to allow the role to read/write DynamoDB. The following example shows creating a role named `DynamoDBCrossAccessRole`, with the permission policy `AmazonDynamoDBFullAccess`.

## Grant access to the role
<a name="aws-glue-programming-etl-dynamo-db-grant-access"></a>

Follow [step 2 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-2) in the *IAM User Guide* to allow account B to switch to the newly-created role. The following example creates a new policy with the following statement:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::111122223333:role/DynamoDBCrossAccessRole"
  }
}
```

------

Then, you can attach this policy to the group/role/user you would like to use to access DynamoDB.

## Assume the role in the AWS Glue job script
<a name="aws-glue-programming-etl-dynamo-db-assume-role"></a>

Now, you can log in to account B and create an AWS Glue job. To create a job, refer to the instructions at [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 

In the job script you need to use the `dynamodb.sts.roleArn` parameter to assume the `DynamoDBCrossAccessRole` role. Assuming this role allows you to get the temporary credentials, which need to be used to access DynamoDB in account B. Review these example scripts.

For a cross-account read across regions (ETL connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
    "dynamodb.region": "us-east-1",
    "dynamodb.input.tableName": "test_source",
    "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a cross-account read across regions (ELT connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source ARN>",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a read and cross-account write across regions:

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
 
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
 
dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-east-1",
        "dynamodb.input.tableName": "test_source"
    }
)
dyf.show()
 
glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-west-2",
        "dynamodb.output.tableName": "test_sink",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
 
job.commit()
```

# DynamoDB connector with Spark DataFrame support
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-support"></a>

DynamoDB connector with Spark DataFrame support allows you to read from and write to tables in DynamoDB using Spark DataFrame APIs. The connector setup steps are the same as for DynamicFrame-based connector and can be found [here](aws-glue-programming-etl-connect-dynamodb-home.md#aws-glue-programming-etl-connect-dynamodb-configure).

In order to load in the DataFrame-based connector library, make sure to attach a DynamoDB connection to the Glue job.

**Note**  
Glue console UI currently does not support creating a DynamoDB connection. You can use Glue CLI ([CreateConnection](https://docs.aws.amazon.com/cli/latest/reference/glue/create-connection.html)) to create a DynamoDB connection:  

```
        aws glue create-connection \
            --connection-input '{ \
                "Name": "my-dynamodb-connection", \
                "ConnectionType": "DYNAMODB", \
                "ConnectionProperties": {}, \
                "ValidateCredentials": false, \
                "ValidateForComputeEnvironments": ["SPARK"] \
            }'
```

Upon creating the DynamoDB connection, you can attach it to your Glue job via CLI ([CreateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/create-job.html), [UpdateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html) ) or directly in the "Job details" page:

![\[alt text not found\]](http://docs.aws.amazon.com/glue/latest/dg/images/dynamodb-dataframe-connector.png)


Upon ensuring a connection with DYNAMODB Type is attached to your Glue job, you can utilize the following read, write, and export operations from the DataFrame-based connector.

## Reading from and writing to DynamoDB with the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-write"></a>

The following code examples show how to read from and write to DynamoDB tables via the DataFrame-based connector. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

# Read from DynamoDB
df = spark.read.format("dynamodb") \
    .option("dynamodb.input.tableName", "test-source") \
    .option("dynamodb.throughput.read.ratio", "0.5") \
    .option("dynamodb.consistentRead", "false") \
    .load()

print(df.rdd.getNumPartitions())

# Write to DynamoDB
df.write \
  .format("dynamodb") \
  .option("dynamodb.output.tableName", "test-sink") \
  .option("dynamodb.throughput.write.ratio", "0.5") \
  .option("dynamodb.item.size.check.enabled", "true") \
  .save()

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val df = spark.read
      .format("dynamodb")
      .option("dynamodb.input.tableName", "test-source")
      .option("dynamodb.throughput.read.ratio", "0.5")
      .option("dynamodb.consistentRead", "false")
      .load()

    print(df.rdd.getNumPartitions)

    df.write
      .format("dynamodb")
      .option("dynamodb.output.tableName", "test-sink")
      .option("dynamodb.throughput.write.ratio", "0.5")
      .option("dynamodb.item.size.check.enabled", "true")
      .save()

    job.commit()
  }
}
```

------

## Using DynamoDB export via the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export"></a>

The export operation is preffered to read operation for DynamoDB table sizes larger than 80 GB. The following code examples show how to read from a table, export to S3, and print the number of partitions via the DataFrame-based connector.

**Note**  
The DynamoDB export functionality is available through the Scala `DynamoDBExport` object. Python users can access it via Spark's JVM interop or use the AWS SDK for Python (boto3) with the DynamoDB `ExportTableToPointInTime` API.

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.spark.SparkContext
import glue.spark.dynamodb.DynamoDBExport
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val options = Map(
      "dynamodb.export" -> "ddb",
      "dynamodb.tableArn" -> "arn:aws:dynamodb:us-east-1:123456789012:table/my-table",
      "dynamodb.s3.bucket" -> "my-s3-bucket",
      "dynamodb.s3.prefix" -> "my-s3-prefix",
      "dynamodb.simplifyDDBJson" -> "true"
    )
    val df = DynamoDBExport.fullExport(spark, options)
    
    print(df.rdd.getNumPartitions)
    df.count()
    
    Job.commit()
  }
}
```

------

## Configuration Options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-options"></a>

### Read options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.input.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.read | The read capacity units (RCU) to use. If unspecified, dynamodb.throughput.read.ratio is used for calculation. | - | 
| dynamodb.throughput.read.ratio | The ratio of read capacity units (RCU) to use | 0.5 | 
| dynamodb.table.read.capacity | The read capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput read units. | - | 
| dynamodb.splits | Defines how many segments used in parallel scan operations. If not provided, connector will calculate a reasonable default value. | - | 
| dynamodb.consistentRead | Whether to use strongly consistent reads | FALSE | 
| dynamodb.input.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Write options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-write-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.output.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.write | The write capacity units (WCU) to use. If unspecified, dynamodb.throughput.write.ratio is used for calculation. | - | 
| dynamodb.throughput.write.ratio | The ratio of write capacity units (WCU) to use | 0.5 | 
| dynamodb.table.write.capacity | The write capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput write units. | - | 
| dynamodb.item.size.check.enabled | If true, the connector calculate the item size and abort if the size exceeds the maximum size, before writing to DynamoDB table. | TRUE | 
| dynamodb.output.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Export options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.export | If set to ddb enables the AWS Glue DynamoDB export connector where a new ExportTableToPointInTimeRequet will be invoked during the AWS Glue job. A new export will be generated with the location passed from dynamodb.s3.bucket and dynamodb.s3.prefix. If set to s3 enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the dynamodb.s3.bucket and dynamodb.s3.prefix as the Amazon S3 location of the past exported of that table. | ddb | 
| dynamodb.tableArn | The DynamoDB table to read from. Required if dynamodb.export is set to ddb. |  | 
| dynamodb.simplifyDDBJson | If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. | FALSE | 
| dynamodb.s3.bucket | The S3 bucket to store temporary data during DynamoDB export (required) |  | 
| dynamodb.s3.prefix | The S3 prefix to store temporary data during DynamoDB export |  | 
| dynamodb.s3.bucketOwner | Indicate the bucket owner needed for cross-account Amazon S3 access |  | 
| dynamodb.s3.sse.algorithm | Type of encryption used on the bucket where temporary data will be stored. Valid values are AES256 and KMS. |  | 
| dynamodb.s3.sse.kmsKeyId | The ID of the AWS KMS managed key used to encrypt the S3 bucket where temporary data will be stored (if applicable). |  | 
| dynamodb.exportTime | A point-in-time at which the export should be made. Valid values: strings representing ISO-8601 instants. |  | 

### General options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-general-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.sts.roleArn | The IAM role ARN to be assumed for cross-account access | - | 
| dynamodb.sts.roleSessionName | STS session name | glue-dynamodb-sts-session | 
| dynamodb.sts.region | Region for the STS client (for cross-region role assumption) | Same as region option | 