

# Converting ETL processes to AWS Glue in AWS Schema Conversion Tool
<a name="CHAP-converting-aws-glue-api-process"></a>

In the following sections, you can find a description of a conversion that calls AWS Glue API operations in Python. For more information, see [Program AWS Glue ETL scripts in Python](https://docs.aws.amazon.com//glue/latest/dg/aws-glue-programming-python.html) in the* AWS Glue Developer Guide.*

**Topics**
+ [Step 1: Create a database](#CHAP-converting-aws-glue-step-api-create-db)
+ [Step 2: Create a connection](#CHAP-converting-aws-glue-step-api-connection)
+ [Step 3: Create an AWS Glue crawler](#CHAP-converting-aws-glue-step-api-crawler)

## Step 1: Create a database
<a name="CHAP-converting-aws-glue-step-api-create-db"></a>

The first step is to create a new database in an AWS Glue Data Catalog by using the [AWS SDK API](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateDatabase.html). When you define a table in the Data Catalog, you add it to a database. A database is used to organize the tables in AWS Glue. 

The following example demonstrates the `create_database` method of the Python API for AWS Glue.

```
response = client.create_database(
    DatabaseInput={
        'Name': '{{database_name}}’,
        'Description': '{{description}}',
        'LocationUri': '{{string}}',
        'Parameters': {         
            '{{parameter-name}}': '{{parameter value}}'
        }
    }
)
```

If you are using Amazon Redshift, the database name is formed as follows.

```
{redshift_cluster_name}_{redshift_database_name}_{redshift_schema_name}
```

The full name of Amazon Redshift cluster for this example is as follows.

```
rsdbb03.apq1mpqso.us-west-2.redshift.amazonaws.com
```

The following shows an example of a well-formed database name. In this case `rsdbb03` is the name, which is the first part of the full name of the cluster endpoint. The database is named `dev` and the schema is `ora_glue`.

```
rsdbb03_dev_ora_glue
```

## Step 2: Create a connection
<a name="CHAP-converting-aws-glue-step-api-connection"></a>

Create a new connection in a Data Catalog by using the [AWS SDK API](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateConnection.html).

The following example demonstrates using the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html) method of the Python API for AWS Glue. 

 

```
response = client.create_connection(
    ConnectionInput={
        'Name': '{{Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev}}',
        'Description': {{'Created from SCT}}',
        'ConnectionType': '{{JDBC}}',
        'ConnectionProperties': {
            {{'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev',
            'USERNAME': 'user_name',
            'PASSWORD': 'password'}}
        },
        'PhysicalConnectionRequirements': {
            'AvailabilityZone': '{{us-west-2c}}',
            'SubnetId': '{{subnet-a1b23c45}}',
            'SecurityGroupIdList': [
                '{{sg-000a2b3c}}', '{{sg-1a230b4c}}', '{{sg-aba12c3d}}', '{{sg-1abb2345}}'
            ]
        }
    }
)
```

The parameters used in `create_connection` are as follows:
+ `Name` (UTF-8 string) – required. For Amazon Redshift, the connection name is formed as follows: `Redshift_{{<Endpoint-name>}}_{{<redshift-database-name>}}`, for example:` Redshift_abcde03_dev`
+ `Description` (UTF-8 string) – Your description of the connection.
+ `ConnectionType` (UTF-8 string) – Required. The type of connection. Currently, only JDBC is supported; SFTP is not supported.
+ `ConnectionProperties` (dict) – Required. A list of key-value pairs used as parameters for this connection, including the JDBC connection URL, the user name, and the password.
+ `PhysicalConnectionRequirements` (dict) – Physical connection requirements, which include the following:
  + `SubnetId` (UTF-8 string) – The ID of the subnet used by the connection.
  + `SecurityGroupIdList` (list) – The security group ID list used by the connection.
  + `AvailabilityZone` (UTF-8 string) – Required. The Availability Zone that contains the endpoint. This parameter is deprecated.

## Step 3: Create an AWS Glue crawler
<a name="CHAP-converting-aws-glue-step-api-crawler"></a>

Next, you create an AWS Glue crawler to populate the AWS Glue catalog. For more information, see [Cataloging tables with a crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) in the *AWS Glue Developer Guide. *

The first step in adding a crawler is to create a new database in a Data Catalog by using the [AWS SDK API](https://docs.aws.amazon.com//glue/latest/webapi/API_CreateCrawler.html). Before you begin, make sure to first delete any previous version of it by using the `delete_crawler` operation.

When you create your crawler, a few considerations apply:
+ For the crawler name, use the format `{{<redshift_node_name>}}_{{<redshift_database_name>}}_{{<redshift_shema_name>}}`, for example: `abcde03_dev_ora_glue`
+ Use an IAM role that already exists. For more information on creating IAM roles, see [Creating IAM roles](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create.html) in the *IAM User Guide.*
+ Use the name of the database that you created in the previous steps.
+ Use the `ConnectionName` parameter, which is required.
+ For the `path` parameter, use the path to the JDBC target, for example: `dev/ora_glue/%`

The following example deletes an existing crawler and then creates a new one by using the Python API for AWS Glue. 

```
response = client.delete_crawler(
    Name='{{crawler_name}}'
)

response = client.create_crawler(
    Name='{{crawler_name}}',
    Role= ‘{{IAM_role}}’,
    DatabaseName='{{database_name}}’,
    Description='{{string}}',
    Targets={
        'S3Targets': [
            {
                'Path': '{{string}}',
                'Exclusions': [
                    '{{string}}',
                ]
            },
        ],
        'JdbcTargets': [
            {
                'ConnectionName': ‘{{ConnectionName}}’,
                'Path': ‘{{Include_path}}’,
                'Exclusions': [
                    '{{string}}',
                ]
            },
        ]
    },
    Schedule='{{string}}',
    Classifiers=[
        '{{string}}',
    ],
    TablePrefix='{{string}}',
    SchemaChangePolicy={
        'UpdateBehavior': {{'LOG'|'UPDATE_IN_DATABASE'}},
        'DeleteBehavior': {{'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'}}
    },
    Configuration='{{string}}'
)
```

Create and then run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule, as shown following.

```
response = client.start_crawler(
    Name='string'
)
```

This example uses Amazon Redshift as the target. Amazon Redshift data types map to AWS Glue data types in the following way after the crawler runs.


|  |  | 
| --- |--- |
| Amazon Redshift data type | AWS Glue data type | 
| smallint | smallint | 
| integer | int | 
| bigint | bigint | 
| decimal | decimal(18,0) | 
| decimal(p,s) | decimal(p,s) | 
| real | double | 
| double precision | double | 
| boolean | boolean | 
| char | string | 
| varchar | string | 
| varchar(n) | string | 
| date | date | 
| timestamp | timestamp | 
| timestamptz | timestamp | 