

# Transform data with AWS Glue managed transforms
<a name="edit-jobs-transforms"></a>

 AWS Glue Studio provides two types of transforms: 
+  **AWS Glue-native transforms** - available to all users and are managed by AWS Glue. 
+  **Custom visual transforms** - allows you to upload your own transforms to use in AWS Glue Studio 

## AWS Glue managed data transform nodes
<a name="edit-jobs-transforms-glue-managed"></a>

AWS Glue Studio provides a set of built-in transforms that you can use to process your data. Your data passes from one node in the job diagram to another in a data structure called a `DynamicFrame`, which is an extension to an Apache Spark SQL `DataFrame`.

In the pre-populated diagram for a job, between the data source and data target nodes is the **Change Schema** transform node. You can configure this transform node to modify your data, or you can use additional transforms. 

The following built-in transforms are available with AWS Glue Studio:
+ **[ChangeSchema](transforms-configure-applymapping.md)**: Map data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.
+ **[SelectFields](transforms-configure-select-fields.md)**: Choose the data property keys that you want to keep.
+ **[DropFields](transforms-configure-drop-fields.md)**: Choose the data property keys that you want to drop.
+ **[RenameField](transforms-configure-rename-field.md)**: Rename a single data property key.
+ **[Spigot](transforms-configure-spigot.md)**: Write samples of the data to an Amazon S3 bucket.
+ **[Join](transforms-configure-join.md)**: Join two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.
+ **[Union](transforms-configure-union.md)**: Combine rows from more than one data source that have the same schema.
+ **[SplitFields](transforms-configure-split-fields.md)**: Split data property keys into two `DynamicFrames`. Output is a collection of `DynamicFrames`: one with selected data property keys, and one with the remaining data property keys. 
+ **[SelectFromCollection](transforms-selectfromcollection-overview.md)**: Choose one `DynamicFrame` from a collection of `DynamicFrames`. The output is the selected `DynamicFrame`.
+ **[FillMissingValues](transforms-configure-fmv.md)**: Locate records in the dataset that have missing values and add a new field with a suggested value that is determined by imputation
+ **[Filter](transforms-filter.md)**: Split a dataset into two, based on a filter condition.
+  **[Drop Null Fields](transforms-dropnull-fields.md)**: Removes columns from the dataset if all values in the column are ‘null’. 
+  ** [Drop Duplicates](transforms-drop-duplicates.md)**: Removes rows from your data source by choosing to match entire rows or specify keys. 
+ **[SQL](transforms-sql.md)**: Enter SparkSQL code into a text entry field to use a SQL query to transform the data. The output is a single `DynamicFrame`. 
+  **[Aggregate](transforms-aggregate-fields.md)**: Performs a calculation (such as average, sum, min, max) on selected fields and rows, and creates a new field with the newly calculated value(s). 
+ **[Flatten](transforms-flatten.md)**: Extract fields inside structs into top level fields.
+ **[UUID](transforms-uuid.md)**: Add a column with a Universally Unique Identifier for each row.
+ **[Identifier](transforms-identifier.md)**: Add a column with a numeric identifier for each row.
+ **[To timestamp](transforms-to-timestamp.md)**: Convert a column to timestamp type.
+ **[Format timestamp](transforms-format-timestamp.md)**: Convert a timestamp column to a formatted string.
+ **[Conditional Router transform](transforms-conditional-router.md)**: Apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group. 
+  **[Concatenate Columns transform](transforms-concatenate-columns.md)**: Build a new string column using the values of other columns with an optional spacer. 
+  **[Split String transform](transforms-split-string.md)**: Break up a string into an array of tokens using a regular expression to define how the split is done. 
+  **[Array To Columns transform](transforms-array-to-columns.md)**: Extract some or all the elements of a column of type array into new columns. 
+  **[Add Current Timestamp transform](transforms-add-current-timestamp.md)**: Mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline. 
+  **[Pivot Rows to Columns transform](transforms-pivot-rows-to-columns.md)**: Aggregate a numeric column by rotating unique values on selected columns which become new columns. If multiple columns are selected, the values are concatenated to name the new columns. 
+  **[Unpivot Columns To Rows transform](transforms-unpivot-columns-to-rows.md)**: Convert columns into values of new columns generating a row for each unique value. 
+  **[Autobalance Processing transform](transforms-autobalance-processing.md)**: Redistribute the data better among the workers. This is useful where the data is unbalanced or as it comes from the source doesn’t allow enough parallel processing on it. 
+  **[Derived Column transform](transforms-derived-column.md)**: Define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals. 
+  **[Lookup transform](transforms-lookup.md)**: Add columns from a defined catalog table when the keys match the defined lookup columns in the data. 
+  **[Explode Array or Map Into Rows transform](transforms-explode-array.md)**: Extract values from a nested structure into individual rows that are easier to manipulate. 
+  **[Record matching transform](transforms-record-matching.md)**: Invoke an existing Record Matching machine learning data classification transform. 
+  **[Remove null rows transform](transforms-remove-null-rows.md)**: Remove from the dataset rows that have all columns as null, or empty. 
+  **[Parse JSON column transform](transforms-parse-json-column.md)**: Parse a string column containing JSON data and convert it to a struct or an array column, depending if the JSON is an object or an array, respectively. 
+  **[Extract JSON path transform](transforms-extract-json-path.md)**: Extract new columns from a JSON string column. 
+  **[Extract string fragments from a regular expression](transforms-regex-extractor.md)**: Extract string fragments using a regular expression and create new column out of it, or multiple columns if using regex groups. 
+ **[Custom transform](transforms-custom.md)**: Enter code into a text entry field to use custom transforms. The output is a collection of `DynamicFrames`. 

# Using a data preparation recipe in AWS Glue Studio
<a name="glue-studio-data-preparation"></a>

 The **Data preparation recipe** transform allows you to author a data preparation recipe from scratch using an interactive grid style authoring interface. It also allows you to import an existing AWS Glue DataBrew recipe and then edit it in AWS Glue Studio. 

 The **Data Preparation Recipe** node is available from the Resource panel. You can connect the **Data Preparation Recipe** node to another node in the visual workflow, whether it is a Data source node or another transformation node. After choosing a AWS Glue DataBrew recipe and version, the applied steps in the recipe are visible in the node properties tab. 

## Prerequisites
<a name="glue-studio-databrew-prerequisites"></a>
+  If importing an AWS Glue DataBrew recipe, you have the required IAM permissions as described in [Import a AWS Glue DataBrew recipe in AWS Glue Studio](glue-studio-data-preparation-import-recipe.md) . 
+  A data preview session must be created. 

## Limitations
<a name="glue-studio-databrew-limitations"></a>
+  AWS Glue DataBrew recipes are only supported in [commercial DataBrew regions](https://docs.aws.amazon.com/general/latest/gr/databrew.html). 
+  Not all AWS Glue DataBrew recipes are supported by AWS Glue. Some recipes will not be able to be run in AWS Glue Studio. 
  +  Recipes with `UNION` and `JOIN` transforms are not supported, however, AWS Glue Studio already has "Join" and "Union" transform nodes which can be used before or after a **Data Preparation Recipe** node. 
+  **Data Preparation Recipe** nodes are supported for jobs starting with AWS Glue version 4.0. This version will be auto-selected after a **Data Preparation Recipe** node is added to the job. 
+  **Data Preparation Recipe** nodes require Python. This is automatically set when the **Data Preparation Recipe** node is added to the job. 
+  Adding a new **Data Preparation Recipe** node to the visual graph will automatically restart your Data Preview session with the correct libraries to use the **Data Preparation Recipe** node. 
+  The following transforms are not supported for import or editing in a **Data Preparation Recipe** node: `GROUP_BY`, `PIVOT`, `UNPIVOT`, and `TRANSPOSE`. 

## Additional features
<a name="glue-studio-data-preparation-recipe-transform-features"></a>

 When you've selected the **Data Preparation Recipe** transform, you have the ability to take additional actions after choosing **Author recipe**. 
+  Add step – you can add additional steps to a recipe as needed by choosing the add step icon, or use the toolbar in the Preview pane by choosing an action.   
![\[The screenshot shows the add recipe icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/add-recipe-icon.png)  
![\[The screenshot shows the add recipe icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-toolbar.png)
+  Import recipe – choose **More** then **Import recipe** to use in your AWS Glue Studio job.   
![\[The screenshot shows the more icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-recipe-node-more-icon.png)  
![\[The screenshot shows the more icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-recipe-node-more-features.png)
+  Download as YAML – choose **More** then **Download as YAML** to download your recipe to save outside of AWS Glue Studio. 
+  Download as JSON – choose **More** then **Download as JSON** to download your recipe to save outside of AWS Glue Studio. 
+  Undo and redo recipe steps – You can undo and redo recipe steps in the Preview pane when working with data in the grid.   
![\[The screenshot shows the more icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-toolbar-undo-redo.png)

# Author and run data preparation recipes in a visual ETL AWS Glue job
<a name="glue-studio-data-preparation-recipe-transform-tutorial"></a>

 In this scenario, you can author data preparation recipes without having to first create them in DataBrew. Before you can start authoring recipes, you must: 
+  Have an active Data Preview session running. When the data preview session is READY, then **Author Recipe** will become active and you can begin authoring or editing your recipe.   
![\[The screenshot shows the Data Preview session as complete.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-recipe-data-preview-complete.png)
+  Ensure that the toggle for **Automatically import glue libraries** is enabled.   
![\[The screenshot shows the option for Automatically import glue libraries toggled on.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-recipe-automatically-import-glue-libraries.png)

   You can do this by choosing the gear icon in the Data Preview pane.   
![\[The screenshot shows the option for Automatically import glue libraries toggled on.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preview-preferences.png)

**To author a data preparation recipe in AWS Glue Studio:**

1.  Add the **Data Preparation Recipe** transform to your job canvas. Your transform should be connected to a data source node parent. When adding the **Data Preparation Recipe** node, the node will restart with the proper libraries and you will see the Data Frame being prepared.   
![\[The screenshot shows the data frame loading after adding the Data Preparation Recipe.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-preparing-dataframe.png)

1.  Once the Data Preview session is ready, the data with any previously applied steps will appear on the bottom of the screen. 

1.  Choose **Author Recipe**. This will allow you to start a new recipe in AWS Glue Studio.   
![\[The screenshot shows the Transform panel with the fields for Name and Node parents, as well as option to Author Recipe.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-preparation-recipe-transform-tab-new.png)

1.  In the **Transform** panel to the right of the job canvas, enter a name for your data preparation recipe. 

1.  On the left-side, the canvas will be replaced with a grid view of your data. To the right, the **Transform** panel will change to show you your recipe steps. Choose **Add step** to add the first step in your recipe.   
![\[The screenshot shows the Transform panel after choosing Add Step. When you choose a column, the options will change dynamically. You can choose to sort, take an action on the column, and filter values.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-preview-data-transform-panel.png)

1.  In the **Transform** panel, choose to sort, take an action on the column, and filter values. For example, choose **Rename column**.   
![\[The screenshot shows the Transform panel after choosing Add Step. When you choose a column, the options will change dynamically. You can choose to sort, take an action on the column, and filter values.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-add-step.png)

1.  In the Transform panel on the right-side, options for renaming a column allow you to choose the source column to rename, and to enter the new column name. Once you have done so, choose **Apply**. 

    You can preview each step, undo a step, and re-order steps and use any of the action icons, such as Filter, Sort, Split, Merge, etc. When you perform actions in the data grid, the steps are added to the recipe in the Transform panel.   
![\[The screenshot shows the Preview data grid with the toolbar highlighted. You can apply an action by using any of the tools and it will be added to the recipe in the Transform panel on the right.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-preview-data-grid.png)

    If you need to make a change, you can do this in the Preview pane by previewing the result of each step, undoing a step, and re-ordering steps. For example: 
   +  Undo/redo step – undo a step by choosing the **undo** icon. You can repeat a step by choosing the **redo** icon.   
![\[The screenshot shows the more icon.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-toolbar-undo-redo.png)
   +  Reorder step – when you reorder a step, AWS Glue Studio will validate each step and let you know if the step is invalid. 

1.  Once you've applied a step, the Transform panel will show you all the steps in your recipe. You can clear all the steps to start over, add more steps by choosing the add icon, or choose **Done Authoring Recipe**.   
![\[The screenshot shows the Transform panel with steps added to your recipe. When done, choose Done Authoring Recipe or choose the add icon to add more steps to the recipe.\]](http://docs.aws.amazon.com/glue/latest/dg/images/author-recipe-done-authoring-recipe.png)

1.  Choose **Save** at the top right side of your screen. Your recipe steps will not be saved until you save your job. 

# Import a AWS Glue DataBrew recipe in AWS Glue Studio
<a name="glue-studio-data-preparation-import-recipe"></a>

 In AWS Glue DataBrew, a recipe is a set of data transformation steps. AWS Glue DataBrew recipes prescribes how to transform data that have already been read and doesn't describe where and how to read data, as well as how and where to write data. This is configured in Source and Target nodes in AWS Glue Studio. For more information on recipes, see [ Creating and using AWS Glue DataBrew recipes ](https://docs.aws.amazon.com/databrew/latest/dg/recipes.html). 

 To use AWS Glue DataBrew recipes in AWS Glue Studio, begin with creating recipes in AWS Glue DataBrew. If you have recipes you want to use, you can skip this step. 

## IAM permissions for AWS Glue DataBrew
<a name="glue-studio-databrew-permissions"></a>

 This topic provides information to help you understand the actions and resources that you an IAM administrator can use in an AWS Identity and Access Management (IAM) policy for the Data Preparation Recipe transform. 

 For additional information about security in AWS Glue, see [Access Management](https://docs.aws.amazon.com/glue/latest/dg/security.html). 

**Note**  
 The following table lists the permissions that a user needs if importing an existing AWS Glue DataBrew recipe. 


**Data Preparation Recipe transform actions**  

| Action | Description | 
| --- | --- | 
| databrew:ListRecipes | Grants permission to retrieve AWS Glue DataBrew recipes. | 
| databrew:ListRecipeVersions | Grants permission to retrieve AWS Glue DataBrew recipe versions. | 
| databrew:DescribeRecipe | Grants permission to retrieve AWS Glue DataBrew recipe description. | 



 The role you’re using for accessing this functionality should have a policy that allows several AWS Glue DataBrew actions. You can achieve this by either using the `AWSGlueConsoleFullAccess` policy that includes the necessary actions or add the following inline policy to your role: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "databrew:ListRecipes",
        "databrew:ListRecipeVersions",
        "databrew:DescribeRecipe"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------



 To use the Data Preparation Recipe transform, you must add the `IAM:PassRole` action to the permissions policy. 


**Additional required permissions**  

| Action | Description | 
| --- | --- | 
| iam:PassRole | Grants permission for IAM to allow the user to pass the approved roles. | 

Without these permissions the following error occurs:

```
"errorCode": "AccessDenied"
"errorMessage": "User: arn:aws:sts::account_id:assumed-role/AWSGlueServiceRole is not 
authorized to perform: iam:PassRole on resource: arn:aws:iam::account_id:role/service-role/AWSGlueServiceRole 
because no identity-based policy allows the iam:PassRole action"
```



## Importing an AWS Glue DataBrew recipe
<a name="glue-studio-databrew-import-steps"></a>

**To import an AWS Glue DataBrew recipe and use in AWS Glue Studio:**

 If you have an existing **Data Preparation Recipe** node and you want to edit the recipe steps directly in AWS Glue Studio, you will have to import the recipe steps into your AWS Glue Studio job. 

1.  Start a AWS Glue job in AWS Glue Studio with a datasource. 

1.  Add the **Data Preparation Recipe** node to the job canvas.   
![\[The screenshot shows the Add node modal with data preparation recipe available for selection.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-add-node-data-preparation-recipe.png)

1.  In the Transform panel, enter a name for your recipe. 

1.  Choose one or more parent nodes by selecting the available nodes on the canvas from the drop-down list. 

1.  Choose **Author Recipe**. If **Author Recipe** is grey it is unavailable until node parents have been selected and a data preview session has finished.   
![\[Author Data Preparation Recipe form with name field and node parents selection dropdown.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-author-data-preparation-recipe.png)

1.  The data frame loads and shows you detailed information about your source data. 

    Select the **more actions** icon and choose **Import recipe**.   
![\[Data preparation interface showing "Build your Recipe" with an "Add step" button.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-dataframe-import-recipe.png)

1.  Use the Import recipe wizard to complete the steps. In step 1, search for your recipe, select it, and choose **Next**.   
![\[Import recipe interface showing two recipes, with one selected for import.\]](http://docs.aws.amazon.com/glue/latest/dg/images/import-recipe-step-1.png)

1.  In step 2, choose your import options. You can choose to Append a new recipe to an existing recipe or Overwrite an existing recipe. Choose **Next**.   
![\[Import recipe interface showing selected recipe, version, and two imported steps.\]](http://docs.aws.amazon.com/glue/latest/dg/images/import-recipe-step-2.png)

1.  In step 3, validate the recipe steps. Once you import your AWS Glue DataBrew recipe, you can edit this recipe directly in AWS Glue Studio.   
![\[Recipe import interface showing two steps and a validation progress indicator.\]](http://docs.aws.amazon.com/glue/latest/dg/images/import-recipe-step-3.png)  
![\[Import recipe interface showing validated steps for sorting and formatting data.\]](http://docs.aws.amazon.com/glue/latest/dg/images/import-recipe-step-3-validated-2.png)

1.  After this, the steps will be imported as part of your AWS Glue job. Make necessary configuration changes in the **Job details** tab, like naming your job and adjusting allocated capacity as needed. Choose **Save** to save your job and recipe. 
**Note**  
 JOIN, UNION, GROUP\$1BY, PIVOT, UNPIVOT, TRANSPOSE are not supported for recipe import, nor will they be available in recipe authoring mode. 

1.  Optionally, you can finish authoring the job by adding other transformations nodes as needed and add Data target node(s). 

    If you reorder steps after you import a recipe, AWS Glue performs validation on those steps. For example, if you renamed and then deleted a column, and you moved the delete step on top, then the rename step would be invalid. You can then edit the steps to fix the validation error. 

# Migrating from AWS Glue DataBrew to AWS Glue Studio
<a name="databrew-migration-to-glue-studio"></a>

 If you have recipes in AWS Glue DataBrew, use the following checklist to migrate your recipes to AWS Glue Studio. 


| If you want to | Then do this | 
| --- | --- | 
|  Allow users to retrieve AWS Glue DataBrew recipes, recipe versions, and recipe descriptions.  |  Add IAM permissions to a policy that allows your role to access the necessary actions. See [IAM permissions for AWS Glue DataBrew](glue-studio-data-preparation-import-recipe.md#glue-studio-databrew-permissions).  | 
|  Import an existing AWS Glue DataBrew recipe into AWS Glue Studio.  |  Follow the steps in [Importing an AWS Glue DataBrew recipe](glue-studio-data-preparation-import-recipe.md#glue-studio-databrew-import-steps).  | 
|  Import a recipe with JOIN and UNION.  |  Recipes with UNION and JOIN transforms are not supported. Use the Join and Union transforms in AWS Glue Studio before or after a Data Preparation Recipe node.  | 

# Using Change Schema to remap data property keys
<a name="transforms-configure-applymapping"></a>

A *Change Schema* transform remaps the source data property keys into the desired configured for the target data. In a Change Schema transform node, you can:
+ Change the name of multiple data property keys.
+ Change the data type of the data property keys, if the new data type is supported and there is a transformation path between the two data types.
+ Choose a subset of data property keys by indicating which data property keys you want to drop.

You can also add additional *Change Schema* nodes to the job diagram as needed – for example, to modify additional data sources or following a *Join* transform. 

## Using Change Schema with decimal datatype
<a name="transforms-configure-applymapping-decimal-datatype"></a>

 When using the **Change Schema** transform with decimal datatype, the **Change Schema** transform modifies the precision to the default value of (10,2). To modify this and set the precision for your use case, you can use the **SQL Query** transform and cast the columns with a specific precision. 

 For example, if you have an input column named "DecimalCol" of type Decimal, and you want to remap it to an output column named "OutputDecimalCol" with a specific precision of (18,6), you would: 

1.  Add a subsequent **SQL Query** transform after the **Change Schema** transform. 

1.  In the **SQL Query** transform, use an SQL query to cast the remapped column to the desired precision. The SQL query would look like this: 

   ```
   SELECT col1, col2, CAST(DecimalCol AS DECIMAL(18,6)) AS OutputDecimalCol
   FROM __THIS__
   ```

    In the above SQL query: 
   +  `col1` and `col2` are other columns in your data that you want to pass through without modification. 
   +  `DecimalCol` is the original column name from the input data. 
   +  `CAST(DecimalCol AS DECIMAL(18,6))` casts the `DecimalCol` to a Decimal type with a precision of 18 digits and 6 decimal places. 
   +  `AS OutputDecimalCol` renames the casted column to `OutputDecimalCol`. 

 By using the **SQL Query** transform, you can override the default precision set by the **Change Schema** transform and explicitly cast the Decimal columns to the desired precision. This approach allows you to leverage the **Change Schema** transform for renaming and restructuring your data while handling the precision requirements for Decimal columns through the subsequent **SQL Query** transformation. 

## Adding a Change Schema transform to your job
<a name="transforms-configure-applymapping-add-to-job"></a>

**Note**  
The **Change Schema** transform is not case-sensitive.

**To add a Change Schema transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Change Schema** to add a new transform to your job diagram, if needed. 

1. In the node properties panel, enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node properties panel.

1. Modify the input schema:
   + To rename a data property key, enter the new name of the key in the **Target key** field.
   + To change the data type for a data property key, choose the new data type for the key from the **Data type** list.
   + To remove a data property key from the target schema, choose the **Drop** check box for that key.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Drop Duplicates
<a name="transforms-drop-duplicates"></a>

 The Drop Duplicates transform removes rows from your data source by giving you two options. You can choose to remove the duplicate row that are completely the same, or you can choose to choose the fields to match and remove only those rows based on your chosen fields. 

 For example, in this data set, you have duplicate rows where all the values in some of the rows are exactly the same as another row, and some of the values in rows are the same or different. 


| Row | Name | Email | Age | State | Note | 
| --- | --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY |  | 
| 2 | Tim | tim@gmail | 45 | OH |  | 
| 3 | Rose | rose@gmail | 23 | NJ |  | 
| 4 | Tim | tim@gmail | 42 | OH |  | 
| 5 | Rose | rose@gmail | 23 | NJ |  | 
| 6 | Tim | tim@gmail | 42 | OH | this is a duplicate row and matches completely on all values as row \$14 | 
| 7 | Rose | rose@gmail | 23 | NJ | This is a duplicate row and matches completely on all values as row \$15 | 

 If you choose to match entire rows, rows 6 and 7 will be removed from the data set. The data set is now: 


| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 
| 4 | Tim | tim@gmail | 42 | OH | 
| 5 | Rose | rose@gmail | 23 | NJ | 

 If you chose to specify keys, you can choose to remove rows that match on ‘name’ and ‘email’. This gives you finer control of what is a ‘duplicate row’ for your data set. By specifying ‘name’ and ‘email’, the data set is now: 


| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 



 Some things to keep in mind: 
+  In order for rows to be recognized as a duplicate, values are case sensitive. all values in rows need to have the same casing - this applies to either option you choose (Match entire rows or Specify keys). 
+  All values are read in as strings. 
+  The **Drop Duplicates** transform utilizes the Spark dropDuplicates command. 
+  When using the **Drop Duplicates** transform, the first row is kept and other rows are dropped. 
+  The **Drop Duplicates** transform does not change the schema of the dataframe. If you choose to specify keys, all fields are kept in the resulting dataframe. 

# Using SelectFields to remove most data property keys
<a name="transforms-configure-select-fields"></a>

You can create a subset of data property keys from the dataset using the *SelectFields* transform. You indicate which data property keys you want to keep and the rest are removed from the dataset.

**Note**  
The *SelectFields* transform is case sensitive. Use *ApplyMapping* if you need a case-insensitive way to select fields.

**To add a SelectFields transform node to your job diagram**

1. (Optional) Open the Resource panel, and then choose **SelectFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Under the heading **SelectFields**, choose the data property keys in the dataset that you want to keep. Any data property keys not selected are dropped from the dataset.

   You can also choose the check box next to the column heading **Field** to automatically choose all the data property keys in the dataset. Then you can deselect individual data property keys to remove them from the dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using DropFields to keep most data property keys
<a name="transforms-configure-drop-fields"></a>

You can create a subset of data property keys from the dataset using the *DropFields* transform. You indicate which data property keys you want to remove from the dataset and the rest of the keys are retained.

**Note**  
The *DropFields* transform is case sensitive. Use *Change Schema* if you need a case-insensitive way to select fields.

**To add a DropFields transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **DropFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Under the heading **DropFields**, choose the data property keys to drop from the data source.

   You can also choose the check box next to the column heading **Field** to automatically choose all the data property keys in the dataset. Then you can deselect individual data property keys so they are retained in the dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Renaming a field in the dataset
<a name="transforms-configure-rename-field"></a>

You can use the *RenameField* transform to change the name for an individual property key in the dataset. 

**Note**  
The *RenameField* transform is case sensitive. Use *ApplyMapping* if you need a case-insensitive transform.

**Tip**  
If you use the *Change Schema* transform, you can rename multiple data property keys in the dataset with a single transform.

**To add a RenameField transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **RenameField** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Under the heading **Data field**, choose a property key from the source data and then enter a new name in the **New field name** field. 

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Spigot to sample your dataset
<a name="transforms-configure-spigot"></a>

To test the transformations performed by your job, you might want to get a sample of the data to check that the transformation works as intended. The *Spigot* transform writes a subset of records from the dataset to a JSON file in an Amazon S3 bucket. The data sampling method can be either a specific number of records from the beginning of the file or a probability factor used to pick records.

**To add a Spigot transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Spigot** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Enter an Amazon S3 path or choose **Browse S3** to choose a location in Amazon S3. This is the location where the job writes the JSON file that contains the data sample.

1. Enter information for the sampling method. You can specify a value for **Number of records** to write starting from the beginning of the dataset and a **Probability threshold** (entered as a decimal value with a maximum value of 1) of picking any given record. 

   For example, to write the first 50 records from the dataset, you would set **Number of records** to 50 and **Probability threshold** to 1 (100%).

# Joining datasets
<a name="transforms-configure-join"></a>

The *Join* transform allows you to combine two datasets into one. You specify the key names in the schema of each dataset to compare. The output `DynamicFrame` contains rows where keys meet the join condition. The rows in each dataset that meet the join condition are combined into a single row in the output `DynamicFrame` that contains all the columns found in either dataset.

**To add a Join transform node to your job diagram**

1. If there is only one data source available, you must add a new data source node to the job diagram.

1. Choose one of the source nodes for the join. Open the Resource panel and then choose **Join** to add a new transform to your job diagram.

1. On the **Node properties** tab, enter a name for the node in the job diagram.

1. In the **Node properties** tab, under the heading **Node parents**, add a parent node so that there are two datasets providing inputs for the join. The parent can be a data source node or a transform node. 
**Note**  
A join can have only two parent nodes.

1. Choose the **Transform** tab.

   If you see a message indicating that there are conflicting key names, you can either:
   + Choose **Resolve it** to automatically add an *ApplyMapping* transform node to your job diagram. The ApplyMapping node adds a prefix to any keys in the dataset that have the same name as a key in the other dataset. For example, if you use the default value of **right**, then any keys in the right dataset that have the same name as a key in the left dataset will be renamed to `(right)key name`.
   + Manually add a transform node earlier in the job diagram to remove or rename the conflicting keys.

1. Choose the type of join in the **Join type** list. 
   + **Inner join**: Returns a row with columns from both datasets for every match based on the join condition. Rows that don't satisfy the join condition aren't returned.
   + **Left join**: All rows from the left dataset and only the rows from the right dataset that satisfy the join condition. 
   + **Right join**: All rows from the right dataset and only the rows from the left dataset that satisfy the join condition.
   + **Outer join**: All rows from both datasets.
   + **Left semi join**: All rows from the left dataset that have a match in the right dataset based on the join condition. 
   + **Left anti join**: All rows in the left dataset that don't have a match in the right dataset based on join condition. 

1. On the **Transform** tab, under the heading **Join conditions**, choose **Add condition**. Choose a property key from each dataset to compare. Property keys on the left side of the comparison operator are referred to as the left dataset and property keys on the right are referred to as the right dataset. 

   For more complex join conditions, you can add additional matching keys by choosing **Add condition** more than once. If you accidentally add a condition, you can choose the delete icon (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/delete-icon-black.png)) to remove it.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

For an example of the join output schema, consider a join between two datasets with the following property keys:

```
Left: {id, dept, hire_date, salary, employment_status}
Right: {id, first_name, last_name, hire_date, title}
```

The join is configured to match on the `id` and `hire_date` keys using the `=` comparison operator. 

Because both datasets contain `id` and `hire_date` keys, you chose **Resolve it** to automatically add the prefix **right** to the keys in the right dataset. 

The keys in the output schema would be:

```
{id, dept, hire_date, salary, employment_status, 
(right)id, first_name, last_name, (right)hire_date, title}
```

# Using Union to combine rows
<a name="transforms-configure-union"></a>

 You use the Union transform node when you want to combine rows from more than one data source that have the same schema. 

 There are to types of Union transformations: 

1. ALL – when applying ALL, the resulting union does not remove duplicate rows.

1. DISTINCT – when applying DISTINCT, the resulting union removes duplicate rows.

 **Unions vs. Joins** 

 You use Union to combine rows. You use Join to combine columns. 

**Using the Union transform in the Visual ETL canvas**

1.  Add more than one data source to perform a union transform. To add a data source, open the Resource Panel, then choose the data source from the Sources tab. Before using the Union transformation, you must ensure that all data sources involved in the union have the same schema and structure. 

1.  When you have at least two data sources that you want to combine using the Union transform, create the Union transform by adding it to the canvas. Open the Resource Panel on the canvas and search for 'Union'. You can also choose the Transforms tab in the Resource Panel and scroll down until you find the Union transform, then choose **Union**. 

1. Select the Union node on the job canvas. In the Node properties window, choose the parent nodes to connect to the Union transform.

1. AWS Glue checks for compatibility to make sure that the Union transform can be applied to all data sources. If the schema for the data sources are the same, the operation will be allowed. If the data sources do not have the same schema, an invalid error message is displayed: “The input schemas of this union are not the same. Consider using ApplyMapping to match the schemas.” To fix this, choose Use **ApplyMapping**. 

1. Choose the Union type.

   1. All – By default, the All Union type is selected; this will result in duplicate rows if there are any in the data combination.

   1. Distinct – Choose Distinct if you want duplicate rows to be removed from the resulting data combination.

# Using SplitFields to split a dataset into two
<a name="transforms-configure-split-fields"></a>

The *SplitFields* transform allows you to choose some of the data property keys in the input dataset and put them into one dataset and the unselected keys into a separate dataset. The output from this transform is a collection of `DynamicFrames`.

**Note**  
You must use a *SelectFromCollection* transform to convert the collection of `DynamicFrames` into a single `DynamicFrame` before you can send the output to a target location.

The *SplitFields* transform is case sensitive. Add an *ApplyMapping* transform as a parent node if you need case-insensitive property key names.

**To add a SplitFields transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **SplitFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Choose which property keys you want to put into the first dataset. The keys that you do not choose are placed in the second dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

1. Configure a *SelectFromCollection* transform node to process the resulting datasets.

# Overview of *SelectFromCollection* transform
<a name="transforms-selectfromcollection-overview"></a>

Certain transforms have multiple datasets as their output instead of a single dataset, for example, *SplitFields*. The *SelectFromCollection* transform selects one dataset (`DynamicFrame`) from a collection of datasets (an array of `DynamicFrames`). The output for the transform is the selected `DynamicFrame`. 

You must use this transform after you use a transform that creates a collection of `DynamicFrames`, such as:
+ Custom code transforms
+ *SplitFields*

If you don't add a *SelectFromCollection* transform node to your job diagram after any of these transforms, you will get an error for your job. 

The parent node for this transform must be a node that returns a collection of `DynamicFrames`. If you choose a parent for this transform node that returns a single `DynamicFrame`, such as a *Join* transform, your job returns an error. 

Similarly, if you use a *SelectFromCollection* node in your job diagram as the parent for a transform that expects a single `DynamicFrame` as input, your job returns an error.

![\[The screenshot shows the Node parents field on the Node properties tab of the node details panel. The selected node parent is SplitFields and the error message displayed reads "Parent node Split Fields outputs a collection, but node Drop Fields does not accept a collection."\]](http://docs.aws.amazon.com/glue/latest/dg/images/screenshot-edit-splitfields-wrong-parent.png)


# Using SelectFromCollection to choose which dataset to keep
<a name="transforms-configure-select-collection"></a>

Use the *SelectFromCollection* transform to convert a collection of `DynamicFrames` into a single `DynamicFrame`.

**To add a SelectFromCollection transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **SelectFromCollection** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Under the heading **Frame index**, choose the array index number that corresponds to the `DynamicFrame` you want to select from the collection of `DynamicFrames`.

   For example, if the parent node for this transform is a *SplitFields* transform, on the **Output schema** tab of that node you can see the schema for each `DynamicFrame`. If you want to keep the `DynamicFrame` associated with the schema for **Output 2**, you would select **1** for the value of **Frame index**, which is the second value in the list.

   Only the `DynamicFrame` that you choose is included in the output.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Find and fill missing values in a dataset
<a name="transforms-configure-fmv"></a>

You can use the *FillMissingValues* transform to locate records in the dataset that have missing values and add a new field with a value determined by imputation. The input data set is used to train the machine learning (ML) model that determines what the missing value should be. If you use incremental data sets, then each incremental set is used as the training data for the ML model, so the results might not be as accurate.

**To use a FillMissingValues transform node in your job diagram**

1. (Optional) Open the Resource panel and then choose **FillMissingValues** to add a new transform to your job diagram, if needed.

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform. 

1. Choose the **Transform** tab.

1. For **Data field**, choose the column or field name from the source data that you want to analyze for missing values.

1. (Optional) In the **New field name** field, enter a name for the field added to each record that will hold the estimated replacement value for the analyzed field. If the analyzed field doesn't have a missing value, the value in the analyzed field is copied into the new field. 

   If you don't specify a name for the new field, the default name is the name of the analyzed column with `_filled` appended. For example, if you enter **Age** for **Data field** and don't specify a value for **New field name**, a new field named **Age\$1filled** is added to each record.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Filtering keys within a dataset
<a name="transforms-filter"></a>

Use the *Filter* transform to create a new dataset by filtering records from the input dataset based on a regular expression. Rows that don't satisfy the filter condition are removed from the output.
+ For string data types, you can filter rows where the key value matches a specified string.
+ For numeric data types, you can filter rows by comparing the key value to a specified value using the comparison operators `<`, `>`, `=`, `!=`, `<=`, and `>=`.

If you specify multiple filter conditions, the results are combined using an `AND` operator by default, but you can choose `OR` instead.

The *Filter* transform is case sensitive. Add an *ApplyMapping* transform as a parent node if you need case-insensitive property key names.

**To add a Filter transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Filter** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent isn't already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Choose either **Global AND** or **Global OR**. This determines how multiple filter conditions are combined. All conditions are combined using either `AND` or `OR` operations. If you have only a single filter conditions, then you can choose either one.

1. Choose the **Add condition** button in the **Filter condition** section to add a filter condition. 

   In the **Key** field, choose a property key name from the dataset. In the **Operation** field, choose the comparison operator. In the **Value** field, enter the comparison value. Here are some examples of filter conditions:
   + `year >= 2018`
   + `State matches 'CA*'`

   When you filter on string values, make sure that the comparison value uses a regular expression format that matches the script language selected in the job properties (Python or Scala).

1. Add additional filter conditions, as needed. 

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using DropNullFields to remove fields with null values
<a name="transforms-dropnull-fields"></a>

 Use the *DropNullFields* transform to remove fields from the dataset if all values in the field are ‘null’. By default, AWS Glue Studio will recognize null objects, but some values such as empty strings, strings that are “null”, -1 integers or other placeholders such as zeros, are not automatically recognized as nulls. 

**To use the DropNullFields**

1.  Add a DropNullFields node to the job diagram. 

1.  On the **Node properties** tab, choose additional values that represent a null value. You can choose to select none or all of the values:   
![\[The screenshot shows the Transform tab for the DropNullFields node.\]](http://docs.aws.amazon.com/glue/latest/dg/images/DropNullFields-transform-tab.png)
   +  Empty String ("" or '') - fields that contain empty strings will be removed 
   +  "null string" - fields that contain the string with the word 'null' will be removed 
   +  -1 integer - fields that contain a -1 (negative one) integer will be removed 

1.  If needed, you can also specify custom null values. These are null values that may be unique to your dataset. To add a custom null value, choose **Add new value**. 

1.  Enter the custom null value. For example, this can zero, or any value that is being used to represent a null in the dataset. 

1.  Choose the data type in the drop-down field. Data types can either be String or Integer. 
**Note**  
 Custom null values and their data types must match exactly in order for the fields to be recognized as null values and the fields removed. Partial matches where only the custom null value matches but the data type does not will not result in the fields being removed. 

# Using a SQL query to transform data
<a name="transforms-sql"></a>

You can use a **SQL** transform to write your own transform in the form of a SQL query.

A SQL transform node can have multiple datasets as inputs, but produces only a single dataset as output. In contains a text field, where you enter the Apache SparkSQL query. You can assign aliases to each dataset used as input, to help simply the SQL query. For more information about the SQL syntax, see the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref.html).

**Note**  
If you use a Spark SQL transform with a data source located in a VPC, add an AWS Glue VPC endpoint to the VPC that contains the data source. For more information about configuring development endpoints, see [Adding a Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/add-dev-endpoint.html), [Setting Up Your Environment for Development Endpoints](https://docs.aws.amazon.com/glue/latest/dg/start-development-endpoint.html), and [Accessing Your Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-elastic-ip.html) in the *AWS Glue Developer Guide*.

**To use a SQL transform node in your job diagram**

1. (Optional) Add a transform node to the job diagram, if needed. Choose **SQL Query** for the node type.
**Note**  
 If you use a data preview session and a custom SQL or custom code node, the data preview session will execute the SQL or code block as-is for the entire dataset. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, or if you want multiple inputs for the SQL transform, choose a node from the **Node parents** list to use as the input source for the transform. Add additional parent nodes as needed.

1. Choose the **Transform** tab in the node details panel. 

1. The source datasets for the SQL query are identified by the names you specified in the **Name** field for each node. If you do not want to use these names, or if the names are not suitable for a SQL query, you can associate a name to each dataset. The console provides default aliases, such as `MyDataSource`.

   For example, if a parent node for the SQL transform node is named `Rename Org PK field`, you might associate the name `org_table` with this dataset. This alias can then be used in the SQL query in place of the node name. 

1. In the text entry field under the heading **Code block**, paste or enter the SQL query. The text field displays SQL syntax highlighting and keyword suggestions.

1. With the SQL transform node selected, choose the **Output schema** tab, and then choose **Edit**. Provide the columns and data types that describe the output fields of the SQL query.

   Specify the schema using the following actions in the **Output schema** section of the page:
   + To rename a column, place the cursor in the **Key** text box for the column (also referred to as a *field* or *property key*) and enter the new name.
   + To change the data type for a column, select the new data type for the column from the drop-down list.
   + To add a new top-level column to the schema, choose the Overflow (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-schema-actions-button.png)) button, and then choose **Add root key**. New columns are added at the top of the schema.
   + To remove a column from the schema, choose the delete icon (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/delete-icon-black.png)) to the far right of the Key name. 

1. When you finish specifying the output schema, choose **Apply** to save your changes and exit the schema editor. If you do not want to save you changes, choose **Cancel** to edit the schema editor.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Aggregate to perform summary calculations on selected fields
<a name="transforms-aggregate-fields"></a>

**To use the Aggregate transform**

1.  Add the Aggregate node to the job diagram. 

1.  On the **Node properties** tab, choose fields to group together by selecting the drop-down field (optional). You can select more than one field at a time or search for a field name by typing in the search bar. 

    When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the field.   
![\[The screenshot shows the Transform tab for the Aggregate node.\]](http://docs.aws.amazon.com/glue/latest/dg/images/Aggregate-transform-tab.png)

1.  Choose **Aggregate another column**. It is required to select at least one field.   
![\[The screenshot shows the fields when choosing Aggregate another column.\]](http://docs.aws.amazon.com/glue/latest/dg/images/Aggregate-fieldtoaggregate.png)

1.  Choose a field in the **Field to aggregate** drop-down. 

1.  Choose the aggregation function to apply to the chosen field: 
   +  avg - calculates the average 
   +  countDistinct - calculates the number of unique non-null values 
   +  count - calculates the number of non-null values 
   +  first - returns the first value that satisfies the 'group by' criteria 
   +  last - returns the last value that satisfies the 'group by' criteria 
   +  kurtosis - calculates the the sharpness of the peak of a frequency-distribution curve 
   +  max - returns the highest value that satisfies the 'group by' criteria 
   +  min - returns the lowest value that satisfies the 'group by' criteria 
   +  skewness - measure of the asymmetry of the probability distribution of a normal distribution 
   +  stddev\$1pop - calculates the population standard deviation and returns the square root of the population variance 
   +  sum - the sum of all values in the group 
   +  sumDistinct - the sum of distinct values in the group 
   +  var\$1samp - the sample variance of the group (ignores nulls) 
   +  var\$1pop - the population variance of the group (ignores nulls) 

# Flatten nested structs
<a name="transforms-flatten"></a>

*Flatten* the fields of nested structs in the data, so they become top level fields. The new fields are named using the field name prefixed with the names of the struct fields to reach it, separated by dots. 

For example, if the data has a field of type Struct named “phone\$1numbers”, which among other fields has one of type “Struct” named “home\$1phone” with two fields: “country\$1code” and “number”. Once flattened, these two fields will become top level fields named: “phone\$1numbers.home\$1phone.country\$1code” and “phone\$1numbers.home\$1phone.number” respectively.

**To add a *Flatten* transform node in your job diagram**

1. Open the Resource panel and then choose the **Transforms** tab, then **Flatten** to add a new transform to your job diagram. You can also use the search bar by entering 'Flatten', then clicking the Flatten node. The node selected at the time of adding the node will be its parent.  
![\[The screenshot shows the Resource Panel and the search bar populated with the word 'Flatten'. The search result shows the Flatten transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transform-flatten.png)

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can limit the maximum nesting level to flatten. For instance, setting that value to 1 means that only top-level structs will be flattened. Setting the max to 2 will flatten the top level and the structs directly under it.

# Add a UUID column
<a name="transforms-uuid"></a>

When you add a *UUID* (Universally Unique Identified) column, each row will be assigned a unique 36-character string.

**To add a *UUID* transform node in your job diagram**

1. Open the Resource panel and then choose **UUID** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can customize the name of the new column. By default it will be named "uuid".

# Add an identifier column
<a name="transforms-identifier"></a>

Assign a numeric *Identifier* for each row in the dataset.

**To add an *Identifier* transform node in your job diagram**

1. Open the Resource panel and then choose **Identifier** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can customize the name of the new column. By default, it will be named "id".

1. (Optional) If the job processes and stores data incrementally, you want to avoid the same ids to be reused between job runs.

   On the **Transform** tab, mark the **unique** checkbox option. It will include the job timestamp in the identifier, making it unique between multiple runs. To allow for the larger number, the column instead of type long will be a decimal.

# Convert a column to timestamp type
<a name="transforms-to-timestamp"></a>

You can use the transform *To timestamp* to change the data type of a numeric or string column into timestamp, so that it can be stored with that data type or applied to other transforms that require a timestamp.

**To add a *To timestamp* transform node in your job diagram**

1. Open the Resource panel and then choose **To timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column to be converted.

1. On the **Transform** tab, define how to parse the column selected by choosing the type.

   If the value is a number, it can be expressed in seconds (Unix/Python timestamp), milliseconds or microseconds, choose the corresponding option.

   If the value is a formatted string, choose the "iso" type, the string needs to conform to one of the variants of the ISO format, for example: “2022-11-02T14:40:59.915Z“.

   If you don’t know the type at this point or different rows use different types, then you can choose ”autodetect“ and the system will make its best guess, with a small performance cost.

1. (Optional) On the **Transform** tab, instead of converting the selected column, you can create a new one and keep the original by entering a name for the new column.

# Convert a timestamp column to a formatted string
<a name="transforms-format-timestamp"></a>

Format a timestamp column into a string based on a pattern. You can use *Format timestamp* to get date and time as a string with the desired format. You can define the format using [Spark date syntax](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) as well as most of the [Python date codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

For example, if you want your date string to be formatted like “2023-01-01 00:00”, you can define such format using the Spark syntax as “yyyy-MM-dd HH:mm” or the equivalent Python date codes as “%Y-%m-%d %H:%M”

**To add a *Format timestamp* transform node in your job diagram**

1. Open the Resource panel and then choose **Format timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column to be converted.

1. On the **Transform** tab, enter the **Timestamp format** pattern to use, expressed using [Spark date syntax](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) or [Python date codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

1. (Optional) On the **Transform** tab, instead of converting the selected column, you can create a new one and keep the original by entering a name for the new column.

# Creating a Conditional Router transformation
<a name="transforms-conditional-router"></a>

 The Conditional Router transform allows you to apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group. If a row meets more than one group filter condition, the transform passes the row to multiple groups. If a row does not meet any condition, it can either be dropped or routed to a default output group. 

 This transform is similar to the filter transform, but useful for users who want to test the same input data on multiple conditions. 

**To add a Conditional Router transform:**

1.  Choose a node where you will perform the conditional router transformation. This can be a source node or another transform. 

1.  Choose **Action**, then use the search bar to find and choose 'Conditional Router'. A ** Conditional Router** transform is added along with two output nodes. One output node, 'Default group', contains records which do not meet any of the conditions defined in the other output node(s). The default group cannot be edited.   
![\[The screenshot shows the conditional router transform node connected to a source node. Output nodes are shown branching from the conditional router node.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transform-conditional-router-node.png)

    You can add additional output groups by choosing **Add group**. For each output group, you can name the group and add filter conditions and a logical operator.   
![\[The screenshot shows the conditional router transform tab with options to name the output group, logical operator and conditional filter(s).\]](http://docs.aws.amazon.com/glue/latest/dg/images/transform-conditional-router-tab.png)

1.  Rename the output group name by entering a new name for the group. AWS Glue Studio will automatically name your groups for you (for example, 'output\$1group\$11'). 

1.  Choose a logical operator (**AND**, **OR**) and add a **Filter condition** by specifying the **Key**, **Operation**, and **Value**. Logical operators allow you to implement more than one filter condition and perform the logical operator on each filter condition you specify. 

    When specifying the key, you can choose from available keys in your schema. You can then choose the available operation depending on the type of key you selected. For example, if the key type is 'string', then the available operation to choose from is 'matches'.   
![\[The screenshot shows the conditional router transform tab with the filter condition fields for key, operation and value.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transform-conditional-router-filter-condition.png)

1.  Enter the value in the **Value** field. To add additional filter conditions, choose **Add condition**. To remove filter conditions, choose the trash can icon. 

# Using the Concatenate Columns transform to append columns
<a name="transforms-concatenate-columns"></a>

 The Concatenate transform allows you to build a new string column using the values of other columns with an optional spacer. For example, if we define a concatenated column “date” as the concatenation of “year”, “month” and “day” (in that order) with “-” as the spacer, we would get: 


| day | month | year | date | 
| --- | --- | --- | --- | 
| 01 | 01 | 2020 | 2020-01-01 | 
| 02 | 01 | 2020 | 2020-01-02 | 
| 03 | 01 | 2020 | 2020-01-03 | 
| 04 | 01 | 2020 | 2020-01-04 | 

**To add a Concatenate transform:**

1. Open the Resource panel. Then choose **Concatenate Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column that will hold the concatenated string as well as the columns to concatenate. The order in which you check the columns in the dropdown will be the order used.  
![\[The screenshot shows the Transform tab for the Concatenate transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-concatenate-transform-tab.png)

1. **Spacer - optional** – Enter a string to place betwen the concatenated fields. By default, there is no spacer.

1. **Null value - optional** – Enter a string to use when a column value is null. By default, in the cases where columns have the value 'NULL' or 'NA', an empty string is used.

# Using the Split String transform to break up a string column
<a name="transforms-split-string"></a>

 The Split String transform allows you to break up a string into an array of tokens using a regular expression to define how the split is done. You can then keep the column as an array type or apply an **Array To Columns** transform after this one, to extract the array values onto top level fields, assuming that each token has a meaning we know beforehand. Also, if the order of the tokens is irrelevant (for instance, a set of categories), you can use the **Explode** transform to generate a separate row for each value. 

 For example, you can split a the column “categories” using a comma as a pattern to add a column “categories\$1arr”. 


| product\$1id | categories | categories\$1arr | 
| --- | --- | --- | 
| 1 | sports,winter | [sports, winter] | 
| 2 | garden,tools | [garden, tools] | 
| 3 | videogames | [videogames] | 
| 4 | game,boardgame,social | [game, boardgame, social] | 

**To add a Split String transform:**

1. Open the Resource panel and then choose Split String to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the Node properties tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the column to split and enter the pattern to use to split the string. In most cases you can just enter the character(s) unless it has a special meaning as a regular expression and needs to be escaped. The characters that need escaping are: `\.[]{}()<>*+-=!?^$|` by adding a backslash in front of the character. For instance if you want to separate by a dot ('.') you need to enter `\.`. However, a comma doesn’t have a special meaning and can just be specified as is: `,`.  
![\[The screenshot shows the Transform tab for the Split String transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-split-string-transform-tab.png)

1. (Optional) If you want to keep the original string column, then you can enter a name for a new array column, this way keeping both the original string column and the new tokenized array column.

# Using the Array To Columns transform to extract the elements of an array into top level columns
<a name="transforms-array-to-columns"></a>

 The Array To Columns transform allows you extract some or all the elements of a column of type array into new columns. The transform will fill the new columns as much as possible if the array has enough values to extract, optionally taking the elements in the positions specified. 

 For instance, if you have an array column “subnet”, which was the result of applying the “Split String” transform on a ip v4 subnet, you can extract the first and forth positions into new columns “first\$1octect” and “forth\$1octect”. The output of the transform in this example would be (notice the last two rows have shorter arrays than expected): 


| subnet | first\$1octect | fourth\$1octect | 
| --- | --- | --- | 
| [54, 240, 197, 238] | 54 | 238 | 
| [192, 168, 0, 1] | 192 | 1 | 
| [192, 168] | 192 |  | 
| [] |  |  | 

**To add a Array To Columns transform:**

1. Open the Resource panel and then choose **Array To Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the array column to extract and enter the list of new columns for the tokens extracted.  
![\[The screenshot shows the Transform tab for the Array To Columns transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-array-to-columns-transform-tab.png)

1. (Optional) If you don’t want to take the array tokens in order to assign to columns, you can specify the indexes to take which will be assigned to the list of columns in the same order specified. For instance if the output columns are “column1, column2, column3” and the indexes “4, 1, 3”, the forth element of the array will go to column1, the first to column2 and the third to column3 (if the array is shorter than the index number, a NULL value will be set).

# Using the Add Current Timestamp transform
<a name="transforms-add-current-timestamp"></a>

 The **Add Current Timestamp** transform allows you to mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline. You can add this new column as a timestamp data type or a formatted string. 

**To add a Add Current Timestamp transform:**

1. Open the Resource panel and then choose **Add Current Timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent. 

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.  
![\[The screenshot shows the Transform tab for the Add Current Timestamp transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-add-current-timestamp-transform-tab.png)

1. (Optional) On the **Transform** tab, enter a custom name for the new column and a format if you rather the column to be a formatted date string.

# Using the Pivot Rows to Columns transform
<a name="transforms-pivot-rows-to-columns"></a>

 The **Pivot Rows to Columns** transform allows you to aggregate a numeric column by rotating unique values on selected columns which become new columns (if multiple columns are selected, the values are concatenated to name the new columns). That way rows are consolidated while having more columns with partial aggregations for each unique value. For example, if you have this dataset of sales by month and country (sorted to be easier to illustrate): 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 4 | 
| 2020 | Feb | de | 7 | 
| 2020 | Feb | us | 6 | 
| 2020 | Feb | us | 12 | 
| 2020 | Jan | us | 90 | 

 If you pivot **amount** and **country** as the aggregation columns, new columns are created from the original **country** column. In the table below, you have new columns for **de**, **uk**, and **us** instead of the **country** column. 


| year | month | de | uk | us | 
| --- | --- | --- | --- | --- | 
| 2020 | Jan | 42 | 32 | 64 | 
| 2020 | Jan | 11 | 67 | 18 | 
| 2021 | Jan |  |  | 90 | 

 If instead you want to pivot both the month and county, you get a column for each combination of the values of those columns: 


| year | Jan\$1de | Jan\$1uk | Jan\$1us | Feb\$1de | Feb\$1uk | Feb\$1us | 
| --- | --- | --- | --- | --- | --- | --- | 
| 2020 | 42 | 32 | 64 | 11 | 67 | 18 | 
| 2021 |  |  | 90 |  |  |  | 

**To add a Pivot Rows To Columns transform:**

1. Open the Resource panel and then choose **Pivot Rows To Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the numeric column which will be aggregated to produce the values for the new columns, the aggregation function to apply and the column(s) to convert its unique values into new columns.  
![\[The screenshot shows the Transform tab for the Pivot Rows To Columns transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-pivot-rows-to-columns-transform-tab.png)

# Using the Unpivot Columns To Rows transform
<a name="transforms-unpivot-columns-to-rows"></a>

 The **Unpivot** transform allows you convert columns into values of new columns generating a row for each unique value. It’s the opposite of pivot but note that it’s not equivalent since it cannot separate rows with identical values that were aggregated or split combinations into the original columns (you can do that later using a Split transform). For example, if you have the following table: 


| year | month | de | uk | us | 
| --- | --- | --- | --- | --- | 
| 2020 | Jan | 42 | 32 | 64 | 
| 2020 | Feb | 11 | 67 | 18 | 
| 2021 | Jan |  |  | 90 | 

 You can unpivot the columns: “de”, “uk” and “us” into a column “country” with the value “amount”, and get the following (sorted here for illustration purposes): 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 11 | 
| 2020 | Feb | us | 18 | 
| 2021 | Jan | us | 90 | 

 Notice the columns that have a NULL value (“de” and “uk of Jan 2021) don’t get generated by default. You can enable that option to get: 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 11 | 
| 2020 | Feb | us | 18 | 
| 2021 | Jan | us | 90 | 
| 2021 | Jan | de |  | 
| 2021 | Jan | uk |  | 

**To add a Unpivot Columns to Rows transform:**

1. Open the Resource panel and then choose **Unpivot Columns to Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the new columns to be created to hold the names and values of the columns chosen to unpivot.  
![\[The screenshot shows the Transform tab for the Unpivot Columns To Rows transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-unpivot-columns-to-rows-transform-tab.png)

# Using the Autobalance Processing transform to optimize your runtime
<a name="transforms-autobalance-processing"></a>

 The **Autobalance Processing** transform redistributes the data among the workers for better performance. This helps in cases where the data is unbalanced or as it comes from the source doesn’t allow enough parallel processing on it. This is common where the source is gzipped or is JDBC. The redistribution of data has a modest performance cost, so the optimization might not always compensate that effort if the data was already well balanced. Underneath, the transform uses Apache Spark repartition to randomly reassign data among a number of partitions optimal for the cluster capacity. For advanced users, it’s possible to enter a number of partitions manually. In addition, it can be used to optimize the writing of partitioned tables by reorganizing the data based on specified columns. This results in output files that are more consolidated. 

****

1. Open the Resource panel and then choose **Autobalance Processing** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can enter a number of partitions. In general, it’s recommended that you let the system decide this value, however you can tune the multiplier or enter a specific value if you need to control this. If you are going to save the data partitioned by columns, you can choose the same columns as repartition columns. This way it will minimize the number of files on each partition and avoid having many files per partitions, which would hinder the performance of the tools querying that data.  
![\[The screenshot shows the Transform tab for the Autobalance Processing transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-autobalance-processing-transform-tab.png)

# Using the Derived Column transform to combine other columns
<a name="transforms-derived-column"></a>

 The **Derived Column** transform allows you to define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals. For instance, to derive a “percentage” column from the columns "success" and "count", you can enter the SQL expression: "success \$1 100 / count \$1\$1 '%'". 

 Example result: 


| success | count | percentage | 
| --- | --- | --- | 
| 14 | 100 | 14% | 
| 6 | 20 | 3% | 
| 3 | 40 | 7.5% | 

**To add a Derived Column transform:**

1. Open the Resource panel and then choose **Derived Column** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column and the expression for its content.  
![\[The screenshot shows the Transform tab for the Derived Column transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-dervied-column-transform-tab.png)

# Using the Lookup transform to add matching data from a catalog table
<a name="transforms-lookup"></a>

 The **Lookup** transform allows you to add columns from a defined catalog table when the keys match the defined lookup columns in the data. This is equivalent to doing a left outer join between the data and the lookup table using as condition matching columns. 

**To add a Lookup transform:**

1. Open the Resource panel and then choose **Lookup** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the fully qualified catalog table name to use to perform the lookups. For example, if your database is “mydb” and your table “mytable” then enter “mydb.mytable”. Then enter the criteria to find a match in the lookup table, if the lookup key is composed. Enter the list of key columns separated by commas. If one or more of the key columns don’t have the same name then you need to define the match mapping. 

   For example, if the data columns are “user\$1id” and “region” and in the users table the corresponding columns are named “id” and “region“, then in the **Columns to match** field, enter: ”user\$1id=id, region“. You could do region=region but it’s not needed since they are the same.

1. Finally, enter the columns to bring from the row matched in the lookup table to incorporate them into the data. If no match was found those columns will be set to NULL.
**Note**  
Underneath the **Lookup** transform, it is using a left join in order to be efficient. If the lookup table has a composite key, ensure the columns to match are setup to match all the key columns so that only one match can occur. Otherwise, multiple lookup rows will match and this will result in extra rows added for each of those matches.  
![\[The screenshot shows the Transform tab for the Lookup transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-lookup-transform-tab.png)

# Using the Explode Array or Map Into Rows transform
<a name="transforms-explode-array"></a>

 The **Explode** transform allows you to extract values from a nested structure into individual rows that are easier to manipulate. In the case of an array, the transform will generate a row for each value of the array, replicating the values for the other columns in the row. In the case of a map, the transform will generate a row for each entry with the key and value as columns plus any other columns in the row. 

 For example, if we have this dataset which has a “category” array column with multiple values. 


| product\$1id | category | 
| --- | --- | 
| 1 | [sports, winter] | 
| 2 | [garden, tools] | 
| 3 | [videogames] | 
| 4 | [game, boardgame, social] | 
| 5 | [] | 

 If you explode the 'category' column into a column with the same name, you will override the column. You can select that you want NULLs included to get the following (ordered for illustration purposes): 


| product\$1id | category | 
| --- | --- | 
| 1 | sports | 
| 1 | winter | 
| 2 | garden | 
| 2 | tool | 
| 3 | videogames | 
| 4 | game | 
| 4 | boardgame | 
| 4 | social | 
| 5 |  | 

**To add a Explode Array Or Map Into Rows transform:**

1. Open the Resource panel and then choose **Explode Array Or Map Into Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the column to explode (it must be an array or map type). Then enter a name for the column for the items of the array or the names of the columns for the keys and values if you are exploding a map.

1. (Optional) On the **Transform** tab, by default if the column to explode is NULL or has an empty structure, it will be omitted on the exploded dataset. If you want to keep the row (with the new columns as NULL) then check “Include NULLs”.  
![\[The screenshot shows the Transform tab for the Explode Array or Map Into Rows transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/transforms-explode-array-transform-tab.png)

# Using the Record Matching transform to invoke an existing data classification transform
<a name="transforms-record-matching"></a>

This transform invokes an existing Record Matching machine learning data classification transform.

The transform evaluates the current data against the trained model based on labels. A column "match\$1id" is added to assign each row to a group of items that are considered equivalent based on the algorithm training. For more information, see [Record matching with Lake Formation FindMatches](https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html).

**Note**  
The version of AWS Glue used by the visual job must match the version that AWS Glue used to create the Record Matching transform.

![\[The screenshot shows a data preview for the transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/recording-matching-transform-1.png)


**To add a Record Matching transform node to your job diagram**

1. Open the Resource panel, and then choose **Record Matching** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the ID taken from the **Machine learning transforms** page:  
![\[The screenshot shows the ID from the Machine learning transforms page.\]](http://docs.aws.amazon.com/glue/latest/dg/images/recording-matching-transform-2.png)

1. (Optional) On the **Transform** tab, you can check the option to add the confidence scores. At the cost of extra computing, the model will estimate a confidence score for each match as an additional column.

# Removing null rows
<a name="transforms-remove-null-rows"></a>

This transform removes from the dataset rows that have all columns as null. In addition, you can extend this criteria to include empty fields, so as to keep rows where at least one column is non empty.

**To add a Remove Null Rows transform node to your job diagram**

1. Open the Resource panel, and then choose **Remove Null Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, check the **Extended** option if you want to require rows not just to not be null but also not empty, this way empty strings, arrays or maps will be considered nulls for the purpose of this transform.

# Parsing a string column containing JSON data
<a name="transforms-parse-json-column"></a>

This transform parses a string column containing JSON data and convert it to a struct or an array column, depending if the JSON is an object or an array, respectively. Optionally you can keep both the parsed and original column.

The JSON schema can be provided or inferred (in the case of JSON objects), with optional sampling.

**To add a Parse JSON Column transform node to your job diagram**

1. Open the Resource panel, and then choose **Parse JSON Column** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, select the column containing the JSON string.

1. (Optional) On the **Transform** tab, enter the schema that the JSON data follows using SQL syntax, for instance: "field1 STRING, field2 INT" in the case of an object or "ARRAY<STRING>" in the case of an array.

   If the case of an array the schema is required but in the case of an object, if the schema is not specified it will be inferred using the data. To reduce the impact of inferring the schema (especially on a large dataset), you can avoid reading the whole data twice by entering a **Ratio of samples to use to infer schema**. If the value is lower than 1, the corresponding ratio of random samples is used to infer the schema. If the data is reliable and the object is consistent between rows, you can use a small ratio such as 0.1 to improve performance.

1. (Optional) On the **Transform** tab, you can enter a new column name if you want to keep both the original string column and the parsed column.

# Extracting a JSON path
<a name="transforms-extract-json-path"></a>

This transform extracts new columns from a JSON string column. This transform is useful when you only need a few data elements and don't want to import the entire JSON content into the table schema.

**To add an Extract JSON Path transform node to your job diagram**

1. Open the Resource panel, and then choose **Extract JSON Path** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, select the column containing the JSON string. Enter one of more JSON path expressions separated by commas, each one referencing how to extract a value out of the JSON array or object. For instance, if the JSON column contained an objects with properties "prop\$11" and "prop2" you could extract both specifying their names "prop\$11, prop\$12".

   If the JSON field has special characters, for instance to extract the property from this JSON `{"a. a": 1}` you could use the path `$['a. a']`. The exception is the comma because it is reserved to separate paths. Then enter the corresponding column names for each path, separated by commas.

1. (Optional) On the **Transform** tab, you can check to drop the JSON column once extracted, this makes sense when you don't need the rest of the JSON data once you have extracted the parts you need.

# Extracting string fragments using a regular expression
<a name="transforms-regex-extractor"></a>

This transform extracts string fragments using a regular expression and creates a new column out of it, or multiple columns if using regex groups.

**To add a Regex Extractor transform node to your job diagram**

1. Open the Resource panel, and then choose **Regex Extractor** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the regular expression and the column on which it needs to be applied. Then enter the name of the new column on which to store the matching string. The new column will be null only if the source column is null, if the regex doesn’t match the column will be empty.

   If the regex uses groups, there has be a corresponding column name separated by comma but you can skip groups by leaving the column name empty.

   For example, if you have a column "purchase\$1date" with a string using both long and short ISO date formats, then you want to extract the year, month, day and hour, when available. Notice the hour group is optional, otherwise in the rows where not available, all the extracted groups would be empty strings (because the regex didn’t match). In this case, we don't want the group to make the time optional but the inner one, so we leave the name empty and it doesn’t get extracted (that group would include the T character).  
![\[The screenshot shows configuring a regular expression for the Regex extractor.\]](http://docs.aws.amazon.com/glue/latest/dg/images/regex-extractor-1.png)

   Resulting in the data preview:  
![\[The screenshot shows configuring a data preview for the Regex extractor.\]](http://docs.aws.amazon.com/glue/latest/dg/images/regex-extractor-2.png)

# Creating a custom transformation
<a name="transforms-custom"></a>

If you need to perform more complicated transformations on your data, or want to add data property keys to the dataset, you can add a **Custom code** transform to your job diagram. The Custom code node allows you to enter a script that performs the transformation. 

When using custom code, you must use a schema editor to indicate the changes made to the output through the custom code. When editing the schema, you can perform the following actions:
+ Add or remove data property keys
+ Change the data type of data property keys
+ Change the name of data property keys
+ Restructure a nested property key

You must use a *SelectFromCollection* transform to choose a single `DynamicFrame` from the result of your Custom transform node before you can send the output to a target location. 

Use the following tasks to add a custom transform node to your job diagram.

## Adding a custom code transform node to the job diagram
<a name="transforms-custom-addnode"></a>

**To add a custom transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Custom transform** to add a custom transform to your job diagram. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, or if you want multiple inputs for the custom transform, then choose a node from the **Node parents** list to use as the input source for the transform.

## Entering code for the custom transform node
<a name="transforms-custom-addcode"></a>

You can type or copy code into an input field. The job uses this code to perform the data transformation. You can provide a code snippet in either Python or Scala. The code should take one or more `DynamicFrames` as input and returns a collection of `DynamicFrames`. 

**To enter the script for a custom transform node**

1. With the custom transform node selected in the job diagram, choose the **Transform** tab. 

1. In the text entry field under the heading **Code block**, paste or enter the code for the transformation. The code that you use must match the language specified for the job on the **Job details** tab.

   When referring to the input nodes in your code, AWS Glue Studio names the `DynamicFrames` returned by the job diagram nodes sequentially based on the order of creation. Use one of the following naming methods in your code:
   + Classic code generation – Use functional names to refer to the nodes in your job diagram.
     + Data source nodes: `DataSource0`, `DataSource1`, `DataSource2`, and so on.
     + Transform nodes: `Transform0`, `Transform1`, `Transform2`, and so on.
   + New code generation – Use the name specified on the **Node properties** tab of a node, appended with '`_node1`', '`_node2`', and so on. For example, `S3bucket_node1`, `ApplyMapping_node2`, `S3bucket_node2`, `MyCustomNodeName_node1`.

   For more information about the new code generator, see [Script code generation](job-editor-features.md#code-gen).

The following examples show the format of the code to enter in the code box:

------
#### [ Python ]

The following example takes the first `DynamicFrame` received, converts it to a `DataFrame` to apply the native filter method (keeping only records that have over 1000 votes), then converts it back to a `DynamicFrame` before returning it.

```
def FilterHighVoteCounts (glueContext, dfc) -> DynamicFrameCollection:
    df = dfc.select(list(dfc.keys())[0]).toDF()
    df_filtered = df.filter(df["vote_count"] > 1000)
    dyf_filtered = DynamicFrame.fromDF(df_filtered, glueContext, "filter_votes")
    return(DynamicFrameCollection({"CustomTransform0": dyf_filtered}, glueContext))
```

------
#### [ Scala ]

The following example takes the first `DynamicFrame` received, converts it to a `DataFrame` to apply the native filter method (keeping only records that have over 1000 votes), then converts it back to a `DynamicFrame` before returning it.

```
object FilterHighVoteCounts {
  def execute(glueContext : GlueContext, input : Seq[DynamicFrame]) : Seq[DynamicFrame] = {
    val frame = input(0).toDF()
    val filtered = DynamicFrame(frame.filter(frame("vote_count") > 1000), glueContext)
    Seq(filtered)
  }
}
```

------

## Editing the schema in a custom transform node
<a name="transforms-custom-editschema"></a>

When you use a custom transform node, AWS Glue Studio cannot automatically infer the output schemas created by the transform. You use the schema editor to describe the schema changes implemented by the custom transform code.

A custom code node can have any number of parent nodes, each providing a `DynamicFrame` as input for your custom code. A custom code node returns a collection of `DynamicFrames`. Each `DynamicFrame` that is used as input has an associated schema. You must add a schema that describes each `DynamicFrame` returned by the custom code node. 

**Note**  
 When you set your own schema on a custom transform, AWS Glue Studio does not inherit schemas from previous nodes.To update the schema, select the Custom transform node, then choose the Data preview tab. Once the preview is generated, choose 'Use Preview Schema'. The schema will then be replaced by the schema using the preview data. 

**To edit the output schemas for a custom transform node**

1. With the custom transform node selected in the job diagram, in the node details panel, choose the **Output schema** tab. 

1. Choose **Edit** to make changes to the schema. 

   If you have nested data property keys, such as an array or object, you can choose the **Expand-Rows** icon (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/expand-rows-icon.png)) on the top right of each schema panel to expand the list of child data property keys. After you choose this icon, it changes to the **Collapse-Rows** icon (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/collapse-rows-icon.png)), which you can choose to collapse the list of child property keys.

1. Modify the schema using the following actions in the section on the right side of the page:
   + To rename a property key, place the cursor in the **Key** text box for the property key, then enter the new name.
   + To change the data type for a property key, use the list to choose the new data type for the property key.
   + To add a new top-level property key to the schema, choose the **Overflow** (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-schema-actions-button.png)) icon to the left of the **Cancel** button, and then choose **Add root key**.
   + To add a child property key to the schema, choose the **Add-Key** icon ![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/filter-add-icon.png)associated with the parent key. Enter a name for the child key and choose the data type.
   + To remove a property key from the schema, choose the **Remove** icon (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/delete-icon-black.png)) to the far right of the key name. 

1. If your custom transform code uses multiple `DynamicFrames`, you can add additional output schemas. 
   + To add a new, empty schema, choose the **Overflow** (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Add output schema**.
   + To copy an existing schema to a new output schema, make sure the schema you want to copy is displayed in the schema selector. Choose the **Overflow** (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Duplicate**.

   If you want to remove an output schema, make sure the schema you want to copy is displayed in the schema selector. Choose the **Overflow** (![\[\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Delete**.

1. Add new root keys to the new schema or edit the duplicated keys. 

1. When you are modifying the output schemas, choose the **Apply** button to save your changes and exit the schema editor.

   If you do not want to save your changes, choose the **Cancel** button.

## Configure the custom transform output
<a name="transforms-custom-output"></a>

A custom code transform returns a collection of `DynamicFrames`, even if there is only one `DynamicFrame` in the result set. 

**To process the output from a custom transform node**

1. Add a *SelectFromCollection* transform node, which has the custom transform node as its parent node. Update this transform to indicate which dataset you want to use. See [Using SelectFromCollection to choose which dataset to keep](transforms-configure-select-collection.md) for more information.

1. Add additional *SelectFromCollection* transforms to the job diagram if you want to use additional `DynamicFrames` produced by the custom transform node. 

   Consider a scenario in which you add a custom transform node to split a flight dataset into multiple datasets, but duplicate some of the identifying property keys in each output schema, such as the flight date or flight number. You add a *SelectFromCollection* transform node for each output schema, with the custom transform node as its parent.

1. (Optional) You can then use each *SelectFromCollection* transform node as input for other nodes in the job, or as a parent for a data target node.