

# Visual ETL


Data engineers, analysts, and scientists use visual ETL features to create extract, transform, and load (ETL) flows using an intuitive visual interface. With visual ETL, analytics users can discover, prepare, move, and integrate data from multiple sources. This simplifies the process of data manipulation and integration so that you can prepare data for analysis and reporting.

Visual ETL in Amazon SageMaker Unified Studio provides a drag-and-drop interface for building ETL flows and authoring flows with Amazon Q. You can connect to data sources, apply transformations, and define target destinations without writing complex code.

You can use Visual ETL to implement solutions such as:
+ Data integration from multiple sources
+ Data cleansing and normalization
+ Creating data warehouses or data lakes
+ Preparing data for machine learning models
+ Automating regular data processing tasks

Authoring flows with Visual ETL utilizes AWS Glue interactive sessions. You can use AWS Glue version 4.0, 5.0, or 5.1.

Amazon SageMaker Unified Studio supports Visual ETL in two domain types:
+ **IAM-based domains** - New domain type where customers log into Amazon SageMaker Unified Studio using federated roles and existing IAM permissions apply when using Amazon SageMaker Unified Studio IAM-based domains. Provides access to a new Amazon SageMaker Unified Studio interface.
+ **Identity Center-based domains** - Original Amazon SageMaker Unified Studio interface and user experience that continues to be supported and maintained.

While workflows are similar across both domain types, the steps to complete actions differ between the two interfaces. Choose the appropriate section below based on your domain type.

**Topics**
+ [

# Supported connectors for Visual ETL
](connectors-visual-etl.md)
+ [

# Supported transforms for Visual ETL
](visual-etl-supported-transforms.md)
+ [

# Visual ETL for IAM-based domains
](visual-etl-iam-domains.md)
+ [

# Visual ETL for Identity Center-based domains
](visual-etl-identity-center-domains.md)
+ [

# Data preview
](visual-etl-data-previews.md)

# Supported connectors for Visual ETL


Visual ETL in Amazon SageMaker Unified Studio supports the connectors listed at [Data connections in the lakehouse architecture of Amazon SageMaker](https://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/lakehouse-data-connection.html#lakehouse-data-connection-supported), in addition to Amazon S3, Amazon S3 tables, and AWS Glue Data Catalog. 

For instructions on how to add a new connection in IDC-based domains, see [Adding data sources in lakehouse architecture](https://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/lakehouse-create-connection.html). For instructions on how to add a new connection is IAM-based domains, see [Data and catalog connections in IAM-based domains](data-connections-iam-based-domains.md) 

# Supported transforms for Visual ETL


 The following section contains information on the supported transforms for Visual ETL in Amazon SageMaker Unified Studio :

**Topics**
+ [

# Add current timestamp transform
](add-current-timestamp-transform.md)
+ [

# Aggregate transform
](aggregate-transform.md)
+ [

# Array to columns transform
](array-to-columns-transform.md)
+ [

# Autobalance processing transform
](autobalance-processing-transform.md)
+ [

# Change columns transform
](change-columns-transform.md)
+ [

# Concatenate columns transform
](concatenate-columns-transform.md)
+ [

# Custom code transform
](custom-code-transform.md)
+ [

# Derived column transform
](derived-column-transform.md)
+ [

# Drop columns transform
](drop-columns-transform.md)
+ [

# Drop duplicates transform
](drop-duplicates-transform.md)
+ [

# Drop nulls transform
](drop-null-transform.md)
+ [

# Explode array or map into rows transform
](explode-array-map-rows-transform.md)
+ [

# Extract JSON path transform
](extract-json-path-transform.md)
+ [

# Fill nulls transform
](fill-null-transform.md)
+ [

# Filter transform
](filter-transform.md)
+ [

# Flatten transform
](flatten-transform.md)
+ [

# Format timestamp transform
](format-timestamp-transform.md)
+ [

# Identifier transform
](identifier-transform.md)
+ [

# Intersect transform
](intersect-transform.md)
+ [

# Join transform
](join-transform.md)
+ [

# Limit tranform
](limit-transform.md)
+ [

# Lookup transform
](lookup-transform.md)
+ [

# Parse JSON column transform
](parse-json-transform.md)
+ [

# Pivot rows into columns transform
](pivot-transform.md)
+ [

# Regex extract transform
](regex-extract-transform.md)
+ [

# Rename columns transform
](rename-columns-transform.md)
+ [

# Select columns transform
](select-columns-transform.md)
+ [

# Split string transform
](split-string-transform.md)
+ [

# SQL query transform
](sql-query-transform.md)
+ [

# To timestamp transform
](to-timestamp-transform.md)
+ [

# Union transform
](union-transform.md)
+ [

# Unpivot columns into rows transform
](unpivot-transform.md)
+ [

# UUID transform
](uuid-transform.md)

# Add current timestamp transform


 The Add Current Timestamp transform allows you to mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline. You can add this new column as a timestamp data type or a formatted string. 

**To add an Add Current Timestamp transform:**

1. Navigate to your Visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Add Current Timestamp**.

1. Select the diagram to add the node to your Visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Timestamp column name**, enter a custom name for the new column.

1. (Optional) under **Timestamp format**, enter a format if you would prefer the column to be a formatted date string.

# Aggregate transform


You may use the Aggregate transform to perform summary calculations on selected fields.

**To use the Aggregate transform**

1. Add the Aggregate node to the visual job diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. On the Node properties view, choose the "fields to group by", selecting the drop-down field (optional). You can select more than one field at a time or search for a field name by typing in the search bar.

   When fields are selected, the name and datatype are shown. To remove a field, click 'X' on the field.

1. Choose Aggregate another column. It is required to select at least one field.

1. Choose a field in the Field to aggregate drop-down.

1. Choose the aggregation function to apply to the chosen field:
   + avg - calculates the average
   + count - calculates the number of non-null values
   + max - returns the highest value that satisfies the 'group by' criteria
   + min - returns the lowest value that satisfies the 'group by' criteria
   + sum - the sum of all values in the group  
![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Aggregate transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-aggregate.png)

# Array to columns transform


The Array To Columns transform allows you extract some or all the elements of a column of type array into new columns. The transform will fill the new columns as much as possible if the array has enough values to extract, optionally taking the elements in the positions specified.

For instance, if you have an array column “subnet”, which was the result of applying the “Split String” transform on a ip v4 subnet, you can extract the first and forth positions into new columns “first\$1octect” and “forth\$1octect”. The output of the transform in this example would be (notice the last two rows have shorter arrays than expected):


| subnet | first\$1octect | fourth\$1octect | 
| --- | --- | --- | 
| [54, 240, 197, 238] | 54 | 238 | 
| [192, 168, 0, 1] | 192 | 1 | 
| [192, 168] | 192 |  | 
| [] |  |  | 

**To add an Array to Columns transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Array to Columns**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Array type column**, choose the column of type array from which the new columns are extracted.

1. Under **Output columns**, enter names for the output columns.

1. (Optional) Under **Array indexes to use**, enter numbers to indicate which columns to include.

# Autobalance processing transform


Use this transform to repartition data to optimize future cluster resource usage. This transform is particularly useful for uneven datasets.

**To add an Autobalance Processing transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Autobalance Processing**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Number of partitions**, input a number of partitions to randomly distribute the data into. Or, switch the toggle to off to use the number of cores as the partition number.

1. (Optional) Under **Repartition columns**, identify columns that you want data of the same value to be assigned to the same partition in.

# Change columns transform


A Change columns transform remaps the source data property keys into the desired configured for the target data. In a Change Columns transform node, you can:

1. Change the name of multiple columns.

1. Change the data type of the columns, if the new data type is supported and there is a transformation path between the two data types.

1. Use a custom expression to change the values of selected columns.

**To add a Change Schema node to your job diagram**

1. Open the menu and then choose Change Columns to add a new transform to your job diagram, if needed.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema by clicking on the node, then:

   1. To rename a column, enter the new name in the Target column field.

   1. To change the data type for a selected column, select "Add new type cast", then choose the source column, target column, and new data type from the Target type list.

   1. To change the data values for a selected column, add a custom expression.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Change Columns transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-change-columns.png)


# Concatenate columns transform


The Concatenate transform allows you to build a new string column using the values of other columns with an optional spacer. For example, if we define a concatenated column “date” as the concatenation of “year”, “month” and “day” (in that order) with “-” as the spacer, we would get:


| day | month | year | date | 
| --- | --- | --- | --- | 
| 01 | 01 | 2020 | 2020-01-01 | 
| 02 | 01 | 2020 | 2020-01-02 | 
| 03 | 01 | 2020 | 2020-01-02 | 
| 04 | 01 | 2020 | 2020-01-02 | 

**To add a Concatenate Columns transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Concatenate Columns**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Concatenated column**, enter the name of a new column that will be generated.

1. Under **Columns**, select the input columns.

1. (Optional) Under **Spacer**, enter a string to place between concatenated fields.

1. (Optional) Under **Null value**, enter the string to use when a column value is null.

# Custom code transform


If you need to perform more complicated transformations on your data, or want to add data property keys to the dataset, you can add a Custom code transform to your job diagram. The Custom code node allows you to enter a script that performs the transformation.

**To add a custom code node to your job diagram**

1. Open the Resource panel and then choose Custom Code to add a custom transform to your job diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Enter desired code changes.

The following examples show the format of the code to enter in the code box:

```
def FilterPopulationAbove1000(input_df):
  df = input_df
  df = df[df['population'] > 1000]
  return df
```

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Custom Code transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-customcode.png)


# Derived column transform


The Derived Column transform allows you to define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals. For instance, to derive a “percentage” column from the columns "success" and "count", you can enter the SQL expression: "success \$1 100 / count \$1\$1 '%'".

Example result:


| success | count | percentage | 
| --- | --- | --- | 
| 14 | 100 | 14% | 
| 6 | 20 | 3% | 
| 3 | 40 | 7.5% | 

**To add a Derived Column transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Derived Column**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Name of derived column**, enter the name of a new column that will be generated.

1. Under **Column expression**, enter a SQL expression to define the new column based on existing columns.

# Drop columns transform


You can create a subset of data property keys from the dataset using the Drop columns transform. You indicate which data property keys you want to remove from the dataset and the rest of the keys are retained.

**To add a Drop columns transform node to your job diagram**

1. Open the Resource panel and then choose Drop columns to add a new transform to your diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Click on the node and view the Node properties panel.

1. Choose the data property keys from the "Field" column to drop the column from the data source.

1. (Optional) After configuring the node properties and transform properties, you can preview using the Data preview tab in job diagram.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Drop columns transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-drop-cols.png)


# Drop duplicates transform


The Drop duplicates transform removes rows from your data source by giving you two options. You can choose to remove the duplicate row that are completely the same, or you can choose to choose the fields to match and remove only those rows based on your chosen fields.

For example, in this data set, you have duplicate rows where all the values in some of the rows are exactly the same as another row, and some of the values in rows are the same or different.


**Example Data Set**  

| Row | Name | Email | Age | State | Note | 
| --- | --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY |  | 
| 2 | Tim | tim@gmail | 45 | OH |  | 
| 3 | Rose | rose@gmail | 23 | NJ |  | 
| 4 | Tim | tim@gmail | 42 | OH |  | 
| 5 | Rose | rose@gmail | 23 | NJ |  | 
| 6 | Tim | tim@gmail | 42 | OH | this is a duplicate row and matches completely on all values as row \$14 | 
| 7 | Rose | rose@gmail | 23 | NJ | This is a duplicate row and matches completely on all values as row \$15 | 

If you choose to match entire rows, rows 6 and 7 will be removed from the data set. The data set is now:


**Data Set After Matching Entire Rows**  

| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 
| 4 | Tim | tim@gmail | 42 | OH | 
| 5 | Rose | rose@gmail | 23 | NJ | 

If you chose to specify keys, you can choose to remove rows that match on 'name' and 'email'. This gives you finer control of what is a 'duplicate row' for your data set. By specifying 'name' and 'email', the data set is now:


**Data Set After Specifying Keys**  

| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 

Some things to keep in mind:
+ In order for rows to be recognized as a duplicate, values are case sensitive. all values in rows need to have the same casing - this applies to either option you choose (Match entire rows or Specify keys).
+ All values are read in as strings.
+ The Drop duplicates transform utilizes the Spark dropDuplicates command.
+ When using the Drop duplicates transform, the first row is kept and other rows are dropped.
+ The Drop duplicates transform does not change the schema of the dataframe.

**To add a Drop duplicates transform node to your job diagram**

1. Open the Resource panel and then choose Drop duplicates to add a new transform to your diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Click on the node and view the Node properties panel.

1. Choose if you prefer to drop duplicates by matching entire rows or specific keys.

1. (Optional) After configuring the node properties and transform properties, you can preview using the Data preview tab in job diagram.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Drop duplicates transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-drop-dupes.png)


# Drop nulls transform


Use the Drop nulls transform to remove fields from the dataset if all values in the field are 'null'. By default, Visual ETL will recognize null objects, but some values such as empty strings, strings that are "null", -1 integers or other placeholders such as zeros, are not automatically recognized as nulls.

**To use the Drop nulls**

1. Add a Drop nulls node to the diagram, if needed.

1.  Set a number as threshold value, if the rows(s) need to be drop only if a certain number of nulls is present. 

1.  Choose if you prefer to drop nulls for all columns or only check null values for specific columns. If you choose the subset option, select the desired column names from the drop down list. 

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Drop nulls transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-dropnull.png)


# Explode array or map into rows transform


You can use the Explode transform to extract values from a nested structure into individual rows that are easier to manipulate. In the case of an array, the transform will generate a row for each value of the array, replicating the values for the other columns in the row. In the case of a map, the transform will generate a row for each entry with the key and value as columns plus any other columns in the row.

For example, if we have this dataset which has a “category” array column with multiple values.


| product\$1id | category | 
| --- | --- | 
| 1 | [sports, winter] | 
| 2 | [garden, tools] | 
| 3 | [videos] | 
| 4 | [games, electronics, social] | 
| 5 | [] | 

If you explode the 'category' column into a column with the same name, you will override the column. You can select that you want NULLs included to get the following (ordered for illustration purposes):


| 1 | sports | 
| --- | --- | 
| 1 | winter | 
| 2 | garden | 
| 2 | tools | 
| 3 | videos | 
| 4 | games | 
| 4 | electronics | 
| 4 | social | 
| 5 |  | 

**To add an Explode Array Or Map Into Rows transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Explode Array or Map Into Rows**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Column to explode**, enter the name of an existing column that will be exploded.

1. Under **New column name**, enter the name of a new column that will be generated.

1. (Optional) Under **Values column**, if you are exploding a dictionary, specify a name for a column to contain the values.

1. Use the toggle to indicate whether or not to include null values. By default, if the column to explode is NULL or has an empty structure, it will be omitted on the exploded dataset.

# Extract JSON path transform


Use this transform to extract new columns from a string column with JSON data.

**To add an Extract JSON path transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Extract JSON Path**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **JSON column**, select a column from the data source.

1. Under **JSON paths**, enter a JSON path or multiple paths separated by commas to specify the location to extract the data from.

1. Under **Output columns**, enter the names of the columns to create from each of the JSON paths.

1. (Optional) If desired, use the toggle to drop the JSON column.

# Fill nulls transform


Fill nulls allows you to fill in null value in a column with a chosen fill value. Select a column with nulls you want to fill and provide a fill value. You can use multiple fill conditions at one time. Select a subset of the data to consider by using a dropdown menu to select columns.

**To add a Fill nulls node to your job diagram**

1. Open the menu and then choose Fill nulls to add a new transform to your job diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. To create a subset, use the dropdown menu and select a column or columns to consider.

   1. To specify a value to fill nulls with, select "Add new fill null value", then choose a column to fill and provide a Fill value.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Fill nullss transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-fill-null.png)


# Filter transform


Use the Filter transform to create a new dataset by filtering records from the input dataset. Rows that don't satisfy the filter condition are removed from the output. You can select from two filter types: Global AND or Global OR. You must select a column name to serve as the key, a comparison operation, and provide a value to filter on.

**To add a Filter node to your job diagram**

1. Open the menu and then choose Filter to add a new transform to your job diagram.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. Select "Add new filter condition".

   1. Choose a filter type: "Global AND" or "Global or".

   1. Select a Key column to filter.

   1. Select a comparison operation.

   1. Type in a value to compare in the "Value" box.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Filter transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-filter.png)


# Flatten transform


You can flatten the fields of nested structs in the data so they become top level fields. The new fields are named using the field name prefixed with the names of the struct fields to reach it, separated by dots.

For example, if the data has a field of type Struct named “phone\$1numbers”, which among other fields has one of type “Struct” named “home\$1phone” with two fields: “country\$1code” and “number”. After they are flattened, these two fields will become top level fields named “phone\$1numbers.home\$1phone.country\$1code” and “phone\$1numbers.home\$1phone.number” respectively.

**To add a Flatten transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Flatten**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Max levels to flatten**, enter the a maximum limit.

1. Under **Separator**, enter a character to use in the new column names to separate the different levels.

# Format timestamp transform


Use this transform to convert a timestamp colum into a formatted string.

**To add a Format Timestamp transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Format Timestamp**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Column**, choose the column that contains the timestamps.

1. Under **Timestamp format**, input the format you want to use for the strings.

1. (Optional) Enter a name for the new formatted column.

# Identifier transform


Use this transform to assign a numeric identifier for each row in the dataset.

**To add an Identifier transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Identifier**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **Column name**, enter a name for the new column. This column is named id by default.

1. (Optional) If you want to make the ID unique among the job runs, switch the toggle to on to include a timestamp and extend the column type from long to decimal.

# Intersect transform


Use this transform to select only the common rows in two datasets. Duplicates are removed by default, and you can choose to keep them in by checking the **Intersect all** checkbox.

**To add an Intersect transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Intersect**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. (Optional) If you want to intersect duplicated rows as well, check the **Intersect all** checkbox.

# Join transform


The Join transform allows you to combine two datasets into one. You specify the key names in the schema of each dataset to compare. The output frame contains rows where keys meet the join condition. The rows in each dataset that meet the join condition are combined into a single row in the output from that contains all the columns found in either dataset. You can select from one of the following join types: Inner, Left, Right, Full, Cross, Semi, and Anti.

**To add a Join node to your job diagram**

1. Open the menu and then choose Join to add a new transform to your job diagram, if needed.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Optional) Ensure two data sources are connected to the Join node.

1. Modify the input schema:

   1. Select a join type from the "Join type" dropdown menu.Optional).

   1. Select a column for the "Left data source" using the dropdown menu.

   1. Select a column for the "Right data source" using the dropdown menu.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Join transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-join.png)


**Note**  
 If you have columns using the same names in your data sources, the join will result in a COLUMN\$1ALREADY\$1EXISTS error. To avoid this error, either: (1) use Rename columns before the join for one of your data sources or (2) use Drop Columns after the join to remove both duplicated columns. 

# Limit tranform


Use this transform to limit the number of rows in a dataset.

**To add a Limit transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Limit**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Number of records**, enter the maximum number of rows you want to see or process from the dataset.

# Lookup transform


Use this transform to add matching columns by looking them up on another catalog table.

**To add a Lookup transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Lookup**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **Catalog**, select the catalog to use for lookup.

1. Under **Database**, select a database.

1. Under **Table**, select a table.

1. Under **Lookup key columns to match**, enter columns in the lookup table, separated by commas.

1. Under **Lookup columns to take**, enter columns in the lookup table to add to the data when a match is found.

# Parse JSON column transform


Use this transform to parse a string column that contains JSON data.

**To add a Parse JSON Column transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Parse JSON Column**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **JSON column**, choose the column that contains JSON data

1. Enter a number to use as a ratio of samples to use for inferring schema. This must be a value between 0.1 and 1, with 1 meaning all samples are used. If your JSON strings are consistent, you can use a lower value to speed up the inference.

1. (Optional) Enter a name for the new column.

# Pivot rows into columns transform


Use this transform to aggregate the values of a numeric column into categories by using the values of the columns configured.

**To add a Pivot Rows Into Columns transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Pivot Rows Into Columns**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **Aggregation column**, select a numeric column to use.

1. Under **Aggregation**, select a Spark function to apply to the aggregation column.

1. Under **Columns to convert**, select columns whose values will become new columns.

# Regex extract transform


Use this transform to extract new columns from a string column with JSON data.

**To add a Regex Extract transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Regex Extract**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Column to extract from**, choose a column from the data source you connected to the transform.

1. Under **Regular expression**, enter a regular expression to apply on the column. If there are multiple columns, the expression must have a number of groups equal to the number of columns.

1. Under **Extracted column**, enter the name of the column to contain the extracted regex.

# Rename columns transform


You can use the Rename columns transform to change the name of multiple columns in the dataset. You select a column that you want to rename from a list of available columns and provide the string that you want to update the column name to.

**To add a Rename columns node to your job diagram**

1. Open the menu and then choose Rename columns to add a new transform to your job diagram, if needed.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. Select "Add new rename pair".

   1. Select the current column to rename from the "Current name" column.

   1. Provide a name for the column in the "New name" box.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Rename transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-renamecolumns.png)


# Select columns transform


You can create a subset of columns in a dataset using the Select columns transform. You indicate which columns you want to keep and the rest are removed from the dataset.

**To add a Select columns node to your job diagram**

1. Open the menu and then choose Filter to add a new transform to your job diagram, if needed.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. Select the columns you want to keep in the output frame by checking the corresponding box under the "Field" option.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Select columns transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-selectclumns.png)


# Split string transform


Use this transform to split a string column into an array.

**To add a Split String transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Split String**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **Column to split**, select the column that you want to split.

1. Under **Splitting regular expression**, enter a regular expression that defines the separator token.

1. (Optional) Under **Array column name, enter a name for the new column that contains the extracted array.**

# SQL query transform


You can use a SQL query transform to write your own transform in the form of a SQL query. Writing the SQL query in Visual ETL creates a subset of the data corresponding to the query. A SQL transform node can have multiple datasets as inputs, but produces only a single dataset as output. It contains a text field, where you enter the SQL query.

**To add an SQL query node to your job diagram**

1. Open the menu and then choose SQL query to add a new transform to your job diagram, if needed.

1. (Optional) Click on the rename node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. Ensure the alias in the "SQL aliases" box is appropriate. Visual ETL will autopopulate this field, but you can change it.

   1. Write an SQL statement that queries the data to suit your needs

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the SQL query transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etl-sqlquery.png)


# To timestamp transform


You can use the transform To Timestamp to change the data type of a numeric or string column into timestamp, so that it can be stored with that data type or applied to other transforms that require a timestamp.

**To add a To Timestamp transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **To Timestamp**.

1. Select the diagram to add the node to your visual ETL job.

1. Select the node on the diagram to view details about the transform.

1. Under **Column to convert**, select the column that you want to convert.

1. Under **Time format**, define how to parse the column selected by choosing the format.

   If the value is a number, it can be expressed in seconds (Unix/Python timestamp), milliseconds, or microseconds.

   If the value is a formatted string, choose the "iso" type. The string needs to conform to one of the variants of the ISO format, for example: “2022-11-02T14:40:59.915Z“.

   If you don’t know the type at this point or different rows use different types, then you can choose ”autodetect“ and the system will make its best guess, with a small performance cost.

1. Under **Column name**, enter a name for a new column that will be generated for the timestamps. If a column name is not specified, the existing column will be replaced instead of generating a new column.

# Union transform


You use the Union transform node when you want to combine rows from more than one data source that have the same schema. When applying Union transformations you can select to "Union by name" to have the union done on columns with the same name (rather than by position). Selecting this option also lets you select "Allow missing columns", which will produce a frame that has missing columns filled with Null values.

**To add a Union transform node to your job diagram**

1. Open the menu and then choose "Union transform" to add a new transform to your job diagram, if needed.

1. (Optional) Click on the union node icon to enter a new name for the node in the job diagram.

1. Modify the input schema:

   1. Select "Union by name" if you want the union to be done on columns with the same name.

   1. If you have enabled Union by name, select "Allow missing columns" if you want to fill missing columns with null values.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the Data preview tab in the node details panel.

![\[The Amazon SageMaker Unified Studio UI showing the configuration options for the Union transform.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/vis-etl/vis-etlunion.png)


# Unpivot columns into rows transform


Use this transform to convert column names into values of a new column and generate one row for each column in the original row.

**To add an Unpivot Columns Into Rows transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **Unpivot Columns Into Rows**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **Unpivot names column**, enter a column to create out of the source column names.

1. Under **Unpivot values column**, enter a column to create out of the source column values.

1. Under **Columns to unpivot into the new value column**, select the column or columns whose names will become the values of the new column.

# UUID transform


Use this transform to generate a 36-character unique identifier for each row in your data source.

**To add a UUID transform:**

1. Navigate to your visual ETL job in Amazon SageMaker Unified Studio.

1. Choose the plus icon to open the **Add nodes** menu.

1. Under **Transforms**, choose **UUID**.

1. Select the diagram to add the node to your visual ETL job.

1. Connect the transform node to a data source node.

1. Select the node on the diagram to view details about the transform.

1. Under **UUID column name**, enter a name to use for the new column that contains the unique identifiers.

# Visual ETL for IAM-based domains


IAM-based domains provide access to a new Amazon SageMaker Unified Studio interface where customers log in using federated roles and existing IAM permissions apply. The following sections describe how to use Visual ETL in IAM-based domains.

## Key features for IAM-based domains


Visual ETL offers several capabilities to streamline your data workflows in IAM-based domains:
+ Drag-and-drop interface: Create Visual ETL flows by dragging and connecting components on a canvas.
+ Wide range of data connectors: Connect to various data sources and destinations, including databases, file systems, cloud storage, and APIs.
+ Extensive transformation library: Apply a variety of pre-built transformations to your data, such as filtering, aggregation, joining, and data type conversions.
+ Custom transformations: Create and save custom transformations using SQL or Python for reuse in multiple flows.
+ Data preview: Visualize your data at each step of the authoring process to ensure accuracy and data quality.
+ View scripts: View the code generated and choose to convert the flow to a notebook and continue authoring with code.
+ Code and compute configuration: Use a configuration panel to add code libraries and adjust the compute settings.

# Create a Visual ETL job in IAM-based domains


To create a job using Visual ETL in Amazon SageMaker Unified Studio IAM-based domains:

1. Log in to Amazon SageMaker Unified Studio.

1. Navigate to the Visual ETL tool using the left menu, selecting "Visual ETL".

1. Choose "Create Visual job" to open the Visual ETL editor.

1. Give the job a name when you begin authoring the job and choose "save".

1. Open the "Add nodes" menu by choosing the plus icon and select a node, choosing your node from one of the three tabs: "Data sources", "Transforms", or "Data targets".

1. Drag a source component onto the canvas.

1. Configure the component by choosing the node and editing the configurations, to connect to your data source.

1. Add transformation components as needed, connecting them in the desired order.

1. Drag a data target onto the canvas and configure it to specify where the processed data should be stored.

1. Connect the components to create a complete job.

1. Choose the "Checklist" button to check for any configuration errors.

1. Choose "Save" when you are done correcting all errors.

1. Select "Run" to execute it immediately or choose the schedule icon to create a reoccurring run schedule.

# Authoring a Visual ETL job using generative AI in IAM-based domains


To author a Visual ETL job using generative AI in Amazon SageMaker Unified Studio IAM-based domains:

1. Verify Amazon Q is enabled for your domain.

1. Open the Visual ETL editor.

1. In the "Add nodes" panel choose the Amazon Q icon.

1. (Optional) Choose "What can I ask?" and copy a prompt.

1. Enter the desired prompt in the chat box and choose 'Submit'.

1. Choose each node in the Visual ETL editor and configure its settings.

# Scheduling and running visual jobs in IAM-based domains


There are two ways to schedule visual ETL jobs in Amazon SageMaker Unified Studio IAM-based domains:
+ You can schedule your visual jobs directly in the Visual ETL editor. This way you can schedule a single visual job quickly.
+ You can schedule your visual job using a DAG and the workflows interface. This way you can combine multiple elements in the same schedule.

# Visual ETL for Identity Center-based domains


Identity Center-based domains use the original Amazon SageMaker Unified Studio interface and user experience that continues to be supported and maintained. The following sections describe how to use Visual ETL in Identity Center-based domains.

## Key features for Identity Center-based domains


Visual ETL offers several capabilities to streamline your data workflows:

1. Drag-and-drop interface: Create Visual ETL flows by dragging and connecting components on a canvas.

1. Wide range of data connectors: Connect to various data sources and destinations, including databases, file systems, cloud storage, and APIs.

1. Extensive transformation library: Apply a variety of pre-built transformations to your data, such as filtering, aggregation, joining, and data type conversions.

1. Custom transformations: Create and save custom transformations using SQL or Python for reuse in multiple flows.

1. Data preview: Visualize your data at each step of the authoring process to ensure accuracy and data quality.

1. View scripts: View the code generated and choose to convert the flow to a notebook and continue authoring with code.

1. Code and compute configuration: Use a configuration panel to add code libraries and adjust the compute settings.

# Creating a Visual ETL job in Identity Center-based domains


To create a job using Visual ETL in Amazon SageMaker Unified Studio Identity Center-based domains:

1. Log in to Amazon SageMaker Unified Studio and select a project.

1. Navigate to the Visual ETL tool using the dropdown "Build" menu, selecting "Visual ETL jobs".

1. Choose "Create Visual ETL job" to open the Visual ETL editor.

   If this is your first time using Visual ETL jobs in Amazon SageMaker Unified Studio, you are asked to choose a default compute permission mode option based on your data access preference. For more information, see [Configuring permission mode for Glue ETL in Amazon SageMaker Unified Studio](compute-permissions-mode-glue.md).

1. Give the job a name when you begin authoring the job.

1. From the dropdown menu next to the Run button, choose the compute permission mode option that supports the data you will be using in the job.
   + Select **project.spark.fineGrained** for data managed using fine-grained access, meaning the compute engine can only access specific rows and columns from the full dataset. Choosing this option configures your compute to work with data asset subscriptions from Amazon SageMaker Catalog. 
   + Select **project.spark.compatibility** to configure permission mode to be compatible with data managed using full-table access, meaning the compute engine can access all rows and columns in the data. Choosing this option configures your compute to work with data assets from AWS and from external systems that you connect to from your project.

1. Select the "Add nodes" button and select a node, chooing your node from one of the three tabs: "Data sources", "Transforms", or "Data targets".

1. Drag a source component onto the canvas.

1. Configure the component by choosing the node and editing the configurations, to connect to your data source.

1. Add transformation components as needed, connecting them in the desired order.

1. Drag a data target onto the canvas and configure it to specify where the processed data should be stored.

1. Connect the components to create a complete job.

1. Choose the "Checklist" button to check for any configuration errors.

1. To make the job accessible for all project members to view and edit, select "Save to project". 

1. Select "Run" to execute it immediately or run it on a schedule with the instructions at [Scheduling and running visual jobs in Identity Center-based domains](identity-center-schedule-visual-etl.md).

# Authoring a Visual ETL job using generative AI in Identity Center-based domains


To author a Visual ETL job using generative AI in Amazon SageMaker Unified Studio Identity Center-based domains:

1. Verify Amazon Q is enabled for your domain.

1. Open the Visual ETL editor.

1. In the "Add nodes" panel choose the Amazon Q icon.

1. (Optional) Choose "What can I ask?" and copy a prompt.

1. Enter the desired prompt in the chat box and choose 'Submit'.

1. Choose each node in the Visual ETL editor and configure its settings.

# Scheduling and running visual jobs in Identity Center-based domains


There are two ways to schedule visual ETL jobs in Amazon SageMaker Unified Studio Identity Center-based domains.
+ You can schedule your visual jobs directly in the Visual ETL editor. This way you can schedule a single visual job quickly.
+ You can schedule your visual job using a DAG and the workflows interface. This way you can combine multiple elements in the same schedule.

## Scheduling visual jobs from the editor


You can schedule your visual jobs to run from within the Visual ETL editor. To do this, use a project with the **All capabilities** project profile or another project profile with scheduling enabled in the Tooling blueprint parameters. If you have created a project that needs to be updated to enable scheduling, contact your admin.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to your visual ETL jobs by choosing **visual ETL jobs** from the **Build** menu.

1. Choose the visual job you want to schedule from the list to open it in the editor.

1. Choose the Schedule icon in the upper-right corner of the editor.

1. Under **Schedule name**, enter a name for the schedule.

1. Under **Schedule status**, choose an option to determine whether the schedule will begin running after being created.
   + Choose **Active** to activate the schedule and run the Visual ETL job when the schedule indicates it should run.
   + Choose **Paused** to create a schedule that will not run the visual ETL job yet.

1. (Optional) Write a description of the schedule.

1. Choose a schedule type.
   + Choose **One-time** to run the visual ETL job at one specific time.
   + Choose **Recurring** to create a schedule that run the Visual ETL job at multiple times that you choose.

1. Choose the days and times that the schedule will run.

1. Choose **Create schedule**.

You can then view the schedule on the **Schedules** tab of the Visual ETL page in your project.

You can enable project repository auto sync flag when creating or updating the project to ensure the schedules always execute the latest ETL notebook saved to repository. It is recommendede that you test the ETL in draft mode before saving.

## Reviewing scheduled visual jobs in the editor


You can review scheduled visual jobs in the Visual ETL interface in Amazon SageMaker Unified Studio. On the schedules page, you can pause, edit, and delete schedules. You can also view the status and other information for a schedule and choose the name of a schedule to view runs and additional data.

To review scheduled queries, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to your project.

1. Choose **Visual ETL jobs** from the **Build** menu.

1. Choose the **Schedules** tab.

You can then pause, edit, or delete a schedule by choosing the three-dot **Actions** menu next to a schedule in the list.

To view information about different times the schedule has run, choose the name of the schedule to view the **Runs** section for that schedule. You can choose the name of a run to see a log and other details for that run.

## Scheduling visual jobs with workflows


You can schedule the Visual ETL jobs you authored to run based on a schedule using Workflows. The following is an example of how to do this:

1. Create a Visual ETL flow and name it "mwaa-test".

1. Save your draft flow ("mwaa-test.vetl") to your project.

1. Navigate to Build → Workflows menu, choose "Create workflow in editor".

1. You will now see an example DAG template in JupyterLab.

1. Modify the lines of python code as below, then save it as "mwaa\$1test\$1dag.py". We will execute the dataflow at 8AM everyday. By default, the dataflow's notebook file is under the path "src/dataflows".

   ```
   WORKFLOW_SCHEDULE = '0 8 * * *'
   NOTEBOOK_PATH = 'src/dataflows/mwaa-test.vetl'
   dag_id = "workflow-mwaa-test" # optional, set to give your workflow a meaningful name
   ```

1. Pull the file "dataflows/mwaa-test.vetl" from the project's source code repository to JupyterLab.

1. Navigate back to the Workflows console, now we can see the DAG is created. We can access Airflow UI via the "Actions" dropdown list.

1. Manually trigger the DAG.

# Using both external data and fine-grained data in Amazon SageMaker Unified Studio visual ETL jobs
Using both external data and fine-grained data

When you use visual ETL, you must select a permission mode to use with your visual ETL flow.

Permission mode is a configuration available to Spark compute resources such as Glue ETL or EMR Serverless. It configures Spark to access different types of data based on the permissions configured for that data. There are two configuration options for permission mode:
+ Compatibility mode. This is a configuration for data managed using full-table access, meaning the compute engine can access all rows and columns in the data. Choosing this option enables your compute to work with data assets from AWS and from external systems. 
+ Fine-grained mode. This is a configuration for data managed using fine-grained access controls, meaning the compute engine can only access specific rows and columns from the full dataset. Choosing this option enables your Glue ETL to work with data asset subscriptions from Amazon SageMaker Catalog.

In cases where you want to use both data configured with fine-grained access and data from external sources that you connect to your project, you can use two visual ETL jobs and orchestrate them to run together using workflows. To do this, complete the following steps.

**Combining jobs with different kinds of data in visual ETL**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project you want to use visual ETL in.

1. Choose **Visual ETL** from the **Build** menu.

1. Choose **Create visual ETL job**.

1. Choose to configure the visual ETL job with full-table access using the AWS Glue ETL compute named **project.spark.fineGrained**.

1. Configure your visual ETL job to ingest the subscribed data to an Amazon S3 target used for temporary staging. This can be done by using the plus icon and adding an lakehouse architecture node as a data source and an Amazon S3 node as a data target, then connecting the nodes on the diagram.

1. Select the lakehouse architecture node and configure it to point to the data you want to use.

   1. Under **Database**, choose the name of the database you want to use.

   1. Under **Table**, choose the name of the table you want to use.

1. Configure the Amazon S3 node to point to a new location.

   1. Under **S3 URI**, create a new Amazon S3 folder name and note the location for later use.

   1. Under **Mode**, select **Overwrite** to clear the Amazon S3 bucket and overwrite it with new data when you are ready to use it again.

   1. (Optional) Configure the other settings as desired.

1. Save the flow and run it using **project.spark.fineGrained** to verify correctness of the results.

1. Create a new visual ETL job that uses the AWS Glue ETL compute named **project.spark.compatibility**.

1. Configure this second visual ETL job to combine the data from the staging S3 location and the data accessible through full-table access to generate the final result.

   1. Select the plus icon. Under **Data sources**, select Amazon S3 and place the node on the diagram.

   1. Select the Amazon S3 node to configure it.

   1. Under **S3 URI**, enter the Amazon S3 folder location you used in the first visual ETL job.

   1. Use the plus icon, and under **Data sources**, select an external data source to add to your visual ETL job. Place the node on the diagram.

   1. Use the plus icon to add a data target and place the data target node on the diagram.

   1. Select the external data source and the data target to edit the configurations as desired and point to the locations you want to use.

   1. Use the plus icon, and under **Transforms**, select the **Join** transform. Place the transform on your diagram.

   1. Connect the Amazon S3 node containing the data from the first flow and the other data source to the data target using the **Join** transform.

1. Save the second flow and run it using **project.spark.compatibility** to verify correctness of the results.

1. Orchestrate these two visual ETL jobs using Amazon SageMaker Unified Studio workflows. For more information, see [Scheduling and running visual jobs in Identity Center-based domains](identity-center-schedule-visual-etl.md) and [Create a code workflow](code-workflow.md#workflow-create).

   Make sure that the workflow is configured so that the first visual ETL job finishes running before the second visual ETL job runs. By default, they'll run in succession, one after the other. This can also be configured using the `wait_for_completion` param, as shown in [Sample code workflow](code-workflow.md#workflows-sample).

# Best practices for Visual ETL in Identity Center-based domains


To get the most out of Visual ETL in Amazon SageMaker Unified Studio Identity Center-based domains:
+ Start with simple flows and gradually increase complexity as you become more familiar with the tool.
+ Use data preview features frequently to verify the results of your transformations.
+ Leverage custom transformations to standardize and streamline your flows.
+ Monitor flows performance and optimize as necessary, using Amazon SageMaker Unified Studio's built-in performance analytics.

By following these guidelines and exploring the various features of Visual ETL, you can efficiently create powerful data integration and transformation Visual ETL flows in Amazon SageMaker Unified Studio.

# Data preview


Data preview help you create and test your visual flow using a sample of your data without having to repeatedly run the flow with the full dataset. By using data preview, you can:
+ Make sure that you have access to your data sources or data targets.
+ Check that the transform is modifying the data in the intended way. For example, if you use a Filter transform, you can make sure that the filter is selecting the right subset of data.
+ Check your data. If your dataset contains columns with values of multiple types, the data preview shows a list of tuples for these columns. Each tuple contains the data type and its value.

## Authoring with Glue data preview


As you author a visual ETL flow, a new Glue interactive session is automatically started and used to preview your data. Learn more on authoring with Glue data previews in AWS Glue [documentation](https://docs.aws.amazon.com/glue/latest/dg/job-editor-features.html) and Glue [pricing console](https://aws.amazon.com/glue/pricing/) page. AWS Glue interactive sessions are billed per second with a 1-minute minimum.

## Authoring with SageMaker data preview


Amazon SageMaker Unified Studio offers a new built-in data preview experience, which doesn't incur Glue interactive sessions resources. You may choose "Data preview v2.0" in your preview panel to use the SageMaker data preview, and turn off "Glue data preview". This new data preview experience is optional.

### Key benefits


SageMaker data preview provides the following advantages:
+ Not incurring Glue interactive sessions resources
+ Faster time to start and to preview your data

### Limitations


SageMaker data preview has the following support limitations:

#### File format and compression


Not supported file formats:
+ ORC and Avro files are currently not supported.

Supported file formats:
+ CSV (uncompressed files, must include headers)
+ JSON/JSONL (uncompressed files)
+ Parquet (uncompressed files, gzip, zstandard, lz4, and Snappy compression)

#### Transforms


The following transforms are not supported:
+ FlattenTransform
+ PivotRowsTransform
+ LookupTransform

#### Data sources


Queries with WHERE clauses are not supported for JDBC connectors (MySQL, PostgreSQL, and SQL Server) and BigQuery connectors.

#### Performance considerations


Large Parquet files are read by reading the first row group. If your parquet file has very large row groups (2GB or larger) you may see performance degradation. Our recommendation for large parquet files is to continue to use Glue data preview.

SageMaker data preview works best with simple primitive data types such as strings, integers, booleans, and binary data. Complex types in Parquet and JSON files (such as struct, map, or array) are supported, but may run into limitations for extremely nested data types. Our recommendation for complex types is to continue to use Glue data preview.