

# Visual ETL for Identity Center-based domains
<a name="visual-etl-identity-center-domains"></a>

Identity Center-based domains use the original Amazon SageMaker Unified Studio interface and user experience that continues to be supported and maintained. The following sections describe how to use Visual ETL in Identity Center-based domains.

## Key features for Identity Center-based domains
<a name="identity-center-key-features"></a>

Visual ETL offers several capabilities to streamline your data workflows:

1. Drag-and-drop interface: Create Visual ETL flows by dragging and connecting components on a canvas.

1. Wide range of data connectors: Connect to various data sources and destinations, including databases, file systems, cloud storage, and APIs.

1. Extensive transformation library: Apply a variety of pre-built transformations to your data, such as filtering, aggregation, joining, and data type conversions.

1. Custom transformations: Create and save custom transformations using SQL or Python for reuse in multiple flows.

1. Data preview: Visualize your data at each step of the authoring process to ensure accuracy and data quality.

1. View scripts: View the code generated and choose to convert the flow to a notebook and continue authoring with code.

1. Code and compute configuration: Use a configuration panel to add code libraries and adjust the compute settings.

# Creating a Visual ETL job in Identity Center-based domains
<a name="identity-center-getting-started-visual-etl"></a>

To create a job using Visual ETL in Amazon SageMaker Unified Studio Identity Center-based domains:

1. Log in to Amazon SageMaker Unified Studio and select a project.

1. Navigate to the Visual ETL tool using the dropdown "Build" menu, selecting "Visual ETL jobs".

1. Choose "Create Visual ETL job" to open the Visual ETL editor.

   If this is your first time using Visual ETL jobs in Amazon SageMaker Unified Studio, you are asked to choose a default compute permission mode option based on your data access preference. For more information, see [Configuring permission mode for Glue ETL in Amazon SageMaker Unified Studio](compute-permissions-mode-glue.md).

1. Give the job a name when you begin authoring the job.

1. From the dropdown menu next to the Run button, choose the compute permission mode option that supports the data you will be using in the job.
   + Select **project.spark.fineGrained** for data managed using fine-grained access, meaning the compute engine can only access specific rows and columns from the full dataset. Choosing this option configures your compute to work with data asset subscriptions from Amazon SageMaker Catalog. 
   + Select **project.spark.compatibility** to configure permission mode to be compatible with data managed using full-table access, meaning the compute engine can access all rows and columns in the data. Choosing this option configures your compute to work with data assets from AWS and from external systems that you connect to from your project.

1. Select the "Add nodes" button and select a node, chooing your node from one of the three tabs: "Data sources", "Transforms", or "Data targets".

1. Drag a source component onto the canvas.

1. Configure the component by choosing the node and editing the configurations, to connect to your data source.

1. Add transformation components as needed, connecting them in the desired order.

1. Drag a data target onto the canvas and configure it to specify where the processed data should be stored.

1. Connect the components to create a complete job.

1. Choose the "Checklist" button to check for any configuration errors.

1. To make the job accessible for all project members to view and edit, select "Save to project". 

1. Select "Run" to execute it immediately or run it on a schedule with the instructions at [Scheduling and running visual jobs in Identity Center-based domains](identity-center-schedule-visual-etl.md).

# Authoring a Visual ETL job using generative AI in Identity Center-based domains
<a name="identity-center-visual-etl-flow-example"></a>

To author a Visual ETL job using generative AI in Amazon SageMaker Unified Studio Identity Center-based domains:

1. Verify Amazon Q is enabled for your domain.

1. Open the Visual ETL editor.

1. In the "Add nodes" panel choose the Amazon Q icon.

1. (Optional) Choose "What can I ask?" and copy a prompt.

1. Enter the desired prompt in the chat box and choose 'Submit'.

1. Choose each node in the Visual ETL editor and configure its settings.

# Scheduling and running visual jobs in Identity Center-based domains
<a name="identity-center-schedule-visual-etl"></a>

There are two ways to schedule visual ETL jobs in Amazon SageMaker Unified Studio Identity Center-based domains.
+ You can schedule your visual jobs directly in the Visual ETL editor. This way you can schedule a single visual job quickly.
+ You can schedule your visual job using a DAG and the workflows interface. This way you can combine multiple elements in the same schedule.

## Scheduling visual jobs from the editor
<a name="identity-center-schedule-visual-etl-editor"></a>

You can schedule your visual jobs to run from within the Visual ETL editor. To do this, use a project with the **All capabilities** project profile or another project profile with scheduling enabled in the Tooling blueprint parameters. If you have created a project that needs to be updated to enable scheduling, contact your admin.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to your visual ETL jobs by choosing **visual ETL jobs** from the **Build** menu.

1. Choose the visual job you want to schedule from the list to open it in the editor.

1. Choose the Schedule icon in the upper-right corner of the editor.

1. Under **Schedule name**, enter a name for the schedule.

1. Under **Schedule status**, choose an option to determine whether the schedule will begin running after being created.
   + Choose **Active** to activate the schedule and run the Visual ETL job when the schedule indicates it should run.
   + Choose **Paused** to create a schedule that will not run the visual ETL job yet.

1. (Optional) Write a description of the schedule.

1. Choose a schedule type.
   + Choose **One-time** to run the visual ETL job at one specific time.
   + Choose **Recurring** to create a schedule that run the Visual ETL job at multiple times that you choose.

1. Choose the days and times that the schedule will run.

1. Choose **Create schedule**.

You can then view the schedule on the **Schedules** tab of the Visual ETL page in your project.

You can enable project repository auto sync flag when creating or updating the project to ensure the schedules always execute the latest ETL notebook saved to repository. It is recommendede that you test the ETL in draft mode before saving.

## Reviewing scheduled visual jobs in the editor
<a name="identity-center-schedule-visual-etl-review"></a>

You can review scheduled visual jobs in the Visual ETL interface in Amazon SageMaker Unified Studio. On the schedules page, you can pause, edit, and delete schedules. You can also view the status and other information for a schedule and choose the name of a schedule to view runs and additional data.

To review scheduled queries, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to your project.

1. Choose **Visual ETL jobs** from the **Build** menu.

1. Choose the **Schedules** tab.

You can then pause, edit, or delete a schedule by choosing the three-dot **Actions** menu next to a schedule in the list.

To view information about different times the schedule has run, choose the name of the schedule to view the **Runs** section for that schedule. You can choose the name of a run to see a log and other details for that run.

## Scheduling visual jobs with workflows
<a name="identity-center-schedule-visual-etl-workflows"></a>

You can schedule the Visual ETL jobs you authored to run based on a schedule using Workflows. The following is an example of how to do this:

1. Create a Visual ETL flow and name it "mwaa-test".

1. Save your draft flow ("mwaa-test.vetl") to your project.

1. Navigate to Build → Workflows menu, choose "Create workflow in editor".

1. You will now see an example DAG template in JupyterLab.

1. Modify the lines of python code as below, then save it as "mwaa\$1test\$1dag.py". We will execute the dataflow at 8AM everyday. By default, the dataflow's notebook file is under the path "src/dataflows".

   ```
   WORKFLOW_SCHEDULE = '0 8 * * *'
   NOTEBOOK_PATH = 'src/dataflows/mwaa-test.vetl'
   dag_id = "workflow-mwaa-test" # optional, set to give your workflow a meaningful name
   ```

1. Pull the file "dataflows/mwaa-test.vetl" from the project's source code repository to JupyterLab.

1. Navigate back to the Workflows console, now we can see the DAG is created. We can access Airflow UI via the "Actions" dropdown list.

1. Manually trigger the DAG.

# Using both external data and fine-grained data in Amazon SageMaker Unified Studio visual ETL jobs
<a name="identity-center-visual-etl-combining-different-data"></a>

When you use visual ETL, you must select a permission mode to use with your visual ETL flow.

Permission mode is a configuration available to Spark compute resources such as Glue ETL or EMR Serverless. It configures Spark to access different types of data based on the permissions configured for that data. There are two configuration options for permission mode:
+ Compatibility mode. This is a configuration for data managed using full-table access, meaning the compute engine can access all rows and columns in the data. Choosing this option enables your compute to work with data assets from AWS and from external systems. 
+ Fine-grained mode. This is a configuration for data managed using fine-grained access controls, meaning the compute engine can only access specific rows and columns from the full dataset. Choosing this option enables your Glue ETL to work with data asset subscriptions from Amazon SageMaker Catalog.

In cases where you want to use both data configured with fine-grained access and data from external sources that you connect to your project, you can use two visual ETL jobs and orchestrate them to run together using workflows. To do this, complete the following steps.

**Combining jobs with different kinds of data in visual ETL**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project you want to use visual ETL in.

1. Choose **Visual ETL** from the **Build** menu.

1. Choose **Create visual ETL job**.

1. Choose to configure the visual ETL job with full-table access using the AWS Glue ETL compute named **project.spark.fineGrained**.

1. Configure your visual ETL job to ingest the subscribed data to an Amazon S3 target used for temporary staging. This can be done by using the plus icon and adding an lakehouse architecture node as a data source and an Amazon S3 node as a data target, then connecting the nodes on the diagram.

1. Select the lakehouse architecture node and configure it to point to the data you want to use.

   1. Under **Database**, choose the name of the database you want to use.

   1. Under **Table**, choose the name of the table you want to use.

1. Configure the Amazon S3 node to point to a new location.

   1. Under **S3 URI**, create a new Amazon S3 folder name and note the location for later use.

   1. Under **Mode**, select **Overwrite** to clear the Amazon S3 bucket and overwrite it with new data when you are ready to use it again.

   1. (Optional) Configure the other settings as desired.

1. Save the flow and run it using **project.spark.fineGrained** to verify correctness of the results.

1. Create a new visual ETL job that uses the AWS Glue ETL compute named **project.spark.compatibility**.

1. Configure this second visual ETL job to combine the data from the staging S3 location and the data accessible through full-table access to generate the final result.

   1. Select the plus icon. Under **Data sources**, select Amazon S3 and place the node on the diagram.

   1. Select the Amazon S3 node to configure it.

   1. Under **S3 URI**, enter the Amazon S3 folder location you used in the first visual ETL job.

   1. Use the plus icon, and under **Data sources**, select an external data source to add to your visual ETL job. Place the node on the diagram.

   1. Use the plus icon to add a data target and place the data target node on the diagram.

   1. Select the external data source and the data target to edit the configurations as desired and point to the locations you want to use.

   1. Use the plus icon, and under **Transforms**, select the **Join** transform. Place the transform on your diagram.

   1. Connect the Amazon S3 node containing the data from the first flow and the other data source to the data target using the **Join** transform.

1. Save the second flow and run it using **project.spark.compatibility** to verify correctness of the results.

1. Orchestrate these two visual ETL jobs using Amazon SageMaker Unified Studio workflows. For more information, see [Scheduling and running visual jobs in Identity Center-based domains](identity-center-schedule-visual-etl.md) and [Create a code workflow](code-workflow.md#workflow-create).

   Make sure that the workflow is configured so that the first visual ETL job finishes running before the second visual ETL job runs. By default, they'll run in succession, one after the other. This can also be configured using the `wait_for_completion` param, as shown in [Sample code workflow](code-workflow.md#workflows-sample).

# Best practices for Visual ETL in Identity Center-based domains
<a name="identity-center-best-practices-visual-etl"></a>

To get the most out of Visual ETL in Amazon SageMaker Unified Studio Identity Center-based domains:
+ Start with simple flows and gradually increase complexity as you become more familiar with the tool.
+ Use data preview features frequently to verify the results of your transformations.
+ Leverage custom transformations to standardize and streamline your flows.
+ Monitor flows performance and optimize as necessary, using Amazon SageMaker Unified Studio's built-in performance analytics.

By following these guidelines and exploring the various features of Visual ETL, you can efficiently create powerful data integration and transformation Visual ETL flows in Amazon SageMaker Unified Studio.