

# Amazon SageMaker Canvas
Canvas

Amazon SageMaker Canvas gives you the ability to use machine learning to generate predictions without needing to write any code. The following are some use cases where you can use SageMaker Canvas:
+ Predict customer churn
+ Plan inventory efficiently
+ Optimize price and revenue
+ Improve on-time deliveries
+ Classify text or images based on custom categories
+ Identify objects and text in images
+ Extract information from documents

With Canvas, you can chat with popular large language models (LLMs), access Ready-to-use models, or build a custom model trained on your data.

Canvas chat is a functionality that leverages open-source and Amazon LLMs to help you boost your productivity. You can prompt the models to get assistance with tasks such as generating content, summarizing or categorizing documents, and answering questions. To learn more, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

The [Ready-to-use models](canvas-ready-to-use-models.md) in Canvas can extract insights from your data for a variety of use cases. You don’t have to build a model to use Ready-to-use models because they are powered by Amazon AI services, including [Amazon Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html), [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html), and [Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html). You only have to import your data and start using a solution to generate predictions.

If you want a model that is customized to your use case and trained with your data, you can [build a model](canvas-custom-models.md). You can get predictions customized to your data by doing the following:

1. Import your data from one or more data sources.

1. Build a predictive model.

1. Evaluate the model's performance.

1. Generate predictions with the model.

Canvas supports the following types of custom models:
+ Numeric prediction (also known as *regression*)
+ Categorical prediction for 2 and 3\$1 categories (also known as *binary* and *multi-class classification*)
+ Time series forecasting
+ Single-label image prediction (also known as *image classification*)
+ Multi-category text prediction (also known as *multi-class text classification*)

To learn more about pricing, see the [SageMaker Canvas pricing page](https://aws.amazon.com/sagemaker/canvas/pricing/). You can also see [Billing and cost in SageMaker Canvas](canvas-manage-cost.md) for more information.

SageMaker Canvas is currently available in the following Regions:
+ US East (Ohio)
+ US East (N. Virginia)
+ US West (N. California)
+ US West (Oregon)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Seoul)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Tokyo)
+ Canada (Central)
+ Europe (Frankfurt)
+ Europe (Ireland)
+ Europe (London)
+ Europe (Paris)
+ Europe (Stockholm)
+ South America (São Paulo)

**Topics**
+ [

## Are you a first-time SageMaker Canvas user?
](#canvas-first-time-user)
+ [

# Getting started with using Amazon SageMaker Canvas
](canvas-getting-started.md)
+ [

# Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas
](canvas-end-to-end-machine-learning-workflow.md)
+ [

# Amazon SageMaker Canvas setup and permissions management (for IT administrators)
](canvas-setting-up.md)
+ [

# Generative AI assistance for solving ML problems in Canvas using Amazon Q Developer
](canvas-q.md)
+ [

# Data import
](canvas-importing-data.md)
+ [

# Data preparation
](canvas-data-prep.md)
+ [

# Generative AI foundation models in SageMaker Canvas
](canvas-fm-chat.md)
+ [

# Ready-to-use models
](canvas-ready-to-use-models.md)
+ [

# Custom models
](canvas-custom-models.md)
+ [

# Logging out of Amazon SageMaker Canvas
](canvas-log-out.md)
+ [

# Limitations and troubleshooting
](canvas-limits.md)
+ [

# Billing and cost in SageMaker Canvas
](canvas-manage-cost.md)

## Are you a first-time SageMaker Canvas user?


If you are a first-time user of SageMaker Canvas, we recommend that you begin by reading the following sections:
+ For IT administrators – [Amazon SageMaker Canvas setup and permissions management (for IT administrators)](canvas-setting-up.md)
+ For analysts and individual users – [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md)
+ For an example of an end to end workflow – [Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas](canvas-end-to-end-machine-learning-workflow.md)

# Getting started with using Amazon SageMaker Canvas
Getting started

This guide tells you how to get started with using SageMaker Canvas. If you're an IT administrator and would like more in-depth details, see [Amazon SageMaker Canvas setup and permissions management (for IT administrators)](canvas-setting-up.md) to set up SageMaker Canvas for your users.

**Topics**
+ [

## Prerequisites for setting up Amazon SageMaker Canvas
](#canvas-prerequisites)
+ [

## Step 1: Log in to SageMaker Canvas
](#canvas-getting-started-step1)
+ [

## Step 2: Use SageMaker Canvas to get predictions
](#canvas-getting-started-step2)

## Prerequisites for setting up Amazon SageMaker Canvas


To set up a SageMaker Canvas application, onboard using one of the following setup methods:

1. **Onboard with the AWS console.** To onboard through the AWS console, you first create an Amazon SageMaker AI domain. SageMaker AI domains support the various machine learning (ML) environments such as Canvas and [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html). For more information about domains, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

   1. (Quick) [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md) – Choose this option if you’d like to quickly set up a domain. This grants your user all of the default Canvas permissions and basic functionality. Any additional features such as [document querying](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html#canvas-fm-chat-query) can be enabled later by an admin. If you want to configure more granular permissions, we recommend that you choose the Advanced option instead.

   1. (Standard) [Use custom setup for Amazon SageMaker AI](onboard-custom.md) – Choose this option if you’d like to complete a more advanced setup of your domain. Maintain granular control over user permissions such as access to data preparation features, generative AI functionality, and model deployments. 

1. **Onboard with CloudFormation.** [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) automates the provisioning of resources and configurations so that you can set up Canvas for one or more user profiles at the same time. Use this option if you want to automate the onboarding process at scale and make sure that your applications are configured the same way every time. The following [CloudFormation template](https://github.com/aws-samples/cloudformation-studio-domain) provides a streamlined way to onboard to Canvas, ensuring that all required components are properly set up and allowing you to focus on building and deploying your machine learning models.

The following section describes how to onboard to Canvas by using the AWS console to create a domain.

**Important**  
For you to set up Amazon SageMaker Canvas, your version of Amazon SageMaker Studio must be 3.19.0 or later. For information about updating Amazon SageMaker Studio, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

### Onboard with the AWS console


If you’re doing the quick domain setup, then you can follow the instructions in [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md), skip the rest of this section, and move on to [Step 1: Log in to SageMaker Canvas](#canvas-getting-started-step1).

If you’re doing the standard domain setup, then you can specify the Canvas features to which you’d like to grant your users access. Use the rest of this section as you complete the standard domain setup to help you configure the permissions that are specific to Canvas.

In the [Use custom setup for Amazon SageMaker AI](onboard-custom.md) setup instructions, for **Step 2: Users and ML Activities**, you must select the Canvas permissions that you want to grant. In the **ML activities** section, you can select the following permissions policies to grant access to Canvas features. You can only select up to 8 **ML activities** total when setting up your domain. The first two permissions in the following list are required to use Canvas, while the rest are for additional features.
+ **Run Studio Applications** – These permissions are necessary to start up the Canvas application.
+ **[Canvas Core Access](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html)** – These permissions grant you access to the Canvas application and the basic functionality of Canvas, such as creating datasets, using basic data transforms, and building and analyzing models.
+ (Optional) **[Canvas Data Preparation (powered by Data Wrangler)](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html)** – These permissions grant you access to create data flows and use advanced transforms to prepare your data in Canvas. These permissions are also necessary for creating data processing jobs and data preparation job schedules.
+ (Optional) **[Canvas AI Services](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasAIServicesAccess.html)** – These permissions grant you access to the Ready-to-use models, foundation models, and Chat with Data features in Canvas.
+ (Optional) **Kendra access** – This permission grants you access to the [document querying ](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html#canvas-fm-chat-query) feature, where you can query documents stored in an Amazon Kendra index using foundation models in Canvas.

  If you select this option, then in the **Canvas Kendra Access** section, enter the IDs for your Amazon Kendra indexes to which you want to grant access.
+ (Optional) **[Canvas MLOps](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDirectDeployAccess.html)** – This permission grants you access to the [model deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html) feature in Canvas, where you can deploy models for use in production.

In the domain setup’s **Step 3: Applications** section, choose **Configure Canvas** and then do the following:

1.  For the **Canvas storage configuration**, specify where you want Canvas to store the application data, such as model artifacts, batch predictions, datasets, and logs. SageMaker AI creates a `Canvas/` folder inside this bucket to store the data. For more information, see [Configure your Amazon S3 storage](canvas-storage-configuration.md). For this section, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI-created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location.

1. (Optional) For **Amazon Q Developer**, do the following:

   1. Turn on **Enable Amazon Q Developer in SageMaker Canvas for natural language ML** to give your users permissions to leverage generative AI assistance during their ML workflow in Canvas. This option only grants permissions to query Amazon Q Developer for help with predetermined tasks that can be completed in the Canvas application.

   1. Turn on **Enable Amazon Q Developer chat for general AWS questions** to give your users permissions to make generative AI queries related to AWS services.

1. (Optional) Configure the **Large data processing** section if your users plan to process datasets larger than 5 GB in Canvas. For more detailed information about how to configure these options, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

1. (Optional) For the **ML Ops permissions configuration** section, do the following:

   1. Leave the **Enable direct deployment of Canvas models** option turned on to give your users permissions to deploy their models from Canvas to a SageMaker AI endpoint. For more information about model deployment in Canvas, see [Deploy your models to an endpoint](canvas-deploy-model.md).

   1. Leave the **Enable Model Registry registration permissions for all users** option turned on to give your users permissions to register their model version to the SageMaker AI model registry (it is turned on by default). For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).

   1. If you left the **Enable Model Registry registration permissions for all users** option turned on, then select either **Register to Model Registry only** or **Register and approve model in Model Registry**.

1. (Optional) For the **Local file upload configuration** section, turn on the **Enable local file upload** option to give your users permissions to upload files to Canvas from their local machines. Turning this option on attaches a cross-origin resource sharing (CORS) policy to the Amazon S3 bucket specified in the **Canvas storage configuration** (and overrides any existing CORS policy). To learn more about local file upload permissions, see [Grant Your Users Permissions to Upload Local Files](canvas-set-up-local-upload.md).

1. (Optional) For the **OAuth settings** section, do the following:

   1. Choose **Add OAuth configuration**.

   1. For **Data source**, select your data source.

   1. For **Secret setup**, select **Create a new secret** and enter the information you have from your identity provider. If you haven’t done the initial OAuth setup with your data source yet, see [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

1. (Optional) For the **Canvas Ready-to-use models configuration**, do the following:

   1. Leave the **Enable Canvas Ready-to-use models** option turned on to give your users permissions to generate predictions with Ready-to-use models in Canvas (it is turned on by default). This option also gives you permissions to chat with generative-AI powered models. For more information, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

   1. Leave the **Enable document query using Amazon Kendra** option turned on to give your users permissions to use foundation models for querying documents stored in an Amazon Kendra index. Then, from the dropdown menu, select the existing indexes to which you want to grant access. For more information, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

   1. For **Amazon Bedrock role**, select **Create and use a new execution role** to create a new IAM execution role that has a trust relationship with Amazon Bedrock. This IAM role is assumed by Amazon Bedrock to fine-tune large language models (LLMs) in Canvas. If you already have an execution role with a trust relationship, then select **Use an existing execution role** and choose your role from the dropdown. For more information about manually configuring permissions for your own execution role, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md).

1. Finish configuring the rest of the domain settings using the [Use custom setup for Amazon SageMaker AI](onboard-custom.md) procedures.

**Note**  
If you encounter any issues with granting permissions through the console, such as permissions for Ready-to-use models, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

You should now have a SageMaker AI domain set up and all of the Canvas permissions configured.

You can edit the Canvas permissions for a domain or a specific user after the initial domain setup. Individual user settings override the domain settings. To learn how to edit your Canvas permissions in the domain settings, see [Edit domain settings](domain-edit.md).

### Give yourself permissions to use specific features in Canvas


The following information outlines the various permissions that you can grant to a Canvas user to allow the use of various features and functionalities within Canvas. Some of these permissions can be granted during the domain setup, but some require additional permissions or configuration. Refer to the specific permissions information for each feature that you want to enable:
+ **Local file upload.** The permissions for local file upload are turned on by default in the Canvas base permissions when setting up your domain. If you can't upload local files from your machine to SageMaker Canvas, you can attach a CORS policy to the Amazon S3 bucket that you specified in the Canvas storage configuration. If you allowed SageMaker AI to use the default bucket, the bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}`. For more information, see [Grant Your Users Permissions to Upload Local Files](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-set-up-local-upload.html).
+ **Custom image and text prediction models.** The permissions for building custom image and text prediction models are turned on by default in the Canvas base permissions when setting up your domain. However, if you have a custom IAM configuration and don't want to attach the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy to your user's IAM execution role, then you must explicitly grant your user the necessary permissions. For more information, see [Grant Your Users Permissions to Build Custom Image and Text Prediction Models](canvas-set-up-cv-nlp.md).
+ **Ready-to-use models and foundation models.** You might want to use the Canvas Ready-to-use models to make predictions for your data. With the Ready-to-use models permissions, you can also chat with generative AI-powered models. The permissions are turned on by default when setting up your domain, or you can edit the permissions for a domain that you’ve already created. The Canvas Ready-to-use models permissions option adds the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your execution role. For more information, see the [Get started](canvas-ready-to-use-models.md#canvas-ready-to-use-get-started) section of the Ready-to-use models documentation.

  For more information about getting started with generative AI foundation models, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).
+ **Fine-tune foundation models.** If you'd like to fine-tune foundation models in Canvas, you can either add the permissions when setting up your domain, or you can edit the permissions for the domain or user profile after creating your domain. You must add the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to the AWS IAM role you chose when setting up the user profile, and you must also add a trust relationship with Amazon Bedrock to the role. For instructions on how to add these permissions to your IAM role, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md). 
+ **Send batch predictions to Quick.** You might want to [send *batch predictions*](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-send-predictions.html), or datasets of predictions you generate from a custom model, to Quick for analysis. In [QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html), you can build and publish predictive dashboards with your prediction results. For instructions on how to add these permissions to your Canvas user's IAM role, see [Grant Your Users Permissions to Send Predictions to Quick](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-quicksight-permissions.html).
+ **Deploy Canvas models to a SageMaker AI endpoint.** SageMaker AI Hosting offers *endpoints* which you can use to deploy your model for use in production. You can deploy models built in Canvas to a SageMaker AI endpoint and then make predictions programmatically in a production environment. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).
+ **Register model versions to the model registry.** You might want to register *versions* of your model to the [SageMaker AI model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html), which is a repository for tracking the status of updated versions of your model. A data scientist or MLOps team working in the SageMaker Model Registry can view the versions of your model that you’ve built and approve or reject them. Then, they can deploy your model version to production or kick off an automated workflow. Model registration permissions are turned on by default for your domain. You can manage permissions at the user profile level and grant or remove permissions to specific users. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).
+ **Import data from Amazon Redshift.** If you want to import data from Amazon Redshift, you must give yourself additional permissions. You must add the `AmazonRedshiftFullAccess` managed policy to the AWS IAM role you chose when setting up the user profile. For instructions on how to add the policy to the role, see [Grant Users Permissions to Import Amazon Redshift Data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-redshift-permissions.html).

**Note**  
The necessary permissions to import through other data sources, such as Amazon Athena and SaaS platforms, are included in the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) and [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policies. If you followed the standard setup instructions, these policies should already be attached to your execution role. For more information about these data sources and their permissions, see [Connect to data sources](canvas-connecting-external.md).

## Step 1: Log in to SageMaker Canvas


When the initial setup is complete, you can access SageMaker Canvas with any of the following methods, depending on your use case:
+ In the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/), choose the **Canvas** in the left navigation pane. Then, on the **Canvas** page, select your user from the dropdown and launch the Canvas application.
+ Open [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), and in the Studio interface, go to the Canvas page and launch the Canvas application.
+ Use your organization’s SAML 2.0-based SSO methods, such as Okta or the IAM Identity Center.

When you log into SageMaker Canvas for the first time, SageMaker AI creates the application and a SageMaker AI *space* for you. The Canvas application’s data is stored in the space. To learn more about spaces, see [Collaboration with shared spaces](domain-space.md). The space consists of your user profile’s applications and a shared directory for all of your applications’ data. If you don’t want to use the default space created by SageMaker AI and would prefer to create your own space for storing application data, see the page [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md).

## Step 2: Use SageMaker Canvas to get predictions


After you’ve logged in to Canvas, you can start building models and generating predictions for your data.

You can either use Canvas Ready-to-use models to make predictions without building a model, or you can build a custom model for your specific business problem. Review the following information to decide whether Ready-to-use models or custom models are best for your use case.
+ **Ready-to-use models.** With Ready-to-use models, you can use pre-built models to extract insights from your data. The Ready-to-use models cover a variety of use cases, such as language detection and document analysis. To get started making predictions with Ready-to-use models, see [Ready-to-use models](canvas-ready-to-use-models.md).
+ **Custom models.** With custom models, you can build a variety of model types that are customized to make predictions for your data. Use custom models if you’d like to build a model that is trained on your business-specific data and if you’d like to use features such as [evaluating your model’s performance](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html). To get started with building a custom model, see [Custom models](canvas-custom-models.md).

# Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas
Tutorial: Build a machine learning workflow in Canvas

This tutorial guides you through an end-to-end machine learning (ML) workflow using Amazon SageMaker Canvas. SageMaker Canvas is a visual no-code interface that you can use to prepare data and to train and deploy ML models. For the tutorial, you use a NYC taxi dataset to train a model that predicts the fare amount for a given trip. You get hands-on experience with key ML tasks such as assessing data quality and addressing data issues, splitting data into training and test sets, model training and evaluation, making predictions, and deploying your trained model–all within the SageMaker Canvas application.

**Important**  
This tutorial assumes that you or your administrator have created an AWS account. For information about creating an AWS account, see [Getting started: Are you a first time AWS User?](https://docs.aws.amazon.com/accounts/latest/reference/welcome-first-time-user.html)

## Setting up


An Amazon SageMaker AI domain is a centralized place to manage all your Amazon SageMaker AI environments and resources. A domain acts as a virtual boundary for your work in SageMaker AI, providing isolation and access control for your machine learning (ML) resources. 

To get started with Amazon SageMaker Canvas, you or your administrator must navigate to the SageMaker AI console and create a Amazon SageMaker AI domain. A domain has the storage and compute resources needed for you to run SageMaker Canvas. Within the domain, you configure SageMaker Canvas to access your Amazon S3 buckets and deploy models. Use the following procedure to set up a quick domain and create a SageMaker Canvas application.

**To set up SageMaker Canvas**

1. Navigate to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. On the left-hand navigation, choose SageMaker Canvas.

1. Choose **Create a SageMaker AI domain**.

1. Choose **Set up**. The domain can take a few minutes to set up.

The preceding procedure used a quick domain set up. You can perform an advanced set up to control all aspects of the account configuration, including permissions, integrations, and encryption. For more information about a custom set up, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md).

By default, doing the quick domain set up provides you with permissions to deploy models. If you have custom permissions set up through a standard domain and you need manually grant model deployment permissions, see [Permissions management](canvas-deploy-model.md#canvas-deploy-model-prereqs).

## Flow creation


Amazon SageMaker Canvas is a machine learning platform that enables users to build, train, and deploy machine learning models without extensive coding or machine learning expertise. One of the powerful features of Amazon SageMaker Canvas is the ability to import and work with large datasets from various sources, such as Amazon S3.

For this tutorial, we're using the NYC taxi dataset to predict the fare amount for each trip using a Amazon SageMaker Canvas Data Wrangler data flow. The following procedure outlines the steps for importing a modified version of the NYC taxi dataset into a data flow.

**Note**  
For improved processing, SageMaker Canvas imports a sample of your data. By default, it randomly samples 50,000 rows.

**To import the NYC taxi dataset**

1. From the SageMaker Canvas home page, choose **Data Wrangler**.

1. Choose **Import data**.

1. Select **Tabular**.

1. Choose the toolbox next to data source.

1. Select **Amazon S3** from the dropdown.

1. For **Input S3 endpoint**, specify `s3://amazon-sagemaker-data-wrangler-documentation-artifacts/canvas-single-file-nyc-taxi-dataset.csv`

1. Choose **Go**.

1. Select the checkbox next to the dataset.

1. Choose **Preview data**.

1. Choose **Save**.

## Data Quality and Insights Report 1 (sample)


After importing a dataset into Amazon SageMaker Canvas, you can generate a Data Quality and Insights report on a sample of the data. Use it to provide valuable insights into the dataset. The report does the following:
+ Assesses the dataset's completeness
+ Identifies missing values and outliers

It can identify other potential issues that may impact model performance. It also evaluates the predictive power of each feature concerning the target variable, allowing you to identify the most relevant features for problem you're trying to solve.

We can use the insights from the report to predict the fare amount. By specifying the **Fare amount** column as the target variable and selecting **Regression** as the problem type, the report will analyze the dataset's suitability for predicting continuous values like fare prices. The report should reveal that features like **year** and **hour\$1of\$1day** have low predictive power for the chosen target variable, providing you with valuable insights.

Use the following procedure to get a Data Quality and Insights report on a 50,000 row sample from the dataset.

**To get a report on a sample**

1. Choose **Get data insights** from the pop up window next to the **Data types** node.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. Choose **Create**.

You can review the Data Quality and Insights report on a sample of your data. The report indicates that the **year** and **hour\$1of\$1day** features are not predictive of the target variable, **Fare amount**.

At the top of the navigation, choose the name of the data flow to navigate back to it.

## Drop year and hour of day


We're using insights from the report to drop the **year** and **hour\$1of\$1day** columns to streamline the feature space and potentially improve model performance.

Amazon SageMaker Canvas provides a user-friendly interface and tools to perform such data transformations.

Use the following procedure to drop the **year** and **hour\$1of\$1day** columns from the NYC taxi dataset using the Data Wrangler tool in Amazon SageMaker Canvas.

1. Choose the icon next to **Data types**.

1. Choose **Add step**.

1. In the search bar, write **Drop column**.

1. Choose **Manage columns**.

1. Choose **Drop column**.

1. For **Columns to drop**, select the **year** and **hour\$1of\$1day** columns.

1. Choose **Preview** to view how your transform changes your data.

1. Choose **Add**.

You can use the preceding procedure as the basis to add all of the other transforms in SageMaker Canvas.

## Data Quality and Insights Report 2 (full dataset)


For the previous insights report, we used a sample of the NYC taxi dataset. For our second report, we're running a comprehensive analysis on the entire dataset to identify potential issues impacting model performance.

Use the following procedure to create a Data Quality and Insights report on an entire dataset.

**To get a report on the entire dataset**

1. Choose the icon next to the **Drop columns** node.

1. Choose **Get data insights**.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. For **Data size**, choose **Full dataset**.

1. Choose **Create**.

The following is an image from the insights report:

![\[Duplicate rows, Skewed target, and Very low quick model score are listed as the insightsP\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/canvas-tutorial-dqi-insights.png)


It shows the following issues:
+ Duplicate rows
+ Skewed target

Duplicate rows can lead to data leakage, where the model is exposed to the same data during training and testing. They can lead to overly optimistic performance metrics. Removing duplicate rows ensures that the model is trained on unique instances, reducing the risk of data leakage and improving the model's ability to generalize.

A skewed target variable distribution, in this case, the **Fare amount** column, can cause imbalanced classes, where the model may become biased towards the majority class. This can lead to poor performance on minority classes, which is particularly problematic in scenarios where accurately predicting rare or underrepresented instances is important.

## Addressing data quality issues


To address these issues and prepare the dataset for modeling, you can search for the following transformations and apply them:

1. Drop duplicates using the **Manage rows** transform.

1. **Handle outliers** in the **Fare amount** column using the **Robust standard deviation numeric outliers**.

1. **Handle outliers** in the **Trip distance** and **Trip duration** columns using the **Standard deviation numeric outliers**.

1. Use the **Encode categorical** to encode the **Rate code id**, **Payment type**, **Extra flag**, and **Toll flag** columns as floats.

If you're not sure about how to apply a transform, see [Drop year and hour of day](#canvas-tutorial-drop-year-and-hour-of-day)

By addressing these data quality issues and applying appropriate transformations, you can improve the dataset's suitability for modeling.

## Verifying data quality and quick model accuracy


After applying the transforms to address data quality issues, such as removing duplicate rows, we create our final Data Quality and Insights report. This report helps verify that the applied transformations resolved the issues and that the dataset is now in a suitable state for modeling.

When reviewing the final Data Quality and Insights report, you should expect to see no major data quality issues flagged. The report should indicate that:
+ The target variable is no longer skewed
+ There are no outliers or duplicate rows

Additionally, the report should provide a quick model score based on a baseline model trained on the transformed dataset. This score serves as an initial indicator of the model's potential accuracy and performance.

Use the following procedure to create the Data Quality and Insights report.

**To create the Data Quality and Insights report**

1. Choose the icon next to the **Drop columns** node.

1. Choose **Get data insights**.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. For **Data size**, choose **Full dataset**.

1. Choose **Create**.

## Split the data into training and test sets


To train a model and evaluate its performance, we use the **Split data** transform to split the data into training and test sets.

By default, SageMaker Canvas uses a Randomized split, but you can also use the following types of splits:
+ Ordered
+ Stratified
+ Split by key

You can change the **Split percentage** or add splits.

For this tutorial, use all of the default settings in the split. You need to double click on the dataset to view its name. The training dataset has the name **Dataset (Train)**.

Next to the **Ordinal encode** node apply the **Split data** transform.

## Train model


After you split your data, you can train a model. This model learns from patterns in your data. You can use it to make predictions or uncover insights.

SageMaker Canvas has both quick builds and standard builds. Use a standard build to train best performing model on your data.

Before you start training a model, you must first export the training dataset as a SageMaker Canvas dataset.

**To export your dataset**

1. Next to the node for the training dataset, choose the icon and select **Export**.

1. Select **SageMaker Canvas dataset**.

1. Choose **Export** to export the dataset.

After you've created a dataset, you can train a model on the SageMaker Canvas dataset that you've created. For information about training a model, see [Build a custom numeric or categorical prediction model](canvas-build-model-how-to.md#canvas-build-model-numeric-categorical).

## Evaluate model and make predictions


After training your machine learning model, it's crucial to evaluate its performance to ensure it meets your requirements and performs well on unseen data. Amazon SageMaker Canvas provides a user-friendly interface to assess your model's accuracy, review its predictions, and gain insights into its strengths and weaknesses. You can use the insights to make informed decisions about its deployment and potential areas for improvement.

Use the following procedure to evaluate a model before you deploy it.

**To evaluate a model**

1. Choose **My Models**.

1. Choose the model you've created.

1. Under **Versions**, select the version corresponding to the model.

You can now view the model evaluation metrics.

After you evaluate the model, you can make predictions on new data. We're using the test dataset that we've created.

To use the test dataset for predictions we need to convert it into a SageMaker Canvas dataset. The SageMaker Canvas dataset is in a format that the model can interpret.

Use the following procedure to create a SageMaker Canvas dataset from the test dataset.

**To create a SageMaker Canvas dataset**

1. Next to the **Dataset (Test)** dataset, choose the radio icon.

1. Select **Export**.

1. Select **SageMaker Canvas dataset**.

1. For **Dataset name**, specify a name for the dataset.

1. Choose **Export**.

Use the following procedure to make predictions. It assumes that you're still on the **Analyze** page.

**To make predictions on test dataset**

1. Choose **Predict**.

1. Choose **Manual**.

1. Select the dataset that you've exported.

1. Choose **Generate predictions**.

1. When SageMaker Canvas has finished generating predictions, select the icon to the right of the dataset.

1. Choose **Preview** to view the predictions.

## Deploy a model


After you've evaluated your model, you can deploy it to an endpoint. You can submit requests to the endpoint to get predictions.

Use the following procedure to deploy a model. It assumes that you're still on the **Predict** page.

**To deploy a model**

1. Choose **Deploy**.

1. Choose **Create deployment**.

1. Choose **Deploy**.

## Cleaning up


You've successfully completed the tutorial. To avoid incurring additional charges, delete the resources that you're not using.

Use the following procedure to delete the endpoint that you created. It assumes that you're still on the **Deploy** page.

**To delete an endpoint**

1. Choose the radio button to the right of your deployment.

1. Select **Delete deployment**.

1. Choose **Delete**.

After deleting the deployment, delete the datasets that you've created within SageMaker Canvas. Use the following procedure to delete the datasets.

**To delete the datasets**

1. Choose **Datasets** on the left-hand navigation.

1. Select the dataset that you've analyzed and the synthetic dataset used for predictions.

1. Choose **Delete**.

To avoid incurring additional charges, you must log out of SageMaker Canvas. For more information, see [Logging out of Amazon SageMaker Canvas](canvas-log-out.md).

# Amazon SageMaker Canvas setup and permissions management (for IT administrators)


The following pages explain how IT administrators can configure Amazon SageMaker Canvas and grant permissions to users within their organizations. You learn how to set up the storage configuration, manage data encryption and VPCs, control access to specific capabilities like generative AI foundation models, integrate with other AWS services like Amazon Redshift, and more. By following these steps, you can tailor SageMaker Canvas for your users based on your organization's specific requirements.

You can also set up SageMaker Canvas for your users with AWS CloudFormation. For more information, see [AWS::SageMaker AI::App](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-app.html) in the *AWS CloudFormation User Guide*.

**Topics**
+ [

# Grant Your Users Permissions to Upload Local Files
](canvas-set-up-local-upload.md)
+ [

# Set Up SageMaker Canvas for Your Users
](setting-up-canvas-sso.md)
+ [

# Configure your Amazon S3 storage
](canvas-storage-configuration.md)
+ [

# Grant permissions for cross-account Amazon S3 storage
](canvas-permissions-cross-account.md)
+ [

# Grant Users Permissions to Use Large Data across the ML Lifecycle
](canvas-large-data-permissions.md)
+ [

# Encrypt Your SageMaker Canvas Data with AWS KMS
](canvas-kms.md)
+ [

# Store SageMaker Canvas application data in your own SageMaker AI space
](canvas-spaces-setup.md)
+ [

# Grant Your Users Permissions to Build Custom Image and Text Prediction Models
](canvas-set-up-cv-nlp.md)
+ [

# Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas
](canvas-fine-tuning-permissions.md)
+ [

# Update SageMaker Canvas for Your Users
](canvas-update.md)
+ [

# Request a Quota Increase
](canvas-requesting-quota-increases.md)
+ [

# Grant Users Permissions to Import Amazon Redshift Data
](canvas-redshift-permissions.md)
+ [

# Grant Your Users Permissions to Send Predictions to Quick
](canvas-quicksight-permissions.md)
+ [

# Applications management
](canvas-manage-apps.md)
+ [

# Configure Amazon SageMaker Canvas in a VPC without internet access
](canvas-vpc.md)
+ [

# Set up connections to data sources with OAuth
](canvas-setting-up-oauth.md)

# Grant Your Users Permissions to Upload Local Files


If your users are uploading files from their local machines to SageMaker Canvas, you must attach a CORS (cross-origin resource sharing) configuration to the Amazon S3 bucket that they're using. When setting up or editing the SageMaker AI domain or user profile, you can specify either a custom Amazon S3 location or the default location, which is a SageMaker AI created Amazon S3 bucket with a name that uses the following pattern: `s3://sagemaker-{Region}-{your-account-id}`. SageMaker Canvas adds your users' data to the bucket whenever they upload a file.

To grant users permissions to upload local files to the bucket, you can attach a CORS configuration to it using either of the following procedures. You can use the first method when editing the settings of your domain, where you opt in to allow SageMaker AI to attach the CORS configuration to the bucket for you. You can also use the first method for editing a user profile within a domain. The second method is the manual method, where you can attach the CORS configuration to the bucket yourself.

## SageMaker AI domain settings method


To grant your users permissions to upload local files, you can edit the Canvas application configuration in the domain settings. This attaches a Cross-Origin Resource Sharing (CORS) configuration to the Canvas storage configuration's Amazon S3 bucket and grants all users in the domain permission to upload local files into SageMaker Canvas. By default, the permissions option is turned on when you set up a new domain, but you can turn this option on and off as needed.

**Note**  
If you have an existing CORS configuration on the storage configuration Amazon S3 bucket, turning on the local file upload option overwrites the existing configuration with the new configuration.

The following procedure shows how you can turn on this option by editing the domain settings in the SageMaker AI console.

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Domains**.

1. From the list of domains, choose your domain.

1. On the domain details page, select the **App Configurations** tab.

1. Go to the **Canvas** section and choose **Edit**.

1. Turn on the **Enable local file upload** toggle. This attaches the CORS configuration and grants local file upload permissions.

1. Choose **Submit**.

Users in the specified domain should now have local file upload permissions.

You can also grant permissions to specific user profiles in a domain by following the preceding procedure and going into the user profile settings instead of the overall domain settings.

## Amazon S3 bucket method


If you want to manually attach the CORS configuration to the SageMaker AI Amazon S3 bucket, use the following procedure.

1. Sign in to [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Choose your bucket. If your domain uses the default SageMaker AI created bucket, the bucket’s name uses the following pattern: `s3://sagemaker-{Region}-{your-account-id}`.

1. Choose **Permissions**.

1. Navigate to **Cross-origins resource sharing (CORS)**.

1. Choose **Edit**.

1. Add the following CORS policy:

   ```
   [
       {
           "AllowedHeaders": [
               "*"
           ],
           "AllowedMethods": [
               "POST"
           ],
           "AllowedOrigins": [
               "*"
           ],
           "ExposeHeaders": []
       }
   ]
   ```

1. Choose **Save changes**.

In the preceding procedure, the CORS policy must have `"POST"` listed under `AllowedMethods`.

After you've gone through the procedure, you should have:
+ An IAM role assigned to each of your users.
+ Amazon SageMaker Studio Classic runtime permissions for each of your users. SageMaker Canvas uses Studio Classic to run the commands from your users.
+ If the users are uploading files from their local machines, a CORS policy attached to their Amazon S3 bucket.

If your users still can't upload the local files after you update the CORS policy, the browser might be caching the CORS settings from a previous upload attempt. If they're running into issues, instruct them to clear their browser cache and try again.

# Set Up SageMaker Canvas for Your Users


To set up Amazon SageMaker Canvas, do the following:
+ Create an Amazon SageMaker AI domain.
+ Create user profiles for the domain
+ Set up Okta Single Sign On (Okta SSO) for your users.
+ Activate link sharing for models.

Use Okta Single-Sign On (Okta SSO) to grant your users access to Amazon SageMaker Canvas. SageMaker Canvas supports SAML 2.0 SSO methods. The following sections guide you through procedures to set up Okta SSO.

To set up a domain, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md) and follow the instructions for setting up your domain using IAM authentication. You can use the following information to help you complete the procedure in the section:
+ You can ignore the step about creating projects.
+ You don't need to provide access to additional Amazon S3 buckets. Your users can use the default bucket that we provide when we create a role.
+ To grant your users access to share their notebooks with data scientists, turn on **Notebook Sharing Configuration**.
+ Use Amazon SageMaker Studio Classic version 3.19.0 or later. For information about updating Amazon SageMaker Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

Use the following procedure to set up Okta. For all of the following procedures, you specify the same IAM role for `IAM-role` .

## Add the SageMaker Canvas application to Okta


Set up the sign-on method for Okta.

1. Sign in to the Okta Admin dashboard.

1. Choose **Add application**. Search for **AWS Account Federation**.

1. Choose **Add**.

1. Optional: Change the name to **Amazon SageMaker Canvas**.

1. Choose **Next**.

1. Choose **SAML 2.0** as the **Sign-On** method.

1. Choose **Identity Provider Metadata** to open the metadata XML file. Save the file locally.

1. Choose **Done**.

## Set up ID federation in IAM


AWS Identity and Access Management (IAM) is the AWS service that you use to gain access to your AWS account. You gain access to AWS through an IAM account.

1. Sign in to the AWS console.

1. Choose **AWS Identity and Access Management (IAM)**.

1. Choose **Identity Providers**.

1. Choose **Create Provider**.

1. For **Configure Provider**, specify the following:
   + **Provider Type** – From the dropdown list, choose **SAML**.
   + **Provider Name** – Specify **Okta**.
   + **Metadata Document** – Upload the XML document that you've saved locally from step 7 of [Add the SageMaker Canvas application to Okta](#canvas-set-up-okta).

1. Find your identity provider under **Identity Providers**. Copy its **Provider ARN** value.

1. For **Roles**, choose the IAM role that you're using for Okta SSO access.

1. Under **Trust Relationship** for the IAM role, choose **Edit Trust Relationship**.

1. Modify the IAM trust relationship policy by specifying the **Provider ARN** value that you've copied and add the following policy:

------
#### [ JSON ]

****  

   ```
     {
     "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws:iam::111122223333:saml-provider/Okta"
           },
           "Action": [
             "sts:AssumeRoleWithSAML",
             "sts:TagSession"
           ],
           "Condition": {
             "StringEquals": {
               "SAML:aud": "https://signin.aws.amazon.com/saml"
             }
           }
         },
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws:iam::111122223333:saml-provider/Okta"
           },
           "Action": [
             "sts:SetSourceIdentity"
           ]
         }
       ]
     }
   ```

------

1. For **Permissions**, add the following policy:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "AmazonSageMakerPresignedUrlPolicy",
              "Effect": "Allow",
              "Action": [
                   "sagemaker:CreatePresignedDomainUrl"
              ],
              "Resource": "*"
         }
     ]
   }
   ```

------

## Configure SageMaker Canvas in Okta


Configure Amazon SageMaker Canvas in Okta using the following procedure.

To configure Amazon SageMaker Canvas to use Okta, follow the steps in this section. You must specify unique user names for each **SageMakerStudioProfileName** field. For example, you can use `user.login` as a value. If the username is different from the SageMaker Canvas profile name, choose a different uniquely identifying attribute. For example, you can use an employee's ID number for the profile name.

For an example of values that you can set for **Attributes**, see the code following the procedure.

1. Under **Directory**, choose **Groups**.

1. Add a group with the following pattern: `sagemaker#canvas#IAM-role#AWS-account-id`.

1. In Okta, open the **AWS Account Federation** application integration configuration.

1. Select **Sign On** for the AWS Account Federation application.

1. Choose **Edit** and specify the following:
   + SAML 2.0
   + **Default Relay State** – https://*Region*.console.aws.amazon.com/sagemaker/home?region=*Region*\$1/studio/canvas/open/*StudioId*. You can find the Studio Classic ID in the console: [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)

1. Choose **Attributes**.

1. In the **SageMakerStudioProfileName** fields, specify unique values for each username. The usernames must match the usernames that you've created in the AWS console.

   ```
   Attribute 1:
   Name: https://aws.amazon.com/SAML/Attributes/PrincipalTag:SageMakerStudioUserProfileName 
   Value: ${user.login}
   
   Attribute 2:
   Name: https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys
   Value: {"SageMakerStudioUserProfileName"}
   ```

1. Select **Environment Type**. Choose **Regular AWS**.
   + If your environment type isn't listed, you can set your ACS URL in the **ACS URL** field. If your environment type is listed, you don't need to enter your ACS URL

1. For **Identity Provider ARN**, specify the ARN you used in step 6 of the preceding procedure.

1. Specify a **Session Duration**.

1. Choose **Join all roles**.

1. Turn on **Use Group Mapping** by specifying the following fields:
   + **App Filter** – `okta`
   + **Group Filter** – `^aws\#\S+\#(?IAM-role[\w\-]+)\#(?accountid\d+)$`
   + **Role Value Pattern** – `arn:aws:iam::$accountid:saml-provider/Okta,arn:aws:iam::$accountid:role/IAM-role`

1. Choose **Save/Next**.

1. Under **Assignments**, assign the application to the group that you've created.

## Add optional policies on access control in IAM


In IAM, you can apply the following policy to the administrator user who creates the user profiles.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "CreateSageMakerStudioUserProfilePolicy",
            "Effect": "Allow",
            "Action": "sagemaker:CreateUserProfile",
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:TagKeys": [
                        "studiouserid"
                    ]
                }
            }
        }
    ]
}
```

------

If you choose to add the preceding policy to the admin user, you must use the following permissions from [Set up ID federation in IAM](#set-up-id-federation-IAM).

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement": [
       {
           "Sid": "AmazonSageMakerPresignedUrlPolicy",
           "Effect": "Allow",
           "Action": [
               "sagemaker:CreatePresignedDomainUrl"
           ],
           "Resource": "*",
           "Condition": {
                  "StringEquals": {
                      "sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/SageMakerStudioUserProfileName}"
                 }
            }
      }
  ]
}
```

------

# Configure your Amazon S3 storage


When you set up your SageMaker Canvas application, the default storage location for model artifacts, datasets, and other application data is an Amazon S3 bucket that Canvas creates. This default Amazon S3 bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}` and exists in the same Region as your Canvas application. However, you can customize the storage location and specify your own Amazon S3 bucket for storing Canvas application data. You might want to use your own Amazon S3 bucket for storing application data for any of the following reasons:
+ Your organization has internal naming conventions for Amazon S3 buckets.
+ You want to enable cross-account access to model artifacts or other Canvas data.
+ You want to be compliant with internal security guidelines, such as restricting users to specific Amazon S3 buckets or model artifacts.
+ You want enhanced visibility and access to logs produced by Canvas, independent of the AWS console or SageMaker Studio Classic.

By specifying your own Amazon S3 bucket, you can have increased control over your own storage and be compliant with your organization. 

To get started, you can either create a new SageMaker AI domain or user profile, or you can update an existing domain or user profile. Note that the user profile settings override the domain-level settings. For example, you can use the default bucket configuration at the domain level, but you can specify a custom Amazon S3 bucket for an individual user. After specifying your own Amazon S3 bucket for the domain or user profile, Canvas creates a subfolder called `Canvas/<UserProfileName>` under the input Amazon S3 URI and saves all artifacts generated in the Canvas application under this subfolder.

**Important**  
If you update an existing domain or user profile, you no longer have access to your Canvas artifacts from the previous location. Your files are still in the old Amazon S3 location, but you can no longer view them from Canvas. The new configuration takes effect the next time you log into the application.

For more information about granting cross-account access to your Amazon S3 bucket, see [Granting cross-account object permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example4.html#access-policies-walkthrough-example4-overview) in the *Amazon S3 User Guide*.

The following sections describe how to specify a custom Amazon S3 bucket for your Canvas storage configuration. If you’re setting up a new SageMaker AI domain (or a new user in a domain), then use the [New domain setup method](#canvas-storage-configuration-new-domain) or the [New user profile setup method](#canvas-storage-configuration-new-user). If you have an existing Canvas user profile and would like to update the profile's storage configuration, use the [Existing user method](#canvas-storage-configuration-existing-user).

## Before you begin


If you’re specifying an Amazon S3 URI from a different AWS account, or if you’re using a bucket that is encrypted with AWS KMS, then you must configure permissions before proceeding. You must grant AWS IAM permissions to ensure that Canvas can download and upload objects to and from your bucket. For detailed information on how to grant the required permissions, see [Grant permissions for cross-account Amazon S3 storage](canvas-permissions-cross-account.md).

Additionally, the final Amazon S3 URI for the training folder in your Canvas storage location must be 128 characters or less. The final Amazon S3 URI consists of your bucket path `s3://<your-bucket-name>/<folder-name>/` plus the path that Canvas adds to your bucket: `Canvas/<user-profile-name>/Training`. For example, an acceptable path that is less than 128 characters is `s3://<amzn-s3-demo-bucket>/<machine-learning>/Canvas/<user-1>/Training`.

## New domain setup method


If you’re setting up a new domain and Canvas application, use this section to configure the storage location at the domain level. This configuration applies to all new users you create in the domain, unless you specify a different storage location for individual user profiles.

When doing a **Standard setup** for your domain, on the **Step 3: Configure Applications - Optional** page, use the following procedure for the **Canvas** section:

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish setting up the domain and choose **Submit**.

Your domain is now configured to use the Amazon S3 location you specified for SageMaker Canvas application storage.

## New user profile setup method


If you’re setting up a new user profile in your domain, use this section to configure the storage location for the user. This configuration overrides the domain-level configuration.

When adding a user profile to your domain, for **Step 2: Configure Applications**, use the following procedure for the **Canvas** section:

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish setting up the user profile and choose **Submit**.

Your user profile is now configured to use the Amazon S3 location you specified for SageMaker Canvas application storage.

## Existing user method


If you have an existing Canvas user profile and would like to update the Amazon S3 storage location, you can edit the SageMaker AI domain or user profile settings. The change takes effect the next time you log into the Canvas application.

**Note**  
When you change the storage location for an existing Canvas application, you lose access to your Canvas artifacts from the previous storage location. The artifacts are still stored in the old Amazon S3 location, but you can no longer view them from Canvas.

Remember that the user profile settings override the general domain settings, so you can update the Amazon S3 storage location for specific user profiles without changing it for all of the users. You can update the storage configuration for an existing domain or user by using the following procedures.

------
#### [ Update an existing domain ]

Use the following procedure to update the storage configuration for a domain.

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**. 

1. From the list of domains, choose your domain.

1. On the **Domain details** page, choose the **App Configurations** tab.

1. Scroll down to the **Canvas** section and choose **Edit**.

1. The **Edit Canvas settings** page opens. For the **Canvas storage configuration** section, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish any other modifications you want to make to the domain, and then choose **Submit** to save your changes.

------
#### [ Update an existing user profile ]

Use the following procedure to update the storage configuration for a user profile.

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of **domains**, choose your domain.

1. From the list of users in the domain, choose the user whose configuration you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish any other modifications you want to make to the user profile, and then choose **Submit** to save your changes.

------

The storage location for your Canvas user profile should now be updated. The next time you log into the Canvas application, you receive a notification that the storage location has been updated. You lose access to any previous artifacts that you created in Canvas. You can still access the files in Amazon S3, but you can no longer view them in Canvas.

# Grant permissions for cross-account Amazon S3 storage


When setting up your SageMaker AI domain or user profile for users to access SageMaker Canvas, you specify an Amazon S3 storage location for Canvas artifacts. These artifacts include saved copies of your input datasets, model artifacts, predictions, and other application data. You can either use the default SageMaker AI created Amazon S3 bucket, or you can customize the storage location and specify your own bucket for storing Canvas application data.

You can specify an Amazon S3 bucket in another AWS account for storing your Canvas data, but first you must grant cross-account permissions so that Canvas can access the bucket.

The following sections describe how to grant permissions to Canvas for uploading and downloading objects to and from an Amazon S3 bucket in another account. There are additional permissions for when your bucket is encrypted with AWS KMS.

## Requirements


Before you begin, review the following requirements:
+ Cross-account Amazon S3 buckets (and any associated AWS KMS keys) must be in the same AWS Region as the Canvas user domain or user profile.
+ The final Amazon S3 URI for the training folder in your Canvas storage location must be 128 characters or less. The final S3 URI consists of your bucket path `s3://<your-bucket-name>/<folder-name>/` plus the path that Canvas adds to your bucket: `Canvas/<user-profile-name>/Training`. For example, an acceptable path that is less than 128 characters is `s3://<amzn-s3-demo-bucket>/<machine-learning>/Canvas/<user-1>/Training`.

## Permissions for cross-account Amazon S3 buckets


The following section outlines the basic steps for granting the necessary permissions so that Canvas can access your Amazon S3 bucket in another account. For more detailed instructions, see [Example 2: Bucket owner granting cross-account bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) in the *Amazon S3 User Guide*.

1. Create an Amazon S3 bucket, `bucketA`, in Account A.

1. The Canvas user exists in another account called Account B. In the following steps, we refer to the Canvas user's IAM role as `roleB` in Account B.

   Give the IAM role `roleB` in Account B permission to download (`GetObject`) and upload (`PutObject`) objects to and from `bucketA` in Account A by attaching an IAM policy.

   To limit access to a specific bucket folder, define the folder name in the resource element, such as `arn:aws:s3:::<bucketA>/FolderName/*`. For more information, see [How can I use IAM policies to grant user-specific access to specific folders?](https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/)
**Note**  
Bucket-level actions, such as `GetBucketCors` and `GetBucketLocation`, should be added on bucket-level resources, not folders.

   The following example IAM policy grants the required permissions for `roleB` to access objects in `bucketA`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:PutObject",
                   "s3:DeleteObject"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA/FolderName/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:ListBucket",
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA"
               ]
           }
       ]
   }
   ```

------

1. Configure the bucket policy for `bucketA` in Account A to grant permissions to the IAM role `roleB` in Account B.
**Note**  
Admins must also turn off **Block all public access** under the bucket **Permissions** section.

   The following is an example bucket policy for `bucketA` to grant the necessary permissions to `roleB`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:DeleteObject",
                   "s3:GetObject",
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::bucketA/FolderName/*"
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:ListBucket",
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": "arn:aws:s3:::bucketA"
           }
       ]
   }
   ```

------

After configuring the preceding permissions, your Canvas user profile in Account B can now use the Amazon S3 bucket in Account A as the storage location for Canvas artifacts.

## Permissions for cross-account Amazon S3 buckets encrypted with AWS KMS


The following procedure shows you how to grant the necessary permissions so that Canvas can access your Amazon S3 bucket in another account that is encrypted with AWS KMS. The steps are similar to the procedure above, but with additional permissions. For more information about granting cross-account KMS key access, see [Allowing users in other accounts to use a KMS key](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying-external-accounts.html) in the *AWS KMS Developer Guide*.

1. Create an Amazon S3 bucket, `bucketA`, and an Amazon S3 KMS key `s3KmsInAccountA` in Account A.

1. The Canvas user exists in another account called Account B. In the following steps, we refer to the Canvas user's IAM role as `roleB` in Account B.

   Give the IAM role `roleB` in Account B permission to do the following:
   + Download (`GetObject`) and upload (`PutObject`) objects to and from `bucketA` in Account A.
   + Access the AWS KMS key `s3KmsInAccountA` in Account A.

   The following example IAM policy grants the required permissions for `roleB` to access objects in `bucketA` and use the KMS key `s3KmsInAccountA`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:PutObject",
                   "s3:DeleteObject"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA/FolderName/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA"
               ]
           },
           {
               "Action": [
                   "kms:DescribeKey",
                   "kms:CreateGrant",
                   "kms:RetireGrant",
                   "kms:GenerateDataKey",
                   "kms:GenerateDataKeyWithoutPlainText",
                   "kms:Decrypt"
               ],
               "Effect": "Allow",
               "Resource": "arn:aws:kms:us-east-1:111122223333:key/s3KmsInAccountA"
           }
       ]
   }
   ```

------

1. Configure the bucket policy for `bucketA` and the key policy for `s3KmsInAccountA` in Account A to grant permissions to the IAM role `roleB` in Account B.

   The following is an example bucket policy for `bucketA` to grant the necessary permissions to `roleB`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:DeleteObject",
                   "s3:GetObject",
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::bucketA/FolderName/*"
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": "arn:aws:s3:::bucketA"
           }
       ]
   }
   ```

------

   The following example is a key policy that you attach to the KMS key `s3KmsInAccountA` in Account A to grant `roleB` access. For more information about how to create and attach a key policy statement, see [Creating a key policy](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-overview.html) in the *AWS KMS Developer Guide*.

   ```
   {
     "Sid": "Allow use of the key",
     "Effect": "Allow",
     "Principal": {
       "AWS": [
         "arn:aws:iam::accountB:role/roleB"
       ]
     },
     "Action": [
           "kms:DescribeKey",
           "kms:CreateGrant",
           "kms:RetireGrant",
           "kms:GenerateDataKey",
           "kms:GenerateDataKeyWithoutPlainText",
           "kms:Decrypt"
     ],
     "Resource": "*"
   }
   ```

After configuring the preceding permissions, your Canvas user profile in Account B can now use the encrypted Amazon S3 bucket in Account A as the storage location for Canvas artifacts.

# Grant Users Permissions to Use Large Data across the ML Lifecycle
Grant Large Data Permissions

Amazon SageMaker Canvas users working with datasets larger than 10 GB in CSV format or 2.5 GB in Parquet format require specific permissions for large data processing. These permissions are essential for managing large-scale data throughout the machine learning lifecycle. When datasets exceed the stated thresholds, or the application's local memory capacity, SageMaker Canvas uses Amazon EMR Serverless for efficient processing. This applies to:
+ Data Import: Importing large datasets with random or stratified sampling.
+ Data Preparation: Exporting processed data from Data Wrangler in Canvas to Amazon S3, to a new Canvas dataset, or to a Canvas model.
+ Model Building: Training models on large datasets.
+ Inference: Making predictions on large datasets.

By default, SageMaker Canvas uses EMR Serverless to run these remote jobs with the following app settings:
+ Pre-Initialized capacity: Not configured
+ Application limits: Maximum capacity of 400 vCPUs, max concurrent 16 vCPUs per account, 3000 GB memory, 20000 GB disk
+ Metastore configuration: AWS Glue Data Catalog
+ Application logs: AWS managed storage (enabled), using an AWS owned encryption key
+ Application behavior: Auto-starts on job submission and auto-stops after the application is idle for 15 minutes

To enable these large data processing capabilities, users need the necessary permissions, which can be granted through the Amazon SageMaker AI domain settings. The method for granting these permissions depends on how your Amazon SageMaker AI domain was set up initially. We'll cover three main scenarios:
+ Quick domain setup
+ Custom domain setup (with public internet access/without VPC)
+ Custom domain setup (with VPC and without public internet access)

Each scenario requires specific steps to ensure that users have the required permissions to leverage EMR Serverless for large data processing across the entire machine learning lifecycle in SageMaker Canvas.

## Scenario 1: Quick domain setup


If you used the **Quick setup** option when creating your SageMaker AI domain, follow these steps:

1. Navigate to the Amazon SageMaker AI domain settings:

   1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **Domains**.

   1. Select your domain.

   1. Choose the **App Configurations** tab.

   1. Scroll to the **Canvas** section and choose **Edit**.

1. Enable large data processing:

   1. In the **Large data processing configuration** section, turn on **Enable EMR Serverless for large data processing**.

   1. Create or select an EMR Serverless role:

      1. Choose **Create and use a new execution role** to create a new IAM role that has a trust relationship with EMR Serverless and the [AWS managed policy: AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy) policy attached. This IAM role is assumed by Canvas to create EMR Serverless jobs.

      1. Alternatively, if you already have an execution role with a trust relationship for EMR Serverless, then select **Use an existing execution role** and choose your role from the dropdown.
         + The existing role must have a name that begins with the prefix `AmazonSageMakerCanvasEMRSExecutionAccess-`.
         + The role you select should also have at least the permissions described in the [AWS managed policy: AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy) policy.
         + The role should have an EMR Serverless trust policy, as shown below:

------
#### [ JSON ]

****  

           ```
           {
               "Version":"2012-10-17",		 	 	 
               "Statement": [
                   {
                       "Sid": "EMRServerlessTrustPolicy",
                       "Effect": "Allow",
                       "Principal": {
                           "Service": "emr-serverless.amazonaws.com"
                       },
                       "Action": "sts:AssumeRole",
                       "Condition": {
                           "StringEquals": {
                               "aws:SourceAccount": "111122223333"
                           }
                       }
                   }
               ]
           }
           ```

------

1. (Optional) Add Amazon S3 permissions for custom Amazon S3 buckets:

   1. The Canvas managed policy automatically grants read and write permissions for Amazon S3 buckets with `sagemaker` or `SageMaker AI` in their names. It also grants read permissions for objects in custom Amazon S3 buckets with the tag `"SageMaker": "true"`.

   1. For custom Amazon S3 buckets without the required tag, add the following policy to your EMR Serverless role:

   1. 

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:DeleteObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*"
                  ]
              }
          ]
      }
      ```

------

   1. We recommend that you scope down the permissions to specific Amazon S3 buckets that you want Canvas to access.

1. Save your changes and restart your SageMaker Canvas application.

## Scenario 2: Custom domain setup (with public internet access/without VPC)


If you created or use a custom domain, follow steps 1-3 from Scenario 1, and then do these additional steps:

1. Add permissions for the Amazon ECR `DescribeImages` operation to your Amazon SageMaker AI execution role, as Canvas utilizes public Amazon ECR Docker images for data preparation and model training:

   1. Sign in to the AWS console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. Choose **Roles**.

   1. In the search box, search for your SageMaker AI execution role by name and select it.

   1. Add the following policy to your SageMaker AI execution role. This can be done either by adding it as a new inline policy or by appending the policy statement to an existing one. Note that an IAM role can have a maximum of 10 policies attached.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [{
              "Sid": "ECRDescribeImagesOperation",
              "Effect": "Allow",
              "Action": "ecr:DescribeImages",
              "Resource": [
                  "arn:aws:ecr:*:*:repository/sagemaker-data-wrangler-emr-container",
                  "arn:aws:ecr:*:*:repository/ap-dataprep-emr"
              ]
          }]
      }
      ```

------

1. Save your changes and restart your SageMaker Canvas application.

## Scenario 3: Custom domain setup (with VPC and without public internet access)


If you created or use a custom domain, follow all steps from Scenario 2, then follow these additional steps:

1. Ensure your VPC subnets are private:

   1. Verify that the route table for your subnets doesn't have an entry mapping `0.0.0.0/0` to an Internet Gateway.

1. Add permissions for creating network interfaces:

   1. When using SageMaker Canvas with EMR Serverless for large-scale data processing, EMR Serverless requires the ability to create Amazon EC2 ENIs to enable network communication between EMR Serverless applications and your VPC resources.

   1. Add the following policy to your Amazon SageMaker AI execution role. This can be done either by adding it as a new inline policy or by appending the policy statement to an existing one. Note that an IAM role can have a maximum of 10 policies attached.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "AllowEC2ENICreation",
                  "Effect": "Allow",
                  "Action": [
                      "ec2:CreateNetworkInterface"
                  ],
                  "Resource": [
                      "arn:aws:ec2:*:*:network-interface/*"
                  ],
                  "Condition": {
                      "StringEquals": {
                          "aws:CalledViaLast": "ops.emr-serverless.amazonaws.com"
                      }
                  }
              }
          ]
      }
      ```

------

1. (Optional) Restrict ENI creation to specific subnets:

   1. To further secure your setup by restricting the creation of ENIs to certain subnets within your VPC, you can tag each subnet with specific conditions.

   1. Use the following IAM policy to ensure that EMR Serverless applications can only create Amazon EC2 ENIs within the allowed subnets and security groups:

      ```
      {
          "Sid": "AllowEC2ENICreationInSubnetAndSecurityGroupWithEMRTags",
          "Effect": "Allow", 
          "Action": [
              "ec2:CreateNetworkInterface"
          ],
          "Resource": [
              "arn:aws:ec2:*:*:subnet/*",
              "arn:aws:ec2:*:*:security-group/*"
          ],
          "Condition": {
              "StringEquals": {
                  "aws:ResourceTag/KEY": "VALUE"
              }
          }
      }
      ```

1. Follow the steps on the page [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md) to set the VPC endpoint for Amazon S3, which is required by EMR Serverless and other AWS services that are used by SageMaker Canvas.

1. Save your changes and restart your SageMaker Canvas application.

By following these steps, you can enable large data processing in SageMaker Canvas for various domain setups, including those with custom VPC configurations. Remember to restart your SageMaker Canvas application after making these changes to apply the new permissions.

# Encrypt Your SageMaker Canvas Data with AWS KMS


You might have data that you want to encrypt while using Amazon SageMaker Canvas, such as your private company information or customer data. SageMaker Canvas uses AWS Key Management Service to protect your data. AWS KMS is a service that you can use to create and manage cryptographic keys for encrypting your data. For more information about AWS KMS, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) in the *AWS KMS Developer Guide*.

Amazon SageMaker Canvas provides you with several options for encrypting your data. SageMaker Canvas provides default encryption within the application for tasks such as building your model and generating insights. You can also choose to encrypt data stored in Amazon S3 to protect your data at rest. SageMaker Canvas supports importing encrypted datasets, so you can establish an encrypted workflow. The following sections describe how you can use AWS KMS encryption to protect your data while building models with SageMaker Canvas.

## Encrypt your data in SageMaker Canvas


With SageMaker Canvas, you can use two different AWS KMS encryption keys to encrypt your data in SageMaker Canvas, which you can specify when [setting up your domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html) using the standard domain setup. These keys are specified in the following domain setup steps:
+ **Step 3: Configure Applications - (Optional)** – When configuring the **Canvas storage configuration** section, you can specify an **Encryption key**. This is a KMS key that SageMaker Canvas uses for long-term storage of model objects and datasets, which are stored in the provided Amazon S3 bucket for your domain. If creating a Canvas application with the [CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html) API, use the `S3KMSKeyId` field to specify this key.
+ **Step 6: Configure storage** – SageMaker Canvas uses one key for encrypting the Amazon SageMaker Studio private space that is created for your Canvas application, which includes temporary application storage, visualizations, and compute jobs (such as building models). You can use either the default AWS managed key or specify your own. If you specify your AWS KMS key, the data stored in the `/home/sagemaker-user` directory is encrypted with your key. If you don't specify an AWS KMS key, the data inside `/home/sagemaker-user` is encrypted with an AWS managed key. Regardless of whether you specify an AWS KMS key, all of the data outside of the working directory is encrypted with an AWS Managed Key. To learn more about the Studio space and your Canvas application storage, see [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md). If creating a Canvas application with the [CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html) API, use the `KmsKeyID` field to specify this key.

The preceding keys can be the same or different KMS keys.

### Prerequisites


To use your own KMS key for either of the previously described purposes, you must first grant your user's IAM role permission to use the key. Then, you can specify the KMS key when setting up your domain.

The simplest way to grant your role permission to use the key is to modify the key policy. Use the following procedure to grant your role the necessary permissions.

1. Open the [AWS KMS console](https://console.aws.amazon.com/kms/).

1. In the **Key Policy** section, choose **Switch to policy view**.

1. Modify the key's policy to grant permissions for the `kms:GenerateDataKey` and `kms:Decrypt` actions to the IAM role. Additionally, if you're modifying the key policy that encrypts your Canvas application storage in the Studio space, grant the `kms:CreateGrant` action. You can add a statement that's similar to the following:

   ```
   {
     "Sid": "ExampleStmt",
     "Action": [
       "kms:CreateGrant", #this permission is only required for the key that encrypts your SageMaker Canvas application storage
       "kms:Decrypt",
       "kms:GenerateDataKey"
     ],
     "Effect": "Allow",
     "Principal": {
       "AWS": "<arn:aws:iam::111122223333:role/Jane>"
     },
     "Resource": "*"
   }
   ```

1. Choose **Save changes**.

The less preferred method is to modify the user’s IAM role to grant the user permissions to use or manage the KMS key. If you use this method, the KMS key policy must also allow access management through IAM. To learn how to grant permission to a KMS key through the user’s IAM role, see [Specifying KMS keys in IAM policy statements](https://docs.aws.amazon.com/kms/latest/developerguide/cmks-in-iam-policies.html) in the *AWS KMS Developer Guide*.

### Encrypt your data in the SageMaker Canvas application


The first KMS key you can use in SageMaker Canvas is used for encrypting application data stored on Amazon Elastic Block Store (Amazon EBS) volumes and in the Amazon Elastic File System that SageMaker AI creates in your domain. SageMaker Canvas encrypts your data with this key in the underlying application and temporary storage systems created when using compute instances for building models and generating insights. SageMaker Canvas passes the key to other AWS services, such as Autopilot, whenever SageMaker Canvas initiates jobs with them to process your data.

You can specify this key by setting the `KmsKeyID` in the `CreateDomain` API call or while doing the standard domain setup in the console. If you don’t specify your own KMS key, SageMaker AI uses a default AWS managed KMS key to encrypt your data in the SageMaker Canvas application.

To specify your own KMS key for use in the SageMaker Canvas application through the console, first set up your Amazon SageMaker AI domain using the **Standard setup**. Use the following procedure to complete the **Network and Storage Section** for the domain.

1. Fill out your desired Amazon VPC settings.

1. For **Encryption key**, choose **Enter a KMS key ARN**.

1. For **KMS ARN**, enter the ARN for your KMS key, which should have a format similar to the following: `arn:aws:kms:example-region-1:123456789098:key/111aa2bb-333c-4d44-5555-a111bb2c33dd`

### Encrypt your SageMaker Canvas data saved in Amazon S3


The second KMS key you can specify is used for data that SageMaker Canvas stores to Amazon S3. This KMS key is specified in the `S3KMSKeyId` field in the `CreateDomain` API call, or while doing the standard domain setup in the SageMaker AI console. SageMaker Canvas saves duplicates of your input datasets, application and model data, and output data to the Region’s default SageMaker AI S3 bucket for your account. The naming pattern for this bucket is `s3://sagemaker-{Region}-{your-account-id}`, and SageMaker Canvas stores data in the `Canvas/` folder.





1. Turn on **Enable notebook resource sharing**.

1. For **S3 location for shareable notebook resources**, leave the default Amazon S3 path. Note that SageMaker Canvas does not use this Amazon S3 path; this Amazon S3 path is used for Studio Classic notebooks.

1. For **Encryption key**, choose **Enter a KMS key ARN**.

1. For **KMS ARN**, enter the ARN for your KMS key, which should have a format similar to the following: `arn:aws:kms:us-east-1:111122223333:key/111aa2bb-333c-4d44-5555-a111bb2c33dd`

## Import encrypted datasets from Amazon S3


Your users might have datasets that have been encrypted with a KMS key. While the preceding section shows you how to encrypt data in SageMaker Canvas and data stored to Amazon S3, you must grant your user's IAM role additional permissions if you want to import data from Amazon S3 that is already encrypted with AWS KMS.

To grant your user permissions to import encrypted datasets from Amazon S3 into SageMaker Canvas, add the following permissions to the IAM execution role that you've used for the user profile.

```
      "kms:Decrypt",
      "kms:GenerateDataKey"
```

To learn how to edit the IAM permissions for a role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*. For more information about KMS keys, see [Key policies in AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/key-policies.html) in the *AWS KMS Developer Guide*.

## FAQs


Refer to the following FAQ items for answers to commonly asked questions about SageMaker Canvas AWS KMS support.

### Q: Does SageMaker Canvas retain my KMS key?


A: No. SageMaker Canvas may temporarily cache your key or pass it on to other AWS services (such as Autopilot), but SageMaker Canvas does not retain your KMS key.

### Q: I specified a KMS key when setting up my domain. Why did my dataset fail to import in SageMaker Canvas?


A: Your user’s IAM role may not have permissions to use that KMS key. To grant your user permissions, see the [Prerequisites](#canvas-kms-app-data-prereqs). Another possible error is that you have a bucket policy on your Amazon S3 bucket that requires the use of a specific KMS key that doesn’t match the KMS key you specified in your domain. Make sure that you specify the same KMS key for your Amazon S3 bucket and your domain.

### Q: How do I find the Region’s default SageMaker AI Amazon S3 bucket for my account?


A: The default Amazon S3 bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}`. The `Canvas/` folder in this bucket stores your SageMaker Canvas application data.

### Q: Can I change the default SageMaker AI Amazon S3 bucket used to store SageMaker Canvas data?


A: No, SageMaker AI creates this bucket for you.

### Q: What does SageMaker Canvas store in the default SageMaker AI Amazon S3 bucket?


A: SageMaker Canvas uses the default SageMaker AI Amazon S3 bucket to store duplicates of your input datasets, model artifacts, and model outputs.

### Q: What use cases are supported for using KMS keys with SageMaker Canvas?


A: With SageMaker Canvas, you can use your own encryption keys with AWS KMS for building regression, binary and multi-class classification, and time series forecasting models, as well as for batch inference with your model.

# Store SageMaker Canvas application data in your own SageMaker AI space


Your Amazon SageMaker Canvas application data, such as datasets that you import and your model artifacts, is stored in a *Amazon SageMaker Studio private space*. The space consists of a storage volume for your application data with 100 GB of storage per user profile, the type of the space (in this case, a Canvas application), and the image for your application's container. When you set up Canvas and launch your application for the first time, SageMaker AI creates a default private space that is assigned to your user profile and stores your Canvas data. You don't have to do any additional configuration to set up the space because SageMaker AI automatically creates the space on your behalf. However, if you don't want to use the default space, you have the option to specify a space that you created yourself. This can be useful if you want to isolate your data. The following page shows you how to create and configure your own Studio space for storing Canvas application data.

**Note**  
You can only configure a custom Studio space for new Canvas applications. You can't modify the space configuration for existing Canvas applications.

## Before you begin


Your Amazon SageMaker AI domain or user profile must have at least 100 GB of storage in order to create and use the SageMaker Canvas application.

If you created your domain through the SageMaker AI console, enough storage is provisioned by default and you don't need to take any additional action. If you created your domain or user profile with the [CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) or [ CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) APIs, then make sure that you set the `MaximumEbsVolumeSizeInGb` value to 100 GB or greater. To set a greater storage value, you can either create a new domain or user profile, or you can update an existing domain or user profile using the [UpdateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html) or [ UpdateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateUserProfile.html) APIs. 

## Create a new space


First, create a new Studio space that is configured to store Canvas application data. This is the space that you specify when creating a new Canvas application in the next step.

To create a space, you can use the AWS SDK for Python (Boto3) or the AWS CLI.

------
#### [ SDK for Python (Boto3) ]

The following example shows you how to use the AWS SDK for Python (Boto3) `create_space` method to create a space that you can use for Canvas applications. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain. To find your ID, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section.
+ `SpaceName`: Specify a name for the new space.
+ `EbsVolumeSizeinGb`: Specify the storage volume size for your space (in GB). The minimum value is `5` and the maximum is `16384`.
+ `SharingType`: Specify this field as `Private`. For more information, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).
+ `OwnerUserProfileName`: Specify the user profile name. To find user profile names associated with a domain, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section. In the domain's settings, you can view the user profiles.
+ `AppType`: Specify this field as `Canvas`.

```
response = client.create_space(
    DomainId='<your-domain-id>', 
    SpaceName='<your-new-space-name>',
    SpaceSettings={
        'AppType': 'Canvas',
        'SpaceStorageSettings': {
            'EbsStorageSettings': {
                'EbsVolumeSizeInGb': <storage-volume-size>
            }
        },
    },
    OwnershipSettings={
        'OwnerUserProfileName': '<your-user-profile>'
    },
    SpaceSharingSettings={
        'SharingType': 'Private'
    }  
)
```

------
#### [ AWS CLI ]

The following example shows you how to use the AWS CLI `create-space` method to create a space that you can use for Canvas applications. Make sure to specify these parameters:
+ `domain-id`: Specify the ID for your domain. To find your ID, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section.
+ `space-name`: Specify a name for the new space.
+ `EbsVolumeSizeinGb`: Specify the storage volume size for your space (in GB). The minimum value is `5` and the maximum is `16384`.
+ `SharingType`: Specify this field as `Private`. For more information, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).
+ `OwnerUserProfileName`: Specify the user profile name. To find user profile names associated with a domain, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section. In the domain's settings, you can view the user profiles.
+ `AppType`: Specify this field as `Canvas`.

```
  
create-space
--domain-id <your-domain-id>
--space-name <your-new-space-name>  
--space-settings '{
        "AppType": "Canvas", 
        "SpaceStorageSettings": {
            "EbsStorageSettings": {"EbsVolumeSizeInGb": <storage-volume-size>}
        },
     }'
--ownership-settings '{"OwnerUserProfileName": "<your-user-profile>"}'
--space-sharing-settings '{"SharingType": "Private"}'
```

------

You should now have a space. Keep track of your space's name for the next step.

## Create a new Canvas application


After creating a space, create a new Canvas application that specifies the space as its storage location.

To create a new Canvas application, you can use the AWS SDK for Python (Boto3) or the AWS CLI.

**Important**  
You must use the AWS SDK for Python (Boto3) or the AWS CLI to create your Canvas application. Specifying a custom space when creating Canvas applications through the SageMaker AI console isn't supported.

------
#### [ SDK for Python (Boto3) ]

The following example shows you how to use the AWS SDK for Python (Boto3) `create_app` method to create a new Canvas application. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain.
+ `SpaceName`: Specify the name of the space that you created in the previous step.
+ `AppType`: Specify this field as `Canvas`.
+ `AppName`: Specify `default` as the app name.

```
response = client.create_app(  
    DomainId='<your-domain-id>',
    SpaceName='<your-space-name>',
    AppType='Canvas', 
    AppName='default'  
)
```

------
#### [ AWS CLI ]

The following example shows you how to use the AWS CLI `create-app` method to create a new Canvas application. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain. 
+ `SpaceName`: Specify the name of the space that you created in the previous step.
+ `AppType`: Specify this field as `Canvas`.
+ `AppName`: Specify `default` as the app name.

```
create-app
--domain-id <your-domain-id>
--space-name <your-space-name>
--app-type Canvas
--app-name default
```

------

You should now have a new Canvas application that uses a custom Studio space as the storage location for application data.

**Important**  
Any time you delete the Canvas application (or log out) and have to re-create the application, you must provide your space in the `SpaceName` field to make sure that Canvas uses your space.

The space is attached to the user profile you specified in the space configuration. You can delete your Canvas application without deleting the space, and the data stored in the space remains. The data stored in your space is only deleted if you delete your user profile, or if you directly delete the space.

# Grant Your Users Permissions to Build Custom Image and Text Prediction Models


**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

In Amazon SageMaker Canvas, you can build [custom models](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) to meet your specific business need. Two of these custom model types are single-label image predicion and multi-category text prediction. The permissions to build these model types are included in the AWS Identity and Access Management (IAM) policy called [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess), which SageMaker AI attaches by default to your user's IAM execution role if you leave the [Canvas base permissions turned on](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites). If you are using a custom IAM configuration, then you must explicitly add permissions to your user's IAM execution role so that they can build custom image and text prediction model types. To grant the necessary permissions to build image and text prediction models, read the following section to learn how to attach a least-permissions policy to your role.

To add the permissions to the user's IAM role, do the following:

1. Go to the [IAM console](https://console.aws.amazon.com/iamv2).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Create inline policy**.

1. Select the JSON tab, and then paste the following least-permissions policy into the editor.

------
#### [ JSON ]

****  

   ```
   {
   "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CreateAutoMLJobV2",
                   "sagemaker:DescribeAutoMLJobV2"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a **Name** for the policy.

1. Choose **Create policy**.

For more information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

# Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas


Generative AI features in Amazon SageMaker Canvas are powered by Amazon Bedrock foundation models, which are large language models (LLMs) that have the capability to understand and generate human-like text. This page describes how to grant the permissions necessary for the following features in SageMaker Canvas:
+ [Chat with and compare Amazon Bedrock models](canvas-fm-chat.md): Access and start conversational chats with Amazon Bedrock models through SageMaker Canvas.
+ [Use the Chat for data prep feature in Data Wrangler ](canvas-chat-for-data-prep.md): Use natural language to explore, visualize, and transform your data. This feature is powered by Anthropic Claude 2.
+ [Fine-tune Amazon Bedrock foundation models](canvas-fm-chat-fine-tune.md): Fine-tune an Amazon Bedrock foundation model on your own data to receive customized responses.

In order to use these features, you must first request access to the specific Amazon Bedrock model that you want to use. Then, add the necessary AWS IAM permissions and a trust relationship with Amazon Bedrock to the user's execution role. To grant the permissions to the role, you can choose one of the following methods:
+ Create a new Amazon SageMaker AI domain or user profile and turn on Amazon Bedrock permissions. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).
+ Edit the settings for an existing Amazon SageMaker AI domain or user profile.
+ Manually add permissions and a trust relationship to a domain's or user's IAM role.

## Step 1: Add Amazon Bedrock model access


Access to Amazon Bedrock models isn't granted by default, so you must go to the Amazon Bedrock console to request access to models for your AWS account.

To learn how to request access to a specific Amazon Bedrock model, following the procedure to **Add model access** on the page [Manage access to Amazon Bedrock foundation models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) in the *Amazon Bedrock User Guide*.

## Step 2: Grant permissions to the user's IAM role


When setting up your Amazon SageMaker AI domain or user profile, the user's IAM execution role must have the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached, as well as a trust relationship with Amazon Bedrock, so that your user can access Amazon Bedrock models from SageMaker Canvas.

You can modify the domain settings and either create a new execution role (to which SageMaker AI attaches the required permissions for you) or specify an existing role.

Alternatively, you can manually modify the permissions for an existing IAM role through the IAM console.

Both methods are described in the following sections.

### Grant permissions through the domain settings


You can edit your domain or user profile settings to turn on the **Canvas Ready-to-use models configuration** setting and specify an Amazon Bedrock role.

To edit your domain settings and grant access to Amazon Bedrock models for Canvas users in the domain, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Domains**.

1. From the list of domains, choose your domain.

1. Choose the **App Configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. The **Edit Canvas settings** page opens. For the **Canvas Ready-to-use models configuration** section, do the following:

   1. Turn on the **Enable Canvas Ready-to-use models option**.

   1. For **Amazon Bedrock role**, select **Create and use a new execution role** to create a new IAM execution role that has the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock. This IAM role is assumed by Amazon Bedrock when you access Amazon Bedrock models, use the chat for data prep feature, or fine-tune Amazon Bedrock models in Canvas. If you already have an execution role with a trust relationship, then select **Use an existing execution role** and choose your role from the dropdown.

1. Choose **Submit ** to save your changes.

Your users should now have the necessary permissions to access Amazon Bedrock models, use the chat for data prep feature, and fine-tune Amazon Bedrock models in Canvas.

You can use the same procedure above for editing an individual user’s settings, except go into the individual user’s profile from the domain page and edit the user settings instead. Permissions granted to an individual user don’t apply to other users in the domain, while permissions granted through the domain settings apply to all user profiles in the domain.

For more information on editing your domain settings, see [View and Edit domains](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-view-edit.html).

### Grant permissions manually through IAM


You can manually grant users permissions to access and fine-tune Amazon Bedrock models in Canvas by adding permissions to the IAM role specified for the domain or user’s profile. The IAM role must have the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock.

The following section shows you how to attach the policy to your IAM role and create the trust relationship with Amazon Bedrock.

First, take note of your domain or user profile’s IAM role. Note that permissions granted to an individual user don’t apply to other users in the domain, while permissions granted through the domain apply to all user profiles in the domain.

To configure the IAM role and grant permissions to fine-tune foundation models in Canvas, do the following:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the left navigation pane, choose **Roles**.

1. Search for the user's IAM role by name from the list of roles and select it.

1. On the **Permissions** tab, choose **Add permissions**. From the dropdown menu, choose **Attach policies**.

1. Search for the `AmazonSageMakerCanvasBedrockAccess` policy and select it.

1. Choose**Add permissions**.

1. Back on the IAM role’s page, choose the **Trust relationships** tab.

1. Choose **Edit trust policy**.

1. In the policy editor, find the **Add a principal option** in the right panel and choose **Add**.

1. In the dialog box, for **Principal type**, select **AWS services**.

1. For **ARN**, enter **bedrock.amazonaws.com**.

1. Choose **Add principal**.

1. Choose **Update policy**.

You should now have an IAM role that has the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock. For information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

# Update SageMaker Canvas for Your Users


You can update to the latest version of Amazon SageMaker Canvas as either a user or an IT administrator. You can update Amazon SageMaker Canvas for a single user at a time.

To update the Amazon SageMaker Canvas application, you must delete the previous version.

**Important**  
Deleting the previous version of Amazon SageMaker Canvas doesn't delete the data or models that the users have created.

Use the following procedure to log in to AWS, open Amazon SageMaker AI domain, and update Amazon SageMaker Canvas. The users can start using the SageMaker Canvas application when they log back in.

1. Sign in to the Amazon SageMaker AI console at [Amazon SageMaker Runtime](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. On the **Domains** page, choose your domain.

1. From the list of **User profiles**, choose a user profile.

1. For the list of **Apps**, find the Canvas application (the **App type** says **Canvas**) and choose **Delete app**.

1. Complete the dialog box and choose **Confirm action**.

The following image shows the user profile page and highlights the **Delete app** action from the preceding procedure.

![\[Screenshot of the user profile page with the Delete app action highlighted.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-update-app-1.png)


# Request a Quota Increase


Your users might use AWS resources in amounts that exceed those specified by their quotas. If your users are resource constrained and encounter errors in SageMaker Canvas, you can request a quota increase for them.

For more details about SageMaker AI quotas and how to request a quota increase, see [Quotas](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#regions-quotas-quotas).

Amazon SageMaker Canvas uses the following services to process the requests of your users:
+ Amazon SageMaker Autopilot
+ Amazon SageMaker Studio Classic domain

For a list of the available quotas for SageMaker Canvas operations, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html).

## Request an increase for instances to build custom models


When building a custom model, if you encounter an error during post-building analysis that tells you to increase your quota for `ml.m5.2xlarge` instances, use the following information to resolve the issue.

You must increase the SageMaker AI Hosting endpoint quota for the `ml.m5.2xlarge` instance type to a non-zero value in your AWS account. After building a model, SageMaker Canvas hosts the model on a SageMaker AI Hosting endpoint and uses the endpoint to generate the post-building analysis. If you don't increase the default account quota of 0 for `ml.m5.2xlarge` instances, SageMaker Canvas cannot complete this step and generates an error during post-building analysis.

For the procedure to increase the quota, see [ Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*.

# Grant Users Permissions to Import Amazon Redshift Data


Your users might have datasets stored in Amazon Redshift. Before users can import data from Amazon Redshift into SageMaker Canvas, you must add the `AmazonRedshiftFullAccess` managed policy to the IAM execution role that you've used for the user profile and add Amazon Redshift as a service principal to the role's trust policy. You must also associate the IAM execution role with your Amazon Redshift cluster. Complete the procedures in the following sections to give your users the required permissions to import Amazon Redshift data.

## Add Amazon Redshift permissions to your IAM role


You must grant Amazon Redshift permissions to the IAM role specified in your user profile.

To add the `AmazonRedshiftFullAccess` policy to the user's IAM role, do the following.

1. Sign in to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Attach policies**.

1. Search for the `AmazonRedshiftFullAccess` managed policy and select it.

1. Choose **Attach policies** to attach the policy to the role.

After attaching the policy, the role’s **Permissions** section should now include `AmazonRedshiftFullAccess`.

To add Amazon Redshift as a service principal to the IAM role, do the following.

1. On the same page for the IAM role, under **Trust relationships**, choose **Edit trust policy**.

1. In the **Edit trust policy** editor, update the trust policy to add Amazon Redshift as a service principal. An IAM role that allows Amazon Redshift to access other AWS services on your behalf has a trust relationship as follows:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "redshift.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

1. After editing the trust policy, choose **Update policy**.

You should now have an IAM role that has the policy `AmazonRedshiftFullAccess` attached to it and a trust relationship established with Amazon Redshift, giving users permission to import Amazon Redshift data into SageMaker Canvas. For more information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

## Associate the IAM role with your Amazon Redshift cluster


In the settings for your Amazon Redshift cluster, you must associate the IAM role that you granted permissions to in the preceding section.

To associate an IAM role with your cluster, do the following.

1. Sign in to the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Clusters**, and then choose the name of the cluster that you want to update.

1. In the **Actions** dropdown menu, choose **Manage IAM roles**. The **Cluster permissions** page appears.

1. For **Available IAM roles**, enter either the ARN or the name of the IAM role, or choose the IAM role from the list.

1. Choose **Associate IAM role** to add it to the list of **Associated IAM roles**.

1. Choose **Save changes** to associate the IAM role with the cluster.

Amazon Redshift modifies the cluster to complete the change, and the IAM role to which you previously granted Amazon Redshift permissions is now associated with your Amazon Redshift cluster. Your users now have the required permissions to import Amazon Redshift data into SageMaker Canvas.

# Grant Your Users Permissions to Send Predictions to Quick


You must grant your SageMaker Canvas users permissions to send batch predictions to Quick. In Quick, users can create analyses and reports with a dataset and prepare dashboards to share their results. For more information about sending prediction to QuickSight for analysis, see [Send predictions to Quick](canvas-send-predictions.md).

To grant the necessary permissions to share batch predictions with users in QuickSight, you must add a permissions policy to the AWS Identity and Access Management (IAM) execution role that you’ve used for the user profile. The following section shows you how to attach a least-permissions policy to your role.

**Add the permissions policy to your IAM role**

**To add the permissions policy, use the following procedure:**

1. Sign in to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Create inline policy**.

1. Select the JSON tab, and then paste the following least-permissions policy into the editor. Replace the placeholders `<your-account-number>` with your own AWS account number.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "quicksight:CreateDataSet",
                   "quicksight:ListUsers",
                   "quicksight:ListNamespaces",
                   "quicksight:CreateDataSource",
                   "quicksight:PassDataSet",
                   "quicksight:PassDataSource"
               ],
               "Resource": [
                   "arn:aws:quicksight:*:111122223333:datasource/*",
                   "arn:aws:quicksight:*:111122223333:user/*",
                   "arn:aws:quicksight:*:111122223333:namespace/*",
                   "arn:aws:quicksight:*:111122223333:dataset/*"
               ]
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a **Name** for the policy.

1. Choose **Create policy**.

You should now have a customer-managed IAM policy attached to your execution role that grants your Canvas users the necessary permissions to send batch predictions to users in QuickSight.

# Applications management


The following sections describe how you can manage your SageMaker Canvas applications. You can view, delete, or relaunch your applications from the **Domains** section of the SageMaker AI console.

**Topics**
+ [

# Check for active applications
](canvas-manage-apps-active.md)
+ [

# Delete an application
](canvas-manage-apps-delete.md)
+ [

# Relaunch an application
](canvas-manage-apps-relaunch.md)

# Check for active applications


To check if you have any actively running SageMaker Canvas applications, use the following procedure.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Dashboard**.

1. In the **LCNC** section, there is a row for Canvas that tells you how many active apps are running. Choose the number to view the list of apps.

The **Status** column displays the status of the application, such as **Ready**, **Pending**, or **Deleted**. If the application is **Ready**, then your SageMaker Canvas workspace instance is active. You can delete the application from the console, or you can reopen Canvas and log out.

# Delete an application


If you want to terminate your SageMaker Canvas workspace instance, you can either log out from the SageMaker Canvas application or delete your application from the SageMaker AI console. A *workspace instance* is dedicated for your use from when you start using SageMaker Canvas to the point when you stop using it. Deleting the application only terminates the workspace instance and stops workspace instance charges. Models and datasets aren’t affected, but Quick build tasks automatically restart when you relaunch the application.

To delete your Canvas application through the AWS console, first close the browser tab in which your Canvas application was open. Then, use the following procedure to delete your SageMaker Canvas application.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**. 

1. On the **Domains** page, choose your domain.

1. On the **Domain details** page, choose **Resources**.

1. Under **Applications**, find the application that says **Canvas** in the **App type** column.

1. Select the checkbox next to the Canvas application and choose **Stop**.

You have now successfully stopped the application and terminated the workspace instance.

You can also terminate the workspace instance by [logging out](canvas-log-out.md) from within the SageMaker Canvas application.

# Relaunch an application


If you delete or log out of your SageMaker Canvas application and want to relaunch the application, use the following procedure.

1. Navigate to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, choose **Canvas**.

1. On the SageMaker Canvas landing page, in the **Get Started** box, select your user profile from the dropdown.

1. Choose **Open Canvas** to open the application.

SageMaker Canvas begins launching the application.

You can also use the following secondary procedure if you encounter any issues with the previous procedure.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. On the **Domains** page, choose your domain.

1. On the **Domain details** page, under **User profiles**, select the user profile name for the SageMaker Canvas application you want to view.

1. Choose **Launch** and select **Canvas** from the dropdown list.

SageMaker Canvas begins launching the application.

# Configure Amazon SageMaker Canvas in a VPC without internet access


The Amazon SageMaker Canvas application runs in a container in an AWS managed Amazon Virtual Private Cloud (VPC). If you want to further control access to your resources or run SageMaker Canvas without public internet access, you can configure your Amazon SageMaker AI domain and VPC settings. Within your own VPC, you can configure settings such as security groups (virtual firewalls that control inbound and outbound traffic from Amazon EC2 instances) and subnets (ranges of IP addresses in your VPC). To learn more about VPCs, see [How Amazon VPC works](https://docs.aws.amazon.com/vpc/latest/userguide/how-it-works.html).

When the SageMaker Canvas application is running in the AWS managed VPC, it can interact with other AWS services using either an internet connection or through VPC endpoints created in a customer-managed VPC (without public internet access). SageMaker Canvas applications can access these VPC endpoints through a Studio Classic-created network interface that provides connectivity to the customer-managed VPC. The default behavior of the SageMaker Canvas application is to have internet access. When using an internet connection, the containers for the preceding jobs access AWS resources over the internet, such as the Amazon S3 buckets where you store training data and model artifacts.

However, if you have security requirements to control access to your data and job containers, we recommend that you configure SageMaker Canvas and your VPC so that your data and containers aren’t accessible over the internet. SageMaker AI uses the VPC configuration settings you specify when setting up your domain for SageMaker Canvas.

If you want to configure your SageMaker Canvas application without internet access, you must configure your VPC settings when you onboard to [Amazon SageMaker AI domain](gs-studio-onboard.md), set up VPC endpoints, and grant the necessary AWS Identity and Access Management permissions. For information about configuring a VPC in Amazon SageMaker AI, see [Choose an Amazon VPC](onboard-vpc.md). The following sections describe how to run SageMaker Canvas in a VPC without public internet access.

## Configure Amazon SageMaker Canvas in a VPC without internet access


You can send traffic from SageMaker Canvas to other AWS services through your own VPC. If your own VPC doesn't have public internet access and you've set up your domain in **VPC only** mode, then SageMaker Canvas won't have public internet access as well. This includes all requests, such as accessing datasets in Amazon S3 or training jobs for standard builds, and the requests go through VPC endpoints in your VPC instead of the public internet. When you onboard to domain and [Choose an Amazon VPC](onboard-vpc.md), you can specify your own VPC as the default VPC for the domain, along with your desired security group and subnet settings. Then, SageMaker AI creates a network interface in your VPC that SageMaker Canvas uses to access VPC endpoints in your VPC.

Make sure that you set up one or more security groups in your VPC with inbound and outbound rules that allow [ TCP traffic within the security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-other-instances). This is required for connectivity between the Jupyter Server application and the Kernel Gateway applications. You must allow access to at least ports in the range `8192-65535`. Also, make sure to create a distinct security group for each user profile and add inbound access from that same security group. We do not recommend reusing a domain level security group for user profiles. If the domain level security group allows inbound access to itself, all applications in the domain have access to all other applications in the domain. Note that the security group and subnet settings are set after you finish onboarding to domain.

When onboarding to domain, if you choose **Public internet only** as the network access type, the VPC is SageMaker AI managed and allows internet access.

You can change this behavior by choosing **VPC only** so that SageMaker AI sends all traffic to a network interface that SageMaker AI creates in your specified VPC. When you choose this option, you must provide the subnets, security groups, and VPC endpoints that are necessary to communicate with the SageMaker API and SageMaker AI Runtime, and various AWS services, such as Amazon S3 and Amazon CloudWatch, that are used by SageMaker Canvas. Note that you can only import data from Amazon S3 buckets located in the same Region as your VPC.

The following procedures show how you can configure these settings to use SageMaker Canvas without the internet.

### Step 1: Onboard to Amazon SageMaker AI domain


To send SageMaker Canvas traffic to a network interface in your own VPC instead of over the internet, specify the VPC you want to use when onboarding to [Amazon SageMaker AI domain](gs-studio-onboard.md). You must also specify at least two subnets in your VPC that SageMaker AI can use. Choose **Standard setup** and do the following procedure when configuring the **Network and Storage Section** for the domain.

1. Select your desired **VPC**.

1. Choose two or more **Subnets**. If you don’t specify the subnets, SageMaker AI uses all of the subnets in the VPC.

1. Choose one or more **Security group(s)**.

1. Choose **VPC Only** to turn off direct internet access in the AWS managed VPC where SageMaker Canvas is hosted.

After disabling internet access, finish the onboarding process to set up your domain. For more information about the VPC settings for Amazon SageMaker AI domain, see [Choose an Amazon VPC](onboard-vpc.md).

### Step 2: Configure VPC endpoints and access


**Note**  
In order to configure Canvas in your own VPC, you must enable private DNS hostnames for your VPC endpoints. For more information, see [Connect to SageMaker AI Through a VPC Interface Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/interface-vpc-endpoint.html).

SageMaker Canvas only accesses other AWS services to manage and store data for its functionality. For example, it connects to Amazon Redshift if your users access an Amazon Redshift database. It can connect to an AWS service such as Amazon Redshift using an internet connection or a VPC endpoint. Use VPC endpoints if you want to set up connections from your VPC to AWS services that don't use the public internet.

A VPC endpoint creates a private connection to an AWS service that uses a networking path that is isolated from the public internet. For example, if you set up access to Amazon S3 using a VPC endpoint from your own VPC, then the SageMaker Canvas application can access Amazon S3 by going through the network interface in your VPC and then through the VPC endpoint that connects to Amazon S3. The communication between SageMaker Canvas and Amazon S3 is private.

For more information about configuring VPC endpoints for your VPC, see [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html). If you are using Amazon Bedrock models in Canvas with a VPC, for more information about controlling access to your data, see [ Protect jobs using a VPC](https://docs.aws.amazon.com/bedrock/latest/userguide/usingVPC.html#configureVPC) in the *Amazon Bedrock User Guide*.

The following are the VPC endpoints for each service you can use with SageMaker Canvas:


| Service | Endpoint | Endpoint type | 
| --- | --- | --- | 
|  AWS Application Auto Scaling  |  com.amazonaws.*Region*.application-autoscaling  | Interface | 
|  Amazon Athena  |  com.amazonaws.*Region*.athena  | Interface | 
|  Amazon SageMaker AI  |  com.amazonaws.*Region*.sagemaker.api com.amazonaws.*Region*.sagemaker.runtime com.amazonaws.*Region*.notebook  | Interface | 
|  Amazon SageMaker AI Data Science Assistant  |  com.amazonaws.*Region*.sagemaker-data-science-assistant  | Interface | 
|  AWS Security Token Service  |  com.amazonaws.*Region*.sts  | Interface | 
|  Amazon Elastic Container Registry (Amazon ECR)  |  com.amazonaws.*Region*.ecr.api com.amazonaws.*Region*.ecr.dkr  | Interface | 
|  Amazon Elastic Compute Cloud (Amazon EC2)  |  com.amazonaws.*Region*.ec2  | Interface | 
|  Amazon Simple Storage Service (Amazon S3)  |  com.amazonaws.*Region*.s3  | Gateway | 
|  Amazon Redshift  |  com.amazonaws.*Region*.redshift-data  | Interface | 
|  AWS Secrets Manager  |  com.amazonaws.*Region*.secretsmanager  | Interface | 
|  AWS Systems Manager  |  com.amazonaws.*Region*.ssm  | Interface | 
|  Amazon CloudWatch  |  com.amazonaws.*Region*.monitoring  | Interface | 
|  Amazon CloudWatch Logs  |  com.amazonaws.*Region*.logs  | Interface | 
|  Amazon Forecast  |  com.amazonaws.*Region*.forecast com.amazonaws.*Region*.forecastquery  | Interface | 
|  Amazon Textract  |  com.amazonaws.*Region*.textract  | Interface | 
|  Amazon Comprehend  |  com.amazonaws.*Region*.comprehend  | Interface | 
|  Amazon Rekognition  |  com.amazonaws.*Region*.rekognition  | Interface | 
|  AWS Glue  |  com.amazonaws.*Region*.glue  | Interface | 
|  AWS Application Auto Scaling  |  com.amazonaws.*Region*.application-autoscaling  | Interface | 
|  Amazon Relational Database Service (Amazon RDS)  |  com.amazonaws.*Region*.rds  | Interface | 
|  Amazon Bedrock (see note after table)  |  com.amazonaws.*Region*.bedrock-runtime  | Interface | 
|  Amazon Kendra  |  com.amazonaws.*Region*.kendra  | Interface | 
|  Amazon EMR Serverless  |  com.amazonaws.*Region*.emr-serverless  | Interface | 
|  Amazon Q Developer (see note after table)  |  com.amazonaws.*Region*.q  | Interface | 

**Note**  
The Amazon Q Developer VPC endpoint is currently available only in the US East (N. Virginia) region. To connect to it from other regions, you can choose one of the following options based on your security and infrastructure preferences:  
**Set up a NAT Gateway.** Configure a NAT Gateway in your VPC's private subnet to enable internet connectivity for the Q Developer endpoint. For more information, see [Setting up a NAT Gateway in a VPC Private Subnet](https://repost.aws/knowledge-center/nat-gateway-vpc-private-subnet).
**Enable cross-region VPC endpoint access.** Set up cross-region VPC endpoint access for Q Developer. Use this option to connect securely without requiring internet access. For more information, see [Configuring Cross-Region VPC Endpoint Access](https://repost.aws/knowledge-center/vpc-endpoints-cross-region-aws-services).

**Note**  
For Amazon Bedrock, the interface endpoint service name `com.amazonaws.Region.bedrock` has been deprecated. Create a new VPC endpoint with the service name listed in the preceding table.  
Additionally, you can't fine-tune foundation models from Canvas VPCs with no internet access. This is because Amazon Bedrock doesn't support VPC endpoints for model customization APIs. To learn more about fine-tuning foundation models in Canvas, see [Fine-tune foundation models](canvas-fm-chat-fine-tune.md).

You must also add an endpoint policy for Amazon S3 to control AWS principal access to your VPC endpoint. For information about how to update your VPC endpoint policy, see [Control access to VPC endpoints using endpoint policies](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-access.html).

The following are two VPC endpoint policies that you can use. Use the first policy if you only want to grant access to the basic functionality of Canvas, such as importing data and creating models. Use the second policy if you want to grant access to the additional [ genenerative AI features](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html) in Canvas.

------
#### [ Basic VPC endpoint policy ]

The following policy grants the necessary access to your VPC endpoint for basic operations in Canvas.

```
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:CreateBucket",
                "s3:GetBucketCors",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
```

------
#### [ Generative AI VPC endpoint policy ]

The following policy grants the necessary access to your VPC endpoint for basic operations in Canvas, as well as using generative AI foundation models.

```
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:CreateBucket",
                "s3:GetBucketCors",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::*fmeval/datasets*",
                "arn:aws:s3:::*jumpstart-cache-prod*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
```

------

### Step 3: Grant IAM permissions


The SageMaker Canvas user must have the necessary AWS Identity and Access Management permissions to allow connection to the VPC endpoints. The IAM role to which you give permissions must be the same one you used when onboarding to Amazon SageMaker AI domain. You can attach the SageMaker AI managed `AmazonSageMakerFullAccess` policy to the IAM role for the user to give the user the required permissions. If you require more restrictive IAM permissions and use custom policies instead, then give the user’s role the `ec2:DescribeVpcEndpointServices` permission. SageMaker Canvas requires these permissions to verify the existence of the required VPC endpoints for standard build jobs. If it detects these VPC endpoints, then standard build jobs run by default in your VPC. Otherwise, they will run in the default AWS managed VPC.

For instructions on how to attach the `AmazonSageMakerFullAccess` IAM policy to your user’s IAM role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

To grant your user’s IAM role the granular `ec2:DescribeVpcEndpointServices` permission, use the following procedure.

1. Sign in to the AWS Management Console and open the [IAM console](https://console.aws.amazon.com/iam/).

1. In the navigation pane, choose **Roles**.

1. In the list, choose the name of the role to which you want to grant permissions.

1. Choose the **Permissions** tab.

1. Choose **Add permissions** and then choose **Create inline policy**.

1. Choose the **JSON** tab and enter the following policy, which grants the `ec2:DescribeVpcEndpointServices` permission:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "VisualEditor0",
               "Effect": "Allow",
               "Action": "ec2:DescribeVpcEndpointServices",
               "Resource": "*"
           }
       ]
   }
   ```

------

1. Choose **Review policy**, and then enter a **Name** for the policy (for example, `VPCEndpointPermissions`).

1. Choose **Create policy**.

The user’s IAM role should now have permissions to access the VPC endpoints configured in your VPC.

### (Optional) Step 4: Override security group settings for specific users


If you are an administrator, you might want different users to have different VPC settings, or user-specific VPC settings. When you override the default VPC’s security group settings for a specific user, these settings are passed on to the SageMaker Canvas application for that user.

You can override the security groups that a specific user has access to in your VPC when you set up a new user profile in Studio Classic. You can use the [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) SageMaker API call (or [create\$1user\$1profile](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_user_profile) with the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html)), and then in the `UserSettings`, you can specify the `SecurityGroups` for the user.

# Set up connections to data sources with OAuth


The following section describes the steps you must take to set up OAuth connections to data sources from SageMaker Canvas. [OAuth](https://oauth.net/2/) is a common authentication platform for granting access to resources without sharing passwords. With OAuth, you can quickly connect to your data from Canvas and import it for building models. Canvas currently supports OAuth for Snowflake and Salesforce Data Cloud. 

**Note**  
You can only establish one OAuth connection for each data source.

## Set up OAuth for Salesforce Data Cloud


To set up OAuth for Salesforce Data Cloud, follow these general steps:

1. Sign in to Salesforce Data Cloud.

1. In Salesforce Data Cloud, create a new app connection and do the following:

   1. Enable OAuth settings.

   1. When prompted for a callback URL (or the URL of the resource accessing your data), specify the URL for your Canvas application. The Canvas application URL follows this format: `https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default`

   1. Copy the consumer key and secret.

   1. Copy your authorization URL and token URL.

For more detailed instructions about performing the preceding tasks in Salesforce Data Cloud, see [Import data from Salesforce Data Cloud](data-wrangler-import.md#data-wrangler-import-salesforce-data-cloud) in the Data Wrangler documentation for importing data from Salesforce Data Cloud.

After enabling access from Salesforce Data Cloud and getting your connection information, you must create an [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) secret to store the information and add it to your Amazon SageMaker AI domain or user profile. Note that you can add a secret to both a domain and user profile, but Canvas looks for secrets in the user profile first.

To add a secret to your domain or user profile, do the following:

1. Go to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Choose **domains** in the navigation pane.

1. From the list of **domains**, choose your domain.

   1. If adding your secret to your domain, do the following:

      1. Choose the domain.

      1. On the **domain settings** page, choose the **domain settings** tab.

      1. Choose **Edit**.

   1. If adding the secret to your user profile, do the following:

      1. Choose the user’s domain.

      1. On the **domain settings** page, choose the user profile.

      1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For **OAuth settings**, choose **Add OAuth configuration**.

1. For **Data source**, select **Salesforce Data Cloud**.

1. For **Secret Setup**, select **Create a new secret**. Alternatively, if you already created an AWS Secrets Manager secret with your credentials, enter the ARN for the secret. If creating a new secret, do the following:

   1. For **Identity Provider**, select **SALESFORCE**.

   1. For **Client ID**, **Client Secret**, **Authorization URL**, and **Token URL**, enter all of the information you gathered from Salesforce Data Cloud in the previous procedure.

1. Save your domain or user profile settings.

You should now be able to create a connection to your data in Salesforce Data Cloud from Canvas.

## Set up OAuth for Snowflake


To set up authentication for Snowflake, Canvas supports identity providers that you can use instead of having users directly enter their credentials into Canvas.

The following are links to the Snowflake documentation for the identity providers that Canvas supports:
+ [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
+ [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
+ [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

The following process describes the general steps you must take. For more detailed instructions about performing these steps, you can refer to the [Setting up Snowflake OAuth Access](data-wrangler-import.md#data-wrangler-snowflake-oauth-setup) section in the Data Wrangler documentation for importing data from Snowflake.

To set up OAuth for Snowflake, do the following:

1. Register Canvas as an application with the identity provider. This requires specifying a redirect URL to Canvas, which should follow this format: `https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default`

1. Within the identity provider, create a server or API that sends OAuth tokens to Canvas so that Canvas can access Snowflake. When setting up the server, use the authorization code and refresh token grant types, specify the access token lifetime, and set a refresh token policy. Additionally, within the External OAuth Security Integration for Snowflake, enable `external_oauth_any_role_mode`.

1. Get the following information from the identity provider: token URL, authorization URL, client ID, client secret. For Azure AD, also retrieve the OAuth scope credentials.

1. Store the information retrieved in the previous step in an AWS Secrets Manager secret.

   1. For Okta and Ping Federate, the secret should look like the following format:

      ```
      {"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
      "client_id":"example-client-id", "client_secret":"example-client-secret", "identity_provider":"OKTA"|"PING_FEDERATE",
      "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"}
      ```

   1. For Azure AD, the secret should also include the OAuth scope credentials as the `datasource_oauth_scope` field.

After configuring the identity provider and the secret, you must create an [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) secret to store the information and add it to your Amazon SageMaker AI domain or user profile. Note that you can add a secret to both a domain and user profile, but Canvas looks for secrets in the user profile first.

To add a secret to your domain or user profile, do the following:

1. Go to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Choose **domains** in the navigation pane.

1. From the list of **domains**, choose your domain.

   1. If adding your secret to your domain, do the following:

      1. Choose the domain.

      1. On the **domain settings** page, choose the **domain settings** tab.

      1. Choose **Edit**.

   1. If adding the secret to your user profile, do the following:

      1. Choose the user’s domain.

      1. On the **domain settings** page, choose the user profile.

      1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For **OAuth settings**, choose **Add OAuth configuration**.

1. For **Data source**, select **Snowflake**.

1. For **Secret Setup**, select **Create a new secret**. Alternatively, if you already created an AWS Secrets Manager secret with your credentials, enter the ARN for the secret. If creating a new secret, do the following:

   1. For **Identity Provider**, select **SNOWFLAKE**.

   1. For **Client ID**, **Client Secret**, **Authorization URL**, and **Token URL**, enter all of the information you gathered from the identity provider in the previous procedure.

1. Save your domain or user profile settings.

You should now be able to create a connection to your data in Snowflake from Canvas.

# Generative AI assistance for solving ML problems in Canvas using Amazon Q Developer
Generative AI assistance using Q Developer

While using Amazon SageMaker Canvas, you can chat with Amazon Q Developer in natural language to leverage generative AI and solve problems. Q Developer is an assistant that helps you translate your goals into machine learning (ML) tasks and describes each step of the ML workflow. Q Developer helps Canvas users reduce the amount of time, effort, and data science expertise required to leverage ML and make data-driven decisions for their organizations. 

Through a conversation with Q Developer, you can initiate actions in Canvas such as preparing data, building an ML model, making predictions, and deploying a model. Q Developer makes suggestions for next steps and provides you with context as you complete each step. It also informs you of results; for example, Canvas can transform your dataset according to best practices, and Q Developer can list the transforms that were used and why.

Amazon Q Developer is available in SageMaker Canvas at no additional cost to both Amazon Q Developer Pro Tier and Free Tier users. However, standard charges apply for resources such as the SageMaker Canvas workspace instance and any resources used for building or deploying models. For more information about pricing, see [Amazon SageMaker Canvas pricing](https://aws.amazon.com/sagemaker-ai/canvas/pricing/).

Use of Amazon Q is licensed to you under [MIT's 0 License](https://github.com/aws/mit-0) and subject to the [AWS Responsible AI Policy](https://aws.amazon.com/machine-learning/responsible-ai/policy/). When you use Q Developer from outside the US, Q Developer processes data across US regions. For more information, see [Cross region inference in Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/cross-region-inference.html).

**Note**  
Amazon Q Developer in SageMaker Canvas doesn't use user content to improve the service, regardless of whether you use the Free-tier or Pro-tier subscription. For service telemetry purposes, Q Developer might track your usage, such as the number of questions asked and whether recommendations were accepted or rejected. This telemetry data doesn't include personally identifiable information such as IP address.

## How it works


Amazon Q Developer is a generative AI powered assistant available in SageMaker Canvas that you can query using natural language. Q Developer makes suggestions for each step of the machine learning workflow, explaining concepts and providing you with options and more details as needed. You can use Q Developer for help with regression, binary classification, and multi-class classification use cases.

For example, to predict customer churn, upload a dataset of historical customer churn information to Canvas through Q Developer. Q Developer suggests an appropriate ML model type and steps to fix dataset issues, build a model, and make predictions.

**Important**  
Amazon Q Developer is intended for conversations about machine learning problems within SageMaker Canvas. It guides users through Canvas actions and optionally answers questions about AWS services. Q Developer processes model inputs only in English. For more information about how you can use Q Developer, see [ Amazon Q Developer features](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/features.html) in the *Amazon Q Developer User Guide*.

## Supported regions


Amazon Q Developer is available within SageMaker Canvas in the following AWS Regions:
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Seoul)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Europe (Ireland)
+ Europe (Paris)

## Amazon Q Developer capabilities available in Canvas


The following list summarizes the Canvas tasks with which Q Developer can provide assistance:
+ **Describe your objective** – Q Developer can suggest an ML model type and general approach to solve your problem.
+ **Import and analyze datasets** – Tell Q Developer where your dataset is stored or upload a file to save it as a Canvas dataset. Prompt Q Developer to identify any issues in your dataset, such as outliers or missing values. Q Developer provides summary statistics about your dataset and lists any identified issues.

  Q Developer supports queries about the following statistics for individual columns:
  + Numeric columns – `number of valid values`, `feature type`, `mean`, `median`, `minimum`, `maximum`, `standard deviation`, `25th percentile`, `75th percentile`, `number of outliers`
  + Categorical columns – `number of missing values`, `number of valid values`, `feature type`, `most frequent`, `most frequent category`, `most frequent category count`, `least frequent`, `least frequent category`, `least frequent category count`, `categories`
+ **Fix dataset issues** – Prompt Q Developer to use Canvas's data transformation capabilities to create a revised version of your dataset. Canvas creates a Data Wrangler data flow and applies transforms according to data science best practices. For more information, see [Data preparation](canvas-data-prep.md).

  If you want to do more advanced data analysis or data preparation tasks than you can accomplish with Q Developer, then we recommend that you go to the Data Wrangler data flow interface.
+ **Train a model** – Q Developer tells you the recommended ML model type for your problem and a proposed model building configuration. You can use the suggested default settings to do a quick build, or you can modify the configuration and do a standard build. When ready, prompt Q Developer to build your Canvas model.

  All of the custom model types are supported. For more information about model types and quick versus standard builds, see [How custom models work](canvas-build-model.md).
+ **Evaluate model accuracy** – After building a model, Q Developer provides a summary of how the model scores across various metrics. These metrics help you determine the usefulness and accuracy of your model. Q Developer can explain any concept or metric in detail.

  To view full details and visualizations, open the model from the chat or the **My Models** page of Canvas. For more information, see [Model evaluation](canvas-evaluate-model.md).
+ **Get predictions for new data** – You can upload a new dataset and prompt Q Developer to help you open the prediction feature of Canvas. 

  Q Developer opens a new window in the application where you can either make a single prediction or make batch predictions with a new dataset. For more information, see [Predictions with custom models](canvas-make-predictions.md).
+ **Deploy a model** – To deploy your model for production, ask Q Developer to help you deploy your model through Canvas. Q Developer opens a new window in which you can configure your deployment. 

  After deploying, view your deployment details either 1) on the **My Models** page of Canvas in the model's **Deploy** tab, or 2) on the **ML Ops** page in the **Deployments** tab. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

## Prerequisites


To use Amazon Q Developer to build ML models in SageMaker Canvas, complete the following prerequisites:

**Set up a Canvas application**

Make sure that you have a Canvas application set up. For information about how to set up a Canvas application, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

**Grant Q Developer permissions**

To access Q Developer while using Canvas, you must attach the necessary permissions to the AWS IAM role used for your SageMaker AI domain or user profile. You can do this through the console described in this section. If you encounter any permissions issues due to using the console method, then manually attach the AWS managed policy [ AmazonSageMakerCanvasSMDataScienceAssistantAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasSMDataScienceAssistantAccess) to the IAM role.

Permissions attached at the domain level apply to all user profiles in the domain, unless individual permissions are granted or revoked at the user profile level.

------
#### [ SageMaker AI console method ]

You can grant permissions by editing the SageMaker AI domain or user profile settings.

To grant permissions through the domain settings in the SageMaker AI console, do the following:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of domains, select your domain.

1. On the **Domain details** page, select the **App configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. On the **Edit Canvas settings** page, go to the **Amazon Q Developer** section and do the following:

   1. Turn on **Enable Amazon Q Developer in SageMaker Canvas for natural language ML** to add the permissions to chat with Q Developer in Canvas to your domain's execution role.

   1. (Optional) Turn on **Enable Amazon Q Developer chat for general AWS questions** if you want to ask Q Developer questions about various AWS services (for example: Describe how Athena works).
**Note**  
When making general AWS queries to Q Developer, your requests route through the US East (N. Virginia) AWS Region. To prevent your data from routing through US East (N. Virginia), turn off the **Enable Amazon Q Developer chat for general AWS questions** toggle.

------
#### [ Manual method ]

Attach the [ AmazonSageMakerCanvasSMDataScienceAssistantAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasSMDataScienceAssistantAccess) policy to the AWS IAM role used for your domain or user profile. For more information about how to do this, see [ Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS IAM User Guide*.

------

**(Optional) Configure access to Q Developer from your VPC**

If you have a VPC that is configured without public internet access, you can add a VPC endpoint for Q Developer. For more information, see [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md).

## Getting started


To use Amazon Q Developer to build ML models in SageMaker Canvas, do the following:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Amazon Q**.

1. Choose **Start a new conversation** to open a new chat.

When you start a new chat, Q Developer prompts you to state your problem or provide a dataset.

![\[The greeting that Q Developer gives you upon starting a new chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-greeting.png)


After importing your data, you can ask Q Developer to provide you with summary statistics about your dataset, or you can ask questions about specific columns. For a list of the different statistics that Q Developer supports, see the preceding section [Amazon Q Developer capabilities available in Canvas](#canvas-q-capabilities). The following screenshot shows an example of asking for dataset statistics and the most frequent category in a product category column.

![\[Chat dialog asking Q Developer to provide dataset statistics and the most frequent category statistic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-dataset-statistics.png)


Q Developer tracks any Canvas artifacts you import or create during the conversation, such as transformed datasets and models. You can access them from the chat or other Canvas application tabs. For example, if Q Developer fixes issues in your dataset, you can access the new, transformed dataset from the following places:
+ The artifacts sidebar in the Q Developer chat interface
+ The **Datasets** page of Canvas, where you can view both your original and transformed datasets. The transformed dataset has the **Built by Amazon Q** label added to it.
+ The **Data Wrangler** page of Canvas, where Q Developer creates a new data flow for your dataset

The following screenshot shows the original dataset and the transformed dataset in the sidebar of a chat.

![\[The artifacts, which are a dataset and a transformed dataset, shown in the sidebar of a Q Developer chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-artifacts.png)


When your data is ready, ask Q Developer to help build a Canvas model. Q Developer might prompt you to confirm a few fields and review the build configuration. If you use the default build configuration, then your model is built using a quick build. If you want to customize any part of your build configuration, such as selecting the algorithms used or changing the objective metric, then your model is built with a standard build.

The following screenshot shows how you can prompt Q Developer to initiate a Canvas model build with only a few prompts. This example uses the default configuration to start a quick build.

![\[A conversation with Q Developer where the user prompted to start a Canvas model build.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-training-chat.png)


After building your model, you can perform additional actions using either natural language in the chat or the artifacts sidebar menu. For example, you can view model details and metrics, make predictions, or deploy the model. The following screenshot shows the sidebar where you can choose these additional options.

![\[A Q Developer conversation ellipsis menu expanded, showing options for viewing models details, predictions, and deployment.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-ellipsis-menu.png)


You can also perform any of these actions by going to the **My Models** page of Canvas and selecting your model. From your model's page, you can navigate to the **Analyze**, **Predict**, and **Deploy** tabs to view model metrics and visualizations, make predictions, and manage deployments, respectively.

# Logging Q Developer conversations with AWS CloudTrail


AWS CloudTrail is a service that records actions taken by users, roles, or AWS services in Amazon SageMaker AI. CloudTrail captures API calls resulting from your interactions with Amazon Q Developer (a conversational AI assistant) while using SageMaker Canvas (a no-code ML interface). CloudTrail data shows request details, the IP address of the requester, who made the request, and when.

Your interactions with Q Developer are sent as `SendConversation` API calls to the SageMaker AI Data Science Assistant service, which is an internal service that Canvas leverages on the backend. The event source for `SendConversation` API calls is `sagemaker-data-science-assistant.amazonaws.com`.

**Note**  
For privacy and security reasons, the content of your conversations is hidden in the logs, appearing as `HIDDEN_DUE_TO_SECURITY_REASONS` in the request and response elements.

To learn more about CloudTrail, see the [https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html). To learn more about CloudTrail in SageMaker AI, see [Logging Amazon SageMaker AI API calls using AWS CloudTrail](logging-using-cloudtrail.md).

The following is an example log file entry for the `SendConversation` API:

```
{
    "eventVersion":"1.10",
    "userIdentity": {
        "type":"AssumedRole",
        "principalId":"AROA123456789EXAMPLE:user-Isengard",
        "arn":"arn:aws:sts::111122223333:assumed-role/Admin/user",
        "accountId":"111122223333",
        "accessKeyId":"ASIAIOSFODNN7EXAMPLE",
        "sessionContext": {
            "sessionIssuer": {
                "type":"Role",
                "principalId":"AROA123456789EXAMPLE",
                "arn":"arn:aws:iam::111122223333:role/Admin",
                "accountId":"111122223333",
                "userName":"Admin"
            },
            "attributes": {
                "creationDate":"2024-11-11T22:04:37Z",
                "mfaAuthenticated":"false"
            }
        }
    },
    "eventTime":"2024-11-11T22:09:22Z",
    "eventSource":"sagemaker-data-science-assistant.amazonaws.com",
    "eventName":"SendConversation",
    "awsRegion":"us-west-2",
    "sourceIPAddress":"192.0.2.0",
    "userAgent":"Boto3/1.33.13 md/Botocore#1.33.13 ua/2.0 os/linux#5.10.227-198.884.amzn2int.x86_64 md/arch#x86_64 lang/python#3.7.16 md/pyimpl#CPython cfg/retry-mode#legacy Botocore/1.33.13",
    "requestParameters": {
        "conversation": [
            {
                "utteranceId":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
                "utterance":"HIDDEN_DUE_TO_SECURITY_REASONS",
                "timestamp":"Feb 4, 2020, 7:46:29 AM",
                "utteranceType":"User"
            }
        ],
        "utteranceId":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111"
    },
    "responseElements": {
        "responseCode":"CHAT_RESPONSE",
        "conversationId":"1234567890abcdef0",
        "response": {
            "chat": {
                "body":"HIDDEN_DUE_TO_SECURITY_REASONS"
            }
        }
    },
    "requestID":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
    "eventID":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
    "readOnly":false,
    "eventType":"AwsApiCall",
    "managementEvent":true,
    "recipientAccountId":"123456789012",
    "eventCategory":"Management",
    "tlsDetails": {
        "tlsVersion":"TLSv1.2",
        "cipherSuite":"ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader":"gamma.us-west-2.data-science-assistant.sagemaker.aws.dev"
    }
}
```

# Data import


Amazon SageMaker Canvas supports importing tabular, image, and document data. You can import datasets from your local machine, Amazon services such as Amazon S3 and Amazon Redshift, and external data sources. When importing datasets from Amazon S3, you can bring a dataset of any size. Use the datasets that you import to build models and make predictions for other datasets.

Each use case for which you can build a custom model accepts different types of input. For example, if you want to build a single-label image classification model, then you should import image data. For more information about the different model types and the data they accept, see [How custom models work](canvas-build-model.md). You can import data and build custom models in SageMaker Canvas for the following data types:
+ **Tabular** (CSV, Parquet, or tables)
  + Categorical – Use categorical data to build custom categorical prediction models for 2 and 3\$1 category prediction.
  + Numeric – Use numeric data to build custom numeric prediction models.
  + Text – Use text data to build custom multi-category text prediction models.
  + Timeseries – Use timeseries data to build custom time series forecasting models.
+ **Image** (JPG or PNG) – Use image data to build custom single-label image prediction models.
+ **Document** (PDF, JPG, PNG, TIFF) – Document data is only supported for SageMaker Canvas Ready-to-use models. To learn more about Ready-to-use models that can make predictions for document data, see [Ready-to-use models](canvas-ready-to-use-models.md).

You can import data into Canvas from the following data sources:
+ Local files on your computer
+ Amazon S3 buckets
+ Amazon Redshift provisioned clusters (not Amazon Redshift Serverless)
+ AWS Glue Data Catalog through Amazon Athena
+ Amazon Aurora
+ Amazon Relational Database Service (Amazon RDS)
+ Salesforce Data Cloud
+ Snowflake
+ Databricks, SQLServer, MariaDB, and other popular databases through JDBC connectors
+ Over 40 external SaaS platforms, such as SAP OData

For a full list of data sources from which you can import, see the following table:


| Source | Type | Supported data types | 
| --- | --- | --- | 
| Local file upload | Local | Tabular, Image, Document | 
| Amazon Aurora | Amazon internal | Tabular | 
| Amazon S3 bucket | Amazon internal | Tabular, Image, Document | 
| Amazon RDS | Amazon internal | Tabular | 
| Amazon Redshift provisioned clusters (not Redshift Serverless) | Amazon internal | Tabular | 
| AWS Glue Data Catalog (through Amazon Athena) | Amazon internal | Tabular | 
| [Databricks](https://www.databricks.com/) | External | Tabular | 
| Snowflake | External | Tabular | 
| [Salesforce Data Cloud](https://www.salesforce.com/products/genie/overview/) | External | Tabular | 
| SQLServer | External | Tabular | 
| MySQL | External | Tabular | 
| PostgreSQL | External | Tabular | 
| MariaDB | External | Tabular | 
| [Amplitude](https://docs.aws.amazon.com/appflow/latest/userguide/amplitude.html) | External SaaS platform | Tabular | 
| [CircleCI](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-circleci.html) | External SaaS platform | Tabular | 
| [DocuSign Monitor](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-docusign-monitor.html) | External SaaS platform | Tabular | 
| [Domo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-domo.html) | External SaaS platform | Tabular | 
| [Datadog](https://docs.aws.amazon.com/appflow/latest/userguide/datadog.html) | External SaaS platform | Tabular | 
| [Dynatrace](https://docs.aws.amazon.com/appflow/latest/userguide/dynatrace.html) | External SaaS platform | Tabular | 
| [Facebook Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-ads.html) | External SaaS platform | Tabular | 
| [Facebook Page Insights](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-page-insights.html) | External SaaS platform | Tabular | 
| [Google Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-ads.html) | External SaaS platform | Tabular | 
| [Google Analytics 4](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-analytics-4.html) | External SaaS platform | Tabular | 
| [Google Search Console](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-search-console.html) | External SaaS platform | Tabular | 
| [GitHub](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-github.html) | External SaaS platform | Tabular | 
| [GitLab](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-gitlab.html) | External SaaS platform | Tabular | 
| [Infor Nexus](https://docs.aws.amazon.com/appflow/latest/userguide/infor-nexus.html) | External SaaS platform | Tabular | 
| [Instagram Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-instagram-ads.html) | External SaaS platform | Tabular | 
| [Jira Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jira-cloud.html) | External SaaS platform | Tabular | 
| [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html) | External SaaS platform | Tabular | 
| [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html) | External SaaS platform | Tabular | 
| [Mailchimp](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mailchimp.html) | External SaaS platform | Tabular | 
| [Marketo](https://docs.aws.amazon.com/appflow/latest/userguide/marketo.html) | External SaaS platform | Tabular | 
| [Microsoft Teams](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-teams.html) | External SaaS platform | Tabular | 
| [Mixpanel](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mixpanel.html) | External SaaS platform | Tabular | 
| [Okta](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-okta.html) | External SaaS platform | Tabular | 
| [Salesforce](https://docs.aws.amazon.com/appflow/latest/userguide/salesforce.html) | External SaaS platform | Tabular | 
| [Salesforce Marketing Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-salesforce-marketing-cloud.html) | External SaaS platform | Tabular | 
| [Salesforce Pardot](https://docs.aws.amazon.com/appflow/latest/userguide/pardot.html) | External SaaS platform | Tabular | 
| [SAP OData](https://docs.aws.amazon.com/appflow/latest/userguide/sapodata.html) | External SaaS platform | Tabular | 
| [SendGrid](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-sendgrid.html) | External SaaS platform | Tabular | 
| [ServiceNow](https://docs.aws.amazon.com/appflow/latest/userguide/servicenow.html) | External SaaS platform | Tabular | 
| [Singular](https://docs.aws.amazon.com/appflow/latest/userguide/singular.html) | External SaaS platform | Tabular | 
| [Slack](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html) | External SaaS platform | Tabular | 
| [Stripe](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-stripe.html) | External SaaS platform | Tabular | 
| [Trend Micro](https://docs.aws.amazon.com/appflow/latest/userguide/trend-micro.html) | External SaaS platform | Tabular | 
| [Typeform](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-typeform.html) | External SaaS platform | Tabular | 
| [Veeva](https://docs.aws.amazon.com/appflow/latest/userguide/veeva.html) | External SaaS platform | Tabular | 
| [Zendesk](https://docs.aws.amazon.com/appflow/latest/userguide/zendesk.html) | External SaaS platform | Tabular | 
| [Zendesk Chat](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-chat.html) | External SaaS platform | Tabular | 
| [Zendesk Sell](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sell.html) | External SaaS platform | Tabular | 
| [Zendesk Sunshine](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sunshine.html) | External SaaS platform | Tabular | 
| [Zoom Meetings](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoom.html) | External SaaS platform | Tabular | 

For instructions on how to import data and information regarding input data requirements, such as the maximum file size for images, see [Create a dataset](canvas-import-dataset.md).

Canvas also provides several sample datasets in your application to help you get started. To learn more about the SageMaker AI-provided sample datasets you can experiment with, see [Use sample datasets](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-sample-datasets.html).

After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual update or you can set up a schedule for automatic dataset updates. For more information, see [Update a dataset](canvas-update-dataset.md).

For more information specific to each dataset type, see the following sections:

**Tabular**

To import data from an external data source (such as a Snowflake database or a SaaS platform), you must authenticate and connect to the data source in the Canvas application. For more information, see [Connect to data sources](canvas-connecting-external.md).

If you want to import datasets larger than 5 GB from Amazon S3 into Canvas, you can achieve faster sampling by using Amazon Athena to query and sample the data from Amazon S3.

After creating datasets in Canvas, you can prepare and transform your data using the data preparation functionality of Data Wrangler. You can use Data Wrangler to handle missing values, transform your features, join multiple datasets into a single dataset, and more. For more information, see [Data preparation](canvas-data-prep.md).

**Tip**  
As long as your data is arranged into tables, you can join datasets from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake.

**Image**

For information about how to edit an image dataset and perform tasks such as assigning or reassigning labels, adding images, or deleting images, see [Edit an image dataset](canvas-edit-image.md).

# Create a dataset


**Note**  
If you're importing datasets larger than 5 GB into Amazon SageMaker Canvas, we recommend that you use the [Data Wrangler feature](canvas-data-prep.md) in Canvas to create a data flow. Data Wrangler supports advanced data preparation features such as [joining](canvas-transform.md#canvas-transform-join) and [concatenating](canvas-transform.md#canvas-transform-concatenate) data. After you create a data flow, you can export your data flow as a Canvas dataset and begin building a model. For more information, see [Export to create a model](canvas-processing-export-model.md).

The following sections describe how to create a dataset in Amazon SageMaker Canvas. For custom models, you can create datasets for tabular and image data. For Ready-to-use models, you can use tabular and image datasets as well as document datasets. Choose your workflow based on the following information:
+ For categorical, numeric, text, and timeseries data, see [Import tabular data](#canvas-import-dataset-tabular).
+ For image data, see [Import image data](#canvas-import-dataset-image).
+ For document data, see [Import document data](#canvas-ready-to-use-import-document).

A dataset can consist of multiple files. For example, you might have multiple files of inventory data in CSV format. You can upload these files together as a dataset as long as the schema (or column names and data types) of the files match.

Canvas also supports managing multiple versions of your dataset. When you create a dataset, the first version is labeled as `V1`. You can create a new version of your dataset by updating your dataset. You can do a manual update, or you can set up an automated schedule for updating your dataset with new data. For more information, see [Update a dataset](canvas-update-dataset.md).

When you import your data into Canvas, make sure that it meets the requirements in the following table. The limitations are specific to the type of model you’re building.


| Limit | 2 category, 3\$1 category, numeric, and time series models | Text prediction models | Image prediction models | \$1Document data for Ready-to-use models | 
| --- | --- | --- | --- | --- | 
| Supported file types |  CSV and Parquet (local upload, Amazon S3, or databases) JSON (databases)  |  CSV and Parquet (local upload, Amazon S3, or databases) JSON (databases)  | JPG, PNG | PDF, JPG, PNG, TIFF | 
| Maximum file size |  Local upload: 5 GB Data sources: PBs  |  Local upload: 5 GB Data sources: PBs  | 30 MB per image | 5 MB per document | 
| Maximum number of files you can upload at a time | 30 | 30 | N/A | N/A | 
| Maximum number of columns | 1,000 | 1,000 | N/A | N/A | 
| Maximum number of entries (rows, images, or documents) for **Quick builds** | N/A | 7500 rows | 5000 images | N/A | 
| Maximum number of entries (rows, images, or documents) for **Standard builds** | N/A | 150,000 rows | 180,000 images | N/A | 
| Minimum number of entries (rows) for **Quick builds** |  2 category: 500 rows 3\$1 category, numeric, time series: N/A  | N/A | N/A | N/A | 
| Minimum number of entries (rows, images, or documents) for **Standard builds** | 250 rows | 50 rows | 50 images | N/A | 
|  Minimum number of entries (rows or images) per label | N/A | 25 rows | 25 rows | N/A | 
| Minimum number of labels |  2 category: 2 3\$1 category: 3 Numeric, time series: N/A  | 2 | 2 | N/A | 
|  Minimum sample size for random sampling | 500 | N/A | N/A | N/A | 
|  Maximum sample size for random sampling | 200,000 | N/A | N/A | N/A | 
| Maximum number of labels |  2 category: 2 3\$1 category, numeric, time series: N/A  | 1000 | 1000 | N/A | 

\$1Document data is currently only supported for [Ready-to-use models](canvas-ready-to-use-models.md) that accept document data. You can't build a custom model with document data.

Also note the following restrictions:
+ When importing data from an Amazon S3 bucket, make sure that your Amazon S3 bucket name doesn't contain a `.`. If your bucket name contains a `.`, you might experience errors when trying to import data into Canvas.
+ For tabular data, Canvas disallows selecting any file with extensions other than .csv, .parquet, .parq, and .pqt for both local upload and Amazon S3 import. CSV files can use any common or custom delimiter, and they must not have newline characters except when denoting a new row.
+ For tabular data using Parquet files, note the following:
  + Parquet files can't include complex types like maps and lists.
  + The column names of Parquet files can't contain spaces.
  + If using compression, Parquet files must use either gzip or snappy compression types. For more information about the preceding compression types, see the [gzip documentation](https://www.gzip.org/) and the [snappy documentation](https://github.com/google/snappy).
+ For image data, if you have any unlabeled images, you must label them before building your model. For information about how to assign labels to images within the Canvas application, see [Edit an image dataset](canvas-edit-image.md).
+ If you set up automatic dataset updates or automatic batch prediction configurations, you can only create a total of 20 configurations in your Canvas application. For more information, see [How to manage automations](canvas-manage-automations.md).

After you import a dataset, you can view your datasets on the **Datasets** page at any time.

## Import tabular data


With tabular datasets, you can build categorical, numeric, time series forecasting, and text prediction models. Review the limitations table in the preceding **Import a dataset** section to ensure that your data meets the requirements for tabular data.

Use the following procedure to import a tabular dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Tabular**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Create tabular dataset** page, open the **Data Source** dropdown menu.

1. Choose your data source:
   + To upload files from your computer, choose **Local upload**.
   + To import data from another source, such as an Amazon S3 bucket or a Snowflake database, search for your data source in the **Search data source bar**. Then, choose the tile for your desired data source.
**Note**  
You can only import data from the tiles that have an active connection. If you want to connect to a data source that is unavailable to you, contact your administrator. If you’re an administrator, see [Connect to data sources](canvas-connecting-external.md).

   The following screenshot shows the **Data Source** dropdown menu.  
![\[Screenshot showing the Data Source dropdown menu and a search for a data source in the search bar.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/import-data-choose-source.png)

1. (Optional) If you’re connecting to an Amazon Redshift or Snowflake database for the first time, a dialog box appears to create a connection. Fill out the dialog box with your credentials and choose **Create connection**. If you already have a connection, choose your connection.

1. From your data source, select your files to import. For local upload and importing from Amazon S3, you can select files. For Amazon S3 only, you also have the option to directly enter the S3 URI, alias, or ARN of your bucket or S3 access point in the **Input S3 endpoint** field, and then choose files to import. For database sources, you can drag-and-drop data tables from the left navigation pane.

1. (Optional) For tabular data sources that support SQL querying (such as Amazon Redshift, Amazon Athena, or Snowflake), you can choose **Edit in SQL** to make SQL queries before importing them.

   The following screenshot shows the **Edit SQL** view for an Amazon Athena data source.  
![\[Screenshot showing a SQL query in the Edit SQL view for Amazon Athena data.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/import-data-edit-sql.png)

1. Choose **Preview dataset** to preview your data before importing it.

1. In the **Import settings**, enter a **Dataset name** or use the default dataset name.

1. (Optional) For data that you import from Amazon S3, you are shown the **Advanced** settings and can fill out the following fields:

   1. Toggle the **Use first row as header** option on if you want to use the first row of your dataset as the column names. If you selected multiple files, this applies to each file.

   1. If you're importing a CSV file, for the **File encoding (CSV)** dropdown, select your dataset file’s encoding. `UTF-8` is the default.

   1. For the **Delimiter** dropdown, select the delimiter that separates each cell in your data. The default delimiter is `,`. You can also specify a custom delimiter.

   1. Select **Multi-line detection** if you’d like Canvas to manually parse your entire dataset for multi-line cells. By default, this option is not selected and Canvas determines whether or not to use multi-line support by taking a sample of your data. However, Canvas might not detect any multi-line cells in the sample. If you have multi-line cells, we recommend that you select the **Multi-line detection** option to force Canvas to check your entire dataset for multi-line cells.

1. When you’re ready to import your data, choose **Create dataset**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas successfully imported your data and you can proceed with [building a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html).

If you have a connection to a data source, such as an Amazon Redshift database or a SaaS connector, you can return to that connection. For Amazon Redshift and Snowflake, you can add another connection by creating another dataset, returning to the **Import data** page, and choosing the **Data Source** tile for that connection. From the dropdown menu, you can open the previous connection or choose **Add connection**.

**Note**  
For SaaS platforms, you can only have one connection per data source.

## Import image data


With image datasets, you can build single-label image prediction custom models, which predict a label for an image. Review the limitations in the preceding **Import a dataset** section to ensure that your image dataset meets the requirements for image data.

**Note**  
You can only import image datasets from local file upload or an Amazon S3 bucket. Also, for image datasets, you must have at least 25 images per label.

Use the following procedure to import an image dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Image**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Import** page, open the **Data Source** dropdown menu.

1. Choose your data source. To upload files from your computer, choose **Local upload**. To import files from Amazon S3, choose **Amazon S3**.

1. From your computer or Amazon S3 bucket, select the images or folders of images that you want to upload.

1. When you’re ready to import your data, choose **Import data**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas successfully imported your data and you can proceed with [building a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html).

When you are building your model, you can edit your image dataset, and you can assign or re-assign labels, add images, or delete images from your dataset. For more information about how to edit your image dataset, see [Edit an image dataset](canvas-edit-image.md).

## Import document data


The Ready-to-use models for expense analysis, identity document analysis, document analysis, and document queries support document data. You can’t build a custom model with document data.

With document datasets, you can generate predictions for expense analysis, identity document analysis, document analysis, and document queries Ready-to-use models. Review the limitations table in the [Create a dataset](#canvas-import-dataset) section to ensure that your document dataset meets the requirements for document data.

**Note**  
You can only import document datasets from local file upload or an Amazon S3 bucket.

Use the following procedure to import a document dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Document**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Import** page, open the **Data Source** dropdown menu.

1. Choose your data source. To upload files from your computer, choose **Local upload**. To import files from Amazon S3, choose **Amazon S3**.

1. From your computer or Amazon S3 bucket, select the document files that you want to upload.

1. When you’re ready to import your data, choose **Import data**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas has successfully imported your data.

On the **Datasets** page, you can choose your dataset to preview it, which shows you up to the first 100 documents of your dataset.

## View your dataset details




For each of your datasets, you can view all of the files in a dataset, the dataset’s version history, and any auto update configurations for the dataset. From the **Datasets** page, you can also initiate actions such as [Update a dataset](canvas-update-dataset.md) or [How custom models work](canvas-build-model.md).

To view the details for a dataset, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose your dataset.

On the **Data** tab, you can see a preview of your data. If you choose **Dataset details**, you can see all of the files that are part of your dataset. Choose a file to see only the data from that file in the preview. For image datasets, the preview only shows you the first 100 images of your dataset.

On the **Version history** tab, you can see a list of all of the versions of your dataset. A new version is made whenever you update a dataset. To learn more about updating a dataset, see [Update a dataset](canvas-update-dataset.md). The following screenshot shows the **Version history** tab in the Canvas application.

![\[Screenshot of the Version history tab for a dataset, with a list of dataset versions.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-version-history.png)


On the **Auto updates** tab, you can enable auto updates for the dataset and set up a configuration to update your dataset on a regular schedule. To learn more about setting up auto updates for a dataset, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md). The following screenshot shows the **Auto updates** tab with auto updates turned on and a list of auto update jobs that have been performed on the dataset.

![\[The Auto updates tab for dataset showing the auto updates turned on and a list of auto update jobs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-auto-updates.png)


# Update a dataset


After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that you want to add to your dataset. For example, you might get inventory data at the end of every week that you want to add to your dataset. Instead of importing your data multiple times, you can update your existing dataset and add or remove files from it.

**Note**  
You can only update datasets that you have imported through local upload or Amazon S3.

You can update your dataset either manually or automatically. For more information about automatic dataset updates, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md).

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use the latest version of your dataset to build a model or generate predictions. For more information about viewing the version history of your dataset, see [View your dataset details](canvas-import-dataset.md#canvas-view-dataset-details).

You can also use dataset updates with automated batch predictions, which starts a batch prediction job whenever you update your dataset. For more information, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

The following section describes how to do manual updates to your dataset.

## Manually update a dataset


To do a manual update, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose the dataset you want to update.

1. Choose the **Update dataset** dropdown menu and choose **Manual update**. You are taken to the import data workflow.

1. From the **Data source** dropdown menu, choose either **Local upload** or **Amazon S3**.

1. The page shows you a preview of your data. From here, you can add or remove files from the dataset. If you’re importing tabular data, the schema of the new files (column names and data types) must match the schema of the existing files. Additionally, your new files must not exceed the maximum dataset size or file size. For more information about these limitations, see [ Import a dataset](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html).
**Note**  
If you add a file with the same name as an existing file in your dataset, the new file overwrites the old version of the file.

1. When you’re ready to save your changes, choose **Update dataset**.

You should now have a new version of your dataset.

On the **Datasets** page, you can choose the **Version history** tab to see all of the versions of your dataset and the history of both manual and automatic updates you’ve made.

# Configure automatic updates for a dataset


After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that you want to add to your dataset. For example, you might get inventory data at the end of every week that you want to add to your dataset. Instead of importing your data multiple times, you can update your existing dataset and add or remove files from it.

**Note**  
You can only update datasets that you have imported through local upload or Amazon S3.

With automatic dataset updates, you specify a location where Canvas checks for files at a frequency you specify. If you import new files during the update, the schema of the files must match the existing dataset exactly.

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use the latest version of your dataset to build a model or generate predictions. For more information about viewing the version history of your dataset, see [View your dataset details](canvas-import-dataset.md#canvas-view-dataset-details).

You can also use dataset updates with automated batch predictions, which starts a batch prediction job whenever you update your dataset. For more information, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

The following section describes how to do automatic updates to your dataset.

An automatic update is when you set up a configuration for Canvas to update your dataset at a given frequency. We recommend that you use this option if you regularly receive new files of data that you want to add to your dataset.

When you set up the auto update configuration, you specify an Amazon S3 location where you upload your files and a frequency at which Canvas checks the location and imports files. Each instance of Canvas updating your dataset is referred to as a *job*. For each job, Canvas imports all of the files in the Amazon S3 location. If you have new files with the same names as existing files in your dataset, Canvas overwrites the old files with the new files.

For automatic dataset updates, Canvas doesn’t perform schema validation. If the schema of files imported during an automatic update don’t match the schema of the existing files or exceed the size limitations (see [Import a dataset](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html) for a table of file size limitations), then you get errors when your jobs run.

**Note**  
You can only set up a maximum of 20 automatic configurations in your Canvas application. Additionally, Canvas only does automatic updates while you’re logged in to your Canvas application. If you log out of your Canvas application, automatic updates pause until you log back in.

To configure automatic updates for your dataset, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose the dataset you want to update.

1. Choose the **Update dataset** dropdown menu and choose **Automatic update**. You are taken to the **Auto updates**tab for the dataset.

1. Turn on the **Auto update enabled** toggle.

1. For **Specify a data source**, enter the Amazon S3 path to a folder where you plan to regularly upload files.

1. For **Choose a frequency**, select **Hourly**, **Weekly**, or **Daily**.

1. For **Specify a starting time**, use the calendar and time picker to select when you want the first auto update job to start.

1. When you’re ready to create the auto update configuration, choose **Save**.

Canvas begins the first job of your auto update cadence at the specified starting time.

# View your automatic dataset update jobs


To view the job history for your automatic dataset updates in Amazon SageMaker Canvas, on your dataset details page, choose the **Auto updates** tab.

Each automatic update to a dataset shows as a job in the **Auto updates** tab under the **Job history** section. For each job, you can see the following:
+ **Job created** – The timestamp for when Canvas started updating the dataset.
+ **Files** – The number of files in the dataset.
+ **Cells (Columns x Rows)** – The number of columns and rows in the dataset.
+ **Status** – The status of the dataset after the update. If the job was successful, the status is **Ready**. If the job failed for any reason, the status is **Failed**, and you can hover over the status for more details.

# Edit your automatic dataset update configuration


You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

To make changes to your auto update configuration for a dataset, go to the **Auto updates** tab of your dataset and choose **Edit** to make changes to the configuration.

To pause your dataset updates, turn off your automatic configuration. You can turn off auto updates by going to the **Auto updates** tab of your dataset and turning the **Enable auto updates** toggle off. You can turn this toggle back on at any time to resume the update schedule.

To learn how to delete your configuration, see [Delete an automatic configuration](canvas-manage-automations-delete.md).

# Connect to data sources


In Amazon SageMaker Canvas, you can import data from a location outside of your local file system through an AWS service, a SaaS platform, or other databases using JDBC connectors. For example, you might want to import tables from a data warehouse in Amazon Redshift, or you might want to import Google Analytics data.

When you go through the **Import** workflow to import data in the Canvas application, you can choose your data source and then select the data that you want to import. For certain data sources, like Snowflake and Amazon Redshift, you must specify your credentials and add a connection to the data source.

The following screenshot shows the data sources toolbar in the **Import** workflow, with all of the available data sources highlighted. You can only import data from the data sources that are available to you. Contact your administrator if your desired data source isn’t available.

![\[The Data Source dropdown menu on the Import data page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/data-sources.png)


The following sections provide information about establishing connections to external data sources and and importing data from them. Review the following section first to determine what permissions you need to import data from your data source.

## Permissions


Review the following information to ensure that you have the necessary permissions to import data from your data source:
+ **Amazon S3:** You can import data from any Amazon S3 bucket as long as your user has permissions to access the bucket. For more information about using AWS IAM to control access to Amazon S3 buckets, see [Identity and access management in Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-access-control.html) in the *Amazon S3 User Guide*.
+ **Amazon Athena:** If you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy and the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you can query your AWS Glue Data Catalog with Amazon Athena. If you’re part of an Athena workgroup, make sure that the Canvas user has permissions to run Athena queries on the data. For more information, see [Using workgroups for running queries](https://docs.aws.amazon.com/athena/latest/ug/workgroups.html) in the *Amazon Athena User Guide*.
+ **Amazon DocumentDB:** You can import data from any Amazon DocumentDB database as long as you have the credentials (username and password) to connect to the database and have the minimum base Canvas permissions attached to your user’s execution role. For more information about Canvas permissions, see the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).
+ **Amazon Redshift:** To give yourself the necessary permissions to import data from Amazon Redshift, see [Grant Users Permissions to Import Amazon Redshift Data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-redshift-permissions.html).
+ **Amazon RDS:** If you have the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you’ll be able to access your Amazon RDS databases from Canvas.
+ **SaaS platforms:** If you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy and the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you have the necessary permissions to import data from SaaS platforms. See [Use SaaS connectors with Canvas](#canvas-connecting-external-appflow) for more information about connecting to a specific SaaS connector.
+ **JDBC connectors:** For database sources such as Databricks, MySQL or MariaDB, you must enable username and password authentication on the source database before attempting to connect from Canvas. If you’re connecting to a Databricks database, you must have the JDBC URL that contains the necessary credentials.

## Connect to a database stored in AWS


You might want to import data that you’ve stored in AWS. You can import data from Amazon S3, use Amazon Athena to query a database in the AWS Glue Data Catalog, import data from [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html), or make a connection to a provisioned Amazon Redshift database (not Redshift Serverless).

You can create multiple connections to Amazon Redshift. For Amazon Athena, you can access any databases that you have in your [AWS Glue Data Catalog](https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/aws-glue-data-catalog.html). For Amazon S3, you can import data from a bucket as long as you have the necessary permissions.

Review the following sections for more detailed information.

### Connect to data in Amazon S3, Amazon Athena, or Amazon RDS


For Amazon S3, you can import data from an Amazon S3 bucket as long as you have permissions to access the bucket.

For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have permissions through your [Amazon Athena workgroup](https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html).

For Amazon RDS, if you have the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s role, then you’ll be able to import data from your Amazon RDS databases into Canvas.

To import data from an Amazon S3 bucket, or to run queries and import data tables with Amazon Athena, see [Create a dataset](canvas-import-dataset.md). You can only import tabular data from Amazon Athena, and you can import tabular and image data from Amazon S3.

### Connect to an Amazon DocumentDB database


Amazon DocumentDB is a fully managed, serverless, document database service. You can import unstructured document data stored in an Amazon DocumentDB database into SageMaker Canvas as a tabular dataset, and then you can build machine learning models with the data.

**Important**  
Your SageMaker AI domain must be configured in **VPC only** mode to add connections to Amazon DocumentDB. You can only access Amazon DocumentDB clusters in the same Amazon VPC as your Canvas application. Additionally, Canvas can only connect to TLS-enabled Amazon DocumentDB clusters. For more information about how to set up Canvas in **VPC only** mode, see [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md).

To import data from Amazon DocumentDB databases, you must have credentials to access the Amazon DocumentDB database and specify the username and password when creating a database connection. You can configure more granular permissions and restrict access by modifying the Amazon DocumentDB user permissions. To learn more about access control in Amazon DocumentDB, see [Database Access Using Role-Based Access Control](https://docs.aws.amazon.com/documentdb/latest/developerguide/role_based_access_control.html) in the *Amazon DocumentDB Developer Guide*.

When you import from Amazon DocumentDB, Canvas converts your unstructured data into a tabular dataset by mapping the fields to columns in a table. Additional tables are created for each complex field (or nested structure) in the data, where the columns correspond to the sub-fields of the complex field. For more detailed information about this process and examples of schema conversion, see the [ Amazon DocumentDB JDBC Driver Schema Discovery](https://github.com/aws/amazon-documentdb-jdbc-driver/blob/develop/src/markdown/schema/schema-discovery.md) GitHub page.

Canvas can only make a connection to a single database in Amazon DocumentDB. To import data from a different database, you must create a new connection.

You can import data from Amazon DocumentDB into Canvas by using the following methods:
+ [Create a dataset](canvas-import-dataset.md). You can import your Amazon DocumentDB data and create a tabular dataset in Canvas. If you choose this method, make sure that you follow the [ Import tabular data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html#canvas-import-dataset-tabular) procedure.
+ [Create a data flow](canvas-data-flow.md). You can create a data preparation pipeline in Canvas and add your Amazon DocumentDB database as a data source.

To proceed with importing your data, follow the procedure for one of the methods linked in the preceding list.

When you reach the step in either workflow to choose a data source (Step 6 for creating a dataset, or Step 8 for creating a data flow), do the following:

1. For **Data Source**, open the dropdown menu and choose **DocumentDB**.

1. Choose **Add connection**.

1. In the dialog box, specify your Amazon DocumentDB credentials:

   1. Enter a **Connection name**. This is a name used by Canvas to identify this connection.

   1. For **Cluster**, select the cluster in Amazon DocumentDB that stores your data. Canvas automatically populates the dropdown menu with Amazon DocumentDB clusters in the same VPC as your Canvas application.

   1. Enter the **Username** for your Amazon DocumentDB cluster.

   1. Enter the **Password** for your Amazon DocumentDB cluster.

   1. Enter the name of the **Database** to which you want to connect.

   1. The **Read preference** option determines which types of instances on your cluster Canvas reads the data from. Select one of the following:
      + **Secondary preferred** – Canvas defaults to reading from the cluster’s secondary instances, but if a secondary instance isn’t available, then Canvas reads from a primary instance.
      + **Secondary** – Canvas only reads from the cluster’s secondary instances, which prevents the read operations from interfering with the cluster’s regular read and write operations.

   1. Choose **Add connection**. The following image shows the dialog box with the preceding fields for an Amazon DocumentDB connection.  
![\[Screenshot of the Add a new DocumentDB connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/add-docdb-connection.png)

You should now have an Amazon DocumentDB connection, and you can use your Amazon DocumentDB data in Canvas to create either a dataset or a data flow.

### Connect to an Amazon Redshift database


You can import data from Amazon Redshift, a data warehouse where your organization keeps its data. Before you can import data from Amazon Redshift, the AWS IAM role you use must have the `AmazonRedshiftFullAccess` managed policy attached. For instructions on how to attach this policy, see [Grant Users Permissions to Import Amazon Redshift Data](canvas-redshift-permissions.md). 

To import data from Amazon Redshift, you do the following:

1. Create a connection to an Amazon Redshift database.

1. Choose the data that you're importing.

1. Import the data.

You can use the Amazon Redshift editor to drag datasets onto the import pane and import them into SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
+ SQL queries
+ Joins

With SQL queries, you can customize how you import the values in the dataset. For example, you can specify the columns returned in the dataset or the range of values for a column.

You can use joins to combine multiple datasets from Amazon Redshift into a single dataset. You can drag your datasets from Amazon Redshift into the panel that gives you the ability to join the datasets.

You can use the SQL editor to edit the dataset that you've joined and convert the joined dataset into a single node. You can join another dataset to the node. You can import the data that you've selected into SageMaker Canvas.

Use the following procedure to import data from Amazon Redshift.

1. In the SageMaker Canvas application, go to the **Datasets** page.

1. Choose **Import data**, and from the dropdown menu, choose **Tabular**.

1. Enter a name for the dataset and choose **Create**.

1. For **Data Source**, open the dropdown menu and choose **Redshift**.

1. Choose **Add connection**.

1. In the dialog box, specify your Amazon Redshift credentials:

   1. For **Authentication method**, choose **IAM**.

   1. Enter the **Cluster identifier** to specify to which cluster you want to connect. Enter only the cluster identifier and not the full endpoint of the Amazon Redshift cluster.

   1. Enter the **Database name** of the database to which you want to connect.

   1. Enter a **Database user** to identify the user you want to use to connect to the database.

   1. For **ARN**, enter the IAM role ARN of the role that the Amazon Redshift cluster should assume to move and write data to Amazon S3. For more information about this role, see [ Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the *Amazon Redshift Management Guide*.

   1. Enter a **Connection name**. This is a name used by Canvas to identify this connection.

1. From the tab that has the name of your connection, drag the .csv file that you're importing to the **Drag and drop table to import** pane.

1. Optional: Drag additional tables to the import pane. You can use the GUI to join the tables. For more specificity in your joins, choose **Edit in SQL**.

1. Optional: If you're using SQL to query the data, you can choose **Context** to add context to the connection by specifying values for the following:
   + **Warehouse**
   + **Database**
   + **Schema**

1. Choose **Import data**.

The following image shows an example of fields specified for an Amazon Redshift connection.

![\[Screenshot of the Add a new Redshift connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-add-connection.png)


The following image shows the page used to join datasets in Amazon Redshift.

![\[Screenshot of the Import page in Canvas, showing two datasets being joined.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-join.png)


The following image shows an SQL query being used to edit a join in Amazon Redshift.

![\[Screenshot of a SQL query in the Edit SQL editor on the Import page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-edit-sql.png)


## Connect to your data with JDBC connectors


With JDBC, you can connect to your databases from sources such as Databricks, SQLServer, MySQL, PostgreSQL, MariaDB, Amazon RDS, and Amazon Aurora.

You must make sure that you have the necessary credentials and permissions to create the connection from Canvas.
+ For Databricks, you must provide a JDBC URL. The URL formatting can vary between Databricks instances. For information about finding the URL and the specifying the parameters within it, see [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters) in the Databricks documentation. The following is an example of how a URL can be formatted: `jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=token;PWD=personal-access-token`
+ For other database sources, you must set up username and password authentication, and then specify those credentials when connecting to the database from Canvas. 

Additionally, your data source must either be accessible through the public internet, or if your Canvas application is running in **VPC only** mode, then the data source must run in the same VPC. For more information about configuring an Amazon RDS database in a VPC, see [Amazon VPC VPCs and Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.html) in the *Amazon RDS User Guide*.

After you’ve configured your data source credentials, you can sign in to the Canvas application and create a connection to the data source. Specify your credentials (or, for Databricks, the URL) when creating the connection.

## Connect to data sources with OAuth


Canvas supports using OAuth as an authentication method for connecting to your data in Snowflake and Salesforce Data Cloud. [OAuth](https://oauth.net/2/) is a common authentication platform for granting access to resources without sharing passwords.

**Note**  
You can only establish one OAuth connection for each data source.

To authorize the connection, you must following the initial setup described in [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

After setting up the OAuth credentials, you can do the following to add a Snowflake or Salesforce Data Cloud connection with OAuth:

1. Sign in to the Canvas application.

1. Create a tabular dataset. When prompted to upload data, choose Snowflake or Salesforce Data Cloud as your data source.

1. Create a new connection to your Snowflake or Salesforce Data Cloud data source. Specify OAuth as the authentication method and enter your connection details.

You should now be able to import data from your databases in Snowflake or Salesforce Data Cloud.

## Connect to a SaaS platform


You can import data from Snowflake and over 40 other external SaaS platforms. For a full list of the connectors, see the table on [Data import](canvas-importing-data.md).

**Note**  
You can only import tabular data, such as data tables, from SaaS platforms.

### Use Snowflake with Canvas


Snowflake is a data storage and analytics service, and you can import your data from Snowflake into SageMaker Canvas. For more information about Snowflake, see the [Snowflake documentation](https://www.snowflake.com/en/).

You can import data from your Snowflake account by doing the following:

1. Create a connection to the Snowflake database.

1. Choose the data that you're importing by dragging and dropping the table from the left navigation menu into the editor.

1. Import the data.

You can use the Snowflake editor to drag datasets onto the import pane and import them into SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
+ SQL queries
+ Joins

With SQL queries, you can customize how you import the values in the dataset. For example, you can specify the columns returned in the dataset or the range of values for a column.

You can join multiple Snowflake datasets into a single dataset before you import into Canvas using SQL or the Canvas interface. You can drag your datasets from Snowflake into the panel that gives you the ability to join the datasets, or you can edit the joins in SQL and convert the SQL into a single node. You can join other nodes to the node that you've converted. You can then combine the datasets that you've joined into a single node and join the nodes to a different Snowflake dataset. Finally, you can import the data that you've selected into Canvas.

Use the following procedure to import data from Snowflake to Amazon SageMaker Canvas.

1. In the SageMaker Canvas application, go to the **Datasets** page.

1. Choose **Import data**, and from the dropdown menu, choose **Tabular**.

1. Enter a name for the dataset and choose **Create**.

1. For **Data Source**, open the dropdown menu and choose **Snowflake**.

1. Choose **Add connection**.

1. In the **Add a new Snowflake connection** dialog box, specify your Snowflake credentials. For the **Authentication method**, choose one of the following:
   + **Basic - username password** – Provide your Snowflake account ID, username, and password.
   + **ARN** – For improved protection of your Snowflake credentials, provide the ARN of an AWS Secrets Manager secret that contains your credentials. For more information, see [ Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) in the *AWS Secrets Manager User Guide*.

     Your secret should have your Snowflake credentials stored in the following JSON format:

     ```
     {"accountid": "ID",
     "username": "username",
     "password": "password"}
     ```
   + **OAuth** – OAuth lets you authenticate without providing a password but requires additional setup. For more information about setting up OAuth credentials for Snowflake, see [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

1. Choose **Add connection**.

1. From the tab that has the name of your connection, drag the .csv file that you're importing to the **Drag and drop table to import** pane.

1. Optional: Drag additional tables to the import pane. You can use the user interface to join the tables. For more specificity in your joins, choose **Edit in SQL**.

1. Optional: If you're using SQL to query the data, you can choose **Context** to add context to the connection by specifying values for the following:
   + **Warehouse**
   + **Database**
   + **Schema**

   Adding context to a connection makes it easier to specify future queries.

1. Choose **Import data**.

The following image shows an example of fields specified for a Snowflake connection.

![\[Screenshot of the Add a new Snowflake connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-connection.png)


The following image shows the page used to add context to a connection.

![\[Screenshot of the Import page in Canvas, showing the Context dialog box.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-connection-context.png)


The following image shows the page used to join datasets in Snowflake.

![\[Screenshot of the Import page in Canvas, showing datasets being joined.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-join.png)


The following image shows a SQL query being used to edit a join in Snowflake.

![\[Screenshot of a SQL query in the Edit SQL editor on the Import page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-edit-sql.png)


### Use SaaS connectors with Canvas


**Note**  
For SaaS platforms besides Snowflake, you can only have one connection per data source.

Before you can import data from a SaaS platform, your administrator must authenticate and create a connection to the data source. For more information about how administrators can create a connection with a SaaS platform, see [Managing Amazon AppFlow connections](https://docs.aws.amazon.com/appflow/latest/userguide/connections.html) in the *Amazon AppFlow User Guide*.

If you’re an administrator getting started with Amazon AppFlow for the first time, see [Getting started](https://docs.aws.amazon.com/appflow/latest/userguide/getting-started.html) in the *Amazon AppFlow User Guide*.

To import data from a SaaS platform, you can follow the standard [Import tabular data](canvas-import-dataset.md#canvas-import-dataset-tabular) procedure, which shows you how to import tabular datasets into Canvas.

# Sample datasets in Canvas


SageMaker Canvas provides sample datasets addressing unique use cases so you can start building, training, and validating models quickly without writing any code. The use cases associated with these datasets highlight the capabilities of SageMaker Canvas, and you can leverage these datasets to get started with building models. You can find the sample datasets in the **Datasets** page of your SageMaker Canvas application.

The following datasets are the samples that SageMaker Canvas provides by default. These datasets cover use cases such as predicting house prices, loan defaults, and readmission for diabetic patients; forecasting sales; predicting machine failures to streamline predictive maintenance in manufacturing units; and generating supply chain predictions for transportation and logistics. The datasets are stored in the `sample_dataset` folder in the default Amazon S3 bucket that SageMaker AI creates for your account in a Region.
+ **canvas-sample-diabetic-readmission.csv:** This dataset contains historical data including over fifteen features with patient and hospital outcomes. You can use this dataset to predict whether high-risk diabetic patients are likely to get readmitted to the hospital within 30 days of discharge, after 30 days, or not at all. Use the **redadmitted** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/5-hcls). This dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008). 
+ **canvas-sample-housing.csv:** This dataset contains data on the characteristics tied to a given housing price. You can use this dataset to predict housing prices. Use the **median\$1house\$1value** column as the target column, and use the numeric prediction model type with this dataset. To learn more about building a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/2-real-estate). This is the California housing dataset obtained from the [StatLib repository](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).
+ **canvas-sample-loans.csv:** This dataset contains complete loan data for all loans issued from 2007–2011, including the current loan status and latest payment information. You can use this dataset to predict whether a customer will repay a loan. Use the **loan\$1status** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/4-finserv). This data uses the LendingClub data obtained from [Kaggle](https://www.kaggle.com/datasets/wordsforthewise/lending-club).
+ **canvas-sample-maintenance.csv:** This dataset contains data on the characteristics tied to a given maintenance failure type. You can use this dataset to predict which failure will occur in the future. Use the **Failure Type** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/6-manufacturing). This dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset).
+ **canvas-sample-shipping-logs.csv:** This dataset contains complete shipping data for all products delivered, including estimated time shipping priority, carrier, and origin. You can use this dataset to predict the estimated time of arrival of the shipment in number of days. Use the **ActualShippingDays** column as the target column, and use the numeric prediction model type with this dataset. To learn more about how to build a model with this data, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/7-supply-chain). This is a synthetic dataset created by Amazon.
+ **canvas-sample-sales-forecasting.csv:** This dataset contains historical time series sales data for retail stores. You can use this dataset to forecast sales for a particular retail store. Use the **sales** column as the target column, and use the time series forecasting model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/3-retail). This is a synthetic dataset created by Amazon.

# Re-import a deleted sample dataset


Amazon SageMaker Canvas provides you with sample datasets for various use cases that highlight the capabilities of Canvas. To learn more about the sample datasets that are available, see [Sample datasets in Canvas](canvas-sample-datasets.md). If you no longer wish to use the sample datasets, you can delete them from the **Datasets** page of your SageMaker Canvas application. However, these datasets are still stored in the Amazon S3 bucket that you specified as the [Canvas storage location](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-storage-configuration.html), so you can always access them later. 

If you used the default Amazon S3 bucket, the bucket name follows the pattern `sagemaker-{region}-{account ID}`. You can find the sample datasets in the directory path `Canvas/sample_dataset`.

If you delete a sample dataset from your SageMaker Canvas application and want to access the sample dataset again, use the following procedure.

1. Navigate to the **Datasets** page in your SageMaker Canvas application.

1. Choose **Import data**.

1. From the list of Amazon S3 buckets, select the bucket that is your Canvas storage location. If using the default SageMaker AI-created Amazon S3 bucket, it follows the naming pattern `sagemaker-{region}-{account ID}`.

1. Select the **Canvas** folder.

1. Select the **sample\$1dataset** folder, which contains all of the sample datasets for SageMaker Canvas.

1. Select the dataset you want to import, and then choose **Import data**.

# Data preparation


**Note**  
Previously, Amazon SageMaker Data Wrangler was part of the SageMaker Studio Classic experience. Now, if you update to using the new Studio experience, you must use SageMaker Canvas to access Data Wrangler and receive the latest feature updates. If you have been using Data Wrangler in Studio Classic until now and want to migrate to Data Wrangler in Canvas, you might have to grant additional permissions so that you can create and use a Canvas application. For more information, see [(Optional) Migrate from Data Wrangler in Studio Classic to SageMaker Canvas](studio-updated-migrate-ui.md#studio-updated-migrate-dw).  
To learn how to migrate your data flows from Data Wrangler in Studio Classic, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).

Use Amazon SageMaker Data Wrangler in Amazon SageMaker Canvas to prepare, featurize and analyze your data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows.
+ **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline. 
+ **Transform** – Clean and transform your dataset using standard *transforms* like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
+ **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Quality and Insights Report. 
+ **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation. 
+ **Export** – Export your data preparation workflow to a different location. The following are example locations: 
  + Amazon Simple Storage Service (Amazon S3) bucket
  + Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
+ **Automate data preparation** – Create machine learning workflows from your data flow.
  + Amazon SageMaker Pipelines – Build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs.
  + Serial inference pipeline – Create a serial inference pipeline from your data flow. Use it to make predictions on new data.
  + Python script – Store the data and their transformations in a Python script for your custom workflows.

# Create a data flow


Use a Data Wrangler flow in SageMaker Canvas, or *data flow*, to create and modify a data preparation pipeline. We recommend that you use Data Wrangler for datasets larger than 5 GB.

To get started, use the following procedure to import your data into a data flow.

1. Open SageMaker Canvas.

1. In the left-hand navigation, choose **Data Wrangler**.

1. Choose **Import and prepare**.

1. From the dropdown menu, choose either **Tabular** or **Image**.

1. For **Select a data source**, choose your data source and select the data that you want to import. You have the option to select up to 30 files or one folder. If you have a dataset already imported into Canvas, choose **Canvas dataset** as your source. Otherwise, connect to a data source such as Amazon S3 or Snowflake and browse through your data. For information about connecting to a data source or importing data, see the following pages:
   + [Data import](canvas-importing-data.md)
   + [Connect to data sources](canvas-connecting-external.md)

1. After selecting the data that you want to import, choose **Next**.

1. (Optional) For the **Import settings** section when importing a tabular dataset, expand the **Advanced** dropdown menu. You can specify the following advanced settings for data flow imports:
   + **Sampling method** – Select the sampling method and sample size you'd like to use. For more information about how to change your sample, see the section [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **File encoding (CSV)** – Select your dataset file’s encoding. `UTF-8` is the default.
   + **Skip first rows** – Enter the number of rows you’d like to skip importing if you have redundant rows at the beginning of your dataset.
   + **Delimiter** – Select the delimiter that separates each item in your data. You can also specify a custom delimiter.
   + **Multi-line detection** – Select this option if you’d like Canvas to manually parse your entire dataset for multi-line cells. Canvas determines whether or not to use multi-line support by taking a sample of your data, but Canvas might not detect any multi-line cells in the sample. In this case, we recommend that you select the **Multi-line detection** option to force Canvas to check your entire dataset for multi-line cells.

1. Choose **Import**.

You should now have a new data flow, and you can begin adding transform steps and analyses.

# How the data flow UI works


To help you navigate your data flow, Data Wrangler has the following tabs in the top navigation pane:
+ **Data flow** – This tab provides you with a visual view of your data flow step where you can add or remove transforms, and export data.
+ **Data** – This tab gives you a preview of your data so that you can check the results of your transforms. You can also see an ordered list of your data flow steps and edit or reorder the steps.
**Note**  
In this tab, you can only preview data visualizations (such as the distribution of values per column) for Amazon S3 data sources. Visualizations for other data sources, such as Amazon Athena, aren't supported.
+ **Analyses** – In this tab, you can see separate sub-tabs for each analysis you create. For example, if you create a histogram and a Data Quality and Insights (DQI) report, Canvas creates a tab for each.

When you import a dataset, the original dataset appears on the data flow and is named **Source**. SageMaker Canvas automatically infers the types of each column in your dataset and creates a new dataframe named **Data types**. You can select this frame to update the inferred data types.

The datasets, transformations, and analyses that you use in the data flow are represented as *steps*. Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than **Join** or **Concatenate**) are added to the same dataset, they are stacked.

Under the **Combine data** option, **Join** and **Concatenate** create standalone steps that contain the new joined or concatenated dataset.

# Edit the data flow sampling configuration


When importing tabular data into a Data Wrangler data flow, you can opt to take a sample of your dataset to speed up the data exploration and cleaning process. Running exploratory transforms on a sample of your dataset is often faster than running transforms on your entire dataset, and when you're ready to export your dataset and build a model, you can apply the transforms to the full dataset.

Canvas supports the following sampling methods:
+ **FirstK** – Canvas selects the first *K* items from your dataset, where *K* is a number you specify. This sampling method is simple but can introduce bias if your dataset isn't randomly ordered.
+ **Random** – Canvas selects items from the dataset at random, with each item having an equal probability of being chosen. This sampling method helps ensure that the sample is representative of the entire dataset.
+ **Stratified** – Canvas divides the dataset into groups (or *strata*) based on one or more attributes (for example, age and income level). Then, a proportional number of items are randomly selected from each group. This method ensures that all relevant subgroups are adequately represented in the sample.

You can edit your sampling configuration at any time to change the size of the sample used for data exploration.

To make changes to your sampling configuration, do the following:

1. In your data flow graph, select your data source node.

1. Choose **Sampling** on the bottom navigation bar.

1. The **Sampling** dialog box opens. For the **Sampling method** dropdown, select your desired sampling method.

1. For **Maximum sample size**, enter the number of rows you want to sample.

1. Choose **Update** to save your changes.

The changes to your sampling configuration should now be applied.

# Add a step to your data flow


In your Data Wrangler data flows, you can add steps that represent data transformations and analyses.

To add a step to your data flow, select **\$1** next to any dataset node or previously added step. Then, select one of the following options:
+ **Edit data types** (For a **Data types** step only): If you have not added any transforms to a **Data types** step, you can double-click on the **Data types** step in your flow to open the **Data** tab and edit the data types that Data Wrangler inferred when importing your dataset. 
+ **Add transform**: Adds a new transform step. See [Transform data](canvas-transform.md) to learn more about the data transformations you can add. 
+ **Get data insights**: Add analyses, such as histograms or custom visualizations. You can use this option to analyze your data at any point in the data flow. See [Perform exploratory data analysis (EDA)](canvas-analyses.md) to learn more about the analyses you can add. 
+ **Join**: Find this option under **Combine data** to join two datasets and add the resulting dataset to the data flow. To learn more, see [Join Datasets](canvas-transform.md#canvas-transform-join).
+ **Concatenate**: Find this option under **Combine data** to concatenate two datasets and add the resulting dataset to the data flow. To learn more, see [Concatenate Datasets](canvas-transform.md#canvas-transform-concatenate).

# Edit data flow steps


In Amazon SageMaker Canvas, you can edit individual steps in your data flows to transform your dataset without having to create a new data flow. The following page covers how to edit join and concatenate steps, as well as data source steps.

## Edit join and concatenate steps


Within your data flows, you have the flexibility to edit your join and concatenate steps. You can make necessary adjustments to your data processing workflow, ensuring that your data is properly combined and transformed without having to redo your entire data flow.

To edit a join or concatenate step in your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to edit.

1. From the context menu, choose **Edit**.

1. A side panel opens where you can edit the details of your join or concatenation. Modify your step fields, such as the type of join. To swap out a data node and select a different one to join or concatenate, choose the delete icon next to the node and then, in the data flow view, select the new node that you want to include in your transformation.
**Note**  
When swapping out a node during the editing process, you can only select steps that occur before the join or concatenate operation. You can swap either the left or right node, but you can only swap one node at a time. Additionally, you cannot select a source node as a replacement.

1. Choose **Preview** to view the result of the combining operation.

1. Choose **Update** to save your changes.

Your data flow should now be updated.

## Edit or replace a data source step


You might need to make changes to your data source or dataset without deleting the transforms and data flow steps applied to your original data. Within Data Wrangler, you can edit or replace your data source configuration while keeping the steps of your data flow. When editing a data source, you can change the import settings, such as the sampling size or method and any advanced settings. You can also add more files with the same schema, or for query-based data sources such as Amazon Athena, you can edit the query. When replacing a data source, you have the option to select a different dataset, or even import the data from a different data source altogether, as long as the schema of the new data matches the original data.

To edit a data source configuration, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Edit**.

1. For Amazon S3 data sources and local upload, you have the option to select or upload more files with the same schema as your original data. For query-based data sources such as Amazon Athena, you can remove and select different tables in the visual query builder, or you can edit the SQL query directly. When you're done, choose **Next**.

1. For the **Import settings**, make any desired changes.

1. When you're done, choose **Save changes**.

Your data source should now be updated.

To replace a data source, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Replace**.

1. Go through the [create a data flow experience](canvas-data-flow.md) to select another data source and data.

1. When you’ve selected your data and are ready to update the source node, choose **Save**.

You should now see the **Source** node updated in your data flow.

# Reorder steps in your data flow


After adding steps to your data flow, you have the option to reorder steps instead of deleting and re-adding them in the correct order. For example, you might decide to move a transform to impute missing values before a step to format strings.

**Note**  
You can’t change the order of certain step types, such as defining your data source, changing data types, joining, concatenating, or splitting. Steps that can’t be reordered are grayed out in the Canvas application UI.

To reorder your data flow steps, do the following:

1. While editing a data flow in Data Wrangler, choose the **Data** tab. A side panel called **Steps** lists your data flow steps in order.

1. Hover over a transform step and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to that step.

1. From the context menu, choose **Reorder**.

1. Drag and drop your data flow steps into your desired order.

1. When you’ve finished, choose **Save**.

Your data flow steps and graph should now reflect the changes you’ve made.

# Delete a step from your data flow


Within your data flows, you have the flexibility to delete your join and concatenate steps and choose whether or not to still apply any subsequent transforms to your data.

To delete a join or concatenate step from your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to delete.

1. In the context menu, choose **Delete**.

1. (Optional) If you have transformation steps following the join or concatenate step, then you can choose whether or not to keep the subsequent transformation steps and add them separately to each data node. In the **Delete join** side panel, choose a node to deselect it and remove any subsequent transformation steps. You can leave both nodes selected to keep all transformation steps, or you can deselect both nodes to discard all transformation steps.

   The following screenshot shows this step with only the second of two data nodes selected. When the join is successfully deleted, then the subsequent **Rename column** transform is only kept by the second data node.  
![\[Screenshot of a data flow in Data Wrangler showing the delete join view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-data-flow-delete-step.png)

1. Choose **Delete**.

The join or concatenate step should now be removed from your data flow.

# Perform exploratory data analysis (EDA)
Perform EDA

Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. You can also create custom analyses using your own code. 

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing **Add analysis**. To access an analysis you've created, select the step that contains the analysis, and select the analysis. 

Analyses are generated using a sample of up to 200,000 rows of your dataset, and you can configure the sample size. For more information about changing the sample size of your data flow, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

**Note**  
Analyses are optimized for data with 1000 or fewer columns. You may experience some latency when generating analyses for data with additional columns.

You can add the following analysis to a dataframe:
+ Data visualizations, including histograms and scatter plots. 
+ A quick summary of your dataset, including number of entries, minimum and maximum values (for numeric data), and most and least frequent categories (for categorical data).
+ A quick model of the dataset, which can be used to generate an importance score for each feature. 
+ A target leakage report, which you can use to determine if one or more features are strongly correlated with your target feature.
+ A custom visualization using your own code. 

Use the following sections to learn more about these options.

## Get insights on data and data quality


Use the **Data Quality and Insights Report** to perform an analysis of the data that you've imported into Data Wrangler. We recommend that you create the report after you import your dataset. You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Use the following procedure to create a Data Quality and Insights report. It assumes that you've already imported a dataset into your Data Wrangler flow.

**To create a Data Quality and Insights report**

1. Choose the ellipsis icon next to a node in your Data Wrangler flow.

1. Select **Get data insights**.

1. For **Analysis type**, select **Data Quality and Insights Report**.

1. For **Analysis name**, specify a name for the insights report.

1. For **Problem type**, specify **Regression** or **Classification**.

1. For **Target column**, specify the target column.

1. For **Data size**, specify one of the following:
   + **Sampled dataset** – Uses the interactive sample from your data flow, which can contain up to 200,000 rows of your dataset. For information about how to edit the size of your sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **Full dataset** – Uses the full dataset from your data source to create the report.
**Note**  
Creating a Data Quality and Insights report on the full dataset uses an Amazon SageMaker processing job. A SageMaker Processing job provisions the additional compute resources required to get insights for all of your data. For more information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

1. Choose **Create**.

The following topics show the sections of the report:

**Topics**
+ [

### Summary
](#canvas-data-insights-summary)
+ [

### Target column
](#canvas-data-insights-target-column)
+ [

### Quick model
](#canvas-data-insights-quick-model)
+ [

### Feature summary
](#canvas-data-insights-feature-summary)
+ [

### Samples
](#canvas-data-insights-samples)
+ [

### Definitions
](#canvas-data-insights-definitions)

You can either download the report or view it online. To download the report, choose the download button at the top right corner of the screen. 

### Summary


The insights report has a brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings that point to probable issues with the data. We recommend that you investigate the warnings.

### Target column


When you create the Data Quality and Insights Report, Data Wrangler gives you the option to select a target column. A target column is a column that you're trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a category. It also presents observations, or rows, with a missing or invalid target value.

For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents observations, or rows, with a missing, invalid, or outlier target value.

### Quick model


The **Quick model** provides an estimate of the expected predicted quality of a model that you train on your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split, each data partition has the same ratio of labels. For classification problems, it's important to have the same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost model with the default hyperparameters. It applies early stopping on the validation data and performs minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

 To learn more about the information that the classification model summary returns, see [Definitions](#canvas-data-insights-definitions).

A confusion matrix gives you the following information:
+ The number of times the predicted label matches the true label.
+ The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent transactions as fraudulent.

### Feature summary


When you specify a target column, Data Wrangler orders the features by their prediction power. Prediction power is measured on the data after it is split into 80% training and 20% validation folds. Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem with other columns. You can confidently use the prediction scores to determine whether a feature in your dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates target leakage. Target leakage usually happens when the dataset contains a column that isn’t available at the prediction time. For example, it could be a duplicate of the target column.

### Samples


Data Wrangler provides information about whether your samples are anomalous or if there are duplicates in your dataset.

Data Wrangler detects anomalous samples using the *isolation forest algorithm*. The isolation forest associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative anomaly score are usually considered anomalous and samples with positive anomaly score are considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to unusual values. For example, you might have anomalous values that result from errors in gathering and processing the data. The following is an example of the most anomalous samples according to the Data Wrangler’s implementation of the isolation forest algorithm. We recommend using domain knowledge and business logic when you examine the anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data sources could include valid duplicates. Other data sources could have duplicates that point to problems in data collection. Duplicate samples that result from faulty data collection could interfere with machine learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:
+ Quick model
+ Prediction power estimation
+ Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the **Drop duplicates** transform under **Manage rows**. Data Wrangler shows you the most frequently duplicated rows.

### Definitions


The following are definitions for the technical terms that are used in the data insights report.

------
#### [ Feature types ]

The following are the definitions for each of the feature types:
+ **Numeric** – Numeric values can be either floats or integers, such as age or income. The machine learning models assume that numeric values are ordered and a distance is defined over them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
+ **Categorical** – The column entries belong to a set of unique values, which is usually much smaller than the number of entries in the column. For example, a column of length 100 could contain the unique values `Dog`, `Cat`, and `Mouse`. The values could be numeric, text, or a combination of both. `Horse`, `House`, `8`, `Love`, and `3.1` would all be valid values and could be found in the same categorical column. The machine learning model does not assume order or distance on the values of categorical features, as opposed to numeric features, even when all the values are numbers.
+ **Binary** – Binary features are a special categorical feature type in which the cardinality of the set of unique values is 2.
+ **Text** – A text column contains many non-numeric unique values. In extreme cases, all the elements of the column are unique. In an extreme case, no two entries are the same.
+ **Datetime** – A datetime column contains information about the date or time. It can have information about both the date and time.

------
#### [ Feature statistics ]

The following are definitions for each of the feature statistics:
+ **Prediction power** – Prediction power measures how useful the column is in predicting the target.
+ **Outliers** (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median \$1 5 \$1 RSTD or smaller than median - 5 \$1 RSTD are considered to be outliers.
+ **Skew** (in numeric columns) – Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is considered skewed when the absolute value of the skew is larger than 3.
+ **Kurtosis** (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the distribution. It's defined as the fourth moment of the distribution divided by the square of the second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply that the distribution is concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
+ **Missing values** – Null-like objects, empty strings and strings composed of only white spaces are considered missing.
+ **Valid values for numeric features or regression target** – All values that you can cast to finite floats are valid. Missing values are not valid.
+ **Valid values for categorical, binary, or text features, or for classification target** – All values that are not missing are valid.
+ **Datetime features** – All values that you can cast to a datetime object are valid. Missing values are not valid.
+ **Invalid values** – Values that are either missing or you can't properly cast. For example, in a numeric column, you can't cast the string `"six"` or a null value.

------
#### [ Quick model metrics for regression ]

The following are the definitions for the quick model metrics:
+ R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
+ MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ Max error – The maximum absolute value of the error over the dataset. Max error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.

------
#### [ Quick model metrics for classification ]

The following are the definitions for the quick model metrics:
+ **Accuracy** – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the perfect model.
+ **Balanced accuracy** – Balanced accuracy is the ratio of samples that are predicted accurately when the class weights are adjusted to balance the data. All classes are given the same importance, regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples wrong. 1 is the score of the perfect model.
+ **AUC (binary classification)** – This is the area under the receiver operating characteristic curve. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **AUC (OVR)** – For multiclass classification, this is the area under the receiver operating characteristic curve calculated separately for each label using one versus rest. Data Wrangler reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **Precision** – Precision is defined for a specific class. Precision is the fraction of true positives out of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the score of the model that has no false positives for the class. For binary classification, Data Wrangler reports the precision of the positive class.
+ **Recall** – Recall is defined for a specific class. Recall is the fraction of the relevant class instances that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the recall of the positive class.
+ **F1** – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports the F1 for classes with positive values.

------
#### [ Textual patterns ]

**Patterns** describe the textual format of a string using an easy to read format. The following are examples of textual patterns:
+ "\$1digits:4-7\$1" describes a sequence of digits that have a length between 4 and 7.
+ "\$1alnum:5\$1" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can describe many of the commonly used patterns. The **confidence** expressed as a percentage indicates how much of the data is estimated to match the pattern. Using the textual pattern, you can see which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:


| Pattern | Textual Format | 
| --- | --- | 
|  \$1alnum\$1  |  Alphanumeric strings  | 
|  \$1any\$1  |  Any string of word characters  | 
|  \$1digits\$1  |  A sequence of digits  | 
|  \$1lower\$1  |  A lowercase word  | 
|  \$1mixed\$1  |  A mixed-case word  | 
|  \$1name\$1  |  A word beginning with a capital letter  | 
|  \$1upper\$1  |  An uppercase word  | 
|  \$1whitespace\$1  |  Whitespace characters  | 

A word character is either an underscore or a character that might appear in a word in any language. For example, the strings `'Hello_word'` and `'écoute'` both consist of word characters. 'H' and 'é' are both examples of word characters.

------

## Bias report


SageMaker Canvas provides the bias report in Data Wrangler to help uncover potential biases in your data. The bias report analyzes the relationship between the target column (label) and a column that you believe might contain bias (facet variable). For example, if you are trying to predict customer conversion, the facet variable may be the age of the customer. The bias report can help you determine whether or not your data is biased toward a certain age group.

To generate a bias report in Canvas, do the following:

1. In your data flow in Data Wrangler, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to a node in the flow.

1. From the context menu, choose **Get data insights**.

1. The **Create analysis** side panel opens. For the **Analysis type** dropdown menu, select **Bias Report**.

1.  In the **Analysis name** field, enter a name for the bias report.

1. For the **Select the column your model predicts (target)** dropdown menu, select your target column.

1. For **Is your predicted column a value or threshold?**, select **Value** if your target column has categorical values or **Threshold** if it has numerical values.

1. For **Predicted value** (or **Predicted threshold**, depending on your selection in the previous step), enter the target column value or values that correspond to a positive outcome. For example, if predicting customer conversion, your value might be `yes` to indicate that a customer was converted.

1. For the **Select the column to analyze for bias** dropdown menu, select the column that you believe might contain bias, also known as the facet variable.

1. For **Is your column a value or threshold?**, select **Value** if the facet variable has categorical values or **Threshold** if it has numerical values.

1. For **Column value(s) to analyze for bias** (or **Column threshold to analyze for bias**, depending on your selection in the previous step), enter the value or values that you want to analyze for potential bias. For example, if you're checking for bias against customers over a certain age, use the beginning of that age range as your threshold.

1. For **Choose bias metrics**, select the bias metrics you'd like to include in your bias report. Hover over the info icons for more information about each metric.

1. (Optional) When prompted with the option **Would you like to analyze additional metrics?**, select **Yes** to view and include more bias metrics.

1. When you're ready to create the bias report, choose **Add**.

Once generated, the report gives you an overview of the bias metrics you selected. You can view the bias report at any time from the **Analyses** tab of your data flow.

## Histogram


Use histograms to see the counts of feature values for a specific feature. You can inspect the relationships between features using the **Color by** option.

You can use the **Facet by** feature to create histograms of one column, for each value in another column. 

## Scatter plot


Use the **Scatter Plot** feature to inspect the relationship between features. To create a scatter plot, select a feature to plot on the **X axis** and the **Y axis**. Both of these columns must be numeric typed columns. 

You can color scatter plots by an additional column. 

Additionally, you can facet scatter plots by features.

## Table summary


Use the **Table Summary** analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table summary reports the number of entries (count), least frequent value (min), and most frequent value (max). 

## Quick model


Use the **Quick Model** visualization to quickly evaluate your data and produce importance scores for each feature. A [feature importance score](http://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances) score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. On the top of the quick model chart, there is a model score. A classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against which you want feature importance to be compared. Data Wrangler does the following:
+ Infers the data types for the target label and each feature in the dataset selected. 
+ Determines the problem type. Based on the number of distinct values in the label column, Data Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a categorical threshold to 100. If there are more than 100 distinct values in the label column, Data Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem. 
+ Pre-processes features and label data for training. The algorithm used requires encoding features to vector type and encoding labels to double type. 
+ Trains a random forest algorithm with 70% of data. Spark’s [RandomForestRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) is used to train a model for regression problems. The [RandomForestClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) is used to train a model for classification problems.
+ Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates classification models using an F1 score and evaluates regression models using an MSE score.
+ Calculates feature importance for each feature using the Gini importance method. 

## Target leakage


Target leakage occurs when there is data in a machine learning training dataset that is strongly correlated with the target label, but is not available in real-world data. For example, you may have a column in your dataset that serves as a proxy for the column you want to predict with your model. 

When you use the **Target Leakage** analysis, you specify the following:
+ **Target**: This is the feature about which you want your ML model to be able to make predictions.
+ **Problem type**: This is the ML problem type on which you are working. Problem type can either be **classification** or **regression**. 
+  (Optional) **Max features**: This is the maximum number of features to present in the visualization, which shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic, or AUC - ROC curve for each column, up to **Max features**. For regression, it uses a coefficient of determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities, which often indicates target leakage. A score of 0.5 or lower indicates that the information on the column could not provide, on its own, any useful information towards predicting the target. Although it can happen that a column is uninformative on its own but is useful in predicting the target when used in tandem with other features, a low score could indicate the feature is redundant.

## Multicollinearity


Multicollinearity is a circumstance where two or more predictor variables are related to each other. The predictor variables are the features in your dataset that you're using to predict a target variable. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other.

You can use the **Variance Inflation Factor (VIF)**, **Principal Component Analysis (PCA)**, or **Lasso feature selection** as measures for the multicollinearity in your data. For more information, see the following.

------
#### [ Variance Inflation Factor (VIF) ]

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1 indicate higher correlation.

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50. If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:
+ A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the other variables.
+ A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the other variables.

------
#### [ Principle Component Analysis (PCA) ]

Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset.

For example, if you're trying to predict who survived on the *RMS Titanic* after it hit an iceberg, your feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0. We can use them to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity. Before it performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1.

**Note**  
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).

------
#### [ Lasso feature selection ]

Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

------

## Detect anomalies in time series data


You can use the anomaly detection visualization to see outliers in your time series data. To understand what determines an anomaly, you need to understand that we decompose the time series into a predicted term and an error term. We treat the seasonality and trend of the time series as the predicted term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be away from the mean for it to be considered an anomaly. For example, you can specify a threshold as being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an anomaly.

You can use the following procedure to perform an **Anomaly detection** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Anomaly detection**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Seasonal trend decomposition in time series data


You can determine whether there's seasonality in your time series data by using the Seasonal Trend Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method to perform the decomposition. We decompose the time series into its seasonal, trend, and residual components. The trend reflects the long term progression of the series. The seasonal component is a signal that recurs in a time period. After removing the trend and the seasonal components from the time series, you have the residual.

You can use the following procedure to perform a **Seasonal-Trend decomposition** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Seasonal-Trend decomposition**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Create custom visualizations


You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with all the transformations you've applied, is available as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Data Wrangler uses the `df` variable to store the dataframe. You access the dataframe by calling the variable.

You must provide the output variable, `chart`, to store an [Altair](https://altair-viz.github.io/) output chart. For example, you can use the following code block to create a custom histogram using the Titanic dataset.

```
import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule
```

**To create a custom visualization:**

1. Next to the node containing the transformation that you'd like to visualize, choose the **\$1**.

1. Choose **Add analysis**.

1. For **Analysis type**, choose **Custom Visualization**.

1. For **Analysis name**, specify a name.

1. Enter your code in the code box. 

1. Choose **Preview** to preview your visualization.

1. Choose **Save** to add your visualization.

If you don’t know how to use the Altair visualization package in Python, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose **Search example snippets** and specify a query in the search bar.

The following example uses the **Binned scatterplot** code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code. You usually need to specify the column names of your dataset in the code.

```
import altair as alt

# Specify the number of top rows for plotting
rows_number = 1000
df = df.head(rows_number)  
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)


chart = (
    alt.Chart(df)
    .mark_circle()
    .encode(
        # Specify the column names for binning and number of bins for X and Y axis
        x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
        y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
        size="count()",
    )
)

# :Q specifies that label column has quantitative type.
# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
```

# Transform data


Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning and featurizing your data. Using the interactive data preparation tools in Data Wrangler, you can sample datasets of any size with a variety of sampling techniques and start exploring your data in a matter of minutes. After finalizing your data transforms on the sampled data, you can then scale the data flow to apply those transformations to the entire dataset.

When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code. If you know how you want to prepare your data but don't know how to get started or which transforms to use, you can use the chat for data prep feature to interact conversationally with Data Wrangler and apply transforms using natural language. For more information, see [Chat for data prep](canvas-chat-for-data-prep.md). 

You can also add custom transformations using PySpark, Python (User-Defined Function), pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in a single step.

You can apply the **Process numeric** and **Handle missing** transforms only to a single column.

Use this page to learn more about the built-in and custom transforms offered by Data Wrangler.

## Join Datasets


You can join datasets directly in your data flow. When you join two datasets, the resulting joined dataset appears in your flow. The following join types are supported by Data Wrangler.
+ **Left outer** – Include all rows from the left table. If the value for the column joined on a left table row does not match any right table row values, that row contains null values for all right table columns in the joined table.
+ **Left anti** – Include rows from the left table that do not contain values in the right table for the joined column.
+ **Left semi** – Include a single row from the left table for all identical rows that satisfy the criteria in the join statement. This excludes duplicate rows from the left table that match the criteria of the join.
+ **Right outer** – Include all rows from the right table. If the value for the joined column in a right table row does not match any left table row values, that row contains null values for all left table columns in the joined table.
+ **Inner** – Include rows from left and right tables that contain matching values in the joined column. 
+ **Full outer** – Include all rows from the left and right tables. If the row value for the joined column in either table does not match, separate rows are created in the joined table. If a row doesn’t contain a value for a column in the joined table, null is inserted for that column.
+ **Cartesian cross** – Include rows which combine each row from the first table with each row from the second table. This is a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of rows from tables in the join. The result of this product is the size of the left table times the size of the right table. Therefore, we recommend caution in using this join between very large datasets. 

Use the following procedure to join two datasets. You should have already imported two data sources into your data flow.

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to join. The first node you select is always the left table in your join. 

1. Hover over **Combine data**, and then choose **Join**.

1. Select the right node. The second node you select is always the right table in your join.

1. The **Join type** field is set to **Inner join** by default. Select the dropdown menu to change the join type.

1. For **Join keys**, verify the columns from the left and right tables that you want to use to join the data. You can add or remove additional join keys.

1. For **Name of join**, enter a name for the joined data, or use the default name.

1. (Optional) Choose **Preview** to preview the joined data.

1. Choose **Add** to complete the join.

**Note**  
If you receive a notice that Canvas didn't identify any matching rows when joining your data, we recommend that you either verify that you've selected the correct columns, or update your sample to try to find matching rows. You can choose a different sampling strategy or change the size of the sample. For information about how to edit the sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

You should now see a join node added to your data flow.

## Concatenate Datasets


Concatenating combines two datasets by appending the rows from one dataset to another.

Use the following procedure to concatenate two datasets. You should have already imported two data sources into your data flow.

**To concatenate two datasets:**

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to concatenate. The first node you select is always the left table in your concatenate operation. 

1. Hover over **Combine data**, and then choose **Concatenate**.

1. Select the right node. The second node you select is always the right table in your concatenate.

1. (Optional) Select the checkbox next to **Remove duplicates after concatenation** to remove duplicate columns. 

1. (Optional) Select the checkbox next to **Add column to indicate source dataframe** to add a column to the resulting dataframe that lists the source dataset for each record.

   1. For **Indicator column name**, enter a name for the added column.

   1. For **First dataset indicating string**, enter the value you want to use to mark records from the first dataset (or the left node).

   1. For **Second dataset indicating string**, enter the value you want to use to mark records from the second dataset (or the right node).

1. For **Name of concatenate**, enter a name for the concatenation.

1. (Optional) Choose **Preview** to preview the concatenated data.

1. Choose **Add** to add the new dataset to your data flow. 

You should now see a concatenate node added to your data flow.

## Balance Data


You can balance the data for datasets with an underrepresented category. Balancing a dataset can help you create better models for binary classification.

**Note**  
You can't balance datasets containing column vectors.

You can use the **Balance data** operation to balance your data using one of the following operators:
+ *Random oversampling* – Randomly duplicates samples in the minority category. For example, if you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases within the dataset 8 times.
+ *Random undersampling* – Roughly equivalent to random oversampling. Randomly removes samples from the overrepresented category to get the proportion of samples that you desire.
+ *Synthetic Minority Oversampling Technique (SMOTE)* – Uses samples from the underrepresented category to interpolate new synthetic minority samples. For more information about SMOTE, see the following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1]. For numeric features, Data Wrangler interpolates samples using a weighted average of the samples. For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The interpolated sample has a value of 0.7A \$1 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of the time.

## Custom Transforms


The **Custom Transforms** group allows you to use Python (User-Defined Function), PySpark, pandas, or PySpark (SQL) to define custom transformations. For all three options, you use the variable `df` to access the dataframe to which you want to apply the transform. To apply your custom code to your dataframe, assign the dataframe with the transformations that you've made to the `df` variable. If you're not using Python (User-Defined Function), you don't need to include a return statement. Choose **Preview** to preview the result of the custom transform. Choose **Add** to add the custom transform to your list of **Previous steps**.

You can import the popular libraries with an `import` statement in the custom transform code block, such as the following:
+ NumPy version 1.19.0
+ scikit-learn version 0.23.2
+ SciPy version 1.5.4
+ pandas version 1.0.3
+ PySpark version 3.0.0

**Important**  
**Custom transform** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

If you include print statements in the code block, the result appears when you select **Preview**. You can resize the custom code transformer panel. Resizing the panel provides more space to write code. 

The following sections provide additional context and examples for writing custom transform code.

**Python (User-Defined Function)**

The Python function gives you the ability to write custom transformations without needing to know Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:
+ **Input column** – The input column where you're applying the transform.
+ **Mode** – The scripting mode, either pandas or Python.
+ **Return type** – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write transformations by using pure Python functions.

**PySpark**

The following example extracts date and time from a timestamp.

```
from pyspark.sql.functions import from_unixtime, to_date, date_format
df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))
```

**pandas**

The following example provides an overview of the dataframe to which you are adding transforms. 

```
df.info()
```

**PySpark (SQL)**

The following example creates a new dataframe with four columns: *name*, *fare*, *pclass*, *survived*.

```
SELECT name, fare, pclass, survived FROM df
```

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose **Search example snippets** and specify a query in the search bar. The text you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a **Drop duplicate rows** code snippet that can delete rows with similar data in your dataset. You can find the code snippet by searching for one of the following:
+ Duplicates
+ Identical
+ Remove

The following snippet has comments to help you understand the changes that you need to make. For most snippets, you must specify the column names of your dataset in the code.

```
# Specify the subset of columns
# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]
df = df.dropDuplicates(subset)  

# to drop the full-duplicate rows run
# df = df.dropDuplicates()
```

To use a snippet, copy and paste its content into the **Custom transform** field. You can copy and paste multiple code snippets into the custom transform field.

## Custom Formula


Use **Custom formula** to define a new column using a Spark SQL expression to query data in the current dataframe. The query must use the conventions of Spark SQL expressions.

**Important**  
**Custom formula** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

You can use this transform to perform operations on columns, referencing the columns by name. For example, assuming the current dataframe contains columns named *col\$1a* and *col\$1b*, you can use the following operation to produce an **Output column** that is the product of these two columns with the following code:

```
col_a * col_b
```

Other common operations include the following, assuming a dataframe contains `col_a` and `col_b` columns:
+ Concatenate two columns: `concat(col_a, col_b)`
+ Add two columns: `col_a + col_b`
+ Subtract two columns: `col_a - col_b`
+ Divide two columns: `col_a / col_b`
+ Take the absolute value of a column: `abs(col_a)`

For more information, see the [Spark documentation](http://spark.apache.org/docs/latest/api/python) on selecting data. 

## Reduce Dimensionality within a Dataset
Reduce Dimensionality (PCA)

Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality of your dataset corresponds to the number of features. When you use dimensionality reduction in Data Wrangler, you get a new set of features called components. Each component accounts for some variability in the data.

The first component accounts for the largest amount of variation in the data. The second component accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models. Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns in your dataset. The first principal component is the value on the axis that has the largest amount of variance. The second principal component is the value on the axis that has the second largest amount of variance. The nth principal component is the value on the axis that has the nth largest amount of variance.

You can configure the number of principal components that Data Wrangler returns. You can either specify the number of principal components directly or you can specify the variance threshold percentage. Each principal component explains an amount of variance in the data. For example, you might have a principal component with a value of 0.5. The component would explain 50% of the variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the smallest number of components that meet the percentage that you specify.

The following are example principal components with the amount of variance that they explain in the data.
+ Component 1 – 0.5
+ Component 2 – 0.45
+ Component 3 – 0.05

If you specify a variance threshold percentage of `94` or `95`, Data Wrangler returns Component 1 and Component 2. If you specify a variance threshold percentage of `96`, Data Wrangler returns all three principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.

1. Choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Dimensionality Reduction**.

1. For **Input Columns**, choose the features that you're reducing into the principal components.

1. (Optional) For **Number of principal components**, choose the number of principal components that Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for **Variance threshold percentage**.

1. (Optional) For **Variance threshold percentage**, specify the percentage of variation in the data that you want explained by the principal components. Data Wrangler uses the default value of `95` if you don't specify a value for the variance threshold. You can't specify a variance threshold percentage if you've specified a value for **Number of principal components**.

1. (Optional) Deselect **Center** to not use the mean of the columns as the center of the data. By default, Data Wrangler centers the data with the mean before scaling.

1. (Optional) Deselect **Scale** to not scale the data with the unit standard deviation.

1. (Optional) Choose **Columns** to output the components to separate columns. Choose **Vector** to output the components as a single vector.

1. (Optional) For **Output column**, specify a name for an output column. If you're outputting the components to separate columns, the name that you specify is a prefix. If you're outputting the components to a vector, the name that you specify is the name of the vector column.

1. (Optional) Select **Keep input columns**. We don't recommend selecting this option if you plan on only using the principal components to train your model.

1. Choose **Preview**.

1. Choose **Add**.

## Encode Categorical


Categorical data is usually composed of a finite number of categories, where each category is represented with a string. For example, if you have a table of customer data, a column that indicates the country a person lives in is categorical. The categories would be *Afghanistan*, *Albania*, *Algeria*, and so on. Categorical data can be *nominal* or *ordinal*. Ordinal categories have an inherent order, and nominal categories do not. The highest degree obtained (*High school*, *Bachelors*, *Masters*, and so on) is an example of ordinal categories. 

Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are *Dog* and *Cat*, you may encode this information into two vectors, `[1,0]` to represent *Dog*, and `[0,1]` to represent *Cat*.

When you encode ordinal categories, you may need to translate the natural order of categories into your encoding. For example, you can represent the highest degree obtained with the following map: `{"High school": 1, "Bachelors": 2, "Masters":3}`.

Use categorical encoding to encode categorical data that is in string format into arrays of integers. 

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the time the step is defined. If new categories have been added to a column when you start a Data Wrangler job to process your dataset at time *t*, and this column was the input for a Data Wrangler categorical encoding transform at time *t*-1, these new categories are considered *missing* in the Data Wrangler job. The option you select for **Invalid handling strategy** is applied to these missing values. Examples of when this can occur are: 
+ When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the creation of the data flow. For example, you may use a data flow to regularly process sales data each month. If that sales data is updated weekly, new categories may be introduced into columns for which an encode categorical step is defined. 
+ When you select **Sampling** when you import your dataset, some categories may be left out of the sample. 

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an *ordinal* and a *one-hot encode*. Use the following sections to learn more about these options. 

Both transforms create a new column named **Output column name**. You specify the output format of this column with **Output style**:
+ Select **Vector** to produce a single column with a sparse vector. 
+ Select **Columns** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category.

### Ordinal Encode


Select **Ordinal encode** to encode categories into an integer between 0 and the total number of categories in the **Input column** you select.

**Invalid handing strategy**: Select a method to handle invalid or missing values. 
+ Choose **Skip** if you want to omit the rows with missing values.
+ Choose **Keep** to retain missing values as the last category.
+ Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ Choose **Replace with NaN** to replace missing with NaN. This option is recommended if your ML algorithm can handle missing values. Otherwise, the first three options in this list may produce better results.

### One-Hot Encode


Select **One-hot encode** for **Transform** to use one-hot encoding. Configure this transform using the following: 
+ **Drop last category**: If `True`, the last category does not have a corresponding index in the one-hot encoding. When missing values are possible, a missing category is always the last one and setting this to `True` means that a missing value results in an all zero vector.
+ **Invalid handing strategy**: Select a method to handle invalid or missing values. 
  + Choose **Skip** if you want to omit the rows with missing values.
  + Choose **Keep** to retain missing values as the last category.
  + Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ **Is input ordinal encoded**: Select this option if the input vector contains ordinal encoded data. This option requires that input data contain non-negative integers. If **True**, input *i* is encoded as a vector with a non-zero in the *i*th location. 

### Similarity encode


Use similarity encoding when you have the following:
+ A large number of categorical variables
+ Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It converts the tokens into an embedding using min-hash encoding.

The similarity encodings that Data Wrangler creates:
+ Have low dimensionality
+ Are scalable to a large number of categories
+ Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker/).

1. Choose **Open Studio Classic**.

1. Choose **Launch app**.

1. Choose **Studio**.

1. Specify your data flow.

1. Choose a step with a transformation.

1. Choose **Add step**.

1. Choose **Encode categorical**.

1. Specify the following:
   + **Transform** – **Similarity encode**
   + **Input column** – The column containing the categorical data that you're encoding.
   + **Target dimension** – (Optional) The dimension of the categorical embedding vector. The default value is 30. We recommend using a larger target dimension if you have a large dataset with many categories.
   + **Output style** – Choose **Vector** for a single vector with all of the encoded values. Choose **Column** to have the encoded values in separate columns.
   + **Output column** – (Optional) The name of the output column for a vector encoded output. For a column-encoded output, this is the prefix of the column names followed by listed number.

## Featurize Text


Use the **Featurize Text** transform group to inspect string-typed columns and use text embedding to featurize these columns. 

This feature group contains two features, *Character statistics* and *Vectorize*. Use the following sections to learn more about these transforms. For both options, the **Input column** must contain text data (string type).

### Character Statistics


Use **Character statistics** to generate statistics for each row in a column containing text data. 

This transform computes the following ratios and counts for each row, and creates a new column to report the result. The new column is named using the input column name as a prefix and a suffix that is specific to the ratio or count. 
+ **Number of words**: The total number of words in that row. The suffix for this output column is `-stats_word_count`.
+ **Number of characters**: The total number of characters in that row. The suffix for this output column is `-stats_char_count`.
+ **Ratio of upper**: The number of uppercase characters, from A to Z, divided by all characters in the column. The suffix for this output column is `-stats_capital_ratio`.
+ **Ratio of lower**: The number of lowercase characters, from a to z, divided by all characters in the column. The suffix for this output column is `-stats_lower_ratio`.
+ **Ratio of digits**: The ratio of digits in a single row over the sum of digits in the input column. The suffix for this output column is `-stats_digit_ratio`.
+ **Special characters ratio**: The ratio of non-alphanumeric (characters like \$1\$1&%:@) characters to over the sum of all characters in the input column. The suffix for this output column is `-stats_special_ratio`.

### Vectorize


Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–inverse document frequency (TF-IDF) vectors. 

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real number that represents its semantic importance. Higher numbers are associated with less frequent words, which tend to be more meaningful. 

When you define a **Vectorize** transform step, Data Wrangler uses the data in your dataset to define the count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following: 
+ **Output column name**: This transform creates a new column with the text embedding. Use this field to specify a name for this output column. 
+ **Tokenizer**: A tokenizer converts the sentence into a list of words, or *tokens*. 

  Choose **Standard** to use a tokenizer that splits by white space and converts each word to lowercase. For example, `"Good dog"` is tokenized to `["good","dog"]`.

  Choose **Custom** to use a customized tokenizer. If you choose **Custom**, you can use the following fields to configure the tokenizer:
  + **Minimum token length**: The minimum length, in characters, for a token to be valid. Defaults to `1`. For example, if you specify `3` for minimum token length, words like `a, at, in` are dropped from the tokenized sentence. 
  + **Should regex split on gaps**: If selected, **regex** splits on gaps. Otherwise, it matches tokens. Defaults to `True`. 
  + **Regex pattern**: Regex pattern that defines the tokenization process. Defaults to `' \\ s+'`.
  + **To lowercase**: If chosen, Data Wrangler converts all characters to lowercase before tokenization. Defaults to `True`.

  To learn more, see the Spark documentation on [Tokenizer](https://spark.apache.org/docs/latest/ml-features#tokenizer).
+ **Vectorizer**: The vectorizer converts the list of tokens into a sparse numeric vector. Each token corresponds to an index in the vector and a non-zero indicates the existence of the token in the input sentence. You can choose from two vectorizer options, *Count* and *Hashing*.
  + **Count vectorize** allows customizations that filter infrequent or too common tokens. **Count vectorize parameters** include the following: 
    + **Minimum term frequency**: In each row, terms (tokens) with smaller frequency are filtered. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Minimum document frequency**: Minimum number of rows in which a term (token) must appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Maximum document frequency**: Maximum number of documents (rows) in which a term (token) can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `0.999`.
    + **Maximum vocabulary size**: Maximum size of the vocabulary. The vocabulary is made up of all terms (tokens) in all rows of the column. Defaults to `262144`.
    + **Binary outputs**: If selected, the vector outputs do not include the number of appearances of a term in a document, but rather are a binary indicator of its appearance. Defaults to `False`.

    To learn more about this option, see the Spark documentation on [CountVectorizer](https://spark.apache.org/docs/latest/ml-features#countvectorizer).
  + **Hashing** is computationally faster. **Hash vectorize parameters** includes the following:
    + **Number of features during hashing**: A hash vectorizer maps tokens to a vector index according to their hash value. This feature determines the number of possible hash values. Large values result in fewer collisions between hash values but a higher dimension output vector.

    To learn more about this option, see the Spark documentation on [FeatureHasher](https://spark.apache.org/docs/latest/ml-features#featurehasher)
+ **Apply IDF** applies an IDF transformation, which multiplies the term frequency with the standard inverse document frequency used for TF-IDF embedding. **IDF parameters** include the following: 
  + **Minimum document frequency **: Minimum number of documents (rows) in which a term (token) must appear to be included. If **count\$1vectorize** is the chosen vectorizer, we recommend that you keep the default value and only modify the **min\$1doc\$1freq** field in **Count vectorize parameters**. Defaults to `5`.
+ ** Output format**:The output format of each row. 
  + Select **Vector** to produce a single column with a sparse vector. 
  + Select **Flattened** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category. You can only choose flattened when **Vectorizer** is set as **Count vectorizer**.

## Transform Time Series


In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to specific time. For example, a dataset that shows the number of customers in a store for each hour in a day is a time series dataset. The following table shows an example of a time series dataset.

Hourly number of customers in a store


| Number of customers | Time (hour) | 
| --- | --- | 
| 4 | 09:00 | 
| 10 | 10:00 | 
| 14 | 11:00 | 
| 25 | 12:00 | 
| 20 | 13:00 | 
| 18 | 14:00 | 

For the preceding table, the **Number of Customers** column contains the time series data. The time series data is indexed on the hourly data in the **Time (hour)** column.

You might need to perform a series of transformations on your data to get it in a format that you can use for your analysis. Use the **Time series** transform group to transform your time series data. For more information about the transformations that you can perform, see the following sections.

**Topics**
+ [

### Group by a Time Series
](#canvas-group-by-time-series)
+ [

### Resample Time Series Data
](#canvas-resample-time-series)
+ [

### Handle Missing Time Series Data
](#canvas-transform-handle-missing-time-series)
+ [

### Validate the Timestamp of Your Time Series Data
](#canvas-transform-validate-timestamp)
+ [

### Standardizing the Length of the Time Series
](#canvas-transform-standardize-length)
+ [

### Extract Features from Your Time Series Data
](#canvas-transform-extract-time-series-features)
+ [

### Use Lagged Features from Your Time Series Data
](#canvas-transform-lag-time-series)
+ [

### Create a Datetime Range In Your Time Series
](#canvas-transform-datetime-range)
+ [

### Use a Rolling Window In Your Time Series
](#canvas-transform-rolling-window)

### Group by a Time Series


You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage


| Household ID | Daily timestamp | Electricity usage (kWh) | Number of household occupants | 
| --- | --- | --- | --- | 
| household\$10 | 1/1/2020 | 30 | 2 | 
| household\$10 | 1/2/2020 | 40 | 2 | 
| household\$10 | 1/4/2020 | 35 | 3 | 
| household\$11 | 1/2/2020 | 45 | 3 | 
| household\$11 | 1/3/2020 | 55 | 4 | 

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | 
| --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 
| household\$11 | [45, 55] | [3, 4] | 

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of the sequence corresponds to the first timestamp of the series. For `household_0`, `30` is the first value of the **Electricity Usage Series**. The value of `30` corresponds to the first timestamp of `1/1/2020`.

You can include the starting timestamp and ending timestamp. The following table shows how that information appears.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | Start\$1time | End\$1time | 
| --- | --- | --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 1/1/2020 | 1/4/2020 | 
| household\$11 | [45, 55] | [3, 4] | 1/2/2020 | 1/3/2020 | 

You can use the following procedure to group by a time series column. 

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Time Series**.

1. Under **Transform**, choose **Group by**.

1. Specify a column in **Group by this column**.

1. For **Apply to columns**, specify a value.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Resample Time Series Data


Time series data usually has observations that aren't taken at regular intervals. For example, a dataset could have some observations that are recorded hourly and other observations that are recorded every two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals. Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The hourly observations are aggregated into a single value using an aggregation method such as the mean or median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, you can use an interpolation method to infer hourly observations from the ones that have been taken every two hours. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

You can resample both numeric and non-numeric data.

Use the **Resample** operation to resample your time series data. If you have multiple time series in your dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 1:00 | 32 | 
| 2:00 | 35 | 
| 3:00 | 32 | 
| 4:00 | 30 | 

Temperature readings downsampled to every two hours


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 2:00 | 33.5 | 
| 4:00 | 35 | 

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Resample**.

1. For **Timestamp**, choose the timestamp column.

1. For **Frequency unit**, specify the frequency that you're resampling.

1. (Optional) Specify a value for **Frequency quantity**.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Handle Missing Time Series Data


If you have missing values in your dataset, you can do one of the following:
+ For datasets that have multiple time series, drop the time series that have missing values that are greater than a threshold that you specify.
+ Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an inferential method. The following are the methods that you can use for imputation:
+ Constant value – Replace all the missing data in your dataset with a value that you specify.
+ Most common value – Replace all the missing data with the value that has the highest frequency in the dataset.
+ Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
+ Backward fill – Use a backward fill to replace the missing values with the non-missing value that follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8]. 
+ Interpolate – Uses an interpolation function to impute the missing values. For more information on the functions that you can use for interpolation, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

Some of the imputation methods might not be able to impute of all the missing value in your dataset. For example, a **Forward fill** can't impute a missing value that appears at the beginning of the time series. You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, NaN, NaN] | 
| household\$11 | [45, NaN, 55] | 

Electricity usage with values imputed using a forward fill


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, 35, 35] | 
| household\$11 | [45, 45, 55] | 

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | NaN | 
| household\$11 | NaN | 
| household\$11 | NaN | 

Average daily household electricity usage with values imputed using a forward fill


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | 40 | 
| household\$11 | 40 | 
| household\$11 | 40 | 

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Handle missing**.

1. For **Time series input type**, choose whether you want to handle missing values inside of a cell or along a column.

1. For **Impute missing values for this column**, specify the column that has the missing values.

1. For **Method for imputing values**, select a method.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. If you have missing values, you can specify a method for imputing them under **Method for imputing values**.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Validate the Timestamp of Your Time Series Data


You might have time stamp data that is invalid. You can use the **Validate time stamp** function to determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or more of the following reasons:
+ Your timestamp column has missing values.
+ The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your dataset:
+ Drop the rows that have the missing or invalid values.
+ Identify the rows that have the missing or invalid values.
+ Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the `timestamp` type or the `string` type. If the column has the `string` type, Data Wrangler converts the type of the column to `timestamp` and performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Validate timestamps**.

1. For **Timestamp Column**, choose the timestamp column.

1. For **Policy**, choose whether you want to handle missing timestamps.

1. (Optional) For **Output column**, specify a name for the output column.

1. If the date time column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Standardizing the Length of the Time Series


If you have time series data stored as arrays, you can standardize each time series to the same length. Standardizing the length of the time series array might make it easier for you to perform your analysis on the data.

You can standardize your time series for data transformations that require the length of your data to be fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time series data is separating each value of the time series into its own column in a dataset. The number of columns in a dataset can't change, so the lengths of the time series need to be standardized between you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For example, you can have three sequences that have the following lengths:
+ 3
+ 4
+ 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile length.

Time series arrays that are shorter than the length you've specified have missing values added. The following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN, NaN].

You can use different approaches to handle the missing values. For information on those approaches, see [Handle Missing Time Series Data](#canvas-transform-handle-missing-time-series).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Standardize length**.

1. For **Standardize the time series length for the column**, choose a column.

1. (Optional) For **Output column**, specify a name for the output column. If you don't specify a name, the transform is done in place.

1. If the datetime column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Cutoff quantile** and specify a quantile to set the length of the sequence.

1. Choose **Flatten the output** to output the values of the time series into separate columns.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Extract Features from Your Time Series Data


If you're running a classification or a regression algorithm on your time series data, we recommend extracting features from the time series before running the algorithm. Extracting features might improve the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:
+ Use **Minimal subset** to specify extracting 8 features that you know are useful in downstream analyses. You can use a minimal subset when you need to perform computations quickly. You can also use it when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
+ Use **Efficient subset** to specify extracting the most features possible without extracting features that are computationally intensive in your analyses.
+ Use **All features** to specify extracting all features from the tune series.
+ Use **Manual subset** to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Extract features**.

1. For **Extract features for this column**, choose a column.

1. (Optional) Select **Flatten** to output the features into separate columns.

1. For **Strategy**, choose a strategy to extract the features.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use Lagged Features from Your Time Series Data


For many use cases, the best way to predict the future behavior of your time series is to use its most recent behavior.

The most common uses of lagged features are the following:
+ Collecting a handful of past values. For example, for time, t \$1 1, you collect t, t - 1, t - 2, and t - 3.
+ Collecting values that correspond to seasonal behavior in the data. For example, to predict the occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as predictive as using the features from previous days.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Lag features**.

1. For **Generate lag features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. For **Lag**, specify the duration of the lag.

1. (Optional) Configure the output using one of the following options:
   + **Include the entire lag window**
   + **Flatten the output**
   + **Drop rows without history**

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Create a Datetime Range In Your Time Series


You might have time series data that don't have timestamps. If you know that the observations were taken at regular intervals, you can generate timestamps for the time series in a separate column. To generate timestamps, you specify the value for the start timestamp and the frequency of the timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant


| Number of customers | 
| --- | 
| 10 | 
| 14 | 
| 24 | 
| 40 | 
| 30 | 
| 20 | 

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can add a timestamp column that corresponds to the time series data. You can see the timestamp column in the following table.

Time series data on the number of customers at a restaurant


| Number of customers | Timestamp | 
| --- | --- | 
| 10 | 1:00 PM | 
| 14 | 2:00 PM | 
| 24 | 3:00 PM | 
| 40 | 4:00 PM | 
| 30 | 5:00 PM | 
| 20 | 6:00 PM | 

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Datetime range**.

1. For **Frequency type**, choose the unit used to measure the frequency of the timestamps.

1. For **Starting timestamp**, specify the start timestamp.

1. For **Output column**, specify a name for the output column.

1. (Optional) Configure the output using the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use a Rolling Window In Your Time Series


You can extract features over a time period. For example, for time, *t*, and a time window length of 3, and for the row that indicates the *t*th timestamp, we append the features that are extracted from the time series at times *t* - 3, *t* -2, and *t* - 1. For information on extracting features, see [Extract Features from Your Time Series Data](#canvas-transform-extract-time-series-features). 

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Rolling window features**.

1. For **Generate rolling window features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. (Optional) For **Output Column**, specify the name of the output column.

1. For **Window size**, specify the window size.

1. For **Strategy**, choose the extraction strategy.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Featurize Datetime


Use **Featurize date/time** to create a vector embedding representing a datetime field. To use this transform, your datetime data must be in one of the following formats: 
+ Strings describing datetime: For example, `"January 1st, 2020, 12:44pm"`. 
+ A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or nanoseconds from 1/1/1970. 

You can choose to **Infer datetime format** and provide a **Datetime format**. If you provide a datetime format, you must use the codes described in the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The options you select for these two configurations have implications for the speed of the operation and the final results.
+ The most manual and computationally fastest option is to specify a **Datetime format** and select **No** for **Infer datetime format**.
+ To reduce manual labor, you can choose **Infer datetime format** and not specify a datetime format. It is also a computationally fast operation; however, the first datetime format encountered in the input column is assumed to be the format for the entire column. If there are other formats in the column, these values are NaN in the final output. Inferring the datetime format can give you unparsed strings. 
+ If you don't specify a format and select **No** for **Infer datetime format**, you get the most robust results. All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower than the first two options in this list. 

When you use this transform, you specify an **Input column** which contains datetime data in one of the formats listed above. The transform creates an output column named **Output column name**. The format of the output column depends on your configuration using the following:
+ **Vector**: Outputs a single column as a vector. 
+ **Columns**: Creates a new column for every feature. For example, if the output contains a year, month, and day, three separate columns are created for year, month, and day. 

Additionally, you must choose an **Embedding mode**. For linear models and deep networks, we recommend choosing **cyclic**. For tree-based algorithms, we recommend choosing **ordinal**.

## Format String


The **Format string** transforms contain standard string formatting operations. For example, you can use these operations to remove special characters, normalize string lengths, and update string casing.

This feature group contains the following transforms. All transforms return copies of the strings in the **Input column** and add the result to a new, output column.


| Name | Function | 
| --- | --- | 
| Left pad |  Left-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Right pad |  Right-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Center (pad on either side) |  Center-pad the string (add padding on both sides of the string) with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Prepend zeros |  Left-fill a numeric string with zeros, up to a given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Strip left and right |  Returns a copy of the string with the leading and trailing characters removed.  | 
| Strip characters from left |  Returns a copy of the string with leading characters removed.  | 
| Strip characters from right |  Returns a copy of the string with trailing characters removed.  | 
| Lower case |  Convert all letters in text to lowercase.  | 
| Upper case |  Convert all letters in text to uppercase.  | 
| Capitalize |  Capitalize the first letter in each sentence.   | 
| Swap case | Converts all uppercase characters to lowercase and all lowercase characters to uppercase characters of the given string, and returns it. | 
| Add prefix or suffix |  Adds a prefix and a suffix the string column. You must specify at least one of **Prefix** and **Suffix**.   | 
| Remove symbols |  Removes given symbols from a string. All listed characters are removed. Defaults to white space.   | 

## Handle Outliers


Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. Use this feature group to detect and update outliers in your dataset. 

When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. 

Use the following sections to learn more about the transforms this group contains. You specify an **Output name** and each of these transforms produces an output column with the resulting data. 

### Robust standard deviation numeric outliers


This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an **Upper quantile** and a **Lower quantile** for the statistics used to calculate outliers. You must also specify the number of **Standard deviations** from which a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Standard Deviation Numeric Outliers


This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of **Standard deviations** a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Quantile Numeric Outliers


Use this transform to detect and fix outliers in numeric features using quantiles. You can define an **Upper quantile** and a **Lower quantile**. All values that fall above the upper quantile or below the lower quantile are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Min-Max Numeric Outliers


This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this method if you know threshold values that demark outliers.

You specify a **Upper threshold** and a **Lower threshold**, and if values fall above or below those thresholds respectively, they are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Replace Rare


When you use the **Replace rare** transform, you specify a threshold and Data Wrangler finds all values that meet that threshold and replaces them with a string that you specify. For example, you may want to use this transform to categorize all outliers in a column into an "Others" category. 
+ **Replacement string**: The string with which to replace outliers.
+ **Absolute threshold**: A category is rare if the number of instances is less than or equal to this absolute threshold.
+ **Fraction threshold**: A category is rare if the number of instances is less than or equal to this fraction threshold multiplied by the number of rows.
+ **Max common categories**: Maximum not-rare categories that remain after the operation. If the threshold does not filter enough categories, those with the top number of appearances are classified as not rare. If set to 0 (default), there is no hard limit to the number of categories.

## Handle Missing Values


Missing values are a common occurrence in machine learning datasets. In some situations, it is appropriate to impute missing data with a calculated value, such as an average or categorically common value. You can process missing values using the **Handle missing values** transform group. This group contains the following transforms. 

### Fill Missing


Use the **Fill missing** transform to replace missing values with a **Fill value** you define. 

### Impute Missing


Use the **Impute missing** transform to create a new column that contains imputed values where missing values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute. You can choose to impute the mean or the median over the values that are present in your dataset. Data Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column. To impute a custom string, use the **Fill missing** transform instead.

### Add Indicator for Missing


Use the **Add indicator for missing** transform to create a new indicator column, which contains a Boolean `"false"` if a row contains a value, and `"true"` if a row contains a missing value. 

### Drop Missing


Use the **Drop missing** option to drop rows that contain missing values from the **Input column**.

## Manage Columns


You can use the following transforms to quickly update and manage columns in your dataset: 


****  

| Name | Function | 
| --- | --- | 
| Drop Column | Delete a column.  | 
| Duplicate Column | Duplicate a column. | 
| Rename Column | Rename a column. | 
| Move Column |  Move a column's location in the dataset. Choose to move your column to the start or end of the dataset, before or after a reference column, or to a specific index.   | 

## Manage Rows


Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the following:
+ **Sort**: Sort the entire dataframe by a given column. Select the check box next to **Ascending order** for this option; otherwise, deselect the check box and descending order is used for the sort. 
+ **Shuffle**: Randomly shuffle all rows in the dataset. 

## Manage Vectors


Use this transform group to combine or flatten vector columns. This group contains the following transforms. 
+ **Assemble**: Use this transform to combine Spark vectors and numeric data into a single column. For example, you can combine three columns: two containing numeric data and one containing vectors. Add all the columns you want to combine in **Input columns** and specify a **Output column name** for the combined data. 
+ **Flatten**: Use this transform to flatten a single column containing vector data. The input column must contain PySpark vectors or array-like objects. You can control the number of columns created by specifying a **Method to detect number of outputs**. For example, if you select **Length of first vector**, the number of elements in the first valid vector or array found in the column determines the number of output columns that are created. All other input vectors with too many items are truncated. Inputs with too few items are filled with NaNs.

  You also specify an **Output prefix**, which is used as the prefix for each output column. 

## Process Numeric


Use the **Process Numeric** feature group to process numeric data. Each scalar in this group is defined using the Spark library. The following scalars are supported:
+ **Standard Scaler**: Standardize the input column by subtracting the mean from each value and scaling to unit variance. To learn more, see the Spark documentation for [StandardScaler](https://spark.apache.org/docs/latest/ml-features#standardscaler).
+ **Robust Scaler**: Scale the input column using statistics that are robust to outliers. To learn more, see the Spark documentation for [RobustScaler](https://spark.apache.org/docs/latest/ml-features#robustscaler).
+ **Min Max Scaler**: Transform the input column by scaling each feature to a given range. To learn more, see the Spark documentation for [MinMaxScaler](https://spark.apache.org/docs/latest/ml-features#minmaxscaler).
+ **Max Absolute Scaler**: Scale the input column by dividing each value by the maximum absolute value. To learn more, see the Spark documentation for [MaxAbsScaler](https://spark.apache.org/docs/latest/ml-features#maxabsscaler).

## Sampling


After you've imported your data, you can use the **Sampling** transformer to take one or more samples of it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:
+ **Limit**: Samples the dataset starting from the first row up to the limit that you specify.
+ **Randomized**: Takes a random sample of a size that you specify.
+ **Stratified**: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample. 

To take a random sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. For **Stratify column**, specify the name of the column that you want to stratify on.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

## Search and Edit


Use this section to search for and edit specific patterns within strings. For example, you can find and update strings within sentences or documents, split strings by delimiters, and find occurrences of specific strings. 

The following transforms are supported under **Search and edit**. All transforms return copies of the strings in the **Input column** and add the result to a new output column.


****  

| Name | Function | 
| --- | --- | 
|  Find substring  |  Returns the index of the first occurrence of the **Substring** for which you searched , You can start and end the search at **Start** and **End** respectively.   | 
|  Find substring (from right)  |  Returns the index of the last occurrence of the **Substring** for which you searched. You can start and end the search at **Start** and **End** respectively.   | 
|  Matches prefix  |  Returns a Boolean value if the string contains a given **Pattern**. A pattern can be a character sequence or regular expression. Optionally, you can make the pattern case sensitive.   | 
|  Find all occurrences  |  Returns an array with all occurrences of a given pattern. A pattern can be a character sequence or regular expression.   | 
|  Extract using regex  |  Returns a string that matches a given Regex pattern.  | 
|  Extract between delimiters  |  Returns a string with all characters found between **Left delimiter** and **Right delimiter**.   | 
|  Extract from position  |  Returns a string, starting from **Start position** in the input string, that contains all characters up to the start position plus **Length**.   | 
|  Find and replace substring  |  Returns a string with all matches of a given **Pattern** (regular expression) replaced by **Replacement string**.  | 
|  Replace between delimiters  |  Returns a string with the substring found between the first appearance of a **Left delimiter** and the last appearance of a **Right delimiter** replaced by **Replacement string**. If no match is found, nothing is replaced.   | 
|  Replace from position  |  Returns a string with the substring between **Start position** and **Start position** plus **Length** replaced by **Replacement string**. If **Start position** plus **Length** is greater than the length of the replacement string, the output contains **…**.  | 
|  Convert regex to missing  |  Converts a string to `None` if invalid and returns the result. Validity is defined with a regular expression in **Pattern**.  | 
|  Split string by delimiter  |  Returns an array of strings from the input string, split by **Delimiter**, with up to **Max number of splits** (optional). The delimiter defaults to white space.   | 

## Split data


Use the **Split data** transform to split your dataset into two or three datasets. For example, you can split your dataset into a dataset used to train your model and a dataset used to test it. You can determine the proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method you use to split the data. For example, you might want to use a stratified split to make sure that the distribution of the observations in the target column are the same across datasets. You can use the following split transforms:
+ Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger datasets, using a randomized split might be computationally expensive and take longer than an ordered split.
+ Ordered split – Splits the dataset based on the sequential order of the observations. For example, for an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in keeping the existing order of the data between splits.
+ Stratified split – Splits the dataset to make sure that the number of observations in the input column have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the 1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go to the testing set.
+ Split by key – Avoids data with the same key occurring in more than one split. For example, if you have a dataset with the column 'customer\$1id' and you're using it as a key, no customer id is in more than one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases, they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions that you specify for the splits. If you set a higher error threshold, you get better performance, but lower accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would get observations approximating one of the following results:
+ 8010 observations in the training set and 1990 in the testing set
+ 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010 and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a different value for the seed to create a different reproducible split.

------
#### [ Randomized split ]

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Ordered split ]

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. For **Transform**, choose **Ordered split**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) For **Input column**, specify a column with numeric values. Uses the values of the columns to infer which records are in each split. The smaller values are in one split with the larger values in the other splits.

1. (Optional) Select **Handle duplicates** to add noise to duplicate values and create a dataset of entirely unique values.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Stratified split ]

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Stratified split**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Input column**, specify a column with up to 100 unique values. Data Wrangler can't stratify a column with more than 100 unique values.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed** to specify a different seed.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Split by column keys ]

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Split by key**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Key columns**, specify the columns with values that you don't want to appear in both datasets.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. Choose **Preview**.

1. Choose **Add**.

------

## Parse Value as Type


Use this transform to cast a column to a new type. The supported Data Wrangler data types are:
+ Long
+ Float
+ Boolean
+ Date, in the format dd-MM-yyyy, representing day, month, and year respectively. 
+ String

## Validate String


Use the **Validate string** transforms to create a new column that indicates that a row of text data meets a specified condition. For example, you can use a **Validate string** transform to verify that a string only contains lowercase characters. The following transforms are supported under **Validate string**. 

The following transforms are included in this transform group. If a transform outputs a Boolean value, `True` is represented with a `1` and `False` is represented with a `0`.


****  

| Name | Function | 
| --- | --- | 
|  String length  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.   | 
|  Starts with  |  Returns `True` if a string starts will a specified prefix. Otherwise, returns `False`.  | 
|  Ends with  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.  | 
|  Is alphanumeric  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is alpha (letters)  |  Returns `True` if a string only contains letters. Otherwise, returns `False`.  | 
|  Is digit  |  Returns `True` if a string only contains digits. Otherwise, returns `False`.  | 
|  Is space  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is title  |  Returns `True` if a string contains any white spaces. Otherwise, returns `False`.  | 
|  Is lowercase  |  Returns `True` if a string only contains lower case letters. Otherwise, returns `False`.  | 
|  Is uppercase  |  Returns `True` if a string only contains upper case letters. Otherwise, returns `False`.  | 
|  Is numeric  |  Returns `True` if a string only contains numbers. Otherwise, returns `False`.  | 
|  Is decimal  |  Returns `True` if a string only contains decimal numbers. Otherwise, returns `False`.  | 

## Unnest JSON Data


If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might have nested data in columns of either a Parquet file or a JSON document.

Use the **Flatten structured** operator to separate the first level keys into separate columns. A first level key is a key that isn't nested within a value.

For example, you might have a dataset that has a *person* column with demographic information on each person stored as JSON strings. A JSON string might look like the following.

```
 "{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city": "Posbotno","state": "WV"}"
```

The **Flatten structured** operator converts the following first level keys into additional columns in your dataset:
+ seq
+ name
+ age
+ city
+ state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column names and values of the JSON.

```
seq, name,                                    age, city, state
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV
```

For each value in your dataset containing JSON, the **Flatten structured** operator creates columns for the first-level keys. To create columns for nested keys, call the operator again. For the preceding example, calling the operator creates the columns:
+ name\$1first
+ name\$1last

The following example shows the dataset that results from calling the operation again.

```
seq, name,                                    age, city, state, name_first, name_last
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson
```

Choose **Keys to flatten on** to specify the first-level keys that want to extract as separate columns. If you don't specify any keys, Data Wrangler extracts all the keys by default.

## Explode Array


Use **Explode array** to expand the values of the array into separate output rows. For example, the operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the following rows:

```
                [1, 2, 3]
                [4, 5, 6]
                [7, 8, 9]
```

Data Wrangler names the new column, input\$1column\$1name\$1flatten.

You can call the **Explode array** operation multiple times to get the nested values of the array into separate output columns. The following example shows the result of calling the operation multiple times on a dataset with a nested array.

Putting the values of a nested array into separate columns


| id | array | id | array\$1items | id | array\$1items\$1items | 
| --- | --- | --- | --- | --- | --- | 
| 1 | [ [cat, dog], [bat, frog] ] | 1 | [cat, dog] | 1 | cat | 
| 2 |  [[rose, petunia], [lily, daisy]]  | 1 | [bat, frog] | 1 | dog | 
|  |  | 2 | [rose, petunia] | 1 | bat | 
|  |  | 2 | [lily, daisy] | 1 | frog | 
|  |  |  | 2 | 2 | rose | 
|  |  |  | 2 | 2 | petunia | 
|  |  |  | 2 | 2 | lily | 
|  |  |  | 2 | 2 | daisy | 

## Transform Image Data


Use Data Wrangler to import and transform the images that you're using for your machine learning (ML) pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

After you've familiarized yourself with the concepts of transforming your image data, go through the following tutorial, [Prepare image data with Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/).

The following industries and use cases are examples where applying machine learning to transformed image data can be useful:
+ Manufacturing – Identifying defects in items from the assembly line
+ Food – Identifying spoiled or rotten food
+ Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.

1. Transform – Use the built-in transformations to prepare the images for your machine learning pipeline.

1. Export – Export the images that you’ve transformed to a location that can be accessed from the pipeline.

Use the following procedure to import your image data.

**To import your image data**

1. Navigate to the **Create connection** page.

1. Choose **Amazon S3**.

1. Specify the Amazon S3 file path that contains the image data.

1. For **File type**, choose **Image**.

1. (Optional) Choose **Import nested directories** to import images from multiple Amazon S3 paths.

1. Choose **Import**.

Data Wrangler uses the open-source [imgaug](https://imgaug.readthedocs.io/en/latest/) library for its built-in image transformations. You can use the following built-in transformations:
+ **ResizeImage**
+ **EnhanceImage**
+ **CorruptImage**
+ **SplitImage**
+ **DropCorruptedImages**
+ **DropImageDuplicates**
+ **Brightness**
+ **ColorChannels**
+ **Grayscale**
+ **Rotate**

Use the following procedure to transform your images without writing code.

**To transform the image data without writing code**

1. From your Data Wrangler flow, choose the **\$1** next to the node representing the images that you've imported.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose the transform and configure it.

1. Choose **Preview**.

1. Choose **Add**.

In addition to using the transformations that Data Wrangler provides, you can also use your own custom code snippets. For more information about using custom code snippets, see [Custom Transforms](#canvas-transform-custom). You can import the OpenCV and imgaug libraries within your code snippets and use the transforms associated with them. The following is an example of a code snippet that detects edges within the images.

```
# A table with your image data is stored in the `df` variable
import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import BasicImageOperationDecorator, PandasUDFOperationDecorator


@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
  # To use the code snippet on your image data, modify the following lines within the function
    HYST_THRLD_1, HYST_THRLD_2 = 100, 200
    edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
    return edges
    

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
    return my_transform(image_row)
    

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
```

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't apply the transforms to all of your images.

## Filter data


Use Data Wrangler to filter the data in your columns. When you filter the data in a column, you specify the following fields:
+ **Column name** – The name of the column that you're using to filter the data.
+ **Condition** – The type of filter that you're applying to values in the column.
+ **Value** – The value or category in the column to which you're applying the filter.

You can filter on the following conditions:
+ **=** – Returns values that match the value or category that you specify.
+ **\$1=** – Returns values that don't match the value or category that you specify.
+ **>=** – For **Long** or **Float** data, filters for values that are greater than or equal to the value that you specify.
+ **<=** – For **Long** or **Float** data, filters for values that are less than or equal to the value that you specify.
+ **>** – For **Long** or **Float** data, filters for values that are greater than the value that you specify.
+ **<** – For **Long** or **Float** data, filters for values that are less than the value that you specify.

For a column that has the categories, `male` and `female`, you can filter out all the `male` values. You could also filter for all the `female` values. Because there are only `male` and `female` values in the column, the filter returns a column that only has `female` values.

You can also add multiple filters. The filters can be applied across multiple columns or the same column. For example, if you're creating a column that only has values within a certain range, you add two different filters. One filter specifies that the column must have values greater than the value that you provide. The other filter specifies that the column must have values less than the value that you provide.

Use the following procedure to add the filter transform to your data.

**To filter your data**

1. From your Data Wrangler flow, choose the **\$1** next to the node with the data that you're filtering.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose **Filter data**.

1. Specify the following fields:
   + **Column name** – The column that you're filtering.
   + **Condition** – The condition of the filter.
   + **Value** – The value or category in the column to which you're applying the filter.

1. (Optional) Choose **\$1** following the filter that you've created.

1. Configure the filter.

1. Choose **Preview**.

1. Choose **Add**.

# Chat for data prep


**Important**  
For administrators:  
Chat for data prep requires the `AmazonSageMakerCanvasAIServicesAccess` policy. For more information, see [AWS managed policy: AmazonSageMakerCanvasAIServicesAccess](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess)
Chat for data prep requires access to Amazon Bedrock and the **Anthropic Claude** model within it. For more information, see [Add model access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html#add-model-access).
You must run SageMaker Canvas data prep in the same AWS Region as the Region where you're running your model. Chat for data prep is available in the US East (N. Virginia), US West (Oregon), and Europe (Frankfurt) AWS Regions.

In addition to using the built-in transforms and analyses, you can use natural language to explore, visualize, and transform your data in a conversational interface. Within the conversational interface, you can use natural language queries to understand and prepare your data to build ML models.

The following are examples of some prompts that you can use:
+ Summarize my data
+ Drop column `example-column-name`
+ Replace missing values with median
+ Plot histogram of prices
+ What is the most expensive item sold?
+ How many distinct items were sold?
+ Sort data by region

When you’re transforming your data using your prompts, you can view a preview that shows how data is being transformed. You can choose to add it as step in your Data Wrangler flow based on what you see in the preview.

The responses to your prompts generate code for your transformations and analyses. You can modify the code to update the output from the prompt. For example, you can modify the code for an analysis to change the values of the axes of a graph.

Use the following procedure to start chatting with your data:

**To chat with your data**

1. Open the SageMaker Canvas data flow.

1. Choose the speech bubble.  
![\[Chat for data prep is at the top of the screen\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/chat-for-data-prep-welcome-step.png)

1. Specify a prompt.

1. (Optional) If an analysis has been generated by your query, choose **Add to analyses** to reference it for later.  
![\[The view of an editable and copyable code block.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/encanto-query-for-visualization.png)

1. (Optional) If you've transformed your data using a prompt, do the following.

   1. Choose **Preview** to view the results.

   1. (Optional) Modify the code in the transform and choose **Update**.

   1. (Optional) If you're happy with the results of the transform, choose **Add to steps** to add it to the steps panel on the right-hand navigation.  
![\[Added to steps shows confirmation that the transform has been added to the flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/transform-added-to-steps-panel.png)

After you’ve prepared your data using natural language, you can create a model using your transformed data. For more information about creating a model, see [How custom models work](canvas-build-model.md).

# How data processing works in Data Wrangler
Data processing

While working with data interactively in an Amazon SageMaker Data Wrangler data flow, Amazon SageMaker Canvas only applies the transformations to a sample dataset for you to preview. After finishing your data flow in SageMaker Canvas, you can process all of your data and save it in a location that is suitable for your machine learning workflows.

There are several options for how to proceed after you've finished transforming your data in Data Wrangler:
+ [Create a model](canvas-processing-export-model.md). You can create a Canvas model, where you directly start creating a model with your prepared data. You can create a model either after processing your entire dataset, or by exporting just the sample data you worked with in Data Wrangler. Canvas saves your processed data (either the entire dataset or the sample data) as a Canvas dataset.

  We recommend that you use your sample data for quick iterations, but that you use your entire data when you want to train your final model. When building tabular models, datasets larger than 5 GB are automatically downsampled to 5 GB, and for time series forecasting models, datasets larger than 30 GB are downsampled to 30 GB.

  To learn more about creating a model, see [How custom models work](canvas-build-model.md).
+ [Export the data](canvas-export-data.md). You can export your data for use in machine learning workflows. When you choose to export your data, you have several options:
  + You can save your data in the Canvas application as a dataset. For more information about the supported file types for Canvas datasets and additional requirements when importing data into Canvas, see [Create a dataset](canvas-import-dataset.md).
  + You can save your data to Amazon S3. Depending on the Canvas memory availability, your data is processed in the application and then exported to Amazon S3. If the size of your dataset exceeds what Canvas can process, then by default, Canvas uses an EMR Serverless job to scale to multiple compute instances, process your full dataset, and export it to Amazon S3. You can also manually configure a SageMaker Processing job to have more granular control over the compute resources used to process your data.
+ [Export a data flow](canvas-export-data-flow.md). You might want to save the code for your data flow so that you can modify or run your transformations outside of Canvas. Canvas provides you with the option to save your data flow transformations as Python code in a Jupyter notebook, which you can then export to Amazon S3 for use elsewhere in your machine learning workflows.

When you export your data from a data flow and save it either as a Canvas dataset or to Amazon S3, Canvas creates a new destination node in your data flow, which is a final node that shows you where your processed data is stored. You can add additional destination nodes to your flow if you'd like to perform multiple export operations. For example, you can export the data from different points in your data flow to only apply some of the transformations, or you can export transformed data to different Amazon S3 locations. For more information about how to add or edit a destination node, see [Add destination nodes](canvas-destination-nodes-add.md) and [Edit a destination node](canvas-destination-nodes-edit.md).

For more information about setting up a schedule with Amazon EventBridge to automatically process and export your data on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

# Export to create a model


In just a few clicks from your data flow, you can export your transformed data and start creating an ML model in Canvas. Canvas saves your data as a Canvas dataset, and you're taken to the model build configuration page for a new model.

To create a Canvas model with your transformed data:

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. From the context menu, choose **Create model**.

1. In the **Export to create a model** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected to process and export your entire dataset before proceeding with building a model. Turn this option off to train your model using the interactive sample data you are working with in your data flow.

1. Enter a **Model name** to name the new model.

1. Select a **Problem type**, or the type of model that you want to build. For more information about the supported model types in SageMaker Canvas, see [How custom models work](canvas-build-model.md).

1. Select the **Target column**, or the value that you want the model to predict.

1. Choose **Export and create model**.

The **Build** tab for a new Canvas model should open, and you can finish configuring and training your model. For more information about how to build a model, see [Build a model](canvas-build-model-how-to.md).

# Export data


Export data to apply the transforms from your data flow to the full imported dataset. You can export any node in your data flow to the following locations:
+ SageMaker Canvas dataset
+ Amazon S3

If you want to train models in Canvas, you can export your full, transformed dataset as a Canvas dataset. If you want to use your transformed data in machine learning workflows external to SageMaker Canvas, you can export your dataset to Amazon S3.

## Export to a Canvas dataset


Use the following procedure to export a SageMaker Canvas dataset from a node in your data flow.

**To export a node in your flow as a SageMaker Canvas dataset**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Canvas dataset**.

1. In the **Export to Canvas dataset** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected if you want SageMaker Canvas to process and save your full dataset. Turn this option off to only apply the transforms to the sample data you are working with in your data flow.

1. Choose **Export**.

You should now be able to go to the **Datasets** page of the Canvas application and see your new dataset.

## Export to Amazon S3


When exporting your data to Amazon S3, you can scale to transform and process data of any size. Canvas automatically processes your data locally if the application's memory can handle the size of your dataset. If your dataset size exceeds the local memory capacity of 5 GB, then Canvas initiates a remote job on your behalf to provision additional compute resources and process the data more quickly. By default, Canvas uses Amazon EMR Serverless to run these remote jobs. However, you can manually configure Canvas to use either EMR Serverless or a SageMaker Processing job with your own settings.

**Note**  
When running an EMR Serverless job, by default the job inherits the IAM role, KMS key settings, and tags of your Canvas application.

The following summarizes the options for remote jobs in Canvas:
+ **EMR Serverless**: This is the default option that Canvas uses for remote jobs. EMR Serverless automatically provisions and scales compute resources to process your data so that you don't have to worry about choosing the right compute resources for your workload. For more information about EMR Serverless, see the [EMR Serverless User Guide](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html).
+ **SageMaker Processing**: SageMaker Processing jobs offer more advanced options and granular control over the compute resources used to process your data. For example, you can specify the type and count of the compute instances, configure the job in your own VPC and control network access, automate processing jobs, and more. For more information about automating processing jobs see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md). For more general information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

The following file types are supported when exporting to Amazon S3:
+ CSV
+ Parquet

To get started, review the following prerequisites.

### Prerequisites for EMR Serverless jobs


To create a remote job that uses EMR Serverless resources, you must have the necessary permissions. You can grant permissions either through the Amazon SageMaker AI domain or user profile settings, or you can manually configure your user's AWS IAM role. For instructions on how to grant users permissions to perform large data processing, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

If you don't want to configure these policies but still need to process large datasets through Data Wrangler, you can alternatively use a SageMaker Processing job.

Use the following procedures to export your data to Amazon S3. To configure a remote job, follow the optional advanced steps.

**To export a node in your flow to Amazon S3**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Amazon S3**.

1. In the **Export to Amazon S3** side panel, you can change the **Dataset name** for the new dataset.

1. For the **S3 location**, enter the Amazon S3 location to which you want to export the dataset. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. (Optional) For the **Advanced settings**, specify values for the following fields:

   1. **File type** – The file format of your exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

   1. **Number of partitions** – The number of dataset files that Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.
**Note**  
If you only export a sample of your data, Canvas processes your data in the application and doesn't create a remote job for you.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the section after this procedure before you export your data.

1. Choose **Export**.

The following procedures show how to manually configure the remote job settings for either EMR Serverless or SageMaker Processing when exporting your full dataset to Amazon S3.

------
#### [ EMR Serverless ]

To configure an EMR Serverless job while exporting to Amazon S3, do the following:

1. In the Export to Amazon S3 side panel, turn off the **Auto job configuration** option.

1. Select **EMR Serverless**.

1. For **Job name**, enter a name for your EMR Serverless job. The name can contain letters, numbers, hyphens, and underscores.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions to run EMR Serverless applications. For more information, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

1. (Optional) For **KMS key**, specify the key ID or ARN of an AWS KMS key to encrypt the job logs. If you don't enter a key, Canvas uses a default key for EMR Serverless.

1. (Optional) For **Monitoring configuration**, enter the name of an Amazon CloudWatch Logs log group to which you want to publish your logs.

1. (Optional) For **Tags**, add metadata tags to the EMR Serverless job consisting of key-value pairs. These tags can be used to categorize and search for jobs.

1. Choose **Export** to start the job.

------
#### [ SageMaker Processing ]

To configure a SageMaker Processing job while exporting to Amazon S3, do the following:

1. In the **Export to Amazon S3** side panel, turn off the **Auto job configuration** option.

1. Select **SageMaker Processing**.

1. For **Job name**, enter a name for your SageMaker AI Processing job.

1. For **Instance type**, select the type of compute instance to run the processing job.

1. For **Instance count**, specify the number of compute instances to launch.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions for SageMaker AI to create and run processing jobs on your behalf. These permissions are granted if you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy attached to your IAM role.

1. For **Volume size**, enter the storage size in GB for the ML storage volume that is attached to each processing instance. Choose the size based on your expected input and output data size.

1. (Optional) For **Volume KMS key**, specify a KMS key to encrypt the storage volume. If you don't specify a key, the default Amazon EBS encryption key is used.

1. (Optional) For **KMS key**, specify a KMS key to encrypt input and output Amazon S3 data sources used by the processing job.

1. (Optional) For **Spark memory configuration**, do the following:

   1. Enter **Driver memory in MB** for the Spark driver node that handles job coordination and scheduling.

   1. Enter **Executor memory in MB** for the Spark executor nodes that run individual tasks in the job.

1. (Optional) For **Network configuration**, do the following:

   1. For **Subnet configuration**, enter the IDs of the VPC subnets for the processing instances to be launched in. By default, the job uses the settings of your default VPC.

   1. For **Security group configuration**, enter the IDs of the security groups to control inbound and outbound connectivity rules.

   1. Turn on the **Enable inter-container traffic encryption** option to encrypt network communication between processing containers during the job.

1. (Optional) For **Associate schedules**, you can choose create an Amazon EventBridge schedule to have the processing job run on recurring intervals. Choose **Create new schedule** and fill out the dialog box. For more information about filling out this section and running processing jobs on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

1. (Optional) Add **Tags** as key-value pairs so that you can categorize and search for processing jobs.

1. Choose **Export** to start the processing job.

------

After exporting your data, you should find the fully processed dataset in the specified Amazon S3 location.

# Export a data flow


Exporting your data flow translates the operations that you've made in Data Wrangler and exports it into a Jupyter notebook of Python code that you can modify and run. This can be helpful for integrating the code for your data transformations into your machine learning pipelines.

You can choose any data node in your data flow and export it. Exporting the data node exports the transformation that the node represents and the transformations that precede it.

**To export a data flow as a Jupyter notebook**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you want to export.

1. In the context menu, hover over **Export**, and then hover over **Export via Jupyter notebook**.

1. Choose one of the following:
   + **SageMaker Pipelines**
   + **Amazon S3**
   + **SageMaker AI Inference Pipeline**
   + **SageMaker AI Feature Store**
   + **Python Code**

1. The **Export data flow as notebook** dialog box opens. Select one of the following:
   + **Download a local copy**
   + **Export to S3 location**

1. If you selected **Export to S3 location**, enter the Amazon S3 location to which you want to export the notebook.

1. Choose **Export**.

Your Jupyter notebook should either download to your local machine, or you can find it saved in the Amazon S3 location you specified.

# Add destination nodes


A destination node in SageMaker Canvas specifies where to store your processed and transformed data. When you choose to export your transformed data to Amazon S3, Canvas uses the specified destination node location, applying all the transformations you've configured in your data flow. For more information about export jobs to Amazon S3, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

By default, choosing to export your data to Amazon S3 adds a destination node to your data flow. However, you can add multiple destination nodes to your flow, allowing you to simultaneously export different sets of transformations or variations of your data to different Amazon S3 locations. For example, you can create one destination node that exports the data after applying all transformations, and another destination node that exports the data after only certain initial transformations, such as a join operation. This flexibility enables you to export and store different versions or subsets of your transformed data in separate S3 locations for various use cases.

Use the following procedure to add a destination node to your data flow.

**To add a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node where you want to place the destination node.

1. In the context menu, hover over **Export**, and then select **Add destination**.

1. In the **Export destination** side panel, enter a **Dataset name** to name the output.

1. For **Amazon S3 location**, enter the Amazon S3 location to which you want to export the output. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. For **Export settings**, specify the following fields:

   1. **File type** – The file format of the exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

1. For **Partitioning**, specify the following fields:

   1. **Number of partitions** – The number of dataset files that SageMaker Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Choose **Add** if you want to simply add a destination node to your data flow, or choose **Add** and then choose **Export** if you want to add the node and initiate an export job.

You should now see a new destination node in your flow.

# Edit a destination node


A *destination node* in a Amazon SageMaker Canvas data flow specifies the Amazon S3 location where your processed and transformed data is stored, applying all the configured transformations in your data flow. You can edit the configuration of an existing destination node and then choose to re-run the job to overwrite the data in the specified Amazon S3 location. For more information about adding a new destination node, see [Add destination nodes](canvas-destination-nodes-add.md).

Use the following procedure to edit a destination node in your data flow and initiate an export job.

**To edit a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the destination node that you want to edit.

1. In the context menu, choose **Edit**.

1. The **Edit destination** side panel opens. From this panel, you can edit details such as the dataset name, the Amazon S3 location, and the export and partitioning settings.

1. (Optional) In **Additional nodes to export**, you can select more destination nodes to process when you run the export job.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

1. When you're done making changes, choose **Update**.

Saving changes to your destination node configuration doesn't automatically re-run a job or overwrite data that has already been processed and exported. Export your data again to run a job with the new configuration. If you decide to export your data again with a job, Canvas uses the updated destination node configuration to transform and output the data to the specified location, overwriting any existing data.

# Create a schedule to automatically process new data


**Note**  
The following section only applies to SageMaker Processing jobs. If you used the default Canvas settings or EMR Serverless to create a remote job to apply transforms to your full dataset, this section doesn’t apply.

If you're processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data. For more information about processing jobs, see [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

When you create a job, you must specify an IAM role that has permissions to create the job. You can use the [AmazonSageMakerCanvasDataPrepFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html) policy to add permissions.

Add the following trust policy to the role to allow EventBridge to assume it.

```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "events.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}
```

**Important**  
When you create a schedule, Data Wrangler creates an `eventRule` in EventBridge. You incur charges for both the event rules that you create and the instances used to run the processing job.  
For information about EventBridge pricing, see [Amazon EventBridge pricing](https://aws.amazon.com/eventbridge/pricing/). For information about processing job pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can set a schedule using one of the following methods:
+ [CRON expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html)
**Note**  
Data Wrangler doesn't support the following expressions:  
LW\$1
Abbreviations for days
Abbreviations for months
+ [RATE expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html#eb-rate-expressions)
+ Recurring – Set an hourly or daily interval to run the job.
+ Specific time – Set specific days and times to run the job.

The following sections provide procedures on scheduling jobs when filling out the SageMaker AI Processing job settings while [exporting your data to Amazon S3](canvas-export-data.md#canvas-export-data-s3). All of the following instructions begin in the **Associate schedules** section of the SageMaker Processing job settings.

------
#### [ CRON ]

Use the following procedure to create a schedule with a CRON expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **CRON**.

1. For each of the **Minutes**, **Hours**, **Days of month**, **Month**, and **Day of week** fields, enter valid CRON expression values.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ RATE ]

Use the following procedure to create a schedule with a RATE expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Rate**.

1. For **Value**, specify an integer.

1. For **Unit**, select one of the following:
   + **Minutes**
   + **Hours**
   + **Days**

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Recurring ]

Use the following procedure to create a schedule that runs a job on a recurring basis.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Recurring**.

1. For **Every x hours**, specify the hourly frequency that the job runs during the day. Valid values are integers in the inclusive range of **1** and **23**.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.
**Note**  
The schedule resets every day. If you schedule a job to run every five hours, it runs at the following times during the day:  
00:00
05:00
10:00
15:00
20:00

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Specific time ]

Use the following procedure to create a schedule that runs a job at specific times.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Start time**.

1. For **Start time**, enter a time in UTC format (for example, **09:00**). The start time defaults to the time zone where you are located.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------

You can use the SageMaker AI AWS Management Console to view the jobs that are scheduled to run. Your processing jobs run within Pipelines. Each processing job has its own pipeline. It runs as a processing step within the pipeline. You can view the schedules that you've created within a pipeline. For information about viewing a pipeline, see [View the details of a pipeline](pipelines-studio-list.md).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio Classic.

1. Open Pipelines

1. View the pipelines for the jobs that you've created.

   The pipeline running the job uses the job name as a prefix. For example, if you've created a job named `housing-data-feature-enginnering`, the name of the pipeline is `canvas-data-prep-housing-data-feature-engineering`.

1. Choose the pipeline containing your job.

1. View the status of the pipelines. Pipelines with a **Status** of **Succeeded** have run the processing job successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an event rule stops all the jobs associated with the schedule from running. For information about deleting a rule, see [Disabling or deleting an Amazon EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-delete-rule.html).

You can stop and delete the pipelines associated with the schedules as well. For information about stopping a pipeline, see [StopPipelineExecution](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopPipelineExecution.html). For information about deleting a pipeline, see [DeletePipeline](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeletePipeline.html#API_DeletePipeline_RequestSyntax).

# Automate data preparation in SageMaker Canvas


After you transform your data in data flow, you can export the transforms to your machine learning workflows. When you export your transforms, SageMaker Canvas creates a Jupyter notebook. You must run the notebook within Amazon SageMaker Studio Classic. For information about getting started with Studio Classic, contact your administrator.

## Automate data preparation using Pipelines


When you want to build and deploy large-scale machine learning (ML) workflows, you can use Pipelines to create workflows that manage and deploy SageMaker AI jobs. With Pipelines, you can build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs. You can use the first-party algorithms that SageMaker AI offers by using Pipelines. For more information on Pipelines, see [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).

When you export one or more steps from your data flow to Pipelines, Data Wrangler creates a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

### Use a Jupyter Notebook to Create a Pipeline


Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Pipelines.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Pipelines (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline includes the data processing steps that are defined by your Data Wrangler flow. 

You can add additional steps to your pipeline by adding steps to the `steps` list in the following code in the notebook:

```
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=[step_process], #Add more steps to this list to run in your Pipeline
)
```

For more information on defining pipelines, see [Define SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html).

## Automate data preparation using an inference endpoint


Use your Data Wrangler flow to process data at the time of inference by creating a SageMaker AI serial inference pipeline from your Data Wrangler flow. An inference pipeline is a series of steps that results in a trained model making predictions on new data. A serial inference pipeline within Data Wrangler transforms the raw data and provides it to the machine learning model for a prediction. You create, run, and manage the inference pipeline from a Jupyter notebook within Studio Classic. For more information about accessing the notebook, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Within the notebook, you can either train a machine learning model or specify one that you've already trained. You can either use Amazon SageMaker Autopilot or XGBoost to train the model using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see [Multi-model endpoints](multi-model-endpoints.md).

**Important**  
You can't export your Data Wrangler flow to an inference endpoint if it has the following transformations:  
Join
Concatenate
Group by
If you must use the preceding transforms to prepare your data, use the following procedure.  
Create a Data Wrangler flow.
Apply the preceding transforms that aren't supported.
Export the data to an Amazon S3 bucket.
Create a separate Data Wrangler flow.
Import the data that you've exported from the preceding flow.
Apply the remaining transforms.
Create a serial inference pipeline using the Jupyter notebook that we provide.
For information about exporting your data to an Amazon S3 bucket see [Export data](canvas-export-data.md). For information about opening the Jupyter notebook used to create the serial inference pipeline, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Data Wrangler ignores transforms that remove data at the time of inference. For example, Data Wrangler ignores the [Handle Missing Values](canvas-transform.md#canvas-transform-handle-missing) transform if you use the **Drop missing** configuration.

If you've refit transforms to your entire dataset, the transforms carry over to your inference pipeline. For example, if you used the median value to impute missing values, the median value from refitting the transform is applied to your inference requests. You can either refit the transforms from your Data Wrangler flow when you're using the Jupyter notebook or when you're exporting your data to an inference pipeline. .

The serial inference pipeline supports the following data types for the input and output strings. Each data type has a set of requirements.

**Supported datatypes**
+ `text/csv` – the datatype for CSV strings
  + The string can't have a header.
  + Features used for the inference pipeline must be in the same order as features in the training dataset.
  + There must be a comma delimiter between features.
  + Records must be delimited by a newline character.

  The following is an example of a validly formatted CSV string that you can provide in an inference request.

  ```
  abc,0.0,"Doe, John",12345\ndef,1.1,"Doe, Jane",67890                    
  ```
+ `application/json` – the datatype for JSON strings
  + The features used in the dataset for the inference pipeline must be in the same order as the features in the training dataset.
  + The data must have a specific schema. You define schema as a single `instances` object that has a set of `features`. Each `features` object represents an observation.

  The following is an example of a validly formatted JSON string that you can provide in an inference request.

  ```
  {
      "instances": [
          {
              "features": ["abc", 0.0, "Doe, John", 12345]
          },
          {
              "features": ["def", 1.1, "Doe, Jane", 67890]
          }
      ]
  }
  ```

### Use a Jupyter notebook to create an inference endpoint


Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference pipeline using a Jupyter notebook, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **SageMaker AI Inference Pipeline (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

When you run the Jupyter notebook, it creates an inference flow artifact. An inference flow artifact is a Data Wrangler flow file with additional metadata used to create the serial inference pipeline. The node that you're exporting encompasses all of the transforms from the preceding nodes.

**Important**  
Data Wrangler needs the inference flow artifact to run the inference pipeline. You can't use your own flow file as the artifact. You must create it by using the preceding procedure.

## Automate data preparation using Python Code


To export all steps in your data flow to a Python file that you can manually integrate into any data processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Python code.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Python Code**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You might need to configure the Python script to make it run in your pipeline. For example, if you're running a Spark environment, make sure that you are running the script from an environment that has permission to access AWS resources.

# Generative AI foundation models in SageMaker Canvas
Generative AI foundation models

Amazon SageMaker Canvas provides generative AI foundation models that you can use to start conversational chats. These content generation models are trained on large amounts of text data to learn the statistical patterns and relationships between words, and they can produce coherent text that is statistically similar to the text on which they were trained. You can use this capability to increase your productivity by doing the following:
+ Generate content, such as document outlines, reports, and blogs
+ Summarize text from large corpuses of text, such as earnings call transcripts, annual reports, or chapters of user manuals
+ Extract insights and key takeaways from large passages of text, such as meeting notes or narratives
+ Improve text and catch grammatical errors or typos

The foundation models are a combination of Amazon SageMaker JumpStart and [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-service.html) large language models (LLMs). Canvas offers the following models:


| Model | Type | Description | 
| --- | --- | --- | 
|  Amazon Titan  | Amazon Bedrock model |  Amazon Titan is a powerful, general-purpose language model that you can use for tasks such as summarization, text generation (such as creating a blog post), classification, open-ended Q&A, and information extraction. It is pretrained on large datasets, making it suitable for complex tasks and reasoning. To continue supporting best practices in the responsible use of AI, Amazon Titan foundation models are built to detect and remove harmful content in the data, reject inappropriate content in the user input, and filter model outputs that contain inappropriate content (such as hate speech, profanity, and violence).  | 
|  Anthropic Claude Instant  | Amazon Bedrock model |  Anthropic's Claude Instant is a faster and more cost-effective yet still very capable model. This model can handle a range of tasks including casual dialogue, text analysis, summarization, and document question answering. Just like Claude-2, Claude Instant can support up to 100,000 tokens in each prompt, equivalent to about 200 pages of information.  | 
|  Anthropic Claude-2  | Amazon Bedrock model |  Claude-2 is Anthropic's most powerful model, which excels at a wide range of tasks from sophisticated dialogue and creative content generation to detailed instruction following. Claude-2 can take up to 100,000 tokens in each prompt, equivalent to about 200 pages of information. It can generate longer responses compared to its prior version. It supports use cases such as question answering, information extraction, removing PII, content generation, multiple-choice classification, roleplay, comparing text, summarization, and document Q&A with citation.  | 
|  Falcon-7B-Instruct  | JumpStart model |  Falcon-7B-Instruct has 7 billion parameters and was fine-tuned on a mixture of chat and instruct datasets. It is suitable as a virtual assistant and performs best when following instructions or engaging in conversation. Since the model was trained on large amounts of English-language web data, it carries the stereotypes and biases commonly found online and is not suitable for languages other than English. Compared to Falcon-40B-Instruct, Falcon-7B-Instruct is a slightly smaller and more compact model.  | 
|  Falcon-40B-Instruct  | JumpStart model |  Falcon-40B-Instruct has 40 billion parameters and was fine-tuned on a mixture of chat and instruct datasets. It is suitable as a virtual assistant and performs best when following instructions or engaging in conversation. Since the model was trained on large amounts of English-language web data, it carries the stereotypes and biases commonly found online and is not suitable for languages other than English. Compared to Falcon-7B-Instruct, Falcon-40B-Instruct is a slightly larger and more powerful model.  | 
|  Jurassic-2 Mid  | Amazon Bedrock model |  Jurassic-2 Mid is a high-performance text generation model trained on a massive corpus of text (current up to mid 2022). It is highly versatile, general-purpose, and capable of composing human-like text and solving complex tasks such as question answering, text classification, and many others. This model offers zero-shot instruction capabilities, allowing it to be directed with only natural language and without the use of examples. It performs up to 30% faster than its predecessor, the Jurassic-1 model. Jurassic-2 Mid is AI21’s mid-sized model, carefully designed to strike the right balance between exceptional quality and affordability.  | 
|  Jurassic-2 Ultra  | Amazon Bedrock model |  Jurassic-2 Ultra is a high-performance text generation model trained on a massive corpus of text (current up to mid 2022). It is highly versatile, general-purpose, and capable of composing human-like text and solving complex tasks such as question answering, text classification, and many others. This model offers zero-shot instruction capabilities, allowing it to be directed with only natural language and without the use of examples. It performs up to 30% faster than its predecessor, the Jurassic-1 model. Compared to Jurassic-2 Mid, Jurassic-2 Ultra is a slightly larger and more powerful model.  | 
|  Llama-2-7b-Chat  | JumpStart model |  Llama-2-7b-Chat is a foundation model by Meta that is suitable for engaging in meaningful and coherent conversations, generating new content, and extracting answers from existing notes. Since the model was trained on large amounts of English-language internet data, it carries the biases and limitations commonly found online and is best-suited for tasks in English.  | 
|  Llama-2-13B-Chat  | Amazon Bedrock model |  Llama-2-13B-Chat by Meta was fine-tuned on conversational data after initial training on internet data. It is optimized for natural dialog and engaging chat abilities, making it well-suited as a conversational agent. Compared to the smaller Llama-2-7b-Chat, Llama-2-13B-Chat has nearly twice as many parameters, allowing it to remember more context and produce more nuanced conversational responses. Like Llama-2-7b-Chat, Llama-2-13B-Chat was trained on English-language data and is best-suited for tasks in English.  | 
|  Llama-2-70B-Chat  | Amazon Bedrock model |  Like Llama-2-7b-Chat and Llama-2-13B-Chat, the Llama-2-70B-Chat model by Meta is optimized for engaging in natural and meaningful dialog. With 70 billion parameters, this large conversational model can remember more extensive context and produce highly coherent responses when compared to the more compact model versions. However, this comes at the cost of slower responses and higher resource requirements. Llama-2-70B-Chat was trained on large amounts of English-language internet data and is best-suited for tasks in English.  | 
|  Mistral-7B  | JumpStart model |  Mistral-7B by Mistral.AI is an excellent general purpose language model suitable for a wide range of natural language (NLP) tasks like text generation, summarization, and question answering. It utilizes grouped-query attention (GQA) which allows for faster inference speeds, making it perform comparably to models with twice or three times as many parameters. It was trained on a mixture of text data including books, websites, and scientific papers in the English language, so it is best-suited for tasks in English.  | 
|  Mistral-7B-Chat  | JumpStart model |  Mistral-7B-Chat is a conversational model by Mistral.AI based on Mistral-7B. While Mistral-7B is best for general NLP tasks, Mistral-7B-Chat has been further fine-tuned on conversational data to optimize its abilities for natural, engaging chat. As a result, Mistral-7B-Chat generates more human-like responses and remembers the context of previous responses. Like Mistral-7B, this model is best-suited for English language tasks.  | 
|  MPT-7B-Instruct  | JumpStart model |  MPT-7B-Instruct is a model for long-form instruction following tasks and can assist you with writing tasks including text summarization and question-answering to save you time and effort. This model was trained on large amounts of fine-tuned data and can handle larger inputs, such as complex documents. Use this model when you want to process large bodies of text or want the model to generate long responses.  | 

The foundation models from Amazon Bedrock are currently only available in the US East (N. Virginia) and US West (Oregon) Regions. Additionally, when using foundation models from Amazon Bedrock, you are charged based on the volume of input tokens and output tokens, as specified by each model provider. For more information, see the [Amazon Bedrock pricing page](https://aws.amazon.com/bedrock/pricing/). The JumpStart foundation models are deployed on SageMaker AI Hosting instances, and you are charged for the duration of usage based on the instance type used. For more information about the cost of different instance types, see the Amazon SageMaker AI Hosting: Real-Time Inference section on the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/).

Document querying is an additional feature that you can use to query and get insights from documents stored in indexes using Amazon Kendra. With this functionality, you can generate content from the context of those documents and receive responses that are specific to your business use case, as opposed to responses that are generic to the large amounts of data on which the foundation models were trained. For more information about indexes in Amazon Kendra, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html).

If you would like to get responses from any of the foundation models that are customized to your data and use case, you can fine-tune foundation models. To learn more, see [Fine-tune foundation models](canvas-fm-chat-fine-tune.md).

If you'd like to get predictions from an Amazon SageMaker JumpStart foundation model through an application or website, you can deploy the model to a SageMaker AI *endpoint*. SageMaker AI endpoints host your model, and you can send requests to the endpoint through your application code to receive predictions from the model. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

# Complete the prerequisites for foundation models in SageMaker Canvas
Complete prerequisites

The following sections outline the prerequisites for interacting with foundation models and using the document query feature in Canvas. The rest of the content on this page assumes that you’ve met the prerequisites for foundation models. The document query feature requires additional permissions.

## Prerequisites for foundation models


The permissions you need for interacting with models are included in the Canvas Ready-to-use models permissions. To use the generative AI-powered models in Canvas, you must turn on the **Canvas Ready-to-use models configuration** permissions when setting up your Amazon SageMaker AI domain. For more information, see [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites). The **Canvas Ready-to-use models configuration** attaches the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues with granting permissions, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

If you’ve already set up your domain, you can edit your domain settings and turn on the permissions. For instructions on how to edit your domain settings, see [Edit domain settings](domain-edit.md). When editing the settings for your domain, go to the **Canvas settings** and turn on the **Enable Canvas Ready-to-use models** option.

Certain JumpStart foundation models also require that you request a SageMaker AI instance quota increase. Canvas hosts the models that you’re currently interacting with on these instances, but the default quota for your account may be insufficient. If you run into an error while running any of the following models, request a quota increase for the associated instance types:
+ Falcon-40B – `ml.g5.12xlarge`, `ml.g5.24xlarge`
+ Falcon-13B – `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`
+ MPT-7B-Instruct – `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`

For the preceding instances types, request an increase from 0 to 1 for the endpoint usage quota. For more information about how to increase an instance quota for your account, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*.

## Prerequisites for document querying


**Note**  
Document querying is supported in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Asia Pacific (Mumbai).

The document querying feature requires that you already have an Amazon Kendra index that stores your documents and document metadata. For more information about Amazon Kendra, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html). To learn more about the quotas for querying indexes, see [Quotas](https://docs.aws.amazon.com/kendra/latest/dg/quotas.html) in the *Amazon Kendra Developer Guide*.

You must also make sure that your Canvas user profile has the necessary permissions for document querying. The [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy must be attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application (this policy is attached by default to all new and existing Canvas user profiles). You must also specifically grant document querying permissions and specify access to one or more Amazon Kendra indexes.

If your Canvas administrator is setting up a new domain or user profile, have them set up the domain by following the instructions in [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites). While setting up the domain, they can turn on the document querying permissions through the **Canvas Ready-to-use models configuration**.

The Canvas administrator can manage document querying permissions at the user profile level as well. For example, if the administrator wants to grant document querying permissions to some user profiles but remove permissions for others, they can edit the permissions for a specific user.

The following procedure shows how to turn on document querying permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**.

1. From the list of domains, select the user profile’s domain.

1. On the **domain details** page, choose the **User profile** whose permissions you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the left navigation pane, choose **Canvas settings**.

1. In the **Canvas Ready-to-use models configuration** section, turn on the **Enable document query using Amazon Kendra** toggle.

1. In the dropdown, select one or more Amazon Kendra indexes to which you want to grant access.

1. Choose **Submit** to save the changes to your domain settings.

You should now be able to use Canvas foundation models to query documents in the specified Amazon Kendra indexes.

# Start a new conversation to generate, extract, or summarize content


To get started with generative AI foundation models in Canvas, you can initiate a new chat session with one of the models. For JumpStart models, you are charged while the model is active, so you must start up models when you want to use them and shut them down when you are done interacting. If you do not shut down a JumpStart model, Canvas shuts it down after 2 hours of inactivity. For Amazon Bedrock models (such as Amazon Titan), you are charged by prompt; the models are already active and don’t need to be started up or shut down. You are charged directly for use of these models by Amazon Bedrock.

To open a chat with a model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Ready-to-use models**.

1. Choose **Generate, extract and summarize content**.

1. On the welcome page, you’ll receive a recommendation to start up the default model. You can start the recommended model, or you can choose **Select another model** from the dropdown to choose a different one.

1. If you selected a JumpStart foundation model, you have to start it up before it is available for use. Choose **Start up the model**, and then the model is deployed to a SageMaker AI instance. It might take several minutes for this to complete. When the model is ready, you can enter prompts and ask the model questions.

   If you selected a foundation model from Amazon Bedrock, you can start using it instantly by entering a prompt and asking questions.

Depending on the model, you can perform various tasks. For example, you can enter a passage of text and ask the model to summarize it. Or, you can ask the model to come up with a short summary of the market trends in your domain.

The model’s responses in a chat are based on the context of your previous prompts. If you want to ask a new question in the chat that is unrelated to the previous conversation topic, we recommend that you start a new chat with the model.

# Extract information from documents with document querying


**Note**  
This section assumes that you’ve completed the section above [Prerequisites for document querying](canvas-fm-chat-prereqs.md#canvas-fm-chat-prereqs-kendra).

Document querying is a feature that you can use while interacting with foundation models in Canvas. With document querying, you can access a corpus of documents stored in an Amazon Kendra *index*, which holds the contents of your documents and is structured in a way to make documents searchable. You can ask specific questions that are targeted to the data in your Amazon Kendra index, and the foundation model returns answers to your questions. For example, you can query an internal knowledge base of IT information and ask questions such as “How do I connect to my company’s network?” For more information about setting up an index, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html).

When using the document query feature, the foundation models restrict their responses to the content of the documents in your index with a technique called Retrieval Augmented Generation (RAG). This technique bundles the most relevant information from the index along with the user's prompt and sends it to the foundation model to get a response. Responses are limited to what can be found in your index, preventing the model from giving you incorrect responses based on external data. For more information about this process, see the blog post [Quickly build high-accuracy Generative AI applications on enterprise data](https://aws.amazon.com/blogs/machine-learning/quickly-build-high-accuracy-generative-ai-applications-on-enterprise-data-using-amazon-kendra-langchain-and-large-language-models/).

To get started, in a chat with a foundation model in Canvas, turn on the **Document query** toggle at the top of the page. From the dropdown, select the Amazon Kendra index that you want to query. Then, you can begin asking questions related to the documents in your index.

**Important**  
Document querying supports the [Compare model outputs](canvas-fm-chat-compare.md) feature. Any existing chat history is overwritten when you start a new chat to compare model outputs.

# Start up models


**Note**  
The following section describe starting up models, which only applies to the JumpStart foundation models, such as Falcon-40B-Instruct. You can access Amazon Bedrock models, such as Amazon Titan, instantly at any time.

You can start up as many JumpStart models as you like. Each active JumpStart model incurs charges on your account, so we recommend that you don’t start up more models than you are currently using.

To start up another model, you can do the following:

1. On the **Generate, extract and summarize content** page, choose **New chat**.

1. Choose the model from the dropdown menu. If you want to choose a model not displayed in the dropdown, choose **Start up another model**, and then select the model that you want to start up.

1. Choose **Start up model**.

The model should begin starting up, and within a few minutes you can chat with the model.

# Shut down models


We highly recommend that you shut down models that you aren’t using. The models automatically shut down after 2 hours of inactivity. However, to manually shut down a model, you can do the following:

1. On the **Generate, extract and summarize content** page, open the chat for the model that you want to shut down.

1. On the chat page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Shut down model**.

1. In the **Shut down model** confirmation box, choose **Shut down**.

The model begins shutting down. If your chat compares two or more models, you can shut down an individual model from the chat page by choosing the model’s **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then choosing **Shut down model**.

# Compare model outputs


You might want to compare the output of different models side by side to see which model output you prefer. This can help you decide which model is best suited to your use case. You can compare up to three models in chats.

**Note**  
Each individual model incurs charges on your account.

You must start a new chat to add models for comparison. To compare the output of models side by side in a chat, do the following:

1. In a chat, choose **New chat**.

1. Choose **Compare**, and use the dropdown menu to select the model that you want to add. To add a third model, choose **Compare** again to add another model.
**Note**  
If you want to use a JumpStart model that isn’t currently active, you are prompted to start up the model.

When the models are active, you see the two models side by side in the chat. You can submit your prompt, and each model responds in the same chat, as shown in the following screenshot.

![\[Screenshot of the Canvas interface with the output of two models shown side by side.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-chat-compare-outputs.png)


When you’re done interacting, make sure to shut down any JumpStart models individually to avoid incurring further charges.

# Fine-tune foundation models


The foundation models that you can access through Amazon SageMaker Canvas can help you with a range of general purpose tasks. However, if you have a specific use case and would like to customized responses based on your own data, you can *fine-tune* a foundation model.

To fine-tune a foundation model, you provide a dataset that consists of sample prompts and model responses. Then, you train the foundation model on the data. Finally, the fine-tuned foundation model is able to provide you with more specific responses.

The following list contains the foundation models that you can fine-tune in Canvas:
+ Titan Express
+ Falcon-7B
+ Falcon-7B-Instruct
+ Falcon-40B-Instruct
+ Falcon-40B
+ Flan-T5-Large
+ Flan-T5-Xl
+ Flan-T5-Xxl
+ MPT-7B
+ MPT-7B-Instruct

You can access more detailed information about each foundation model in the Canvas application while fine-tuning a model. For more information, see [Fine-tune the model](#canvas-fm-chat-fine-tune-procedure-model).

This topic describes how to fine-tune foundation models in Canvas.

## Before you begin


Before fine-tuning a foundation model, make sure that you have the permissions for Ready-to-use models in Canvas and an AWS Identity and Access Management execution role that has a trust relationship with Amazon Bedrock, which allows Amazon Bedrock to assume your role while fine-tuning foundation models.

While setting up or editing your Amazon SageMaker AI domain, you must 1) turn on the Canvas Ready-to-use models configuration permissions, and 2) create or specify an Amazon Bedrock role, which is an IAM execution role to which SageMaker AI attaches a trust relationship with Amazon Bedrock. For more information about configuring these settings, see [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).

You can configure the Amazon Bedrock role manually if you would rather use your own IAM execution role (instead of letting SageMaker AI create one on your behalf). For more information about configuring your own IAM execution role’s trust relationship with Amazon Bedrock, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md).

You must also have a dataset that is formatted for fine-tuning large language models (LLMs). The following is a list of requirements for your dataset:
+ The dataset must be tabular and contain at least two columns of text data–one input column (which contains example prompts to the model) and one output column (which contains example responses from the model).

  An example is the following:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat-fine-tune.html)
+ We recommend that the dataset has at least 100 text pairs (rows of corresponding input and output items). This ensures that the foundation model has enough data for fine-tuning and increases the accuracy of its responses.
+ Each input and output item should contain a maximum of 512 characters. Anything longer is reduced to 512 characters when fine-tuning the foundation model.

When fine-tuning an Amazon Bedrock model, you must adhere to the Amazon Bedrock quotas. For more information, see [Model customization quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html#model-customization-quotas) in the *Amazon Bedrock User Guide*.

For more information about general dataset requirements and limitations in Canvas, see [Create a dataset](canvas-import-dataset.md).

## Fine-tune a foundation model


You can fine-tune a foundation model by using any of the following methods in the Canvas application:
+ While in a **Generate, extract and summarize content** chat with a foundation model, choose the **Fine-tune model** icon (![\[Magnifying glass icon with a plus sign, indicating a search or zoom-in function.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/wrench-icon-small.png)).
+ While in a chat with a foundation model, if you’ve re-generated the response two or more times, then Canvas offers you the option to **Fine-tune model**. The following screenshot shows you what this looks like.  
![\[Screenshot of the Fine-tune foundation model option shown in a chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/fine-tuning-ingress.png)
+ On the **My models** page, you can create a new model by choosing **New model**, and then select **Fine-tune foundation model**.
+ On the **Ready-to-use models** home page, you can choose **Create your own model**, and then in the **Create new model** dialog box, choose **Fine-tune foundation model**.
+ While browsing your datasets in the **Data Wrangler** tab, you can select a dataset and choose **Create a model**. Then, choose **Fine-tune foundation model**.

After you’ve begun to fine-tune a model, do the following:

### Select a dataset


On the **Select** tab of fine-tuning a model, you choose the data on which you’d like to train the foundation model.

Either select an existing dataset or create a new dataset that meets the requirements listed in the [Before you begin](#canvas-fm-chat-fine-tune-prereqs) section. For more information about how to create a dataset, see [Create a dataset](canvas-import-dataset.md).

When you’ve selected or created a dataset and you’re ready to move on, choose **Select dataset**.

### Fine-tune the model


After selecting your data, you’re now ready to begin training and fine-tune the model.

On the **Fine-tune** tab, do the following:

1. (Optional) Choose **Learn more about our foundation models** to access more information about each model and help you decide which foundation model or models to deploy.

1. For **Select up to 3 base models**, open the dropdown menu and check up to 3 foundation models (up to 2 JumpStart models and 1 Amazon Bedrock model) that you’d like to fine-tune during the training job. By fine-tuning multiple foundation models, you can compare their performance and ultimately choose the one best suited to your use case as the default model. For more information about default models, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

1. For **Select Input column**, select the column of text data in your dataset that contains the example model prompts.

1. For **Select Output column**, select the column of text data in your dataset that contains the example model responses.

1. (Optional) To configure advanced settings for the training job, choose **Configure model**. For more information about the advanced model building settings, see [Advanced model building configurations](canvas-advanced-settings.md).

   In the **Configure model** pop-up window, do the following:

   1. For **Hyperparameters**, you can adjust the **Epoch count**, **Batch size**, **Learning rate**, and **Learning rate warmup steps** for each model you selected. For more information about these parameters, see the [ Hyperparameters section in the JumpStart documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-fine-tune.html#jumpstart-hyperparameters).

   1. For **Data split**, you can specify percentages for how to divide your data between the **Training set** and **Validation set**.

   1. For **Max job runtime**, you can set the maximum amount of time that Canvas runs the build job. This feature is only available for JumpStart foundation models.

   1. After configuring the settings, choose **Save**.

1. Choose **Fine-tune** to begin training the foundation models you selected.

After the fine-tuning job begins, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for use, and you can now analyze the performance of your fine-tuned foundation model.

### Analyze the fine-tuned foundation model


On the **Analyze** tab of your fine-tuned foundation model, you can see the model’s performance.

The **Overview** tab on this page shows you the perplexity and loss scores, along with analyses that visualize the model’s improvement over time during training. The following screenshot shows the **Overview** tab.

![\[The Analyze tab of a fine-tuned foundation model in Canvas, showing the perplexity and loss curves.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-fine-tune-analyze-2.png)


On this page, you can see the following visualizations:
+ The **Perplexity Curve** measures how well the model predicts the next word in a sequence, or how grammatical the model’s output is. Ideally, as the model improves during training, the score decreases and results in a curve that lowers and flattens over time.
+ The **Loss Curve** quantifies the difference between the correct output and the model’s predicted output. A loss curve that decreases and flattens over time indicates that the model is improving its ability to make accurate predictions.

The **Advanced metrics** tab shows you the hyperparameters and additional metrics for your model. It looks like the following screenshot:

![\[Screenshot of the Advanced metrics tab of a fine-tuned foundation model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-fine-tune-metrics.png)


The **Advanced metrics** tab contains the following information:
+ The **Explainability** section contains the **Hyperparameters**, which are the values set before the job to guide the model’s fine-tuning. If you didn’t specify custom hyperparameters in the model’s advanced settings in the [Fine-tune the model](#canvas-fm-chat-fine-tune-procedure-model) section, then Canvas selects default hyperparameters for you.

  For JumpStart models, you can also see the advanced metric [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)), which evaluates the quality of summaries generated by the model. It measures how well the model can summarize the main points of a passage.
+ The **Artifacts** section provides you with links to artifacts generated during the fine-tuning job. You can access the training and validation data saved in Amazon S3, as well as the link to the model evaluation report (to learn more, see the following paragraph).

To get more model evaluation insights, you can download a report that is generated using [SageMaker Clarify](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-configure-processing-jobs.html), which is a feature that can help you detect bias in your model and data. First, generate the report by choosing **Generate evaluation report** at the bottom of the page. After the report has generated, you can download the full report by choosing **Download report** or by returning to the **Artifacts** section.

You can also access a Jupyter notebook that shows you how to replicate your fine-tuning job in Python code. You can use this to replicate or make programmatic changes to your fine-tuning job or get a deeper understanding of how Canvas fine-tunes your model. To learn more about model notebooks and how to access them, see [Download a model notebook](canvas-notebook.md).

For more information about how to interpret the information in the **Analyze** tab of your fine-tuned foundation model, see the topic [Model evaluation](canvas-evaluate-model.md).

After analyzing the **Overview** and **Advanced metrics** tabs, you can also choose to open the **Model leaderboard**, which shows you the list of the base models trained during the build. The model with the lowest loss score is considered the best performing model and is selected as the **Default model**, which is the model whose analysis you see in the **Analyze** tab. You can only test and deploy the default model. For more information about the model leaderboard and how to change the default model, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

### Test a fine-tuned foundation model in a chat


After analyzing the performance of a fine-tuned foundation model, you might want to test it out or compare its responses with the base model. You can test a fine-tuned foundation model in a chat in the **Generate, extract and summarize content** feature.

Start a chat with a fine-tuned model by choosing one of the following methods:
+ On the fine-tuned model’s **Analyze** tab, choose **Test in Ready-to-use foundation models**.
+ On the Canvas **Ready-to-use models** page, choose **Generate, extract and summarize content**. Then, choose **New chat** and select the version of the model that you want to test.

The model starts up in a chat, and you can interact with it like any other foundation model. You can add more models to the chat and compare their outputs. For more information about the functionality of chats, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

## Operationalize fine-tuned foundation models


After fine-tuning your model in Canvas, you can do the following:
+ Register the model to the SageMaker Model Registry for integration into your organizations MLOps processes. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).
+ Deploy the model to a SageMaker AI endpoint and send requests to the model from your application or website to get predictions (or *inference*). For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

**Important**  
You can only register and deploy JumpStart based fine-tuned foundation models, not Amazon Bedrock based models.

# Ready-to-use models


With Amazon SageMaker Canvas Ready-to-use models, you can make predictions on your data without writing a single line of code or having to build a model—all you have to bring is your data. The Ready-to-use models use pre-built models to generate predictions without requiring you to spend the time, expertise, or cost required to build a model, and you can choose from a variety of use cases ranging from language detection to expense analysis.

Canvas integrates with existing AWS services, such as [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html), [Amazon Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html), and [Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html), to analyze your data and make predictions or extract insights. You can use the predictive power of these services from within the Canvas application to get high quality predictions for your data.

Canvas supports the following Ready-to-use models types:


| Ready-to-use model | Description | Supported data type | 
| --- | --- | --- | 
| Sentiment analysis | Detect sentiment in lines of text, which can be positive, negative, neutral, or mixed. Currently, you can only do sentiment analysis for English language text. | Plain text or tabular (CSV, Parquet) | 
| Entities extraction | Extract entities, which are real-world objects such as people, places, and commercial items, or units such as dates and quantities, from text. | Plain text or tabular (CSV, Parquet) | 
| Language detection | Determine the dominant language in text such as English, French, or German. | Plain text or tabular (CSV, Parquet) | 
| Personal information detection | Detect personal information that could be used to identify an individual, such as addresses, bank account numbers, and phone numbers, from text. | Plain text or tabular (CSV, Parquet) | 
| Object detection in images | Detect objects, concepts, scenes, and actions in your images. | Image (JPG, PNG) | 
| Text detection in images | Detect text in your images. | Image (JPG, PNG) | 
| Expense analysis | Extract information from invoices and receipts, such as date, number, item prices, total amount, and payment terms. | Document (PDF, JPG, PNG, TIFF) | 
| Identity document analysis | Extract information from passports, driver licenses, and other identity documentation issued by the US Government. | Document (PDF, JPG, PNG, TIFF) | 
| Document analysis | Analyze documents and forms for relationships among detected text. | Document (PDF, JPG, PNG, TIFF) | 
| Document queries | Extract information from structured documents such as paystubs, bank statements, W-2s, and mortgage application forms by asking questions using natural language. | Document (PDF) | 

## Get started


To get started with Ready-to-use models, review the following information.

**Prerequisites**

To use Ready-to-use models in Canvas, you must turn on the **Canvas Ready-to-use models configuration** permissions when [setting up your Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites). The **Canvas Ready-to-use models configuration** attaches the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues with granting permissions, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

If you’ve already set up your domain, you can edit your domain settings and turn on the permissions. For instructions on how to edit your domain settings, see [Edit domain settings](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-edit.html). When editing the settings for your domain, go to the **Canvas settings** and turn on the **Enable Canvas Ready-to-use models** option.

**(Optional) Opt out of AI services data storage**

Certain AWS AI services store and use your data to make improvements to the service. You can opt out of having your data stored or used for service improvements. To learn more about how to opt out, see [ AI services opt-out policies](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_ai-opt-out.html) in the *AWS Organizations User Guide*.

**How to use Ready-to-use models**

To get started with Ready-to-use models, do the following:

1. **(Optional) Import your data.** You can import a tabular, image, or document dataset to generate batch predictions, or a dataset of predictions, with Ready-to-use models. To get started with importing a dataset, see [Create a data flow](canvas-data-flow.md).

1. **Generate predictions.** You can generate single or batch predictions with your chosen Ready-to-use model. To get started with making predictions, see [Make predictions for text data](canvas-ready-to-use-predict-text.md).

# Make predictions for text data


The following procedures describe how to make both single and batch predictions for text datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: sentiment analysis, entities extraction, language detection, and personal information detection.

**Note**  
For sentiment analysis, you can only use English language text.

## Single predictions


To make a single prediction for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For text data, it should be one of the following: **Sentiment analysis**, **Entities extraction**, **Language detection**, or **Personal information detection**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. For **Text field**, enter the text for which you’d like to get a prediction.

1. Choose **Generate prediction results** to get your prediction.

In the right pane **Prediction results**, you receive an analysis of your text in addition to a **Confidence** score for each result or label. For example, if you chose language detection and entered a passage of text in French, you might get French with a 95% confidence score and traces of other languages, like English, with a 5% confidence score.

The following screenshot shows the results for a single prediction using language detection where the model is 100% confident that the passage is English.

![\[Screenshot of the results of a single prediction with the language detection Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-text-prediction.png)


## Batch predictions


To make batch predictions for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For text data, it should be one of the following: **Sentiment analysis**, **Entities extraction**, **Language detection**, or **Personal information detection**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can **Preview** the output data. Then, you can choose **Download** to download the results.

# Make predictions for image data


The following procedures describe how to make both single and batch predictions for image datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: object detection images and text detection in images.

## Single predictions


To make a single prediction for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Object detection images** or **Text detection in images**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. Choose **Upload image**.

1. You are prompted to select an image to upload from your local computer. Select the image from your local files, and then the prediction results generate.

In the right pane **Prediction results**, you receive an analysis of your image in addition to a **Confidence** score for each object or text detected. For example, if you chose object detection in images, you receive a list of objects in the image along with a confidence score of how certain the model is that each object was accurately detected, such as 93%.

The following screenshot shows the results for a single prediction using the object detection in images solution, where the model predicts objects such as a clock tower and bus with 100% confidence.

![\[The results of a single prediction with the object detection solution in images Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-image-prediction.png)


## Batch predictions


To make batch predictions for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Object detection images** or **Text detection in images**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to preview the output data. Then, you can choose **Download prediction** and download the results as a CSV or a ZIP file.

# Make predictions for document data


The following procedures describe how to make both single and batch predictions for document datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: expense analysis, identity document analysis, and document analysis.

**Note**  
For document queries, only single predictions are currently supported.

## Single predictions


To make a single prediction for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For document data, it should be one of the following: **Expense analysis**, **Identity document analysis**, or **Document analysis**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. If your Ready-to-use model is identity document analysis or document analysis, complete the following actions. If you’re doing expense analysis or document queries, skip this step and go to Step 5 or Step 6, respectively.

   1. Choose **Upload document**.

   1. You are prompted to upload a PDF, JPG, or PNG file from your local computer. Select the document from your local files, and then the prediction results will generate.

1. If your Ready-to-use model is expense analysis, do the following:

   1. Choose **Upload invoice or receipt**.

   1. You are prompted to upload a PDF, JPG, PNG, or TIFF file from your local computer. Select the document from your local files, and then the prediction results will generate.

1. If your Ready-to-use model is document queries, do the following:

   1. Choose **Upload document**.

   1. You are prompted to upload a PDF file from your local computer. Select the document from your local files. Your PDF must be 1–100 pages long.
**Note**  
If you're in the Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), or Europe (Frankfurt) regions, then the maximum PDF size for document queries is 20 pages.

   1. In the right side pane, enter queries to search for information in the document. The number of characters you can have in a single query is from 1–200. You can add up to 15 queries at a time.

   1. Choose **Submit queries**, and then the results generate with answers to your queries. You are billed once for each submissions of queries you make.

In the right pane **Prediction results**, you’ll receive an analysis of your document.

The following information describes the results for each type of solution:
+ For expense analysis, the results are categorized into **Summary fields**, which include fields such as the total on a receipt, and **Line item fields**, which include fields such as individual items on a receipt. The identified fields are highlighted on the document image in the output.
+ For identity document analysis, the output shows you the fields that the Ready-to-use model identified, such as first and last name, address, or date of birth. The identified fields are highlighted on the document image in the output.
+ For document analysis, the results are categorized into **Raw text**, **Forms**, **Tables**, and **Signatures**. **Raw text** includes all of the extracted text, while **Forms**, **Tables**, and **Signatures** only include information on the form that falls into those categories. For example, **Tables** only includes information extracted from tables in the document. The identified fields are highlighted on the document image in the output.
+ For document queries, Canvas returns answers to each of your queries. You can open the collapsible query dropdown to view a result, along with a confidence score for the prediction. If Canvas finds multiple answers in the document, then you might have more than one result for each query.

The following screenshot shows the results for a single prediction using the document analysis solution.

![\[Screenshot of the results of a single prediction with the document analysis Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-document-analysis.png)


## Batch predictions


To make batch predictions for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Expense analysis**, **Identity document analysis**, or **Document analysis**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions**. If your use case is document analysis, continue to Step 6.

1. (Optional) If your use case is Document analysis, another dialog box called **Select features to include in batch prediction** appears. You can select **Forms**, **Tables**, and **Signatures** to group the results by those features. Then, choose **Generate predictions**.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to preview the analysis of your document data.

The following information describes the results for each type of solution:
+ For expense analysis, the results are categorized into **Summary fields**, which include fields such as the total on a receipt, and **Line item fields**, which include fields such as individual items on a receipt. The identified fields are highlighted on the document image in the output.
+ For identity document analysis, the output shows you the fields that the Ready-to-use model identified, such as first and last name, address, or date of birth. The identified fields are highlighted on the document image in the output.
+ For document analysis, the results are categorized into **Raw text**, **Forms**, **Tables**, and **Signatures**. **Raw text** includes all of the extracted text, while **Forms**, **Tables**, and **Signatures** only include information on the form that falls into those categories. For example, **Tables** only includes information extracted from tables in the document. The identified fields are highlighted on the document image in the output.

After previewing your results, you can choose **Download prediction** and download the results as a ZIP file.

# Custom models


In Amazon SageMaker Canvas, you can train custom machine learning models tailored to your specific data and use case. By training a custom model on your data, you are able to capture characteristics and trends that are specific and most representative of your data. For example, you might want to create a custom time series forecasting model that you train on inventory data from your warehouse to manage your logistics operations.

Canvas supports training a range of model types. After training a custom model, you can evaluate the model's performance and accuracy. Once satisfied with a model, you can make predictions on new data, and you also have the option to share the custom model with data scientists for further analysis or to deploy it to a SageMaker AI hosted endpoint for real-time inference, all from within the Canvas application.

You can train a Canvas custom model on the following types of datasets:
+ Tabular (including numeric, categorical, timeseries, and text data)
+ Image

The following table shows the types of custom models that you can build in Canvas, along with their supported data types and data sources.


| Model type | Example use case | Supported data types | Supported data sources | 
| --- | --- | --- | --- | 
| Numeric prediction | Predicting house prices based on features like square footage | Numeric | Local upload, Amazon S3, SaaS connectors | 
| 2 category prediction | Predicting whether or not a customer is likely to churn | Binary or categorical | Local upload, Amazon S3, SaaS connectors | 
| 3\$1 category prediction | Predicting patient outcomes after being discharged from the hospital | Categorical | Local upload, Amazon S3, SaaS connectors | 
| Time series forecasting | Predicting your inventory for the next quarter | Timeseries | Local upload, Amazon S3, SaaS connectors | 
| Single-label image prediction | Predicting types of manufacturing defects in images | Image (JPG, PNG) | Local upload, Amazon S3 | 
| Multi-category text prediction | Predicting categories of products, such as clothing, electronics, or household goods, based on product descriptions |  Source column: text Target column: binary or categorical | Local upload, Amazon S3 | 

**Get started**

To get started with building and generating predictions from a custom model, do the following:
+ Determine your use case and type of model that you want to build. For more information about the custom model types, see [How custom models work](canvas-build-model.md). For more information about the data types and sources supported for custom models, see [Data import](canvas-importing-data.md).
+ [Import your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-importing-data.html) into Canvas. You can build a custom model with any tabular or image dataset that meets the input requirements. For more information about the input requirements, see [Create a dataset](canvas-import-dataset.md).

  To learn more about sample datasets provided by SageMaker AI with which you can experiment, see [Sample datasets in Canvas](canvas-sample-datasets.md).
+ [Build](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) your custom model. You can do a **Quick build** to get your model and start making predictions more quickly, or you can do a **Standard build** for greater accuracy.

  For numeric, categorical, and time series forecasting model types, you can clean and prepare your data with the [Data Wrangler feature](canvas-data-prep.md). In Data Wrangler, you can create a data flow and use various data preparation techniques, such as applying advanced transforms or joining datasets. For image prediction models, you can [Edit an image dataset](canvas-edit-image.md) to update your labels or add and delete images. Note that you can't use these features for multi-category text prediction models.
+ [Evaluate your model's performance](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html) and determine how well it might perform on real-world data.
+ [Make single or batch predictions](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-make-predictions.html) with your model.

# How custom models work


Use Amazon SageMaker Canvas to build a custom model on the dataset that you've imported. Use the model that you've built to make predictions on new data. SageMaker Canvas uses the information in the dataset to build up to 250 models and choose the one that performs the best.

When you begin building a model, Canvas automatically recommends one or more *model types*. Model types fall into one of the following categories:
+ **Numeric prediction** – This is known as *regression* in machine learning. Use the numeric prediction model type when you want to make predictions for numeric data. For example, you might want to predict the price of houses based on features such as the house’s square footage.
+ **Categorical prediction** – This is known as *classification* in machine learning. When you want to categorize data into groups, use the categorical prediction model types:
  + **2 category prediction** – Use the 2 category prediction model type (also known as *binary classification* in machine learning) when you have two categories that you want to predict for your data. For example, you might want to determine whether a customer is likely to churn.
  + **3\$1 category prediction** – Use the 3\$1 category prediction model type (also known as *multi-class classification* in machine learning) when you have three or more categories that you want to predict for your data. For example, you might want to predict a customer's loan status based on features such as previous payments.
+ **Time series forecasting** – Use time series forecasts when you want to make predictions over a period of time. For example, you might want to predict the number of items you’ll sell in the next quarter. For information about time series forecasts, see [Time Series Forecasts in Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html).
+ **Image prediction** – Use the single-label image prediction model type (also known as *single-label image classification* in machine learning) when you want to assign labels to images. For example, you might want to classify different types of manufacturing defects in images of your product.
+ **Text prediction** – Use the multi-category text prediction model type (also known as *multi-class text classification* in machine learning) when you want to assign labels to passages of text. For example, you might have a dataset of customer reviews for a product, and you want to determine whether customers liked or disliked the product. You might have your model predict whether a given passage of text is `Positive`, `Negative`, or `Neutral`.

For a table of the supported input data types for each model type, see [Custom models](canvas-custom-models.md).

For each tabular data model that you build (which includes numeric, categorical, time series forecasting, and text prediction models), you choose the **Target column**. The **Target column** is the column that contains the information that you want to predict. For example, if you're building a model to predict whether people have cancelled their subscriptions, the **Target column** contains data points that are either a `yes` or a `no` about someone's cancellation status.

For image prediction models, you build the model with a dataset of images that have been assigned labels. For the unlabeled images that you provide, the model predicts a label. For example, if you’re building a model to predict whether an image is a cat or a dog, you provide images labeled as cats or dogs when building the model. Then, the model can accept unlabeled images and predict them as either cats or dogs.

**What happens when you build a model**

To build your model, you can choose either a **Quick build** or a **Standard build**. The **Quick build** has a shorter build time, but the **Standard build** generally has a higher accuracy.

For tabular and time series forecasting models, Canvas uses *downsampling* to reduce the size of datasets larger than 5 GB or 30 GB, respectively. Canvas downsamples with the stratified sampling method. The table below lists the size of the downsample by model type. To control the sampling process, you can use Data Wrangler in Canvas to sample using your preferred sampling technique. For time series data, you can resample to aggregate data points. For more information about sampling, see [Sampling](canvas-transform.md#canvas-transform-sampling). For more information about resampling time series data, see [Resample Time Series Data](canvas-transform.md#canvas-resample-time-series).

If you choose to do a **Quick build** on a dataset with more than 50,000 rows, then Canvas samples your data down to 50,000 rows for a shorter model training time.

The following table summarizes key characteristics of the model building process, including average build times for each model and build type, the size of the downsample when building models with large datasets, and the minimum and maximum number of data points you should have for each build type.


| Limit | Numeric and categorical prediction | Time series forecasting | Image prediction | Text prediction | 
| --- | --- | --- | --- | --- | 
| **Quick build** time | 2‐20 minutes | 2‐20 minutes | 15‐30 minutes | 15‐30 minutes | 
| **Standard build** time | 2‐4 hours | 2‐4 hours | 2‐5 hours | 2‐5 hours | 
| Downsample size (the reduced size of a large dataset after Canvas downsamples) | 5 GB | 30 GB | N/A | N/A | 
| Minimum number of entries (rows) for **Quick builds** |  2 category: 500 rows 3\$1 category, numeric, time series: N/A  | N/A | N/A | N/A | 
| Minimum number of entries (rows, images, or documents) for **Standard builds** | 250 | 50 | 50 | N/A | 
| Maximum number of entries (rows, images, or documents) for **Quick builds** | N/A | N/A | 5000 | 7500 | 
| Maximum number of entries (rows, images, or documents) for **Standard builds** | N/A | 150,000 | 180,000 | N/A | 
| Maximum number of columns | 1,000 | 1,000 | N/A | N/A | 

Canvas predicts values by using the information in the rest of the dataset, depending on the model type:
+ For categorical prediction, Canvas puts each row into one of the categories listed in the **Target column**.
+ For numeric prediction, Canvas uses the information in the dataset to predict the numeric values in the **Target column**.
+ For time series forecasting, Canvas uses historical data to predict values for the **Target column** in the future.
+ For image prediction, Canvas uses images that have been assigned labels to predict labels for unlabeled images.
+ For text prediction, Canvas analyzes text data that has been assigned labels to predict labels for passages of unlabeled text.

**Additional features to help you build your model**

Before building your model, you can use Data Wrangler in Canvas to prepare your data using 300\$1 built-in transforms and operators. Data Wrangler supports transforms for both tabular and image datasets. Additionally, you can connect to data sources outside of Canvas, create jobs to apply transforms to your entire dataset, and export your fully prepared and cleaned data for use in ML workflows outside of Canvas. For more information, see [Data preparation](canvas-data-prep.md).

To see visualizations and analytics to explore your data and determine which features to include in your model, you can use Data Wrangler’s built-in analyses. You can also access a **Data Quality and Insights Report** that highlights potential issues with your dataset and provides recommendations for how to fix them. For more information, see [Perform exploratory data analysis (EDA)](canvas-analyses.md).

In addition to the more advanced data preparation and exploration functionality provided through Data Wrangler, Canvas provides some basic features that you can use:
+ To filter your data and access a set of basic data transforms, see [Prepare data for model building](canvas-prepare-data.md).
+ To access simple visualizations and analytics for feature exploration, see [Data exploration and analysis](canvas-explore-data.md).
+ To learn more about additional features such as previewing your model, validating your dataset, and changing the size of the random sample used to build your model, see [Preview your model](canvas-preview-model.md).

For tabular datasets with multiple columns (such as datasets for building categorical, numeric, or time series forecasting model types), you might have rows with missing data points. While Canvas builds the model, it automatically adds missing values. Canvas uses the values in your dataset to perform a mathematical approximation for the missing values. For the highest model accuracy, we recommend adding in the missing data if you can find it. Note that the missing data feature is not supported for text prediction or image prediction models.

**Get started**

To get started with building a custom model, see [Build a model](canvas-build-model-how-to.md) and follow the procedure for the type of model that you want to build.

# Preview your model


**Note**  
The following functionality is only available for custom models built with tabular datasets. Multi-category text prediction models are also excluded.

SageMaker Canvas provides you with a tool to preview your model before you begin building. This gives you an estimated accuracy score and also gives you a preliminary idea of how each column might impact the model. 

To preview the model score, when you're on the **Build** tab of your model, choose **Preview model**.

The model preview generates an **Estimated accuracy** prediction of how well the model might analyze your data. The accuracy of a **Quick build** or a **Standard build** represents how well the model can perform on real data and is generally higher than the **Estimated accuracy**.

The model preview also provides you with the **Column Impact** scores, which can indicate the importance of each column to the model's predictions.

The following screenshot shows a model preview in the Canvas application.

![\[Screenshot of the Build tab for a model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-build/canvas-build-preview-model.png)


Amazon SageMaker Canvas automatically handles missing values in your dataset while it builds the model. It infers the missing values by using adjacent values that are present in the dataset.

If you're satisfied with your model preview and want to proceed with building a model, then see [Build a model](canvas-build-model-how-to.md).

# Data validation


Before you build your model, SageMaker Canvas checks your dataset for issues that might cause your build to fail. If SageMaker Canvas finds any issues, then it warns you on the **Build** page before you attempt to build a model.

You can choose **Validate data** to see a list of the issues with your dataset. You can then use the SageMaker Canvas [Data Wrangler data preparation features](canvas-data-prep.md), or your own tools, to fix your dataset before starting a build. If you don’t fix the issues with your dataset, then your build fails.

If you make changes to your dataset to fix the issues, you have the option to re-validate your dataset before attempting a build. We recommend that you re-validate your dataset before building.

The following table shows the issues that SageMaker Canvas checks for in your dataset and how to resolve them.


| Issue | Resolution | 
| --- | --- | 
|  Wrong model type for your data  |  Try another model type or use a different dataset.  | 
|  Missing values in your target column  |  Replace the missing values, drop rows with missing values, or use a different dataset.  | 
|  Too many unique labels in your target column  |  Verify that you've used the correct column for your target column, or use a different dataset.  | 
|  Too many non-numeric values in your target column  |  Choose a different target column, select another model type, or use a different dataset.  | 
|  One or more column names contain double underscores  |  Rename the columns to remove any double underscores, and try again.  | 
|  None of the rows in your dataset are complete  |  Replace the missing values, or use a different dataset.  | 
|  Too many unique labels for the number of rows in your data  |  Check that you're using the right target column, increase the number of rows in your dataset, consolidate similar labels, or use a different dataset.  | 

# Random sample


SageMaker Canvas uses the random sampling method to sample your dataset. The random sample method means that each row has an equal chance of being picked for the sample. You can choose a column in the preview to get summary statistics for the random sample, such as the mean and the mode.

By default, SageMaker Canvas uses a random sample size of 20,000 rows from your dataset for datasets with more than 20,000 rows. For datasets smaller than 20,000 rows, the default sample size is the number of rows in your dataset. You can increase or decrease the sample size by choosing **Random sample** in the **Build** tab of the SageMaker Canvas application. You can use the slider to select your desired sample size, and then choose **Update** to change the sample size. The maximum sample size you can choose for a dataset is 40,000 rows, and the minimum sample size is 500 rows. If you choose a large sample size, the dataset preview and summary statistics might take a few moments to reload.

The **Build** page shows a preview of 100 rows from your dataset. If the sample size is the same size as your dataset, then the preview uses the first 100 rows of your dataset. Otherwise, the preview uses the first 100 rows of the random sample.

# Build a model


The following sections show you how to build a model for each of the main types of custom models.
+ To build numeric prediction, 2 category prediction, or 3\$1 category prediction models, see [Build a custom numeric or categorical prediction model](#canvas-build-model-numeric-categorical).
+ To build single-label image prediction models, see [Build a custom image prediction model](#canvas-build-model-image).
+ To build multi-category text prediction models, see [Build a custom text prediction model](#canvas-build-model-text).
+ To build time series forecasting models, see [Build a time series forecasting model](#canvas-build-model-forecasting).

**Note**  
If you encounter an error during post-building analysis that tells you to increase your quota for `ml.m5.2xlarge` instances, see [Request a Quota Increase](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-requesting-quota-increases.html).

## Build a custom numeric or categorical prediction model


Numeric and categorical prediction models support both **Quick builds** and **Standard builds**.

To build a numeric or categorical prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Predictive analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict.

1. For **Model type**, Canvas automatically detects the problem type for you. If you want to change the type or configure advanced model settings, choose **Configure model**.

   When the **Configure model** dialog box opens, do the following:

   1. For **Model type**, choose the model type that you want to build.

   1. After you choose the model type, there are additional **Advanced settings**. For more information about each of the advanced settings, see [Advanced model building configurations](canvas-advanced-settings.md). To configure the advanced settings, do the following:

      1. (Optional) For the **Objective metric** dropdown menu, select the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see [Metrics reference](canvas-metrics.md).

      1. For **Training method**, choose **Auto**, **Ensemble**, or **Hyperparameter optimization (HPO) mode**.

      1. For **Algorithms**, select the algorithms that you want to include for building model candidates.

      1. For **Data split**, specify in percentages how you want to split your data between the **Training set** and the **Validation set**. The training set is used for building the model, while the validation set is used for testing accuracy of model candidates.

      1. For **Max candidates and runtime**, do the following:

         1. Set the **Max candidates** value, or the maximum number of model candidates that Canvas can generate. Note that **Max candidates** is only available in HPO mode.

         1. Set the hour and minute values for **Max job runtime**, or the maximum amount of time that Canvas can spend building your model. After the maximum time, Canvas stops building and selects the best model candidate.

   1. After configuring the advanced settings, choose **Save**.

1. Select or deselect columns in your data to include or drop them from your build.
**Note**  
If you make batch predictions with your model after building, Canvas adds dropped columns to your prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

1. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and determine which features you might want to include in your model. For more information, see [Explore and analyze your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html).

1. (Optional) Use data transformations to clean, transform, and prepare your data for model building. For more information, see [ Prepare your data with advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html). You can view and remove your transforms by choosing **Model recipe** to open the **Model recipe** side panel.

1. (Optional) For additional features such as previewing the accuracy of your model, validating your dataset, and changing the size of the random sample that Canvas takes from your dataset, see [Preview your model](canvas-preview-model.md).

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model. The following screenshot shows the **Build** page and the **Quick build** and **Standard build** options.  
![\[The Build page for a 2 category model showing the Quick build and Standard build options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-tabular-quick-standard-options.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a custom image prediction model


Single-label image prediction models support both **Quick builds** and **Standard builds**.

To build a single-label image prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Image analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, you see the **Label distribution** for the images in your dataset. The **Model type** is set to **Single-label image prediction**.

1. On this page, you can preview your images and edit the dataset. If you have any unlabeled images, choose **Edit dataset** and [Assign labels to unlabeled images](canvas-edit-image.md#canvas-edit-image-assign). You can also perform other tasks when you [Edit an image dataset](canvas-edit-image.md), such as renaming labels and adding images to the dataset.

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model. The following screenshot shows the **Build** page of an image prediction model that is ready to be built.  
![\[The Build page for a single-label image prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-image-model.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a custom text prediction model


Multi-category text prediction models support both **Quick builds** and **Standard builds**.

To build a text prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Text analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict. The target column must have a binary or categorical data type, and there must be at least 25 entries (or rows of data) for each unique label in the target column.

1. For **Model type**, confirm that the model type is automatically set to **Multi-category text prediction**.

1. For the training column, select your source column of text data. This should be the column containing the text that you want to analyze.

1. Choose **Quick build** or **Standard build** to begin building your model. The following screenshot shows the **Build** page of a text prediction model that is ready to be built.  
![\[The Build page for a multi-category text prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-text-model.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a time series forecasting model


Time series forecasting models support both **Quick builds** and **Standard builds**.

To build a time series forecasting model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Time series forecasting** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict.

1. In the **Model type** section, choose **Configure model**.

1. The **Configure model** box opens. For the **Time series configuration** section, fill out the following fields:

   1. For **Item ID column**, choose a column in your dataset that uniquely identifies each row. The column should have a data type of `Text`.

   1. (Optional) For **Group column**, choose one or more categorical columns (with a data type of `Text`) that you want to use for grouping your forecasting values.

   1. For **Time stamp column**, select the column with timestamps (in datetime format). For more information about the accepted datetime formats, see [Time Series Forecasts in Amazon SageMaker Canvas](canvas-time-series.md).

   1. For the **Forecast length** field, enter the period of time for which you want to forecast values. Canvas automatically detects the units of time in your data.

   1. (Optional) Turn on the **Use holiday schedule** toggle to select a holiday schedule from various countries and make your forecasts with holiday data more accurate.

1. In the **Configure model** box, there are additional settings in the **Advanced** section. For more information about each of the advanced settings, see [Advanced model building configurations](canvas-advanced-settings.md). To configure the **Advanced** settings, do the following:

   1. For the **Objective metric** dropdown menu, select the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see [Metrics reference](canvas-metrics.md).

   1. If you’re running a standard build, you’ll see the **Algorithms** section. This section is for selecting the time series forecasting algorithms that you’d like to use for building your model. You can select a subset of the available algorithms, or you can select all of them if you aren’t sure which ones to try.

      When you run your standard build, Canvas builds an ensemble model that combines all of the algorithms together to optimize prediction accuracy.
**Note**  
If you’re running a quick build, Canvas uses a single tree-based learning algorithm to train your model, and you don’t have to select any algorithms.

   1. For **Forecast quantiles**, enter up to 5 comma-separated quantile values to specify the upper and lower bounds of your forecast.

   1. After configuring the **Advanced** settings, choose **Save**.

1. Select or deselect columns in your data to include or drop them from your build.
**Note**  
If you make batch predictions with your model after building, Canvas adds dropped columns to your prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

1. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and determine which features you might want to include in your model. For more information, see [Explore and analyze your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html).

1. (Optional) Use data transformations to clean, transform, and prepare your data for model building. For more information, see [ Prepare your data with advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html). You can view and remove your transforms by choosing **Model recipe** to open the **Model recipe** side panel.

1. (Optional) For additional features such as previewing the accuracy of your model, validating your dataset, and changing the size of the random sample that Canvas takes from your dataset, see [Preview your model](canvas-preview-model.md).

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model.

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

# Advanced model building configurations


Amazon SageMaker Canvas supports various advanced settings that you can configure when building a model. The following page lists all of the advanced settings along with additional information about their options and configurations.

**Note**  
The following advanced settings are currently only supported for numeric, categorical, and time series forecasting model types.

## Advanced numeric and categorical prediction model settings


Canvas supports the following advanced settings for numeric and categorical prediction model types.

### Objective metric


The objective metric is the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see the [Metrics reference](canvas-metrics.md).

### Training method


Canvas can automatically select the training method based on the dataset size, or you can select it manually. The following training methods are available for you to choose from:
+ **Ensembling** – SageMaker AI leverages the AutoGluon library to train several base models. To find the best combination for your dataset, ensemble mode runs 5–10 trials with different model and meta parameter settings. Then, these models are combined using a stacking ensemble method to create an optimal predictive model. For a list of algorithms supported by ensemble mode for tabular data, see the following [Algorithms](#canvas-advanced-settings-predictive-algos) section.
+ **Hyperparameter optimization (HPO)** – SageMaker AI finds the best version of a model by tuning hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to 100 trials (default) to find the optimal hyperparameters settings within the selected range. If your dataset size is less than 100 MB, SageMaker AI uses Bayesian optimization. SageMaker AI chooses multi-fidelity optimization if your dataset is larger than 100 MB.

  For a list of algorithms supported by HPO mode for tabular data, see the following [Algorithms](#canvas-advanced-settings-predictive-algos) section.
+ **Auto** – SageMaker AI automatically chooses either ensembling mode or HPO mode based on your dataset size. If your dataset is larger than 100 MB, SageMaker AI chooses HPO mode. Otherwise, it chooses ensembling mode.

### Algorithms


In **Ensembling** mode, Canvas supports the following machine learning algorithms:
+ [LightGBM](https://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm.html) – An optimized framework that uses tree-based algorithms with gradient boosting. This algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
+ [CatBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html) – A framework that uses tree-based algorithms with gradient boosting. Optimized for handling categorical variables.
+ [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) – A framework that uses tree-based algorithms with gradient boosting that grows in depth, rather than breadth.
+ [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) – A tree-based algorithm that uses several decision trees on random sub-samples of the data with replacement. The trees are split into optimal nodes at each level. The decisions of each tree are averaged together to prevent overfitting and improve predictions.
+ [Extra Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier) – A tree-based algorithm that uses several decision trees on the entire dataset. The trees are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to improve predictions. Extra trees add a degree of randomization in comparison to the random forest algorithm.
+ [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) – A framework that uses a linear equation to model the relationship between two variables in observed data.
+ Neural network torch – A neural network model that's implemented using [Pytorch](https://pytorch.org/).
+ Neural network fast.ai – A neural network model that's implemented using [fast.ai](https://www.fast.ai/).

In **HPO mode**, Canvas supports the following machine learning algorithms:
+ [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) – A supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.
+ Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network. This algorithm can handle data that is not linearly separable.

### Data split


You have the option to specify how you want to split your dataset between the training set (the portion of your dataset used for building the model) and the validation set, (the portion of your dataset used for verifying the model’s accuracy). For example, a common split ratio is 80% training and 20% validation, where 80% of your data is used to build the model while 20% is saved for measuring model performance. If you don’t specify a custom ratio, then Canvas splits your dataset automatically.

### Max candidates


**Note**  
This feature is only available in the HPO training mode.

You can specify the maximum number of model candidates that Canvas generates while building your model. We recommend that you use the default number of candidates, which is 100, to build the most accurate models. The maximum number you can specify is 250. Decreasing the number of model candidates may impact your model’s accuracy.

### Max job runtime


You can specify the maximum job runtime, or the maximum amount of time that Canvas spends building your model. After the time limit, Canvas stops building and selects the best model candidate.

The maximum time that you can specify is 720 hours. We highly recommend that you keep the maximum job runtime greater than 30 minutes to ensure that Canvas has enough time to generate model candidates and finish building your model.

## Advanced time series forecasting model settings


For time series forecasting models, Canvas supports the Objective metric, which is listed in the previous section.

Time series forecasting models also support the following advanced setting:

### Algorithm selection


When you build a time series forecasting model, Canvas uses an *ensemble* (or a combination) of statistical and machine learning algorithms to deliver highly accurate time series forecasts. By default, Canvas selects the optimal combination of all the available algorithms based on the time series in your dataset. However, you have the option to specify one or more algorithms to use for your forecasting model. In this case, Canvas determines the best blend using only your selected algorithms. If you're uncertain about which algorithm to select for training your model, we recommend that you choose all of the available algorithms.

**Note**  
Algorithm selection is only supported for standard builds. If you don’t select any algorithms in the advanced settings, then by default SageMaker AI runs a quick build and trains model candidates using a single tree-based learning algorithm. For more information about the difference between quick builds and standard builds, see [How custom models work](canvas-build-model.md).

Canvas supports the following time series forecasting algorithms:
+ [ Autoregressive Integrated Moving Average (ARIMA)](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) – A simple stochastic time series model that uses statistical analysis to interpret the data and make future predictions. This algorithm is useful for simple datasets with fewer than 100 time series.
+ [ Convolutional Neural Network - Quantile Regression (CNN-QR)](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-algo-cnnqr.html) – A proprietary, supervised learning algorithm that trains one global model from a large collection of time series and uses a quantile decoder to make predictions. CNN-QR works best with large datasets containing hundreds of time series.
+ [ DeepAR\$1](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html) – A proprietary, supervised learning algorithm for forecasting scalar time series using recurrent neural networks (RNNs) to train a single model jointly over all of the time series. DeepAR\$1 works best with large datasets containing hundreds of feature time series.
+ [ Non-Parametric Time Series (NPTS)](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-npts.html) – A scalable, probabilistic baseline forecaster that predicts the future value distribution of a given time series by sampling from past observations. NPTS is useful when working with sparse or intermittent time series (for example, forecasting demand for individual items where the time series has many 0s or low counts).
+ [Exponential Smoothing (ETS)](https://en.wikipedia.org/wiki/Exponential_smoothing) – A forecasting method that produces forecasts which are weighted averages of past observations where the weights of older observations exponentially decrease. The algorithm is useful for simple datasets with fewer than 100 time series and datasets with seasonality patterns.
+ [Prophet](https://facebook.github.io/prophet/) – An additive regression model that works best with time series that have strong seasonal effects and several seasons of historical data. The algorithm is useful for datasets with non-linear growth trends that approach a limit.

### Forecast quantiles


For time series forecasting, SageMaker AI trains 6 model candidates with your target time series. Then, SageMaker AI combines these models using a stacking ensemble method to create an optimal forecasting model for a given objective metric. Each forecasting model generates a probabilistic forecast by producing forecasts at quantiles between P1 and P99. These quantiles are used to account for forecast uncertainty. By default, forecasts are generated for 0.1 (`p10`), 0.5 (`p50`), and 0.9 (`p90`). You can choose to specify up to five of your own quantiles from 0.01 (`p1`) to 0.99 (`p99`), by increments of 0.01 or higher.

# Edit an image dataset


In Amazon SageMaker Canvas, you can edit your image datasets and review your labels before building a model. You might want to perform tasks such as assigning labels to unlabeled images or adding more images to the dataset. These tasks can all be done in the Canvas application, providing you with one place to modify your dataset and build a model.

**Note**  
Before building a model, you must assign labels to all images in your dataset. Also, you must have at least 25 images per label and a minimum of two labels. For more information about assigning labels, see the section on this page called **Assign labels to unlabeled images**. If you can’t determine a label for an image, you should delete it from your dataset. For more information about deleting images, see the section on this page [Add or delete images from the dataset](#canvas-edit-image-add-delete).

To begin editing your image dataset, you should be on the **Build** tab while building your single-label image prediction model.

A new page opens that shows the images in your dataset along with their labels. This page categorizes your image dataset into **Total images**, **Labeled images**, and **Unlabeled images**. You can also review the **Dataset preparation guide** for best practices on building a more accurate image prediction model.

The following screenshot shows the page for editing your image dataset.

![\[Screenshot of the image dataset management page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/dataset-management-page.png)


From this page, you can do the following actions.

## View the properties for each image (label, size, dimensions)


To view an individual image, you can search for it by file name in the search bar. Then, choose the image to open the full view. You can view the image properties and reassign the image’s label. Choose **Save** when you’re doing viewing the image.

## Add, rename, or delete labels in the dataset


Canvas lists the labels for your dataset in the left navigation pane. You can add new labels to the dataset by entering a label in the **Add label** text field.

To rename or delete a label from your dataset, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the label and select either **Rename** or **Delete**. If you rename the label, you can enter the new label name and choose **Confirm**. If you delete the label, the label is removed from all images in your dataset that have that label. Any images with that label are left unlabeled.

## Assign labels to unlabeled images


To view the unlabeled images in your dataset, choose **Unlabeled** in the left navigation pane. For each image, select it and open the label titled **Unlabeled** and select a label to assign to the image from the dropdown list. You can also select more than one image and perform this action, and all selected images are assigned the label you chose.

## Reassign labels to images


You can reassign labels to images by selecting the image (or multiple images at a time) and opening the dropdown titled with the current label. Select your desired label, and the image or images are updated with the new label.

## Sort your images by label


You can view all the images for a given label by choosing the label in the left navigation pane.

## Add or delete images from the dataset


You can add more images to your dataset by choosing **Add images** in the top navigation pane. You’ll be taken through the workflow to import more images. The images you import are added to your existing dataset.

You can delete images from your dataset by selecting them and then choosing **Delete** in the top navigation pane.

**Note**  
After making any changes to your dataset, choose **Save dataset** to make sure that you don’t lose your changes.

# Data exploration and analysis


**Note**  
You can only use SageMaker Canvas visualizations and analytics for models built on tabular datasets. Multi-category text prediction models are also excluded.

In Amazon SageMaker Canvas, you can explore the variables in your dataset using visualizations and analytics and create in-application visualizations and analytics. You can use these explorations to uncover relationships between your variables before building your model.

For more information about visualization techniques in Canvas, see [Explore your data using visualization techniques](canvas-explore-data-visualization.md).

For more information about analytics in Canvas, see [Explore your data using analytics](canvas-explore-data-analytics.md).

# Explore your data using visualization techniques


**Note**  
You can only use SageMaker Canvas visualizations for models built on tabular datasets. Multi-category text prediction models are also excluded.

With Amazon SageMaker Canvas, you can explore and visualize your data to gain advanced insights into your data before building your ML models. You can visualize using scatter plots, bar charts, and box plots, which can help you understand your data and discover the relationships between features that could affect the model accuracy.

In the **Build** tab of the SageMaker Canvas application, choose **Data visualizer** to begin creating your visualizations.

You can change the visualization sample size to adjust the size of the random sample taken from your dataset. A sample size that is too large might affect the performance of your data visualizations, so we recommend that you choose an appropriate sample size. To change the sample size, use the following procedure.

1. Choose **Visualization sample**.

1. Use the slider to select your desired sample size.

1. Choose **Update** to confirm the change to your sample size.

**Note**  
Certain visualization techniques require columns of a specific data type. For example, you can only use numeric columns for the x and y-axes of scatter plots.

## Scatter plot


To create a scatter plot with your dataset, choose **Scatter plot** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Color by** to color the data points on the plot with a third feature. You can also use **Group by** to group the data into separate plots based on a fourth feature.

The following image shows a scatter plot that uses **Color by** and **Group by**. In this example, each data point is colored by the `MaritalStatus` feature, and grouping by the `Department` feature results in a scatter plot for the data points of each department.

![\[Screenshot of a scatter plot in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-scatter-plot.png)


## Bar chart


To create a bar chart with your dataset, choose **Bar chart** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Group by** to group the bar chart by a third feature. You can use **Stack by** to vertically shade each bar based on the unique values of a fourth feature.

The following image shows a bar chart that uses **Group by** and **Stack by**. In this example, the bar chart is grouped by the `MaritalStatus` feature and stacked by the `JobLevel` feature. For each `JobRole` on the x axis, there is a separate bar for the unique categories in the `MaritalStatus` feature, and every bar is vertically stacked by the `JobLevel` feature.

![\[Screenshot of a bar chart in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-bar-chart.png)


## Box plot


To create a box plot with your dataset, choose **Box plot** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Group by** to group the box plots by a third feature.

The following image shows a box plot that uses **Group by**. In this example, the x and y-axes show `JobLevel` and `JobSatisfaction`, respectively, and the colored box plots are grouped by the `Department` feature.

![\[Screenshot of a box plot in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-box-plot.png)


# Explore your data using analytics


**Note**  
You can only use SageMaker Canvas analytics for models built on tabular datasets. Multi-category text prediction models are also excluded.

With analytics in Amazon SageMaker Canvas, you can explore your dataset and gain insight on all of your variables before building a model. You can determine the relationships between features in your dataset using correlation matrices. You can use this technique to summarize your dataset into a matrix that shows the correlations between two or more values. This helps you identify and visualize patterns in a given dataset for advanced data analysis.

The matrix shows the correlation between each feature as positive, negative, or neutral. You might want to include features that have a high correlation with each other when building your model. Features that have little to no correlation might be irrelevant to your model, and you can drop those features when building your model.

To get started with correlation matrices in SageMaker Canvas, see the following section.

## Create a correlation matrix


You can create a correlation matrix when you are preparing to build a model in the **Build** tab of the SageMaker Canvas application.

For instructions on how to begin creating a model, see [Build a model](canvas-build-model-how-to.md).

After you’ve started preparing a model in the SageMaker Canvas application, do the following:

1. In the **Build** tab, choose **Data visualizer**.

1. Choose **Analytics**.

1. Choose **Correlation matrix**.

You should see a visualization similar to the following screenshot, which shows up to 15 columns of the dataset organized into a correlation matrix.

![\[Screenshot of a correlation matrix in the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-correlation-matrix-2.png)


After you’ve created the correlation matrix, you can customize it by doing the following:

### 1. Choose your columns


For **Columns**, you can select the columns that you want to include in the matrix. You can compare up to 15 columns from your dataset.

**Note**  
You can use numeric, categorical, or binary column types for a correlation matrix. The correlation matrix doesn’t support datetime or text data column types.

To add or remove columns from the correlation matrix, select and deselect columns from the **Columns** panel. You can also drag and drop columns from the panel directly onto the matrix. If your dataset has a lot of columns, you can search for the columns you want in the **Search columns** bar.

To filter the columns by data type, choose the dropdown list and select **All**, **Numeric**, or **Categorical**. Selecting **All** shows you all of the columns from your dataset, whereas the **Numeric** and **Categorical** filters only show you the numeric or categorical columns in your dataset. Note that binary column types are included in the numeric or categorical filters.

For the best data insights, include your target column in the correlation matrix. When you include your target column in the correlation matrix, it appears as the last feature on the matrix with a target symbol.

### 2. Choose your correlation type


SageMaker Canvas supports different *correlation types*, or methods for calculating the correlation between your columns.

To change the correlation type, use the **Columns** filter mentioned in the preceding section to filter for your desired column type and columns. You should see the **Correlation type** in the side panel. For numeric comparisons, you have the option to select either **Pearson** or **Spearman**. For categorical comparisons, the correlation type is set as **MI**. For categorical and mixed comparisons, the correlation type is set as **Spearman & MI**.

For matrices that only compare numeric columns, the correlation type is either Pearson or Spearman. The Pearson measure evaluates the linear relationship between two continuous variables. The Spearman measure evaluates the monotonic relationship between two variables. For both Pearson and Spearman, the scale of correlation ranges from -1 to 1, with either end of the scale indicating a perfect correlation (a direct 1:1 relationship) and 0 indicating no correlation. You might want to select Pearson if your data has more linear relationships (as revealed by a [scatter plot visualization](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html#canvas-explore-data-scatterplot)). If your data is not linear, or contains a mixture of linear and monotonic relationships, then you might want to select Spearman.

For matrices that only compare categorical columns, the correlation type is set to Mutual Information Classification (MI). The MI value is a measure of the mutual dependence between two random variables. The MI measure is on a scale of 0 to 1, with 0 indicating no correlation and 1 indicating a perfect correlation.

For matrices that compare a mix of numeric and categorical columns, the correlation type **Spearman & MI** is a combination of the Spearman and MI correlation types. For correlations between two numeric columns, the matrix shows the Spearman value. For correlations between a numeric and categorical column or two categorical columns, the matrix shows the MI value.

Lastly, remember that correlation does not necessarily indicate causation. A strong correlation value only indicates that there is a relationship between two variables, but the variables might not have a causal relationship. Carefully review your columns of interest to avoid bias when building your model.

### 3. Filter your correlations


In the side panel, you can use the **Filter correlations** feature to filter for the range of correlation values that you want to include in the matrix. For example, if you want to filter for features that only have positive or neutral correlation, you can set the **Min** to 0 and the **Max** to 1 (valid values are -1 to 1).

For Spearman and Pearson comparisons, you can set the **Filter correlations** range anywhere from -1 to 1, with 0 meaning that there is no correlation. -1 and 1 mean that the variables have a strong negative or positive correlation, respectively.

For MI comparisons, the correlation range only goes from 0 to 1, with 0 meaning that there is no correlation and 1 meaning that the variables have a strong correlation, either positive or negative.

Each feature has a perfect correlation (1) with itself. Therefore, you might notice that the top row of the correlation matrix is always 1. If you want to exclude these values, you can use the filter to set the **Max** less than 1.

Keep in mind that if your matrix compares a mix of numeric and categorical columns and uses the **Spearman & MI** correlation type, then the *categorical x numeric* and *categorical x categorical* correlations (which use the MI measure) are on a scale of 0 to 1, whereas the *numeric x numeric* correlations (which use the Spearman measure) are on a scale of -1 to 1. Review your correlations of interest carefully to ensure that you know the correlation type being used to calculate each value.

### 4. Choose the visualization method


In the side panel, you can use **Visualize by** to change the visualization method of the matrix. Choose the **Numeric** visualization method to show the correlation (Pearson, Spearman, or MI) value, or choose the **Size** visualization method to visualize the correlation with differently sized and colored dots. If you choose **Size**, you can hover over a specific dot on the matrix to see the actual correlation value.

### 5. Choose a color palette


In the side panel, you can use **Color selection** to change the color palette used for the scale of negative to positive correlation in the matrix. Select one of the alternative color palettes to change the colors used in the matrix.

# Prepare data for model building


**Note**  
You can now do advanced data preparation in SageMaker Canvas with Data Wrangler, which provides you with a natural language interface and over 300 built-in transformations. For more information, see [Data preparation](canvas-data-prep.md).

Your machine learning dataset might require data preparation before you build your model. You might want to clean your data due to various issues, which might include missing values or outliers, and perform feature engineering to improve the accuracy of your model. Amazon SageMaker Canvas provides ML data transforms with which you can clean, transform, and prepare your data for model building. You can use these transforms on your datasets without any code. SageMaker Canvas adds the transforms you use to the **Model recipe**, which is a record of the data preparation done on your data before building the model. Any data transforms you use only modify the input data for model building and do not modify your original data source.

The preview of your dataset shows the first 100 rows of the dataset. If your dataset has more than 20,000 rows, Canvas takes a random sample of 20,000 rows and previews the first 100 rows from that sample. You can only search for and specify values from the previewed rows, and the filter functionality only filters the previewed rows and not the entire dataset.

The following transforms are available in SageMaker Canvas for you to prepare your data for building.

**Note**  
You can only use advanced transformations for models built on tabular datasets. Multi-category text prediction models are also excluded.

## Drop columns


You can exclude a column from your model build by dropping it in the **Build** tab of the SageMaker Canvas application. Deselect the column you want to drop, and it isn't included when building the model.

**Note**  
If you drop columns and then make [batch predictions](canvas-make-predictions.md) with your model, SageMaker Canvas adds the dropped columns back to the ouput dataset available for you to download. However, SageMaker Canvas does not add the dropped columns back for time series models.

## Filter rows


The filter functionality filters the previewed rows (the first 100 rows of your dataset) according to conditions that you specify. Filtering rows creates a temporary preview of the data and does not impact the model building. You can filter to preview rows that have missing values, contain outliers, or meet custom conditions in a column you choose.

### Filter rows by missing values


Missing values are a common occurrence in machine learning datasets. If you have rows with null or empty values in certain columns, you might want to filter for and preview those rows.

To filter missing values from your previewed data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows ** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check for missing values.

1. For the **Operation**, choose **Is missing**.

SageMaker Canvas filters for rows that contain missing values in the **Column** you selected and provides a preview of the filtered rows.

![\[Screenshot of the filter by missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-missing.png)


### Filter rows by outliers


Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. SageMaker Canvas enables you to detect and filter rows that contain outliers in numeric columns. You can choose to define outliers with either standard deviations or a custom range.

To filter for outliers in your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows ** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check for outliers.

1. For the **Operation**, choose **Is outlier**.

1. Set the **Outlier range** to either **Standard deviation** or **Custom range**.

1. If you choose **Standard deviation**, specify a **SD** (standard deviation) value from 1–3. If you choose **Custom range**, select either **Percentile** or **Number**, and then specify the **Min** and **Max** values.

The **Standard deviation** option detects and filters for outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify `3` for **SD**, a value must fall more than 3 standard deviations from the mean to be considered an outlier.

The **Custom range** option detects and filters for outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. You can set the **Type** of the range to either **Percentile** or **Number**. If you choose **Percentile**, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0-100) that you want to allow. If you choose **Number**, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to filter in the data.

![\[Screenshot of the filter by outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-outlier.png)


### Filter rows by custom values


You can filter for rows with values that meet custom conditions. For example, you might want to preview rows that have a price value greater than 100 before removing them. With this functionality, you can filter rows that exceed the threshold you set and preview the filtered data.

To use the custom filter functionality, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check.

1. Select the type of **Operation** you want to use, and then specify the values for the selected condition.

For the **Operation**, you can choose one of the following options. Note that the available operations depend on the data type of the column you choose. For example, you cannot create a `is greater than` operation for a column containing text values.


| Operation | Supported data type | Supported feature type | Function | 
| --- | --- | --- | --- | 
|  Is equal to  |  Numeric, Text  | Binary, Categorical |  Filters rows where the value in **Column** equals the values you specify.  | 
|  Is not equal to  |  Numeric, Text  | Binary, Categorical |  Filters rows where the value in **Column** doesn't equal the values you specify.  | 
|  Is less than  |  Numeric  | N/A |  Filters rows where the value in **Column** is less than the value you specify.  | 
|  Is less than or equal to  |  Numeric  | N/A |  Filters rows where the value in **Column** is less than or equal to the value you specify.  | 
|  Is greater than  |  Numeric  | N/A |  Filters rows where the value in **Column** is greater than the value you specify.  | 
|  Is greater than or equal to  |  Numeric  | N/A |  Filters rows where the value in **Column** is greater than or equal to the value you specify.  | 
|  Is between  |  Numeric  | N/A |  Filters rows where the value in **Column** is between or equal to two values you specify.  | 
|  Contains  |  Text  | Categorical |  Filters rows where the value in **Column** contains a values you specify.  | 
|  Starts with  |  Text  | Categorical |  Filters rows where the value in **Column** begins with a value you specify.  | 
|  Ends with  |  Categorical  | Categorical |  Filters rows where the value in **Column** ends with a value you specify.  | 

After you set the filter operation, SageMaker Canvas updates the preview of the dataset to show you the filtered data.

![\[Screenshot of the filter by custom values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-custom.png)


## Functions and operators


You can use mathematical functions and operators to explore and distribute your data. You can use the SageMaker Canvas supported functions or create your own formula with your existing data and create a new column with the result of the formula. For example, you can add the corresponding values of two columns and save the result to a new column.

You can nest statements to create more complex functions. The following are some examples of nested functions that you might use.
+ To calculate BMI, you could use the function `weight / (height ^ 2)`.
+ To classify ages, you could use the function `Case(age < 18, 'child', age < 65, 'adult', 'senior')`.

You can specify functions in the data preparation stage before you build your model. To use a function, do the following.
+ In the **Build** tab of the SageMaker Canvas application, choose **View all** and then choose **Custom formula** to open the **Custom formula** panel.
+ In the **Custom formula** panel, you can choose a **Formula** to add to your **Model Recipe**. Each formula is applied to all of the values in the columns you specify. For formulas that accept two or more columns as arguments, use columns with matching data types; otherwise, you get an error or `null` values in the new column. 
+ After you’ve specified a **Formula**, add a column name in the **New Column Name** field. SageMaker Canvas uses this name for the new column that is created.
+ (Optional) Choose **Preview** to preview your transform.
+ To add the function to your **Model Recipe**, choose **Add**.

SageMaker Canvas saves the result of your function to a new column using the name you specified in **New Column Name**. You can view or remove functions from the **Model Recipe** panel.

SageMaker Canvas supports the following operators for functions. You can use either the text format or the in-line format to specify your function.


| Operator | Description | Supported data types | Text format | In-line format | 
| --- | --- | --- | --- | --- | 
|  Add  |  Returns the sum of the values  |  Numeric  | Add(sales1, sales2) | sales1 \$1 sales2 | 
|  Subtract  |  Returns the difference between the values  |  Numeric  | Subtract(sales1, sales2) | sales1 ‐ sales2 | 
|  Multiply  |  Returns the product of the values  |  Numeric  | Multiply(sales1, sales2) | sales1 \$1 sales2 | 
|  Divide  |  Returns the quotient of the values  |  Numeric  | Divide(sales1, sales2) | sales1 / sales2 | 
|  Mod  |  Returns the result of the modulo operator (the remainder after dividing the two values)  |  Numeric  | Mod(sales1, sales2) | sales1 % sales2 | 
|  Abs  | Returns the absolute value of the value |  Numeric  | Abs(sales1) | N/A | 
|  Negate  | Returns the negative of the value |  Numeric  | Negate(c1) | ‐c1 | 
|  Exp  |  Returns e (Euler's number) raised to the power of the value  |  Numeric  | Exp(sales1) | N/A | 
|  Log  |  Returns the logarithm (base 10) of the value  |  Numeric  | Log(sales1) | N/A | 
|  Ln  |  Returns the natural logarithm (base e) of the value  |  Numeric  | Ln(sales1) | N/A | 
|  Pow  |  Returns the value raised to a power  |  Numeric  | Pow(sales1, 2) | sales1 ^ 2 | 
|  If  |  Returns a true or false label based on a condition you specify  |  Boolean, Numeric, Text  | If(sales1>7000, 'truelabel, 'falselabel') | N/A | 
|  Or  |  Returns a Boolean value of whether one of the specified values or conditions is true or not  |  Boolean  | Or(fullprice, discount) | fullprice \$1\$1 discount | 
|  And  |  Returns a Boolean value of whether two of the specified values or conditions are true or not  |  Boolean  | And(sales1,sales2) | sales1 && sales2 | 
|  Not  |  Returns a Boolean value that is the opposite of the specified value or conditions  |  Boolean  | Not(sales1) | \$1sales1 | 
|  Case  |  Returns a Boolean value based on conditional statements (returns c1 if cond1 is true, returns c2 if cond2 is true, else returns c3)  |  Boolean, Numeric, Text  | Case(cond1, c1, cond2, c2, c3) | N/A | 
|  Equal  |  Returns a Boolean value of whether two values are equal  |  Boolean, Numeric, Text  | N/A | c1 = c2c1 == c2 | 
|  Not equal  |  Returns a Boolean value of whether two values are not equal  |  Boolean, Numeric, Text  | N/A | c1 \$1= c2 | 
|  Less than  |  Returns a Boolean value of whether c1 is less than c2  |  Boolean, Numeric, Text  | N/A | c1 < c2 | 
|  Greater than  |  Returns a Boolean value of whether c1 is greater than c2  |  Boolean, Numeric, Text  | N/A | c1 > c2 | 
|  Less than or equal  |  Returns a Boolean value of whether c1 is less than or equal to c2  |  Boolean, Numeric, Text  | N/A | c1 <= c2 | 
|  Greater than or equal  |  Returns a Boolean value of whether c1 is greater than or equal to c2  |  Boolean, Numeric, Text  | N/A | c1 >= c2 | 

SageMaker Canvas also supports aggregate operators, which can perform operations such as calculating the sum of all the values or finding the minimum value in a column. You can use aggregate operators in combination with standard operators in your functions. For example, to calculate the difference of values from the mean, you could use the function `Abs(height – avg(height))`. SageMaker Canvas supports the following aggregate operators.


| Aggregate operator | Description | Format | Example | 
| --- | --- | --- | --- | 
|  sum  |  Returns the sum of all the values in a column  | sum | sum(c1) | 
|  minimum  |  Returns the minimum value of a column  | min | min(c2) | 
|  maximum  |  Returns the maximum value of a column  | max | max(c3) | 
|  average  |  Returns the average value of a column  | avg | avg(c4) | 
|  std  | Returns the sample standard deviation of a column | std | std(c1) | 
|  stddev  | Returns the standard deviation of the values in a column | stddev | stddev(c1) | 
|  variance  | Returns the unbiased variance of the values in a column | variance | variance(c1) | 
|  approx\$1count\$1distinct  | Returns the approximate number of distinct items in a column | approx\$1count\$1distinct | approx\$1count\$1distinct(c1) | 
|  count  | Returns the number of items in a column | count | count(c1) | 
|  first  |  Returns the first value of a column  | first | first(c1) | 
|  last  |  Returns the last value of a column  | last | last(c1) | 
|  stddev\$1pop  | Returns the population standard deviation of a column | stddev\$1pop | stddev\$1pop(c1) | 
|  variance\$1pop  |  Returns the population variance of the values in a column  | variance\$1pop | variance\$1pop(c1) | 

## Manage rows


With the Manage rows transform, you can perform sort, random shuffle, and remove rows of data from the dataset.

### Sort rows


To sort the rows in a dataset by a given column, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Sort rows**.

1. For **Sort Column**, choose the column you want to sort by.

1. For **Sort Order**, choose either **Ascending** or **Descending**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Shuffle rows


To randomly shuffle the rows in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Shuffle rows**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Drop duplicate rows


To remove duplicate rows in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Drop duplicate rows**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Remove rows by missing values


Missing values are a common occurrence in machine learning datasets and can impact model accuracy. Use this transform if you want to drop rows with null or empty values in certain columns.

To remove rows that contain missing values in a specified column, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by missing values**.

1. Choose **Add** to add the transform to the **Model recipe**.

SageMaker Canvas drops rows that contain missing values in the **Column** you selected. After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-missing.png)


### Remove rows by outliers


Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. With SageMaker Canvas, you can detect and remove rows that contain outliers in numeric columns. You can choose to define outliers with either standard deviations or a custom range.

To remove outliers from your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by outlier values**.

1. Choose the **Column** you want to check for outliers.

1. Set the **Operator** to **Standard deviation**, **Custom numeric range**, or **Custom quantile range**.

1. If you choose **Standard deviation**, specify a **Standard deviations** (standard deviation) value from 1–3. If you choose **Custom numeric range** or **Custom quantile range**, specify the **Min** and **Max** values (numbers for numeric ranges, or percentiles between 0–100% for quantile ranges).

1. Choose **Add** to add the transform to the **Model recipe**.

The **Standard deviation** option detects and removes outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify `3` for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier.

The **Custom numeric range** and **Custom quantile range** options detect and remove outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. If you choose a numeric range, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to allow in the data. If you choose a quantile range, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0–100) that you want to allow.

After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-outlier.png)


### Remove rows by custom values


You can remove rows with values that meet custom conditions. For example, you might want to exclude all of the rows with a price value greater than 100 when building your model. With this transform, you can create a rule that removes all rows that exceed the threshold you set.

To use the custom remove transform, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by formula**.

1. Choose the **Column** you want to check.

1. Select the type of **Operation** you want to use, and then specify the values for the selected condition.

1. Choose **Add** to add the transform to the **Model recipe**.

For the **Operation**, you can choose one of the following options. Note that the available operations depend on the data type of the column you choose. For example, you cannot create a `is greater than` operation for a column containing text values.


| Operation | Supported data type | Supported feature type | Function | 
| --- | --- | --- | --- | 
|  Is equal to  |  Numeric, Text  |  Binary, Categorical  |  Removes rows where the value in **Column** equals the values you specify.  | 
|  Is not equal to  |  Numeric, Text  |  Binary, Categorical  |  Removes rows where the value in **Column** doesn't equal the values you specify.  | 
|  Is less than  |  Numeric  | N/A |  Removes rows where the value in **Column** is less than the value you specify.  | 
|  Is less than or equal to  |  Numeric  | N/A |  Removes rows where the value in **Column** is less than or equal to the value you specify.  | 
|  Is greater than  |  Numeric  | N/A |  Removes rows where the value in **Column** is greater than the value you specify.  | 
|  Is greater than or equal to  | Numeric | N/A |  Removes rows where the value in **Column** is greater than or equal to the value you specify.  | 
|  Is between  | Numeric | N/A |  Removes rows where the value in **Column** is between or equal to two values you specify.  | 
|  Contains  |  Text  | Categorical |  Removes rows where the value in **Column** contains a values you specify.  | 
|  Starts with  |  Text  | Categorical |  Removes rows where the value in **Column** begins with a value you specify.  | 
|  Ends with  |  Text  | Categorical |  Removes rows where the value in **Column** ends with a value you specify.  | 

After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by custom values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-custom.png)


## Rename columns


With the rename columns transform, you can rename columns in your data. When you rename a column, SageMaker Canvas changes the column name in the model input.

You can rename a column in your dataset by double-clicking on the column name in the **Build** tab of the SageMaker Canvas application and entering a new name. Pressing the **Enter** key submits the change, and clicking anywhere outside the input cancels the change. You can also rename a column by clicking the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), located at the end of the row in list view or at the end of the header cell in grid view, and choosing **Rename**.

Your column name can’t be longer than 32 characters or have double underscores (\$1\$1), and you can’t rename a column to the same name as another column. You also can’t rename a dropped column.

The following screenshot shows how to rename a column by double-clicking the column name.

![\[Screenshot of renaming a column with the double-click method in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-rename-column.png)


When you rename a column, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the column reverts to its original name.

## Manage columns


With the following transforms, you can change the data type of columns and replace missing values or outliers for specific columns. SageMaker Canvas uses the updated data types or values when building your model but doesn’t change your original dataset. Note that if you've dropped a column from your dataset using the [Drop columns](#canvas-prepare-data-drop) transform, you can't replace values in that column.

### Replace missing values


Missing values are a common occurrence in machine learning datasets and can impact model accuracy. You can choose to drop rows that have missing values, but your model is more accurate if you choose to replace the missing values instead. With this transform, you can replace missing values in numeric columns with the mean or median of the data in a column, or you can also specify a custom value with which to replace missing values. For non-numeric columns, you can replace missing values with the mode (most common value) of the column or a custom value.

Use this transform if you want to replace the null or empty values in certain columns. To replace missing values in a specified column, do the following. 

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage columns**.

1. Choose **Replace missing values**.

1. Choose the **Column** in which you want to replace missing values.

1. Set **Mode** to **Manual** to replace missing values with values that you specify. With the **Automatic (default)** setting, SageMaker Canvas replaces missing values with imputed values that best fit your data. This imputation method is done automatically for each model build, unless you specify the **Manual** mode.

1. Set the **Replace with** value:
   + If your column is numeric, then select **Mean**, **Median**, or **Custom**. **Mean** replaces missing values with the mean for the column, and **Median** replaces missing values with the median for the column. If you choose **Custom**, then you must specify a custom value that you want to use to replace missing values.
   + If your column is non-numeric, then select **Mode** or **Custom**. **Mode** replaces missing values with the mode, or the most common value, for the column. For **Custom**, specify a custom value. that you want to use to replace missing values.

1. Choose **Add** to add the transform to the **Model recipe**.

After replacing the missing values in the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the missing values return to the dataset.

![\[Screenshot of the replace missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-replace-missing.png)


### Replace outliers


Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. SageMaker Canvas enables you to detect outliers in numeric columns and replace the outliers with values that lie within an accepted range in your data. You can choose to define outliers with either standard deviations or a custom range, and you can replace outliers with the minimum and maximum values in the accepted range.

To replace outliers in your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage columns**.

1. Choose **Replace outlier values**.

1. Choose the **Column** in which you want to replace outliers.

1. For **Define outliers**, choose **Standard deviation**, **Custom numeric range**, or **Custom quantile range**.

1. If you choose **Standard deviation**, specify a **Standard deviations** (standard deviation) value from 1–3. If you choose **Custom numeric range** or **Custom quantile range**, specify the **Min** and **Max** values (numbers for numeric ranges, or percentiles between 0–100% for quantile ranges).

1. For **Replace with**, select **Min/max range**.

1. Choose **Add** to add the transform to the **Model recipe**.

The **Standard deviation** option detects outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. SageMaker Canvas replaces outliers with the minimum value or maximum value in the accepted range. For example, if you configure the standard deviations to only include values from 200–300, then SageMaker Canvas changes a value of 198 to 200 (the minimum).

The **Custom numeric range** and **Custom quantile range** options detect outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. If you choose a numeric range, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to allow. SageMaker Canvas replaces any values that fall outside of the minimum and maximum to the minimum and maximum values. For example, if your range only allows values from 1–100, then SageMaker Canvas changes a value of 102 to 100 (the maximum). If you choose a quantile range, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0–100) that you want to allow.

After replacing the values in the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the original values return to the dataset.

![\[Screenshot of the replace outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-replace-outlier.png)


### Change data type


SageMaker Canvas provides you with the ability to change the *data type* of your columns between numeric, text, and datetime, while also displaying the associated *feature type* for that data type. A *data type* refers to the format of the data and how it is stored, while the *feature type* refers to the characteristic of the data used in machine learning algorithms, such as binary or categorical. This gives you the flexibility to manually change the type of data in your columns based on the features. The ability to choose the right data type ensures data integrity and accuracy prior to building models. These data types are used when building models.

**Note**  
Currently, changing the feature type (for example, from binary to categorical) is not supported.

The following table lists all of the supported data types in Canvas.


| Data type | Description | Example | 
| --- | --- | --- | 
| Numeric | Numeric data represents numerical values | 1, 2, 31.1, 1.2. 1.3 | 
| Text | Text data represents sequences of characters, like names or descriptions | A, B, C, Dapple, banana, orange1A\$1, 2A\$1, 3A\$1 | 
| Datetime | Datetime data represents dates and times in timestamp format | 2019-07-01 01:00:00, 2019-07-01 02:00:00, 2019-07-01 03:00:00 | 

The following table lists all of the supported feature types in Canvas.


| Feature type | Description | Example | 
| --- | --- | --- | 
| Binary | Binary features represent two possible values | 0, 1, 0, 1, 0 (2 distinct values)true, false, true (2 distinct values) | 
| Categorical | Categorical features represent distinct categories or groups | apple, banana, orange, apple (3 distinct values)A, B, C, D, E, A, D, C (5 distinct values) | 

To modify data type of a column in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, go to the **Column view** or **Grid view** and select the **Data type** dropdown for the specific column.

1. In the **Data type** dropdown, choose the data type to convert to. The following screenshot shows the dropdown menu.  
![\[The data type conversion dropdown menu for a column, shown in the Build tab.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-prepare-data-change.png)

1. For **Column**, choose or verify the column you want to change the data type for.

1. For **New data type**, choose or verify the new data type you want to convert to.

1. If the **New data type** is `Datetime` or `Numeric`, choose one of the following options under **Handle invalid values**:

   1. **Replace with empty value** – Invalid values are substituted with an empty value

   1. **Delete rows** – Rows with an invalid value are removed from the dataset

   1. **Replace with custom value** – Invalid values are substituted with the **Custom Value** that you specify.

1. Choose **Add** to add the transform to the **Model recipe**.

The data type for your column should now be updated.

## Prepare time series data


Use the following functionalities to prepare your time series data for building time series forecasting models.

### Resample time series data


By resampling time-series data, you can establish regular intervals for the observations in your time series dataset. This is particularly useful when working with time series data containing irregularly spaced observations. For instance, you can use resampling to transform a dataset with observations recorded every one hour, two hour and three hour intervals into a regular one hour interval between observations. Forecasting algorithms require the observations to be taken at regular intervals.

To resample time series data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Time series**.

1. Choose **Resample**.

1. For **Timestamp column**, choose the column you want to apply the transform to. You can only select columns of the **Datetime** type.

1. In the **Frequency settings** section, choose a **Frequency** and **Rate**. **Frequency** is the unit of frequency and **Rate** is the interval of the unit of frequency to be applied to the column. For example, choosing `Calendar Day` for **Frequency value** and `1` for **Rate** sets the interval to increase every 1 calendar day, such as `2023-03-26 00:00:00`, `2023-03-27 00:00:00`, `2023-03-28 00:00:00`. See the table after this procedure for a complete list of **Frequency value**. 

1. Choose **Add** to add the transform to the **Model recipe**.

The following table lists all of the **Frequency** types you can select when resampling time series data.


| Frequency | Description | Example values (assuming Rate is 1) | 
| --- | --- | --- | 
|  Business Day  |  Resample observations in the datetime column to 5 business days of the week (Monday, Tuesday, Wednesday, Thursday, Friday)  |  2023-03-24 00:00:00 2023-03-27 00:00:00 2023-03-28 00:00:00 2023-03-29 00:00:00 2023-03-30 00:00:00 2023-03-31 00:00:00 2023-04-03 00:00:00  | 
|  Calendar Day  |  Resample observations in the datetime column to all 7 days of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)  |  2023-03-26 00:00:00 2023-03-27 00:00:00 2023-03-28 00:00:00 2023-03-29 00:00:00 2023-03-30 00:00:00 2023-03-31 00:00:00 2023-04-01 00:00:00  | 
|  Week  |  Resample observations in the datetime column to the first day of each week  |  2023-03-13 00:00:00 2023-03-20 00:00:00 2023-03-27 00:00:00 2023-04-03 00:00:00  | 
|  Month  |  Resample observations in the datetime column to the first day of each month  |  2023-03-01 00:00:00 2023-04-01 00:00:00 2023-05-01 00:00:00 2023-06-01 00:00:00  | 
|  Annual Quarter  |  Resample observations in the datetime column to the last day of each quarter  |  2023-03-31 00:00:00 2023-06-30 00:00:00 2023-09-30 00:00:00 2023-12-31 00:00:00  | 
|  Year  |  Resample observations in the datetime column to the last day of each year  |  2022-12-31 0:00:00 2023-12-31 00:00:00 2024-12-31 00:00:00  | 
|  Hour  |  Resample observations in the datetime column to each hour of each day  |  2023-03-24 00:00:00 2023-03-24 01:00:00 2023-03-24 02:00:00 2023-03-24 03:00:00  | 
|  Minute  |  Resample observations in the datetime column to each minute of each hour  |  2023-03-24 00:00:00 2023-03-24 00:01:00 2023-03-24 00:02:00 2023-03-24 00:03:00  | 
|  Second  |  Resample observations in the datetime column to each second of each minute  |  2023-03-24 00:00:00 2023-03-24 00:00:01 2023-03-24 00:00:02 2023-03-24 00:00:03  | 

When applying the resampling transform, you can use the **Advanced** option to specify how the resulting values of the rest of the columns (other than the timestamp column) in your dataset are modified. This can be achieved by specifying the resampling methodology, which can either be downsampling or upsampling for both numeric and non-numeric columns.

*Downsampling* increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The values of other columns of the hourly observations are aggregated into a single value using a combination method. The following tables show an example of downsampling time series data by using mean as the combination method. The data is downsampled from every two hours to every hour.

The following table shows the hourly temperature readings over a day before downsampling.


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 pm | 30 | 
| 1:00 am | 32 | 
| 2:00 am | 35 | 
| 3:00 am | 32 | 
| 4:00 am | 30 | 

The following table shows the temperature readings after downsampling to every two hours.


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 pm | 30 | 
| 2:00 am | 33.5 | 
| 2:00 am | 35 | 
| 4:00 am | 32.5 | 

To downsample time series data, do the following:

1. Expand the **Advanced ** section under the **Resample** transform.

1. Choose **Non-numeric combination** to specify the combination method for non-numeric columns. See the table below for a complete list of combination methods.

1. Choose **Numeric combination** to specify the combination method for numeric columns. See the table below for a complete list of combination methods.

If you don’t specify combination methods, the default values are `Most Common` for **Non-numeric combination** and `Mean` for **Numeric combination**. The following table lists the methods for numeric and non-numeric combination.


| Downsampling methodology | Combination method | Description | 
| --- | --- | --- | 
| Non-numeric combination | Most Common | Aggregate values in the non-numeric column by the most commonly ocurring value | 
| Non-numeric combination | Last | Aggregate values in the non-numeric column by the last value in the column | 
| Non-numeric combination | First | Aggregate values in the non-numeric column by the first value in the column | 
| Numeric combination | Mean | Aggregate values in the numeric column by the taking the mean of all the values in the column | 
| Numeric combination | Median | Aggregate values in the numeric column by the taking the median of all the values in the column | 
| Numeric combination | Min | Aggregate values in the numeric column by the taking the minimum of all the values in the column | 
| Numeric combination | Max | Aggregate values in the numeric column by the taking the maximum of all the values in the column | 
| Numeric combination | Sum | Aggregate values in the numeric column by adding all the values in the column | 
| Numeric combination | Quantile | Aggregate values in the numeric column by the taking the quantile of all the values in the column | 

*Upsampling* reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, the values of other columns of the hourly observations are interpolated from the ones that have been taken every two hours.

To upsample time series data, do the following:

1. Expand the **Advanced** section under the **Resample** transform.

1. Choose **Non-numeric estimation** to specify the estimation method for non-numeric columns. See the table after this procedure for a complete list of methods.

1. Choose **Numeric estimation** to specify the estimation method for numeric columns. See the table below for a complete list of methods.

1. (Optional) Choose **ID Column** to specify the column that has the IDs of the observations of the time series. Specify this option if your dataset has two time series. If you have a column representing only one time series, don't specify a value for this field. For example, you can have a dataset that has the columns `id` and `purchase`. The `id` column has the following values: `[1, 2, 2, 1]`. The `purchase` column has the following values `[$2, $3, $4, $1]`. Therefore, the dataset has two time series—one time series is: `1: [$2, $1]`, and the other time series is `2: [$3, $4]`.

If you don’t specify estimation methods, the default values are `Forward Fill` for **Non-numeric estimation** and `Linear` for **Numeric estimation**. The following table lists the methods for estimation.


| Upsampling methodology | Estimation method | Description | 
| --- | --- | --- | 
| Non-numeric estimation | Forward Fill | Interpolate values in the non-numeric column by taking the consecutive values after all the values in the column | 
| Non-numeric estimation | Backward Fill | Interpolate values in the non-numeric column by taking the consecutive values before all the values in the column | 
| Non-numeric estimation | Keep Missing | Interpolate values in the non-numeric column by showing empty values | 
| Numeric estimation | Linear, Time, Index, Zero, S-Linear, Nearest, Quadratic, Cubic, Barycentric, Polynomial, Krogh, Piecewise Polynomial, Spline, P-chip, Akima, Cubic Spline, From Derivatives | Interpolate values in the numeric column by using the specfied interpolator. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html) in the pandas documentation. | 

The following screenshot shows the **Advanced** settings with the fields for downsampling and upsampling filled out.

![\[The Canvas application, with the time series resampling side panel showing the advanced options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-prepare-data-resampling.png)


### Use datetime extraction


With the datetime extraction transform, you can extract values from a datetime column to a separate column. For example, if you have a column containing dates of purchases, you can extract the month value to a separate column and use the new column when building your model. You can also extract multiple values to separate columns with a single transform.

Your datetime column must use a supported timestamp format. For a list of the formats that SageMaker Canvas supports, see [Time Series Forecasts in Amazon SageMaker Canvas](canvas-time-series.md). If your dataset does not use one of the supported formats, update your dataset to use a supported timestamp format and re-import it to Amazon SageMaker Canvas before building your model.

To perform a datetime extraction, do the following.

1. In the **Build** tab of the SageMaker Canvas application, on the transforms bar, choose **View all**.

1. Choose **Extract features**.

1. Choose the **Timestamp column** from which you want to extract values.

1. For **Values**, select one or more values to extract from the column. The values you can extract from a timestamp column are **Year**, **Month**, **Day**, **Hour**, **Week of year**, **Day of year**, and **Quarter**.

1. (Optional) Choose **Preview** to preview the transform results.

1. Choose **Add** to add the transform to the **Model recipe**.

SageMaker Canvas creates a new column in the dataset for each of the values you extract. Except for **Year** values, SageMaker Canvas uses a 0-based encoding for the extracted values. For example, if you extract the **Month** value, January is extracted as 0, and February is extracted as 1.

![\[Screenshot of the datetime extraction box in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-datetime-extract.png)


You can see the transform listed in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the new columns are removed from the dataset.

# Model evaluation


After you’ve built your model, you can evaluate how well your model performed on your data before using it to make predictions. You can use information, such as the model’s accuracy when predicting labels and advanced metrics, to determine whether your model can make sufficiently accurate predictions for your data.

The section [Evaluate your model's performance](canvas-scoring.md) describes how to view and interpret the information on your model's **Analyze** page. The section [Use advanced metrics in your analyses](canvas-advanced-metrics.md) contains more detailed information about the **Advanced metrics** used to quantify your model’s accuracy.

You can also view more advanced information for specific *model candidates*, which are all of the model iterations that Canvas runs through while building your model. Based on the advanced metrics for a given model candidate, you can select a different candidate to be the default, or the version that is used for making predictions and deploying. For each model candidate, you can view the **Advanced metrics** information to help you decide which model candidate you’d like to select as the default. You can view this information by selecting the model candidate from the **Model leaderboard**. For more information, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

Canvas also provides the option to download a Jupyter notebook so that you can view and run the code used to build your model. This is useful if you’d like to make adjustments to the code or learn more about how your model was built. For more information, see [Download a model notebook](canvas-notebook.md).

# Evaluate your model's performance


Amazon SageMaker Canvas provides overview and scoring information for the different types of model. Your model’s score can help you determine how accurate your model is when it makes predictions. The additional scoring insights can help you quantify the differences between the actual and predicted values.

To view the analysis of your model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, you can view the overview and scoring information for your model.

The following sections describe how to interpret the scoring for each model type.

## Evaluate categorical prediction models


The **Overview** tab shows you the column impact for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the **Accuracy** score for the model, along with the **Optimization metric**, which is the metric that you choose to optimize when building the model. In this case, the **Optimization metric** is **Accuracy**. You can specify a different optimization metric if you build a new version of your model.

![\[Screenshot of the accuracy score and optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-category.png)


The **Scoring** tab for a categorical prediction model gives you the ability to visualize all the predictions. Line segments extend from the left of the page, indicating all the predictions the model has made. In the middle of the page, the line segments converge on a perpendicular segment to indicate the proportion of each prediction to a single category. From the predicted category, the segments branch out to the actual category. You can get a visual sense of how accurate the predictions were by following each line segment from the predicted category to the actual category.

The following image gives you an example **Scoring** section for a **3\$1 category prediction** model.

![\[Screenshot of the Scoring tab for a 3+ category prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze/canvas-multiclass-classification.png)


You can also view the **Advanced metrics** tab for more detailed information about your model’s performance, such as the advanced metrics, error density plots, or confusion matrices. To learn more about the **Advanced metrics** tab, see [Use advanced metrics in your analyses](canvas-advanced-metrics.md).

## Evaluate numeric prediction models


The **Overview** tab shows you the column impact for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the **RMSE** score for the model on the **Overview** tab, which in this case is the **Optimization metric**. The **Optimization metric** is the metric that you choose to optimize when building the model. You can specify a different optimization metric if you build a new version of your model.

![\[Screenshot of the RMSE optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-numeric.png)


The **Scoring** tab for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions. The values of the numeric prediction are often \$1/- the RMSE (root mean squared error) value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range.

The following image shows the **Scoring** section for numeric prediction.

![\[Screenshot of the Scoring tab for a numeric prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze/canvas-analyze-regression-scoring.png)


You can also view the **Advanced metrics** tab for more detailed information about your model’s performance, such as the advanced metrics, error density plots, or confusion matrices. To learn more about the **Advanced metrics** tab, see [Use advanced metrics in your analyses](canvas-advanced-metrics.md).

## Evaluate time series forecasting models


On the **Analyze** page for time series forecasting models, you can see an overview of the model’s metrics. You can hover over each metric for more information, or you can see [Use advanced metrics in your analyses](canvas-advanced-metrics.md) for more information about each metric.

In the **Column impact** section, you can see the score for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the time series metrics scores for the model, along with the **Optimization metric**, which is the metric that you choose to optimize when building the model. In this case, the **Optimization metric** is **RMSE**. You can specify a different optimization metric if you build a new version of your model. These metrics scores are taken from your backtest results, which are available for download in the **Artifacts** tab.

![\[Screenshot of the RMSE optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-time-series.png)


The **Artifacts** tab provides access to several key resources that you can use to dive deeper into your model’s performance and continue iterating upon it:
+ **Shuffled training and validation splits** – This section includes links to the artifacts generated when your dataset was split into training and validation sets, enabling you to review the data distribution and potential biases.
+ **Backtest results** – This section includes a link to the forecasted values for your validation dataset, which is used to generate accuracy metrics and evaluation data for your model.
+ **Accuracy metrics** – This section lists the advanced metrics that evaluate your model's performance, such as Root Mean Squared Error (RMSE). For more information about each metric, see [Metrics for time series forecasts](canvas-metrics.md#canvas-time-series-forecast-metrics).
+ **Explainability report** – This section provides a link to download the explainability report, which offers insights into the model's decision-making process and the relative importance of input columns. This report can help you identify potential areas for improvement.

On the **Analyze** page, you can also choose the **Download** button to directly download the backtest results, accuracy metrics, and explainability report artifacts to your local machine.

## Evaluate image prediction models


The **Overview** tab shows you the **Per label performance**, which gives you an overall accuracy score for the images predicted for each label. You can choose a label to see more specific details, such as the **Correctly predicted** and **Incorrectly predicted** images for the label.

You can turn on the **Heatmap** toggle to see a heatmap for each image. The heatmap shows you the areas of interest that have the most impact when your model is making predictions. For more information about heatmaps and how to use them to improve your model, choose the **More info** icon next to the **Heatmap** toggle.

The **Scoring** tab for single-label image prediction models shows you a comparison of what the model predicted as the label versus what the actual label was. You can select up to 10 labels at a time. You can change the labels in the visualization by choosing the labels dropdown menu and selecting or deselecting labels.

You can also view insights for individual labels or groups of labels, such as the three labels with the highest or lowest accuracy, by choosing the **View scores for** dropdown menu in the **Model accuracy insights** section.

The following screenshot shows the **Scoring** information for a single-label image prediction model.

![\[The actual versus predicted labels on the Scoring page for a multi-category text prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-image-scoring.png)


## Evaluate text prediction models


The **Overview** tab shows you the **Per label performance**, which gives you an overall accuracy score for the passages of text predicted for each label. You can choose a label to see more specific details, such as the **Correctly predicted** and **Incorrectly predicted** passages for the label.

The **Scoring** tab for multi-category text prediction models shows you a comparison of what the model predicted as the label versus what the actual label was.

In the **Model accuracy insights** section, you can see the **Most frequent category**, which tells you the category that the model predicted most frequently and how accurate those predictions were. If you model predicts a label of **Positive** correctly 99% of the time, then you can be fairly confident that your model is good at predicting positive sentiment in text.

The following screenshot shows the **Scoring** information for a multi-category text prediction model.

![\[The actual versus predicted labels on the Scoring page for a single-label image prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-text-scoring.png)


# Use advanced metrics in your analyses


The following section describes how to find and interpret the advanced metrics for your model in Amazon SageMaker Canvas.

**Note**  
Advanced metrics are only currently available for numeric and categorical prediction models.

To find the **Advanced metrics** tab, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, choose the **Advanced metrics** tab.

In the **Advanced metrics** tab, you can find the **Performance** tab. The page looks like the following screenshot.

![\[Screenshot of the advanced metrics tab for a categorical prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze-performance.png)


At the top, you can see an overview of the metrics scores, including the **Optimization metric**, which is the metric that you selected (or that Canvas selected by default) to optimize when building the model.

The following sections describe more detailed information for the **Performance** tab within the **Advanced metrics**.

## Performance


In the **Performance** tab, you’ll see a **Metrics table**, along with visualizations that Canvas creates based on your model type. For categorical prediction models, Canvas provides a *confusion matrix*, whereas for numeric prediction models, Canvas provides you with *residuals* and *error density* charts.

In the **Metrics table**, you are provided with a full list of your model’s scores for each advanced metric, which is more comprehensive than the scores overview at the top of the page. The metrics shown here depend on your model type. For a reference to help you understand and interpret each metric, see [Metrics reference](canvas-metrics.md).

To understand the visualizations that might appear based on your model type, see the following options:
+ **Confusion matrix** – Canvas uses confusion matrices to help you visualize when a model makes predictions correctly. In a confusion matrix, your results are arranged to compare the predicted values against the actual values. The following example explains how a confusion matrix works for a 2 category prediction model that predicts positive and negative labels:
  + True positive – The model correctly predicted positive when the true label was positive.
  + True negative – The model correctly predicted negative when the true label was negative.
  + False positive – The model incorrectly predicted positive when the true label was negative.
  + False negative – The model incorrectly predicted negative when the true label was positive.
+ **Precision recall curve** – The precision recall curve is a visualization of the model’s precision score plotted against the model’s recall score. Generally, a model that can make perfect predictions would have precision and recall scores that are both 1. The precision recall curve for a decently accurate model is fairly high in both precision and recall.
+ **Residuals** – Residuals are the difference between the actual values and the values predicted by the model. A residuals chart plots the residuals against the corresponding values to visualize their distribution and any patterns or outliers. A normal distribution of residuals around zero indicates that the model is a good fit for the data. However, if the residuals are significantly skewed or have outliers, it may indicate that the model is overfitting the data or that there are other issues that need to be addressed.
+ **Error density** – An error density plot is a representation of the distribution of errors made by a model. It shows the probability density of the errors at each point, helping you to identify any areas where the model may be overfitting or making systematic errors.

# View model candidates in the model leaderboard


When you do a [Standard build](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) for tabular and time series forecasting models in Amazon SageMaker Canvas, SageMaker AI trains multiple *model candidates* (different iterations of the model) and by default selects the one with the highest value for the optimization metric. For tabular models, Canvas builds up to 250 different model candidates using various algorithms and hyperparameter settings. For time series forecasting models, Canvas builds 7 different models—one for each of the [supported forecasting algorithms](canvas-advanced-settings.md#canvas-advanced-settings-time-series) and one ensemble model that averages the predictions of the other models to try to optimize accuracy.

The default model candidate is the only version that you can use in Canvas for actions like making predictions, registering to the model registry, or deploying to an endpoint. However, you might want to review all of the model candidates and select a different candidate to be the default model. You can view all of the model candidates and more details about each candidate on the **Model leaderboard** in Canvas.

To view the **Model leaderboard**, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, choose **Model leaderboard.**

The **Model leaderboard** page opens, which for tabular models looks like the following screenshot.

![\[The model leaderboard, which lists all of the model candidates that Canvas trained.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-model-leaderboard.png)


For time series forecasting models, you see 7 models, which include one for each of the time series forecasting algorithms supported by Canvas and one ensemble model. For more information about the algorithms, see [Advanced time series forecasting model settings](canvas-advanced-settings.md#canvas-advanced-settings-time-series).

In the preceding screenshot, you can see that the first model candidate listed is marked as the **Default model**. This is the model candidate with which you can make predictions or deploy to endpoints.

To view more detailed metrics information about the model candidates to compare them, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and choose **View model details**.

**Important**  
 Loading the model details for non-default model candidates may take a few minutes (typically less than 10 minutes), and SageMaker AI Hosting charges apply. For more information, see [SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

The model candidate opens in the **Analyze** tab, and the metrics shown are specific to that model candidate. When you’re done reviewing the model candidate’s metrics, you can go back or exit the view to return to the **Model leaderboard**.

If you’d like to set the **Default model** to a different candidate, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and choose **Change to default model**. Changing the default model for a model trained using HPO mode might take several minutes.

**Note**  
If your model is already deployed in production, [registered to the model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html), or has [automations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-manage-automations.html) set up, you must delete your deployment, model registration, or automations before changing the default model.

# Metrics reference


The following sections describe the metrics that are available in Amazon SageMaker Canvas for each model type.

## Metrics for numeric prediction


The following list defines the metrics for numeric prediction in SageMaker Canvas and gives you information about how you can use them.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ MAE – Mean absolute error. On average, the prediction for the target column is \$1/- \$1MAE\$1 from the actual value.

  Measures how different the predicted and actual values are when they're averaged over all values. MAE is commonly used in numeric prediction to understand model prediction error. If the predictions are linear, MAE represents the average distance from a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the number of observations. Values range from 0 to infinity, with smaller numbers indicating a better model fit to the data.
+ MAPE – Mean absolute percent error. On average, the prediction for the target column is \$1/- \$1MAPE\$1 % from the actual value.

  MAPE is the mean of the absolute differences between the actual values and the predicted or estimated values, divided by the actual values and expressed as a percentage. A lower MAPE indicates better performance, as it means that the predicted or estimated values are closer to the actual values.
+ MSE – Mean squared error, or the average of the squared differences between the predicted and actual values.

  MSE values are always positive. The better a model is at predicting the actual values, the smaller the MSE value is.
+ R2 – The percentage of the difference in the target column that can be explained by the input column.

  Quantifies how much a model can explain the variance of a dependent variable. Values range from one (1) to negative one (-1). Higher numbers indicate a higher fraction of explained variability. Values close to zero (0) indicate that very little of the dependent variable can be explained by the model. Negative values indicate a poor fit and that the model is outperformed by a constant function (or a horizontal line).
+ RMSE – Root mean squared error, or the standard deviation of the errors.

  Measures the square root of the squared difference between predicted and actual values, and is averaged over all values. It is used to understand model prediction error, and it's an important metric to indicate the presence of large model errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of different types.

## Metrics for categorical prediction


This section defines the metrics for categorical prediction in SageMaker Canvas and gives you information about how you can use them.

The following is a list of available metrics for 2-category prediction:
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ AUC – A value between 0 and 1 that indicates how well your model is able to separate the categories in your dataset. A value of 1 indicates that it was able to separate the categories perfectly.
+ BalancedAccuracy – Measures the ratio of accurate predictions to all predictions.

  This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total number of positive (P) and negative (N) values. It is defined as follows: `0.5*((TP/P)+(TN/N))`, with values ranging from 0 to 1. The balanced accuracy metric gives a better measure of accuracy when the number of positives or negatives differ greatly from each other in an imbalanced dataset, such as when only 1% of email is spam.
+ F1 – A balanced measure of accuracy that takes class balance into account.

  It is the harmonic mean of the precision and recall scores, defined as follows: `F1 = 2 * (precision * recall) / (precision + recall)`. F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ LogLoss – Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability outputs, rather than the outputs themselves. Log loss is an important metric to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to infinity. A value of 0 represents a model that perfectly predicts the data.
+ Precision – Of all the times that \$1category x\$1 was predicted, the prediction was correct \$1precision\$1% of the time.

  Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives that it identifies. It is defined as follows: `Precision = TP/(TP+FP)`, with values ranging from zero (0) to one (1). Precision is an important metric when the cost of a false positive is high. For example, the cost of a false positive is very high if an airplane safety system is falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative in the data.
+ Recall – The model correctly predicted \$1recall\$1% to be \$1category x\$1 when \$1target\$1column\$1 was actually \$1category x\$1.

  Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A true positive is a positive prediction that is also an actual positive value in the data. Recall is defined as follows: `Recall = TP/(TP+FN)`, with values ranging from 0 to 1. Higher scores reflect a better ability of the model to predict true positives (TP) in the data. Note that it is often insufficient to measure only recall, because predicting every output as a true positive yields a perfect recall score.

The following is a list of available metrics for 3\$1 category prediction:
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ BalancedAccuracy – Measures the ratio of accurate predictions to all predictions.

  This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total number of positive (P) and negative (N) values. It is defined as follows: `0.5*((TP/P)+(TN/N))`, with values ranging from 0 to 1. The balanced accuracy metric gives a better measure of accuracy when the number of positives or negatives differ greatly from each other in an imbalanced dataset, such as when only 1% of email is spam.
+ F1macro – The F1macro score applies F1 scoring by calculating the precision and recall, and then taking their harmonic mean to calculate the F1 score for each class. Then, the F1macro averages the individual scores to obtain the F1macro score. F1macro scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ LogLoss – Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability outputs, rather than the outputs themselves. Log loss is an important metric to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to infinity. A value of 0 represents a model that perfectly predicts the data.
+ PrecisionMacro – Measures precision by calculating precision for each class and averaging scores to obtain precision for several classes. Scores range from zero (0) to one (1). Higher scores reflect the model's ability to predict true positives (TP) out of all of the positives that it identifies, averaged across multiple classes.
+ RecallMacro – Measures recall by calculating recall for each class and averaging scores to obtain recall for several classes. Scores range from 0 to 1. Higher scores reflect the model's ability to predict true positives (TP) in a dataset, whereas a true positive reflects a positive prediction that is also an actual positive value in the data. It is often insufficient to measure only recall, because predicting every output as a true positive will yield a perfect recall score.

Note that for 3\$1 category prediction, you also receive the average F1, Accuracy, Precision, and Recall metrics. The scores for these metrics are just the metric scores averaged for all categories.

## Metrics for image and text prediction


The following is a list of available metrics for image prediction and text prediction.
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ F1 – A balanced measure of accuracy that takes class balance into account.

  It is the harmonic mean of the precision and recall scores, defined as follows: `F1 = 2 * (precision * recall) / (precision + recall)`. F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ Precision – Of all the times that \$1category x\$1 was predicted, the prediction was correct \$1precision\$1% of the time.

  Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives that it identifies. It is defined as follows: `Precision = TP/(TP+FP)`, with values ranging from zero (0) to one (1). Precision is an important metric when the cost of a false positive is high. For example, the cost of a false positive is very high if an airplane safety system is falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative in the data.
+ Recall – The model correctly predicted \$1recall\$1% to be \$1category x\$1 when \$1target\$1column\$1 was actually \$1category x\$1.

  Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A true positive is a positive prediction that is also an actual positive value in the data. Recall is defined as follows: `Recall = TP/(TP+FN)`, with values ranging from 0 to 1. Higher scores reflect a better ability of the model to predict true positives (TP) in the data. Note that it is often insufficient to measure only recall, because predicting every output as a true positive yields a perfect recall score.

Note that for image and text prediction models where you are predicting 3 or more categories, you also receive the *average* F1, Accuracy, Precision, and Recall metrics. The scores for these metrics are just the metric scores average for all categories.

## Metrics for time series forecasts


The following defines the advanced metrics for time series forecasts in Amazon SageMaker Canvas and gives you information about how you can use them.
+ Average Weighted Quantile Loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles. A lower value indicates a more accurate model.
+ Weighted Absolute Percent Error (WAPE) – The sum of the absolute error normalized by the sum of the absolute target, which measures the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
+ Root Mean Square Error (RMSE) – The square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
+ Mean Absolute Percent Error (MAPE) – The percentage error (percent difference of the mean forecasted value versus the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
+ Mean Absolute Scaled Error (MASE) – The mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.

# Predictions with custom models


Use the custom model that you've built in SageMaker Canvas to make predictions for your data. The following sections show you how to make predictions for numeric and categorical prediction models, time series forecasts, image prediction models, and text prediction models.

Numeric and categorical prediction, image prediction, and text prediction custom models support making the following types of predictions for your data:
+ **Single predictions** — A **Single prediction** is when you only need to make one prediction. For example, you have one image or passage of text that you want to classify.
+ **Batch predictions** — A **Batch prediction** is when you’d like to make predictions for an entire dataset. You can make batch predictions for datasets that are 1 TB\$1. For example, you have a CSV file of customer reviews for which you’d like to predict the customer sentiment, or you have a folder of image files that you'd like to classify. You should make predictions with a dataset that matches your input dataset. Canvas provides you with the ability to do manual batch predictions, or you can configure automatic batch predictions that run whenever you update a dataset.

For each prediction or set of predictions, SageMaker Canvas returns the following:
+ The predicted values
+ The probability of the predicted value being correct

**Get started**

Choose one of the following workflows to make predictions with your custom model:
+ [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md)
+ [Make single predictions](canvas-make-predictions-single.md)

After generating predictions with your model, you can also do the following:
+ [Update your model by adding versions.](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html) If you want to try to improve the prediction accuracy of your model, you can build new versions of your model. You can choose to clone your original model building configuration and dataset, or you can change your configuration and select a different dataset. After adding a new version, you can review and compare versions to choose the best one.
+ [Register a model version in the SageMaker AI model registry](canvas-register-model.md). You can register versions of your model to the SageMaker Model Registry, which is a feature for tracking and managing the status of model versions and machine learning pipelines. A data scientist or MLOps team user with access to the SageMaker Model Registry can review your model versions and approve or reject them before deploying them to production.
+ [Send your batch predictions to Quick.](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-send-predictions.html) In Quick, you can build and publish dashboards with your batch prediction datasets. This can help you analyze and share results generated by your custom model.

# Make single predictions


**Note**  
This section describes how to get single predictions from your model inside the Canvas application. For information about making real-time invocations in a production environment by deploying your model to an endpoint, see [Deploy your models to an endpoint](canvas-deploy-model.md).

Make single predictions if you want to get a prediction for a single data point. You can use this feature to get real-time predictions or to experiment with changing individual values to see how they impact the prediction outcome. Note that single predictions rely on an Asynchronous Inference endpoint, which shuts down after being idle (or not receiving any prediction requests) for two hours.

Choose one of the following procedures based on your model type.

## Make single predictions with numeric and categorical prediction models


To make a single prediction for a numeric or categorical prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. For each **Column** field, which represents the columns of your input data, you can change the **Value**. Select the dropdown menu for the **Value** you want to change. For numeric fields, you can enter a new number. For fields with labels, you can select a different label.

1. When you’re ready to generate the prediction, in the right **Prediction** pane, choose **Update**.

In the right **Prediction** pane, you’ll see the prediction result. You can **Copy** the prediction result chart, or you can also choose **Download** to either download the prediction result chart as an image or to download the values and prediction as a CSV file.

## Make single predictions with time series forecasting models


To make a single prediction for a time series forecasting model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. Choose **Single prediction**.

1. For **Item**, select the item for which you want to forecast values.

1. If you used a group by column to train the model, then select the group by category for the item.

The prediction result loads in the pane below, showing you a chart with the forecast for each quantile. Choose **Schema view** to see the numeric predicted values. You can also choose **Download** to download the prediction results as either an image or a CSV file.

## Make single predictions with image prediction models


To make a single prediction for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. Choose **Import image**.

1. You’ll be prompted to upload an image. You can upload an image from your local computer or from an Amazon S3 bucket.

1. Choose **Import** to import your image and generate the prediction.

In the right **Prediction results** pane, the model lists the possible labels for the image along with a **Confidence** score for each label. For example, the model might predict the label **Sea** for an image, with a confidence score of 96%. The model may have predicted the image as a **Glacier** with only a confidence score of 4%. Therefore, you can determine that your model is fairly confident in predicting images of the sea.

## Make single predictions with text prediction models


To make a single prediction for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. For the **Text field**, enter the text for which you’d like to get a prediction.

1. Choose **Generate prediction results** to get your prediction.

In the right **Prediction results** pane, you receive an analysis of your text in addition to a **Confidence** score for each possible label. For example, if you entered a good review for a product, you might get **Positive** with a confidence score of 85%, while the confidence score for **Neutral** might be 10% and the confidence score for **Negative** only 5%.

# Batch predictions in SageMaker Canvas
Batch predictions

Make batch predictions when you have an entire dataset for which you’d like to generate predictions. Amazon SageMaker Canvas supports batch predictions for datasets up to PBs in size.

There are two types of batch predictions you can make:
+ [Manual batch predictions](canvas-make-predictions-batch-manual.md) are when you have a dataset for which you want to make one-time predictions.
+ [Automatic batch predictions](canvas-make-predictions-batch-auto.md) are when you set up a configuration that runs whenever a specific dataset is updated. For example, if you’ve configured weekly updates to a SageMaker Canvas dataset of inventory data, you can set up automatic batch predictions that run whenever you update the dataset. After setting up an automated batch predictions workflow, see [How to manage automations](canvas-manage-automations.md) for more information about viewing and editing the details of your configuration. For more information about setting up automatic dataset updates, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md).

**Note**  
Time series forecasting models don't support automatic batch predictions.  
You can only set up automatic batch predictions for datasets imported through local upload or Amazon S3. Additionally, automatic batch predictions can only run while you’re logged in to the Canvas application. If you log out of Canvas, the automatic batch prediction job resumes when you log back in.

To get started, review the [Batch prediction dataset requirements](canvas-make-predictions-batch-preqreqs.md), and then choose one of the following manual or automatic batch prediction workflows.

**Topics**
+ [

# Batch prediction dataset requirements
](canvas-make-predictions-batch-preqreqs.md)
+ [

# Make manual batch predictions
](canvas-make-predictions-batch-manual.md)
+ [

# Make automatic batch predictions
](canvas-make-predictions-batch-auto.md)
+ [

# Edit your automatic batch prediction configuration
](canvas-make-predictions-batch-auto-edit.md)
+ [

# Delete your automatic batch prediction configuration
](canvas-make-predictions-batch-auto-delete.md)
+ [

# View your batch prediction jobs
](canvas-make-predictions-batch-auto-view.md)

# Batch prediction dataset requirements


For batch predictions, make sure that your datasets meet the requirements outlined in [Create a dataset](canvas-import-dataset.md). If your dataset is larger than 5 GB, then Canvas uses Amazon EMR Serverless to process your data and split it into smaller batches. After your data has been split, Canvas uses SageMaker AI Batch Transform to make predictions. You may see charges from both of these services after running batch predictions. For more information, see [Canvas pricing](https://aws.amazon.com/sagemaker/canvas/pricing/).

You might not be able to make predictions on some datasets if they have incompatible *schemas*. A *schema* is an organizational structure. For a tabular dataset, the schema is the names of the columns and the data type of the data in the columns. An incompatible schema might happen for one of the following reasons:
+ The dataset that you're using to make predictions has fewer columns than the dataset that you're using to build the model.
+ The data types in the columns you used to build the dataset might be different from the data types in dataset that you're using to make predictions.
+ The dataset that you're using to make predictions and the dataset that you've used to build the model have column names that don't match. The column names are case sensitive. `Column1` is not the same as `column1`.

To ensure that you can successfully generate batch predictions, match the schema of your batch predictions dataset to the dataset you used to train the model.

**Note**  
For batch predictions, if you dropped any columns when building your model, Canvas adds the dropped columns back to the prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

# Make manual batch predictions


Choose one of the following procedures to make manual batch predictions based on your model type.

## Make manual batch predictions with numeric, categorical, and time series forecasting models


To make manual batch predictions for numeric, categorical, and time series forecasting model types, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** to pick a dataset for generating predictions.

1. From the list of available datasets, select your dataset, and then choose **Start Predictions** to get your predictions.

After the prediction job finishes running, there is an output dataset listed on the same page in the **Predictions** section. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to preview the output data. You can see the input data matched to the prediction and the probability that the prediction is correct. Then, you can choose **Download prediction** to download the results as a file.

## Make manual batch predictions with image prediction models


To make manual batch predictions for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you’ll be directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to see the output data. You can see the images along with their predicted labels and confidence scores. Then, you can choose **Download prediction** to download the results as a CSV or a ZIP file.

## Make manual batch predictions with text prediction models


To make manual batch predictions for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you’ll be directed through the import data workflow. The dataset you choose must have the same source column as the dataset with which you built the model.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to see the output data. You can see the images along with their predicted labels and confidence scores. Then, you can choose **Download prediction** to download the results.

# Make automatic batch predictions


**Note**  
Time series forecasting models don't support automatic batch predictions.

To set up a schedule for automatic batch predictions, do the following:

1. In the left navigation pane of Canvas, choose **My models**.

1. Choose your model.

1. Choose the **Predict** tab.

1. Choose **Batch prediction**.

1. For **Generate predictions**, choose **Automatic**.

1. The **Automate batch predictions** dialog box pops up. Choose **Select dataset** and choose the dataset for which you want to automate predictions. Note that you can only select a dataset that was imported through local upload or Amazon S3.

1. After selecting a dataset, choose **Set up**.

Canvas runs a batch predictions job for the dataset after you set up the configuration. Then, every time you [Update a dataset](canvas-update-dataset.md), either manually or automatically, another batch predictions job runs.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to preview the output data. You can see the input data matched to the prediction and the probability that the prediction is correct. Then, you can choose **Download** to download the results.

The following sections describe how to view, update, and delete your automatic batch prediction configuration through the **Datasets** page in the Canvas application. You can only set up a maximum of 20 automatic configurations in Canvas. For more information about viewing your automated batch predictions job history or making changes to your automatic configuration through the **Automations** page, see [How to manage automations](canvas-manage-automations.md).

# Edit your automatic batch prediction configuration


You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

When you edit a batch prediction configuration, you can change the target dataset but not the frequency (since automatic batch predictions occur whenever the dataset is updated).

To edit your auto update configuration, do the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the dropdown menu, choose **Update configuration**.

1. The **Automate batch prediction** dialog box opens. You can select another dataset and choose **Set up** to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration by doing the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic batch predictions are now paused. You can turn the toggle back on at any time to resume the update schedule.

# Delete your automatic batch prediction configuration


To learn how to delete your automatic batch prediction configuration, see [Delete an automatic configuration](canvas-manage-automations-delete.md).

You can also delete your configuration by doing the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration from the list and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the dropdown menu, choose **Delete configuration**.

Your configuration should now be deleted.

# View your batch prediction jobs


To view the statuses and history of your batch prediction jobs, go to the **Predict** tab of your model.

Each batch prediction job shows up in the **Predict** tab of your model. Under **Predictions**, you can see the **All jobs** tab and the **Configuration** tabs:
+ **All jobs** – In this tab, you can see all of the manual and automatic batch prediction jobs for this model. You can filter the jobs by configuration name. For each job, you can see the following fields:
  + **Status** – The current status of your batch prediction job. If the status is **Failed** or **Partially failed**, you can hover over the status to view a more detailed error message to help you troubleshoot.
  + **Input dataset** – The name of your Canvas input dataset, including the dataset version.
  + **Prediction type** – Whether the prediction job was automatic or manual.
  + **Rows** – The number of rows predicted.
  + **Configuration name** – The name of the batch prediction job configuration.
  + **QuickSight** – Describes whether you've sent the batch predictions to Quick.
  + **Created** – The creation time of the batch prediction job.

  If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View details**, **Preview prediction**, **Download prediction**, or **Send to Quick**. If you choose **View details**, a page opens that shows you the full details of the batch prediction job, including the status, the input and output data configurations, information about the instances used to complete the job and access to the Amazon CloudWatch logs. The page looks like the following screenshot.  
![\[Batch prediction job details page showing all of the additional details about a job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-view-batch-prediction-job-details.png)
+ **Configuration** – In this tab, you can see all of the automatic batch prediction configurations you’ve created for this model. For each configuration, you can see fields such as the timestamp for when it was **Created**, the **Input dataset** it tracks for updates, and the **Next job scheduled**, which is the time when the next automatic prediction job is scheduled to start. If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View all jobs** to see the job history and in progress jobs for the configuration.



# Send predictions to Quick


**Note**  
You can send batch predictions to Quick for numeric and categorical prediction and time series forecasting models. Single-label image prediction and multi-category text prediction models are excluded.

Once you generate batch predictions with custom tabular models in SageMaker Canvas, you can send those predictions as CSV files to Quick, which is a business intelligence (BI) service to build and publish predictive dashboards.

For example, if you built a 2 category prediction model to determine whether a customer will churn, you can create a visual, predictive dashboard in Quick to show the percentage of customers that are expected to churn. To learn more about Quick, see the [Quick User Guide](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html).

The following sections show you how to send your batch predictions to Quick for analysis.

## Before you begin


Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your predictions to Quick. Your administrator can set up the IAM permissions for your user. For more information, see [Grant Your Users Permissions to Send Predictions to Quick](canvas-quicksight-permissions.md).

Your Quick account must contain the `default` namespace, which is set up when you first create your Quick account. Contact your administrator to help you get access to Quick. For more information, see [Setting up for Quick](https://docs.aws.amazon.com/quicksight/latest/user/setting-up.html) in the *Quick User Guide*.

Your Quick account must be created in the same Region as your Canvas application. If your Quick account’s home Region differs from your Canvas application’s Region, you must either [close](https://docs.aws.amazon.com/quicksight/latest/user/closing-account.html) and recreate your Quick account, or [set up a Canvas application](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites) in the same Region as your Quick account. You can check your Quick home Region by doing the following (assuming you already have an Quick account):

1. Open your [Quick console](https://quicksight.aws.amazon.com/).

1. When the page loads, your Quick home Region is appended to the URL in the following format: `https://<your-home-region>.quicksight.aws.amazon.com/`.

You must know the usernames of the Quick users to whom you want to send your predictions. You can send predictions to yourself or other users who have the right permissions. Any users to whom you send predictions must be in the `default` [namespace](https://docs.aws.amazon.com/quicksight/latest/user/namespaces.html) of your Quick account and have the `Author` or `Admin` role in Quick.

Additionally, Quick must have access to the SageMaker AI default Amazon S3 bucket for your domain, which is named with the following format: `sagemaker-{REGION}-{ACCOUNT_ID}`. The Region should be the same as your Quick account's home Region and your Canvas application’s Region. To learn how to give Quick access to the batch predictions stored in your Amazon S3 bucket, see the topic [I can’t connect to Amazon S3](https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-S3.html) in the *Quick User Guide*.

## Supported data formats


Before sending your predictions, check that the data format of your batch predictions is compatible with Quick.
+ To learn more about the accepted data formats for timeseries data, see [Supported date formats](https://docs.aws.amazon.com/quicksight/latest/user/supported-date-formats.html) in the *Quick User Guide*.
+ To learn more about data values that might prevent you from sending to Quick, see [Unsupported values in data](https://docs.aws.amazon.com/quicksight/latest/user/unsupported-data-values.html) in the *Quick User Guide*.

Also note that Quick uses the character `"` as a text qualifier, so if your Canvas data contains any `"` characters, make sure that you close all matching quotes. Any mismatching quotes can cause issues with sending your dataset to Quick.

## Send your batch predictions to Quick


Use the following procedure to send your predictions to Quick:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model.

1. Choose the **Predict** tab.

1. Under **Predictions**, select the dataset (or datasets) of batch predictions that you’d like to share. You can share up to 5 datasets of batch predictions at a time.

1. After you select your dataset, choose **Send to Quick**.
**Note**  
The **Send to Quick** button doesn’t activate unless you select one or more datasets.

   Alternatively, you can preview your predictions by choosing the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then **View prediction results**. From the dataset preview, you can choose **Send to Quick**. The following screenshot shows you the **Send to Quick** button in a dataset preview.  
![\[Screenshot of a dataset preview with the Send to Quick button at the bottom.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/send-to-quicksight-preview.png)

1. In the **Send to Quick** dialog box, do the following:

   1. For **QuickSight users**, enter the name of the Quick users to whom you want to send your predictions. If you want to send them to yourself, enter your own username. You can only send predictions to users in the `default` namespace of the Quick account, and the user must have the `Author` or `Admin` role in Quick.

   1. Choose **Send**.

   The following screenshot shows the **Send to Quick** dialog box:  
![\[The Send to Quick dialog box.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/send-to-quicksight.png)

After you send your batch predictions, the **QuickSight** field for the datasets you sent shows as `Sent`. In the confirmation box that confirms your predictions were sent, you can choose **Open Quick** to open your Quick application. If you’re done using Canvas, you should [log out](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-log-out.html) of the Canvas application.

The Quick users that you’ve sent datasets to can open their Quick application and view the Canvas datasets that have been shared with them. Then, they can create predictive dashboards with the data. For more information, see [Getting started with Quick data analysis](https://docs.aws.amazon.com/quicksight/latest/user/getting-started.html) in the *Quick User Guide*.

By default, all of the users to whom you send predictions have owner permissions for the dataset in Quick. Owners are able to create analyses, refresh, edit, delete, and re-share datasets. The changes that owners make to a dataset change the dataset for all users with access. To change the permissions, go to the dataset in Quick and manage its permissions. For more information, see [Viewing and editing the permissions users that a dataset is shared with](https://docs.aws.amazon.com/quicksight/latest/user/sharing-data-sets.html#view-users-data-set) in the *Quick User Guide*.

# Download a model notebook


**Note**  
The model notebook feature is available for quick build and standard build tabular models, and fine-tuned foundation models. Model notebooks aren't supported for image prediction, text prediction, or time series forecasting models.  
If you'd like to generate a model notebook for a tabular model built before this feature was launched, you must rebuild the model to generate a notebook.

For eligible models that you successfully build in Amazon SageMaker Canvas, a Jupyter notebook containing a report of all the model building steps is generated. This Jupyter notebook contains Python code that you can run locally or run in an environment like Amazon SageMaker Studio Classic to replicate the steps necessary to build your model. The notebook can be useful if you’d like to experiment with the code or see the backend details of how Canvas builds models.

To access the model notebook, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model and version that you built.

1. On the model version’s page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) in the header.

1. From the dropdown menu, choose **View Notebook**.

1. A popup appears with the notebook content. You can choose **Download** and then do one of the following:

   1. Choose **Download** to save the notebook content to your local device.

   1. Choose **Copy S3 URI** to copy the Amazon S3 location where the notebook is stored. The notebook is stored in the Amazon S3 bucket specified in your **Canvas storage configuration**, which is configured in the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites) section.

You should now be able to view the notebook either locally or as an object in Amazon S3. You can upload the notebook to an IDE to edit and run the code, or you can share the notebook with others in your organization to review.

# Send your model to Quick


If you use Quick and want to leverage SageMaker Canvas in your Quick visualizations, you can build an Amazon SageMaker Canvas model and use it as a *predictive field* in your Quick dataset. A *predictive field* is a field in your Quick dataset that can make predictions for a given column in your dataset, similar to how Canvas users make single or batch predictions with a model. To learn more about how to integrate Canvas predictive abilities into your Quick datasets, see [SageMaker Canvas integration](https://docs.aws.amazon.com/quicksight/latest/user/sagemaker-canvas-integration.html) in the [Quick User Guide](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html).

The following steps explain how you can add a predictive field to your Quick dataset using a Canvas model:

1. Open the Canvas application and build a model with your dataset.

1. After building the model in Canvas, send the model to Quick. A schema file automatically downloads to your local machine when you send the model to Quick. You upload this schema file to Quick in the next step.

1. Open Quick and choose a dataset with the same schema as the dataset you used to build your model. Add a predictive field to the dataset and do the following:

   1. Specify the model sent from Canvas.

   1. Upload the schema file that was downloaded in Step 2.

1. Save and publish your changes, and then generate predictions for the new dataset. Quick uses the model to fill in the target column with predictions.

In order to send a model from Canvas to Quick, you must meet the following prerequisites:
+ You must have both Canvas and Quick set up. Your Quick account must be created in the same AWS Region as your Canvas application. If your Quick account’s home Region differs from your Canvas application’s Region, you must either [close](https://docs.aws.amazon.com/quicksight/latest/user/closing-account.html) and recreate your Quick account, or [set up a Canvas application](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites) in the same Region as your Quick account. Your Quick account must also contain the default namespace, which you set up when you first create your Quick account. Contact your administrator to help you get access to Quick. For more information, see [Setting up for Quick](https://docs.aws.amazon.com/quicksight/latest/user/setting-up.html) in the *Quick User Guide*.
+ Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your predictions to Quick. Your administrator can set up the IAM permissions for your user. For more information, see [Grant Your Users Permissions to Send Predictions to Quick](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-quicksight-permissions.html).
+ Quick must have access to the Amazon S3 bucket that you’ve specified for Canvas application storage. For more information, see [Configure your Amazon S3 storage](canvas-storage-configuration.md).

# Time Series Forecasts in Amazon SageMaker Canvas


**Note**  
Time series forecasting models are only supported for tabular datasets.

Amazon SageMaker Canvas gives you the ability to use machine learning time series forecasts. Time series forecasts give you the ability to make predictions that can vary with time.

You can make a time series forecast for the following examples:
+ Forecasting your inventory in the coming months.
+ The number of items sold in the next four months.
+ The effect of reducing the price on sales during the holiday season.
+ Item inventory in the next 12 months.
+ The number of customers entering a store in the next several hours.
+ Forecasting how a 10% reduction in the price of a product affects sales over a time period.

To make a time series forecast, your dataset must have the following:
+ A timestamp column with all values having the `datetime` type.
+ A target column that has the values that you're using to forecast future values.
+ An item ID column that contains unique identifiers for each item in your dataset, such as SKU numbers.

The `datetime` values in the timestamp column must use one of the following formats:
+ `YYYY-MM-DD HH:MM:SS`
+ `YYYY-MM-DDTHH:MM:SSZ`
+ `YYYY-MM-DD`
+ `MM/DD/YY`
+ `MM/DD/YY HH:MM`
+ `MM/DD/YYYY`
+ `YYYY/MM/DD HH:MM:SS`
+ `YYYY/MM/DD`
+ `DD/MM/YYYY`
+ `DD/MM/YY`
+ `DD-MM-YY`
+ `DD-MM-YYYY`

You can make forecasts for the following intervals:
+ 1 min
+ 5 min
+ 15 min
+ 30 min
+ 1 hour
+ 1 day
+ 1 week
+ 1 month
+ 1 year

## Future values in your input dataset


Canvas automatically detects columns in your dataset that might potentially contain future values. If present, these values can enhance the accuracy of predictions. Canvas marks these specific columns with a `Future values` label. Canvas infers the relationship between the data in these columns and the target column that you are trying to predict, and utilizes that relationship to generate more accurate forecasts.

For example, you can forecast the amount of ice cream sold by a grocery store. To make a forecast, you must have a timestamp column and a column that indicates how much ice cream the grocery store sold. For a more accurate forecast, your dataset can also include the price, the ambient temperature, the flavor of the ice cream, or a unique identifier for the ice cream.

Ice cream sales might increase when the weather is warmer. A decrease in the price of the ice cream might result in more units sold. Having a column with ambient temperature data and a column with pricing data can improve your ability to forecast the number of units of ice cream the grocery store sells.

While providing future values is optional, it helps you to perform what-if analyses directly in the Canvas application, showing you how changes in future values could alter your predictions.

## Handling missing values


You might have missing data for different reasons. The reason for your missing data might inform how you want Canvas to impute it. For example, your organization might use an automatic system that only tracks when a sale happens. If you're using a dataset that comes from this type of automatic system, you have missing values in the target column.

**Important**  
If you have missing values in the target column, we recommend using a dataset that doesn't have them. SageMaker Canvas uses the target column to forecast future values. Missing values in the target column can greatly reduce the accuracy of the forecast.

For missing values in the dataset, Canvas automatically imputes the missing values for you by filling the target column with `0` and other numeric columns with the median value of the column.

However, you can select your own filling logic for the target column and other numeric columns in your datasets. Target columns have different filling guidelines and restrictions than the rest of the numeric columns. Target columns are filled up to the end of the historical period, whereas numeric columns are filled across both historical and future periods all the way to the end of the forecast horizon. Canvas only fills future values in a numeric column if your data has at least one record with a future timestamp and a value for that specific column.

You can choose one of the following filling logic options to impute missing values in your data:
+ `zero` – Fill with `0`.
+ `NaN` – Fill with NaN, or not a number. This is only supported for the target column.
+ `mean` – Fill with the mean value from the data series.
+ `median` – Fill with the median value from the data series.
+ `min` – Fill with the minimum value from the data series.
+ `max` – Fill with the maximum value from the data series.

When choosing a filling logic, you should consider how your model interprets the logic. For example, in a retail scenario, recording zero sales of an available item is different from recording zero sales of an unavailable item, as the latter scenario doesn’t necessarily imply a lack of customer interest in the unavailable item. In this case, filling with `0` in the target column of the dataset might cause the model to be under-biased in its predictions and infer a lack of customer interest in unavailable items. Conversely, filling with `NaN` might cause the model to ignore true occurrences of zero items being sold of available items.

## Types of forecasts


You can make one of the following types of forecasts:
+ **Single item**
+ **All items**

For a forecast on all the items in your dataset, SageMaker Canvas returns a forecast for the future values for each item in your dataset.

For a single item forecast, you specify the item and SageMaker Canvas returns a forecast for the future values. The forecast includes a line graph that plots the predicted values over time.

**Topics**
+ [

## Future values in your input dataset
](#canvas-time-series-future)
+ [

## Handling missing values
](#canvas-time-series-missing)
+ [

## Types of forecasts
](#canvas-time-series-types)
+ [

# Additional options for forecasting insights
](canvas-additional-insights.md)

# Additional options for forecasting insights


In Amazon SageMaker Canvas, you can use the following optional methods to get more insights from your forecast:
+ Group column
+ Holiday schedule
+ What-if scenario

You can specify a column in your dataset as a **Group column**. Amazon SageMaker Canvas groups the forecast by each value in the column. For example, you can group the forecast on columns containing price data or unique item identifiers. Grouping a forecast by a column lets you make more specific forecasts. For example, if you group a forecast on a column containing item identifiers, you can see the forecast for each item.

Overall sales of items might be impacted by the presence of holidays. For example, in the United States, the number of items sold in both November and December might differ greatly from the number of items sold in January. If you use the data from November and December to forecast the sales in January, your results might be inaccurate. Using a holiday schedule prevents you getting inaccurate results. You can use a holiday schedule for 251 countries.

For a forecast on a single item in your dataset, you can use what-if scenarios. A what-if scenario gives you the ability to change values in your data and change the forecast. For example, you can answer the following questions by using a what-if scenario, "What if I lowered prices? How would that affect the number of items sold?"

# Adding model versions in Amazon SageMaker Canvas


In Amazon SageMaker Canvas, you can update the models that you’ve built by adding *versions*. Each model that you build has a version number. The first model is version 1 or `V1`. You can use model versions to see changes in prediction accuracy when you update your data or use [advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html).

When viewing your model, SageMaker Canvas shows you the model history so that you can compare all of the model versions that you built. You can also delete versions that are no longer useful to you. By creating multiple model versions and evaluating their accuracy, you can iteratively improve your model performance.

**Note**  
Text prediction and image prediction models only support one model version.

To add a model version, you can either clone an existing version or create a new version. 

Cloning an existing version copies over the current model configuration, including the model recipe and the input dataset. Alternatively, you can create a new version if you want to configure a new model recipe or choose a different dataset. 

If you create a new version and select a different dataset, you must choose a dataset with the same target column and schema as the dataset from version 1.

Before you can add a new version, you must successfully build at least one model version. Then, you can [ register a model version in the SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html). Use the registry for tracking model versions and for collaborating with Studio Classic users on production model approvals.

If you did a quick build for your first model version, you have the option to run a standard build when you add a version. Standard builds generally have higher accuracy. Therefore, if you feel confident in your quick build configuration, you can run a standard build to create a final version of your model. To learn more about the differences between quick builds and standard builds, see [How custom models work](canvas-build-model.md).

The following procedures show you how to add model versions; the procedure is different depending on whether you are adding a version of the same build type or a different build type (quick versus standard). Use the procedure **To add a new model version** to add versions of the same build type. To add a standard build model version after running a quick build, follow the procedure **To run a standard build**.

**To add a new model version**

1. Open your SageMaker Canvas application. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. To find your model, you can choose **Filter by problem type**.

1. After your model opens, choose the **Add version** button in the top panel.

1. From the dropdown menu, select one of the following options:

   1. **Add a new version from scratch** – When you select this option, the **Build** tab opens with the draft for a new model version. You can select a different dataset (as long as the schema matches the schema of the first model version’s dataset) and configure a new model recipe. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

   1. **Clone an existing version with configurations** – A dialog box prompts you to select the version that you want to clone. After you've selected your desired version, choose **Clone**. The **Build** tab opens with the draft for a new model version. Any model recipe configurations are copied over from the cloned version. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

**To run a standard build**

1. Open your SageMaker Canvas application. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. You can choose **Filter by problem type** to find your model more easily.

1. After your model opens, choose the **Analyze** tab.

1. Choose **Standard build**.  
![\[The Analyze tab of a Canvas model showing the standard build button.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-add-version-quick-to-standard.png)

   On the model draft page that opens to the **Build** tab, you can modify your model configuration and start a build. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

You should now have a new model version build in progress. For more information about building a model, see [How custom models work](canvas-build-model.md).

After building a model version, you can return to your model details page at any time to view all of the versions or add more versions. The following image shows the **Versions** page for a model.

![\[The model versions page for a model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/model-versions.png)


On the **Versions** page, you can view the following information for each of your model versions:
+ **Status** – This field tells you whether your model is currently building (`In building`), done building (`Ready`), failed to build (`Failed`), or still being edited (`In draft`).
+ **Model score**, **F1**, **Precision**, **Recall**, and **AUC** – If you turn on the **Show advanced metrics** toggle on this page, you can see these model metrics. These metrics indicate the accuracy and performance of your model. For more information, see [Evaluate your model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html).
+ **Shared** – This field states whether you shared the model version with SageMaker Studio Classic users.
+ **Model registry** – This field states whether you registered the version to a model registry. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).

# MLOps


After building a model in SageMaker Canvas that you feel confident about, you might want to integrate your model with the machine learning operations (MLOps) processes in your organization. MLOps includes common tasks such as deploying a model for use in production or setting up continuous integration and continuous deployment (CI/CD) pipelines.

The following topics describe how you can use features within Canvas to use a Canvas-built model in production.

**Topics**
+ [

# Register a model version in the SageMaker AI model registry
](canvas-register-model.md)
+ [

# Deploy your models to an endpoint
](canvas-deploy-model.md)
+ [

# View your deployments
](canvas-deploy-model-view.md)
+ [

# Update a deployment configuration
](canvas-deploy-model-update.md)
+ [

# Test your deployment
](canvas-deploy-model-test.md)
+ [

# Invoke your endpoint
](canvas-deploy-model-invoke.md)
+ [

# Delete a model deployment
](canvas-deploy-model-delete.md)

# Register a model version in the SageMaker AI model registry


With SageMaker Canvas, you can build multiple iterations, or versions, of your model to improve it over time. You might want to build a new version of your model if you acquire better training data or if you want to attempt to improve the model’s accuracy. For more information about adding versions to your model, see [Update a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html).

After you’ve [built a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) that you feel confident about, you might want to evaluate its performance and have it reviewed by a data scientist or MLOps engineer in your organization before using it in production. To do this, you can register your model versions to the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html). The SageMaker Model Registry is a repository that data scientists or engineers can use to catalog machine learning (ML) models and manage model versions and their associated metadata, such as training metrics. They can also manage and log the approval status of a model.

After you register your model versions to the SageMaker Model Registry, a data scientist or your MLOps team can access the SageMaker Model Registry through [SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), which is a web-based integrated development environment (IDE) for working with machine learning models. In the SageMaker Model Registry interface in Studio Classic, the data scientist or MLOps team can evaluate your model and update its approval status. If the model doesn’t perform to their requirements, the data scientist or MLOps team can update the status to `Rejected`. If the model does perform to their requirements, then the data scientist or MLOps team can update the status to `Approved`. Then, they can [deploy your model to an endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-prereqs) or [automate model deployment](https://aws.amazon.com/blogs/machine-learning/building-automating-managing-and-scaling-ml-workflows-using-amazon-sagemaker-pipelines/) with CI/CD pipelines. You can use the SageMaker AI model registry feature to seamlessly integrate models built in Canvas with the MLOps processes in your organization.

The following diagram summarizes an example of registering a model version built in Canvas to the SageMaker Model Registry for integration into an MLOps workflow.

![\[The steps registering a model version built in Canvas for integration into an MLOps workflow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-model-registration-diagram.jpg)


You can register tabular, image, and text model versions to the SageMaker Model Registry. This includes time series forecasting models and JumpStart based [fine-tuned foundation models](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat-fine-tune.html).

**Note**  
Currently, you can't register Amazon Bedrock based fine-tuned foundation models built in Canvas to the SageMaker Model Registry.

The following sections show you how to register a model version to the SageMaker Model Registry from Canvas.

## Permissions management


By default, you have permissions to register model versions to the SageMaker Model Registry. SageMaker AI grants these permissions for all new and existing Canvas user profiles through the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy, which is attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application.

If your Canvas administrator is setting up a new domain or user profile, when they're setting up the domain and following the prerequisite instructions in the [Getting started guide](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites), SageMaker AI turns on the model registration permissions through the **ML Ops permissions configuration** option, which is enabled by default.

The Canvas administrator can manage model registration permissions at the user profile level as well. For example, if the administrator wants to grant model registration permissions to some user profiles but remove permissions for others, they can edit the permissions for a specific user. The following procedure shows how to turn off model registration permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the user profile’s domain.

1. On the **domain details** page, choose the **User profile** whose permissions you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the left navigation pane, choose **Canvas settings**.

1. In the **ML Ops permissions configuration** section, turn off the **Enable Model Registry registration permissions** toggle.

1. Choose **Submit** to save the changes to your domain settings.

The user profile should no longer have model registration permissions.

## Register a model version to the SageMaker AI model registry


SageMaker Model Registry tracks all of the model versions that you build to solve a particular problem in a *model group*. When you build a SageMaker Canvas model and register it to SageMaker Model Registry, it gets added to a model group as a new model version. For example, if you build and register four versions of your model, then a data scientist or MLOps team working in the SageMaker Model Registry interface can view the model group and review all four versions of the model in one place.

When registering a Canvas model to the SageMaker Model Registry, a model group is automatically created and named after your Canvas model. Optionally, you can rename it to a name of your choice, or use an existing model group in the SageMaker Model Registry. For more information about creating a model group, see [Create a Model Group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-model-group.html).

**Note**  
Currently, you can only register models built in Canvas to the SageMaker Model Registry in the same account.

To register a model version to the SageMaker Model Registry from the Canvas application, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. You can **Filter by problem type** to find your model more easily.

1. After choosing your model, the **Versions** page opens, listing all of the versions of your model. You can turn on the **Show advanced metrics** toggle to view the advanced metrics, such as **Recall** and **Precision**, to compare your model versions and determine which one you’d like to register.

1. From the list of model versions, for the the version that you want to register, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)). Alternatively, you can double click on the version that you need to register, and then on the version details page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown list, choose **Add to Model Registry**. The **Add to Model Registry** dialog box opens.

1. In the **Add to Model Registry** dialog box, do the following:

   1. (Optional) In the **SageMaker Studio Classic model group** section, for the **Model group name** field, enter the name of the model group to which you want to register your version. You can specify the name for a new model group that SageMaker AI creates for you, or you can specify an existing model group. If you don’t specify this field, Canvas registers your version to a default model group with the same name as your model.

   1. Choose **Add**.

Your model version should now be registered to the model group in the SageMaker Model Registry. When you register a model version to a model group in the SageMaker Model Registry, all subsequent versions of the Canvas model are registered to the same model group (if you choose to register them). If you register your versions to a different model group, you need to go to the SageMaker Model Registry and [delete the model group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-delete-model-group.html). Then, you can re-register your model versions to the new model group.

To view the status of your models, you can return to the **Versions** page for your model in the Canvas application. This page shows you the **Model Registry** status of each version. If the status is `Registered`, then the model has been successfully registered.

If you want to view the details of your registered model version, for the **Model Registry** status, you can hover over the **Registered** field to see the **Model registry details** pop-up box. These details contain more info, such as the following:
+ The **Model package group name** is the model group that your version is registered to in the SageMaker Model Registry.
+ The **Approval status**, which can be `Pending Approval`, `Approved`, or `Rejected`. If a Studio Classic user approves or rejects your version in the SageMaker Model Registry, then this status is updated on your model versions page when you refresh the page.

The following screenshot shows the **Model registry details** box, along with an **Approval status** of `Approved` for this particular model version.

![\[Screenshot of the SageMaker Model Registry details box in the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/approved-mr.png)


# Deploy your models to an endpoint


In Amazon SageMaker Canvas, you can deploy your models to an endpoint to make predictions. SageMaker AI provides the ML infrastructure for you to host your model on an endpoint with the compute instances that you choose. Then, you can *invoke* the endpoint (send a prediction request) and get a real-time prediction from your model. With this functionality, you can use your model in production to respond to incoming requests, and you can integrate your model with existing applications and workflows.

To get started, you should have a model that you'd like to deploy. You can deploy custom model versions that you've built, Amazon SageMaker JumpStart foundation models, and fine-tuned JumpStart foundation models. For more information about building a model in Canvas, see [How custom models work](canvas-build-model.md). For more information about JumpStart foundation models in Canvas, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

Review the following **Permissions management** section, and then begin creating new deployments in the **Deploy a model** section.

## Permissions management


By default, you have permissions to deploy models to SageMaker AI Hosting endpoints. SageMaker AI grants these permissions for all new and existing Canvas user profiles through the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy, which is attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application.

If your Canvas administrator is setting up a new domain or user profile, when they're setting up the domain and following the prerequisite instructions in the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites), SageMaker AI turns on the model deployment permissions through the **Enable direct deployment of Canvas models** option, which is enabled by default.

The Canvas administrator can manage model deployment permissions at the user profile level as well. For example, if the administrator doesn't want to grant model deployment permissions to all user profiles when setting up a domain, they can grant permissions to specific users after creating the domain.

The following procedure shows how to modify the model deployment permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of domains, select the user profile’s domain.

1. On the **Domain details** page, select the **User profiles** tab.

1. Choose your **User profile**.

1. On the user profile's page, select the **App Configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. In the **ML Ops configuration** section, turn on the **Enable direct deployment of Canvas models** toggle to enable deployment permissions.

1. Choose **Submit** to save the changes to your domain settings.

The user profile should now have model deployment permissions.

After granting permissions to the domain or user profile, make sure that the user logs out of their Canvas application and logs back in to apply the permission changes.

## Deploy a model


To get started with deploying your model, you create a new deployment in Canvas and specify the model version that you want to deploy along with the ML infrastructure, such as the type and number of compute instances that you would like to use for hosting the model.

Canvas suggests a default type and number of instances based on your model type, or you can learn more about the various SageMaker AI instance types on the [Amazon SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/). You are charged based on the SageMaker AI instance pricing while your endpoint is active.

When deploying JumpStart foundation models, you also have the option to specify the length of the deployment time. You can deploy the model to an endpoint indefinitely (meaning the endpoint is active until you delete the deployment). Or, if you only need the endpoint for a short period of time and would like to reduce costs, you can deploy the model to an endpoint for a specified amount of time, after which SageMaker AI shuts down the endpoint for you.

**Note**  
If you deploy a model for a specified amount of time, stay logged in to the Canvas application for the duration of the endpoint. If you log out of or delete the application, then Canvas is unable to shut down the endpoint at the specified time.

After your model is deployed to a SageMaker AI Hosting [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html), you can begin making predictions by *invoking* the endpoint.

There are several different ways for you to deploy a model from the Canvas application. You can access the model deployment option through any of the following methods:
+ On the **My models** page of the Canvas application, choose the model that you want to deploy. Then, from the model’s **Versions** page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to a model version and select **Deploy**.
+ When on the details page for a model version, on the **Analyze** tab, choose the **Deploy** option.
+ When on the details page for a model version, on the **Predict** tab, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) at the top of the page and select **Deploy**.
+ On the **ML Ops** page of the Canvas application, choose the **Deployments** tab and then choose **Create deployment**.
+ For JumpStart foundation models and fine-tuned foundation models, go to the **Ready-to-use models** page of the Canvas application. Choose **Generate, extract and summarize content**. Then, find the JumpStart foundation model or fine-tuned foundation model that you want to deploy. Choose the model, and on the model's chat page, choose the **Deploy** button.

All of these methods open the **Deploy model** side panel, where you specify the deployment configuration for your model. To deploy the model from this panel, do the following:

1. (Optional) If you’re creating a deployment from the **ML Ops** page, you’ll have the option to **Select model and version**. Use the dropdown menus to select the model and model version that you want to deploy.

1. Enter a name in the **Deployment name** field.

1. (For JumpStart foundation models and fine-tuned foundation models only) Choose a **Deployment length**. Select **Indefinite** to leave the endpoint active until you shut it down, or select **Specify length** and then enter the period of time for which you want the endpoint to remain active.

1. For **Instance type**, SageMaker AI detects a default instance type and number that is suitable for your model. However, you can change the instance type that you would like to use for hosting your model.
**Note**  
If you run out of the instance quota for the chosen instance type on your AWS account, you can request a quota increase. For more information about the default quotas and how to request an increase, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) in the *AWS General Reference guide*.

1. For **Instance count**, you can set the number of active instances that are used for your endpoint. SageMaker AI detects a default number that is suitable for your model, but you can change this number.

1. When you’re ready to deploy your model, choose **Deploy**.

Your model should now be deployed to an endpoint.

# View your deployments


You might want to check the status or details of a model deployment in Amazon SageMaker Canvas. For example, if your deployment failed, you might want to check the details to troubleshoot.

You can view your Canvas model deployments from the Canvas application or from the Amazon SageMaker AI console.

To view deployment details from Canvas, choose one of the following procedures:

To view your deployment details from the **ML Ops** page, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. Choose your deployment by name from the list.

To view your deployment details from a model version’s page, do the following:

1. In the SageMaker Canvas application, go to your model version’s details page.

1. Choose the **Deploy** tab.

1. On the **Deployments ** section that lists all of the deployment configurations associated with that model version, find your deployment.

1. Choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), and then select **View details** to open the details page.

The details page for your deployment opens, and you can view information such as the time of the most recent prediction, the endpoint’s status and configuration, and the model version that is currently deployed to the endpoint.

You can also view your currently active Canvas workspace instances and active endpoints from the **SageMaker AI dashboard** in the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). Your Canvas endpoints are listed alongside any other SageMaker AI Hosting endpoints that you’ve created, and you can filter them by searching for endpoints with the Canvas tag.

The following screenshot shows the SageMaker AI dashboard. In the **Canvas** section, you can see that one workspace instance is in service and four endpoints are active.

![\[Screenshot of the SageMaker AI dashboard showing the active Canvas workspace instances and endpoints.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-sagemaker-dashboard.png)


# Update a deployment configuration


You can update the deployment configuration for models that you've deployed to endpoints in Amazon SageMaker Canvas. For example, you can deploy an updated model version to the endpoint, or you can update the instance type or number of instances behind the endpoint based on your capacity needs.

There are several different ways for you to update your deployment from the Canvas application. You can use any of the following methods:
+ On the **ML Ops** page of the Canvas application, you can choose the **Deployments** tab and select the deployment that you want to update. Then, choose **Update configuration**.
+ When on the details page for a model version, on the **Deploy** tab, you can view the deployments for that version. Next to the deployment, choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then choose **Update configuration**.

Both of the preceding methods open the **Update configuration** side panel, where you can make changes to your deployment configuration. To update the configuration, do the following:

1. For the **Select version** dropdown menu, you can select a different model version to deploy to the endpoint.
**Note**  
When updating a deployment configuration, you can only choose a different model version to deploy. To deploy a different model, create a new deployment.

1. For **Instance type**, you can select a different instance type for hosting your model.

1. For **Instance count**, you can change the number of active instances that are used for your endpoint.

1. Choose **Save**.

Your deployment configuration should now be updated.

# Test your deployment


You can test a model deployment by invoking the endpoint, or making single prediction requests, through the Amazon SageMaker Canvas application. You can use this functionality to confirm that your endpoint responds to requests before invoking your endpoint programmatically in a production environment.

## Test a custom model deployment


You can test a custom model deployment by accessing it through the **ML Ops** page and making a single invocation, which returns a prediction along with the probability that the prediction is correct.

**Note**  
Execution length is an estimate of the time taken to invoke and get a response from the endpoint in Canvas. For detailed latency metrics, see [SageMaker AI Endpoint Invocation Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation).

To test your endpoint through the Canvas application, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, choose the one with the endpoint that you want to invoke.

1. On the deployment’s details page, choose the **Test deployment** tab.

1. On the deployment testing page, you can modify the **Value** fields to specify a new data point. For time series forecasting models, you specify the **Item ID** for which you want to make a forecast.

1. After modifying the values, choose **Update** to get the prediction result.

The prediction loads, along with the **Invocation result** fields which indicate whether or not the invocation was successful and how long the request took to process.

The following screenshot shows a prediction performed in the Canvas application on the **Test deployment** tab.

![\[The Canvas application showing a test prediction for a deployed model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-test-deployments.png)


For all model types except numeric prediction and time series forecasting, the prediction returns the following fields:
+  **predicted\$1label** – the predicted output
+  **probability** – the probability that the predicted label is correct
+  **labels** – the list of all the possible labels
+  **probabilities** – the probabilities corresponding to each label (the order of this list matches the order of the labels)

For numeric prediction models, the prediction only contains the **score** field, which is the predicted output of the model, such as the predicted price of a house.

For time series forecasting models, the prediction is a graph showing the forecasts by quantile. You can choose **Schema view** to see the forecasted numeric values for each quantile.

You can continue making single predictions through the deployment testing page, or you can see the following section [Invoke your endpoint](canvas-deploy-model-invoke.md) to learn how to invoke your endpoint programmatically from applications.

## Test a JumpStart foundation model deployment


You can chat with a deployed JumpStart foundation model through the Canvas application to test its functionality before invoking it through code.

To chat with a deployed JumpStart foundation model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, find the one that you want to invoke and choose its **More options** icon (![\[More options icon for a model deployment.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the context menu, choose **Test deployment**.

1. A new **Generate, extract and summarize content** chat opens with the JumpStart foundation model, and you can begin typing prompts. Note that prompts from this chat are sent as requests to your SageMaker AI Hosting endpoint.

# Invoke your endpoint


**Note**  
We recommend that you [test your model deployment in Amazon SageMaker Canvas](canvas-deploy-model-test.md) before invoking a SageMaker AI endpoint programmatically.

You can use your Amazon SageMaker Canvas models that you've deployed to a SageMaker AI endpoint in production with your applications. Invoke the endpoint programmatically the same way that you invoke any other [SageMaker AI real-time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). Invoking an endpoint programmatically returns a response object which contains the same fields described in [Test your deployment](canvas-deploy-model-test.md).

For more detailed information about how to programmatically invoke endpoints, see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).

The following Python examples show you how to invoke your endpoint based on the model type.

## JumpStart foundation models


The following example shows you how to invoke a JumpStart foundation model that you've deployed to an endpoint.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame(
    [['feature_column1', 'feature_column2'], 
    ['feature_column1', 'feature_column2']]
).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Numeric and categorical prediction models


The following example shows you how to invoke numeric or categorical prediction models.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame(['feature_column1', 'feature_column2'], ['feature_column1', 'feature_column2']).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Time series forecasting models


The following example shows you how to invoke time series forecasting models. For a complete example of how to test invoke a time series forecasting model, see [ Time-Series Forecasting with Amazon SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/eef13dae197a6e588a8bc111aba3244f99ee0fbb/autopilot/autopilot_time_series.ipynb).

```
import boto3
import pandas as pd

csv_path = './real-time-payload.csv'
data = pd.read_csv(csv_path)

client = boto3.client("runtime.sagemaker")

body = data.to_csv(index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Image prediction models


The following example shows you how to invoke image prediction models.

```
import boto3
client = boto3.client("runtime.sagemaker")
with open("example_image.jpg", "rb") as file:
    body = file.read()
    response = client.invoke_endpoint(
        EndpointName="endpoint_name",
        ContentType="application/x-image",
        Body=body,
        Accept="application/json"
    )
```

## Text prediction models


The following example shows you how to invoke text prediction models.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame([["Example text 1"], ["Example text 2"]]).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

# Delete a model deployment


You can delete your model deployments from the Amazon SageMaker Canvas application. This action also deletes the endpoint from the SageMaker AI console and shuts down any endpoint-related resources.

**Note**  
Optionally, you can delete your endpoint through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/) or using the SageMaker AI `DeleteEndpoint` API. For more information, see [Delete Endpoints and Resources](realtime-endpoints-delete-resources.md). However, when you delete the endpoint through the SageMaker AI console or APIs instead of the Canvas application, the list of deployments in Canvas isn’t automatically updated. You must also delete the deployment from the Canvas application to remove it from the list.

To delete a deployment in Canvas, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, choose the one that you want to delete.

1. At the top of the deployment details page, choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Delete deployment**.

1. In the ** Delete deployment** dialog box, choose **Delete**.

Your deployment and SageMaker AI Hosting endpoint should now be deleted from both Canvas and the SageMaker AI console.

# How to manage automations


In SageMaker Canvas, you can create automations that update your dataset or generate predictions from your model on a schedule. For example, you might receive new shipping data on a daily basis. You can set up an automatic update for your dataset and automatic batch predictions that run whenever the dataset is updated. Using these features, you can set up an automated workflow and reduce the amount of time you spend manually updating datasets and making predictions.

**Note**  
You can only set up a maximum of 20 automatic configurations in your Canvas application. Automations are only active while you’re logged in to the Canvas application. If you log out of Canvas, your automatic jobs pause until you log back in.

The following sections describe how to view, edit, and delete configurations for existing automations. To learn how to set up automations, see the following topics:
+ To set up automatic dataset updates, see [Update a dataset](canvas-update-dataset.md).
+ To set up automatic batch predictions, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

**Topics**
+ [

# View your automations
](canvas-manage-automations-view.md)
+ [

# Edit your automatic configurations
](canvas-manage-automations-edit.md)
+ [

# Delete an automatic configuration
](canvas-manage-automations-delete.md)

# View your automations


You can also view all of your auto update jobs by going to the left navigation pane of Canvas and choosing **ML Ops**. The **ML Operations** page combines automations for both automatic dataset updates and automatic batch predictions. On the **Automations** tab, you can see the following sub-tabs:
+ **All jobs** – You can see every instance of a **Dataset update** or **Batch prediction** job that Canvas has done. For each job, you can see fields such as the associated **Input dataset**, the **Configuration name** of the associated auto update configuration, and the **Status** showing whether the job was successful or not. You can filter the jobs by configuration name:
  + For dataset update jobs, you can choose the latest version of the dataset, or the most recent job, to preview the dataset.
  + For batch prediction jobs, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) to preview or download the predictions for that job. You can also choose **View details** to see more details about your prediction job. For more information about batch prediction job details, see [View your batch prediction jobs](canvas-make-predictions-batch-auto-view.md).
+ **Configuration** – You can see all of the **Dataset update** and **Batch prediction** configurations you’ve created. For each configuration, you can see fields such as the associated **Input dataset** and the **Frequency** of the jobs. You can also turn off or turn on the **Auto update** toggle to pause or resume automatic updates. If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) for a specific configuration, you can choose to **View all jobs** for the configuration, **Update configuration**, or **Delete configuration**.

# Edit your automatic configurations


After setting up a configuration, you might want to make changes to it. For automatic dataset updates, you can update the Amazon S3 location for Canvas to import data, the frequency of the updates, and the starting time. For automatic batch predictions, you can change the dataset that the configuration tracks for updates. You can also turn off the automation to temporarily pause updates until you choose to resume them.

The following sections show you how to update each type of configuration.

**Note**  
You can’t change the frequency for automatic batch predictions because automatic batch predictions run every time the target dataset is updated.

**Topics**
+ [

# Edit your automatic dataset update configuration
](canvas-manage-automations-edit-dataset.md)
+ [

# Edit your automatic batch prediction configuration
](canvas-manage-automations-edit-batch.md)

# Edit your automatic dataset update configuration


You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

To make changes to your auto update configuration for a dataset, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. For your auto update configuration, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown menu, choose **Update configuration**. You are taken to the **Auto updates** tab of the dataset.

1. Make your changes to the configuration. When you’re done making changes, choose **Save**.

To pause your dataset updates, turn off your automatic configuration. One way to turn off auto updates is by doing the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the ** Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic updates for your dataset are now paused. You can turn this toggle back on at any time to resume the update schedule.

# Edit your automatic batch prediction configuration


When you edit a batch prediction configuration, you can change the target dataset but not the frequency (since automatic batch predictions occur whenever the dataset is updated).

To make changes to your automatic batch predictions configuration, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. For your auto update configuration, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown menu, choose **Update configuration**. You are taken to the **Auto updates** tab of the dataset.

1. The **Automate batch prediction** dialog box opens. You can select another dataset and choose **Set up** to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration. Use the following procedure to turn off your configuration:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the ** Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic batch predictions for your dataset are now paused. You can turn this toggle back on at any time to resume the update schedule.

# Delete an automatic configuration


You might want to delete a configuration to stop your automated workflow in SageMaker Canvas.

To delete a configuration for automatic dataset updates or automatic batch predictions, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. Find your auto update configuration, and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Delete configuration**.

1. In the dialog box that pops up, choose **Delete**.

Your auto update configuration is now deleted.

# Logging out of Amazon SageMaker Canvas
Logging out

After completing your work in Amazon SageMaker Canvas, you can log out or configure your application to automatically terminate the *workspace instance*. A workspace instance is dedicated for your use every time you launch a Canvas application, and you are billed for as long as the instance runs. Logging out or terminating the workspace instance stops the workspace instance billing. For more information, see [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

The following sections describe how to log out of your Canvas application and how to configure your application to automatically shut down on a schedule.

## Log out of Canvas


When you log out of Canvas, your models and datasets aren't affected. Any quick or standard model builds or [large data processing jobs](canvas-export-data.md#canvas-export-data-s3) continue running even if you log out.

To log out, choose the **Log out** button (![\[Filter icon in the SageMaker Canvas app.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/logout-icon.png)) on the left panel of the SageMaker Canvas application.

You can also log out from the SageMaker Canvas application by closing your browser tab and then [deleting the application](canvas-manage-apps-delete.md) in the console.

After you log out, SageMaker Canvas tells you to relaunch in a different tab. Logging in takes around 1 minute. If you have an administrator who set up SageMaker Canvas for you, use the instructions they gave you to log back in. If don't have an administrator, see the procedure for accessing SageMaker Canvas in [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).

## Automatically shut down Canvas


If you’re a Canvas administrator, you might want to regularly shut down applications to reduce costs. You can either create a schedule to shut down active Canvas applications, or you can create an automation to shut down Canvas applications as soon as they’re *idle* (meaning the user hasn’t been active for 2 hours).

You can create these solutions using AWS Lambda functions that call the `DeleteApp` API and delete Canvas applications given certain conditions. For more information about these solutions and access to CloudFormation templates that you can use, see the blog [Optimizing costs for Amazon SageMaker Canvas with automatic shutdown of idle apps ](https://aws.amazon.com/blogs/machine-learning/optimizing-costs-for-amazon-sagemaker-canvas-with-automatic-shutdown-of-idle-apps/).

**Note**  
You might experience missing [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) metrics if there was an error when setting up your idle shut down schedule or a CloudWatch error. We recommend that you add a CloudWatch alarm that monitors for missing metrics. If you encounter this issue, reach out to Support for help.

# Limitations and troubleshooting


The following section outlines troubleshooting help and limitations that apply when using Amazon SageMaker Canvas. You can use these this topic to help troubleshoot any issues you encounter.

## Troubleshooting issues with granting permissions through the SageMaker AI console


If you’re having trouble granting Canvas base permissions or Ready-to-use models permissions to your user, your user might have an AWS IAM execution role with more than one trust relationship to other AWS services. A trust relationship is a policy attached to your role that defines which principals (users, roles, accounts, or services) can assume the role. For example, you might encounter an issue granting additional Canvas permissions to your user if their execution role has a trust relationship to both Amazon SageMaker AI and Amazon Forecast.

You can fix this problem by choosing one of the following options.

### 1. Remove all but one trusted service from the role.


This solution requires you to edit the trust relationship for your user profile’s IAM role and remove all AWS services except SageMaker AI.

To edit the trust relationship for your IAM execution role, do the following:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the navigation pane of the IAM console, choose **Roles**. The console displays the roles for your account.

1. Choose the name of the role that you want to modify, and select the **Trust relationships** tab on the details page.

1. Choose **Edit trust policy**.

1. In the **Edit trust policy editor**, paste the following, and then choose **Update Policy**.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "sagemaker.amazonaws.com"
                   ]
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

You can also update this policy document using the IAM CLI. For more information, see [update-trust](https://docs.aws.amazon.com/cli/latest/reference/ds/update-trust.html) in the *IAM Command Line Reference*.

You can now retry granting the Canvas base permissions or the Ready-to-use models permissions to your user.

### 2. Use a different role with one or fewer trusted services.


This solution requires you to specify a different IAM role for your user profile. Use this option if you already have an IAM role that you can substitute.

To specify a different execution role for your user, do the following:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain that you want to view a list of user profiles for.

1. On the **domain details** page, choose the **User profiles** tab.

1. Choose the user whose permissions you want to edit. On the **User details** page, choose **Edit**.

1. On the **General settings** page, choose the **Execution role** dropdown list and select the role that you want to use.

1. Choose **Submit** to save your changes to the user profile.

Your user should now be using an execution role with only one trusted service (SageMaker AI).

You can retry granting the Canvas base permissions or the Ready-to-use models permissions to your user.

### 3. Manually attach the AWS managed policy to the execution role instead of using the toggle in the SageMaker AI domain settings.


Instead of using the toggle in the domain or user profile settings, you can manually attach the AWS managed policies that grant a user the correct permissions.

To grant a user Canvas base permissions, attach the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy. To grant a user Ready-to-use models permissions, attach the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy.

Use the following procedure to attach an AWS managed policy to your role:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. From the dropdown menu, choose **Attach policies**.

1. Search for and select the policy or policies that you want to attach to the user’s execution role:

   1. To grant the Canvas base permissions, search for and select the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy.

   1. To grant the Ready-to-use models permissions, search for and select the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy.

1. Choose **Add permissions** to attach the policy to the role.

After attaching an AWS managed policy to the user’s role through the IAM console, your user should now have the Canvas base permissions or Ready-to-use models permissions.

## Troubleshooting issues with creating a Canvas application due to space failure


When creating a new Canvas application, if you encounter an error stating `Unable to create app <app-arn> because space <space-arn> is not in InService state`, this indicates that the underlying Amazon SageMaker Studio space creation has failed. A Studio *space* is the underlying storage that hosts your Canvas application data. For more general information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md). For more information about configuring spaces in Canvas, see [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md).

To determine the root cause of your why space creation failed, you can use the [DescribeSpace](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeSpace.html) API to check the `FailureReason` field. For more information about the possible statuses of spaces and what they mean, see [Amazon SageMaker AI domain entities and statuses](sm-domain.md).

To resolve this issue, find your domain in the SageMaker AI console and delete the failed space listed in the error message you received. For detailed steps on how to find and delete a space, see the page [Stop and delete your Studio running applications and spaces](studio-updated-running-stop.md) and follow the instructions to **Delete a Studio space**. Deleting the space also deletes any applications associated with the space. After deleting the space, you can try to create your Canvas application again. The space should now provision successfully, allowing Canvas to launch.

# Billing and cost in SageMaker Canvas


To track the costs associated with your SageMaker Canvas application, you can use the AWS Billing and Cost Management service. Billing and Cost Management provides tools to help you gather information related to your cost and usage, analyze your cost drivers and usage trends, and take action to budget your spending. For more information, see [What is AWS Billing and Cost Management?](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html)

Billing in SageMaker Canvas consists of the following components:
+ Workspace instance charges – You are charged for the number of hours that you are logged in to or using SageMaker Canvas. We recommend that you log out or create a schedule to shut down any Canvas applications that you’re not actively using to reduce costs. For more information, see [Logging out of Amazon SageMaker Canvas](canvas-log-out.md).
+ AWS service charges – You are charged for building and making predictions with custom models, or for making predictions with Ready-to-use models:
  + Training charges – For all model types, you are charged based on your resource usage while the model builds. These resources include any compute instances that Canvas spins up. You may see these charges on your account as Hosting, Training, Processing, or Batch Transform jobs.
  + Prediction charges – You are charged for the resources used to generate predictions, depending on the type of custom model that you built or the type of Ready-to-use model you used.

The [Ready-to-use models](canvas-ready-to-use-models.md) in Canvas leverage other AWS services to generate predictions. When you use a Ready-to-use model, you are charged by the respective service, and their pricing conditions apply:
+ For sentiment analysis, entities extraction, language detection, and personal information detection, you’re charged with [Amazon Comprehend pricing](https://aws.amazon.com/comprehend/pricing/).
+ For object detection in images and text detection in images, you’re charged with [Amazon Rekognition pricing](https://aws.amazon.com/rekognition/pricing/).
+ For expense analysis, identity document analysis, and document analysis, you’re charged with [Amazon Textract pricing](https://aws.amazon.com/textract/pricing/).

For more information, see [SageMaker Canvas pricing](https://aws.amazon.com/sagemaker/canvas/pricing/).

To help you track your costs in Billing and Cost Management, you can assign custom tags to your SageMaker Canvas app and users. You can track the costs your apps incur, and by tagging individual user profiles, you can track costs based on the user profile. For more information about tags, see [Using Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html).

You can add tags to your SageMaker Canvas app and users by doing the following:
+ If you are setting up your Amazon SageMaker AI domain and SageMaker Canvas for the first time, follow the [Getting Started](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html) instructions and add tags when creating your domain or users. You can add tags either through the **General settings** in the domain console setup, or through the APIs ([CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) or [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html)). SageMaker AI adds the tags specified in your domain or UserProfile to any SageMaker Canvas apps or users you create after you create the domain.
+ If you want to add tags to apps in an existing domain, you must add tags to either the domain or the UserProfile. You can adds tags through either the console or the [AddTags](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html) API. If you add tags through the console, then you must delete and relaunch your SageMaker Canvas app in order for the tags to propagate to the app. If you use the API, the tags are added directly to the app. For more information about deleting and relaunching a SageMaker Canvas app, see [Manage apps](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-manage-apps.html).

After you add tags to your domain, it might take up to 24 hours for the tags to appear in the AWS Billing and Cost Management console for activation. After they appear in the console, it takes another 24 hours for the tags to activate.

On the **Cost explorer** page, you can group and filter your costs by tags and usage types to separate your Workspace instance charges from your Training charges. The charges for each are listed as the following:
+ Workspace instance charges: Charges show up under the usage type `REGION-Canvas:Session-Hrs (Hrs)`.
+ Training charges: Charges show up under the usage types for SageMaker AI Hosting, Training, Processing, or Batch Transform jobs.