

# Machine learning environments offered by Amazon SageMaker AI
<a name="machine-learning-environments"></a>

**Important**  
Amazon SageMaker Studio and Amazon SageMaker Studio Classic are two of the machine learning environments that you can use to interact with SageMaker AI.  
If your domain was created after November 30, 2023, Studio is your default experience.  
If your domain was created before November 30, 2023, Amazon SageMaker Studio Classic is your default experience. To use Studio if Amazon SageMaker Studio Classic is your default experience, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).  
When you migrate from Amazon SageMaker Studio Classic to Amazon SageMaker Studio, there is no loss in feature availability. Studio Classic also exists as an IDE within Amazon SageMaker Studio to help you run your legacy machine learning workflows.

SageMaker AI supports the following machine learning environments:
+ *Amazon SageMaker Studio* (Recommended): The latest web-based experience for running ML workflows with a suite of IDEs. Studio supports the following applications:
  + Amazon SageMaker Studio Classic
  + Code Editor, based on Code-OSS, Visual Studio Code - Open Source
  + JupyterLab
  + Amazon SageMaker Canvas
  + RStudio
+ *Amazon SageMaker Studio Classic*: Lets you build, train, debug, deploy, and monitor your machine learning models.
+ *Amazon SageMaker Notebook Instances*: Lets you prepare and process data, and train and deploy machine learning models from a compute instance running the Jupyter Notebook application.
+ *Amazon SageMaker Studio Lab*: Studio Lab is a free service that gives you access to AWS compute resources, in an environment based on open-source JupyterLab, without requiring an AWS account.
+ *Amazon SageMaker Canvas*: Gives you the ability to use machine learning to generate predictions without needing to code.
+ *Amazon SageMaker geospatial*: Gives you the ability to build, train, and deploy geospatial models.
+ *RStudio on Amazon SageMaker AI*: RStudio is an IDE for [R](https://aws.amazon.com/blogs/opensource/getting-started-with-r-on-amazon-web-services/), with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management.
+ *SageMaker HyperPod*: SageMaker HyperPod lets you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs).

To use these machine learning environments, you or your organization's administrator must create an Amazon SageMaker AI domain. The exceptions are Studio Lab, SageMaker Notebook Instances, and SageMaker HyperPod.

Instead of manually provisioning resources and managing permissions for yourself and your users, you can create a Amazon DataZone domain. The process of creating a Amazon DataZone domain creates a corresponding Amazon SageMaker AI domain with AWS Glue or Amazon Redshift databases for your ETL workflows. Setting up a domain through Amazon DataZone reduces the amount of time it takes to set up SageMaker AI environments for your users. For more information about setting up a Amazon SageMaker AI domain within Amazon DataZone, see [Set up SageMaker Assets (administrator guide)](sm-assets-set-up.md).

Users within the Amazon DataZone domain have permissions to all Amazon SageMaker AI actions, but their permissions are scoped down to resources within the Amazon DataZone domain.

Creating a Amazon DataZone domain streamlines creating a domain that allows your users to share data and models with each other. For information about how they can share data and models, see [Controlled access to assets with Amazon SageMaker Assets](sm-assets.md).

**Topics**
+ [

# Amazon SageMaker Studio
](studio-updated.md)
+ [

# SageMaker JupyterLab
](studio-updated-jl.md)
+ [

# Amazon SageMaker notebook instances
](nbi.md)
+ [

# Amazon SageMaker Studio Lab
](studio-lab.md)
+ [

# Amazon SageMaker Canvas
](canvas.md)
+ [

# Amazon SageMaker geospatial capabilities
](geospatial.md)
+ [

# RStudio on Amazon SageMaker AI
](rstudio.md)
+ [

# Code Editor in Amazon SageMaker Studio
](code-editor.md)
+ [

# Amazon SageMaker HyperPod
](sagemaker-hyperpod.md)
+ [

# Generative AI in SageMaker notebook environments
](jupyterai.md)
+ [

# Amazon Q Developer
](studio-updated-amazon-q.md)
+ [

# Amazon SageMaker Partner AI Apps overview
](partner-apps.md)

# Amazon SageMaker Studio
<a name="studio-updated"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

 Amazon SageMaker Studio is the latest web-based experience for running ML workflows. Studio offers a suite of integrated development environments (IDEs). These include Code Editor, based on Code-OSS, Visual Studio Code - Open Source, a new JupyterLab application, RStudio, and Amazon SageMaker Studio Classic. For more information, see [Applications supported in Amazon SageMaker Studio](studio-updated-apps.md). 

The new web-based UI in Studio is faster and provides access to all SageMaker AI resources, including jobs and endpoints, in one interface. ML practitioners can also choose their preferred IDE to accelerate ML development. A data scientist can use JupyterLab to explore data and tune models. In addition, a machine learning operations (MLOps) engineer can use Code Editor with the pipelines tool in Studio to deploy and monitor models in production. 

 The previous Studio experience is still being supported as Amazon SageMaker Studio Classic. Studio Classic is the default experience for existing customers, and is available as an application in Studio. For more information about Studio Classic, see [Amazon SageMaker Studio Classic](studio.md). For information about how to migrate from Studio Classic to Studio, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md). 

 Studio offers the following benefits: 
+ A new JupyterLab application that has a faster start-up time and is more reliable than the existing Studio Classic application. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ A suite of IDEs that open in a separate tab, including the new Code Editor, based on Code-OSS, Visual Studio Code - Open Source application. Users can interact with supported IDEs in a full screen experience. For more information, see [Applications supported in Amazon SageMaker Studio](studio-updated-apps.md).
+ Access to all of your SageMaker AI resources in one place. Studio displays running instances across all of your applications.  
+ Access to all training jobs in a single view, regardless of whether they were scheduled from notebooks or initiated from Amazon SageMaker JumpStart.
+ Simplified model deployment workflows and endpoint management and monitoring directly from Studio. You don't need to access the SageMaker AI console. 
+ Automatic creation of all configured applications when you onboard to a domain. For information about onboarding to a domain, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).
+ An improved JumpStart experience where you can discover, import, register, fine tune, and deploy a foundation model. For more information, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

**Topics**
+ [

# Launch Amazon SageMaker Studio
](studio-updated-launch.md)
+ [

# Amazon SageMaker Studio UI overview
](studio-updated-ui.md)
+ [

# Amazon EFS auto-mounting in Studio
](studio-updated-automount.md)
+ [

# Idle shutdown
](studio-updated-idle-shutdown.md)
+ [

# Applications supported in Amazon SageMaker Studio
](studio-updated-apps.md)
+ [

# Connect your Remote IDE to SageMaker spaces with remote access
](remote-access.md)
+ [

# Bring your own image (BYOI)
](studio-updated-byoi.md)
+ [

# Lifecycle configurations within Amazon SageMaker Studio
](studio-lifecycle-configurations.md)
+ [

# Amazon SageMaker Studio spaces
](studio-updated-spaces.md)
+ [

# Trusted identity propagation with Studio
](trustedidentitypropagation.md)
+ [

# Perform common UI tasks
](studio-updated-common.md)
+ [

# NVMe stores with Amazon SageMaker Studio
](studio-updated-nvme.md)
+ [

# Local mode support in Amazon SageMaker Studio
](studio-updated-local.md)
+ [

# View your Studio running instances, applications, and spaces
](studio-updated-running.md)
+ [

# Stop and delete your Studio running applications and spaces
](studio-updated-running-stop.md)
+ [

# SageMaker Studio image support policy
](sagemaker-distribution.md)
+ [

# Amazon SageMaker Studio pricing
](studio-updated-cost.md)
+ [

# Troubleshooting
](studio-updated-troubleshooting.md)
+ [

# Migration from Amazon SageMaker Studio Classic
](studio-updated-migrate.md)
+ [

# Amazon SageMaker Studio Classic
](studio.md)

# Launch Amazon SageMaker Studio
<a name="studio-updated-launch"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

 This page's topics demonstrate how to launch Amazon SageMaker Studio from the Amazon SageMaker AI console and the AWS Command Line Interface (AWS CLI). 

**Topics**
+ [

## Prerequisites
](#studio-updated-launch-prereq)
+ [

## Launch from the Amazon SageMaker AI console
](#studio-updated-launch-console)
+ [

## Launch using the AWS CLI
](#studio-updated-launch-cli)

## Prerequisites
<a name="studio-updated-launch-prereq"></a>

 Before you begin, complete the following prerequisites: 
+ Onboard to a SageMaker AI domain with Studio access. If you don't have permissions to set Studio as the default experience for your domain, contact your administrator. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). 
+ Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com//cli/latest/userguide/install-cliv1.html#install-tool-bundled). 
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com//general/latest/gr/aws-sec-cred-types.html).

## Launch from the Amazon SageMaker AI console
<a name="studio-updated-launch-console"></a>

Complete the following procedure to launch Studio from the Amazon SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  From the left navigation pane, choose Studio. 

1.  From the Studio landing page, select the domain and user profile for launching Studio. 

1.  Choose **Open Studio**. 

1.  To launch Studio, choose **Launch personal Studio**. 

## Launch using the AWS CLI
<a name="studio-updated-launch-cli"></a>

This section demonstrates how to launch Studio using the AWS CLI. The procedure to access Studio using the AWS CLI depends if the domain uses AWS Identity and Access Management (IAM) authentication or AWS IAM Identity Center authentication. You can use the AWS CLI to launch Studio by creating a presigned domain URL when your domain uses IAM authentication. For information about launching Studio with IAM Identity Center authentication, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md). 

### Launch if Studio is the default experience
<a name="studio-updated-launch-console-updated"></a>

 The following code snippet demonstrates how to launch Studio from the AWS CLI using a presigned domain URL if Studio is the default experience. For more information, see [create-presigned-domain-url](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-presigned-domain-url.html). 

```
aws sagemaker create-presigned-domain-url \
--region region \
--domain-id domain-id \
--user-profile-name user-profile-name \
--session-expiration-duration-in-seconds 43200
```

### Launch if Amazon SageMaker Studio Classic is your default experience
<a name="studio-updated-launch-console-classic"></a>

 The following code snippet demonstrates how to launch Studio from the AWS CLI using a presigned domain URL if Studio Classic is the default experience. For more information, see [create-presigned-domain-url](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-presigned-domain-url.html). 

```
aws sagemaker create-presigned-domain-url \
--region region \
--domain-id domain-id \
--user-profile-name user-profile-name \
--session-expiration-duration-in-seconds 43200 \
--landing-uri studio::
```

# Amazon SageMaker Studio UI overview
<a name="studio-updated-ui"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

 The Amazon SageMaker Studio user interface is split into three distinct parts. This page gives information about the distinct parts and their components. 
+  **Navigation bar**– This section of the UI includes the URL, breadcrumbs, notifications, and user options. 
+  **Navigation pane**– This section of the UI includes a list of the applications that are supported in Studio and options for the main workflows in Studio. 
+  **Content pane**– The main working area that displays the current page of the Studio UI that you have open.

![\[Amazon SageMaker Studio home page with navigation pane and content pane (main working area).\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/monarch/studio-updated-ui.png)


**Topics**
+ [

## Amazon SageMaker Studio navigation bar
](#studio-updated-ui-top)
+ [

## Amazon SageMaker Studio navigation pane
](#studio-updated-ui-left)
+ [

## Studio content pane
](#studio-updated-ui-working)

## Amazon SageMaker Studio navigation bar
<a name="studio-updated-ui-top"></a>

 The navigation bar of the Studio UI includes the URL, breadcrumbs, notifications, and user options. 

 **URL Structure** 

 The URL of Studio changes as you navigate the UI. When you navigate to a different page in the UI, the URL changes to reflect that page. With the updated URL, you open any page in the Studio UI directly without navigating to the landing page first. 

 **Breadcrumbs** 

 As you navigate through the Studio UI, the breadcrumbs keep track of the parent pages of the current page. By choosing one of these breadcrumbs, you can navigate to parent pages in the UI. 

 **Notifications** 

 The notifications section of the UI gives information about important changes to Studio, updates to applications, and issues to resolve. 

 **User options** 

Choose the user options icon (![\[User icon with a circular avatar placeholder and a downward-pointing arrow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/monarch/user-settings.png)) to get information about the user profile that is currently using Studio, and gives the option to sign out of Studio.  

## Amazon SageMaker Studio navigation pane
<a name="studio-updated-ui-left"></a>

 **Navigation pane** 

 The navigation pane of the UI includes a list of the applications that are supported in Studio. It also provides options for the main workflows in Studio. 

 This section of the UI can be used in an expanded or collapsed state. To change whether the section is expanded or collapsed, select the **Collapse** icon (![\[Square icon with "ID" text representing an identity or identification concept.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/monarch/collapse-ui.png)). 

 **Applications** 

 The applications section lists the applications that are available in Studio. If you choose one of the application types, you are directed to the landing page for that application. 

 **Workflows** 

 The list of workflows includes all of the available actions that you can take in Studio. Choose one of the options to navigate to the landing page for that workflow. If there are multiple workflows available for that option, choosing the option opens a dropdown menu where you can select the desired landing page. 

 The following list describes the options and provides a link for more information. 
+  **Home**– The main landing page with an overview, getting started, and what’s new. 
+  **Running instances**– All of the instances that are currently running in Studio. For more information, see [View your Studio running instances, applications, and spaces](studio-updated-running.md). 
+  **Data**– Data preparation options where you can collaborate to store, explore, prepare, transform, and share your data.  
  +  For more information about Amazon SageMaker Data Wrangler, see [Data preparation](canvas-data-prep.md). 
  +  For more information about Amazon SageMaker Feature Store, see [Create, store, and share features with Feature Store](feature-store.md). 
  +  For more information about Amazon EMR clusters, see [Data preparation using Amazon EMR](studio-notebooks-emr-cluster.md). 
+  **Auto ML**– Automatically build, train, tune, and deploy machine learning (ML) models. For more information, see [Amazon SageMaker Canvas](canvas.md). 
+  **Experiments**– Create, manage, analyze, and compare your machine learning experiments using Amazon SageMaker Experiments. For more information, see [Amazon SageMaker Experiments in Studio Classic](experiments.md). 
+  **Jobs**– View jobs created in Studio.  
  +  For more information about training, see [Model training](train-model.md). 
  +  For more information about model evaluation, see [Understand options for evaluating large language models with SageMaker Clarify](clarify-foundation-model-evaluate.md). 
+  **Pipelines**– Automate your ML workflow with Amazon SageMaker Pipelines, which provides resources to help you build, track, and manage your pipeline resources. For more information, see [Pipelines](pipelines.md).
+  **Models**– Organize your models into groups and collections in the model registry, where you can manage model versions, view metadata, and deploy models to production. For more information, see [Model Registration Deployment with Model Registry](model-registry.md).
+  **JumpStart**– Amazon SageMaker JumpStart provides pretrained, open-source models for a wide range of problem types to help you get started with machine learning. For more information, see [SageMaker JumpStart pretrained models](studio-jumpstart.md). 
+  **Deployments**– Deploy your machine learning (ML) models for inference.
  +  For more information about Amazon SageMaker Inference Recommender, see [Amazon SageMaker Inference Recommender](inference-recommender.md). 
  +  For more information about endpoints, see [Deploy models for inference](deploy-model.md). 

## Studio content pane
<a name="studio-updated-ui-working"></a>

 The main working area is also called the content pane. It displays the current page of the Studio UI that you have open. 

 **Studio home page** 

 The Studio home page is the primary landing page in the main working area. The home page includes two distinct tabs. There is an **Overview** tab and a **Getting started** tab. 

 **Overview** 

 The **Overview** tab includes options to start spaces for popular application types, get started with pre-built and automated solutions for ML workflows, and links to common tasks in the Studio UI. 

 **Getting started** 

 The **Getting started** tab includes information, guidance, and resources on how to begin with Studio. This includes a guided tour of the Studio UI, a link to documentation about Studio, and a selection of quick tips. 

# Amazon EFS auto-mounting in Studio
<a name="studio-updated-automount"></a>

 Amazon SageMaker AI supports automatically mounting a folder in an Amazon EFS volume for each user in a domain. Using this folder, users can share data between their own private spaces. However, users cannot share data with other users in the domain. Users only have access to their own folder. 

 The user’s folder can be accessed through a folder named `user-default-efs` . This folder is present in the `$HOME` directory of the Studio application.

 For information about opting out of Amazon EFS auto-mounting, see [Opt out of Amazon EFS auto-mounting](studio-updated-automount-optout.md). 

 Amazon EFS auto-mounting also facilitates the migration of data from Studio Classic to Studio. For more information, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md). 

 **Access point information** 

 When auto-mounting is activated, SageMaker AI uses an Amazon EFS access point to facilitate access to the data in the Amazon EFS volume. For more information about access points, see [Working with Amazon EFS access points](https://docs.aws.amazon.com/efs/latest/ug/efs-access-points.html) SageMaker AI creates a unique access point for each user profile in the domain during user profile creation or during application creation for an existing user profile. The POSIX user value of the access point matches the `HomeEfsFileSystemUid` value of the user profile that SageMaker AI creates the access point for. To get the value of the user, see [DescribeUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeUserProfile.html#sagemaker-DescribeUserProfile-response-HomeEfsFileSystemUid). The root directory path is also set to the same value as the POSIX user value.  

 SageMaker AI sets the permissions of the new directory to the following values: 

 
+  Owner user ID: `POSIX user value` 
+  Owner group ID: `0` 
+  Permissions `700` 

 The access point is required to access the Amazon EFS volume. As a result, you cannot delete or update the access point without losing access to the Amazon EFS volume. 

 **Error resolution** 

 If SageMaker AI encounters an issue when auto-mounting the Amazon EFS user folder during application creation, the application is still created. However, in this case, SageMaker AI creates a file named `error.txt` instead of mounting the Amazon EFS folder. This file describes the error encountered, as well as steps to resolve it. SageMaker AI creates the `error.txt` file in the `user-default-efs` folder located in the `$HOME` directory of the application. 

# Opt out of Amazon EFS auto-mounting
<a name="studio-updated-automount-optout"></a>

 You can opt-out of Amazon SageMaker AI auto-mounting Amazon EFS user folders during domain and user profile creation or for an existing domain or user profile. 

## Opt out during domain creation
<a name="studio-updated-automount-optout-domain-creation"></a>

 You can opt out of Amazon EFS auto-mounting when creating a domain using either the console or the AWS Command Line Interface. 

### Console
<a name="studio-updated-automount-optout-domain-creation-console"></a>

Complete the following steps to opt out of Amazon EFS auto-mounting when creating a domain from the console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  Complete the steps in [Use custom setup for Amazon SageMaker AI](onboard-custom.md) with the following modification to set up a domain. 
   +  On the **Configure storage** step, turn off **Automatically mount EFS storage and data**. 

### AWS CLI
<a name="studio-updated-automount-optout-domain-creation-cli"></a>

 Use the following command to opt out of Amazon EFS auto-mounting during domain creation using the AWS CLI. For more information about creating a domain using the AWS CLI, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md).

```
aws --region region sagemaker create-domain \
--domain-name "my-domain-$(date +%s)" \
--vpc-id default-vpc-id \
--subnet-ids subnet-ids \
--auth-mode IAM \
--default-user-settings "ExecutionRole=execution-role-arn,AutoMountHomeEFS=Disabled" \
--default-space-settings "ExecutionRole=execution-role-arn"
```

## Opt out for an existing domain
<a name="studio-updated-automount-optout-domain-existing"></a>

 You can opt out of Amazon EFS auto-mounting for an existing domain using either the console or the AWS CLI. 

### Console
<a name="studio-updated-automount-optout-domain-existing-console"></a>

 Complete the following steps to opt out of Amazon EFS auto-mounting when updating a domain from the console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  On the left navigation under **Admin configurations**, choose **Domains**. 

1.  On the **Domains** page, select the domain that you want to opt out of Amazon EFS auto-mounting for. 

1.  On the **Domain details** page, select the **Domain settings** tab. 

1.  Navigate to the **Storage configurations** section. 

1.  Select **Edit**. 

1.  From the **Edit storage settings** page, turn off **Automatically mount EFS storage and data**. 

1.  Select **Submit**.

### AWS CLI
<a name="studio-updated-automount-optout-domain-existing-cli"></a>

 Use the following command to opt out of Amazon EFS auto-mounting while updating an existing domain using the AWS CLI. 

```
aws --region region sagemaker update-domain \
--domain-id domain-id \
--default-user-settings "AutoMountHomeEFS=Disabled"
```

## Opt out during user profile creation
<a name="studio-updated-automount-optout-user-creation"></a>

 You can opt out of Amazon EFS auto-mounting when creating a user profile using either the console or the AWS CLI. 

### Console
<a name="studio-updated-automount-optout-user-creation-console"></a>

 Complete the following steps to opt out of Amazon EFS auto-mounting when creating a user profile from the console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  Complete the steps in [Add user profiles](domain-user-profile-add.md) with the following modification to create a user profile. 
   +  On the **Data and Storage** step, turn off **Inherit settings from domain**. This allows the user to have a different value than the defaults that are set for the domain.  
   +  Turn off **Automatically mount EFS storage and data**. 

### AWS CLI
<a name="studio-updated-automount-optout-user-creation-cli"></a>

 Use the following command to opt out of Amazon EFS auto-mounting during user profile creation using the AWS CLI. For more information about creating a user profile using the AWS CLI, see [Add user profiles](domain-user-profile-add.md).

```
aws --region region sagemaker create-user-profile \
--domain-id domain-id \
--user-profile-name "user-profile-$(date +%s)" \
--user-settings "ExecutionRole=arn:aws:iam::account-id:role/execution-role-name,AutoMountHomeEFS=Enabled/Disabled/DefaultAsDomain"
```

## Opt out for an existing user profile
<a name="studio-updated-automount-optout-user-existing"></a>

 You can opt out of Amazon EFS auto-mounting for an existing user profile using either the console or the AWS CLI. 

### Console
<a name="studio-updated-automount-optout-user-existing-console"></a>

 Complete the following steps to opt out of Amazon EFS auto-mounting when updating a user profile from the console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  On the left navigation under **Admin configurations**, choose **Domains**. 

1.  On the **Domains** page, select the domain containing the user profile that you want to opt out of Amazon EFS auto-mounting for. 

1.  On the **Domains details** page, select the **User profiles** tab. 

1.  Select the user profile to update. 

1.  From the **User Details** tab, navigate to the **AutoMountHomeEFS** section. 

1.  Select **Edit**. 

1.  From the **Edit storage settings** page, turn off **Inherit settings from domain**. This allows the user to have a different value than the defaults that are set for the domain.  

1.  Turn off **Automatically mount EFS storage and data**. 

1.  Select **Submit**. 

### AWS CLI
<a name="studio-updated-automount-optout-user-existing-cli"></a>

 Use the following command to opt out of Amazon EFS auto-mounting while updating an existing user profile using the AWS CLI. 

```
aws --region region sagemaker update-user-profile \
--domain-id domain-id \
--user-profile-name user-profile-name \
--user-settings "AutoMountHomeEFS=DefaultAsDomain"
```

# Idle shutdown
<a name="studio-updated-idle-shutdown"></a>

Amazon SageMaker AI supports shutting down idle resources to manage costs and prevent cost overruns due to cost accrued by idle, billable resources. It accomplishes this by detecting an app’s idle state and performing an app shutdown when idle criteria are met. 

SageMaker AI supports idle shutdown for the following applications. Idle shutdown must be set for each application type independently. 
+  JupyterLab 
+  Code Editor, based on Code-OSS, Visual Studio Code - Open Source 

 Idle shutdown can be set at either the domain or user profile level. When idle shutdown is set at the domain level, the idle shutdown settings apply to all applications created in the domain. When set at the user profile level, the idle shutdown settings apply only to the specific users that they are set for. User profile settings override domain settings.  

**Note**  
Idle shutdown requires the usage of the `SageMaker-distribution` (SMD) image with v2.0 or newer. Domains using an older SMD version can’t use the feature. These users must use an LCC to manage auto-shutdown instead. 

## Definition of idle
<a name="studio-updated-idle-shutdown-definition"></a>

 Idle shutdown settings only apply when the application becomes idle with no jobs running. SageMaker AI doesn’t start the idle shutdown timing until the instance becomes idle. The definition on idle differs based on whether the application type is JupyterLab or Code Editor. 

 For JupyterLab applications, the instance is considered idle when the following conditions are met: 
+  No active Jupyter kernel sessions 
+  No active Jupyter terminal sessions 

 For Code Editor applications, the instance is considered idle when the following conditions are met: 
+  No text file or notebook changes 
+  No files being viewed 
+  No interaction with the terminal

# Set up idle shutdown
<a name="studio-updated-idle-shutdown-setup"></a>

 The following sections show how to set up idle shutdown from either the console or using the AWS CLI. Idle shutdown can be set at either the domain or user profile level. 

## Prerequisites
<a name="studio-updated-idle-shutdown-setup-prereq"></a>

 To use idle shutdown with your application, you must complete the following prerequisites. 
+ Ensure that your application is using the SageMaker Distribution (SMD) version 2.0. You can select this version during application creation or update the image version of the application after creation. For more information, see [Update the SageMaker Distribution Image](studio-updated-jl-update-distribution-image.md) . 
+ For applications built with custom images, idle shutdown is supported if your custom image is created with SageMaker Distribution (SMD) version 2.0 or later as the base image. If the custom image is created with a different base image, then you must install the [jupyter-activity-monitor-extension >= 0.3.1](https://anaconda.org/conda-forge/jupyter-activity-monitor-extension) extension on the image and attach the image to your Amazon SageMaker AI domain for JupyterLab applications. For more information about custom images, see [Bring your own image (BYOI)](studio-updated-byoi.md).

## From the Console
<a name="studio-updated-idle-shutdown-setup-console"></a>

 The following sections show how to enable idle shutdown from the console. 

### Add when creating a new domain
<a name="studio-updated-idle-shutdown-setup-console-new-domain"></a>

1. Create a domain by following the steps in [Use custom setup for Amazon SageMaker AI](onboard-custom.md) 

1.  When configuring the application settings in the domain, navigate to either the Code Editor or JupyterLab section.  

1.  Select **Enable idle shutdown**. 

1.  Enter a default idle shutdown time in minutes. This values defaults to `10,080` if no value is entered. 

1.  (Optional) Select **Allow users to set custom idle shutdown time** to allow users to modify the idle shutdown time. 
   +  Enter a maximum value that users can set the default idle shutdown time to. You must enter a maximum value. The minimum value is set by Amazon SageMaker AI and must be `60`. 

### Add to an existing domain
<a name="studio-updated-idle-shutdown-setup-console-existing-domain"></a>

**Note**  
If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

1.  Navigate to the domain. 

1.  Choose the **App Configurations** tab. 

1.  From the **App Configurations** tab, navigate to either the Code Editor or JupyterLab section. 

1.  Select **Edit**. 

1.  Select **Enable idle shutdown**. 

1.  Enter a default idle shutdown time in minutes. This values defaults to `10,080` if no value is entered. 

1.  (Optional) Select **Allow users to set custom idle shutdown time** to allow users to modify the idle shutdown time. 
   +  Enter a maximum value that users can set the default idle shutdown time to. You must enter a maximum value. The minimum value is set by Amazon SageMaker AI and must be `60`. 

1.  Select **Submit**. 

### Add when creating a new user profile
<a name="studio-updated-idle-shutdown-setup-console-new-userprofile"></a>

1. Add a user profile by following the steps at [Add user profiles](domain-user-profile-add.md) 

1.  When configuring the application settings for the user profile, navigate to either the Code Editor or JupyterLab section. 

1.  Select **Enable idle shutdown**. 

1.  Enter a default idle shutdown time in minutes. This values defaults to `10,080` if no value is entered. 

1.  (Optional) Select **Allow users to set custom idle shutdown time** to allow users to modify the idle shutdown time. 
   +  Enter a maximum value that users can set the default idle shutdown time to. You must enter a maximum value. The minimum value is set by Amazon SageMaker AI and must be `60`. 

1.  Select “Save Changes”. 

### Add to an existing user profile
<a name="studio-updated-idle-shutdown-setup-console-existing-userprofile"></a>

 Note: If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

1.  Navigate to the user profile. 

1.  Choose the **App Configurations** tab. 

1.  From the ****App Configurations**** tab, navigate to either the Code Editor or JupyterLab section.  

1.  Select **Edit**. 

1.  Idle shutdown settings will show domain settings by default if configured for the domain. 

1.  Select **Enable idle shutdown**. 

1.  Enter a default idle shutdown time in minutes. This values defaults to `10,080` if no value is entered. 

1.  (Optional) Select **Allow users to set custom idle shutdown time** to allow users to modify the idle shutdown time. 
   +  Enter a maximum value that users can set the default idle shutdown time to. You must enter a maximum value. The minimum value is set by Amazon SageMaker AI and must be `60`. 

1.  Select **Save Changes**. 

## From the AWS CLI
<a name="studio-updated-idle-shutdown-setup-cli"></a>

 The following sections show how to enable idle shutdown using the AWS CLI. 

**Note**  
To enforce a specific timeout value from the AWS CLI, you must set `IdleTimeoutInMinutes`, `MaxIdleTimeoutInMinutes`, and `MinIdleTimeoutInMinutes` to the same value.

### Domain
<a name="studio-updated-idle-shutdown-setup-cli-domain"></a>

 The following command shows how to enable idle shutdown when updating an existing domain. To add idle shutdown for a new domain, use the `create-domain` command instead. 

**Note**  
If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

```
aws sagemaker update-domain --region region --domain-id domain-id \
--default-user-settings file://default-user-settings.json

## default-user-settings.json example for enforcing the default timeout
{
    "JupyterLabAppSettings": {
        "AppLifecycleManagement": {
            "IdleSettings": {
                "LifecycleManagement": "ENABLED",
                "IdleTimeoutInMinutes": 120,
                "MaxIdleTimeoutInMinutes": 120,
                "MinIdleTimeoutInMinutes": 120
        }
    }
}

## default-user-settings.json example for letting users customize the default timeout, between 2-5 hours
{
    "JupyterLabAppSettings": {
        "AppLifecycleManagement": {
            "IdleSettings": {
                "LifecycleManagement": "ENABLED",
                "IdleTimeoutInMinutes": 120,
                "MinIdleTimeoutInMinutes": 120,
                "MaxIdleTimeoutInMinutes": 300
        }
    }
}
```

### User profile
<a name="studio-updated-idle-shutdown-setup-cli-userprofile"></a>

 The following command shows how to enable idle shutdown when updating an existing user profile. To add idle shutdown for a new user profile, use the `create-user-profile` command instead. 

**Note**  
If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

```
aws sagemaker update-user-profile --region region --domain-id domain-id \
--user-profile-name user-profile-name --user-settings file://user-settings.json

## user-settings.json example for enforcing the default timeout
{
    "JupyterLabAppSettings": {
        "AppLifecycleManagement": {
            "IdleSettings": {
                "LifecycleManagement": "ENABLED",
                "IdleTimeoutInMinutes": 120,
                "MaxIdleTimeoutInMinutes": 120,
                "MinIdleTimeoutInMinutes": 120
        }
    }
}

## user-settings.json example for letting users customize the default timeout, between 2-5 hours
{
    "JupyterLabAppSettings": {
        "AppLifecycleManagement": {
            "IdleSettings": {
                "LifecycleManagement": "ENABLED",
                "IdleTimeoutInMinutes": 120,
                "MinIdleTimeoutInMinutes": 120,
                "MaxIdleTimeoutInMinutes": 300
        }
    }
}
```

# Update default idle shutdown settings
<a name="studio-updated-idle-shutdown-update"></a>

 You can update the default idle shutdown settings at either the domain or user profile level. 

**Note**  
If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

## Update domain settings
<a name="studio-updated-idle-shutdown-update-domain"></a>

1.  Navigate to the domain. 

1.  Choose the **App Configurations** tab. 

1.  From the **App Configurations** tab, navigate to either the Code Editor or JupyterLab section.  

1.  In the section for the application that you want to modify the idle shutdown time limit for, select **Edit**. 

1.  Update the idle shutdown settings for the domain. 

1.  Select **Save Changes**. 

## Update user profile settings
<a name="studio-updated-idle-shutdown-update-userprofile"></a>

1.  Navigate to the domain. 

1.  Choose the **User profiles** tab. 

1.  From the **User profiles** tab, select the user profile to edit. 

1.  From the **User profile** page, choose the **Applications** tab. 

1.  On the **Applications** tab, navigate to either the Code Editor or JupyterLab section.  

1.  In the section for the application that you want to modify the idle shutdown time limit for, select **Edit**. 

1.  Update the idle shutdown settings for the user profile. 

1.  Select **Save Changes**. 

# Modify your idle shutdown time limit
<a name="studio-updated-idle-shutdown-modify"></a>

 Users may be able to modify the idle shutdown time limit if the admin gives access when adding support for idle shutdown. If support for idle shutdown is added, there may be a limit applied to the maximum time for idle shutdown. A user can set the value anywhere between the lower limit and upper limit. 

1.  Launch Amazon SageMaker Studio by following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1.  From the **Applications** section, select the application type to update the idle shutdown time for. 

1.  Select the space to update. 

1.  Update **Idle shutdown (mins)** with your desired value. 
**Note**  
If idle shutdown is set when applications are running, they must be restarted for idle shutdown settings to take effect. 

# Applications supported in Amazon SageMaker Studio
<a name="studio-updated-apps"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

 Amazon SageMaker Studio supports the following applications: 
+  **Code Editor, based on Code-OSS, Visual Studio Code - Open Source**– Code Editor offers a lightweight and powerful integrated development environment (IDE) with familiar shortcuts, terminal, and advanced debugging capabilities and refactoring tools. It is a fully managed, browser-based application in Studio. For more information, see [Code Editor in Amazon SageMaker Studio](code-editor.md). 
+  **Amazon SageMaker Studio Classic**– Amazon SageMaker Studio Classic is a web-based IDE for machine learning. With Studio Classic, you can build, train, debug, deploy, and monitor your machine learning models. For more information, see [Amazon SageMaker Studio Classic](studio.md). 
+  **JupyterLab**–JupyterLab offers a set of capabilities that augment the fully managed notebook offering. It includes kernels that start in seconds, a pre-configured runtime with popular data science, machine learning frameworks, and high performance block storage. For more information, see [SageMaker JupyterLab](studio-updated-jl.md). 
+  **Amazon SageMaker Canvas**– With SageMaker Canvas, you can use machine learning to generate predictions without writing code. With Canvas, you can chat with popular large language models (LLMs), access ready-to-use models, or build a custom model that's trained on your data. For more information, see [Amazon SageMaker Canvas](canvas.md). 
+  **RStudio**– RStudio is an integrated development environment for R. It includes a console and syntax-highlighting editor that supports running code directly. It also includes tools for plotting, history, debugging, and workspace management. For more information, see [RStudio on Amazon SageMaker AI](rstudio.md). 

# Connect your Remote IDE to SageMaker spaces with remote access
<a name="remote-access"></a>

You can remotely connect from your Remote IDE to Amazon SageMaker Studio spaces. You can use your customized local IDE setup, including AI-assisted development tools and custom extensions, with the scalable compute resources in Amazon SageMaker AI. This guide provides concepts and setup instructions for administrators and users.

A Remote IDE connection establishes a secure connection between your local IDE and SageMaker spaces. This connection lets you:
+ **Access SageMaker AI compute resources** — Run code on scalable SageMaker AI infrastructure from your local environment
+ **Maintain security boundaries** — Work within the same security framework as SageMaker AI
+ **Keep your familiar IDE experience** — Use compatible local extensions, themes, and configurations that support remote development

**Note**  
Not all IDE extensions are compatible with remote development. Extensions that require local GUI components, have architecture dependencies, or need specific client-server interactions may not work properly in the remote environment. Verify that your required extensions support remote development before use.

**Topics**
+ [

## Key concepts
](#remote-access-key-concepts)
+ [

## Connection methods
](#remote-access-connection-methods)
+ [

## Supported IDEs
](#remote-access-supported-ides)
+ [

## IDE version requirements
](#remote-access-ide-version-requirements)
+ [

## Operating system requirements
](#remote-access-os-requirements)
+ [

## Local machine prerequisites
](#remote-access-local-prerequisites)
+ [

## Image requirements
](#remote-access-image-requirements)
+ [

## Instance requirements
](#remote-access-instance-requirements)
+ [

# Set up remote access
](remote-access-remote-setup.md)
+ [

# Set up Remote IDE
](remote-access-local-ide-setup.md)
+ [

# Supported AWS Regions
](remote-access-supported-regions.md)

## Key concepts
<a name="remote-access-key-concepts"></a>
+ **Remote connection** — A secure tunnel between your Remote IDE and a SageMaker space. This connection enables interactive development and code execution using SageMaker AI compute resources.
+ [https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-spaces.html](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-spaces.html) — A dedicated environment within Amazon SageMaker Studio where you can manage your storage and resources for your Studio applications.
+ **Deep link** — A button (direct URL) from the SageMaker UI that initiates a remote connection to your local IDE.

## Connection methods
<a name="remote-access-connection-methods"></a>

There are three main ways to connect your Remote IDE to SageMaker spaces:
+ **Deep link access** — You can connect directly to a specific space by using the **Open space with** button available in SageMaker AI. This uses URL patterns to establish a remote connection and open your SageMaker space in your Remote IDE.
+ [https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/welcome.html](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/welcome.html) — You can authenticate with AWS Toolkit for Visual Studio Code. This allows you to connect to spaces and open a remotely connected window from your Remote IDE.
+ **SSH terminal connection** — You can connect via command line using SSH configuration.

## Supported IDEs
<a name="remote-access-supported-ides"></a>

Remote connection to Studio spaces supports:
+ [Visual Studio Code](https://code.visualstudio.com/)
+ [Kiro](https://kiro.dev/)
+ [Cursor](https://cursor.com/home)

## IDE version requirements
<a name="remote-access-ide-version-requirements"></a>

The following table lists the minimum version requirements for each supported Remote IDE.


| IDE | Minimum version | 
| --- | --- | 
|  Visual Studio Code  |  [v1.90](https://code.visualstudio.com/updates/v1_90) or greater. We recommend using the [latest stable version](https://code.visualstudio.com/updates).  | 
|  Kiro  |  v0.10.78 or greater  | 
|  Cursor  |  v2.6.18 or greater  | 

The AWS Toolkit extension is required to connect your Remote IDE to Studio spaces. For Kiro and Cursor, AWS Toolkit extension version v3.100 or greater is required.

## Operating system requirements
<a name="remote-access-os-requirements"></a>

You need one of the following operating systems to remotely connect to Studio spaces:
+ macOS 13\$1
+ Windows 10
  + [Windows 10 support ends on October 14, 2025](https://support.microsoft.com/en-us/windows/windows-10-support-ends-on-october-14-2025-2ca8b313-1946-43d3-b55c-2b95b107f281)
+ Windows 11
+ Linux
  + For VS Code, install the official [Microsoft VS Code for Linux](https://code.visualstudio.com/docs/setup/linux), not an open-source version

## Local machine prerequisites
<a name="remote-access-local-prerequisites"></a>

Before connecting your Remote IDE to Studio spaces, ensure your local machine has the required dependencies and network access.

**Important**  
Environments with software installation restrictions may prevent users from installing required dependencies. The AWS Toolkit for Visual Studio Code automatically searches for these dependencies when initiating remote connections and will prompt for installation if any are missing. Coordinate with your IT department to ensure these components are available.

**Required local dependencies**

Your local machine must have the following components installed:
+ **[Remote-SSH Extension](https://code.visualstudio.com/docs/remote/ssh)** — Remote development extension for your IDE (available in the extension marketplace for VS Code, Kiro, and Cursor)
+ **[Session Manager plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)** — Required for secure session management
+ **SSH Client** — Standard component on most machines ([OpenSSH recommended for Windows](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse))
+ **IDE CLI Command** — Typically included with IDE installation (for example, `code` for VS Code, `kiro` for Kiro, `cursor` for Cursor)

**Platform-specific requirements**
+ **Windows users** — PowerShell 5.1 or later is required for SSH terminal connections

**Network connectivity requirements**

Your local machine must have network access to [Session Manager endpoints](https://docs.aws.amazon.com/general/latest/gr/ssm.html). For example, in US East (N. Virginia) (us-east-1) these can be:
+ ssm.us-east-1.amazonaws.com
+ ssm.us-east-1.api.aws
+ ssmmessages.us-east-1.amazonaws.com
+ ec2messages.us-east-1.amazonaws.com

## Image requirements
<a name="remote-access-image-requirements"></a>

**SageMaker Distribution images**

When using SageMaker Distribution with remote access, use [SageMaker Distribution](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-distribution.html) version 2.7 or later.

**Custom images**

When you [Bring your own image (BYOI)](studio-updated-byoi.md) with remote access, ensure that you follow the [custom image specifications](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi-specs.html) and ensure the following dependencies are installed:
+ `curl` or `wget` — Required for downloading AWS CLI components
+ `unzip` — Required for extracting AWS CLI installation files
+ `tar` — Required for archive extraction
+ `gzip` — Required for compressed file handling

## Instance requirements
<a name="remote-access-instance-requirements"></a>
+ **Memory** — 8GB or more
+ **Instance types** — Use instances with at least 8GB of memory. The following instance types are *not* supported due to insufficient memory (less than 8GB): `ml.t3.medium`, `ml.c7i.large`, `ml.c6i.large`, `ml.c6id.large`, and `ml.c5.large`. For a more complete list of instance types, see the [Amazon EC2 On-Demand Pricing page](https://aws.amazon.com/ec2/pricing/on-demand/).

**Topics**
+ [

## Key concepts
](#remote-access-key-concepts)
+ [

## Connection methods
](#remote-access-connection-methods)
+ [

## Supported IDEs
](#remote-access-supported-ides)
+ [

## IDE version requirements
](#remote-access-ide-version-requirements)
+ [

## Operating system requirements
](#remote-access-os-requirements)
+ [

## Local machine prerequisites
](#remote-access-local-prerequisites)
+ [

## Image requirements
](#remote-access-image-requirements)
+ [

## Instance requirements
](#remote-access-instance-requirements)
+ [

# Set up remote access
](remote-access-remote-setup.md)
+ [

# Set up Remote IDE
](remote-access-local-ide-setup.md)
+ [

# Supported AWS Regions
](remote-access-supported-regions.md)

# Set up remote access
<a name="remote-access-remote-setup"></a>

Before users can connect their Remote IDE to Studio spaces, the administrator must configure permissions. This section provides instructions for administrators on how to set up their Amazon SageMaker AI domain with remote access.

Different connection methods require different IAM permissions. Configure the appropriate permissions based on how your users will connect. Use the following workflow along with the permissions aligned with the connection method.

**Important**  
Currently remote IDE connections are authenticated using IAM credentials, not IAM Identity Center. This applies for domains that use the IAM Identity Center [authentication method](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-custom.html#onboard-custom-authentication-details) for your users to access the domain. If you prefer not to use IAM authentication for remote connections, you can opt-out by disabling this feature using the `RemoteAccess` conditional key in your IAM policies. For more information, see [Remote access enforcement](remote-access-remote-setup-abac.md#remote-access-remote-setup-abac-remote-access-enforcement). When using IAM credentials, Remote IDE connections may maintain active sessions even after you log out of your IAM Identity Center session. Sometimes, these Remote IDE connections can persist for up to 12 hours. To ensure the security of your environment, administrators must review session duration settings where possible and be cautious when using shared workstations or public networks.

1. Choose one of the following connection method permissions that align with your users’ [Connection methods](remote-access.md#remote-access-connection-methods).

1. [Create a custom IAM policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) based on the connection method permission.

**Topics**
+ [

## Step 1: Configure security and permissions
](#remote-access-remote-setup-permissions)
+ [

## Step 2: Enable remote access for your space
](#remote-access-remote-setup-enable)
+ [

# Advanced access control
](remote-access-remote-setup-abac.md)
+ [

# Set up Studio to run with subnets without internet access within a VPC
](remote-access-remote-setup-vpc-subnets-without-internet-access.md)
+ [

# Set up automated Studio space filtering when using the AWS Toolkit
](remote-access-remote-setup-filter.md)

## Step 1: Configure security and permissions
<a name="remote-access-remote-setup-permissions"></a>

**Topics**
+ [

### Method 1: Deep link permissions
](#remote-access-remote-setup-method-1-deep-link-permissions)
+ [

### Method 2: AWS Toolkit permissions
](#remote-access-remote-setup-method-2-aws-toolkit-permissions)
+ [

### Method 3: SSH terminal permissions
](#remote-access-remote-setup-method-3-ssh-terminal-permissions)

**Important**  
Using broad permissions for `sagemaker:StartSession`, especially with a wildcard resource `*` creates the risk that any user with this permission can initiate a session against any SageMaker Space app in the account. This can lead to the impact of data scientists unintentionally accessing other users’ SageMaker Spaces. For production environments, you should scope down these permissions to specific space ARNs to enforce the principle of least privilege. See [Advanced access control](remote-access-remote-setup-abac.md) for examples of more granular permission policies using resource ARNs, tags, and network-based constraints.

### Method 1: Deep link permissions
<a name="remote-access-remote-setup-method-1-deep-link-permissions"></a>

For users connecting via deep links from the SageMaker UI, use the following permission and attach it to your SageMaker AI [space execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-get-execution-role-space) or [domain execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-get-execution-role). If the space execution role is not configured, the domain execution role is used by default.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "RestrictStartSessionOnSpacesToUserProfile",
            "Effect": "Allow",
            "Action": [
                "sagemaker:StartSession"
            ],
            "Resource": "arn:*:sagemaker:*:*:space/${sagemaker:DomainId}/*",
            "Condition": {
                "ArnLike": {
                    "sagemaker:ResourceTag/sagemaker:user-profile-arn": "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                }
            }
        }
    ]
}
```

------

### Method 2: AWS Toolkit permissions
<a name="remote-access-remote-setup-method-2-aws-toolkit-permissions"></a>

For users connecting through the AWS Toolkit for Visual Studio Code extension, attach the following policy to one of the following:
+ For IAM authentication, attach this policy to the IAM user or role
+ For IdC authentication, attach this policy to the [Permission sets](https://docs.aws.amazon.com/singlesignon/latest/userguide/permissionsetsconcept.html) managed by the IdC

To show only spaces relevant to the authenticated user, see [Filtering overview](remote-access-remote-setup-filter.md#remote-access-remote-setup-filter-overview).

**Important**  
The following policy using `*` as the resource constraint is only recommended for quick testing purposes. For production environments, you should scope down these permissions to specific space ARNs to enforce the principle of least privilege. See [Advanced access control](remote-access-remote-setup-abac.md) for examples of more granular permission policies using resource ARNs, tags, and network-based constraints.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListSpaces",
                "sagemaker:DescribeSpace",
                "sagemaker:ListApps",
                "sagemaker:DescribeApp",
                "sagemaker:DescribeDomain",
                "sagemaker:UpdateSpace",
                "sagemaker:CreateApp",
                "sagemaker:DeleteApp",
                "sagemaker:AddTags"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowStartSessionOnSpaces",
            "Effect": "Allow",
            "Action": "sagemaker:StartSession",
            "Resource": [
                "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/space-name-1",
                "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/space-name-2"
            ]
        }
    ]
}
```

------

### Method 3: SSH terminal permissions
<a name="remote-access-remote-setup-method-3-ssh-terminal-permissions"></a>

For SSH terminal connections, the `StartSession` API is called by the SSH proxy command script below, using the local AWS credentials. See [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for information and instructions on setting up the user's local AWS credentials. To use these permissions:

1. Attach this policy to the IAM user or role associated with the local AWS credentials.

1. If using a named credential profile, modify the proxy command in your SSH config:

   ```
   ProxyCommand '/home/user/sagemaker_connect.sh' '%h' YOUR_CREDENTIAL_PROFILE_NAME
   ```
**Note**  
The policy needs to be attached to the IAM identity (user/role) used in your local AWS credentials configuration, not to the Amazon SageMaker AI domain execution role.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowStartSessionOnSpecificSpaces",
               "Effect": "Allow",
               "Action": "sagemaker:StartSession",
               "Resource": [
                   "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/space-name-1",
                   "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/space-name-2"
               ]
           }
       ]
   }
   ```

------

After setup, users can run `ssh my_studio_space_abc` to start up the space. For more information, see [Method 3: Connect from the terminal via SSH CLI](remote-access-local-ide-setup.md#remote-access-local-ide-setup-local-vs-code-method-3-connect-from-the-terminal-via-ssh-cli).

## Step 2: Enable remote access for your space
<a name="remote-access-remote-setup-enable"></a>

After you set up the permissions, you must toggle on **Remote Access** and start your space in Studio before the user can connect using their Remote IDE. This setup only needs to be done once.

**Note**  
If your users are connecting using [Method 2: AWS Toolkit permissions](#remote-access-remote-setup-method-2-aws-toolkit-permissions), you do not necessarily need this step. AWS Toolkit for Visual Studio users can enable remote access from the Toolkit.

**Activate remote access for your Studio space**

1. [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html#studio-updated-launch-console).

1. Open the Studio UI.

1. Navigate to your space.

1. In the space details, toggle on **Remote Access**.

1. Choose **Run space**.

# Advanced access control
<a name="remote-access-remote-setup-abac"></a>

Amazon SageMaker AI supports [attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to achieve fine-grained access control for Remote IDE connections using ABAC policies. The following are example ABAC policies for Remote IDE connections.

**Topics**
+ [

## Remote access enforcement
](#remote-access-remote-setup-abac-remote-access-enforcement)
+ [

## Tag-based access control
](#remote-access-remote-setup-abac-tag-based-access-control)

## Remote access enforcement
<a name="remote-access-remote-setup-abac-remote-access-enforcement"></a>

Control access to resources using the `sagemaker:RemoteAccess` condition key. This is supported by both `CreateSpace` and `UpdateSpace` APIs. The following example uses `CreateSpace`. 

You can ensure that users cannot create spaces with remote access enabled. This helps maintain security by defaulting to more restricted access settings. The following policy ensures users can:
+ Create new Studio spaces where remote access is explicitly disabled
+ Create new Studio spaces without specifying any remote access settings

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "DenyCreateSpaceRemoteAccessEnabled",
            "Effect": "Deny",
            "Action": [
                "sagemaker:CreateSpace",
                "sagemaker:UpdateSpace"
            ],
            "Resource": "arn:aws:sagemaker:*:*:space/*",
            "Condition": {
                "StringEquals": {
                    "sagemaker:RemoteAccess": [
                        "ENABLED"
                    ]
                }
            }
        },
        {
            "Sid": "AllowCreateSpace",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateSpace",
                "sagemaker:UpdateSpace"
            ],
            "Resource": "arn:aws:sagemaker:*:*:space/*"
        }
    ]
}
```

------

## Tag-based access control
<a name="remote-access-remote-setup-abac-tag-based-access-control"></a>

Implement [tag-based](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/what-are-tags.html) access control to restrict connections based on resource and principal tags.

You can ensure users can only access resources appropriate for their role and project assignments. You can use the following policy to:
+ Allow users to connect only to spaces that match their assigned team, environment, and cost center
+ Implement fine-grained access control based on organizational structure

In the following example, the space is tagged with the following:

```
{ "Team": "ML", "Environment": "Production", "CostCenter": "12345" }
```

You can have a role that contains the following policy to match resource and principal tags:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "RestrictStartSessionOnTaggedSpacesInDomain",
            "Effect": "Allow",
            "Action": [
                "sagemaker:StartSession"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "${aws:PrincipalTag/Team}",
                    "aws:ResourceTag/Environment": "${aws:PrincipalTag/Environment}",
                    "aws:ResourceTag/CostCenter": "${aws:PrincipalTag/CostCenter}",
                    "aws:ResourceTag/IDC_UserName": "${aws:PrincipalTag/IDC_UserName}"
                }
            }
        }
    ]
}
```

------

When the role’s tags match, the user has permission to start the session and remotely connect to their space. See [Control access to AWS resources using tags](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_tags.html) for more information.

# Set up Studio to run with subnets without internet access within a VPC
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access"></a>

This guide shows you how to connect to Amazon SageMaker Studio spaces from your Remote IDE when your Amazon SageMaker AI domain runs in private subnets without internet access. You’ll learn about connectivity requirements and setup options to establish secure remote connections in isolated network environments.

You can configure Amazon SageMaker Studio to run in VPC only mode with subnets without internet access. This setup enhances security for your machine learning workloads by operating in an isolated network environment where all traffic flows through the VPC. To enable external communications while maintaining security, use VPC endpoints for AWS services and configure VPC PrivateLink for required AWS dependencies.

**IDE support for private subnet connections**

The following table shows the supported connection methods for each Remote IDE when connecting to Studio spaces in private subnets without internet access.


| Connection method | VS Code | Kiro | Cursor | 
| --- | --- | --- | --- | 
|  HTTP Proxy support  |  Supported  |  Supported  |  Not supported  | 
|  Pre-packaged remote server and extensions  |  Supported  |  Not supported  |  Not supported  | 

**Important**  
Cursor is not supported for connecting to Studio spaces in private subnets without outbound internet access.

**Topics**
+ [

## Studio remote access network requirements
](#remote-access-remote-setup-vpc-subnets-without-internet-access-network-requirements)
+ [

## Setup Studio remote access network
](#remote-access-remote-setup-vpc-subnets-without-internet-access-setup)

## Studio remote access network requirements
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access-network-requirements"></a>

**VPC mode limitations** Studio in VPC mode only supports private subnets. Studio cannot work with subnets directly attached with an Internet Gateway (IGW). Remote IDE connections share the same limitations as SageMaker AI. For more information, see [Connect Studio notebooks in a VPC to external resources](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html).

### VPC PrivateLink requirements
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access-vpc-privatelink-requirements"></a>

When SageMaker AI runs in private subnets, configure these SSM VPC endpoints in addition to standard VPC endpoints required for SageMaker. For more information, see [Connect Studio Through a VPC Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-interface-endpoint.html).
+ `com.amazonaws.REGION.ssm`
+ `com.amazonaws.REGION.ssmmessages`

**VPC endpoint policy recommendations**

The following are the recommended VPC endpoint policies that allow the necessary actions for remote access while using the `aws:PrincipalIsAWSService` condition to ensure only AWS services like Amazon SageMaker AI can make the calls. For more information about the `aws:PrincipalIsAWSService` condition key, see [the documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-principalisawsservice).

**SSM endpoint policy**

Use the following policy for the `com.amazonaws.REGION.ssm` endpoint:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "ssm:CreateActivation",
                "ssm:RegisterManagedInstance",
                "ssm:DeleteActivation",
                "ssm:DeregisterManagedInstance",
                "ssm:AddTagsToResource",
                "ssm:UpdateInstanceInformation",
                "ssm:UpdateInstanceAssociationStatus",
                "ssm:DescribeInstanceInformation",
                "ssm:ListInstanceAssociations",
                "ssm:ListAssociations",
                "ssm:GetDocument",
                "ssm:PutInventory"
            ],
            "Resource": "*",
            "Condition": {
                "BoolIfExists": {
                    "aws:PrincipalIsAWSService": "true"
                }
            }
        }
    ]
}
```

**SSM Messages endpoint policy**

Use the following policy for the `com.amazonaws.REGION.ssmmessages` endpoint:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*",
            "Condition": {
                "BoolIfExists": {
                    "aws:PrincipalIsAWSService": "true"
                }
            }
        }
    ]
}
```

**VS Code specific network requirements**

Remote VS Code connection requires VS Code remote development, which needs specific network access to install the remote server and extensions. See the [remote development FAQ](https://code.visualstudio.com/docs/remote/faq) in the Visual Studio Code documentation for full network requirements. The following is a summary of the requirements:
+ Access to Microsoft’s VS Code server endpoints is needed to install and update the VS Code remote server.
+ Access to Visual Studio Marketplace and related CDN endpoints is required for installing VS Code extensions through the extension panel (alternatively, extensions can be installed manually using VSIX files without internet connection).
+ Some extensions may require access to additional endpoints for downloading their specific dependencies. See the extension’s documentation for their specific connectivity requirements.

**Kiro specific network requirements**

Remote Kiro connection requires Kiro remote development, which needs specific network access to install the remote server and extensions. For firewall and proxy server configuration, see [Kiro firewall configuration](https://kiro.dev/docs/privacy-and-security/firewalls/). The requirements are similar to VS Code:
+ Access to Kiro server endpoints is needed to install and update the Kiro remote server.
+ Access to extension marketplace and related CDN endpoints is required for installing Kiro extensions through the extension panel.
+ Some extensions may require access to additional endpoints for downloading their specific dependencies. See the extension’s documentation for their specific connectivity requirements.

## Setup Studio remote access network
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access-setup"></a>

You have the following options to connect your Remote IDE to Studio spaces in private subnets:
+ HTTP Proxy (supported for VS Code and Kiro)
+ Pre-packaged remote server and extensions (VS Code only)

### Set up HTTP Proxy with controlled allow-listing
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access-setup-http-proxy-with-controlled-allow-listing"></a>

When your Studio space is behind a firewall or proxy, allow access to your IDE server and extension-related CDNs and endpoints.

1. Set up a public subnet to run the HTTP proxy (such as Squid), where you can configure which websites to allow. Ensure that the HTTP proxy is accessible by SageMaker spaces.

1. The public subnet can be in the same VPC used by the Studio or in separate VPC peered with all the VPCs used by Amazon SageMaker AI domains.

### Set up Pre-packaged remote server and extensions (VS Code only)
<a name="remote-access-remote-setup-vpc-subnets-without-internet-access-setup-pre-packaged-vs-code-remote-server-and-extensions"></a>

**Note**  
This option is only available for Visual Studio Code. Kiro and Cursor do not support pre-packaged remote server setup.

When your Studio spaces can’t access external endpoints to download VS Code remote server and extensions, you can pre-package them. With this approach, you export a tarball containing the `.VS Code-server` directory for a specific version of VS Code. Then, you use a SageMaker AI Lifecycle Configuration (LCC) script to copy and extract the tarball into the home directory (`/home/sagemaker-user`) of the Studio spaces. This LCC-based solution works with both AWS-provided and custom images. Even when you’re not using private subnets, this approach accelerates the setup of the VS Code remote server and pre-installed extensions.

**Instructions for pre-packaging your VS Code remote server and extensions**

1. Install VS Code on your local machine.

1. Launch a Linux-based (x64) Docker container with SSH enabled, either locally or via a Studio space with internet access. We recommend using a temporary Studio space with remote access and internet enabled for simplicity.

1. Connect your installed VS Code to the local Docker container via Remote SSH or connect to the Studio space via the Studio remote VS Code feature. VS Code installs the remote server into `.VS Code-server` in the home directory in the remote container during connection. See [Example Dockerfile usage for pre-packaging your VS Code remote server and extensions](remote-access-local-ide-setup-vpc-no-internet.md#remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions-example-dockerfile) for more information.

1. After connecting remotely, ensure you use the VS Code Default profile.

1. Install the required VS Code extensions and validate their functionality. For example, create and run a notebook to install Jupyter notebook-related extensions in the VS Code remote server.

   Ensure you [install the AWS Toolkit for Visual Studio Code extension](https://docs.aws.amazon.com/toolkit-for-visual-studio/latest/user-guide/setup.html) after connecting to the remote container.

1. Archive the `$HOME/.VS Code-server` directory (for example, `VS Code-server-with-extensions-for-1.100.2.tar.gz`) in either the local Docker container or in the terminal of the remotely connected Studio space.

1. Upload the tarball to Amazon S3.

1. Create an [LCC script](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lifecycle-configurations.html) ([Example LCC script (LCC-install-VS Code-server-v1.100.2)](remote-access-local-ide-setup-vpc-no-internet.md#remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions-example-lcc)) that:
   + Downloads the specific archive from Amazon S3.
   + Extracts it into the home directory when a Studio space in a private subnet launches.

1. (Optional) Extend the LCC script to support per-user VS Code server tarballs stored in user-specific Amazon S3 folders.

1. (Optional) Maintain version-specific LCC scripts ([Example LCC script (LCC-install-VS Code-server-v1.100.2)](remote-access-local-ide-setup-vpc-no-internet.md#remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions-example-lcc)) that you can attach to your spaces, ensuring compatibility between your local VS Code client and the remote server.

# Set up automated Studio space filtering when using the AWS Toolkit
<a name="remote-access-remote-setup-filter"></a>

Users can filter spaces in the AWS Toolkit for Visual Studio Code explorer to display only relevant spaces. This section provides information on filtering and how to set up automated filtering.

This setup only applies when using the [Method 2: AWS Toolkit in the Remote IDE](remote-access-local-ide-setup.md#remote-access-local-ide-setup-local-vs-code-method-2-aws-toolkit-in-vs-code) method to connect from your Remote IDE to Amazon SageMaker Studio spaces. See [Set up remote access](remote-access-remote-setup.md) for more information.

**Topics**
+ [

## Filtering overview
](#remote-access-remote-setup-filter-overview)
+ [

## Set up when connecting with IAM credentials
](#remote-access-remote-setup-filter-set-up-iam-credentials)

## Filtering overview
<a name="remote-access-remote-setup-filter-overview"></a>

**Manual filtering** allows users to manually select which user profiles to display spaces for through the AWS Toolkit interface. This method works for all authentication types and takes precedence over automated filtering. To manually filter, see [Manual filtering](remote-access-local-ide-setup-filter.md#remote-access-local-ide-setup-filter-manual).

**Automated filtering** automatically shows only spaces relevant to the authenticated user. This filtering behavior depends on the authentication method during sign-in. See [connecting to AWS from the Toolkit](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/connect.html#connect-to-aws) in the Toolkit for VS Code User Guide for more information. The following lists the sign-in options.
+ **Authenticate and connect with SSO**: Automated filtering works by default.
+ **Authenticate and connect with IAM credentials**: Automated filtering **requires administrator setup** for the following IAM credentials. Without this setup, AWS Toolkit cannot identify which spaces belong to the user, so all spaces are shown by default.
  + **Using IAM user credentials**
  + **Using assumed IAM role session credentials**

## Set up when connecting with IAM credentials
<a name="remote-access-remote-setup-filter-set-up-iam-credentials"></a>

**When using IAM user credentials**

Toolkit for VS Code can match spaces belonging to user profiles that start with the authenticated IAM user name or assumed role session name. To set this up:

**Note**  
Administrators must configure Studio user profile names to follow this naming convention for automated filtering to work correctly.
+ Administrators must ensure Studio user profile names follow the naming convention:
  + For IAM users: prefix with `IAM-user-name-`
  + For assumed roles: prefix with `assumed-role-session-name-`
+ `aws sts get-caller-identity` returns the identity information used for matching
+ Spaces belonging to the matched user profiles will be automatically filtered in the Toolkit for VS Code

**When using assumed IAM role session credentials** In addition to the setup when using IAM user credentials above, you will need to ensure session ARNs include user identifiers as prefixes that match. You can configure trust policies that ensure session ARNs include user identifiers as prefixes. [Create a trust policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) and attach it to the assumed role used for authentication.

This setup is not required for direct IAM user credentials or IdC authentication.

**Set up trust policy for IAM role session credentials example** Create a trust policy that enforces role sessions to include the IAM user name. The following is an example policy:

```
{
    "Statement": [
        {
            "Sid": "RoleTrustPolicyRequireUsernameForSessionName",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Principal": {"AWS": "arn:aws:iam::ACCOUNT:root"},
            "Condition": {
                "StringLike": {"sts:RoleSessionName": "${aws:username}"}
            }
        }
    ]
}
```

# Set up Remote IDE
<a name="remote-access-local-ide-setup"></a>

After administrators complete the instructions in [Connect your Remote IDE to SageMaker spaces with remote access](remote-access.md), you can connect your Remote IDE to your remote SageMaker spaces.

**Topics**
+ [

## Set up your local environment
](#remote-access-local-ide-setup-local-environment)
+ [

## Connect to your Remote IDE
](#remote-access-local-ide-setup-local-vs-code)
+ [

# Connect to VPC with subnets without internet access
](remote-access-local-ide-setup-vpc-no-internet.md)
+ [

# Filter your Studio spaces
](remote-access-local-ide-setup-filter.md)

## Set up your local environment
<a name="remote-access-local-ide-setup-local-environment"></a>

Install your preferred Remote IDE on your local machine:
+ [Visual Studio Code](https://code.visualstudio.com/)
+ [Kiro](https://kiro.dev/)
+ [Cursor](https://cursor.com/home)

For information on the version requirements, see [IDE version requirements](remote-access.md#remote-access-ide-version-requirements).

## Connect to your Remote IDE
<a name="remote-access-local-ide-setup-local-vs-code"></a>

Before you can establish a connection from your Remote IDE to your remote SageMaker spaces, your administrator must [Set up remote access](remote-access-remote-setup.md). Your administrator sets up a specific method for you to establish a connection. Choose the method that was set up for you.

**Topics**
+ [

### Method 1: Deep link from Studio UI
](#remote-access-local-ide-setup-local-vs-code-method-1-deep-link-from-studio-ui)
+ [

### Method 2: AWS Toolkit in the Remote IDE
](#remote-access-local-ide-setup-local-vs-code-method-2-aws-toolkit-in-vs-code)
+ [

### Method 3: Connect from the terminal via SSH CLI
](#remote-access-local-ide-setup-local-vs-code-method-3-connect-from-the-terminal-via-ssh-cli)

### Method 1: Deep link from Studio UI
<a name="remote-access-local-ide-setup-local-vs-code-method-1-deep-link-from-studio-ui"></a>

Use the following procedure to establish a connection using deep link.

1. [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html#studio-updated-launch-console).

1. In the Studio UI, navigate to your space.

1. Choose **Open in VS Code**, **Open in Kiro**, or **Open in Cursor** button for your preferred IDE. Ensure that your preferred IDE is already installed on your local computer.

1. When prompted, confirm to open your IDE. Your IDE opens with another pop-up to confirm. Once completed, the remote connection is established.

### Method 2: AWS Toolkit in the Remote IDE
<a name="remote-access-local-ide-setup-local-vs-code-method-2-aws-toolkit-in-vs-code"></a>

Use the following procedure to establish a connection using the AWS Toolkit for Visual Studio Code. This method is available for VS Code, Kiro, and Cursor.

1. Open your Remote IDE (VS Code, Kiro, or Cursor).

1. Open the AWS Toolkit extension.

1. [Connect to AWS](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/connect.html).

1. In the AWS Explorer, expand **SageMaker AI**, then expand **Studio**.

1. Find your Studio space.

1. Choose the **Connect** icon next to your space to start it.
**Note**  
Stop and restart the space in the Toolkit for Visual Studio to enable remote access, if not already connected.
If the space is not using a supported [instance size](https://docs.aws.amazon.com/sagemaker/latest/dg/remote-access.html#remote-access-instance-requirements), you will be asked to change the instance.

### Method 3: Connect from the terminal via SSH CLI
<a name="remote-access-local-ide-setup-local-vs-code-method-3-connect-from-the-terminal-via-ssh-cli"></a>

Choose one of the following platform options to view the procedure to establish a connection using the SSH CLI.

**Note**  
Ensure that you have the latest versions of the [Local machine prerequisites](remote-access.md#remote-access-local-prerequisites) installed before following the instructions below.
If you [Bring your own image (BYOI)](studio-updated-byoi.md), ensure you have installed the required dependencies listed in [Image requirements](remote-access.md#remote-access-image-requirements) before proceeding

------
#### [ Linux/macOS ]

Create a shell script (for example, `/home/user/sagemaker_connect.sh`):

```
#!/bin/bash
# Disable the -x option if printing each command is not needed.
set -exuo pipefail

SPACE_ARN="$1"
AWS_PROFILE="${2:-}"

# Validate ARN and extract region
if [[ "$SPACE_ARN" =~ ^arn:aws[-a-z]*:sagemaker:([a-z0-9-]+):[0-9]{12}:space\/[^\/]+\/[^\/]+$ ]]; then
    AWS_REGION="${BASH_REMATCH[1]}"
else
    echo "Error: Invalid SageMaker Studio Space ARN format."
    exit 1
fi

# Optional profile flag
PROFILE_ARG=()
if [[ -n "$AWS_PROFILE" ]]; then
    PROFILE_ARG=(--profile "$AWS_PROFILE")
fi

# Start session
START_SESSION_JSON=$(aws sagemaker start-session \
    --resource-identifier "$SPACE_ARN" \
    --region "${AWS_REGION}" \
    "${PROFILE_ARG[@]}")

# Extract fields using grep and sed
SESSION_ID=$(echo "$START_SESSION_JSON" | grep -o '"SessionId": "[^"]*"' | sed 's/.*: "//;s/"$//')
STREAM_URL=$(echo "$START_SESSION_JSON" | grep -o '"StreamUrl": "[^"]*"' | sed 's/.*: "//;s/"$//')
TOKEN=$(echo "$START_SESSION_JSON" | grep -o '"TokenValue": "[^"]*"' | sed 's/.*: "//;s/"$//')

# Validate extracted values
if [[ -z "$SESSION_ID" || -z "$STREAM_URL" || -z "$TOKEN" ]]; then
    echo "Error: Failed to extract session information from sagemaker start session response."
    exit 1
fi

# Call session-manager-plugin
session-manager-plugin \
    "{\"streamUrl\":\"$STREAM_URL\",\"tokenValue\":\"$TOKEN\",\"sessionId\":\"$SESSION_ID\"}" \
    "$AWS_REGION" "StartSession"
```

1. Make the script executable:

   ```
   chmod +x /home/user/sagemaker_connect.sh
   ```

1. Configure `$HOME/.ssh/config` to add the following entry:

```
Host space-name
  HostName 'arn:PARTITION:sagemaker:us-east-1:111122223333:space/domain-id/space-name'
  ProxyCommand '/home/user/sagemaker_connect.sh' '%h'
  ForwardAgent yes
  AddKeysToAgent yes
  StrictHostKeyChecking accept-new
```

For example, the `PARTITION` can be `aws`.

If you need to use a [named AWS credential profile](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html#cli-configure-files-using-profiles), change the proxy command as follows:

```
  ProxyCommand '/home/user/sagemaker_connect.sh' '%h' YOUR_CREDENTIAL_PROFILE_NAME
```
+ Connect via SSH or run SCP command:

```
ssh space-name
scp file_abc space-name:/tmp/
```

------
#### [ Windows ]

**Prerequisites for Windows:**
+ PowerShell 5.1 or later
+ SSH client (OpenSSH recommended)

Create a PowerShell script (for example, `C:\Users\user-name\sagemaker_connect.ps1`):

```
# sagemaker_connect.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$SpaceArn,

    [Parameter(Mandatory=$false)]
    [string]$AwsProfile = ""
)

# Enable error handling
$ErrorActionPreference = "Stop"

# Validate ARN and extract region
if ($SpaceArn -match "^arn:aws[-a-z]*:sagemaker:([a-z0-9-]+):[0-9]{12}:space\/[^\/]+\/[^\/]+$") {
    $AwsRegion = $Matches[1]
} else {
    Write-Error "Error: Invalid SageMaker Studio Space ARN format."
    exit 1
}

# Build AWS CLI command
$awsCommand = @("sagemaker", "start-session", "--resource-identifier", $SpaceArn, "--region", $AwsRegion)

if ($AwsProfile) {
    $awsCommand += @("--profile", $AwsProfile)
}

try {
    # Start session and capture output
    Write-Host "Starting SageMaker session..." -ForegroundColor Green
    $startSessionOutput = & aws @awsCommand

    # Try to parse JSON response
    try {
        $sessionData = $startSessionOutput | ConvertFrom-Json
    } catch {
        Write-Error "Failed to parse JSON response: $_"
        Write-Host "Raw response was:" -ForegroundColor Yellow
        Write-Host $startSessionOutput
        exit 1
    }

    $sessionId = $sessionData.SessionId
    $streamUrl = $sessionData.StreamUrl
    $token = $sessionData.TokenValue

    # Validate extracted values
    if (-not $sessionId -or -not $streamUrl -or -not $token) {
        Write-Error "Error: Failed to extract session information from sagemaker start session response."
        Write-Host "Parsed response was:" -ForegroundColor Yellow
        Write-Host ($sessionData | ConvertTo-Json)
        exit 1
    }

    Write-Host "Session started successfully. Connecting..." -ForegroundColor Green

    # Create session manager plugin command
    $sessionJson = @{
        streamUrl = $streamUrl
        tokenValue = $token
        sessionId = $sessionId
    } | ConvertTo-Json -Compress

    # Escape the JSON string
    $escapedJson = $sessionJson -replace '"', '\"'

    # Call session-manager-plugin
    & session-manager-plugin "$escapedJson" $AwsRegion "StartSession"

} catch {
    Write-Error "Failed to start session: $_"
    exit 1
}
```
+ Configure `C:\Users\user-name\.ssh\config` to add the following entry:

```
Host space-name                            
  HostName "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/space-name"
  ProxyCommand "C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -ExecutionPolicy RemoteSigned -File "C:\\Users\\user-name\\sagemaker_connect.ps1" "%h"
  ForwardAgent yes
  AddKeysToAgent yes
  User sagemaker-user
  StrictHostKeyChecking accept-new
```

------

# Connect to VPC with subnets without internet access
<a name="remote-access-local-ide-setup-vpc-no-internet"></a>

Before connecting your Remote IDE to Studio spaces in private subnets without internet access, ensure your administrator has [Set up Studio to run with subnets without internet access within a VPC](remote-access-remote-setup-vpc-subnets-without-internet-access.md).

You have the following options to connect your Remote IDE to Studio spaces in private subnets:
+ Set up HTTP Proxy (supported for VS Code and Kiro)
+ Pre-packaged remote server and extensions (VS Code only)

**Important**  
Cursor is not supported for connecting to Studio spaces in private subnets without outbound internet access.

**Topics**
+ [

## HTTP Proxy with controlled allow-listing
](#remote-access-local-ide-setup-vpc-no-internet-http-proxy-with-controlled-allow-listing)
+ [

## Pre-packaged remote server and extensions (VS Code only)
](#remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions)

## HTTP Proxy with controlled allow-listing
<a name="remote-access-local-ide-setup-vpc-no-internet-http-proxy-with-controlled-allow-listing"></a>

When your Studio space is behind a firewall or proxy, ask your administrator to allow access to your IDE server and extension-related CDNs and endpoints. For more information, see [Set up HTTP Proxy with controlled allow-listing](remote-access-remote-setup-vpc-subnets-without-internet-access.md#remote-access-remote-setup-vpc-subnets-without-internet-access-setup-http-proxy-with-controlled-allow-listing).

------
#### [ VS Code ]

Configure the HTTP proxy for VS Code remote development by providing the proxy URL with the `remote.SSH.httpProxy` or `remote.SSH.httpsProxy` setting.

**Note**  
Consider enabling "Remote.SSH: Use Curl And Wget Configuration Files" to use the configuration from the remote environment’s `curlrc` and `wgetrc` files. This is so that the `curlrc` and `wgetrc` files, placed in their respective default locations in the SageMaker space, can be used for enabling certain cases.

------
#### [ Kiro ]

Configure the HTTP proxy for Kiro remote development by setting the `aws.sagemaker.ssh.kiro.httpsProxy` setting to your HTTP or HTTPS proxy endpoint.

If you use MCP (Model Context Protocol) servers in Kiro, you also need to add the proxy environment variables to your MCP server configuration:

```
"env": {
    "http_proxy": "${http_proxy}",
    "https_proxy": "${https_proxy}"
}
```

------

This option works when you are allowed to set up HTTP proxy and lets you install additional extensions flexibly, as some extensions require a public endpoint.

## Pre-packaged remote server and extensions (VS Code only)
<a name="remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions"></a>

**Note**  
This option is only available for Visual Studio Code. Kiro and Cursor do not support pre-packaged remote server setup.

When your Studio spaces can’t access external endpoints to download VS Code remote server and extensions, you can pre-package them. With this approach, your administrator can export a tarball containing the `.VS Code-server` directory for a specific version of VS Code. Then, the administrator uses a SageMaker AI Lifecycle Configuration (LCC) script to copy and extract the tarball into your home directory (`/home/sagemaker-user`). For more information, see [Set up Pre-packaged remote server and extensions (VS Code only)](remote-access-remote-setup-vpc-subnets-without-internet-access.md#remote-access-remote-setup-vpc-subnets-without-internet-access-setup-pre-packaged-vs-code-remote-server-and-extensions).

**Instructions for using pre-packaging for your VS Code remote server and extensions**

1. Install VS Code on your local machine

1. When you connect to the SageMaker space:
   + Use the Default profile to ensure compatibility with pre-packaged extensions. Otherwise, you’ll need to install extensions using downloaded VSIX files after connecting to the Studio space.
   + Choose a VS Code version specific LCC script to attach to the space when you launch the space.

### Example Dockerfile usage for pre-packaging your VS Code remote server and extensions
<a name="remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions-example-dockerfile"></a>

The following is a sample Dockerfile to launch a local container with SSH server pre-installed, if it is not possible to create a space with remote access and internet enabled.

**Note**  
In this example the SSH server does not require authentication and is only used for exporting the VS Code remote server.
The container should be built and run on an x64 architecture.

```
FROM amazonlinux:2023

# Install OpenSSH server and required tools
RUN dnf install -y \
    openssh-server \
    shadow-utils \
    passwd \
    sudo \
    tar \
    gzip \
    && dnf clean all

# Create a user with no password
RUN useradd -m -s /bin/bash sagemaker-user && \
    passwd -d sagemaker-user

# Add sagemaker-user to sudoers via wheel group
RUN usermod -aG wheel sagemaker-user && \
    echo 'sagemaker-user ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/sagemaker-user && \
    chmod 440 /etc/sudoers.d/sagemaker-user

# Configure SSH to allow empty passwords and password auth
RUN sed -i 's/^#\?PermitEmptyPasswords .*/PermitEmptyPasswords yes/' /etc/ssh/sshd_config && \
    sed -i 's/^#\?PasswordAuthentication .*/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Generate SSH host keys
RUN ssh-keygen -A

# Expose SSH port
EXPOSE 22

WORKDIR /home/sagemaker-user
USER sagemaker-user

# Start SSH server
CMD ["bash"]
```

Use the following commands to build and run the container:

```
# Build the image
docker build . -t remote_server_export

# Run the container
docker run --rm -it -d \
  -v /tmp/remote_access/.VS Code-server:/home/sagemaker-user/.VS Code-server \
  -p 2222:22 \
  --name remote_server_export \
  remote_server_export
  
# change the permisson for the mounted folder
docker exec -i remote_server_export \
       bash -c 'sudo chown sagemaker-user:sagemaker-user ~/.VS Code-server'

# start the ssh server in the container 
docker exec -i remote_server_export bash -c 'sudo /usr/sbin/sshd -D &'
```

Connect using the following command:

```
ssh sagemaker-user@localhost -p 2222
```

Before this container can be connected, configure the following in the `.ssh/config` file. Afterwards you will be able to see the `remote_access_export` as a host name in the remote SSH side panel when connecting. For example:

```
Host remote_access_export
  HostName localhost
  User=sagemaker-user
  Port 2222
  ForwardAgent yes
```

Archive `/tmp/remote_access/.VS Code-server` and follow the steps in Pre-packaged VS Code remote server and extensions to connect and install the extension. After unzipping, ensure that the `.VS Code-server` folder shows as the parent folder.

```
cd /tmp/remote_access/
sudo tar -czvf VS Code-server-with-extensions-for-1.100.2.tar.gz .VS Code-server
```

### Example LCC script (LCC-install-VS Code-server-v1.100.2)
<a name="remote-access-local-ide-setup-vpc-no-internet-pre-packaged-vs-code-remote-server-and-extensions-example-lcc"></a>

The following is an example of how to install a specific version of VS Code remote server.

```
#!/bin/bash

set -x

remote_server_file=VS Code-server-with-extensions-for-1.100.2.tar.gz

if [ ! -d "${HOME}/.VS Code-server" ]; then
    cd /tmp
    aws s3 cp s3://S3_BUCKET/remote_access/${remote_server_file} .
    tar -xzvf ${remote_server_file}
    mv .VS Code-server "${HOME}"
    rm ${remote_server_file}
else
    echo "${HOME}/.VS Code-server already exists, skipping download and install."
fi
```

# Filter your Studio spaces
<a name="remote-access-local-ide-setup-filter"></a>

You can use filtering to display only the relevant Amazon SageMaker AI spaces in the AWS Toolkit for Visual Studio Code explorer. The following provides information on manual filtering and automated filtering. For more information on the definitions of manual and automatic filtering, see [Filtering overview](remote-access-remote-setup-filter.md#remote-access-remote-setup-filter-overview).

This setup only applies when using the [Method 2: AWS Toolkit in the Remote IDE](remote-access-local-ide-setup.md#remote-access-local-ide-setup-local-vs-code-method-2-aws-toolkit-in-vs-code) method to connect from your Remote IDE to Amazon SageMaker Studio spaces. See [Set up remote access](remote-access-remote-setup.md) for more information.

**Topics**
+ [

## Manual filtering
](#remote-access-local-ide-setup-filter-manual)
+ [

## Automatic filtering setup when using IAM credentials to sign-in
](#remote-access-local-ide-setup-filter-automatic-IAM-credentials)

## Manual filtering
<a name="remote-access-local-ide-setup-filter-manual"></a>

To manually filter displayed spaces:
+ Open your Remote IDE and navigate to the Toolkit for VS Code side panel explorer
+ Find the **SageMaker AI** section
+ Choose the filter icon on the right of the SageMaker AI section header. This will open a dropdown menu.
+ In the dropdown menu, select the user profiles for which you want to display spaces

## Automatic filtering setup when using IAM credentials to sign-in
<a name="remote-access-local-ide-setup-filter-automatic-IAM-credentials"></a>

Automated filtering depends on the authentication method during sign-in. See [Connecting to AWS from the Toolkit](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/connect.html#connect-to-aws) in the Toolkit for VS Code User Guide for more information.

When you authenticate and connect with **IAM Credentials**, automated filtering requires [Set up when connecting with IAM credentials](remote-access-remote-setup-filter.md#remote-access-remote-setup-filter-set-up-iam-credentials). Without this setup, if users opt-in for identity filtering, no spaces will be shown.

Once the above is set up, AWS Toolkit matches spaces belonging to user profiles that start with the authenticated IAM user name or assumed role session name.

Automatic filtering is opt-in for users:
+ Open your Remote IDE settings
+ Navigate to the **AWS Toolkit** extension
+ Find **Enable Identity Filtering**
+ Choose to enable automatic filtration of spaces based on your AWS identity

# Supported AWS Regions
<a name="remote-access-supported-regions"></a>

The following table lists the AWS Regions where Remote IDE connections to Studio spaces are supported, along with the IDEs available in each Region.


| AWS Region | VS Code | Kiro | Cursor | 
| --- | --- | --- | --- | 
| us-east-1 | Supported | Supported | Supported | 
| us-east-2 | Supported | Supported | Supported | 
| us-west-1 | Supported | Supported | Supported | 
| us-west-2 | Supported | Supported | Supported | 
| af-south-1 | Supported | Supported | Supported | 
| ap-east-1 | Supported | Supported | Supported | 
| ap-south-1 | Supported | Supported | Supported | 
| ap-northeast-1 | Supported | Supported | Supported | 
| ap-northeast-2 | Supported | Supported | Supported | 
| ap-northeast-3 | Supported | Supported | Supported | 
| ap-southeast-1 | Supported | Supported | Supported | 
| ap-southeast-2 | Supported | Supported | Supported | 
| ap-southeast-3 | Supported | Supported | Supported | 
| ap-southeast-5 | Supported | Supported | Supported | 
| ca-central-1 | Supported | Supported | Supported | 
| eu-central-1 | Supported | Supported | Supported | 
| eu-central-2 | Supported | Supported | Supported | 
| eu-north-1 | Supported | Supported | Supported | 
| eu-south-1 | Supported | Supported | Supported | 
| eu-south-2 | Supported | Supported | Supported | 
| eu-west-1 | Supported | Supported | Supported | 
| eu-west-2 | Supported | Supported | Supported | 
| eu-west-3 | Supported | Supported | Supported | 
| il-central-1 | Supported | Supported | Supported | 
| me-central-1 | Supported | Not supported | Not supported | 
| me-south-1 | Supported | Not supported | Not supported | 
| sa-east-1 | Supported | Supported | Supported | 

# Bring your own image (BYOI)
<a name="studio-updated-byoi"></a>

An image is a file that identifies the kernels, language packages, and other dependencies required to run your applications. It includes:
+ Programming languages (like Python or R)
+ Kernels
+ Libraries and packages
+ Other necessary software

Amazon SageMaker Distribution (`sagemaker-distribution`) is a set of Docker images that include popular frameworks and packages for machine learning, data science, and visualization. For more information, see [SageMaker Studio image support policy](sagemaker-distribution.md).

If you need different functionality, you can bring your own image (BYOI). You may want to create a custom image if:
+ You need a specific version of a programming language or library
+ You want to include custom tools or packages
+ You're working with specialized software not available in the standard images

## Key terminology
<a name="studio-updated-byoi-basics"></a>

The following section defines key terms for bringing your own image to use with SageMaker AI.
+ **Dockerfile:** A text-based document with instructions for building a Docker image. This identifies the language packages and other dependencies for your Docker image.
+ **Docker image:** A packaged set of software and dependencies built from a Dockerfile.
+ **SageMaker AI image store:** A storage of your custom images in SageMaker AI.

**Topics**
+ [

## Key terminology
](#studio-updated-byoi-basics)
+ [

# Custom image specifications
](studio-updated-byoi-specs.md)
+ [

# How to bring your own image
](studio-updated-byoi-how-to.md)
+ [

# Launch a custom image in Studio
](studio-updated-byoi-how-to-launch.md)
+ [

# View your custom image details
](studio-updated-byoi-view-images.md)
+ [

# Speed up container startup with SOCI
](soci-indexing.md)
+ [

# Detach and clean up custom image resources
](studio-updated-byoi-how-to-detach-from-domain.md)

# Custom image specifications
<a name="studio-updated-byoi-specs"></a>

The image that you specify in your Dockerfile must match the specifications in the following sections to create the image successfully.

**Topics**
+ [

## Running the image
](#studio-updated-byoi-specs-run)
+ [

## Specifications for the user and file system
](#studio-updated-byoi-specs-user-and-filesystem)
+ [

## Health check and URL for applications
](#studio-updated-byoi-specs-app-healthcheck)
+ [

## Dockerfile samples
](#studio-updated-byoi-specs-dockerfile-templates)

## Running the image
<a name="studio-updated-byoi-specs-run"></a>

The following configurations can be made by updating your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerConfig.html). For an example, see [Update container configuration](studio-updated-byoi-how-to-container-configuration.md).
+ `Entrypoint` – You can configure `ContainerEntrypoint` and `ContainerArguments` that are passed to the container at runtime. We recommend configuring your entry point using `ContainerConfig`. See the above link for an example.
+ `EnvVariables` – When using Studio, you can define custom `ContainerEnvironment` variables for your container. You can optionally update your environmental variables using `ContainerConfig`. See the above link for an example.

  SageMaker AI-specific environment variables take precedence and will override any variables with the same names. For example, SageMaker AI automatically provides environment variables prefixed with `AWS_` and `SAGEMAKER_` to ensure proper integration with AWS services and SageMaker AI functionality. The following are a few example SageMaker AI-specific environment variables:
  + `AWS_ACCOUNT_ID`
  + `AWS_REGION`
  + `AWS_DEFAULT_REGION`
  + `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI`
  + `SAGEMAKER_SPACE_NAME`
  + `SAGEMAKER_APP_TYPE`

## Specifications for the user and file system
<a name="studio-updated-byoi-specs-user-and-filesystem"></a>
+ `WorkingDirectory` – The Amazon EBS volume for your space is mounted on the path `/home/sagemaker-user`. You can't change the mount path. Use the `WORKDIR` instruction to set the working directory of your image to a folder within `/home/sagemaker-user`.
+ `UID` – The user ID of the Docker container. UID=1000 is a supported value. You can add sudo access to your users. The IDs are remapped to prevent a process running in the container from having more privileges than necessary.
+ `GID` – The group ID of the Docker container. GID=100 is a supported value. You can add sudo access to your users. The IDs are remapped to prevent a process running in the container from having more privileges than necessary.
+ Metadata directories – The `/opt/.sagemakerinternal` and `/opt/ml` directories that are used by AWS. The metadata file in `/opt/ml` contains metadata about resources such as `DomainId`.

  Use the following command to show the file system contents:

  ```
  cat /opt/ml/metadata/resource-metadata.json
  ```
+ Logging directories – `/var/log/studio` are reserved for the logging directories of your applications and the extensions associated with it. We recommend that you don't use these folders in creating your image.

## Health check and URL for applications
<a name="studio-updated-byoi-specs-app-healthcheck"></a>

The health check and URL depend on the applications. Choose the following link associated with the application you are building the image for.
+ [Health check and URL for applications](code-editor-custom-images.md#code-editor-custom-images-app-healthcheck) for Code Editor
+ [Health check and URL for applications](studio-updated-jl-admin-guide-custom-images.md#studio-updated-jl-admin-guide-custom-images-app-healthcheck) for JupyterLab

## Dockerfile samples
<a name="studio-updated-byoi-specs-dockerfile-templates"></a>

For Dockerfile samples that meet both the requirements on this page and your specific application needs, navigate to the sample Dockerfiles in the respective application's section. The following options include Amazon SageMaker Studio applications. 
+ [Dockerfile examples](code-editor-custom-images.md#code-editor-custom-images-dockerfile-templates) for Code Editor
+ [Dockerfile examples](studio-updated-jl-admin-guide-custom-images.md#studio-updated-jl-custom-images-dockerfile-templates) for JupyterLab

**Note**  
If you are bringing your own image to SageMaker Unified Studio, you will need to follow the [Dockerfile specifications](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html) in the *Amazon SageMaker Unified Studio User Guide*.  
`Dockerfile` examples for SageMaker Unified Studio can be found in [Dockerfile example](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html#byoi-specifications-example) in the *Amazon SageMaker Unified Studio User Guide*.

# How to bring your own image
<a name="studio-updated-byoi-how-to"></a>

The following pages will provide instructions on how to bring your own custom image. Ensure that the following prerequisites are satisfied before continuing.

## Prerequisites
<a name="studio-updated-byoi-how-to-prerequisites"></a>

You will need to complete the following prerequisites to bring your own image to Amazon SageMaker AI.
+ Set up the Docker application. For more information, see [Get started](https://docs.docker.com/get-started/) in the *Docker documentation*.
+ Install the latest AWS CLI by following the steps in [Getting started with the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) in the *AWS Command Line Interface User Guide for Version 2*.
+ Permissions to access the Amazon Elastic Container Registry (Amazon ECR) service. For more information, see [Amazon ECR Managed Policies](https://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html) in the *Amazon ECR User Guide*.
+ An AWS Identity and Access Management role that has the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached.

**Topics**
+ [

## Prerequisites
](#studio-updated-byoi-how-to-prerequisites)
+ [

# Create a custom image and push to Amazon ECR
](studio-updated-byoi-how-to-prepare-image.md)
+ [

# Attach your custom image to your domain
](studio-updated-byoi-how-to-attach-to-domain.md)
+ [

# Update container configuration
](studio-updated-byoi-how-to-container-configuration.md)

# Create a custom image and push to Amazon ECR
<a name="studio-updated-byoi-how-to-prepare-image"></a>

This page provides instructions on how to create a local Dockerfile, build the container image, and add it to Amazon Elastic Container Registry (Amazon ECR).

**Note**  
In the following examples, the tags are not specified, and the tag `latest` is applied by default. If you would like to specify a tag, you will need to append `:tag` to end of the image names. For more information, see [docker image tag](https://docs.docker.com/reference/cli/docker/image/tag/) in the *Docker documentation*.

**Topics**
+ [

## Create a local Dockerfile and build the container image
](#studio-updated-byoi-how-to-create-local-dockerfile)
+ [

## Add a Docker image to Amazon ECR
](#studio-updated-byoi-add-container-image)

## Create a local Dockerfile and build the container image
<a name="studio-updated-byoi-how-to-create-local-dockerfile"></a>

Use the following instructions to create a Dockerfile with your desired software and dependencies.

**To create your Dockerfile**

1. First set your variables for the AWS CLI commands that follow.

   ```
   LOCAL_IMAGE_NAME=local-image-name
   ```

   `local-image-name` is the name of the container image on your local device, that you define here.

1. Create a text-based document, named `Dockerfile`, that meet the specifications in [Custom image specifications](studio-updated-byoi-specs.md).

   `Dockerfile` examples for supported applications can be found in [Dockerfile samples](studio-updated-byoi-specs.md#studio-updated-byoi-specs-dockerfile-templates).
**Note**  
If you are bringing your own image to SageMaker Unified Studio, you will need to follow the [Dockerfile specifications](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html) in the *Amazon SageMaker Unified Studio User Guide*.  
`Dockerfile` examples for SageMaker Unified Studio can be found in [Dockerfile example](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html#byoi-specifications-example) in the *Amazon SageMaker Unified Studio User Guide*.

1. In the directory containing your `Dockerfile`, build the Docker image using the following command. The period (`.`) specifies that the `Dockerfile` should be in the context of the build command.

   ```
   docker build -t ${LOCAL_IMAGE_NAME} .
   ```

   After the build completes, you can list your container image information with the following command.

   ```
   docker images
   ```

1. (Optional) You can test your image by using the following command.

   ```
   docker run -it ${LOCAL_IMAGE_NAME}
   ```

   In the output you will find that your server is running at a URL, like `http://127.0.0.1:8888/...`. You can test the image by copying the URL into the browser. 

   If this does not work, you may need to include `-p port:port` in the docker run command. This option maps the exposed port on the container to a port on the host system. For more information about docker run, see the [Running containers](https://docs.docker.com/engine/containers/run/) in the *Docker documentation*.

   Once you have verified that the server is working, you can stop the server and shut down all kernels before continuing. The instructions are viewable the output.

## Add a Docker image to Amazon ECR
<a name="studio-updated-byoi-add-container-image"></a>

To add a container image to Amazon ECR, you will need to do the following.
+ Create an Amazon ECR repository.
+ Log in to your default registry.
+ Push the image to the Amazon ECR repository.

**Note**  
The Amazon ECR repository must be in the same AWS Region as the domain you are attaching the image to.

**To build and push the container image to Amazon ECR**

1. First set your variables for the AWS CLI commands that follow.

   ```
   ACCOUNT_ID=account-id
   REGION=aws-region
   ECR_REPO_NAME=ecr-repository-name
   ```
   + `account-id` is your account ID. You can find this at the top right of any AWS console page. For example, the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).
   + `aws-region` is the AWS Region of your Amazon SageMaker AI domain. You can find this at the top right of any AWS console page. 
   + `ecr-repository-name` is the name of your Amazon Elastic Container Registry repository, that you define here. To view your Amazon ECR repositories, see the [Amazon ECR console](https://console.aws.amazon.com/ecr).

1. Log in to Amazon ECR and sign in to Docker.

   ```
   aws ecr get-login-password \
       --region ${REGION} | \
       docker login \
       --username AWS \
       --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com
   ```

   On a successful authentication, you will receive a succeeded log in message.
**Important**  
If you receive an error, you may need to install or upgrade to the latest version of the AWS CLI. For more information, see [Installing the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) in the *AWS Command Line Interface User Guide*.

1. Tag the image in a format compatible with Amazon ECR, to push to your repository.

   ```
   docker tag \
       ${LOCAL_IMAGE_NAME} \
       ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${ECR_REPO_NAME}
   ```

1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR console, see [Creating an Amazon ECR private repository to store images](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html).

   ```
   aws ecr create-repository \
       --region ${REGION} \
       --repository-name ${ECR_REPO_NAME}
   ```

1. Push the image to your Amazon ECR repository. You can also tag the Docker image.

   ```
   docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${ECR_REPO_NAME}
   ```

Once the image has been successfully added to your Amazon ECR repository, you can view it in the [Amazon ECR console](https://console.aws.amazon.com/ecr).

# Attach your custom image to your domain
<a name="studio-updated-byoi-how-to-attach-to-domain"></a>

This page provides instructions on how to attach your custom image to your domain. Use the following procedure to use the Amazon SageMaker AI console to navigate to your domain and start the **Attach image** process.

The following instructions assume that you have pushed an image to a Amazon ECR repository in the same AWS Region as your domain. If you have not already done so, see [Create a custom image and push to Amazon ECR](studio-updated-byoi-how-to-prepare-image.md).

When you choose to attach an image, you will have two options:
+ Attach a **New image**: This option will create an image and image version in your SageMaker AI image store and then attach it to your domain.
**Note**  
If you are continuing the BYOI process, from [Create a custom image and push to Amazon ECR](studio-updated-byoi-how-to-prepare-image.md), use the **New image** option.
+ Attach an **Existing image**: If you have already created your intended custom image in the SageMaker AI image store, use this option. This option attaches an existing custom image to your domain. To view your custom images in the SageMaker AI image store, see [View custom image details (console)](studio-updated-byoi-view-images.md#studio-updated-byoi-view-images-console).

------
#### [ New image ]

**To attach a new image to your domain**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Expand the **Admin configurations** section, if not already done so.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of **Domains**, select the domain you want to attach the image to.
**Note**  
If you are attaching the image to a SageMaker Unified Studio project and you need clarification on which domain to use, see [View the SageMaker AI domain details associated with your project](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/view-project-details.html#view-project-details-smai-domain).

1. Open the **Environment** tab.

1. In the **Custom images for personal Studio apps** section, choose **Attach image**.

1. For the **Image source**, choose **New image**.

1. Include your Amazon ECR image URI. The format is as follows.

   ```
   account-id.dkr.ecr.aws-region.amazonaws.com/repository-name:tag
   ```

   1. To obtain your Amazon ECR image URI, navigate to your [Amazon ECR private repositories](https://console.aws.amazon.com/ecr/private-registry/repositories) page.

   1. Choose your repository name link.

   1. Choose the **Copy URI** icon that corresponds to your image version (**Image tag**).

1. Follow the rest of the instructions to attach your custom image.
**Note**  
Ensure that you are using the application type consistent with your `Dockerfile`. For more information, see [Dockerfile samples](studio-updated-byoi-specs.md#studio-updated-byoi-specs-dockerfile-templates).

Once the image has been successfully attached to your domain, you will be able to view it in the **Environment** tab.

------
#### [ Existing image ]

**To attach an existing image to your domain**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Expand the **Admin configurations** section, if not already done so.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of **Domains**, select the domain you want to attach the image to.
**Note**  
If you are attaching the image to a SageMaker Unified Studio project and you need clarification on which domain to use, see [View the SageMaker AI domain details associated with your project](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/view-project-details.html#view-project-details-smai-domain).

1. Open the **Environment** tab.

1. In the **Custom images for personal Studio apps** section, choose **Attach image**.

1. For the **Image source**, choose **Existing image**.

1. Choose an existing image and image version from the SageMaker AI image store.

   If you are unable to view your image version, you may need to create an image version. For more information, see [View custom image details (console)](studio-updated-byoi-view-images.md#studio-updated-byoi-view-images-console).

1. Follow the rest of the instructions to attach your custom image.
**Note**  
Ensure that you are using the application type consistent with your `Dockerfile`. For more information, see [Dockerfile samples](studio-updated-byoi-specs.md#studio-updated-byoi-specs-dockerfile-templates).

Once the image has been successfully attached to your domain, you will be able to view it in the **Environment** tab.

------

Once your image has been successfully attached to your domain, the domain users can choose the image for their application. For more information, see [Launch a custom image in Studio](studio-updated-byoi-how-to-launch.md).

**Note**  
If you have attached a custom image to your SageMaker Unified Studio project, you will need to launch the application from within SageMaker Unified Studio. For more information, see [Launch your custom image](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-launch-custom-image.html) in the *Amazon SageMaker Unified Studio User Guide*.

# Update container configuration
<a name="studio-updated-byoi-how-to-container-configuration"></a>

You can bring custom Docker images into your machine learning workflows. A key aspect of customizing these images is configuring the container configurations, or [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerConfig.html). The following page provides an example on how to configure your `ContainerConfig`. 

An entrypoint is the command or script that runs when the container starts. Custom entrypoints enable you to set up your environment, initialize services, or perform any necessary setup before your application launches. 

This example provides instructions on how to configure a custom entrypoint, for your JupyterLab application, using the AWS CLI. This example assumes that you have already created a custom image and domain. For instructions, see [Attach your custom image to your domain](studio-updated-byoi-how-to-attach-to-domain.md).

1. First set your variables for the AWS CLI commands that follow.

   ```
   APP_IMAGE_CONFIG_NAME=app-image-config-name
   ENTRYPOINT_FILE=entrypoint-file-name
   ENV_KEY=environment-key
   ENV_VALUE=environment-value
   REGION=aws-region
   DOMAIN_ID=domain-id
   IMAGE_NAME=custom-image-name
   IMAGE_VERSION=custom-image-version
   ```
   + `app-image-config-name` is the name of your application image configuration.
   + `entrypoint-file-name` is the name of your container's entrypoint script. For example, `entrypoint.sh`.
   + `environment-key` is the name of your environment variable.
   + `environment-value` is the value assigned to your environment variable.
   + `aws-region` is the AWS Region of your Amazon SageMaker AI domain. You can find this at the top right of any AWS console page. 
   + `domain-id` is your domain ID. To view your domains, see [View domains](domain-view.md).
   + `custom-image-name` is the name of your custom image. To view your custom image details, see [View custom image details (console)](studio-updated-byoi-view-images.md#studio-updated-byoi-view-images-console).

     If you followed the instructions in [Attach your custom image to your domain](studio-updated-byoi-how-to-attach-to-domain.md), you may want to use the same image name you used in that process.
   + `custom-image-version` is the version number of your custom image. This should be an integer, representing the version of your image. To view your custom image details, see [View custom image details (console)](studio-updated-byoi-view-images.md#studio-updated-byoi-view-images-console).

1. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateAppImageConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateAppImageConfig.html) API to create an image configuration.

   ```
   aws sagemaker create-app-image-config \
       --region ${REGION} \
       --app-image-config-name "${APP_IMAGE_CONFIG_NAME}" \
       --jupyter-lab-app-image-config "ContainerConfig = {
           ContainerEntrypoint = "${ENTRYPOINT_FILE}", 
           ContainerEnvironmentVariables = {
               "${ENV_KEY}"="${ENV_VALUE}"
           }
       }"
   ```

1. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html) API to update the default settings for your domain. This will attach the custom image as well as the application image configuration. 

   ```
   aws sagemaker update-domain \
       --region ${REGION} \
       --domain-id "${DOMAIN_ID}" \
       --default-user-settings "{
           \"JupyterLabAppSettings\": {
               \"CustomImages\": [
                   {
                       \"ImageName\": \"${IMAGE_NAME}\",
                       \"ImageVersionNumber\": ${IMAGE_VERSION},
                       \"AppImageConfigName\": \"${APP_IMAGE_CONFIG_NAME}\"
                   }
               ]
           }
       }"
   ```

# Launch a custom image in Studio
<a name="studio-updated-byoi-how-to-launch"></a>

After you have attached a custom image to your Amazon SageMaker AI domain, the image becomes available to the users in the domain. Use the following instructions to launch an application with the custom image.

**Note**  
If you have attached a custom image to your SageMaker Unified Studio project, you will need to launch the application from within SageMaker Unified Studio. For more information, see [Launch your custom image](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-launch-custom-image.html) in the *Amazon SageMaker Unified Studio User Guide*.

1. Launch Amazon SageMaker Studio. For instructions, see [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. If not done so already, expand the **Applications** section.

1. Choose the application from the **Applications** section. If you do not see the application available, the application may be hidden from you. In this case, contact your administrator.

1. To create a space, choose **\$1 Create *application* space** and follow the instructions to create the space.

   To choose an existing space, choose the link name of the space you want to open.

   

1. Under **Image**, choose the image you want to use.

   If the **Image** dropdown is unavailable, you may need to stop your space. Choose **Stop space** to do so.

1. Confirm the settings for the space and choose **Run space**.

# View your custom image details
<a name="studio-updated-byoi-view-images"></a>

The following page provides instructions on how to view your custom image details in the SageMaker AI image store.

## View custom image details (console)
<a name="studio-updated-byoi-view-images-console"></a>

The following provides instructions on how to view your custom images using the SageMaker AI console. In this section, you can view and edit your image details.

**View your custom images (console)**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Expand the **Admin configurations** section.

1. Under **Admin configurations**, choose **Images**.

1. From the list of **Custom images**, select the hyperlink of your image name.

## View custom image details (AWS CLI)
<a name="studio-updated-byoi-view-images-cli"></a>

The following section shows an example on how to view your custom images using the AWS CLI.

```
aws sagemaker list-images \
       --region aws-region
```

# Speed up container startup with SOCI
<a name="soci-indexing"></a>

SOCI (Seekable Open Container Initiative) indexing enables lazy loading of custom container images in [Amazon SageMaker Studio](studio-updated.md) or [Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html). SOCI significantly reduces startup times by roughly 30-70% for your custom [Bring your own image (BYOI)](studio-updated-byoi.md) containers. Latency improvement varies depending on the size of the image, hosting instance availability, and other application dependencies. SOCI creates an index that allows containers to launch with only necessary components, fetching additional files on-demand as needed.

SOCI addresses slow container startup times, that interrupt iterative machine learning (ML) development workflows, for custom images. As ML workloads become more complex, container images have grown larger, creating startup delays that hamper development cycles.

**Topics**
+ [

## Key benefits
](#soci-indexing-key-benefits)
+ [

## How SOCI indexing works
](#soci-indexing-how-works)
+ [

## Architecture components
](#soci-indexing-architecture-components)
+ [

## Supported tools
](#soci-indexing-supported-tools)
+ [

# Permissions for SOCI indexing
](soci-indexing-setup.md)
+ [

# Create SOCI indexes with nerdctl and SOCI CLI example
](soci-indexing-example-create-indexes.md)
+ [

# Integrate SOCI-indexed images with Studio example
](soci-indexing-example-integrate-studio.md)

## Key benefits
<a name="soci-indexing-key-benefits"></a>
+ **Faster iteration cycles**: Reduce container startup, depending on image and instance types
+ **Universal optimization**: Extend performance benefits to all custom BYOI containers in Studio

## How SOCI indexing works
<a name="soci-indexing-how-works"></a>

SOCI creates a specialized metadata index that maps your container image's internal file structure. This index enables access to individual files without downloading the entire image. The SOCI index is stored as an OCI (Open Container Initiative) compliant artifact in [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) and linked to your original container image, preserving image digests and signature validity.

When you launch a container in Studio, the system uses the SOCI index to identify and download only essential files needed for startup. Additional components are fetched in parallel as your application requires them.

## Architecture components
<a name="soci-indexing-architecture-components"></a>
+ **Original container image**: Your base container stored in Amazon ECR
+ **SOCI index artifact**: Metadata mapping your image's file structure
+ **OCI Image Index manifest**: Links your original image and SOCI index
+ **Finch container runtime**: Enables lazy loading integration with Studio

## Supported tools
<a name="soci-indexing-supported-tools"></a>


| Tool | Integration | 
| --- | --- | 
| nerdctl | Requires containerd setup | 
| Finch CLI | Native SOCI support | 
| Docker \$1 SOCI CLI | Additional tooling required | 

**Topics**
+ [

## Key benefits
](#soci-indexing-key-benefits)
+ [

## How SOCI indexing works
](#soci-indexing-how-works)
+ [

## Architecture components
](#soci-indexing-architecture-components)
+ [

## Supported tools
](#soci-indexing-supported-tools)
+ [

# Permissions for SOCI indexing
](soci-indexing-setup.md)
+ [

# Create SOCI indexes with nerdctl and SOCI CLI example
](soci-indexing-example-create-indexes.md)
+ [

# Integrate SOCI-indexed images with Studio example
](soci-indexing-example-integrate-studio.md)

# Permissions for SOCI indexing
<a name="soci-indexing-setup"></a>

Create SOCI indexes for your container images and store them in Amazon ECR before using SOCI indexing with [Amazon SageMaker Studio](studio-updated.md) or [Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html).

**Topics**
+ [

## Prerequisites
](#soci-indexing-setup-prerequisites)
+ [

## Required IAM permissions
](#soci-indexing-setup-iam-permissions)

## Prerequisites
<a name="soci-indexing-setup-prerequisites"></a>
+ AWS account with an [AWS Identity and Access Management](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started.html) (IAM) role with permissions to manage
  + [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html)
  + [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html)
+ [Amazon ECR private repositories](https://docs.aws.amazon.com/AmazonECR/latest/userguide/Repositories.html) for storing your container images
+ [AWS CLI v2.0\$1](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) configured with appropriate credentials
+ The following container tools:
  + Required: [soci-snapshotter](https://github.com/awslabs/soci-snapshotter)
  + Options:
    + [nerdctl](https://github.com/containerd/nerdctl)
    + [finch](https://github.com/runfinch/finch)

## Required IAM permissions
<a name="soci-indexing-setup-iam-permissions"></a>

Your IAM role needs permissions to:
+ Create and manage SageMaker AI resources (domains, images, app configs).
  + You may use the [SageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) AWS managed policy. For more permission details, see [AWS managed policy: AmazonSageMakerFullAccess](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSageMakerFullAccess).
+ [IAM permissions for pushing an image to an Amazon ECR private repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-push-iam.html).

# Create SOCI indexes with nerdctl and SOCI CLI example
<a name="soci-indexing-example-create-indexes"></a>

The following page provides an example on how to create SOCI indexes with nerdctl and SOCI CLI.

**Create SOCI indexes example**

1. First set your variables for the AWS CLI commands that follow. The following is an example of setting up your variables.

   ```
   ACCOUNT_ID="111122223333"
   REGION="us-east-1"
   REPOSITORY_NAME="repository-name"
   ORIGINAL_IMAGE_TAG="original-image-tag"
   SOCI_IMAGE_TAG="soci-indexed-image-tag"
   ```

   Variable definitions:
   + `ACCOUNT_ID` is your AWS account ID
   + `REGION` is the AWS Region of your Amazon ECR private registry
   + `REPOSITORY_NAME` is the name of your Amazon ECR private registry
   + `ORIGINAL_IMAGE_TAG` is the tag of your original image
   + `SOCI_IMAGE_TAG` is the tag of your SOCI-indexed image

1. Install required tools:

   ```
   # Install SOCI CLI, containerd, and nerdctl
   sudo yum install soci-snapshotter
   sudo yum install containerd jq  
   sudo systemctl start soci-snapshotter
   sudo systemctl restart containerd
   sudo yum install nerdctl
   ```

1. Set your registry variables:

   ```
   REGISTRY_USER=AWS
   REGISTRY="$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com"
   ```

1. Export your region and authenticate to Amazon ECR:

   ```
   export AWS_REGION=$REGION
   REGISTRY_PASSWORD=$(/usr/local/bin/aws ecr get-login-password --region $AWS_REGION)
   echo $REGISTRY_PASSWORD | sudo nerdctl login -u $REGISTRY_USER --password-stdin $REGISTRY
   ```

1. Pull your original container image:

   ```
   sudo nerdctl pull $REGISTRY/$REPOSITORY_NAME:$ORIGINAL_IMAGE_TAG
   ```

1. Create the SOCI index:

   ```
   sudo nerdctl image convert --soci $REGISTRY/$REPOSITORY_NAME:$ORIGINAL_IMAGE_TAG $REGISTRY/$REPOSITORY_NAME:$SOCI_IMAGE_TAG
   ```

1. Push the SOCI-indexed image:

   ```
   sudo nerdctl push --platform linux/amd64 $REGISTRY/$REPOSITORY_NAME:$SOCI_IMAGE_TAG
   ```

This process creates two artifacts for the original container image in your ECR repository:
+ SOCI index - Metadata enabling lazy loading
+ Image Index manifest - OCI-compliant manifest

# Integrate SOCI-indexed images with Studio example
<a name="soci-indexing-example-integrate-studio"></a>

You must reference the SOCI-indexed image tag to use SOCI-indexed images in Studio, rather than the original container image tag. Use the tag you specified during the SOCI conversion process (e.g., `SOCI_IMAGE_TAG` in the [Create SOCI indexes with nerdctl and SOCI CLI example](soci-indexing-example-create-indexes.md)).

**Integrate SOCI-indexed images example**

1. First set your variables for the AWS CLI commands that follow. The following is an example of setting up your variables.

   ```
   ACCOUNT_ID="111122223333"
   REGION="us-east-1"
   IMAGE_NAME="sagemaker-image-name"
   IMAGE_CONFIG_NAME="sagemaker-image-config-name"
   ROLE_ARN="your-role-arn"
   DOMAIN_ID="domain-id"
   SOCI_IMAGE_TAG="soci-indexed-image-tag"
   ```

   Variable definitions:
   + `ACCOUNT_ID` is your AWS account ID
   + `REGION` is the AWS Region of your Amazon ECR private registry
   + `IMAGE_NAME` is the name of your SageMaker image
   + `IMAGE_CONFIG_NAME` is the name of your SageMaker image configuration
   + `ROLE_ARN` is the ARN of your execution role with the permissions listed in [Required IAM permissions](soci-indexing-setup.md#soci-indexing-setup-iam-permissions)
   + `DOMAIN_ID` is the [domain ID](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-view.html)
**Note**  
If you are attaching the image to a SageMaker Unified Studio project and you need clarification on which domain to use, see [View the SageMaker AI domain details associated with your project](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/view-project-details.html#view-project-details-smai-domain).
   + `SOCI_IMAGE_TAG` is the tag of your SOCI-indexed image

1. Export your region:

   ```
   export AWS_REGION=$REGION
   ```

1. Create a SageMaker image:

   ```
   aws sagemaker create-image \
       --image-name "$IMAGE_NAME" \
       --role-arn "$ROLE_ARN"
   ```

1. Create a SageMaker Image Version using your SOCI index URI:

   ```
   IMAGE_INDEX_URI="$ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_NAME:$SOCI_IMAGE_TAG"
   
   aws sagemaker create-image-version \
       --image-name "$IMAGE_NAME" \
       --base-image "$IMAGE_INDEX_URI"
   ```

1. Create an application image configuration and update your Amazon SageMaker AI domain to include the custom image for your app. You can do this for Code Editor, based on Code-OSS, Visual Studio Code - Open Source (Code Editor) and JupyterLab applications. Choose the application option below to view the steps.

------
#### [ Code Editor ]

   Create an application image configuration for Code Editor:

   ```
   aws sagemaker create-app-image-config \
       --app-image-config-name "$IMAGE_CONFIG_NAME" \
       --code-editor-app-image-config '{ "FileSystemConfig": { "MountPath": "/home/sagemaker-user", "DefaultUid": 1000, "DefaultGid": 100 } }'
   ```

   Update your Amazon SageMaker AI domain to include the custom image for Code Editor:

   ```
   aws sagemaker update-domain \
       --domain-id "$DOMAIN_ID" \
       --default-user-settings '{
           "CodeEditorAppSettings": {
           "CustomImages": [{
               "ImageName": "$IMAGE_NAME", 
               "AppImageConfigName": "$IMAGE_CONFIG_NAME"
           }]
       }
   }'
   ```

------
#### [ JupyterLab ]

   Create an application image configuration for JupyterLab:

   ```
   aws sagemaker create-app-image-config \
       --app-image-config-name "$IMAGE_CONFIG_NAME" \
       --jupyter-lab-app-image-config '{ "FileSystemConfig": { "MountPath": "/home/sagemaker-user", "DefaultUid": 1000, "DefaultGid": 100 } }'
   ```

   Update your Amazon SageMaker AI domain to include the custom image for JupyterLab:

   ```
   aws sagemaker update-domain \
       --domain-id "$DOMAIN_ID" \
       --default-user-settings '{
           "JupyterLabAppSettings": {
           "CustomImages": [{
               "ImageName": "$IMAGE_NAME", 
               "AppImageConfigName": "$IMAGE_CONFIG_NAME"
           }]
       }
   }'
   ```

------

1. After you update your domain to include your custom image, you can create an application in Studio using your custom image. When you [Launch a custom image in Studio](studio-updated-byoi-how-to-launch.md) ensure that you are using your custom image.

# Detach and clean up custom image resources
<a name="studio-updated-byoi-how-to-detach-from-domain"></a>

The following page provides instructions on how to detach your custom images and clean up the related resources using the Amazon SageMaker AI console or the AWS Command Line Interface (AWS CLI). 

**Important**  
You must first detach your custom image from your domain before deleting the image from the SageMaker AI image store. If not, you may experience errors while viewing your domain information or attaching new custom images to your domain.   
If you are experiencing an error loading a custom image, see [Failure to load custom image](studio-updated-troubleshooting.md#studio-updated-troubleshooting-custom-image). 

## Detach and delete custom images (console)
<a name="studio-updated-byoi-how-to-detach-from-domain-console"></a>

The following provides instructions on how to detach your custom images from SageMaker AI and clean up your custom image resources using the console.

**Detach your custom image from your domain**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Expand the **Admin configurations** section.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of **domains**, select a domain.

1. Open the **Environment** tab.

1. For **Custom images for personal Studio apps**, select the checkboxes for the images you want to detach.

1. Choose **Detach**.

1. Follow the instructions to detach.

**Delete your custom image**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Expand the **Admin configurations** section, if not already done so.

1. Under **Admin configurations**, choose **Images**.

1. From the list of **Images**, select an image you would like to delete.

1. Choose **Delete**.

1. Follow the instructions to delete your image and all its versions from SageMaker AI.

**Delete your custom images and repository from Amazon ECR**
**Important**  
This will also delete any container images and artifacts in this repository.

1. Open the [Amazon ECR console](https://console.aws.amazon.com/ecr).

1. If not already done so, expand the left navigation pane.

1. Under **Private registry**, choose **Repositories**.

1. Select the repositories you wish to delete.

1. Choose **Delete**.

1. Follow the instructions to delete.

## Detach and delete custom images (AWS CLI)
<a name="studio-updated-byoi-how-to-detach-from-domain-cli"></a>

The following section shows an example on how to detach your custom images using the AWS CLI.

1. First set your variables for the AWS CLI commands that follow.

   ```
   ACCOUNT_ID=account-id
   REGION=aws-region
   APP_IMAGE_CONFIG=app-image-config
   SAGEMAKER_IMAGE_NAME=custom-image-name
   ```
   + `aws-region` is the AWS Region of your Amazon SageMaker AI domain. You can find this at the top right of any AWS console page. 
   + `app-image-config` is the name of your application image configuration. Use the following AWS CLI command to list the application image configurations in your AWS Region.

     ```
     aws sagemaker list-app-image-configs \
            --region ${REGION}
     ```
   + `custom-image-name` is the custom image name. Use the following AWS CLI command to list the images in your AWS Region.

     ```
     aws sagemaker list-images \
            --region ${REGION}
     ```

1. To detach the image and image versions from your domain using these instructions, you will need to create or update a domain configuration json file.
**Note**  
If you followed the instructions in [Attach your custom image to your domain](studio-updated-byoi-how-to-attach-to-domain.md), you may have updated your domain using the file named `update-domain.json`.   
If you do not have that file, you can create a new json file instead.

   Create a file named `update-domain.json` that you will use to update your domain.

1. To delete the custom images, you will need to leave `CustomImages` blank, such that `"CustomImages": []`. Choose one of the following to view example configuration files for Code Editor or JupyterLab.

------
#### [ Code Editor: update domain configuration file example ]

   A configuration file example for Code Editor, using [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CodeEditorAppSettings.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CodeEditorAppSettings.html).

   ```
   {
       "DomainId": "domain-id",
       "DefaultUserSettings": {
           "CodeEditorAppSettings": {
               "CustomImages": [
               ]
           }
       }
   }
   ```

------
#### [ JupyterLab: update domain configuration file example ]

   A configuration file example for JupyterLab, using [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_JupyterLabAppSettings.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_JupyterLabAppSettings.html).

   ```
   {
       "DomainId": "domain-id",
       "DefaultUserSettings": {
           "JupyterLabAppSettings": {
               "CustomImages": [
               ]
           }
       }
   }
   ```

------

   `domain-id` is the domain ID that your image is attached to. Use the following command to list your domains.

   ```
   aws sagemaker list-domains \
         --region ${REGION}
   ```

1. Save the file.

1. Call the [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html) AWS CLI using the update domain configuration file, `update-domain.json`.
**Note**  
Before you can update the custom images, you must delete all of the **applications** in your domain. You **do not** need to delete user profiles or shared spaces. For instructions on deleting applications, choose one of the following options.  
If you want to use the SageMaker AI console, see [Shut down SageMaker AI resources in your domain](sm-console-domain-resources-shut-down.md).
If you want to use the AWS CLI, use steps 1 through 3 of [Delete an Amazon SageMaker AI domain (AWS CLI)](gs-studio-delete-domain.md#gs-studio-delete-domain-cli).

   ```
   aws sagemaker update-domain \
       --cli-input-json file://update-domain.json \
       --region ${REGION}
   ```

1. Delete the app image config.

   ```
   aws sagemaker delete-app-image-config \
       --app-image-config-name ${APP_IMAGE_CONFIG}
   ```

1. Delete the custom image. This also deletes all of the image versions. This does not delete the Amazon ECR container image and image versions. To do so, use the optional steps below.

   ```
   aws sagemaker delete-image \
       --image-name ${SAGEMAKER_IMAGE_NAME}
   ```

1. (Optional) Delete your Amazon ECR resources. The following list provides AWS CLI commands to obtain your Amazon ECR resource information for the steps below.

   1. Set your variables for the AWS CLI commands that follow.

      ```
      ECR_REPO_NAME=ecr-repository-name
      ```

      `ecr-repository-name` is the name of your Amazon Elastic Container Registry repository. 

      To list the details of your repositories, use the following command.

      ```
      aws ecr describe-repositories \
              --region ${REGION}
      ```

   1. Delete your repository from Amazon ECR. 
**Important**  
This will also delete any container images and artifacts in this repository.

      ```
      aws ecr delete-repository \
            --repository-name ${ECR_REPO_NAME} \
            --force \
            --region ${REGION}
      ```

# Lifecycle configurations within Amazon SageMaker Studio
<a name="studio-lifecycle-configurations"></a>

Lifecycle configurations (LCCs) are scripts that administrators and users can use to automate the customization of the following applications within your Amazon SageMaker Studio environment:
+ Amazon SageMaker AI JupyterLab
+ Code Editor, based on Code-OSS, Visual Studio Code - Open Source
+ Studio Classic
+ Notebook instance

Customizing your application includes:
+ Installing custom packages
+ Configuring extensions
+ Preloading datasets
+ Setting up source code repositories

Users create and attach built-in lifecycle configurations to their own user profiles. Administrators create and attach default or built-in lifecycle configurations at the domain, space, or user profile level.

**Important**  
Amazon SageMaker Studio first runs the built-in lifecycle configuration and then runs the default LCC. Amazon SageMaker AI won't resolve package conflicts between the user and administrator LCCs. For example, if the built-in LCC installs `python3.11` and the default LCC installs `python3.12`, Studio installs `python3.12`. 

# Create and attach lifecycle configurations
<a name="studio-lifecycle-configurations-create"></a>

You can create and attach lifecycle configurations using either the AWS Management Console or the AWS Command Line Interface.

**Topics**
+ [

## Create and attach lifecycle configurations (AWS CLI)
](#studio-lifecycle-configurations-create-cli)
+ [

## Create and attach lifecycle configurations (console)
](#studio-lifecycle-configurations-create-console)

## Create and attach lifecycle configurations (AWS CLI)
<a name="studio-lifecycle-configurations-create-cli"></a>

**Important**  
Before you begin, complete the following prerequisites:   
Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
Onboard to Amazon SageMaker AI domain. For conceptual information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). For a quickstart guide, see [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md).

The following procedure shows how to create a lifecycle configuration script that prints `Hello World` within Code Editor or JupyterLab.

**Note**  
Each script can have up to **16,384 characters**.

1. From your local machine, create a file named `my-script.sh` with the following content:

   ```
   #!/bin/bash
   set -eux
   echo 'Hello World!'
   ```

1. Use the following to convert your `my-script.sh` file into base64 format. This requirement prevents errors that occur from spacing and line break encoding.

   ```
   LCC_CONTENT=`openssl base64 -A -in my-script.sh`
   ```

1. Create a lifecycle configuration for use with Studio. The following command creates a lifecycle configuration that runs when you launch an associated `JupyterLab` application:

   ```
   aws sagemaker create-studio-lifecycle-config \
   --region region \
   --studio-lifecycle-config-name my-lcc \
   --studio-lifecycle-config-content $LCC_CONTENT \
   --studio-lifecycle-config-app-type application-type
   ```

   For `studio-lifecycle-config-app-type`, specify either *CodeEditor* or *JupyterLab*.
**Note**  
The ARN of the newly created lifecycle configuration that is returned. This ARN is required to attach the lifecycle configuration to your application.

To ensure that the environments are customized properly, users and administrators use different commands to attach lifecycle configurations.

### Attach default lifecycle configurations (administrator)
<a name="studio-lifecycle-configurations-attach-cli-administrator"></a>

To attach the lifecycle configuration, you must update the `UserSettings` for your domain or user profile. Lifecycle configuration scripts that are associated at the domain level are inherited by all users. However, scripts that are associated at the user profile level are scoped to a specific user. 

You can create a new user profile, domain, or space with a lifecycle configuration attached by using the following commands:
+ [create-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-user-profile.html)
+ [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-domain.html)
+ [create-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html)

The following command creates a user profile with a lifecycle configuration for a JupyterLab application. Add the lifecycle configuration ARN from the preceding step to the `JupyterLabAppSettings` of the user. You can add multiple lifecycle configurations at the same time by passing a list of them. When a user launches a JupyterLab application with the AWS CLI, they can specify a lifecycle configuration instead of using the default one. The lifecycle configuration that the user passes must belong to the list of lifecycle configurations in `JupyterLabAppSettings`.

```
# Create a new UserProfile
aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"JupyterLabAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

The following command creates a user profile with a lifecycle configuration for a Code Editor application. Add the lifecycle configuration ARN from the preceding step to the `CodeEditorAppSettings` of the user. You can add multiple lifecycle configurations at the same time by passing a list of them. When a user launches a Code Editor application with the AWS CLI, they can specify a lifecycle configuration instead of using the default one. The lifecycle configuration that the user passes must belong to the list of lifecycle configurations in `CodeEditorAppSettings`.

```
# Create a new UserProfile
aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"CodeEditorAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

### Attach built-in lifecycle configurations (user)
<a name="studio-lifecycle-configurations-attach-cli-user"></a>

To attach the lifecycle configuration, you must update the `UserSettings` for your user profile.

The following command creates a user profile with a lifecycle configuration for a JupyterLab application. Add the lifecycle configuration ARN from the preceding step to the `JupyterLabAppSettings` of your user profile.

```
# Update a UserProfile
aws sagemaker update-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"JupyterLabAppSettings": {
  "BuiltInLifecycleConfigArn":"lifecycle-configuration-arn"
  }
}'
```

The following command creates a user profile with a lifecycle configuration for a Code Editor application. Add the lifecycle configuration ARN from the preceding step to the `CodeEditorAppSettings` of your user profile. The lifecycle configuration that the user passes must belong to the list of lifecycle configurations in `CodeEditorAppSettings`.

```
# Update a UserProfile
aws sagemaker update-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"CodeEditorAppSettings": {
  "BuiltInLifecycleConfigArn":"lifecycle-configuration-arn"
  }
}'
```

## Create and attach lifecycle configurations (console)
<a name="studio-lifecycle-configurations-create-console"></a>

To create and attach lifecycle configurations in the AWS Management Console, navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker) and choose **Lifecycle configurations** in the left-hand navigation. The console will guide you through the process of creating the lifecycle configuration.

# Debug lifecycle configurations
<a name="studio-lifecycle-configurations-debug"></a>

The following topics show how to get information about and debug your lifecycle configurations.

**Topics**
+ [

## Verify lifecycle configuration process from CloudWatch Logs
](#studio-lifecycle-configurations-debug-logs)
+ [

# Lifecycle configuration timeout
](studio-lifecycle-configurations-debug-timeout.md)

## Verify lifecycle configuration process from CloudWatch Logs
<a name="studio-lifecycle-configurations-debug-logs"></a>

Lifecycle configurations only log `STDOUT` and `STDERR`.

`STDOUT` is the default output for bash scripts. You can write to `STDERR` by appending `>&2` to the end of a bash command. For example, `echo 'hello'>&2`. 

Logs for your lifecycle configurations are published to your AWS account using Amazon CloudWatch. These logs can be found in the `/aws/sagemaker/studio` log stream in the CloudWatch console.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs** from the left navigation pane. From the dropdown menu, select **Log groups**.

1. On the **Log groups** page, search for `aws/sagemaker/studio`. 

1. Select the log group.

1. On the **Log group details** page, choose the **Log streams** tab.

1. To find the logs for a specific space and app, search the log streams using the following format:

   ```
   domain-id/space-name/app-type/default/LifecycleConfigOnStart
   ```

   For example, to find the lifecycle configuration logs for domain ID `d-m85lcu8vbqmz`, space name `i-sonic-js`, and application type `JupyterLab`, use the following search string:

   ```
   d-m85lcu8vbqmz/i-sonic-js/JupyterLab/default/LifecycleConfigOnStart
   ```

1. To view the script execution logs, select the log stream appended with `LifecycleConfigOnStart`.

# Lifecycle configuration timeout
<a name="studio-lifecycle-configurations-debug-timeout"></a>

There is a lifecycle configuration timeout limitation of 5 minutes. If a lifecycle configuration script takes longer than 5 minutes to run, you get an error.

To resolve this error, make sure that your lifecycle configuration script completes in less than 5 minutes. 

To help decrease the runtime of scripts, try the following:
+ Reduce unnecessary steps. For example, limit which conda environments to install large packages in.
+ Run tasks in parallel processes.
+ Use the nohup command in your script to make sure that hangup signals are ignored so that the script runs without stopping.

# Amazon SageMaker Studio spaces
<a name="studio-updated-spaces"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

Spaces are used to manage the storage and resource needs of some Amazon SageMaker Studio applications. Each space is composed of multiple resources and can be either private or shared. Each space has a 1:1 relationship with an instance of an application. Every supported application that is created gets its own space. The following applications in Studio run on spaces: 
+  [Code Editor in Amazon SageMaker Studio](code-editor.md)
+  [SageMaker JupyterLab](studio-updated-jl.md) 
+  [Amazon SageMaker Studio Classic](studio.md) 

A space is composed of the following resources: 
+ A storage volume. 
  + For Studio Classic, the space is connected to the shared Amazon Elastic File System (Amazon EFS) volume for the domain. 
  + For other applications, a distinct Amazon Elastic Block Store (Amazon EBS) volume is attached to the space. All applications are given their own Amazon EBS volume. Applications do not have access to the Amazon EBS volume of other applications. For more information about Amazon EBS volumes, see [Amazon Elastic Block Store (Amazon EBS)](https://docs.aws.amazon.com//AWSEC2/latest/UserGuide/AmazonEBS.html). 
+ The application type of the space. 
+ The image that the application is based on.

Spaces can be either private or shared:
+  **Private**: Private spaces are scoped to a single user in a domain. Private spaces cannot be shared with other users. All applications that support spaces also support private spaces. 
+  **Shared**: Shared spaces are accessible by all users in the domain. For more information about shared spaces, see [Collaboration with shared spaces](domain-space.md). 

Spaces can be created in domains that use either AWS IAM Identity Center or AWS Identity and Access Management (IAM) authentication. The following sections give general information about how to access spaces. For specific information about creating and accessing a space, see the documentation for the respective application type of the space that you're creating. 

For information about viewing, stopping, or deleting your applications, instances, or spaces, see [Stop and delete your Studio running applications and spaces](studio-updated-running-stop.md).

**Topics**
+ [

# Launch spaces
](studio-updated-spaces-access.md)
+ [

# Collaboration with shared spaces
](domain-space.md)

# Launch spaces
<a name="studio-updated-spaces-access"></a>

The following sections give information about accessing spaces in a domain. Spaces can be accessed in one of the following ways:
+ from the Amazon SageMaker AI console
+ from Studio
+ using the AWS CLI

## Accessing spaces from the Amazon SageMaker AI console
<a name="studio-updated-spaces-access-console"></a>

**To access spaces from the Amazon SageMaker AI console**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **Admin configurations**, choose **Domains**.

1. From the list of domains, select the domain that contains the spaces.

1. On the **Domain details** page, select the **Space management** tab. For more information about managing spaces, see [Collaboration with shared spaces](domain-space.md).

1. From the list of spaces for that domain, select the space to launch.

1. Choose **Launch Studio** for the space that you want to launch.

## Accessing spaces from Studio
<a name="studio-updated-spaces-access-updated"></a>

Follow these steps to access spaces from Studio for a specific application type. 

**To access spaces from Studio**

1. Open Studio by following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1. Select the application type with spaces that you want to access.

## Accessing spaces using the AWS CLI
<a name="studio-updated-spaces-access-cli"></a>

The following sections show how to access a space from the AWS Command Line Interface (AWS CLI). The procedures are for domains that use AWS Identity and Access Management (IAM) or AWS IAM Identity Center authentication. 

### IAM authentication
<a name="studio-updated-spaces-access-cli-iam"></a>

The following procedure outlines generally how to access a space using IAM authentication from the AWS CLI. 

1. Create a presigned domain URL specifying the name of the space that you want to access.

   ```
   aws \
       --region region \
       sagemaker \
       create-presigned-domain-url \
       --domain-id domain-id \
       --user-profile-name user-profile-name \
       --space-name space-name
   ```

1. Navigate to the URL. 

### Accessing a space in IAM Identity Center authentication
<a name="studio-updated-spaces-access-identity-center"></a>

The following procedure outlines how to access a space using IAM Identity Center authentication from the AWS CLI. 

1. Use the following command to return the URL associated with the space.

   ```
   aws \
       --region region \
       sagemaker \
       describe-space \
       --domain-id domain-id \
       --space-name space-name
   ```

1. Append the respective redirect parameter for the application type to the URL to be federated through IAM Identity Center. For more information about the redirect parameters, see [describe-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/describe-space.html). 

1. Navigate to the URL to be federated through IAM Identity Center. 

# Collaboration with shared spaces
<a name="domain-space"></a>

An Amazon SageMaker Studio Classic shared space consists of a shared JupyterServer application and a shared directory. A JupyterLab shared space consists of a shared JupyterLab application and a shared directory within Amazon SageMaker Studio. All user profiles in a domain have access to all shared spaces in the domain. Amazon SageMaker AI automatically scopes resources in a shared space within the context of the Amazon SageMaker Studio Classic application that you launch in that shared space. Resources in a shared space include notebooks, files, experiments, and models. Use shared spaces to collaborate with other users in real-time using features like automatic tagging, real time co-editing of notebooks, and customization. 

Shared spaces are available in:
+ Amazon SageMaker Studio Classic
+ JupyterLab

A Studio Classic shared space only supports Studio Classic and KernelGateway applications. A shared space only supports the use of a JupyterLab 3 image Amazon Resource Name (ARN). For more information, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md).

 Amazon SageMaker AI automatically tags all SageMaker AI resources that you create within the scope of a shared space. You can use these tags to monitor costs and plan budgets using tools, such as AWS Budgets. 

A shared space uses the same VPC settings as the domain that it's created in. 

**Note**  
 Shared spaces do not support the use of Amazon SageMaker Data Wrangler or Amazon EMR cross-account clusters. 

 **Automatic tagging** 

 All resources created in a shared space are automatically tagged with a domain ARN tag and shared space ARN tag. The domain ARN tag is based on the domain ID, while the shared space ARN tag is based on the shared space name. 

 You can use these tags to monitor AWS CloudTrail usage. For more information, see [Log Amazon SageMaker API Calls with AWS CloudTrail](https://docs.aws.amazon.com//sagemaker/latest/dg/logging-using-cloudtrail.html). 

 You can also use these tags to monitor costs with AWS Billing and Cost Management. For more information, see [Using AWS cost allocation tags](https://docs.aws.amazon.com//awsaccountbilling/latest/aboutv2/cost-alloc-tags.html). 

 **Real time co-editing of notebooks** 

 A key benefit of a shared space is that it facilitates collaboration between members of the shared space in real time. Users collaborating in a workspace get access to a shared Studio Classic application where they can access, read, and edit their notebooks in real time. Real time collaboration is only supported for JupyterServer applications within a shared space. 

 Users with access to a shared space can simultaneously open, view, edit, and execute Jupyter notebooks in the shared Studio Classic or JupyterLab application in that space. 

The notebook indicates each co-editing user with a different cursor that shows the user profile name. While multiple users can view the same notebook, co-editing is best suited for small groups of two to five users.

To track changes being made by multiple users, we strongly recommended using Studio Classic's built-in Git-based version control.

 **JupyterServer 2** 

To use shared spaces in Studio Classic, Jupyter Server version 2 is required. Certain JupyterLab extensions and packages can forcefully downgrade Jupyter Server to version 1. This prevents the use of shared space. Run the following from the command prompt to change the version number and continue using shared spaces.

```
conda activate studio
pip install jupyter-server==2.0.0rc3
```

 **Customize a shared space** 

To attach a lifecycle configuration or custom image to a shared space, you must use the AWS CLI. For more information about creating and attaching lifecycle configurations, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md). For more information about creating and attaching custom images, see [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).

# Create a shared space
<a name="domain-space-create"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

 The following topic demonstrates how to create a shared space in an existing Amazon SageMaker AI domain. If you created your domain without support for shared spaces, you must add support for shared spaces to your existing domain before you can create a shared space. 

**Topics**
+ [

## Add shared space support to an existing domain
](#domain-space-add)
+ [

## Create a shared space
](#domain-space-create-app)

## Add shared space support to an existing domain
<a name="domain-space-add"></a>

 You can use the SageMaker AI console or the AWS CLI to add support for shared spaces to an existing domain. If the domain is using `VPC only` network access, then you can only add shared space support using the AWS CLI.

### Console
<a name="domain-space-add-console"></a>

 Complete the following procedure to add support for Studio Classic shared spaces to an existing domain from the SageMaker AI console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1.  From the list of domains, select the domain that you want to open the **domain settings** page for. 

1.  On the **domain details** page, choose the **domain settings** tab. 

1.  Choose **Edit**. 

1.  For **Space default execution role**, set an IAM role that is used by default for all shared spaces created in the domain. 

1.  Choose **Next**. 

1.  Choose **Next**. 

1.  Choose **Next**. 

1.  Choose **Submit**. 

### AWS CLI
<a name="domain-space-add-cli"></a>

------
#### [ Studio Classic ]

Run the following command from the terminal of your local machine to add default shared space settings to a domain from the AWS CLI. If you are adding default shared space settings to a domain within an Amazon VPC, you must also include a list of security groups. Studio Classic shared spaces only support the use of JupyterLab 3 image ARNs. For more information, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md).

```
# Public Internet domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-arn,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=example-instance-type,SageMakerImageArn=sagemaker-image-arn}}"

# VPCOnly domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-arn,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=sagemaker-image-arn}},SecurityGroups=[security-groups]"
```

Use the following command to verify that the default shared space settings have been updated. 

```
aws --region region \
sagemaker describe-domain \
--domain-id domain-id
```

------
#### [ JupyterLab ]

Run the following command from the terminal of your local machine to add default shared space settings to a domain from the AWS CLI. If you are adding default shared space settings to a domain within an Amazon VPC, you must also include a list of security groups. Studio Classic shared spaces only support the use of JupyterLab 4 image ARNs. For more information, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md).

```
# Public Internet domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-arn", JupyterLabAppSettings={DefaultResourceSpec={InstanceType=example-instance-type,SageMakerImageArn=sagemaker-image-arn}}"

# VPCOnly domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-arn, SecurityGroups=[security-groups]"
```

Use the following command to verify that the default shared space settings have been updated. 

```
aws --region region \
sagemaker describe-domain \
--domain-id domain-id
```

------

## Create a shared space
<a name="domain-space-create-app"></a>

The following sections demonstrate how to create a shared space from the Amazon SageMaker AI console, Amazon SageMaker Studio, or the AWS CLI.

### Create from Studio
<a name="domain-space-create-updated"></a>

Use the following procedures to create a shared space in a domain from Studio.

------
#### [ Studio Classic ]

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **Studio Classic**.

1. Choose **Create Studio Classic space**

1. In the pop up window, enter a name for the space.

1. Choose **Create space**.

------
#### [ JupyterLab ]

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **JupyterLab**.

1. Choose **Create JupyterLab space**

1. In the pop up window, enter a name for the space.

1. Choose **Create space**.

------

### Create from the console
<a name="domain-space-create-console"></a>

 Complete the following procedure to create a shared space in a domain from the SageMaker AI console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1.  From the list of domains, select the domain that you want to create a shared space for. 

1.  On the **domain details** page, choose the **Space management** tab. 

1.  Choose **Create**. 

1.  Enter a name for your shared space. shared space names within a domain must be unique. The execution role for the shared space is set to the domain IAM execution role. 

### Create from AWS CLI
<a name="domain-space-create-cli"></a>

This section shows how to create a shared space from the AWS CLI. 

You cannot set the execution role of a shared space when creating or updating it. The `DefaultDomainExecRole` can only be set when creating or updating the domain. shared spaces only support the use of JupyterLab 3 image ARNs. For more information, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md).

To create a shared space from the AWS CLI, run one of the following commands from the terminal of your local machine.

------
#### [ Studio Classic ]

```
aws --region region \
sagemaker create-space \
--domain-id domain-id \
--space-name space-name \
--space-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "sagemaker-image-arn",
      "InstanceType": "system"
    }
  }
}'
```

------
#### [ JupyterLab ]

```
aws --region region \
sagemaker create-space \
--domain-id domain-id \
--space-name space-name \
--ownership-settings "{\"OwnerUserProfileName\": \"user-profile-name\"}" \
--space-sharing-settings "{\"SharingType\": \"Shared\"}" \
--space-settings "{\"AppType\": \"JupyterLab\"}"
```

------

# Get information about shared spaces
<a name="domain-space-list"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

 This guide shows how to access a list of shared spaces in an Amazon SageMaker AI domain with the Amazon SageMaker AI console, Amazon SageMaker Studio, or the AWS CLI. It also shows how to view details of a shared space from the AWS CLI. 

**Topics**
+ [

## List shared spaces
](#domain-space-list-spaces)
+ [

## View shared space details
](#domain-space-describe)

## List shared spaces
<a name="domain-space-list-spaces"></a>

 The following topic describes how to view a list of shared spaces within a domain from the SageMaker AI console or the AWS CLI. 

### List shared spaces from Studio
<a name="domain-space-list-updated"></a>

 Complete the following procedure to view a list of the shared spaces in a domain from Studio.

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **Studio Classic** or **JupyterLab**. You can view the spaces that are being used to run the application type.

### List shared spaces from the console
<a name="domain-space-list-console"></a>

 Complete the following procedure to view a list of the shared spaces in a domain from the SageMaker AI console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1.  From the list of domains, select the domain that you want to view the list of shared spaces for. 

1.  On the **domain details** page, choose the **Space management** tab. 

### List shared spaces from the AWS CLI
<a name="domain-space-list-cli"></a>

 To list the shared spaces in a domain from the AWS CLI, run the following command from the terminal of your local machine.

```
aws --region region \
sagemaker list-spaces \
--domain-id domain-id
```

## View shared space details
<a name="domain-space-describe"></a>

 The following section describes how to view shared space details from the SageMaker AI console, Studio, or the AWS CLI. 

### View shared spaces details from Studio
<a name="domain-space-describe-updated"></a>

 Complete the following procedure to view the details of a shared spaces in a domain from Studio.

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **Studio Classic** or **JupyterLab**. You can view the spaces that are running the application.

1. Select the name of the space that you want to view more details for.

### View shared space details from the console
<a name="domain-space-describe-console"></a>

 You can view the details of a shared space from the SageMaker AI console using the following procedure. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1.  From the list of domains, select the domain that you want to view the list of shared spaces for. 

1.  On the **domain details** page, choose the **Space management** tab. 

1.  Select the name of the space to open a new page that lists details about the shared space. 

### View shared space details from the AWS CLI
<a name="domain-space-describe-cli"></a>

To view the details of a shared space from the AWS CLI, run the following command from the terminal of your local machine.

```
aws --region region \
sagemaker describe-space \
--domain-id domain-id \
--space-name space-name
```

# Edit a shared space
<a name="domain-space-edit"></a>

 You can only edit the details for an Amazon SageMaker Studio Classic or JupyterLab shared space using the AWS CLI. You can't edit the details of a shared space from the Amazon SageMaker AI console. You can only update workspace attributes when there are no running applications in the shared space. 

------
#### [ Studio Classic ]

To edit the details of a Studio Classic shared space from the AWS CLI, run the following one of the following commands from the terminal of your local machine. shared spaces only support the use of JupyterLab 3 image ARNs. For more information, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md).

```
aws --region region \
sagemaker update-space \
--domain-id domain-id \
--space-name space-name \
--query SpaceArn --output text \
--space-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "sagemaker-image-arn",
      "InstanceType": "system"
    }
  }
}'
```

------
#### [ JupyterLab ]

To edit the details of a JupyterLab shared space from the AWS CLI, run the following one of the following commands from the terminal of your local machine. shared spaces only support the use of JupyterLab 4 image ARNs. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).

```
aws --region region \
sagemaker update-space \
--domain-id domain-id \
--space-name space-name \
--space-settings "{
      "SpaceStorageSettings": {
      "EbsStorageSettings": { 
      "EbsVolumeSizeInGb":100
    }
    }
  }
}"
```

------

# Delete a shared space
<a name="domain-space-delete"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

 The following topic shows how to delete an Amazon SageMaker Studio Classic shared space from the Amazon SageMaker AI console or AWS CLI. A shared space can only be deleted if it has no running applications. 

**Topics**
+ [

## Console
](#domain-space-delete-console)
+ [

## AWS CLI
](#domain-space-delete-cli)

## Console
<a name="domain-space-delete-console"></a>

 Complete the following procedure to delete a shared space in the Amazon SageMaker AI domain from the SageMaker AI console. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1.  From the list of domains, select the domain that you want to create a shared space for. 

1.  On the **domain details** page, choose the **Space management** tab. 

1.  Select the shared space that you want to delete. The shared space must not contain any non-failed apps. 

1.  Choose **Delete**. This opens a new window. 

1.  Choose **Yes, delete space**. 

1.  Enter *delete* in the field. 

1.  Choose **Delete space**. 

## AWS CLI
<a name="domain-space-delete-cli"></a>

To delete a shared space from the AWS CLI, run the following command from the terminal of your local machine.

```
aws --region region \
sagemaker delete-space \
--domain-id domain-id \
--space-name space-name
```

# Trusted identity propagation with Studio
<a name="trustedidentitypropagation"></a>

Trusted identity propagation is an AWS IAM Identity Center feature that administrators of connected AWS services can use to grant and audit access to service data. Access to this data is based on user attributes such as group associations. Setting up trusted identity propagation requires collaboration between the administrators of connected AWS services and the IAM Identity Center administrator. For more information, see [Prerequisites and considerations](https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overall-prerequisites.html).

The Amazon SageMaker Studio and IAM Identity Center administrators can collaborate to connect the services for trusted identity propagation. Trusted identity propagation addresses enterprise authentication needs across AWS services by simplifying:
+ Enhanced auditing that tracks actions to specific users
+ Access management for data science and machine learning workloads through integration with compatible AWS services
+ Compliance requirements in regulated industries

Studio supports trusted identity propagation for audit purposes and access control with connected AWS services. Trusted identity propagation in Studio does not directly handle authentication or authorization decisions within Studio itself. Instead, it propagates identity context information to compatible services that can use this information for access control.

When you use trusted identity propagation with Studio, your IAM Identity Center identity propagates to connected AWS services, creating more granular permissions and security governance.

**Topics**
+ [

# Trusted identity propagation architecture and compatibility
](trustedidentitypropagation-compatibility.md)
+ [

# Setting up trusted identity propagation for Studio
](trustedidentitypropagation-setup.md)
+ [

# Monitoring and auditing with CloudTrail
](trustedidentitypropagation-auditing.md)
+ [

# User background sessions
](trustedidentitypropagation-user-background-sessions.md)
+ [

# How to connect with other AWS services with trusted identity propagation enabled
](trustedidentitypropagation-connect-other.md)

# Trusted identity propagation architecture and compatibility
<a name="trustedidentitypropagation-compatibility"></a>

Trusted identity propagation integrates AWS IAM Identity Center with Amazon SageMaker Studio and other connected AWS services to propagate users' identity context across services. The following page summarizes the trusted identity propagation architecture and compatibility with SageMaker AI. For a comprehensive overview of how trusted identity propagation works across AWS, see [Trusted identity propagation overview](https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overview.html).

The key components of the trusted identity propagation architecture include:
+ **Trusted identity propagation**: A methodology of propagating user's identity context between applications and services
+ **Identity context**: Information about a user
+ **Identity-enhanced IAM role session**: Identity-enhanced role sessions have an added identity context that carries a user identifier to the AWS service that it calls
+ **Connected AWS services**: Other AWS services that can recognize the identity context that is propagated through trusted identity propagation

Trusted identity propagation allows connected AWS services to make access decisions based on a user's identity. Within Studio itself, IAM roles are used as carriers of the identity context rather than for making access control decisions. The identity context is propagated to connected AWS services where it can be used for both access control and audit purposes. See [trusted identity propagation considerations](https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overall-prerequisites.html#trustedidentitypropagation-considerations) for more information.

When you enable trusted identity propagation with Studio and authenticate through IAM Identity Center, SageMaker AI:
+ Captures the user's identity context from the IAM Identity Center
+ Creates an identity-enhanced IAM role session that include the user's identity context
+ Passes identity-enhanced IAM role session to compatible AWS services when the user accesses resources
+ Enables downstream AWS services to make access decisions and log activities based on the user identity

## Compatible SageMaker AI features
<a name="trustedidentitypropagation-compatibility-compatible-features"></a>

Trusted identity propagation works with the following Studio features:
+ [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html) private spaces (JupyterLab and Code Editor, based on Code-OSS, Visual Studio Code - Open Source)

**Note**  
When Studio launches with trusted identity propagation enabled, it uses your identity context in addition to your execution role permissions. However, the following processes during instance setup will only use the execution role permissions, without the identity context: Lifecycle Configuration, Bring-Your-Own-Image, CloudWatch agent for user log forwarding.
[Remote access](https://docs.aws.amazon.com/sagemaker/latest/dg/remote-access.html) is not currently supported with trusted identity propagation.
When you use assume role operations within Studio notebooks, the assumed roles don't propagate trusted identity propagation context. Only the original execution role maintains the identity context.
+  [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) 
+  [SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) 
+  [SageMaker AI realtime hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-options.html) 
+  [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-overview.html) 
+  [SageMaker real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) 
+  [SageMaker Asynchronous Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) 
+  [Managed MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) 

## Compatible AWS services
<a name="trustedidentitypropagation-compatibility-compatible-services"></a>

Trusted identity propagation for Amazon SageMaker Studio integrates with compatible AWS services, where trusted identity propagation is enabled. See [use cases](https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-integrations.html) for a comprehensive list with examples on how to enable trusted identity propagation. The trusted identity propagation compatible services include the following.
+  [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/workgroups-identity-center.html) 
+  [Amazon EMR on EC2](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-idc-start.html) 
+  [EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-service-trusted-prop.html) 
+  [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/identity-center-integration.html) 
+  [Amazon Redshift Data API](https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-trusted-identity-propagation.html) 
+ Amazon S3 (via [Amazon S3 Access Grants](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants-get-started.html))
+ [AWS Glue Connections](https://docs.aws.amazon.com/glue/latest/dg/security-trusted-identity-propagation.html)

When trusted identity propagation is enabled with SageMaker AI, each other AWS service with trusted identity propagation is enabled is connected. Once they are connected they recognize and use the user's identity context for access control and auditing.

## Supported AWS Regions
<a name="trustedidentitypropagation-compatibility-supported-regions"></a>

Studio supports trusted identity propagation where [IAM Identity Center is supported](https://docs.aws.amazon.com/singlesignon/latest/userguide/regions.html) and Studio with IAM Identity Center authentication is supported. Studio supports trusted identity propagation in the following AWS Regions:
+ af-south-1
+ ap-east-1
+ ap-northeast-1
+ ap-northeast-2
+ ap-northeast-3
+ ap-south-1
+ ap-southeast-1
+ ap-southeast-2
+ ap-southeast-3
+ ca-central-1
+ eu-central-1
+ eu-central-2
+ eu-north-1
+ eu-south-1
+ eu-west-1
+ eu-west-2
+ eu-west-3
+ il-central-1
+ me-south-1
+ sa-east-1
+ us-east-1
+ us-east-2
+ us-west-1
+ us-west-2

# Setting up trusted identity propagation for Studio
<a name="trustedidentitypropagation-setup"></a>

Setting up trusted identity propagation for Amazon SageMaker Studio requires your Amazon SageMaker AI domain to have IAM Identity Center authentication method configured. This section guides you through the prerequisites and steps needed to enable and configure trusted identity propagation for your Studio users.

**Topics**
+ [

## Prerequisites
](#trustedidentitypropagation-setup-prerequisites)
+ [

## Enable trusted identity propagation for your Amazon SageMaker AI domain
](#trustedidentitypropagation-setup-enable)
+ [

## Configure your SageMaker AI execution role
](#trustedidentitypropagation-setup-permissions)

## Prerequisites
<a name="trustedidentitypropagation-setup-prerequisites"></a>

Before setting up trusted identity propagation for SageMaker AI, set up your IAM Identity Center using the following instructions.

**Note**  
Ensure that your IAM Identity Center and domain are in the same region.
+  [IAM Identity Center trusted identity propagation prerequisites](https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overall-prerequisites.html#trustedidentitypropagation-prerequisites) 
+  [Set up IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/getting-started.html) 
+  [Add users to your IAM Identity Center directory](https://docs.aws.amazon.com/singlesignon/latest/userguide/addusers.html) 

## Enable trusted identity propagation for your Amazon SageMaker AI domain
<a name="trustedidentitypropagation-setup-enable"></a>

**Important**  
You can only enable trusted identity propagation for domains with AWS IAM Identity Center authentication method configured.
Your IAM Identity Center and Amazon SageMaker AI domain must be in the same AWS Region.

Use one of the following options to learn how to enable trusted identity propagation for a new or existing domain.

------
#### [ New domain - console ]

**Enable trusted identity propagation for a new domain using the SageMaker AI console**

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Navigate to **Domains**.

1. [Create a custom domain](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-custom.html). The domain must have the **AWS IAM Identity Center** authentication method configured.

1. In the **Trusted identity propagation** section, choose to **Enable the trusted identity propagation for all users on this domain**.

1. Complete the custom creation process.

------
#### [ Existing domain - console ]

**Enable trusted identity propagation for an existing domain using the SageMaker AI console**
**Note**  
For trusted identity propagation to work properly after it is enabled for an existing domain, users will need to restart their existing IAM Identity Center sessions. To do so, either:  
Users will need to log out and log back in to their existing IAM Identity Center sessions
Administrators can [end active sessions for their workforce users](https://docs.aws.amazon.com/singlesignon/latest/userguide/end-active-sessions.html).

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Navigate to **Domains**.

1. Select your existing domain. The domain must have the **AWS IAM Identity Center** authentication method configured.

1. In the **Domain settings** tab, choose **Edit** in the **Authentication and permissions** section.

1. Choose to **Enable the trusted identity propagation for all users on this domain**.

1. Complete the domain configuration.

------
#### [ Existing domain - AWS CLI ]

Enable trusted identity propagation for an existing domain using the AWS CLI

**Note**  
For trusted identity propagation to work properly after it is enabled for an existing domain, users will need to restart their existing IAM Identity Center sessions. To do so, either:  
Users will need to log out and log back in to their existing IAM Identity Center sessions
Administrators can [end active sessions for their workforce users](https://docs.aws.amazon.com/singlesignon/latest/userguide/end-active-sessions.html).

```
aws sagemaker update-domain \
    --region $REGION \
    --domain-id $DOMAIN_ID \
    --domain-settings "TrustedIdentityPropagationSettings={Status=ENABLED}"
```
+ `DOMAIN_ID` is the Amazon SageMaker AI domain ID. See [View domains](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-view.html) for more information.
+ `REGION` is the AWS Region of your Amazon SageMaker AI domain. You can find this at the top right of any AWS console page.

------

## Configure your SageMaker AI execution role
<a name="trustedidentitypropagation-setup-permissions"></a>

To enable trusted identity propagation for your Studio users, all trusted identity propagation roles need the set the following context permissions. Update the trust policy for all roles to include the `sts:AssumeRole` and `sts:SetContext` actions. Use the following policy when you [update your role trust policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_update-role-trust-policy.html).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ]
        }
    ]
}
```

------

# Monitoring and auditing with CloudTrail
<a name="trustedidentitypropagation-auditing"></a>

With trusted identity propagation enabled, AWS CloudTrail logs include the identity information of the specific user who performed an action, rather than just the IAM role. This provides enhanced auditing capabilities for compliance and security.

To view identity information in CloudTrail logs:
+ Open the [CloudTrail console](https://console.aws.amazon.com/cloudtrail).
+ Choose **Event history** from the left navigation pane.
+ Choose events from SageMaker AI and related services.
+ Under the **Event record** find `onBehalfOf` key. This contains the `userId` key and other user identification information that can be mapped to a specific IAM Identity Center user.

  See [CloudTrail use cases for IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/sso-cloudtrail-use-cases.html) for more information.

# User background sessions
<a name="trustedidentitypropagation-user-background-sessions"></a>

User background sessions continue even when the user is no longer active. These allow for long-running jobs that can continue even after the user has logged off. This can be enabled through SageMaker AI's trusted identity propagation. The following page explains the configuration options and behaviors for user background sessions.

**Note**  
Existing active user sessions are not impacted when trusted identity propagation is enabled. The default duration applies only to new user sessions or restarted sessions.
User background sessions apply to any long-running SageMaker AI workflows or jobs with persistent states. This includes, but is not limited to, any SageMaker AI resources that maintain execution status or require ongoing monitoring. For example, SageMaker Training, Processing, and Pipelines execution jobs.

**Topics**
+ [

## Configure user background session
](#configure-user-background-sessions)
+ [

## Default user background session duration
](#default-user-background-session-duration)
+ [

## Impact of disabling trusted identity propagation in Studio
](#user-background-session-impact-disable-trustedidentitypropagation-studio)
+ [

## Impact of disabling user background sessions in the IAM Identity Center console
](#user-background-session-impact-disable-trustedidentitypropagation-identity-center)
+ [

## Runtime considerations
](#user-background-session-runtime-considerations)

## Configure user background session
<a name="configure-user-background-sessions"></a>

Once trusted identity propagation for Amazon SageMaker Studio is enabled, default duration limits can be configured through the [user background sessions in the IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/user-background-sessions.html).

## Default user background session duration
<a name="default-user-background-session-duration"></a>

By default, all user background sessions have a duration limit of 7 days. Administrators can [modify this duration in the IAM Identity Center console](https://docs.aws.amazon.com/singlesignon/latest/userguide/user-background-sessions.html). This setting applies at the IAM Identity Center instance level, affecting all supported IAM Identity Center applications and Studio domains within that instance.

When trusted identity propagation is enabled, administrators in the SageMaker AI console will find a banner with the following information:
+ The duration limit for user background sessions
+ A link to the IAM Identity Center console where administrators can change this configuration
  + The duration can be set to any value from 15 minutes up to 90 days

An error message will occur when a user background session has expired. You can use the link to the IAM Identity Center console to update the duration.

## Impact of disabling trusted identity propagation in Studio
<a name="user-background-session-impact-disable-trustedidentitypropagation-studio"></a>

If an administrator disables trusted identity propagation, after initially enabling it, in the SageMaker AI console:
+ Existing jobs continue to run without interruption when user background sessions are enabled.
+ When user background sessions are disabled, any long-running SageMaker AI workflows or jobs with persistent states will switch to using interactive sessions. This includes, but is not limited to, any SageMaker AI resources that maintain execution status or require ongoing monitoring. For example, Amazon SageMaker Training and Processing jobs.
+ Users can restart expired jobs from checkpoints.
+ New jobs run with IAM role credentials and do not propagate the identity context.

## Impact of disabling user background sessions in the IAM Identity Center console
<a name="user-background-session-impact-disable-trustedidentitypropagation-identity-center"></a>

When the user background session is **disabled** for the IAM Identity Center instance, the SageMaker AI job uses user interactive sessions. When using interactive sessions, a SageMaker AI job will fail within 15 minutes when:
+ The user logs out
+ The interactive session is revoked by the administrator

When the user background session is **enabled** for the IAM Identity Center instance, the SageMaker AI job uses user background sessions. When using interactive sessions, a SageMaker AI job will fail within 15 minutes when:
+ The user background session expires
+ The user background session is manually revoked by an administrator

The following provides example behavior with SageMaker Training jobs. When an administrator enables trusted identity propagation but disables [user background sessions](https://docs.aws.amazon.com/singlesignon/latest/userguide/user-background-sessions.html) in the IAM Identity Center console:
+ If a user stays logged in, their Training jobs created while background sessions are disabled fallback to the interactive session.
+ If the user logs off, the session expires and Training jobs depending on the interactive session will fail.
+ Users can restart their Training job from the last checkpoint. The session duration is determined by what is set for the interactive session duration in the IAM Identity Center console.
+ If a user disables background sessions **after** starting a job, the job will continue to use its existing background sessions. In other words, SageMaker AI will not create any new background sessions.

The same behavior applies if background sessions are enabled at the IAM Identity Center instance level but disabled specifically for the Studio application using [IAM Identity Center APIs](https://docs.aws.amazon.com/singlesignon/latest/APIReference/welcome.html).

## Runtime considerations
<a name="user-background-session-runtime-considerations"></a>

When an administrator sets `MaxRuntimeInSeconds` for long-running Training or Processing jobs that is lower than the user background session duration, SageMaker AI runs the job for the minimum of either `MaxRuntimeInSeconds` or user background session duration. For more information about `MaxRuntimeInSeconds`, see [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-StoppingCondition). See [user background sessions in the IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/user-background-sessions.html) for information on how to set the runtime.

# How to connect with other AWS services with trusted identity propagation enabled
<a name="trustedidentitypropagation-connect-other"></a>

When trusted identity propagation is enabled for your Amazon SageMaker AI domain, the domain users can connect to other trusted identity propagation enabled AWS services. When trusted identity propagation is enabled, your identity context is automatically propagated to compatible services, allowing for fine-grained access control and improved auditing across your machine learning workflows. This integration eliminates the need for complex IAM role switching and provides a unified identity experience across AWS services. The following pages provide information on how to connect Amazon SageMaker Studio to other AWS services when trusted identity propagation is enabled.

**Topics**
+ [

# Connect Studio JupyterLab notebooks to Amazon S3 Access Grants with trusted identity propagation enabled
](trustedidentitypropagation-s3-access-grants.md)
+ [

# Connect Studio JupyterLab notebooks to Amazon EMR with trusted identity propagation enabled
](trustedidentitypropagation-emr-ec2.md)
+ [

# Connect your Studio JupyterLab notebooks to EMR Serverless with trusted identity propagation enabled
](trustedidentitypropagation-emr-serverless.md)
+ [

# Connect Studio JupyterLab notebooks to Redshift Data API with trusted identity propagation enabled
](trustedidentitypropagation-redshift-data-apis.md)
+ [

# Connect Studio JupyterLab notebooks to Lake Formation and Athena with trusted identity propagation enabled
](trustedidentitypropagation-lake-formation-athena.md)

# Connect Studio JupyterLab notebooks to Amazon S3 Access Grants with trusted identity propagation enabled
<a name="trustedidentitypropagation-s3-access-grants"></a>

You can use [Amazon S3 Access Grants](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html) to flexibly grant identity-based fine-grain access control to Amazon S3 locations. These grant Amazon S3 buckets access directly to your corporate users and groups. The following pages provides information and instructions on how to use Amazon S3 Access Grants with trusted identity propagation for SageMaker AI.

## Prerequisites
<a name="s3-access-grants-prerequisites"></a>

To connect Studio to Lake Formation and Athena with trusted identity propagation enabled, ensure you have completed the following prerequisites:
+  [Setting up trusted identity propagation for Studio](trustedidentitypropagation-setup.md) 
+ Follow the [getting started with Amazon S3 Access Grants](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants-get-started.html) to set up Amazon S3 Access Grants for your bucket. See [scaling data access with Amazon S3 Access Grants](https://aws.amazon.com/blogs/storage/scaling-data-access-with-amazon-s3-access-grants/) for more information.
**Note**  
Standard Amazon S3 APIs do not automatically work with Amazon S3 Access Grants. You must explicitly use Amazon S3 Access Grants APIs. See [Managing access with Amazon S3 Access Grants](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html) for more information.

**Topics**
+ [

## Prerequisites
](#s3-access-grants-prerequisites)
+ [

# Connect Amazon S3 Access Grants with Studio JupyterLab notebooks
](s3-access-grants-setup.md)
+ [

# Connect Studio JupyterLab notebooks to Amazon S3 Access Grants with Training and Processing jobs
](trustedidentitypropagation-s3-access-grants-jobs.md)

# Connect Amazon S3 Access Grants with Studio JupyterLab notebooks
<a name="s3-access-grants-setup"></a>

Use the following information to grant Amazon S3 Access Grants in Studio JupyterLab notebooks.

After Amazon S3 Access Grants is set up, [add the following permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) to your domain or user [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-get-execution-role).
+ `us-east-1` is your AWS Region
+ `111122223333` is your AWS account ID
+ `S3-ACCESS-GRANT-ROLE` is your Amazon S3 Access Grant role

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowDataAccessAPI",
            "Effect": "Allow",
            "Action": [
                "s3:GetDataAccess"
            ],
            "Resource": [
                "arn:aws:s3:us-east-1:111122223333:access-grants/default"
            ]
        },
        {
            "Sid": "RequiredForTIP",
            "Effect": "Allow",
            "Action": "sts:SetContext",
            "Resource": "arn:aws:iam::111122223333:role/S3-ACCESS-GRANT-ROLE"
        }
    ]
}
```

------

Ensure that your Amazon S3 Access Grants role's trust policy allows the `sts:SetContext` and `sts:AssumeRole` actions. The following is an example policy for when you [update your role trust policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_update-role-trust-policy.html).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "access-grants.s3.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "111122223333",
                    "aws:SourceArn": "arn:aws:s3:us-east-1:111122223333:access-grants/default"
                }
            }
        }
    ]
}
```

------

## Use Amazon S3 Access Grants to call Amazon S3
<a name="s3-access-grants-python-example"></a>

The following is an example Python script showing how Amazon S3 Access Grants can be used to call Amazon S3. This assumes you have already successfully set up trusted identity propagation with SageMaker AI.

```
import boto3
from botocore.config import Config

def get_access_grant_credentials(account_id: str, target: str, 
                                 permission: str = 'READ'):
    s3control = boto3.client('s3control')
    response = s3control.get_data_access(
        AccountId=account_id,
        Target=target,
        Permission=permission
    )
    return response['Credentials']

def create_s3_client_from_credentials(credentials) -> boto3.client:
    return boto3.client(
        's3',
        aws_access_key_id=credentials['AccessKeyId'],
        aws_secret_access_key=credentials['SecretAccessKey'],
        aws_session_token=credentials['SessionToken']
    )

# Create client
credentials = get_access_grant_credentials('111122223333',
                                        "s3://tip-enabled-bucket/tip-enabled-path/")
s3 = create_s3_client_from_credentials(credentials)

s3.list_objects(Bucket="tip-enabled-bucket", Prefix="tip-enabled-path/")
```

If you use a path to an Amazon S3 bucket where Amazon S3 access grant is not enabled, the call will fail.

For other programming languages, see [Managing access with Amazon S3 Access Grants](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html) for more information.

# Connect Studio JupyterLab notebooks to Amazon S3 Access Grants with Training and Processing jobs
<a name="trustedidentitypropagation-s3-access-grants-jobs"></a>

Use the following information to grant Amazon S3 Access Grants to access data in Amazon SageMaker Training and Processing jobs.

When a user with trusted identity propagation enabled launches a SageMaker Training or Processing job that needs to access Amazon S3 data:
+ SageMaker AI calls Amazon S3 Access Grants to get temporary credentials based on the user's identity
+ If successful, these temporary credentials access the Amazon S3 data
+ If unsuccessful, SageMaker AI falls back to using the IAM role credentials

**Note**  
To enforce that all of the permission are granted through Amazon S3 Access Grants, you will need to remove related Amazon S3 access permission your execution role and attach them to your corresponding [Amazon S3 Access Grant](https://docs.aws.amazon.com/singlesignon/latest/userguide/tip-tutorial-s3.html#tip-tutorial-s3-create-grant).

**Topics**
+ [

## Considerations
](#s3-access-grants-jobs-considerations)
+ [

## Set up Amazon S3 Access Grants with Training and Processing jobs
](#s3-access-grants-jobs-setup)

## Considerations
<a name="s3-access-grants-jobs-considerations"></a>

Amazon S3 Access Grants cannot be used with [Pipe mode](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest-stream.html) for both SageMaker Training and Processing for Amazon S3 input.

When trusted identity propagation is enabled, you cannot launch a SageMaker Training Job with the following feature
+ Remote Debug
+ Debugger
+ Profiler

When trusted identity propagation is enabled, you cannot launch a Processing job with the following feature
+ DatasetDefinition

## Set up Amazon S3 Access Grants with Training and Processing jobs
<a name="s3-access-grants-jobs-setup"></a>

After Amazon S3 Access Grants is set up, [add the following permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) to your domain or user [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-get-execution-role).
+ `us-east-1` is your AWS Region
+ `111122223333` is your AWS account ID
+ `S3-ACCESS-GRANT-ROLE` is your Amazon S3 Access Grant role

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowDataAccessAPI",
            "Effect": "Allow",
            "Action": [
                "s3:GetDataAccess",
                "s3:GetAccessGrantsInstanceForPrefix"
            ],
            "Resource": [
                "arn:aws:s3:us-east-1:111122223333:access-grants/default"
            ]
        },
        {
            "Sid": "RequiredForIdentificationPropagation",
            "Effect": "Allow",
            "Action": "sts:SetContext",
            "Resource": "arn:aws:iam::111122223333:role/S3-ACCESS-GRANT-ROLE"
        }
    ]
}
```

------

# Connect Studio JupyterLab notebooks to Amazon EMR with trusted identity propagation enabled
<a name="trustedidentitypropagation-emr-ec2"></a>

Connecting Amazon SageMaker Studio JupyterLab notebooks to Amazon EMR clusters enables you to leverage the distributed computing power of Amazon EMR for large-scale data processing and analytics workloads. With trusted identity propagation enabled, your identity context is propagated to Amazon EMR, allowing for fine-grained access control and comprehensive audit trails. The following page provides instructions on how to connect your Studio notebook to Amazon EMR clusters. Once set up, you can use the `Connect to Cluster` option in your Studio notebook.

To connect Studio to Amazon EMR with trusted identity propagation enabled, ensure you have completed the following setups:
+  [Setting up trusted identity propagation for Studio](trustedidentitypropagation-setup.md) 
+  [Getting started with AWS IAM Identity Center integration for Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-idc-start.html) 
+  [Enable communications between Studio and Amazon EMR clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster.html) 

 **Connect to the Amazon EMR cluster** 

For a full list of options on how to connect your JupyterLab notebook to Amazon EMR, see [Connect to an Amazon EMR cluster](https://docs.aws.amazon.com/sagemaker/latest/dg/connect-emr-clusters.html).

# Connect your Studio JupyterLab notebooks to EMR Serverless with trusted identity propagation enabled
<a name="trustedidentitypropagation-emr-serverless"></a>

Amazon EMR Serverless provides a serverless option for running Apache Spark and Apache Hive applications without managing clusters. When integrated with trusted identity propagation, EMR Serverless automatically scales compute resources while maintaining your identity context for access control and auditing. This approach eliminates the operational overhead of cluster management while preserving the security benefits of identity-based access control. The following section provides information on how to connect your trusted identity propagation enabled Studio with the EMR Serverless.

To connect Studio to Amazon EMR Serverless with trusted identity propagation enabled, ensure you have completed the following setups:
+  [Setting up trusted identity propagation for Studio](trustedidentitypropagation-setup.md) 
+  [Trusted identity propagation with EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-service-trusted-prop.html) 
+  [Enable communications between Studio and EMR Serverless](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-serverless.html) 

 **Connect to the EMR Serverless application** 

For a full list of options on how to connect your JupyterLab notebook to EMR Serverless, see [Connect to an EMR Serverless application](https://docs.aws.amazon.com/sagemaker/latest/dg/connect-emr-serverless-application.html).

# Connect Studio JupyterLab notebooks to Redshift Data API with trusted identity propagation enabled
<a name="trustedidentitypropagation-redshift-data-apis"></a>

Amazon Redshift Data API enables you to interact with your Amazon Redshift clusters programmatically without managing persistent connections. When combined with trusted identity propagation, the Redshift Data API provides secure, identity-based access to your data warehouse, allowing you to run SQL queries and retrieve results while maintaining full audit trails of user activities. This integration is particularly valuable for data science workflows that require access to structured data stored in Redshift. The following page includes information and instructions on how to connect trusted identity propagation with Amazon SageMaker Studio to Redshift Data API.

To connect Studio to Redshift Data API with trusted identity propagation enabled, ensure you have completed the following setups:
+  [Setting up trusted identity propagation for Studio](trustedidentitypropagation-setup.md) 
+  [Using Redshift Data API with trusted identity propagation](https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-trusted-identity-propagation.html) 
  + Ensure your execution role has relevant permissions for Redshift Data API. See [authorizing access](https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-access.html) for more information.
+  [Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider](https://aws.amazon.com/blogs/big-data/simplify-access-management-with-amazon-redshift-and-aws-lake-formation-for-users-in-an-external-identity-provider/) 

# Connect Studio JupyterLab notebooks to Lake Formation and Athena with trusted identity propagation enabled
<a name="trustedidentitypropagation-lake-formation-athena"></a>

AWS Lake Formation and Amazon Athena work together to provide a comprehensive data lake solution with fine-grained access control and serverless query capabilities. Lake Formation centralizes permissions management for your data lake, while Athena provides interactive query services. When integrated with trusted identity propagation, this combination enables data scientists to access only the data they're authorized to see, with all queries and data access automatically logged for compliance and auditing purposes. The following page provides information and instructions on how to connect trusted identity propagation with Amazon SageMaker Studio to Lake Formation and Athena

To connect Studio to Lake Formation and Athena with trusted identity propagation enabled, ensure you have completed the following setups:
+  [Setting up trusted identity propagation for Studio](trustedidentitypropagation-setup.md) 
+  [Create a Lake Formation role](https://docs.aws.amazon.com/lake-formation/latest/dg/prerequisites-identity-center.html) 
+  [Connect Lake Formation with IAM Identity Center](https://docs.aws.amazon.com/lake-formation/latest/dg/connect-lf-identity-center.html) 
+ Create Lake Formation resources:
  +  [Database](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-database.html) 
  +  [Tables](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-tables.html) 
+  [Create Athena workgroup](https://docs.aws.amazon.com/athena/latest/ug/creating-workgroups.html) 
  + Choose **AthenaSQL** for the engine
  + Choose **IAM Identity Center** for authentication method
  + Create a new service role
    + Ensure that the IAM Identity Center users have access to the query result location using Amazon S3 Access Grants
+  [Granting database permissions using the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html) 

# Perform common UI tasks
<a name="studio-updated-common"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

 The following sections describe how to perform common tasks in the Amazon SageMaker Studio UI. For an overview of the Studio user interface, see [Amazon SageMaker Studio UI overview](studio-updated-ui.md). 

 **Set cookie preferences** 

1. Launch Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1.  At the bottom of the Studio user interface, choose **Cookie Preferences**. 

1.  Select the check box for each type of cookie that you want Amazon SageMaker AI to use. 

1.  Choose **Save preferences**. 

 **Manage notifications** 

Notifications give information about important changes to Studio, updates to applications, and issues to resolve. 

1. Launch Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1.  On the top navigation bar, choose the **Notifications** icon (![\[Logo for Notifications, a cloud service with a stylized bell icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/monarch/notification.png)). 

1.  From the list of notifications, select the notification to get information about it. 

 **Leave feedback** 

 We take your feedback seriously. We encourage you to provide feedback. 

 At the top navigation of Studio, choose **Provide feedback**. 

 **Sign out** 

 Signing out of the Studio UI is different than closing the browser window. Signing out clears session data from the browser and deletes unsaved changes. 

This same behavior also happens when the Studio session times out. This happens after 5 minutes. 

1. Launch Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1. Choose the **User options** icon (![\[User icon with a circular avatar placeholder and a downward-pointing arrow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/monarch/user-settings.png)). 

1.  Choose **Sign out**. 

1. In the pop-up window, choose **Sign out**. 

# NVMe stores with Amazon SageMaker Studio
<a name="studio-updated-nvme"></a>

Amazon SageMaker Studio applications and their associated notebooks run on Amazon Elastic Compute Cloud (Amazon EC2) instances. Some of the Amazon EC2 instance types, such as the `ml.m5d` instance family, offer non-volatile memory express (NVMe) solid state drives (SSD) instance stores. NVMe instance stores are local ephemeral disk stores that are physically connected to an instance for fast temporary storage. Studio applications support NVMe instance stores for supported instance types. For more information about instance types and their associated NVMe store volumes, see the [Amazon Elastic Compute Cloud Instance Type Details](https://aws.amazon.com/ec2/instance-types/). This topic provides information about accessing and using NVMe instance stores, as well as considerations when using NVMe instance stores with Studio.

## Considerations
<a name="studio-updated-nvme-considerations"></a>

The following considerations apply when using NVMe instance stores with Studio.
+ An NVMe instance store is temporary storage. The data stored on the NVMe store is deleted when the instance is terminated, stopped, or hibernated. When using NVMe stores with Studio applications, the data on the NVMe instance store is lost whenever the application is deleted, restarted, or patched. We recommend that you back up valuable data to persistent storage solutions, such as Amazon Elastic Block Store, Amazon Elastic File System, or Amazon Simple Storage Service. 
+ Studio patches instances periodically to install new security updates. When an instance is patched, the instance is restarted. This restart results in the deletion of data stored in the NVMe instance store. We recommend that you frequently back up necessary data from the NVMe instance store to persistent storage solutions, such as Amazon Elastic Block Store, Amazon Elastic File System, or Amazon Simple Storage Service. 
+ The following Studio applications support using NVMe storage:
  + JupyterLab
  + Code Editor, based on Code-OSS, Visual Studio Code - Open Source
  + KernelGateway

## Access NVMe instance stores
<a name="studio-updated-nvme-access"></a>

When you select an instance type with attached NVMe instance stores to host a Studio application, the NVMe instance store directory is mounted to the application container at the following location:

```
/mnt/sagemaker-nvme
```

If an instance has more than 1 NVMe instance store attached to it, Studio creates a striped logical volume that spans all of the local disks attached. Studio then mounts this striped logical volume to the `/mnt/sagemaker-nvme` directory. As a result, the directory storage size is the sum of all NVMe instance store volume sizes attached to the instance. 

If the `/mnt/sagemaker-nvme` directory does not exist, verify that the instance type hosting your application has an attached NVMe instance store volume.

# Local mode support in Amazon SageMaker Studio
<a name="studio-updated-local"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Amazon SageMaker Studio applications support the use of local mode to create estimators, processors, and pipelines, then deploy them to a local environment. With local mode, you can test machine learning scripts before running them in Amazon SageMaker AI managed training or hosting environments. Studio supports local mode in the following applications:
+ Amazon SageMaker Studio Classic
+ JupyterLab
+ Code Editor, based on Code-OSS, Visual Studio Code - Open Source

Local mode in Studio applications is invoked using the SageMaker Python SDK. In Studio applications, local mode functions similarly to how it functions in Amazon SageMaker notebook instances, with some differences. With the [Rootless Docker configuration](studio-updated-local-get-started.md#studio-updated-local-rootless) enabled, you can also access additional Docker registries through your VPC configuration, including on-premises repositories, and public registries. For more information about using local mode with the SageMaker Python SDK, see [Local Mode](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode).

**Note**  
Studio applications do not support multi-container jobs in local mode. Local mode jobs are limited to a single instance for training, inference, and processing jobs. When creating a local mode job, the instance count configuration must be `1`. 

## Docker support
<a name="studio-updated-local-docker"></a>

As part of local mode support, Studio applications support limited Docker access capabilities. With this support, users can interact with the Docker API from Jupyter notebooks or the image terminal of the application. Customers can interact with Docker using one of the following:
+ [Docker CLI](https://docs.docker.com/engine/reference/run/)
+ [Docker Compose CLI ](https://docs.docker.com/compose/reference/)
+ Language specific Docker SDK clients

Studio also supports limited Docker access capabilities with the following restrictions:
+ Usage of Docker networks is not supported.
+ Docker [volume](https://docs.docker.com/storage/volumes/) usage is not supported during container run. Only volume bind mount inputs are allowed during container orchestration. The volume bind mount inputs must be located on the Amazon Elastic File System (Amazon EFS) volume for Studio Classic. For JupyterLab and Code Editor applications, it must be located on the Amazon Elastic Block Store (Amazon EBS) volume.
+ Container inspect operations are allowed.
+ Container port to host mapping is not allowed. However, you can specify a port for hosting. The endpoint is then accessible from Studio using the following URL:

  ```
  http://localhost:port
  ```

### Docker operations supported
<a name="studio-updated-local-docker-supported"></a>

The following table lists all of the Docker API endpoints that are supported in Studio, including any support limitations. If an API endpoint is missing from the table, Studio doesn't support it.


|  API Documentation  |  Limitations  | 
| --- | --- | 
|  [SystemAuth](https://docs.docker.com/engine/api/v1.43/#tag/System/operation/SystemAuth)  |   | 
|  [SystemEvents](https://docs.docker.com/engine/api/v1.43/#tag/System/operation/SystemEvents)  |   | 
|  [SystemVersion](https://docs.docker.com/engine/api/v1.43/#tag/System/operation/SystemVersion)  |   | 
|  [SystemPing](https://docs.docker.com/engine/api/v1.43/#tag/System/operation/SystemPing)  |   | 
|  [SystemPingHead](https://docs.docker.com/engine/api/v1.43/#tag/System/operation/SystemPingHead)  |   | 
|  [ContainerCreate](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerCreate)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html)  | 
|  [ContainerStart](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerStart)  |   | 
|  [ContainerStop](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerStop)  |   | 
|  [ContainerKill](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerKill)  |   | 
|  [ContainerDelete](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerDelete)  |   | 
|  [ContainerList](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerList)  |   | 
|  [ContainerLogs](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerLogs)  |   | 
|  [ContainerInspect](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerInspect)  |   | 
|  [ContainerWait](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerWait)  |   | 
|  [ContainerAttach](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerAttach)  |   | 
|  [ContainerPrune](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerPrune)  |   | 
|  [ContainerResize](https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerResize)  |   | 
|  [ImageCreate](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageCreate)  |  VPC-only mode support is limited to Amazon ECR images in allowlisted accounts. With the [Rootless Docker configuration](studio-updated-local-get-started.md#studio-updated-local-rootless) enabled, you can also access additional Docker registries through your VPC configuration, including on-premises repositories, and public registries. | 
|  [ImagePrune](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImagePrune)  |   | 
|  [ImagePush](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImagePush)  |  VPC-only mode support is limited to Amazon ECR images in allowlisted accounts. With the [Rootless Docker configuration](studio-updated-local-get-started.md#studio-updated-local-rootless) enabled, you can also access additional Docker registries through your VPC configuration, including on-premises repositories, and public registries. | 
|  [ImageList](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageList)  |   | 
|  [ImageInspect](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageInspect)  |   | 
|  [ImageGet](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageGet)  |   | 
|  [ImageDelete](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageDelete)  |   | 
|  [ImageBuild](https://docs.docker.com/engine/api/v1.43/#tag/Image/operation/ImageBuild)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html)  | 

**Topics**
+ [

## Docker support
](#studio-updated-local-docker)
+ [

# Getting started with local mode
](studio-updated-local-get-started.md)

# Getting started with local mode
<a name="studio-updated-local-get-started"></a>

The following sections outline the steps needed to get started with local mode in Amazon SageMaker Studio, including:
+ Completing prerequisites
+ Setting `EnableDockerAccess`
+ Docker installation

## Prerequisites
<a name="studio-updated-local-prereq"></a>

Complete the following prerequisites to use local mode in Studio applications:
+ To pull images from an Amazon Elastic Container Registry repository, the account hosting the Amazon ECR image must provide access permission for the user’s execution role. The domain’s execution role must also allow Amazon ECR access.
+ Verify that you are using the latest version of the Studio Python SDK by using the following command: 

  ```
  pip install -U sagemaker
  ```
+ To use local mode and Docker capabilities, set the following parameter of the domain’s `DockerSettings` using the AWS Command Line Interface (AWS CLI): 

  ```
  EnableDockerAccess : ENABLED
  ```
+ Using `EnableDockerAccess`, you can also control whether users in the domain can use local mode. By default, local mode and Docker capabilities aren't allowed in Studio applications. For more information, see [Setting `EnableDockerAccess`](#studio-updated-local-enable).
+ Install the Docker CLI in the Studio application by following the steps in [Docker installation](#studio-updated-local-docker-installation).
+ For the [Rootless Docker configuration](#studio-updated-local-rootless), ensure your VPC has appropriate endpoints and routing configured for your desired Docker registries.

## Setting `EnableDockerAccess`
<a name="studio-updated-local-enable"></a>

The following sections show how to set `EnableDockerAccess` when the domain has public internet access or is in `VPC-only` mode.

**Note**  
Changes to `EnableDockerAccess` only apply to applications created after the domain is updated. You must create a new application after updating the domain.

**Public internet access**

The following example commands show how to set `EnableDockerAccess` when creating a new domain or updating an existing domain with public internet access:

```
# create new domain
aws --region region \
    sagemaker create-domain --domain-name domain-name \
    --vpc-id vpc-id \
    --subnet-ids subnet-ids \
    --auth-mode IAM \
    --default-user-settings "ExecutionRole=execution-role" \
    --domain-settings '{"DockerSettings": {"EnableDockerAccess": "ENABLED"}}' \
    --query DomainArn \
    --output text

# update domain
aws --region region \
    sagemaker update-domain --domain-id domain-id \
    --domain-settings-for-update '{"DockerSettings": {"EnableDockerAccess": "ENABLED"}}'
```

**`VPC-only` mode**

When using a domain in `VPC-only` mode, Docker image push and pull requests are routed through the service VPC instead of the VPC configured by the customer. Because of this functionality, administrators can configure a list of trusted AWS accounts that users can make Amazon ECR Docker pull and push operations requests to.

If a Docker image push or pull request is made to an AWS account that is not in the list of trusted AWS accounts, the request fails. Docker pull and push operations outside of Amazon Elastic Container Registry (Amazon ECR) aren't supported in `VPC-only` mode.

The following AWS accounts are trusted by default:
+ The account hosting the SageMaker AI domain.
+ SageMaker AI accounts that host the following SageMaker images:
  + DLC framework images
  + Sklearn, Spark, XGBoost processing images

To configure a list of additional trusted AWS accounts, specify the `VpcOnlyTrustedAccounts` value as follows:

```
aws --region region \
    sagemaker update-domain --domain-id domain-id \
    --domain-settings-for-update '{"DockerSettings": {"EnableDockerAccess": "ENABLED", "VpcOnlyTrustedAccounts": ["account-list"]}}'
```

**Note**  
When the [Rootless Docker configuration](#studio-updated-local-rootless) is enabled, `VpcOnlyTrustedAccounts` is ignored and Docker traffic routes through your VPC configuration, allowing access to any registry your VPC can reach.

## Rootless Docker configuration
<a name="studio-updated-local-rootless"></a>

When [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DockerSettings.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DockerSettings.html) is enabled, Studio uses a [rootless Docker daemon](https://docs.docker.com/engine/security/rootless/) that routes traffic through your VPC. This provides enhanced security and allows access to additional Docker registries. The key differences with `RootlessDocker` are:
+ Your VPC configuration determines which registries are accessible for Docker operations. `VpcOnlyTrustedAccounts` is ignored and Docker traffic routes through your VPC configuration.

To use rootless Docker, you will need to set both `EnableDockerAccess` and `RootlessDocker` to `ENABLED` for your `DockerSettings`. For example, in the [Setting `EnableDockerAccess`](#studio-updated-local-enable) examples above, you can modify your domain settings to include:

```
'{"DockerSettings": {"EnableDockerAccess": "ENABLED", "RootlessDocker": "ENABLED"}}'
```

## Docker installation
<a name="studio-updated-local-docker-installation"></a>

To use Docker, you must manually install Docker from the terminal of your Studio application. The steps to install Docker are different if the domain has access to the internet or not.

### Internet access
<a name="studio-updated-local-docker-installation-internet"></a>

If the domain is created with public internet access or in `VPC-only` mode with limited internet access, use the following steps to install Docker.

1. (Optional) If your domain is created in `VPC-only` mode with limited internet access, create a public NAT gateway with access to the Docker website. For instructions, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html).

1. Navigate to the terminal of the Studio application that you want to install Docker in.

1. To return the operating system of the application, run the following command from the terminal:

   ```
   cat /etc/os-release
   ```

1. Install Docker following the instructions for the operating system of the application in the [Amazon SageMaker AI Local Mode Examples repository](https://github.com/aws-samples/amazon-sagemaker-local-mode/tree/main/sagemaker_studio_docker_cli_install).

   For example, install Docker on Ubuntu following the script at [https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/sagemaker\$1studio\$1docker\$1cli\$1install/sagemaker-ubuntu-focal-docker-cli-install.sh](https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/sagemaker_studio_docker_cli_install/sagemaker-ubuntu-focal-docker-cli-install.sh) with the following considerations:
   + If chained commands fail, run commands one at a time.
   + Studio only supports Docker version `20.10.X.` and Docker Engine API version `1.41`.
   + The following packages aren't required to use the Docker CLI in Studio and their installation can be skipped:
     + `containerd.io`
     + `docker-ce`
     + `docker-buildx-plugin`
**Note**  
You do not need to start the Docker service in your applications. The instance that hosts the Studio application runs Docker service by default. All Docker API calls are routed through the Docker service automatically.

1. Use the exposed Docker socket for Docker interactions within Studio applications. By default, the following socket is exposed:

   ```
   unix:///docker/proxy.sock
   ```

   The following Studio application environmental variable for the default `USER` uses this exposed socket:

   ```
   DOCKER_HOST
   ```

### No internet access
<a name="studio-updated-local-docker-installation-no-internet"></a>

If the domain is created in `VPC-only` mode with no internet access, use the following steps to install Docker.

1. Navigate to the terminal of the Studio application that you want to install Docker in.

1. Run the following command from the terminal to return the operating system of the application:

   ```
   cat /etc/os-release
   ```

1. Download the required Docker `.deb` files to your local machine. For instructions about downloading the required files for the operating system of the Studio application, see [Install Docker Engine](https://docs.docker.com/engine/install/).

   For example, install Docker from a package on Ubuntu following the steps 1–4 in [Install from a package](https://docs.docker.com/engine/install/ubuntu/#install-from-a-package) with the following considerations:
   + Install Docker from a package. Using other methods to install Docker will fail.
   + Install the latest packages corresponding to Docker version `20.10.X`.
   + The following packages aren't required to use the Docker CLI in Studio. You don't need to install the following:
     + `containerd.io`
     + `docker-ce`
     + `docker-buildx-plugin`
**Note**  
You do not need to start the Docker service in your applications. The instance that hosts the Studio application runs Docker service by default. All Docker API calls are routed through the Docker service automatically.

1. Upload the `.deb` files to the Amazon EFS file system or to the Amazon EBS file system of the application.

1. Manually install the `docker-ce-cli` and `docker-compose-plugin` `.deb` packages from the Studio application terminal. For more information and instructions, see step 5 in [Install from a package](https://docs.docker.com/engine/install/ubuntu/#install-from-a-package) on the Docker docs website.

1. Use the exposed Docker socket for Docker interactions within Studio applications. By default, the following socket is exposed:

   ```
   unix:///docker/proxy.sock
   ```

   The following Studio application environmental variable for the default `USER` uses this exposed socket:

   ```
   DOCKER_HOST
   ```

# View your Studio running instances, applications, and spaces
<a name="studio-updated-running"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

The following topics include information and instructions about how to view your Studio running instances, applications, and spaces. For more information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).

## View your Studio running instances and applications
<a name="studio-updated-running-view-app"></a>

The **Running instances** page gives information about all running application instances that were created in Amazon SageMaker Studio by the user, or were shared with the user. 

You can view and stop running instances for all of your applications and spaces. If an instance is stopped, it does not appear on this page. Stopped instances can be viewed from the landing page for their respective application types. 

You can view a list of running applications and their details in Studio.

**To view running instances**

1. Launch Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1. On the left navigation pane, choose **Running instances**. 

1. From the **Running instances** page, you can view a list of running applications and details about those applications. 

   To view non-running instances, from the left navigation pane choose, the relevant application under **Applications**. The non-running applications will have the **Stopped** status under the **Status** column.

## View your Studio spaces
<a name="studio-updated-running-view-space"></a>

The **Spaces** section within your **Domain details** page gives information about Studio spaces within your domain. You can view, create, and delete spaces on this page. 

The spaces that you can view in the **Spaces** section are running spaces for the following:
+ JupyterLab private space. For information about JupyterLab, see [SageMaker JupyterLab](studio-updated-jl.md).
+ Code Editor private space. For information about Code Editor, based on Code-OSS, Visual Studio Code - Open Source, see [Code Editor in Amazon SageMaker Studio](code-editor.md).
+ Studio Classic shared space. For information about Studio Classic shared space, see [Collaboration with shared spaces](domain-space.md).

There are no spaces for SageMaker Canvas, Studio Classic (private), or RStudio. 

**View your Studio spaces in a domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, expand **Admin configurations** and choose **Domains**.

1. Choose the domain where you want to view the spaces.

1. On the **Domain details** page, choose the **Space management** tab to open the **Spaces** section.

**View your Studio spaces using the AWS CLI**

Use the following command to list all spaces in your domain:

```
aws sagemaker list-spaces --region us-east-1 --domain-id domain-id
```
+ `us-east-1` is your AWS Region.
+ *domain-id* is your domain ID. See [View domains](domain-view.md) for more information.

# Stop and delete your Studio running applications and spaces
<a name="studio-updated-running-stop"></a>

The following page includes information and instructions on how to stop and delete unused Amazon SageMaker Studio resources to avoid unwanted additional costs. For the Studio resources you no longer you wish to use, you will need to both:
+ Stop the application: This stops both the application and deletes the instance that the application is running on. Once you stop an application you can start it back up again.
+ Delete the space: This deletes the Amazon EBS volume that was created for the application and instance.
**Important**  
If you delete the space, you will lose access to the data within that space. Do not delete the space unless you're sure that you want to.

For more information about the differences between Studio spaces and applications, see [View your Studio running instances, applications, and spaces](studio-updated-running.md).

**Topics**
+ [

## Stop your Amazon SageMaker Studio application
](#studio-updated-running-stop-app)
+ [

## Delete a Studio space
](#studio-updated-running-stop-space)

## Stop your Amazon SageMaker Studio application
<a name="studio-updated-running-stop-app"></a>

To avoid additional charges from unused running applications, you must stop them. The following includes information on what stopping an application does and how to do it.
+ The following instructions uses the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html) API to stop the application. This also stops the instance that the application is running on.
+ After you stop an application, you can start up the application again later.
  + When you stop an application, the files in the space will persist. You can run the application again and expect to have access to the same files that are stored in the space, as you did before deleting the application.

    
  + When you stop an application, the *metadata* for the application will be deleted within 24 hours. For more information, see the note in the `CreationTime` response element for the [DescribeApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeApp.html#sagemaker-DescribeApp-response-CreationTime) API.

**Note**  
If the service detects that an application is unhealthy, it assumes the [AmazonSageMakerNotebooksServiceRolePolicy](security-iam-awsmanpol-notebooks.md#security-iam-awsmanpol-AmazonSageMakerNotebooksServiceRolePolicy) service linked role and deletes the application using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html) API.

The following tabs provide instructions to stop an application from your domain using the Studio UI, the SageMaker AI console, or the AWS CLI.

**Note**  
To view and stop all of your Studio running instances in one location, we recommend the [Stop applications using the Studio UI](#studio-updated-running-stop-app-using-studio-updated-ui) workflow from the following options.

### Stop applications using the Studio UI
<a name="studio-updated-running-stop-app-using-studio-updated-ui"></a>

To stop your Studio applications using the Studio UI, use the following instructions.

**To delete your applications (Studio UI)**

1. Launch Studio. This process may differ depending on your setup. For information about launching Studio, see [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

1. From the left navigation pane, choose **Running instances**. 

   If the table on the page is empty, you don't have any running instances or applications in your spaces.

1. In the table under the **Name** and **Application** columns, find the space name and the application that you want to stop.

1. Choose the corresponding **Stop** button to stop the application.

### Stop applications using the SageMaker AI console
<a name="studio-updated-running-stop-app-using-sagemaker-console"></a>

To view or stop Studio running instances from a centralized location, see [Stop applications using the Studio UI](#studio-updated-running-stop-app-using-studio-updated-ui). Otherwise, use the following instructions.

In the SageMaker AI console, you can only stop the running Studio applications for the spaces that you are able to view in the **Spaces** section of the console. For a list of the viewable spaces, see [View your Studio spaces](studio-updated-running.md#studio-updated-running-view-space).

These steps show how to stop your Studio applications by using the SageMaker AI console.

**To stop your applications (SageMaker AI console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

1. Choose the domain that you want to revert.

1. On the **Domain details** page, choose the **Space management** tab.

1. 
**Important**  
In the **Space management** tab, you have the option to delete the space. There is a difference between deleting the space and deleting an application. If you delete the space, you will lose access to the data within that space. Do not delete the space unless you're sure that you want to.

   To stop the application, in the **Space management** tab and under the **Name** column, choose the space for the application.

1. In the **Apps** section and under the **App type** column, search for the app to stop.

1. Under the **Action** column, choose the corresponding **Delete app** button.

1. In the pop-up box, choose **Yes, delete app**. After you do so the delete input field becomes available.

1. Enter **delete** in the delete input field to confirm deletion.

1. Choose **Delete**.

### Stop your domain applications using the AWS CLI
<a name="studio-updated-running-stop-app-using-cli"></a>

To view or stop any of your Studio running instances from a centralized location, see [Stop applications using the Studio UI](#studio-updated-running-stop-app-using-studio-updated-ui). Otherwise, use the following instructions.

The following code examples use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html) API to stop an application in an example domain. 

To stop your running **JupyterLab** or **Code Editor** instances, use the following code example:

```
aws sagemaker delete-app \
--domain-id example-domain-id \
--region AWS Region \
--app-name default \
--app-type example-app-type \
--space-name example-space-name
```
+ To obtain your `example-domain-id`, use the following instructions:

**To get `example-domain-id`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, choose the **Domain settings** tab.

  1. Copy the **Domain ID**.
+ To obtain your `AWS Region`, use the following instructions to ensure you are using the correct AWS Region for your domain: 

**To get `AWS Region`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, verify that this is the relevant domain.

  1. Expand the region dropdown list from the top right of the SageMaker AI console, and use the corresponding AWS Region ID to the right of your AWS Region name. For example, `us-west-1`.
+ For `example-app-type`, use the application type that's relevant to the application that you want to stop. For example, replace `example-app-type` with one of the following application types:
  + JupyterLab application type: `JupyterLab`. For information about JupyterLab, see [SageMaker JupyterLab](studio-updated-jl.md).
  + Code Editor application type: `CodeEditor`. For information about Code Editor, based on Code-OSS, Visual Studio Code - Open Source, see [Code Editor in Amazon SageMaker Studio](code-editor.md).
+ To obtain your `example-space-name`, use the following steps: 

**To get `example-space-name`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, choose the **Space management** tab.

  1. Copy the relevant space name.

To stop running instances for **SageMaker Canvas**, **Studio Classic**, or **RStudio**, use the following code example:

```
aws sagemaker delete-app \
--domain-id example-domain-id \
--region AWS Region \
--app-name default \
--app-type example-app-type \
--user-profile example-user-name
```
+ For `example-app-type`, use the application type relevant to the application that you want to stop. For example, replace `example-app-type` with one of the following application types:
  + SageMaker Canvas application type: `Canvas`. For information about SageMaker Canvas, see [Amazon SageMaker Canvas](canvas.md).
  + Studio Classic application type: `JupyterServer`. For information about Studio Classic, see [Amazon SageMaker Studio Classic](studio.md).
  + RStudio application type: `RStudioServerPro`. For information about RStudio, see [RStudio on Amazon SageMaker AI](rstudio.md).
+ To obtain your `example-user-name`, navigate to the **Domain details** page. 
  + Next, choose the **User profiles** tab, and copy the relevant space name.

For alternative instructions to stop your running Studio applications, see: 
+ JupyterLab: [Delete unused resources](studio-updated-jl-admin-guide-clean-up.md).
+ Code Editor: [Shut down Code Editor resources](code-editor-use-log-out.md).
+ SageMaker Canvas: [Logging out of Amazon SageMaker Canvas](canvas-log-out.md).
+ Studio Classic: [Shut Down and Update Amazon SageMaker Studio Classic and Apps](studio-tasks-update.md).
+ RStudio: [Shut down RStudio](rstudio-shutdown.md).

## Delete a Studio space
<a name="studio-updated-running-stop-space"></a>

**Important**  
After you delete your space, you will lose all of the data stored in the space. We recommend that you back up your data before deleting your space.

You will need to have administrator permissions, or at least have permissions to update domain, IAM, and Amazon S3, to delete a Studio space.
+ Spaces are used to manage the storage and resource needs of the relevant application. When you delete a space, the storage volume also deletes. Therefore, you lose access to the files stored on that space. For more information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).

  We recommend that you back up your data if you choose to delete a space.
+ After you delete a space, you can't access that space again.

You can delete the Studio spaces that are viewable in the **Spaces** section of the console. For a list of the viewable spaces, see [View your Studio spaces](studio-updated-running.md#studio-updated-running-view-space). 

There are no spaces for SageMaker Canvas, Studio Classic (private), and RStudio. To stop and delete your SageMaker Canvas, Studio Classic (private), or RStudio applications, see [Stop your Amazon SageMaker Studio application](#studio-updated-running-stop-app).

### Delete a space using the SageMaker AI console
<a name="studio-updated-running-stop-space-using-sagemaker-console"></a>

The **Spaces** section within your **Domain details** page gives information about Studio spaces within your domain. You can view, create, and delete spaces on this page. 

**To view Studio spaces in a domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

1. Choose the domain where you want to view the spaces.

1. On the **Domain details**, choose **Space management** to open the **Spaces** section.

1. Select the space to delete.

1. Choose **Delete**. 

1. In the pop-up box titled **Delete space**, you have two options: 
   + If you already shut down all applications in the space, choose **Yes, delete space**.
   + If you still have applications running in the space, choose **Yes, shut down all apps and delete space**.

1. Enter **delete** in the delete input field to confirm deletion.

1. To delete the space, you have two options:
   + If you already shut down all applications in the space, choose **Delete space**.
   + If you still have applications running in the space, choose **Shut down all apps and delete space**.

### Delete a space using the AWS CLI
<a name="studio-updated-running-stop-space-using-cli"></a>

Before you can delete a space using the AWS CLI, you must delete the application associated with it. For information about stopping your Studio applications, see [Stop your Amazon SageMaker Studio application](#studio-updated-running-stop-app).

Use the following AWS CLI command to delete a space within a domain:

```
aws sagemaker delete-space \
--domain-id example-domain-id \
--region AWS Region \
--space-name example-space-name
```
+ To obtain your `example-domain-id`, use the following instructions:

**To get `example-domain-id`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, choose the **Domain settings** tab.

  1. Copy the **Domain ID**.
+ To obtain your `AWS Region`, use the following instructions to ensure you are using the correct AWS Region for your domain: 

**To get `AWS Region`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, verify that this is the relevant domain.

  1. Expand the region dropdown list from the top right of the SageMaker AI console, and use the corresponding AWS Region ID to the right of your AWS Region name. For example, `us-west-1`.
+ To obtain your `example-space-name`, use the following steps: 

**To get `example-space-name`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the relevant domain.

  1. On the **Domain details** page, choose the **Space management** tab.

  1. Copy the relevant space name.

# SageMaker Studio image support policy
<a name="sagemaker-distribution"></a>

**Important**  
Currently, all packages in SageMaker Distribution images are licensed for use with Amazon SageMaker AI and do not require additional commercial licenses. However, this might be subject to change in the future, and we recommend reviewing the licensing terms regularly for any updates.

Amazon SageMaker Distribution is a set of Docker images available on SageMaker Studio that include popular frameworks for machine learning, data science, and visualization.

The images include deep learning frameworks like PyTorch, TensorFlow and Keras; popular Python packages like numpy, scikit-learn and pandas; and IDEs like JupyterLab and Code Editor, based on Code-OSS, Visual Studio Code - Open Source. The distribution contains the latest versions of all these packages such that they are mutually compatible.

This page details the support policy and availability for SageMaker Distribution Images on SageMaker Studio.

## Versioning, release cadence, and support policy
<a name="sm-distribution-versioning"></a>

The table below outlines the release schedule for SageMaker Distribution Image versions and their planned support. AWS provides ongoing functionality and security updates for supported image versions. New minor versions are released for major versions for 12 months, and supported minor versions receive ongoing functionality and security patches. In some cases, an image version may need to be designated end of support earlier than originally planned if (a) security issues cannot be addressed while maintaining semantic versioning guidelines or (b) any of our major dependencies, like Python, reach end-of-life. AWS releases ad-hoc major or minor versions on an as-needed basis.


| Version | Description | Release cadence | Planned support | 
| --- | --- | --- | --- | 
| Major | Amazon SageMaker Distribution's major version releases involve upgrading all of its core dependencies to the latest compatible versions. These major releases may also add or remove packages as part of the update. Major versions are denoted by the first number in the version string, such as 1.0, 2.0, or 3.0. | 6 months | 18 months | 
| Minor | Amazon SageMaker Distribution's minor version releases include upgrading all of its core dependencies to the latest compatible minor versions within the same major version. SageMaker Distribution can add new packages during a minor version release. Minor versions are denoted by the second number in the version string, for example, 1.1, 1.2, or 2.1. | 1 month | 6 months | 
| Patch | Amazon SageMaker Distribution's patch version releases include updating all of its core dependencies to the latest compatible patch versions within the same minor version. SageMaker Distribution does not add or remove any packages during a patch version release. Patch versions are denoted by the third number in the version string, for example, 1.1.1, 1.2.1, or 2.1.3. Since patch versions are generally released for fixing security vulnerabilities, we recommend always upgrading to the newest patch version when they become available. | As necessary for fixing security vulnerabilities | Until new patch version is released | 

Each major version of the Amazon SageMaker Distribution is available for 18 months. During the first 12 months, new minor versions are released monthly. For the remaining 6 months, the existing minor versions continue to be supported.

## Supported image versions
<a name="sagemaker-distribution-supported-packages"></a>

The tables below list the supported SageMaker Distribution image versions, their planned end of support dates, and their availability on SageMaker Studio. For image versions where support ends sooner than the planned end of support date, the versions continue to be available on Studio until the designated availability date. You can continue using the image to launch applications for up to 90 days or until the availability date on Studio, whichever comes first. For more information about such cases, reach out to Support.

You can migrate to a newer supported version as soon as possible to ensure that you receive ongoing functionality and security updates. When choosing an image version in SageMaker Studio, we recommend that you choose a supported image version from the tables below.

### Supported major versions
<a name="sm-distribution-major-versions"></a>

The following table lists the supported SageMaker Distribution major image versions.


| Image version | Last minor version release | Supported until | Description | 
| --- | --- | --- | --- | 
| 1.x.x | Apr 30th, 2025 | Oct 30th, 2025 | SageMaker Distribution major version 1 is built with Python 3.10. | 
| 2.x.x | Aug 25th, 2025 | Feb 25th, 2026 | SageMaker Distribution major version 2 is built with Python 3.11. | 
|  3.x.x  | Mar 29th, 2026 | Sep 29th, 2026 |  SageMaker Distribution major version 3 is built with Python 3.12.  | 

### CPU image minor versions
<a name="sm-distribution-cpu-versions"></a>

The following table lists the supported SageMaker Distribution minor image versions for CPUs.


| Image version | Amazon ECR image URI | Planned end of support date | Availability on Studio until | Release notes | 
| --- | --- | --- | --- | --- | 
| 3.1.x | public.ecr.aws/sagemaker/sagemaker-distribution:3.1-cpu | Nov 19th, 2025 | Nov 19th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v3/v3.1) | 
| 3.0.x  | public.ecr.aws/sagemaker/sagemaker-distribution:3.0-cpu  | Jun 30th, 2025 | Sep 29th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v3/v3.0) | 
| 2.6.x | public.ecr.aws/sagemaker/sagemaker-distribution:2.6-cpu | Jun 30th, 2025 | Oct 28th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v2/v2.6) | 

### GPU image minor versions
<a name="sm-distribution-gpu-versions"></a>

The following table lists the supported SageMaker Distribution minor image versions for GPUs.


| Image version | Amazon ECR image URI | Planned end of support date | Availability on Studio until | Release notes for newest patch | 
| --- | --- | --- | --- | --- | 
| 3.1.x | public.ecr.aws/sagemaker/sagemaker-distribution:3.1-gpu | Nov 19th, 2025 | Nov 19th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v3/v3.1) | 
| 3.0.x | public.ecr.aws/sagemaker/sagemaker-distribution:3.0-gpu | Jun 30th, 2025 | Sep 29th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v3/v3.0) | 
| 2.6.x | public.ecr.aws/sagemaker/sagemaker-distribution:2.6-gpu | Jun 30th, 2025 | Oct 28th, 2025 | [Release notes](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v2/v2.6) | 

### Unsupported images
<a name="sm-distribution-unsupported-images"></a>

The following table lists unsupported SageMaker Distribution image versions.


| Image version | Amazon ECR image URI | End of support date | Availability on Studio until | 
| --- | --- | --- | --- | 
| 2.4.x |  public.ecr.aws/sagemaker/sagemaker-distribution:2.4-cpu public.ecr.aws/sagemaker/sagemaker-distribution:2.4-cpu  | Sep 7th, 2025 | Sep 7th, 2025 | 
| 2.3.x |  public.ecr.aws/sagemaker/sagemaker-distribution:2.3-cpu public.ecr.aws/sagemaker/sagemaker-distribution:2.3-cpu  | July 27th, 2025 | July 27th, 2025 | 
| 2.2.x |  public.ecr.aws/sagemaker/sagemaker-distribution:2.2-cpu public.ecr.aws/sagemaker/sagemaker-distribution:2.2-gpu  | May 15th, 2025 | May 15th, 2025 | 
| 2.1.x |  public.ecr.aws/sagemaker/sagemaker-distribution:2.1-cpu public.ecr.aws/sagemaker/sagemaker-distribution:2.1-gpu  | Apr 25th, 2025 | May 12th, 2025 | 
| 2.0.x |  public.ecr.aws/sagemaker/sagemaker-distribution:2.0-cpu public.ecr.aws/sagemaker/sagemaker-distribution:2.0-gpu  | Feb 25th, 2025 | Apr 21st, 2025 | 
| 1.13.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.13-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.13-gpu  | May 15th, 2025 | Sep 20th, 2025 | 
| 1.12.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.12-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.12-gpu  | July 23rd, 2025 | July 23rd, 2025 | 
| 1.11.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.11-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.11-gpu  | Apr 1st, 2025 | May 12th, 2025 | 
| 1.10.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.10-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.10-gpu  | Feb 5th, 2025 | Apr 10th, 2025 | 
| 1.9.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.9-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.9-gpu  | Jan 15th, 2025 | Apr 10th, 2025 | 
| 1.8.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.8-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.8-gpu  | Dec 31st, 2024 | Apr 10th, 2025 | 
| 1.7.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.7-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.7-gpu  | Dec 15th, 2024 | Apr 10th, 2025 | 
| 1.6.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.6-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.6-gpu  | Dec 15th, 2024 | Apr 10th, 2025 | 
| 1.5.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.5-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.5-gpu  | Oct 31st, 2024 | Nov 1st, 2024 | 
| 1.4.x |  public.ecr.aws/sagemaker/sagemaker-distribution:1.4-cpu public.ecr.aws/sagemaker/sagemaker-distribution:1.4-gpu  | Oct 31st, 2024 | Nov 1st, 2024 | 
| 1.3.x | public.ecr.aws/sagemaker/sagemaker-distribution:1.3-cpu | June 28th, 2024 | Oct 18th, 2024 | 
| 1.2.x | public.ecr.aws/sagemaker/sagemaker-distribution:1.2-cpu | June 28th, 2024 | Oct 18th, 2024 | 

### Frequently asked questions
<a name="sm-distribution-faqs"></a>

**What constitutes a major image version release?**

Major image versions are released every 6 months. A major image version release for Amazon SageMaker Distribution involves upgrading all core dependencies to the latest compatible versions and may include adding or removing packages. Python framework is only upgraded with new major version releases. For example, with major version 2 release, Python framework was upgraded from 3.10 to 3.11, PyTorch was upgraded from 2.0 to 2.3, TensorFlow was upgraded from 2.14 to 2.17, Autogluon was upgraded from 0.8 to 1.1, and 4 packages were added to the image.

**What constitutes a minor image version release?**

Minor image versions are released for all supported major versions monthly. A minor image version release for Amazon SageMaker Distribution involves upgrading all core dependencies except Python and CUDA to the latest compatible minor versions within the same major version and may include adding new packages. For example, with a minor version release, langchain might be upgraded from 0.1 to 0.2 and jupyter-ai from 2.18 to 2.20.

**What constitutes a patch image version release?**

Patch image versions are released as necessary to fix security vulnerabilities. A patch image version release for Amazon SageMaker Distribution involves updating all of its core dependencies to the latest compatible patch versions within the same minor version. SageMaker Distribution does not add or remove any packages during a patch version release. For example, with a patch version release, matplotlib might be upgraded from 3.9.1 to 3.9.2 and boto3 from 1.34.131 to 1.34.162.

**Where can I find the packages available in a specific image version?**

Each image version has a `release.md` file in the [GitHub repository's](https://github.com/aws/sagemaker-distribution) `build_artifacts` folder, showing all packages and package versions for CPU and GPU images. Separate changelog files for CPU and GPU versions detail package upgrades. Changelogs compare the new image version to the previous. For example, version 1.9.0 compares to the latest patch version of 1.8, version 1.9.1 compares to 1.9.0, and version 2.0.0 compares to the latest patch version of the latest minor version available at the time.

**How are images scanned for Common Vulnerabilities and Exposures (CVEs)?**

Amazon SageMaker AI leverages [Amazon Elastic Container Registry (Amazon ECR) enhanced scanning](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning-enhanced.html) to automatically detect vulnerabilities and fixes for SageMaker Distribution Images. AWS continuously runs ECR enhanced scanning for the latest patch version of all supported image versions. When vulnerabilities are detected and a fix is available, AWS releases an updated image version to remediate the issue.

**Can I still use older images after an image is no longer supported?**

Images are available on SageMaker Studio until the designated availability date. Older images remain available in ECR after they reach end of support and are removed from Studio. You can download older image versions from ECR and [create a custom SageMaker image](studio-byoi-create.md). However, we highly recommend upgrading to a supported image version that continuously receives security updates and bug fixes. Customers who build their own custom images are responsible for scanning and patching their images. For more information, see the [AWS Shared Responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/).

**Important**  
SageMaker Distribution v0.x.y is only used in Studio Classic. SageMaker Distribution v1.x.y is only used in JupyterLab.

# Amazon SageMaker Studio pricing
<a name="studio-updated-cost"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

There is no additional charge for using the Amazon SageMaker Studio UI.  

The following do incur costs:
+ Amazon Elastic Block Store or Amazon Elastic File System volumes that are mounted with your applications.
+ Any jobs and resources that users launch from Studio applications.
+ Launching a JupyterLab application, even if no resources or jobs launched in the application. 

For information about how Amazon SageMaker Studio Classic is billed, see [Amazon SageMaker Studio Classic Pricing](studio-pricing.md). 

For more information about billing along with pricing examples, see [Amazon SageMaker Pricing](https://aws.amazon.com//sagemaker/pricing/). 

# Troubleshooting
<a name="studio-updated-troubleshooting"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This section shows how to troubleshoot common problems in Amazon SageMaker Studio.

## Recovery mode
<a name="studio-updated-troubleshooting-recovery-mode"></a>

Recovery mode allows you to access your Studio application when a configuration issue prevents your normal start up. It provides a simplified environment with essential functionality to help you diagnose and fix the issue.

When an application fails to launch, you may see an error message about accessing recovery mode to address one of the following configuration issues.
+ Corrupted [https://docs.conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html](https://docs.conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html) file.

  For information on troubleshooting your `.condarc` file, see the [troubleshooting](https://docs.conda.io/projects/conda/en/latest/user-guide/troubleshooting.html) page in the *Conda user guide*.
+ Insufficient storage volume available. 

  You can increase the Amazon EBS space storage available for the application or enter recovery mode to remove unnecessary data.

  For information on increasing the Amazon EBS volume size, see [request a quota size](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas Developer Guide*.

In recovery mode:
+ Your home directory will differ from your normal start up. This directory is temporary and ensures that any corrupted configurations in your standard home directory does not impact your recovery mode operations. You can navigate to your standard home directory by using the command `cd /home/sagemaker-user`.
  + Standard mode: `/home/sagemaker-user`
  + Recovery mode: `/tmp/sagemaker-recovery-mode-home`
+ The conda environment uses a minimal base conda environment with essential packages only. The simplified conda setup helps isolate environment-related issues and provides basic functionality for troubleshooting.

You can use the Studio UI or the AWS CLI to access the application in recovery mode.

### Use the Studio UI to access the application in recovery mode
<a name="studio-updated-troubleshooting-recovery-mode-console"></a>

The following provides instructions on accessing your application in recovery mode.

1. If you have not already done so, launch the Studio UI by following the instructions in [Launch from the Amazon SageMaker AI console](studio-updated-launch.md#studio-updated-launch-console).

1. In the left navigation menu, under **Applications**, choose the application.

1. Choose the space you are having configuration issues with.

   The following steps become available to you when you have one one or more of the configuration issues mentioned previously. In this case, you will see a warning banner and **Recovery mode** message. 
**Note**  
The warning banner should have a recommended solution for the issue. Take note of it before proceeding.

1. Choose **Run space (Recovery mode)**. 

1. To access your application in recovery mode, choose **Open *application* (Recovery mode)**.

### Use the AWS CLI to access the application in recovery mode
<a name="studio-updated-troubleshooting-recovery-mode-cli"></a>

To access your application in recovery mode, you must append `--recovery-mode` to your [create-app](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-app.html) AWS CLI command. The following provides an example on how to access your application in recovery mode. 

For the following example, you will need your:
+ *domain-id*

  To obtain your domain details, see [View domains](domain-view.md).
+ *space-name*

  To obtain the space names associated with your domain, see [Use the AWS CLI to view the SageMaker AI spaces in your domain](sm-console-domain-resources-view.md#sm-console-domain-resources-view-spaces-cli).
+ *app-name*

  The name of your application. To view your applications, see [Use the AWS CLI to view the SageMaker AI applications in your domain](sm-console-domain-resources-view.md#sm-console-domain-resources-view-apps-cli).

------
#### [ Access Code Editor application in recovery mode ]

```
aws sagemaker create-app \
    --app-name app-name \
    --app-type CodeEditor \
    --domain-id domain-id \
    --space-name space-name \
    --recovery-mode
```

------
#### [ Access JupyterLab application in recovery mode ]

```
aws sagemaker create-app \
    --app-name app-name \
    --app-type JupyterLab \
    --domain-id domain-id \
    --space-name space-name \
    --recovery-mode
```

------

## Cannot delete the Code Editor or JupyterLab application
<a name="studio-updated-troubleshooting-cannot-delete-application"></a>

This issue occurs when a user creates an application from Amazon SageMaker Studio, that is only available in Studio, then reverts their default experience to Studio Classic. As a result, the user cannot delete an application for Code Editor, based on Code-OSS, Visual Studio Code - Open Source or JupyterLab, because they can't access the Studio UI.

To resolve this issue, notify your administrator so that they can delete the application manually using the AWS Command Line Interface (AWS CLI). 

## EC2InsufficientCapacityError
<a name="studio-updated-troubleshooting-ec2-capacity"></a>

This issue occurs when you try to run a space and AWS does not currently have enough available on-demand capacity to fulfill your request. 

To resolve this issue, complete the following. 
+ Wait a few minutes, then resubmit your request. Capacity can shift frequently.
+ Run the space with an alternate instance size or type.

**Note**  
Capacity is available in different Availability Zones. To maximize capacity availability for users, we recommend setting up subnets in all Availability Zones. Studio retries all available Availability Zones for the domain.   
Instance type availability differs between regions. For a list of supported instances types per Region, see [Amazon SageMaker AI pricing](https://aws.amazon.com/sagemaker/pricing/))

The following table lists instance families and their recommended alternatives.


| Instance family | CPU Type | vCPUs | Memory (GiB) | GPU type | GPUs | GPU Memory (GiB) | Recommended alternative | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| G4dn | 2nd Generation Intel Xeon Scalable Processors | 4 to 96 | 16 to 384 | NVIDIA T4 Tensor Core | 1 to 8 | 16 per GPU | G6 | 
| G5 | 2nd generation AMD EPYC processors | 4 to 192 | 16 to 768 | NVIDIA A10G Tensor core | 1 to 8 | 24 per GPU | G6e | 
| G6 | 3rd generation AMD EPYC processors | 4 to 192 | 16 to 768 | NVIDIA L4 Tensor Core | 1 to 8 | 24 per GPU | G4dn | 
| G6e | 3rd generation AMD EPYC processors | 4 to 192 | 32 to 1536 | NVIDIA L40S Tensor Core | 1 to 8 | 48 per GPU | G5, P4 | 
| P3 | Intel Xeon Scalable Processors | 8 to 96 | 61 to 768 | NVIDIA Tesla V100 | 1 to 8 | 16 per GPU (32 per GPU for P3dn) | G6e, P4 | 
| P4 | 2nd Generation Intel Xeon Scalable processors | 96 | 1152 | NVIDIA A100 Tensor Core | 8 | 320 (640 for P4de) | G6e | 
| P5 | 3rd Gen AMD EPYC processors | 192 | 2000 | NVIDIA H100 Tensor Core | 8 | 640 | P4de | 

## Insufficient limit (quota increase required)
<a name="studio-updated-troubleshooting-insufficient-limit"></a>

This issue occurs when you get the following error message while attempting to run a space. 

```
Error when creating application for space: ... : The account-level service limit is X Apps, with current utilization Y Apps and a request delta of 1 Apps. Please use Service Quotas to request an increase for this quota.
```

There is a default limit on the number of instances, for each instance type, that you can run in each AWS Region. This error means that you have reached that limit. 

To resolve this issue, request an instance limit increase for the AWS Region that you are launching the space in. For more information, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html).

## Failure to load custom image
<a name="studio-updated-troubleshooting-custom-image"></a>

This issue occurs when a SageMaker AI image is deleted before detaching the image from your domain. This can be seen when you view the **Environment** tab for your domain.

To resolve this issue, you will need to create a temporary new image with the same name as the deleted one, detach the image, then delete the temporary image. Use the following instructions for a walk through.

1. If you have not already done so, launch the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. In the left navigation menu, under **Admin configurations**, choose **Domains**.

1. Choose your domain.

1. Choose the **Environment** tab. You will see the error message on this page.

1. Copy your image name from the image ARN.

1. In the left navigation menu, under **Admin configurations**, choose **Images**.

1. Choose **Create image**.

1. Follow the steps in the procedure, but ensure that your image name is the same as the image name from above. 

   If you do not have an image in a Amazon ECR directory, see the instructions in [Create a custom image and push to Amazon ECR](studio-updated-byoi-how-to-prepare-image.md).

1. Once you have created your SageMaker AI image, navigate back to your domain **Environment** tab. You will see the image attached to your domain.

1. Select the image and choose **Detach**.

1. Follow the instructions to detach and delete the temporary SageMaker AI image.

# Migration from Amazon SageMaker Studio Classic
<a name="studio-updated-migrate"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

When you open Amazon SageMaker Studio, the web-based UI is based on the chosen default experience. Amazon SageMaker AI currently supports two different default experiences: the Amazon SageMaker Studio experience and the Amazon SageMaker Studio Classic experience. To access the latest Amazon SageMaker Studio features, you must migrate existing domains from the Amazon SageMaker Studio Classic experience. When you migrate your default experience from Studio Classic to Studio, you don't lose any features, and can still access the Studio Classic IDE within Studio. For information about the added benefits of the Studio experience, see [Amazon SageMaker Studio](studio-updated.md).

**Note**  
For existing customers that created their accounts before November 30, 2023, Studio Classic may be the default experience. You can enable Studio as your default experience using the AWS Command Line Interface (AWS CLI) or the Amazon SageMaker AI console. For more information about Studio Classic, see [Amazon SageMaker Studio Classic](studio.md). 
For customers that created their accounts after November 30, 2023, we recommend using Studio as the default experience because it contains various integrated development environments (IDEs), including the Studio Classic IDE, and other new features.  
JupyterLab 3 reached its end of maintenance date on May 15, 2024. After December 31, 2024, you can only create new Studio Classic notebooks on JupyterLab 3 for a limited period. However after December 31, 2024, SageMaker AI will no longer provide fixes for critical issues on Studio Classic notebooks on JupyterLab 3. We recommend that you migrate your workloads to the new Studio experience, which supports JupyterLab 4.
+ If Studio is your default experience, the UI is similar to the images found in [Amazon SageMaker Studio UI overview](studio-updated-ui.md).
+ If Studio Classic is your default experience, the UI is similar to the images found in [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

To migrate, you must update an existing domain. Migrating an existing domain from Studio Classic to Studio requires three distinct phases:

1. ** Migrate the UI from Studio Classic to Studio**: One time, low lift task that requires creating a test domain to ensure Studio is compliant with your organization's network configurations before migrating the existing domain's UI from Studio Classic to Studio.

1. **(Optional) Migrate custom images and lifecycle configuration scripts**: Medium lift task for migrating your custom images and LCC scripts from Studio Classic to Studio.

1. **(Optional) Migrate data from Studio Classic to Studio**: Heavy lift task that requires using AWS DataSync to migrate data from the Studio Classic Amazon Elastic File System volume to either a target Amazon EFS or Amazon Elastic Block Store volume.

   1. **(Optional) Migrate data flows from Data Wrangler in Studio Classic**: One time, low lift task for migrating your data flows from Data Wrangler in Studio Classic to Studio, which you can then access in the latest version of Studio through SageMaker Canvas. For more information, see [Migrate data flows from Data Wrangler](studio-updated-migrate-data.md#studio-updated-migrate-flows).

 The following topics show how to complete these phases to migrate an existing domain from Studio Classic to Studio.

## Automatic migration
<a name="studio-updated-migrate-auto"></a>

Between July 2024 and August 2024, we are automatically upgrading the default landing experience for users to the new Studio experience. This only changes the default landing UI to the updated Studio UI. The Studio Classic application is still accessible from the new Studio UI.

To ensure that migration works successfully for your users, see [Migrate the UI from Studio Classic to Studio](studio-updated-migrate-ui.md). In particular, ensure the following:
+ the domain's execution role has the right permissions
+ the default landing experience is set to Studio
+ the domain's Amazon VPC, if applicable, is configured to Studio using the Studio VPC endpoint

However, if you need to continue having Studio Classic as your default UI for a limited time, set the landing experience to Studio Classic explicitly. For more information, see [Set Studio Classic as the default experience](studio-updated-migrate-ui.md#studio-updated-migrate-revert).

**Topics**
+ [

## Automatic migration
](#studio-updated-migrate-auto)
+ [

# Complete prerequisites to migrate the Studio experience
](studio-updated-migrate-prereq.md)
+ [

# Migrate the UI from Studio Classic to Studio
](studio-updated-migrate-ui.md)
+ [

# (Optional) Migrate custom images and lifecycle configurations
](studio-updated-migrate-lcc.md)
+ [

# (Optional) Migrate data from Studio Classic to Studio
](studio-updated-migrate-data.md)

# Complete prerequisites to migrate the Studio experience
<a name="studio-updated-migrate-prereq"></a>

Migration of the default experience from Studio Classic to Studio is managed by the administrator of the existing domain. If you do not have permissions to set Studio as the default experience for the existing domain, contact your administrator. To migrate your default experience, you must have administrator permissions or at least have permissions to update the existing domain, AWS Identity and Access Management (IAM), and Amazon Simple Storage Service (Amazon S3). Complete the following prerequisites before migrating an existing domain from Studio Classic to Studio.
+ The AWS Identity and Access Management role used to complete migration must have a policy attached with at least the following permissions. For information about creating an IAM policy, see [Creating IAM policies](https://docs.aws.amazon.com//IAM/latest/UserGuide/access_policies_create.html).
**Note**  
The release of Studio includes updates to the AWS managed policies. For more information, see [SageMaker AI Updates to AWS Managed Policies](security-iam-awsmanpol.md#security-iam-awsmanpol-updates).
  + Phase 1 required permissions:
    + `iam:CreateServiceLinkedRole`
    + `iam:PassRole`
    + `sagemaker:DescribeDomain`
    + `sagemaker:UpdateDomain`
    + `sagemaker:CreateDomain`
    + `sagemaker:CreateUserProfile`
    + `sagemaker:ListApps`
    + `sagemaker:AddTags`
    + `sagemaker:DeleteApp`
    + `sagemaker:DeleteSpace`
    + `sagemaker:UpdateSpace`
    + `sagemaker:DeleteUserProfile`
    + `sagemaker:DeleteDomain`
    + `s3:PutBucketCORS`
  + Phase 2 required permissions (Optional, only if using lifecycle configuration scripts):

    No additional permissions needed. If the existing domain has lifecycle configurations and custom images, the admin will already have the required permissions.
  + Phase 3 using custom Amazon Elastic File System required permissions (Optional, only if transfering data):
    + `efs:CreateFileSystem`
    + `efs:CreateMountTarget`
    + `efs:DescribeFileSystems`
    + `efs:DescribeMountTargets`
    + `efs:DescribeMountTargetSecurityGroups`
    + `efs:ModifyMountTargetSecurityGroups`
    + `ec2:DescribeSubnets`
    + `ec2:DescribeSecurityGroups`
    + `ec2:DescribeNetworkInterfaceAttribute`
    + `ec2:DescribeNetworkInterfaces`
    + `ec2:AuthorizeSecurityGroupEgress`
    + `ec2:AuthorizeSecurityGroupIngress`
    + `ec2:CreateNetworkInterface`
    + `ec2:CreateNetworkInterfacePermission`
    + `ec2:RevokeSecurityGroupIngress`
    + `ec2:RevokeSecurityGroupEgress`
    + `ec2:DeleteSecurityGroup`
    + `datasync:CreateLocationEfs`
    + `datasync:CreateTask`
    + `datasync:StartTaskExecution`
    + `datasync:DeleteTask`
    + `datasync:DeleteLocation`
    + `sagemaker:ListUserProfiles`
    + `sagemaker:DescribeUserProfile`
    + `sagemaker:UpdateDomain`
    + `sagemaker:UpdateUserProfile`
  + Phase 3 using Amazon Simple Storage Service required permissions (Optional, only if transfering data):
    + `iam:CreateRole`
    + `iam:GetRole`
    + `iam:AttachRolePolicy`
    + `iam:DetachRolePolicy`
    + `iam:DeleteRole`
    + `efs:DescribeFileSystems`
    + `efs:DescribeMountTargets`
    + `efs:DescribeMountTargetSecurityGroups`
    + `ec2:DescribeSubnets`
    + `ec2:CreateSecurityGroup`
    + `ec2:DescribeSecurityGroups`
    + `ec2:DescribeNetworkInterfaces`
    + `ec2:CreateNetworkInterface`
    + `ec2:CreateNetworkInterfacePermission`
    + `ec2:DetachNetworkInterfaces`
    + `ec2:DeleteNetworkInterface`
    + `ec2:DeleteNetworkInterfacePermission`
    + `ec2:CreateTags`
    + `ec2:AuthorizeSecurityGroupEgress`
    + `ec2:AuthorizeSecurityGroupIngress`
    + `ec2:RevokeSecurityGroupIngress`
    + `ec2:RevokeSecurityGroupEgress`
    + `ec2:DeleteSecurityGroup`
    + `datasync:CreateLocationEfs`
    + `datasync:CreateLocationS3`
    + `datasync:CreateTask`
    + `datasync:StartTaskExecution`
    + `datasync:DescribeTaskExecution`
    + `datasync:DeleteTask`
    + `datasync:DeleteLocation`
    + `sagemaker:CreateStudioLifecycleConfig`
    + `sagemaker:UpdateDomain`
    + `s3:ListBucket`
    + `s3:GetObject`
+ Access to AWS services from a terminal environment on either:
  + Your local machine using the AWS CLI version `2.13+`. Use the following command to verify the AWS CLI version.

    ```
    aws --version
    ```
  + AWS CloudShell. For more information, see [What is AWS CloudShell?](https://docs.aws.amazon.com/cloudshell/latest/userguide/welcome.html)
+ From your local machine or AWS CloudShell, run the following command and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html).

  ```
  aws configure
  ```
+ Verify that the lightweight JSON processor, jq, is installed in the terminal environment. jq is required to parse AWS CLI responses.

  ```
  jq --version
  ```

  If jq is not installed, install it using one of the following commands:
  + 

    ```
    sudo apt-get install -y jq
    ```
  + 

    ```
    sudo yum install -y jq
    ```

# Migrate the UI from Studio Classic to Studio
<a name="studio-updated-migrate-ui"></a>

The first phase for migrating an existing domain involves migrating the UI from Amazon SageMaker Studio Classic to Amazon SageMaker Studio. This phase does not include the migration of data. Users can continue working with their data the same way as they were before migration. For information about migrating data, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).

Phase 1 consists of the following steps:

1. Update application creation permissions for new applications available in Studio.

1. Update the VPC configuration for the domain.

1. Upgrade the domain to use the Studio UI.

## Prerequisites
<a name="studio-updated-migrate-ui-prereq"></a>

Before running these steps, complete the prerequisites in [Complete prerequisites to migrate the Studio experience](studio-updated-migrate-prereq.md).

## Step 1: Update application creation permissions
<a name="studio-updated-migrate-limit-apps"></a>

Before migrating the domain, update the domain's execution role to grant users permissions to create applications.

1. Create an AWS Identity and Access Management policy with one of the following contents by following the steps in [Creating IAM policies](https://docs.aws.amazon.com//IAM/latest/UserGuide/access_policies_create.html): 
   + Use the following policy to grant permissions for all application types and spaces.
**Note**  
If the domain uses the `SageMakerFullAccess` policy, you do not need to perform this action. `SageMakerFullAccess` grants permissions to create all applications.

------
#### [ JSON ]

****  

     ```
     {
         "Version":"2012-10-17",		 	 	 
         "Statement": [
             {
                 "Sid": "SMStudioUserProfileAppPermissionsCreateAndDelete",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreateApp",
                     "sagemaker:DeleteApp"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:app/*",
                 "Condition": {
                     "Null": {
                         "sagemaker:OwnerUserProfileArn": "true"
                     }
                 }
             },
             {
                 "Sid": "SMStudioCreatePresignedDomainUrlForUserProfile",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreatePresignedDomainUrl"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
             },
             {
                 "Sid": "SMStudioAppPermissionsListAndDescribe",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:ListApps",
                     "sagemaker:ListDomains",
                     "sagemaker:ListUserProfiles",
                     "sagemaker:ListSpaces",
                     "sagemaker:DescribeApp",
                     "sagemaker:DescribeDomain",
                     "sagemaker:DescribeUserProfile",
                     "sagemaker:DescribeSpace"
                 ],
                 "Resource": "*"
             },
             {
                 "Sid": "SMStudioAppPermissionsTagOnCreate",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:AddTags"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:*/*",
                 "Condition": {
                     "Null": {
                         "sagemaker:TaggingAction": "false"
                     }
                 }
             },
             {
                 "Sid": "SMStudioRestrictSharedSpacesWithoutOwners",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreateSpace",
                     "sagemaker:UpdateSpace",
                     "sagemaker:DeleteSpace"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:space/${sagemaker:DomainId}/*",
                 "Condition": {
                     "Null": {
                         "sagemaker:OwnerUserProfileArn": "true"
                     }
                 }
             },
             {
                 "Sid": "SMStudioRestrictSpacesToOwnerUserProfile",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreateSpace",
                     "sagemaker:UpdateSpace",
                     "sagemaker:DeleteSpace"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:space/${sagemaker:DomainId}/*",
                 "Condition": {
                     "ArnLike": {
                         "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                     },
                     "StringEquals": {
                         "sagemaker:SpaceSharingType": [
                             "Private",
                             "Shared"
                         ]
                     }
                 }
             },
             {
                 "Sid": "SMStudioRestrictCreatePrivateSpaceAppsToOwnerUserProfile",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreateApp",
                     "sagemaker:DeleteApp"
                 ],
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:app/${sagemaker:DomainId}/*",
                 "Condition": {
                     "ArnLike": {
                         "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                     },
                     "StringEquals": {
                         "sagemaker:SpaceSharingType": [
                             "Private"
                         ]
                     }
                 }
             },
             {
                 "Sid": "AllowAppActionsForSharedSpaces",
                 "Effect": "Allow",
                 "Action": [
                     "sagemaker:CreateApp",
                     "sagemaker:DeleteApp"
                 ],
                 "Resource": "arn:aws:sagemaker:*:*:app/${sagemaker:DomainId}/*/*/*",
                 "Condition": {
                     "StringEquals": {
                         "sagemaker:SpaceSharingType": [
                             "Shared"
                         ]
                     }
                 }
             }
         ]
     }
     ```

------
   + Because Studio shows an expanded set of applications, users may have access to applications that weren't displayed before. Administrators can limit access to these default applications by creating an AWS Identity and Access Management (IAM) policy that grants denies permissions for some applications to specific users.
**Note**  
Application type can be either `jupyterlab` or `codeeditor`.

------
#### [ JSON ]

****  

     ```
     {
         "Version":"2012-10-17",		 	 	 
         "Statement": [
             {
                 "Sid": "DenySageMakerCreateAppForSpecificAppTypes",
                 "Effect": "Deny",
                 "Action": "sagemaker:CreateApp",
                 "Resource": "arn:aws:sagemaker:us-east-1:111122223333:app/domain-id/*/app-type/*"
             }
         ]
     }
     ```

------

1. Attach the policy to the execution role of the domain. For instructions, follow the steps in [Adding IAM identity permissions (console)](https://docs.aws.amazon.com//IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console).

## Step 2: Update VPC configuration
<a name="studio-updated-migrate-vpc"></a>

If you use your domain in `VPC-Only` mode, ensure your VPC configuration meets the requirements for using Studio in `VPC-Only` mode. For more information, see [Connect Amazon SageMaker Studio in a VPC to External Resources](studio-updated-and-internet-access.md).

## Step 3: Upgrade to the Studio UI
<a name="studio-updated-migrate-set-studio-updated"></a>

Before you migrate your existing domain from Studio Classic to Studio, we recommend creating a test domain using Studio with the same configurations as your existing domain.

### (Optional) Create a test domain
<a name="studio-updated-migrate-ui-create-test"></a>

Use this test domain to interact with Studio, test out networking configurations, and launch applications, before migrating the existing domain.

1. Get the domain ID of your existing domain.

   1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

   1. Choose the existing domain.

   1. On the **Domain details** page, choose the **Domain settings** tab.

   1. Copy the **Domain ID**.

1. Add the domain ID of your existing domain.

   ```
   export REF_DOMAIN_ID="domain-id"
   export SM_REGION="region"
   ```

1. Use `describe-domain` to get important information about the existing domain.

   ```
   export REF_EXECROLE=$(aws sagemaker describe-domain --region=$SM_REGION --domain-id=$REF_DOMAIN_ID | jq -r '.DefaultUserSettings.ExecutionRole')
   export REF_VPC=$(aws sagemaker describe-domain --region=$SM_REGION --domain-id=$REF_DOMAIN_ID | jq -r '.VpcId')
   export REF_SIDS=$(aws sagemaker describe-domain --region=$SM_REGION --domain-id=$REF_DOMAIN_ID | jq -r '.SubnetIds | join(",")')
   export REF_SGS=$(aws sagemaker describe-domain --region=$SM_REGION --domain-id=$REF_DOMAIN_ID | jq -r '.DefaultUserSettings.SecurityGroups | join(",")')
   export AUTHMODE=$(aws sagemaker describe-domain --region=$SM_REGION --domain-id=$REF_DOMAIN_ID | jq -r '.AuthMode')
   ```

1. Validate the parameters.

   ```
   echo "Execution Role: $REF_EXECROLE || VPCID: $REF_VPC || SubnetIDs: $REF_SIDS || Security GroupIDs: $REF_SGS || AuthMode: $AUTHMODE"
   ```

1. Create a test domain using the configurations from the existing domain.

   ```
   IFS=',' read -r -a subnet_ids <<< "$REF_SIDS"
   IFS=',' read -r -a security_groups <<< "$REF_SGS"
   security_groups_json=$(printf '%s\n' "${security_groups[@]}" | jq -R . | jq -s .)
   
   aws sagemaker create-domain \
   --domain-name "TestV2Config" \
   --vpc-id $REF_VPC \
   --auth-mode $AUTHMODE \
   --subnet-ids "${subnet_ids[@]}" \
   --app-network-access-type VpcOnly \
   --default-user-settings "
   {
       \"ExecutionRole\": \"$REF_EXECROLE\",
       \"StudioWebPortal\": \"ENABLED\",
       \"DefaultLandingUri\": \"studio::\",
       \"SecurityGroups\": $security_groups_json
   }
   "
   ```

1. After the test domain is `In Service`, use the test domain's ID to create a user profile. This user profile is used to launch and test applications.

   ```
   aws sagemaker create-user-profile \
   --region="$SM_REGION" --domain-id=test-domain-id \
   --user-profile-name test-network-user
   ```

#### Test Studio functionality
<a name="studio-updated-migrate-ui-testing"></a>

Launch the test domain using the `test-network-user` user profile. We suggest that you thoroughly test the Studio UI and create applications to test Studio functionality in `VPCOnly` mode. Test the following workflows:
+ Create a new JupyterLab Space, test environment and connectivity.
+ Create a new Code Editor, based on Code-OSS, Visual Studio Code - Open Source Space, test environment and connectivity.
+ Launch a new Studio Classic App, test environment and connectivity.
+ Test Amazon Simple Storage Service connectivity with test read and write actions.

If these tests are successful, then upgrade the existing domain. If you encounter any failures, we recommended fixing your environment and connectivity issues before updating the existing domain.

#### Clean up test domain resources
<a name="studio-updated-migrate-ui-clean"></a>

After you have migrated the existing domain, clean up test domain resources.

1. Add the test domain's ID.

   ```
   export TEST_DOMAIN="test-domain-id"
   export SM_REGION="region"
   ```

1. List all applications in the domain that are in a running state.

   ```
   active_apps_json=$(aws sagemaker list-apps --region=$SM_REGION --domain-id=$TEST_DOMAIN)
   echo $active_apps_json
   ```

1. Parse the JSON list of running applications and delete them. If users attempted to create an application that they do not have permissions for, there may be spaces that are not captured in the following script. You must manually delete these spaces.

   ```
   echo "$active_apps_json" | jq -c '.Apps[]' | while read -r app;
   do
       if echo "$app" | jq -e '. | has("SpaceName")' > /dev/null;
       then
           app_type=$(echo "$app" | jq -r '.AppType')
           app_name=$(echo "$app" | jq -r '.AppName')
           domain_id=$(echo "$app" | jq -r '.DomainId')
           space_name=$(echo "$app" | jq -r '.SpaceName')
   
           echo "Deleting App - AppType: $app_type || AppName: $app_name || DomainId: $domain_id || SpaceName: $space_name"
           aws sagemaker delete-app --region=$SM_REGION --domain-id=$domain_id \
           --app-type $app_type --app-name $app_name --space-name $space_name
   
           echo "Deleting Space - AppType: $app_type || AppName: $app_name || DomainId: $domain_id || SpaceName: $space_name"
           aws sagemaker delete-space --region=$SM_REGION --domain-id=$domain_id \
           --space-name $space_name
       else
   
           app_type=$(echo "$app" | jq -r '.AppType')
           app_name=$(echo "$app" | jq -r '.AppName')
           domain_id=$(echo "$app" | jq -r '.DomainId')
           user_profile_name=$(echo "$app" | jq -r '.UserProfileName')
   
           echo "Deleting Studio Classic - AppType: $app_type || AppName: $app_name || DomainId: $domain_id || UserProfileName: $user_profile_name"
           aws sagemaker delete-app --region=$SM_REGION --domain-id=$domain_id \
           --app-type $app_type --app-name $app_name --user-profile-name $user_profile_name
   
       fi
   
   done
   ```

1. Delete the test user profile.

   ```
   aws sagemaker delete-user-profile \
   --region=$SM_REGION --domain-id=$TEST_DOMAIN \
   --user-profile-name "test-network-user"
   ```

1. Delete the test domain.

   ```
   aws sagemaker delete-domain \
   --region=$SM_REGION --domain-id=$TEST_DOMAIN
   ```

After you have tested Studio functionality with the configurations in your test domain, migrate the existing domain. When Studio is the default experience for a domain, Studio is the default experience for all users in the domain. However, the user settings takes precedence over the domain settings. Therefore, if a user has their default experience set to Studio Classic in their user settings, then that user will have Studio Classic as their default experience. 

You can migrate the existing domain by updating it from the SageMaker AI console, the AWS CLI, or AWS CloudFormation. Choose one of the following tabs to view the relevant instructions.

### Set Studio as the default experience for the existing domain using the SageMaker AI console
<a name="studio-updated-migrate-set-studio-updated-console"></a>

You can set Studio as the default experience for the existing domain by using the SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane expand **Admin configurations** and choose **Domains**. 

1. Choose the existing domain that you want to enable Studio as the default experience for.

1. On the **Domain details** page expand **Enable the new Studio**.

1. (Optional) To view the details about the steps involved in enabling Studio as your default experience, choose **View details**. The page shows the following.
   + In the **SageMaker Studio Overview** section you can view the applications that are included or available in the Studio web-based interface. 
   + In the **Enablement process** section you can view descriptions of the workflow tasks to enable Studio.
**Note**  
You will need to migrate your data manually. For instructions about migrating your data, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).
   + In the **Revert to Studio Classic experience** section you can view how to revert back to Studio Classic after enabling Studio as your default experience.

1. To begin the process to enable Studio as your default experience, choose **Enable the new Studio**.

1. In the **Specify and configure role** section, you can view the default applications that are automatically included in Studio.

   To prevent users from running these applications, choose the AWS Identity and Access Management (IAM) Role that has an IAM policy that denies access. For information about how to create a policy to limit access, see [Step 1: Update application creation permissions](#studio-updated-migrate-limit-apps).

1. In the **Choose default S3 bucket to attach CORS policy** section, you can give Studio access to Amazon S3 buckets. The default Amazon S3 bucket, in this case, is the default Amazon S3 bucket for your Studio Classic. In this step you can do the following:
   + Verify the domain’s default Amazon S3 bucket to attach the CORS policy to. If your domain does not have a default Amazon S3 bucket, SageMaker AI creates an Amazon S3 bucket with the correct CORS policy attached.
   + You can include 10 additional Amazon S3 buckets to attach the CORS policy to.

     If you wish to include more than 10 buckets, you can add them manually. For more information about manually attaching the CORS policy to your Amazon S3 buckets, see [(Optional) Update your CORS policy to access Amazon S3 buckets](#studio-updated-migrate-cors).

   To proceed, select the check box next to **Do you agree to overriding any existing CORS policy on the chosen Amazon S3 buckets?**.

1. The **Migrate data** section contains information about the different data storage volumes for Studio Classic and Studio. Your data will not be migrated automatically through this process. For instructions about migrating your data, lifecycle configurations, and JupyterLab extensions, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).

1. Once you have completed the tasks on the page and verified your configuration, choose **Enable the new Studio**.

### Set Studio as the default experience for the existing domain using the AWS CLI
<a name="studio-updated-migrate-set-studio-updated-cli"></a>

To set Studio as the default experience for the existing domain using the AWS CLI, use the [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html) call. You must set `ENABLED` as the value for `StudioWebPortal`, and set `studio::` as the value for `DefaultLandingUri` as part of the `default-user-settings` parameter. 

`StudioWebPortal` indicates if the Studio experience is the default experience and `DefaultLandingUri` indicates the default experience that the user is directed to when accessing the domain. In this example, setting these values on a domain level (in `default-user-settings`) makes Studio the default experience for users within the domain.

If a user within the domain has their `StudioWebPortal` set to `DISABLED` and `DefaultLandingUri` set to `app:JupyterServer:` on a user level (in `UserSettings`), this takes precedence over the domain settings. In other words, that user will have Studio Classic as their default experience, regardless of the domain settings. 

The following code example shows how to set Studio as the default experience for users within the domain:

```
aws sagemaker update-domain \
--domain-id existing-domain-id \
--region AWS Region \
--default-user-settings '
{
    "StudioWebPortal": "ENABLED",
    "DefaultLandingUri": "studio::"
}
'
```
+ To obtain your `existing-domain-id`, use the following instructions:

**To get `existing-domain-id`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the existing domain.

  1. On the **Domain details** page, choose the **Domain settings** tab.

  1. Copy the **Domain ID**.
+ To ensure you are using the correct AWS Region for your domain, use the following instructions: 

**To get `AWS Region`**

  1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

  1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

  1. Choose the existing domain.

  1. On the **Domain details** page, verify that this is the existing domain.

  1. Expand the AWS Region dropdown list from the top right of the SageMaker AI console, and use the corresponding AWS Region ID to the right of your AWS Region name. For example, `us-west-1`.

After you migrate your default experience to Studio, you can give Studio access to Amazon S3 buckets. For example, you can include access to your Studio Classic default Amazon S3 bucket and additional Amazon S3 buckets. To do so, you must manually attach a [Cross-Origin Resource Sharing](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) (CORS) configuration to the Amazon S3 buckets. For more information about how to manually attach the CORS policy to your Amazon S3 buckets, see [(Optional) Update your CORS policy to access Amazon S3 buckets](#studio-updated-migrate-cors).

Similarly, you can set Studio as the default experience when you create a domain from the AWS CLI using the [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-domain.html) call. 

### Set Studio as the default experience for the existing domain using the AWS CloudFormation
<a name="studio-updated-migrate-set-studio-updated-cloud-formation"></a>

You can set the default experience when creating a domain using the AWS CloudFormation. For an CloudFormation migration template, see [SageMaker Studio Administrator IaC Templates](https://github.com/aws-samples/sagemaker-studio-admin-iac-templates/tree/main?tab=readme-ov-file#phase-1-migration). For more information about creating a domain using CloudFormation, see [Creating Amazon SageMaker AI domain using CloudFormation](https://github.com/aws-samples/cloudformation-studio-domain?tab=readme-ov-file#creating-sagemaker-studio-domains-using-cloudformation).

For information about the domain resource supported by AWS CloudFormation, see [AWS::SageMaker AI::Domain](https://docs.aws.amazon.com//AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-domain.html#cfn-sagemaker-domain-defaultusersettings).

After you migrate your default experience to Studio, you can give Studio access to Amazon S3 buckets. For example, you can include access to your Studio Classic default Amazon S3 bucket and additional Amazon S3 buckets. To do so, you must manually attach a [Cross-Origin Resource Sharing](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) (CORS) configuration to the Amazon S3 buckets. For information about how to manually attach the CORS policy to your Amazon S3 buckets, see [(Optional) Update your CORS policy to access Amazon S3 buckets](#studio-updated-migrate-cors).

### (Optional) Update your CORS policy to access Amazon S3 buckets
<a name="studio-updated-migrate-cors"></a>

In Studio Classic, users can create, list, and upload files to Amazon Simple Storage Service (Amazon S3) buckets. To support the same experience in Studio, administrators must attach a [Cross-Origin Resource Sharing](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) (CORS) configuration to the Amazon S3 buckets. This is required because Studio makes Amazon S3 calls from the internet browser. The browser invokes CORS on behalf of users. As a result, all of the requests to Amazon S3 buckets fail unless the CORS policy is attached to the Amazon S3 buckets.

You may need to manually attach the CORS policy to Amazon S3 buckets for the following reasons.
+ If there is already an existing Amazon S3 default bucket that doesn’t have the correct CORS policy attached when you migrate the existing domain's default experience to Studio.
+ If you are using the AWS CLI to migrate the existing domain's default experience to Studio. For information about using the AWS CLI to migrate, see [Set Studio as the default experience for the existing domain using the AWS CLI](#studio-updated-migrate-set-studio-updated-cli).
+ If you want to attach the CORS policy to additional Amazon S3 buckets.

**Note**  
If you plan to use the SageMaker AI console to enable Studio as your default experience, the Amazon S3 buckets that you attach the CORS policy to will have their existing CORS policies overridden during the migration. For this reason, you can ignore the following manual instructions.  
However, if you have already used the SageMaker AI console to migrate and want to include more Amazon S3 buckets to attach the CORS policy to, then continue with the following manual instructions.

The following procedure shows how to manually add a CORS configuration to an Amazon S3 bucket.

**To add a CORS configuration to an Amazon S3 bucket**

1. Verify that there is an Amazon S3 bucket in the same AWS Region as the existing domain with the following name. For instructions, see [Viewing the properties for an Amazon S3 bucket](https://docs.aws.amazon.com//AmazonS3/latest/userguide/view-bucket-properties.html). 

   ```
   sagemaker-region-account-id
   ```

1. Add a CORS configuration with the following content to the default Amazon S3 bucket. For instructions, see [Configuring cross-origin resource sharing (CORS)](https://docs.aws.amazon.com//AmazonS3/latest/userguide/enabling-cors-examples.html).

   ```
   [
       {
           "AllowedHeaders": [
               "*"
           ],
           "AllowedMethods": [
               "POST",
               "PUT",
               "GET",
               "HEAD",
               "DELETE"
           ],
           "AllowedOrigins": [
               "https://*.sagemaker.aws"
           ],
           "ExposeHeaders": [
               "ETag",
               "x-amz-delete-marker",
               "x-amz-id-2",
               "x-amz-request-id",
               "x-amz-server-side-encryption",
               "x-amz-version-id"
           ]
       }
   ]
   ```

### (Optional) Migrate from Data Wrangler in Studio Classic to SageMaker Canvas
<a name="studio-updated-migrate-dw"></a>

Amazon SageMaker Data Wrangler exists as its own feature in the Studio Classic experience. When you enable Studio as your default experience, use the [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html) application to access Data Wrangler functionality. SageMaker Canvas is an application in which you can train and deploy machine learning models without writing any code, and Canvas provides data preparation features powered by Data Wrangler.

The new Studio experience doesn’t support the classic Data Wrangler UI, and you must create a Canvas application if you want to continue using Data Wrangler. However, you must have the necessary permissions to create and use Canvas applications.

Complete the following steps to attach the necessary permissions policies to your SageMaker AI domain's or user’s AWS IAM role.

**To grant permissions for Data Wrangler functionality inside Canvas**

1. Attach the AWS managed policy [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonSageMakerFullAccess) to your user’s IAM role. For a procedure that shows you how to attach IAM policies to a role, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console) in the *AWS IAM User Guide.*

   If this permissions policy is too permissive for your use case, you can create scoped-down policies that include at least the following permissions:

   ```
   {
       "Sid": "AllowStudioActions",
       "Effect": "Allow",
       "Action": [
           "sagemaker:CreatePresignedDomainUrl",
           "sagemaker:DescribeDomain",
           "sagemaker:ListDomains",
           "sagemaker:DescribeUserProfile",
           "sagemaker:ListUserProfiles",
           "sagemaker:DescribeSpace",
           "sagemaker:ListSpaces",
           "sagemaker:DescribeApp",
           "sagemaker:ListApps"
       ],
       "Resource": "*"
   },
   {
       "Sid": "AllowAppActionsForUserProfile",
       "Effect": "Allow",
       "Action": [
           "sagemaker:CreateApp",
           "sagemaker:DeleteApp"
       ],
       "Resource": "arn:aws:sagemaker:region:account-id:app/domain-id/user-profile-name/canvas/*",
       "Condition": {
           "Null": {
               "sagemaker:OwnerUserProfileArn": "true"
           }
       }
   }
   ```

1. Attach the AWS managed policy [AmazonSageMakerCanvasDataPrepFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html) to your user’s IAM role.

After attaching the necessary permissions, you can create a Canvas application and log in. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

When you’ve logged into Canvas, you can directly access Data Wrangler and begin creating data flows. For more information, see [Data preparation](canvas-data-prep.md) in the Canvas documentation.

### (Optional) Migrate from Autopilot in Studio Classic to SageMaker Canvas
<a name="studio-updated-migrate-autopilot"></a>

[Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/AWSIronmanApiDoc/integ/npepin-studio-migration-autopilot-to-canvas/latest/dg/autopilot-automate-model-development.html) exists as its own feature in the Studio Classic experience. When you migrate to using the updated Studio experience, use the [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html) application to continue using the same automated machine learning (AutoML) capabilities via a user interface (UI). SageMaker Canvas is an application in which you can train and deploy machine learning models without writing any code, and Canvas provides a UI to run your AutoML tasks.

The new Studio experience doesn’t support the classic Autopilot UI. You must create a Canvas application if you want to continue using Autopilot's AutoML features via a UI. 

However, you must have the necessary permissions to create and use Canvas applications.
+ If you are accessing SageMaker Canvas from Studio, add those permissions to the execution role of your SageMaker AI domain or user profile.
+ If you are accessing SageMaker Canvas from the Console, add those permissions to your user’s AWS IAM role.
+ If you are accessing SageMaker Canvas via a [presigned URL](https://docs.aws.amazon.com/sagemaker/latest/dg/setting-up-canvas-sso.html#canvas-optional-access), add those permissions to the IAM role that you're using for Okta SSO access.

To enable AutoML capabilities in Canvas, add the following policies to your execution role or IAM user role.
+ AWS managed policy: [`CanvasFullAccess`.](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) 
+ Inline policy:

  ```
  {
      "Sid": "AllowAppActionsForUserProfile",
      "Effect": "Allow",
      "Action": [
          "sagemaker:CreateApp",
          "sagemaker:DeleteApp"
      ],
      "Resource": "arn:aws:sagemaker:region:account-id:app/domain-id/user-profile-name/canvas/*",
      "Condition": {
          "Null": {
              "sagemaker:OwnerUserProfileArn": "true"
          }
      }
  }
  ```

**To attach IAM policies to an execution role**

1. 

**Find the execution role attached to your SageMaker AI user profile**

   1. In the SageMaker AI console [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/), navigate to **Domains**, then choose your SageMaker AI domain.

   1. The execution role ARN is listed under *Execution role* on the **User Details** page of your user profile. Make note of the execution role name in the ARN.

   1. In the IAM console [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/), choose **Roles**.

   1. Search for your role by name in the search field.

   1. Select the role.

1. Add policies to the role

   1. In the IAM console [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/), choose **Roles**.

   1. Search for your role by name in the search field.

   1. Select the role.

   1. In the **Permissions** tab, navigate to the dropdown menu **Add permissions**.

   1. 
      + For managed policies: Select **Attach policies**, search for the name of the manage policy you want to attach.

        Select the policy then choose **Add permissions**.
      + For inline policies: Select **Create inline policy**, paste your policy in the JSON tab, choose next, name your policy, and choose **Create**.

For a procedure that shows you how to attach IAM policies to a role, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console) in the *AWS IAM User Guide.*

After attaching the necessary permissions, you can create a Canvas application and log in. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

## Set Studio Classic as the default experience
<a name="studio-updated-migrate-revert"></a>

Administrators can revert to Studio Classic as the default experience for an existing domain. This can be done through the AWS CLI.

**Note**  
When Studio Classic is set as the default experience on a domain level, Studio Classic is the default experience for all users in the domain. However, settings on a user level takes precedence over the domain level settings. So if a user has their default experience set to Studio, then that user will have Studio as their default experience. 

To revert to Studio Classic as the default experience for the existing domain using the AWS CLI, use the [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html) call. As part of the `default-user-settings` field, you must set:
+ `StudioWebPortal` value to `DISABLED`.
+ `DefaultLandingUri` value to `app:JupyterServer:`

`StudioWebPortal` indicates if the Studio experience is the default experience and `DefaultLandingUri` indicates the default experience that the user is directed to when accessing the domain. In this example, setting these values on a domain level (in `default-user-settings`) makes Studio Classic the default experience for users within the domain.

If a user within the domain has their `StudioWebPortal` set to `ENABLED` and `DefaultLandingUri` set to `studio::` on a user level (in `UserSettings`), this takes precedence over the domain level settings. In other words, that user will have Studio as their default experience, regardless of the domain level settings. 

The following code example shows how to set Studio Classic as the default experience for users within the domain:

```
aws sagemaker update-domain \
--domain-id existing-domain-id \
--region AWS Region \
--default-user-settings '
{
    "StudioWebPortal": "DISABLED",
    "DefaultLandingUri": "app:JupyterServer:"
}
'
```

Use the following instructions to obtain your `existing-domain-id`.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

1. Choose the existing domain.

1. On the **Domain details** page, choose the **Domain settings** tab.

1. Copy the **Domain ID**.

To obtain your `AWS Region`, use the following instructions to ensure you are using the correct AWS Region for your domain.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, expand **Admin configurations** and choose **Domains**. 

1. Choose the existing domain.

1. On the **Domain details** page, verify that this is the existing domain.

1. Expand the AWS Region dropdown list from the top right of the SageMaker AI console, and use the corresponding AWS Region ID to the right of your AWS Region name. For example, `us-west-1`.

# (Optional) Migrate custom images and lifecycle configurations
<a name="studio-updated-migrate-lcc"></a>

You must update your custom images and lifecycle configuration (LCC) scripts to work with the simplified local run model in Amazon SageMaker Studio. If you have not created custom images or lifecycle configurations in your domain, skip this phase.

Amazon SageMaker Studio Classic operates in a split environment with:
+ A `JupyterServer` application running the Jupyter Server. 
+ Studio Classic notebooks running on one or more `KernelGateway` applications. 

Studio has shifted away from a split environment. Studio runs the JupyterLab and Code Editor, based on Code-OSS, Visual Studio Code - Open Source applications in a local runtime model. For more information about the change in architecture, see [Boost productivity on Amazon SageMaker Studio](https://aws.amazon.com/blogs//machine-learning/boost-productivity-on-amazon-sagemaker-studio-introducing-jupyterlab-spaces-and-generative-ai-tools/). 

## Migrate custom images
<a name="studio-updated-migrate-lcc-custom"></a>

Your existing Studio Classic custom images may not work in Studio. We recommend creating a new custom image that satisfies the requirements for use in Studio. The release of Studio simplifies the process to build custom images by providing [SageMaker Studio image support policy](sagemaker-distribution.md). SageMaker AI Distribution images include popular libraries and packages for machine learning, data science, and data analytics visualization. For a list of base SageMaker Distribution images and Amazon Elastic Container Registry account information, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

To build a custom image, complete one of the following.
+ Extend a SageMaker Distribution image with custom packages and modules. These images are pre-configured with JupyterLab and Code Editor, based on Code-OSS, Visual Studio Code - Open Source.
+ Build a custom Dockerfile file by following the instructions in [Bring your own image (BYOI)](studio-updated-byoi.md). You must install JupyterLab and the open source CodeServer on the image to make it compatible with Studio.

## Migrate lifecycle configurations
<a name="studio-updated-migrate-lcc-lcc"></a>

Because of the simplified local runtime model in Studio, we recommend migrating the structure of your existing Studio Classic LCCs. In Studio Classic, you often have to create separate lifecycle configurations for both KernelGateway and JupyterServer applications. Because the JupyterServer and KernelGateway applications run on separate compute resources within Studio Classic, Studio Classic LCCs can be one of either type: 
+ JupyterServer LCC: These LCCs mostly govern a user’s home actions, including setting proxy, creating environment variables, and auto-shutdown of resources.
+ KernelGateway LCC: These LCCs govern Studio Classic notebook environment optimizations. This includes updating numpy package versions in the `Data Science 3.0` kernel and installing the snowflake package in `Pytorch 2.0 GPU` kernel.

In the simplified Studio architecture, you only need one LCC script that runs at application start up. While migration of your LCC scripts varies based on development environment, we recommend combining JupyterServer and KernelGateway LCCs to build a combined LCC.

LCCs in Studio can be associated with one of the following applications: 
+ JupyterLab 
+ Code Editor

Users can select the LCC for the respective application type when creating a space or use the default LCC set by the admin.

**Note**  
Existing Studio Classic auto-shutdown scripts do not work with Studio. For an example Studio auto-shutdown script, see [SageMaker Studio Lifecycle Configuration examples](https://github.com/aws-samples/sagemaker-studio-apps-lifecycle-config-examples).

### Considerations when refactoring LCCs
<a name="studio-updated-migrate-lcc-considerations"></a>

Consider the following differences between Studio Classic and Studio when refactoring your LCCs.
+ JupyterLab and Code Editor applications, when created, are run as `sagemaker-user` with `UID:1001` and `GID:101`. By default, `sagemaker-user` has permissions to assume sudo/root permissions. KernelGateway applications are run as `root` by default.
+ SageMaker Distribution images that run inside JupyterLab and Code Editor apps use the Debian-based package manager, `apt-get`.
+ Studio JupyterLab and Code Editor applications use the Conda package manager. SageMaker AI creates a single base Python3 Conda environment when a Studio application is launched. For information about updating packages in the base Conda environment and creating new Conda environments, see [JupyterLab user guide](studio-updated-jl-user-guide.md). In contrast, not all KernelGateway applications use Conda as a package manager.
+ The Studio JupyterLab application uses `JupyterLab 4.0`, while Studio Classic uses `JupyterLab 3.0`. Validate that all JupyterLab extensions you use are compatible with `JupyterLab 4.0`. For more information about extensions, see [Extension Compatibility with JupyterLab 4.0](https://github.com/jupyterlab/jupyterlab/issues/14590).

# (Optional) Migrate data from Studio Classic to Studio
<a name="studio-updated-migrate-data"></a>

Studio Classic and Studio use two different types of storage volumes. Studio Classic uses a single Amazon Elastic File System (Amazon EFS) volume to store data across all users and shared spaces in the domain. In Studio, each space gets its own Amazon Elastic Block Store (Amazon EBS) volume. When you update the default experience of an existing domain, SageMaker AI automatically mounts a folder in an Amazon EFS volume for each user in a domain. As a result, users are able to access files from Studio Classic in their Studio applications. For more information, see [Amazon EFS auto-mounting in Studio](studio-updated-automount.md). 

You can also opt out of Amazon EFS auto-mounting and manually migrate the data to give users access to files from Studio Classic in Studio applications. To accomplish this, you must transfer the files from the user home directories to the Amazon EBS volumes associated with those spaces. The following section gives information about this workflow. For more information about opting out of Amazon EFS auto-mounting, see [Opt out of Amazon EFS auto-mounting](studio-updated-automount-optout.md).

## Manually migrate all of your data from Studio Classic
<a name="studio-updated-migrate-data-all"></a>

The following section describes how to migrate all of the data from your Studio Classic storage volume to the new Studio experience.

When manually migrating a user's data, code, and artifacts from Studio Classic to Studio, we recommend one of the following approaches:

1. Using a custom Amazon EFS volume

1. Using Amazon Simple Storage Service (Amazon S3)

If you used Amazon SageMaker Data Wrangler in Studio Classic and want to migrate your data flow files, then choose one of the following options for migration:
+ If you want to migrate all of the data from your Studio Classic storage volume, including your data flow files, go to [Manually migrate all of your data from Studio Classic](#studio-updated-migrate-data-all) and complete the section **Use Amazon S3 to migrate data**. Then, skip to the [Import the flow files into Canvas](#studio-updated-migrate-flows-import) section.
+ If you only want to migrate your data flow files and no other data from your Studio Classic storage volume, skip to the [Migrate data flows from Data Wrangler](#studio-updated-migrate-flows) section.

### Prerequisites
<a name="studio-updated-migrate-data-prereq"></a>

Before running these steps, complete the prerequisites in [Complete prerequisites to migrate the Studio experience](studio-updated-migrate-prereq.md). You must also complete the steps in [Migrate the UI from Studio Classic to Studio](studio-updated-migrate-ui.md).

### Choosing an approach
<a name="studio-updated-migrate-data-choose"></a>

Consider the following when choosing an approach to migrate your Studio Classic data.

** Pros and cons of using a custom Amazon EFS volume**

In this approach, you use an Amazon EFS-to-Amazon EFS AWS DataSync task (one time or cadence) to copy data, then mount the target Amazon EFS volume to a user’s spaces. This gives users access to data from Studio Classic in their Studio compute environments.

Pros:
+ Only the user’s home directory data is visible in the user's spaces. There is no data cross-pollination.
+ Syncing from the source Amazon EFS volume to a target Amazon EFS volume is safer than directly mounting the source Amazon EFS volume managed by SageMaker AI into spaces. This avoids the potential to impact home directory user files.
+ Users have the flexibility to continue working in Studio Classic and Studio applications, while having their data available in both applications if AWS DataSync is set up on a regular cadence.
+ No need for repeated push and pull with Amazon S3.

Cons:
+ No write access to the target Amazon EFS volume mounted to user's spaces. To get write access to the target Amazon EFS volume, customers would need to mount the target Amazon EFS volume to an Amazon Elastic Compute Cloud instance and provide appropriate permissions for users to write to the Amazon EFS prefix.
+ Requires modification to the security groups managed by SageMaker AI to allow network file system (NFS) inbound and outbound flow.
+ Costs more than using Amazon S3.
+ If [migrating data flows from Data Wrangler in Studio Classic](#studio-updated-migrate-flows), you must follow the steps for manually exporting flow files.

**Pros and cons of using Amazon S3**

In this approach, you use an Amazon EFS-to-Amazon S3 AWS DataSync task (one time or cadence) to copy data, then create a lifecycle configuration to copy the user’s data from Amazon S3 to their private space’s Amazon EBS volume.

Pros:
+ If the LCC is attached to the domain, users can choose to use the LCC to copy data to their space or to run the space with no LCC script. This gives users the choice to copy their files only to the spaces they need.
+ If an AWS DataSync task is set up on a cadence, users can restart their Studio application to get the latest files.
+ Because the data is copied over to Amazon EBS, users have write permissions on the files.
+ Amazon S3 storage is cheaper than Amazon EFS.
+ If [migrating data flows from Data Wrangler in Studio Classic](#studio-updated-migrate-flows), you can skip the manual export steps and directly import the data flows into SageMaker Canvas from Amazon S3.

Cons:
+ If administrators need to prevent cross-pollination, they must create AWS Identity and Access Management policies at the user level to ensure users can only access the Amazon S3 prefix that contains their files.

### Use a custom Amazon EFS volume to migrate data
<a name="studio-updated-migrate-data-approach1"></a>

In this approach, you use an Amazon EFS-to-Amazon EFS AWS DataSync to copy the contents of a Studio Classic Amazon EFS volume to a target Amazon EFS volume once or in a regular cadence, then mount the target Amazon EFS volume to a user’s spaces. This gives users access to data from Studio Classic in their Studio compute environments.

1. Create a target Amazon EFS volume. You will transfer data into this Amazon EFS volume and mount it to a corresponding user's space using prefix-level mounting.

   ```
   export SOURCE_DOMAIN_ID="domain-id"
   export AWS_REGION="region"
   
   export TARGET_EFS=$(aws efs create-file-system --performance-mode generalPurpose --throughput-mode bursting --encrypted --region $REGION | jq -r '.FileSystemId')
   
   echo "Target EFS volume Created: $TARGET_EFS"
   ```

1. Add variables for the source Amazon EFS volume currently attached to the domain and used by all users. The domain's Amazon Virtual Private Cloud information is required to ensure the target Amazon EFS is created in the same Amazon VPC and subnet, with the same security group configuration.

   ```
   export SOURCE_EFS=$(aws sagemaker describe-domain --domain-id $SOURCE_DOMAIN_ID | jq -r '.HomeEfsFileSystemId')
   export VPC_ID=$(aws sagemaker describe-domain --domain-id $SOURCE_DOMAIN_ID | jq -r '.VpcId')
   
   echo "EFS managed by SageMaker: $SOURCE_EFS | VPC: $VPC_ID"
   ```

1. Create an Amazon EFS mount target in the same Amazon VPC and subnet as the source Amazon EFS volume, with the same security group configuration. The mount target takes a few minutes to be available.

   ```
   export EFS_VPC_ID=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].VpcId")
   export EFS_AZ_NAME=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].AvailabilityZoneName")
   export EFS_AZ_ID=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].AvailabilityZoneId")
   export EFS_SUBNET_ID=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].SubnetId")
   export EFS_MOUNT_TARG_ID=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].MountTargetId")
   export EFS_SG_IDS=$(aws efs describe-mount-target-security-groups --mount-target-id $EFS_MOUNT_TARG_ID | jq -r '.SecurityGroups[]')
   
   aws efs create-mount-target \
   --file-system-id $TARGET_EFS \
   --subnet-id $EFS_SUBNET_ID \
   --security-groups $EFS_SG_IDS
   ```

1. Create Amazon EFS source and destination locations for the AWS DataSync task.

   ```
   export SOURCE_EFS_ARN=$(aws efs describe-file-systems --file-system-id $SOURCE_EFS | jq -r ".FileSystems[0].FileSystemArn")
   export TARGET_EFS_ARN=$(aws efs describe-file-systems --file-system-id $TARGET_EFS | jq -r ".FileSystems[0].FileSystemArn")
   export EFS_SUBNET_ID_ARN=$(aws ec2 describe-subnets --subnet-ids $EFS_SUBNET_ID | jq -r ".Subnets[0].SubnetArn")
   export ACCOUNT_ID=$(aws ec2 describe-security-groups --group-id $EFS_SG_IDS | jq -r ".SecurityGroups[0].OwnerId")
   export EFS_SG_ID_ARN=arn:aws:ec2:$REGION:$ACCOUNT_ID:security-group/$EFS_SG_IDS
   
   export SOURCE_LOCATION_ARN=$(aws datasync create-location-efs --subdirectory "/" --efs-filesystem-arn $SOURCE_EFS_ARN --ec2-config SubnetArn=$EFS_SUBNET_ID_ARN,SecurityGroupArns=$EFS_SG_ID_ARN --region $REGION | jq -r ".LocationArn")
   export DESTINATION_LOCATION_ARN=$(aws datasync create-location-efs --subdirectory "/" --efs-filesystem-arn $TARGET_EFS_ARN --ec2-config SubnetArn=$EFS_SUBNET_ID_ARN,SecurityGroupArns=$EFS_SG_ID_ARN --region $REGION | jq -r ".LocationArn")
   ```

1. Allow traffic between the source and target network file system (NFS) mounts. When a new domain is created, SageMaker AI creates 2 security groups.
   + NFS inbound security group with only inbound traffic.
   + NFS outbound security group with only outbound traffic.

   The source and target NFS are placed inside the same security groups. You can allow traffic between these mounts from the AWS Management Console or AWS CLI.
   + Allow traffic from the AWS Management Console

     1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

     1. Choose **Security Groups**.

     1. Search for the existing domain's ID on the **Security Groups** page.

        ```
        d-xxxxxxx
        ```

        The results should return two security groups that include the domain ID in the name.
        + `security-group-for-inbound-nfs-domain-id`
        + `security-group-for-outbound-nfs-domain-id`

     1. Select the inbound security group ID. This opens a new page with details about the security group.

     1. Select the **Outbound Rules** tab.

     1. Select **Edit outbound rules**.

     1. Update the existing outbound rules or add a new outbound rule with the following values:
        + **Type**: NFS
        + **Protocol**: TCP
        + **Port range**: 2049
        + **Destination**: security-group-for-outbound-nfs-*domain-id* \$1 *security-group-id*

     1. Choose **Save rules**.

     1. Select the **Inbound Rules** tab.

     1. Select **Edit inbound rules**.

     1. Update the existing inbound rules or add a new outbound rule with the following values:
        + **Type**: NFS
        + **Protocol**: TCP
        + **Port range**: 2049
        + **Destination**: security-group-for-outbound-nfs-*domain-id* \$1 *security-group-id*

     1. Choose **Save rules**.
   + Allow traffic from the AWS CLI

     1.  Update the security group inbound and outbound rules with the following values:
        + **Protocol**: TCP
        + **Port range**: 2049
        + **Group ID**: Inbound security group ID or outbound security group ID

        ```
        export INBOUND_SG_ID=$(aws ec2 describe-security-groups --filters "Name=group-name,Values=security-group-for-inbound-nfs-$SOURCE_DOMAIN_ID" | jq -r ".SecurityGroups[0].GroupId")
        export OUTBOUND_SG_ID=$(aws ec2 describe-security-groups --filters "Name=group-name,Values=security-group-for-outbound-nfs-$SOURCE_DOMAIN_ID" | jq -r ".SecurityGroups[0].GroupId")
        
        echo "Outbound SG ID: $OUTBOUND_SG_ID | Inbound SG ID: $INBOUND_SG_ID"
        aws ec2 authorize-security-group-egress \
        --group-id $INBOUND_SG_ID \
        --protocol tcp --port 2049 \
        --source-group $OUTBOUND_SG_ID
        
        aws ec2 authorize-security-group-ingress \
        --group-id $OUTBOUND_SG_ID \
        --protocol tcp --port 2049 \
        --source-group $INBOUND_SG_ID
        ```

     1.  Add both the inbound and outbound security groups to the source and target Amazon EFS mount targets. This allows traffic between the 2 Amazon EFS mounts.

        ```
        export SOURCE_EFS_MOUNT_TARGET=$(aws efs describe-mount-targets --file-system-id $SOURCE_EFS | jq -r ".MountTargets[0].MountTargetId")
        export TARGET_EFS_MOUNT_TARGET=$(aws efs describe-mount-targets --file-system-id $TARGET_EFS | jq -r ".MountTargets[0].MountTargetId")
        
        aws efs modify-mount-target-security-groups \
        --mount-target-id $SOURCE_EFS_MOUNT_TARGET \
        --security-groups $INBOUND_SG_ID $OUTBOUND_SG_ID
        
        aws efs modify-mount-target-security-groups \
        --mount-target-id $TARGET_EFS_MOUNT_TARGET \
        --security-groups $INBOUND_SG_ID $OUTBOUND_SG_ID
        ```

1. Create a AWS DataSync task. This returns a task ARN that can be used to run the task on-demand or as part of a regular cadence.

   ```
   export EXTRA_XFER_OPTIONS='VerifyMode=ONLY_FILES_TRANSFERRED,OverwriteMode=ALWAYS,Atime=NONE,Mtime=NONE,Uid=NONE,Gid=NONE,PreserveDeletedFiles=REMOVE,PreserveDevices=NONE,PosixPermissions=NONE,TaskQueueing=ENABLED,TransferMode=CHANGED,SecurityDescriptorCopyFlags=NONE,ObjectTags=NONE'
   export DATASYNC_TASK_ARN=$(aws datasync create-task --source-location-arn $SOURCE_LOCATION_ARN --destination-location-arn $DESTINATION_LOCATION_ARN --name "SMEFS_to_CustomEFS_Sync" --region $REGION --options $EXTRA_XFER_OPTIONS | jq -r ".TaskArn")
   ```

1. Start a AWS DataSync task to automatically copy data from the source Amazon EFS to the target Amazon EFS mount. This does not retain the file's POSIX permissions, which allows users to read from the target Amazon EFS mount, but not write to it.

   ```
   aws datasync start-task-execution --task-arn $DATASYNC_TASK_ARN
   ```

1. Mount the target Amazon EFS volume on the domain at the root level.

   ```
   aws sagemaker update-domain --domain-id $SOURCE_DOMAIN_ID \
   --default-user-settings '{"CustomFileSystemConfigs": [{"EFSFileSystemConfig": {"FileSystemId": "'"$TARGET_EFS"'", "FileSystemPath": "/"}}]}'
   ```

1. Overwrite every user profile with a `FileSystemPath` prefix. The prefix includes the user’s UID, which is created by SageMaker AI. This ensure user’s only have access to their data and prevents cross-pollination. When a space is created in the domain and the target Amazon EFS volume is mounted to the application, the user’s prefix overwrites the domain prefix. As a result, SageMaker AI only mounts the `/user-id` directory on the user's application.

   ```
   aws sagemaker list-user-profiles --domain-id $SOURCE_DOMAIN_ID | jq -r '.UserProfiles[] | "\(.UserProfileName)"' | while read user; do
   export uid=$(aws sagemaker describe-user-profile --domain-id $SOURCE_DOMAIN_ID --user-profile-name $user | jq -r ".HomeEfsFileSystemUid")
   echo "$user $uid"
   aws sagemaker update-user-profile --domain-id $SOURCE_DOMAIN_ID --user-profile-name $user --user-settings '{"CustomFileSystemConfigs": [{"EFSFileSystemConfig":{"FileSystemId": "'"$TARGET_EFS"'", "FileSystemPath": "'"/$uid/"'"}}]}'
   done
   ```

1. Users can then select the custom Amazon EFS filesystem when launching an application. For more information, see [JupyterLab user guide](studio-updated-jl-user-guide.md) or [Launch a Code Editor application in Studio](code-editor-use-studio.md).

### Use Amazon S3 to migrate data
<a name="studio-updated-migrate-data-approach2"></a>

In this approach, you use an Amazon EFS-to-Amazon S3 AWS DataSync task to copy the contents of a Studio Classic Amazon EFS volume to an Amazon S3 bucket once or in a regular cadence, then create a lifecycle configuration to copy the user’s data from Amazon S3 to their private space’s Amazon EBS volume.

**Note**  
This approach only works for domains that have internet access.

1. Set the source Amazon EFS volume ID from the domain containing the data that you are migrating.

   ```
   timestamp=$(date +%Y%m%d%H%M%S)
   export SOURCE_DOMAIN_ID="domain-id"
   export AWS_REGION="region"
   export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
   export EFS_ID=$(aws sagemaker describe-domain --domain-id $SOURCE_DOMAIN_ID | jq -r '.HomeEfsFileSystemId')
   ```

1. Set the target Amazon S3 bucket name. For information about creating an Amazon S3 bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html). The bucket used must have a CORS policy as described in [(Optional) Update your CORS policy to access Amazon S3 buckets](studio-updated-migrate-ui.md#studio-updated-migrate-cors). Users in the domain must also have permissions to access the Amazon S3 bucket.

   In this example, we are copying files to a prefix named `studio-new`. If you are using a single Amazon S3 bucket to migrate multiple domains, use the `studio-new/<domain-id>` prefix to restrict permissions to the files using IAM.

   ```
   export BUCKET_NAME=s3-bucket-name
   export S3_DESTINATION_PATH=studio-new
   ```

1. Create a trust policy that gives AWS DataSync permissions to assume the execution role of your account. 

   ```
   export TRUST_POLICY=$(cat <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": "datasync.amazonaws.com"
               },
               "Action": "sts:AssumeRole",
               "Condition": {
                   "StringEquals": {
                       "aws:SourceAccount": "$ACCOUNT_ID"
                   },
                   "ArnLike": {
                       "aws:SourceArn": "arn:aws:datasync:$REGION:$ACCOUNT_ID:*"
                   }
               }
           }
       ]
   }
   EOF
   )
   ```

1. Create an IAM role and attach the trust policy.

   ```
   export timestamp=$(date +%Y%m%d%H%M%S)
   export ROLE_NAME="DataSyncS3Role-$timestamp"
   
   aws iam create-role --role-name $ROLE_NAME --assume-role-policy-document "$TRUST_POLICY"
   aws iam attach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
   echo "Attached IAM Policy AmazonS3FullAccess"
   aws iam attach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
   echo "Attached IAM Policy AmazonSageMakerFullAccess"
   export ROLE_ARN=$(aws iam get-role --role-name $ROLE_NAME --query 'Role.Arn' --output text)
   echo "Created IAM Role $ROLE_ARN"
   ```

1. Create a security group to give access to the Amazon EFS location.

   ```
   export EFS_ARN=$(aws efs describe-file-systems --file-system-id $EFS_ID | jq -r '.FileSystems[0].FileSystemArn' )
   export EFS_SUBNET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID | jq -r '.MountTargets[0].SubnetId')
   export EFS_VPC_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID | jq -r '.MountTargets[0].VpcId')
   export MOUNT_TARGET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID | jq -r '.MountTargets[0].MountTargetId ')
   export EFS_SECURITY_GROUP_ID=$(aws efs describe-mount-target-security-groups --mount-target-id $MOUNT_TARGET_ID | jq -r '.SecurityGroups[0]')
   export EFS_SUBNET_ARN=$(aws ec2 describe-subnets --subnet-ids $EFS_SUBNET_ID | jq -r '.Subnets[0].SubnetArn')
   echo "Subnet ID: $EFS_SUBNET_ID"
   echo "Security Group ID: $EFS_SECURITY_GROUP_ID"
   echo "Subnet ARN: $EFS_SUBNET_ARN"
   
   timestamp=$(date +%Y%m%d%H%M%S)
   sg_name="datasync-sg-$timestamp"
   export DATASYNC_SG_ID=$(aws ec2 create-security-group --vpc-id $EFS_VPC_ID --group-name $sg_name --description "DataSync SG" --output text --query 'GroupId')
   aws ec2 authorize-security-group-egress --group-id $DATASYNC_SG_ID --protocol tcp --port 2049 --source-group $EFS_SECURITY_GROUP_ID
   aws ec2 authorize-security-group-ingress --group-id $EFS_SECURITY_GROUP_ID --protocol tcp --port 2049 --source-group $DATASYNC_SG_ID
   export DATASYNC_SG_ARN="arn:aws:ec2:$REGION:$ACCOUNT_ID:security-group/$DATASYNC_SG_ID"
   echo "Security Group ARN: $DATASYNC_SG_ARN"
   ```

1. Create a source Amazon EFS location for the AWS DataSync task.

   ```
   export SOURCE_ARN=$(aws datasync create-location-efs --efs-filesystem-arn $EFS_ARN --ec2-config "{\"SubnetArn\": \"$EFS_SUBNET_ARN\", \"SecurityGroupArns\": [\"$DATASYNC_SG_ARN\"]}" | jq -r '.LocationArn')
   echo "Source Location ARN: $SOURCE_ARN"
   ```

1. Create a target Amazon S3 location for the AWS DataSync task.

   ```
   export BUCKET_ARN="arn:aws:s3:::$BUCKET_NAME"
   export DESTINATION_ARN=$(aws datasync create-location-s3 --s3-bucket-arn $BUCKET_ARN --s3-config "{\"BucketAccessRoleArn\": \"$ROLE_ARN\"}" --subdirectory $S3_DESTINATION_PATH | jq -r '.LocationArn')
   echo "Destination Location ARN: $DESTINATION_ARN"
   ```

1. Create a AWS DataSync task.

   ```
   export TASK_ARN=$(aws datasync create-task --source-location-arn $SOURCE_ARN --destination-location-arn $DESTINATION_ARN | jq -r '.TaskArn')
   echo "DataSync Task: $TASK_ARN"
   ```

1. Start the AWS DataSync task. This task automatically copies data from the source Amazon EFS volume to the target Amazon S3 bucket. Wait for the task to be complete.

   ```
   aws datasync start-task-execution --task-arn $TASK_ARN
   ```

1. Check the status of the AWS DataSync task to verify that it is complete. Pass the ARN returned in the previous step.

   ```
   export TASK_EXEC_ARN=datasync-task-arn
   echo "Task execution ARN: $TASK_EXEC_ARN"
   export STATUS=$(aws datasync describe-task-execution --task-execution-arn $TASK_EXEC_ARN | jq -r '.Status')
   echo "Execution status: $STATUS"
   while [ "$STATUS" = "QUEUED" ] || [ "$STATUS" = "LAUNCHING" ] || [ "$STATUS" = "PREPARING" ] || [ "$STATUS" = "TRANSFERRING" ] || [ "$STATUS" = "VERIFYING" ]; do
       STATUS=$(aws datasync describe-task-execution --task-execution-arn $TASK_EXEC_ARN | jq -r '.Status')
       if [ $? -ne 0 ]; then
           echo "Error Running DataSync Task"
           exit 1
       fi
       echo "Execution status: $STATUS"
       sleep 30
   done
   ```

1. After the AWS DataSync task is complete, clean up the previously created resources.

   ```
   aws datasync delete-task --task-arn $TASK_ARN
   echo "Deleted task $TASK_ARN"
   aws datasync delete-location --location-arn $SOURCE_ARN
   echo "Deleted location source $SOURCE_ARN"
   aws datasync delete-location --location-arn $DESTINATION_ARN
   echo "Deleted location source $DESTINATION_ARN"
   aws iam detach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
   aws iam detach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
   aws iam delete-role --role-name $ROLE_NAME
   echo "Deleted IAM Role $ROLE_NAME"
   echo "Wait 5 minutes for the elastic network interface to detach..."
   start_time=$(date +%s)
   while [[ $(($(date +%s) - start_time)) -lt 300 ]]; do
       sleep 1
   done
   aws ec2 revoke-security-group-ingress --group-id $EFS_SECURITY_GROUP_ID --protocol tcp --port 2049 --source-group $DATASYNC_SG_ID
   echo "Revoked Ingress from $EFS_SECURITY_GROUP_ID"
   aws ec2 revoke-security-group-egress --group-id $DATASYNC_SG_ID --protocol tcp --port 2049 --source-group $EFS_SECURITY_GROUP_ID
   echo "Revoked Egress from $DATASYNC_SG_ID"
   aws ec2 delete-security-group --group-id $DATASYNC_SG_ID
   echo "Deleted DataSync SG $DATASYNC_SG_ID"
   ```

1. From your local machine, create a file named `on-start.sh` with the following content. This script copies the user’s Amazon EFS home directory in Amazon S3 to the user’s Amazon EBS volume in Studio and creates a prefix for each user profile.

   ```
   #!/bin/bash
   set -eo pipefail
   
   sudo apt-get install -y jq
   
   # Studio Variables
   DOMAIN_ID=$(cat /opt/ml/metadata/resource-metadata.json | jq -r '.DomainId')
   SPACE_NAME=$(cat /opt/ml/metadata/resource-metadata.json | jq -r '.SpaceName')
   USER_PROFILE_NAME=$(aws sagemaker describe-space --domain-id=$DOMAIN_ID --space-name=$SPACE_NAME | jq -r '.OwnershipSettings.OwnerUserProfileName')
   
   # S3 bucket to copy from
   BUCKET=s3-bucket-name
   # Subfolder in bucket to copy
   PREFIX=studio-new
   
   # Getting HomeEfsFileSystemUid for the current user-profile
   EFS_FOLDER_ID=$(aws sagemaker describe-user-profile --domain-id $DOMAIN_ID --user-profile-name $USER_PROFILE_NAME | jq -r '.HomeEfsFileSystemUid')
   
   # Local destination directory
   DEST=./studio-classic-efs-backup
   mkdir -p $DEST
   
   echo "Bucket: s3://$BUCKET/$PREFIX/$EFS_FOLDER_ID/"
   echo "Destination $DEST/"
   echo "Excluding .*"
   echo "Excluding .*/*"
   
   aws s3 cp s3://$BUCKET/$PREFIX/$EFS_FOLDER_ID/ $DEST/ \
       --exclude ".*" \
       --exclude "**/.*" \
       --recursive
   ```

1. Convert your script into base64 format. This requirement prevents errors that occur from spacing and line break encoding. The script type can be either `JupyterLab` or `CodeEditor`.

   ```
   export LCC_SCRIPT_NAME='studio-classic-sync'
   export SCRIPT_FILE_NAME='on-start.sh'
   export SCRIPT_TYPE='JupyterLab-or-CodeEditor'
   LCC_CONTENT=`openssl base64 -A -in ${SCRIPT_FILE_NAME}`
   ```

1. Verify the following before you use the script: 
   + The Amazon EBS volume is large enough to store the objects that you're exporting.
   + You aren't migrating hidden files and folders, such as `.bashrc` and `.condarc` if you aren't intending to do so.
   + The AWS Identity and Access Management (IAM) execution role that's associated with Studio user profiles has the policies configured to access only the respective home directory in Amazon S3.

1. Create a lifecycle configuration using your script.

   ```
   aws sagemaker create-studio-lifecycle-config \
       --studio-lifecycle-config-name $LCC_SCRIPT_NAME \
       --studio-lifecycle-config-content $LCC_CONTENT \
       --studio-lifecycle-config-app-type $SCRIPT_TYPE
   ```

1. Attach the LCC to your domain.

   ```
   aws sagemaker update-domain \
       --domain-id $SOURCE_DOMAIN_ID \
       --default-user-settings '
           {"JupyterLabAppSettings":
               {"LifecycleConfigArns":
                   [
                       "lifecycle-config-arn"
                   ]
               }
           }'
   ```

1. Users can then select the LCC script when launching an application. For more information, see [JupyterLab user guide](studio-updated-jl-user-guide.md) or [Launch a Code Editor application in Studio](code-editor-use-studio.md). This automatically syncs the files from Amazon S3 to the Amazon EBS storage for the user's space.

## Migrate data flows from Data Wrangler
<a name="studio-updated-migrate-flows"></a>

If you have previously used Amazon SageMaker Data Wrangler in Amazon SageMaker Studio Classic for data preparation tasks, you can migrate to the new Amazon SageMaker Studio and access the latest version of Data Wrangler in Amazon SageMaker Canvas. Data Wrangler in SageMaker Canvas provides you with an enhanced user experience and access to the latest features, such as a natural language interface and faster performance.

You can onboard to SageMaker Canvas at any time to begin using the new Data Wrangler experience. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

If you have data flow files saved in Studio Classic that you were previously working on, you can onboard to Studio and then import the flow files into Canvas. You have the following options for migration:
+ One-click migration: When you sign in to Canvas, you can use a one-time import option that migrates all of your flow files on your behalf.
+ Manual migration: You can manually import your flow files into Canvas. From Studio Classic, either export the files to Amazon S3 or download them to your local machine. Then, you sign in to the SageMaker Canvas application, import the flow files, and continue your data preparation tasks.

The following guide describes the prerequisites to migration and how to migrate your data flow files using either the one-click or manual option.

### Prerequisites
<a name="studio-updated-migrate-flows-prereqs"></a>

Review the following prerequisites before you begin migrating your flow files.

**Step 1. Migrate the domain and grant permissions**

Before migrating data flow files, you need to follow specific steps of the [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md) guide to ensure that your user profile's AWS IAM execution role has the required permissions. Follow the [Prerequisites](studio-updated-migrate-prereq.md) and [Migrate the UI from Studio Classic to Studio](studio-updated-migrate-ui.md) before proceeding, which describe how to grant the required permissions, configure Studio as the new experience, and migrate your existing domain.

Specifically, you must have permissions to create a SageMaker Canvas application and use the SageMaker Canvas data preparation features. To obtain these permissions, you can either:
+ Add the [ AmazonSageMakerCanvasDataPrepFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html) policy to your IAM role, or
+ Attach a least-permissions policy, as shown in the **(Optional) Migrate from Data Wrangler in Studio Classic to SageMaker Canvas** section of the page [Migrate the UI from Studio Classic to Studio](studio-updated-migrate-ui.md).

Make sure to use the same user profile for both Studio and SageMaker Canvas.

After completing the prerequisites outlined in the migration guide, you should have a new domain with the required permissions to access SageMaker Canvas through Studio.

**Step 2. (Optional) Prepare an Amazon S3 location**

If you are doing a manual migration and plan to use Amazon S3 to transfer your flow files instead of using the local download option, you should have an Amazon S3 bucket in your account that you'd like to use for storing the flow files.

### One-click migration method
<a name="studio-updated-migrate-flows-auto"></a>

SageMaker Canvas offers a one-time import option for migrating your data flows from Data Wrangler in Studio Classic to Data Wrangler in SageMaker Canvas. As long as your Studio Classic and Canvas applications share the same Amazon EFS storage volume, you can migrate in one click from Canvas. This streamlined process eliminates the need for manual export and import steps, and you can import all of your flows at once.

Use the following procedure to migrate all of your flow files:

1. Open your latest version of Studio.

1. In Studio, in the left navigation pane, choose the **Data** dropdown menu.

1. From the navigation options, choose **Data Wrangler**.

1. On the **Data Wrangler** page, choose **Run in Canvas**. If you have successfully set up the permissions, this creates a Canvas application for you. The Canvas application may take a few minutes before it's ready. 

1. When Canvas is ready, choose **Open in Canvas.**

1. Canvas opens to the **Data Wrangler** page, and a banner at the top of the page appears that says Import your data flows from Data Wrangler in Studio Classic to Canvas. This is a one time import. Learn more. In the banner, choose **Import All**.
**Warning**  
If you close the banner notification, you won't be able to re-open it or use the one-click migration method anymore. 

A pop-up notification appears, indicating that Canvas is importing your flow files from Studio Classic. If the import is fully successful, you receive another notification that `X` number of flow files were imported, and you can see your flow files on the **Data Wrangler** page of the Canvas application. Any imported flow files that have the same name as existing data flows in your Canvas application are renamed with a postfix. You can open a data flow to verify that it looks as expected.

In case any of your flow files don't import successfully, you receive a notification that the import was either partially successful or failed. Choose **View errors** on the notification message to check the individual error messages for guidance on how to reformat any incorrectly formatted flow files.

After importing your flow files, you should now be able to continue using Data Wrangler to prepare data in SageMaker Canvas.

### Manual migration method
<a name="studio-updated-migrate-flows-manual"></a>

The following sections describe how to manually import your flow files into Canvas in case the one-click migration method didn't work.

#### Export the flow files from Studio Classic
<a name="studio-updated-migrate-flows-export"></a>

**Note**  
If you've already migrated your Studio Classic data to Amazon S3 by following the instructions in [(Optional) Migrate data from Studio Classic to Studio](#studio-updated-migrate-data), you can skip this step and go straight to the [Import the flow files into Canvas](#studio-updated-migrate-flows-import) section in which you import your flow files from the Amazon S3 location where your Studio Classic data is stored.

You can export your flow files by either saving them to Amazon S3 or downloading them to your local machine. When you import your flow files into SageMaker Canvas in the next step, if you choose the local upload option, then you can only upload 20 flow files at a time. If you have a large number of flow files to import, we recommend that you use Amazon S3 instead.

Follow the instructions in either [Method 1: Use Amazon S3 to transfer flow files](#studio-updated-migrate-flows-export-s3) or [Method 2: Use your local machine to transfer flow files](#studio-updated-migrate-flows-export-local) to proceed.

##### Method 1: Use Amazon S3 to transfer flow files
<a name="studio-updated-migrate-flows-export-s3"></a>

With this method, you use Amazon S3 as the intermediary between Data Wrangler in Studio Classic and Data Wrangler in SageMaker Canvas (accessed through the latest version of Studio). You export the flow files from Studio Classic to Amazon S3, and then in the next step, you access Canvas through Studio and import the flow files from Amazon S3.

Make sure that you have an Amazon S3 bucket prepared as the storage location for the flow files.

Use the following procedure to export your flow files from Studio Classic to Amazon S3:

1. Open Studio Classic.

1. Open a new terminal by doing the following:

   1. On the top navigation bar, choose **File**.

   1. In the context menu, hover over **New**, and then select **Terminal**.

1. By default, the terminal should open in your home directory. Navigate to the folder that contains all of the flow files that you want to migrate.

1. Use the following command to synchronize all of the flow files to the specified Amazon S3 location. Replace `{bucket-name}` and `{folder}` with the path to your desired Amazon S3 location. For more information about the command and parameters, see the [sync](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) command in the AWS AWS CLI Command Reference.

   ```
   aws s3 sync . s3://{bucket-name}/{folder}/ --exclude "*.*" --include "*.flow"
   ```

   If you are using your own AWS KMS key, then use the following command instead to synchronize the files, and specify your KMS key ID. Make sure that the user's IAM execution role (which should be the same role used in **Step 1. Migrate the domain and grant permissions** of the preceding [Prerequisites](#studio-updated-migrate-flows-prereqs)) has been granted access to use the KMS key.

   ```
   aws s3 sync . s3://{bucket-name}/{folder}/ --exclude "*.*" --include "*.flow" --sse-kms-key-id {your-key-id}
   ```

Your flow files should now be exported. You can check your Amazon S3 bucket to make sure that the flow files synchronized successfully.

To import these files in the latest version of Data Wrangler, follow the steps in [Import the flow files into Canvas](#studio-updated-migrate-flows-import).

##### Method 2: Use your local machine to transfer flow files
<a name="studio-updated-migrate-flows-export-local"></a>

With this method, you download the flow files from Studio Classic to your local machine. You can download the files directly, or you can compress them as a zip archive. Then, you unpack the zip file locally (if applicable), sign in to Canvas, and import the flow files by uploading them from your local machine.

Use the following procedure to download your flow files from Studio Classic:

1. Open Studio Classic.

1. (Optional) If you want to compress multiple flow files into a zip archive and download them all at once, then do the following:

   1. On the top navigation bar of Studio Classic, choose **File**.

   1. In the context menu, hover over **New**, and then select **Terminal**.

   1. By default, the terminal opens in your home directory. Navigate to the folder that contains all of the flow files that you want to migrate.

   1. Use the following command to pack the flow files in the current directory as a zip. The command excludes any hidden files:

      ```
      find . -not -path "*/.*" -name "*.flow" -print0 | xargs -0 zip my_archive.zip
      ```

1. Download the zip archive or individual flow files to your local machine by doing the following:

   1. In the left navigation pane of Studio Classic, choose **File Browser**.

   1. Find the file you want to download in the file browser.

   1. Right click the file, and in the context menu, select **Download**.

The file should download to your local machine. If you packed them as a zip archive, extract the files locally. After the files are extracted, to import these files in the latest version of Data Wrangler, follow the steps in [Import the flow files into Canvas](#studio-updated-migrate-flows-import). 

#### Import the flow files into Canvas
<a name="studio-updated-migrate-flows-import"></a>

After exporting your flow files, access Canvas through Studio and import the files.

Use the following procedure to import flow files into Canvas:

1. Open your latest version of Studio.

1. In Studio, in the left navigation pane, choose the **Data** dropdown menu.

1. From the navigation options, choose **Data Wrangler**.

1. On the **Data Wrangler** page, choose **Run in Canvas**. If you have successfully set up the permissions, this creates a Canvas application for you. The Canvas application may take a few minutes before it's ready. 

1. When Canvas is ready, choose **Open in Canvas.**

1. Canvas opens to the **Data Wrangler** page. In the top pane, choose **Import data flows**.

1. For **Data source**, choose either **Amazon S3** or **Local upload**.

1. Select your flow files from your Amazon S3 bucket, or upload the files from your local machine.
**Note**  
For local upload, you can upload a maximum of 20 flow files at a time. For larger imports, use Amazon S3. If you select a folder to import, any flow files in sub-folders are also imported.

1. Choose **Import data**.

If the import was successful, you receive a notification that `X` number of flow files were successfully imported.

In case your flow files don't import successfully, you receive a notification in the SageMaker Canvas application. Choose **View errors** on the notification message to check the individual error messages for guidance on how to reformat any incorrectly formatted flow files.

After your flow files are done importing, go to the **Data Wrangler** page of the SageMaker Canvas application to view your data flows. You can try opening a data flow to verify that it looks as expected.

# Amazon SageMaker Studio Classic
<a name="studio"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic is a web-based integrated development environment (IDE) for machine learning (ML). Studio Classic lets you build, train, debug, deploy, and monitor your ML models. Studio Classic includes all of the tools you need to take your models from data preparation to experimentation to production with increased productivity. In a single visual interface, you can do the following tasks:
+ Write and run code in Jupyter notebooks
+ Prepare data for machine learning
+ Build and train ML models
+ Deploy the models and monitor the performance of their predictions
+ Track and debug ML experiments
+ Collaborate with other users in real time

For information on the onboarding steps for Studio Classic, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

For information about collaborating with other users in real time, see [Collaboration with shared spaces](domain-space.md).

For the AWS Regions supported by Studio Classic, see [Supported Regions and Quotas](regions-quotas.md).

## Amazon SageMaker Studio Classic maintenance phase plan
<a name="studio-deprecation"></a>

The following table gives information about the timeline for when Amazon SageMaker Studio Classic entered its extended maintenance phase.




| Date | Description | 
| --- | --- | 
|  12/31/2024  |  Starting December 31st, Studio Classic reaches end of maintenance. At this point, Studio Classic will no longer receive updates and security fixes. All new domains will be created with Amazon SageMaker Studio as the default.  | 
|  1/31/2025  |  Starting January 31st, users will no longer be able to create new JupyterLab 3 notebooks in Studio Classic. Users will also not be able to restart or update existing notebooks. Users will be able to access existing Studio Classic applications from Studio only to delete or stop existing notebooks.  | 

**Note**  
Your existing Studio Classic domain is not automatically migrated to Studio. For information about migrating, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).

**Topics**
+ [

## Amazon SageMaker Studio Classic maintenance phase plan
](#studio-deprecation)
+ [

## Amazon SageMaker Studio Classic Features
](#studio-features)
+ [

# Amazon SageMaker Studio Classic UI Overview
](studio-ui.md)
+ [

# Launch Amazon SageMaker Studio Classic
](studio-launch.md)
+ [

# JupyterLab Versioning in Amazon SageMaker Studio Classic
](studio-jl.md)
+ [

# Use the Amazon SageMaker Studio Classic Launcher
](studio-launcher.md)
+ [

# Use Amazon SageMaker Studio Classic Notebooks
](notebooks.md)
+ [

# Customize Amazon SageMaker Studio Classic
](studio-customize.md)
+ [

# Perform Common Tasks in Amazon SageMaker Studio Classic
](studio-tasks.md)
+ [

# Amazon SageMaker Studio Classic Pricing
](studio-pricing.md)
+ [

# Troubleshooting Amazon SageMaker Studio Classic
](studio-troubleshooting.md)

## Amazon SageMaker Studio Classic Features
<a name="studio-features"></a>

Studio Classic includes the following features:
+ [SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html)
+ [SageMaker Clarify](https://docs.aws.amazon.com/sagemaker/latest/dg/model-explainability.html)
+ [SageMaker Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html)
+ [SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html)
+ [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html)
+ [SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-use-with-studio.html)
+ [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html)
+ [Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio.html)
+ [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html)
+ [SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects.html)
+ [SageMaker Studio Classic Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html)
+ [SageMaker Studio Universal Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster.html)

# Amazon SageMaker Studio Classic UI Overview
<a name="studio-ui"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic extends the capabilities of JupyterLab with custom resources that can speed up your Machine Learning (ML) process by harnessing the power of AWS compute. Previous users of JupyterLab will notice the similarity of the user interface. The most prominent additions are detailed in the following sections. For an overview of the original JupyterLab interface, see [The JupyterLab Interface](https://jupyterlab.readthedocs.io/en/latest/user/interface.html). 

The following image shows the default view upon launching Amazon SageMaker Studio Classic. The *left navigation* panel displays all top-level categories of features, and a *[Amazon SageMaker Studio Classic home page](#studio-ui-home)* is open in the *main working area*. Come back to this central point of orientation by choosing the **Home** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)) at any time, then selecting the **Home** node in the navigation menu.

Try the **Getting started notebook** for an in-product hands-on guide on how to set up and get familiar with Amazon SageMaker Studio Classic features. On the **Quick actions** section of the Studio Classic Home page, choose **Open the Getting started notebook**.

![\[SageMaker Studio Classic home page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-home.png)


**Note**  
This chapter is based on Studio Classic's updated user interface (UI) available on version `v5.38.x` and above on JupyterLab3.  
To retrieve your version of Studio Classic UI, from the [Studio Classic Launcher](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-launcher.html), open a System Terminal, then  
Run `conda activate studio`
Run `jupyter labextension list`
Search for the version displayed after `@amzn/sagemaker-ui version` in the output.
For information about updating Amazon SageMaker Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

**Topics**
+ [

## Amazon SageMaker Studio Classic home page
](#studio-ui-home)
+ [

## Amazon SageMaker Studio Classic layout
](#studio-ui-layout)

## Amazon SageMaker Studio Classic home page
<a name="studio-ui-home"></a>

The Home page provides access to common tasks and workflows. In particular, it includes a list of **Quick actions** for common tasks such as **Open Launcher** to create notebooks and other resources and **Import & prepare data visually** to create a new flow in Data Wrangler.The **Home** page also offers tooltips on key controls in the UI.

The **Prebuilt and automated solutions** help you get started quickly with SageMaker AI's low-code solutions such as Amazon SageMaker JumpStart and Autopilot.

In **Workflows and tasks**, you can find a list of relevant tasks for each step of your ML workflow that takes you to the right tool for the job. For example, **Transform, analyse, and export data** takes you to Amazon SageMaker Data Wrangler and opens the workflow to create a new data flow, or **View all experiments** takes you to SageMaker Experiments and opens the experiments list view.

Upon Studio Classic launch, the **Home** page is open in the main working area. You can customize your SageMaker AI **Home** page by choosing the **Customize Layout** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/layout.png)) at the top right of the **Home** tab.

## Amazon SageMaker Studio Classic layout
<a name="studio-ui-layout"></a>

The Amazon SageMaker Studio Classic interface consists of a *menu bar* at the top, a collapsible *left sidebar* displaying a variety of icons such as the **Home** icon and the **File Browser**, a *status bar* at the bottom of the screen, and a *central area* divided horizontally into two panes. The left pane is a collapsible *navigation panel*. The right pane, or main working area, contains one or more tabs for resources such as launchers, notebooks, terminals, metrics, and graphs, and can be further divided.

**Report a bug** in Studio Classic or choose the notification icon (![\[Red circle icon with white exclamation mark, indicating an alert or warning.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Notification.png)) to view notifications from Studio Classic, such as new Studio Classic versions and new SageMaker AI features, on the right corner of the menu bar. To update to a new version of Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic and Apps](studio-tasks-update.md).

The following sections describe the Studio Classic main user interface areas.

### Left sidebar
<a name="studio-ui-nav-bar"></a>

The *left sidebar* includes the following icons. When hovering over an icon, a tooltip displays the icon name. A single click on an icon opens up the left navigation panel with the described functionality. A double click minimizes the left navigation panel.


| Icon | Description | 
| --- | --- | 
|  ![\[The Home icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house@2x.png)  | **Home** Choose the **Home** icon to open a top-level navigation menu in the *left navigation* panel. Using the **Home** navigation menu, you can discover and navigate to the right tools for each step of your ML workflow. The menu also provides shortcuts to quick-start solutions and learning resources such as documentation and guided tutorials. The menu categories group relevant features together. Choosing **Data**, for example, expands the relevant SageMaker AI capabilities for your data preparations tasks. From here, you can prepare your data with Data Wrangler, create and store ML features with Amazon SageMaker Feature Store, and manage Amazon EMR clusters for large-scale data processing. The categories are ordered following a typical ML workflow from preparing data, to building, training, and deploying ML models (data, pipelines, models, and deployments). When you choose a specific node (such as Data Wrangler), a corresponding page opens in the main working area. Choose **Home** in the navigation menu to open the [Amazon SageMaker Studio Classic home page](#studio-ui-home) | 
|  ![\[The File Browser icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder@2x.png)  |  **File Browser** The **File Browser** displays lists of your notebooks, experiments, trials, trial components, endpoints, and low-code solutions. Whether you are in a personal or shared space determines who has access to your files. You can identify which type of space you are in by looking at the top right corner. If you are in a personal app, you see a user icon followed by *[user\$1name]*** / Personal Studio** and if you are in a collaborative space, you see a globe icon followed by "*[user\$1name] ***/** *[space\$1name].*" [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html) For hierarchical entries, a selectable breadcrumb at the top of the browser shows your location in the hierarchy.  | 
|  ![\[The Property Inspector icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/gears@2x.png)  |  **Property Inspector** The Property Inspector is a notebook cell tools inspector which displays contextual property settings when open.  | 
|  ![\[The Running Terminals and Kernels icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels@2x.png)  |  **Running Terminals and Kernels** You can check the list of all the *kernels* and *terminals* currently running across all notebooks, code consoles, and directories. You can shut down individual resources, including notebooks, terminals, kernels, apps, and instances. You can also shut down all resources in one of these categories at the same time. For more information, see [Shut Down Resources from Amazon SageMaker Studio Classic](notebooks-run-and-manage-shut-down.md).  | 
|  ![\[The Git icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git@2x.png)  |  **Git** You can connect to a Git repository and then access a full range of Git tools and operations. For more information, see [Clone a Git Repository in Amazon SageMaker Studio Classic](studio-tasks-git.md).  | 
|  ![\[The Table of Contents icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/table-of-contents@2x.png)  | **Table of Contents**You can navigate the structure of a document when a notebook or Python files are open. A table of contents is auto-generated in the left navigation panel when you have a notebook, Markdown files, or Python files opened. The entries are clickable and scroll the document to the heading in question. | 
|  ![\[The Extensions icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/extensions@2x.png)  |  **Extensions** You can turn on and manage third-party JupyterLab extensions. You can check the already installed extensions and search for extensions by typing the name in the search bar. When you have found the extension you want to install, choose **Install**. After installing your new extensions, be sure to restart JupyterLab by refreshing your browser. For more information, see [JupyterLab Extensions documentation](https://jupyterlab.readthedocs.io/en/stable/user/extensions.html).  | 

### Left navigation panel
<a name="studio-ui-browser"></a>

The left navigation panel content varies with the Icon selected in the left sidebar.

For example, choosing the **Home** icon displays the navigation menu. Choosing **File browser** lists all the files and directories available in your workspace (notebooks, experiments, data flows, trials, trial components, endpoints, or low-code solutions).

In the navigation menu, choosing a node brings up the corresponding feature page in the main working area. For example, choosing **Data Wrangler** in the **Data** menu opens up the **Data Wrangler** tab listing all existing flows.

### Main working area
<a name="studio-ui-work"></a>

The main working area consists of multiple tabs that contain your open notebooks, terminals, and detailed information about your experiments and endpoints. In the main working area, you can arrange documents (such as notebooks and text files) and other activities (such as terminals and code consoles) into panels of tabs that you can resize or subdivide. Drag a tab to the center of a tab panel to move the tab to the panel. Subdivide a tab panel by dragging a tab to the left, right, top, or bottom of the panel. The tab for the current activity is marked with a colored top border (blue by default).

**Note**  
All feature pages provide in-product contextual help. To access help, choose **Show information**. The help interface provides a brief introduction to the tool and links to additional resources, such as videos, tutorials, or blogs.

# Launch Amazon SageMaker Studio Classic
<a name="studio-launch"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

After you have onboarded to an Amazon SageMaker AI domain, you can launch an Amazon SageMaker Studio Classic application from either the SageMaker AI console or the AWS CLI. For more information about onboarding to a domain, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

**Topics**
+ [

## Launch Amazon SageMaker Studio Classic Using the Amazon SageMaker AI Console
](#studio-launch-console)
+ [

## Launch Amazon SageMaker Studio Classic Using the AWS CLI
](#studio-launch-cli)

## Launch Amazon SageMaker Studio Classic Using the Amazon SageMaker AI Console
<a name="studio-launch-console"></a>

The process to navigate to Studio Classic from the Amazon SageMaker AI Console differs depending on if Studio Classic or Amazon SageMaker Studio are set as the default experience for your domain. For more information about setting the default experience for your domain, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).

**Topics**
+ [

### Prerequisite
](#studio-launch-console-prerequisites)

### Prerequisite
<a name="studio-launch-console-prerequisites"></a>

 To complete this procedure, you must onboard to a domain by following the steps in [Onboard to Amazon SageMaker AI domain](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-studio-onboard.html). 

### Launch Studio Classic if Studio is your default experience
<a name="studio-launch-console-updated"></a>

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **Studio Classic**.

1. From the Studio Classic landing page, select the Studio Classic instance to open.

1. Choose “Open”.

### Launch Studio Classic if Studio Classic is your default experience
<a name="studio-launch-console-classic"></a>

When Studio Classic is your default experience, you can launch a Amazon SageMaker Studio Classic application from the SageMaker AI console using the Studio Classic landing page or the Amazon SageMaker AI domain details page. The following sections demonstrate how to launch the Studio Classic application from the SageMaker AI console.

#### Launch Studio Classic from the domain details page
<a name="studio-launch-console-details"></a>

The following sections describe how to launch a Studio Classic application from the domain details page. The steps to launch the Studio Classic application after you have navigated to the domain details page differ depending on if you’re launching a personal application or a shared space. 

 **Navigate to the domain details page** 

 The following procedure shows how to navigate to the domain details page. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domain, select the domain that you want to launch the Studio Classic application in.

 **Launch a user profile app** 

 The following procedure shows how to launch a Studio Classic application that is scoped to a user profile. 

1.  On the domain details page, choose the **User profiles** tab. 

1.  Identify the user profile that you want to launch the Studio Classic application for. 

1.  Choose **Launch** for your selected user profile, then choose **Studio Classic**. 

 **Launch a shared space app** 

 The following procedure shows how to launch a Studio Classic application that is scoped to a shared space. 

1.  On the domain details page, choose the **Space management** tab. 

1.  Identify the shared space that you want to launch the Studio Classic application for. 

1.  Choose **Launch Studio Classic** for your selected shared space. 

#### Launch Studio Classic from the Studio Classic landing page
<a name="studio-launch-console-landing"></a>

 The following procedure describes how to launch a Studio Classic application from the Studio Classic landing page. 

 **Launch Studio Classic** 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose Studio Classic.

1.  Under **Get started**, select the domain that you want to launch the Studio Classic application in. If your user profile only belongs to one domain, you do not see the option for selecting a domain.

1.  Select the user profile that you want to launch the Studio Classic application for. If there is no user profile in the domain, choose **Create user profile**. For more information, see [Add user profiles](domain-user-profile-add.md).

1.  Choose **Launch Studio Classic**. If the user profile belongs to a shared space, choose **Open Spaces**. 

1.  To launch a Studio Classic application scoped to a user profile, choose **Launch personal Studio Classic**. 

1.  To launch a shared Studio Classic application, choose the **Launch shared Studio Classic** button next to the shared space that you want to launch into. 

## Launch Amazon SageMaker Studio Classic Using the AWS CLI
<a name="studio-launch-cli"></a>

You can use the AWS Command Line Interface (AWS CLI) to launch Amazon SageMaker Studio Classic by creating a presigned domain URL.

 **Prerequisites** 

 Before you begin, complete the following prerequisites: 
+  Onboard to Amazon SageMaker AI domain. For more information, see [Onboard to Amazon SageMaker AI domain](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-studio-onboard.html). 
+  Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com//cli/latest/userguide/install-cliv1.html#install-tool-bundled). 
+  From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com//general/latest/gr/aws-sec-cred-types.html). 

The following code snippet demonstrates how to launch Amazon SageMaker Studio Classic from the AWS CLI using a presigned domain URL. For more information, see [create-presigned-domain-url](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-presigned-domain-url.html).

```
aws sagemaker create-presigned-domain-url \
--region region \
--domain-id domain-id \
--space-name space-name \
--user-profile-name user-profile-name \
--session-expiration-duration-in-seconds 43200
```

# JupyterLab Versioning in Amazon SageMaker Studio Classic
<a name="studio-jl"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The Amazon SageMaker Studio Classic interface is based on JupyterLab, which is a web-based interactive development environment for notebooks, code, and data. Studio Classic only supports using JupyterLab 3.

If you created your domain and user profile using the AWS Management Console before 08/31/2022 or using the AWS Command Line Interface before 02/22/23, then your Studio Classic instance defaulted to JupyterLab 1. After 07/01/2024, you cannot create any Studio Classic applications that run JupyterLab 1.

## JupyterLab 3
<a name="jl3"></a>

JupyterLab 3 includes the following features that are not available in previous versions. For more information about these features, see [JupyterLab 3.0 is released\$1](https://blog.jupyter.org/jupyterlab-3-0-is-out-4f58385e25bb). 
+ Visual debugger when using the Base Python 2.0 and Data Science 2.0 kernels.
+ File browser filter 
+ Table of Contents (TOC) 
+ Multi-language support 
+ Simple mode 
+ Single interface mode 

### Important changes to JupyterLab 3
<a name="jl3-changes"></a>

 Consider the following when using JupyterLab 3: 
+ When setting the JupyterLab version using the AWS CLI, select the corresponding image for your Region and JupyterLab version from the image list in [From the AWS CLI](#studio-jl-set-cli).
+ In JupyterLab 3, you must activate the `studio` conda environment before installing extensions. For more information, see [Installing JupyterLab and Jupyter Server extensions](#studio-jl-install).
+ Debugger is only supported when using the following images: 
  + Base Python 2.0
  + Data Science 2.0
  + Base Python 3.0
  + Data Science 3.0

## Restricting default JupyterLab version using an IAM policy condition key
<a name="iam-policy"></a>

You can use IAM policy condition keys to restrict the version of JupyterLab that your users can launch.

The following policy shows how to limit the JupyterLab version at the domain level. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "BlockJupyterLab3DomainLevelAppCreation",
            "Effect": "Deny",
            "Action": [
                "sagemaker:CreateDomain",
                "sagemaker:UpdateDomain"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:ArnLike": {
                    "sagemaker:ImageArns": "arn:aws:sagemaker:us-east-1:111122223333:image/jupyter-server-3"
                }
            }
        }
    ]
}
```

------

The following policy shows how to limit the JupyterLab version at the user profile level. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "BlockUsersFromCreatingJupyterLab3Apps",
            "Effect": "Deny",
            "Action": [
                "sagemaker:CreateUserProfile",
                "sagemaker:UpdateUserProfile"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:ArnLike": {
                    "sagemaker:ImageArns": "arn:aws:sagemaker:us-east-1:111122223333:image/jupyter-server-3"
                }
            }
        }
    ]
}
```

------

The following policy shows how to limit the JupyterLab version at the application level. The `CreateApp` request must include the image ARN for this policy to apply.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "BlockJupyterLab3AppLevelAppCreation",
            "Effect": "Deny",
            "Action": "sagemaker:CreateApp",
            "Resource": "*",
            "Condition": {
                "ForAnyValue:ArnLike": {
                    "sagemaker:ImageArns": "arn:aws:sagemaker:us-east-1:111122223333:image/jupyter-server-3"
                }
            }
        }
    ]
}
```

------

## Setting a default JupyterLab version
<a name="studio-jl-set"></a>

The following sections show how to set a default JupyterLab version for Studio Classic using either the console or the AWS CLI.  

### From the console
<a name="studio-jl-set-console"></a>

 You can select the default JupyterLab version to use on either the domain or user profile level during resource creation. To set the default JupyterLab version using the console, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).  

### From the AWS CLI
<a name="studio-jl-set-cli"></a>

 You can select the default JupyterLab version to use on either the domain or user profile level using the AWS CLI.  

 To set the default JupyterLab version using the AWS CLI, you must include the ARN of the desired default JupyterLab version as part of an AWS CLI command. This ARN differs based on the version and the Region of the SageMaker AI domain.  

The following table lists the ARNs of the available JupyterLab versions for each Region:


|  Region  |  JL3  | 
| --- | --- | 
|  us-east-1  |  arn:aws:sagemaker:us-east-1:081325390199:image/jupyter-server-3  | 
|  us-east-2  |  arn:aws:sagemaker:us-east-2:429704687514:image/jupyter-server-3  | 
|  us-west-1  |  arn:aws:sagemaker:us-west-1:742091327244:image/jupyter-server-3  | 
|  us-west-2  |  arn:aws:sagemaker:us-west-2:236514542706:image/jupyter-server-3  | 
|  af-south-1  |  arn:aws:sagemaker:af-south-1:559312083959:image/jupyter-server-3  | 
|  ap-east-1  |  arn:aws:sagemaker:ap-east-1:493642496378:image/jupyter-server-3  | 
|  ap-south-1  |  arn:aws:sagemaker:ap-south-1:394103062818:image/jupyter-server-3  | 
|  ap-northeast-2  |  arn:aws:sagemaker:ap-northeast-2:806072073708:image/jupyter-server-3  | 
|  ap-southeast-1  |  arn:aws:sagemaker:ap-southeast-1:492261229750:image/jupyter-server-3  | 
|  ap-southeast-2  |  arn:aws:sagemaker:ap-southeast-2:452832661640:image/jupyter-server-3  | 
|  ap-northeast-1  |  arn:aws:sagemaker:ap-northeast-1:102112518831:image/jupyter-server-3  | 
|  ca-central-1  |  arn:aws:sagemaker:ca-central-1:310906938811:image/jupyter-server-3  | 
|  eu-central-1  |  arn:aws:sagemaker:eu-central-1:936697816551:image/jupyter-server-3  | 
|  eu-west-1  |  arn:aws:sagemaker:eu-west-1:470317259841:image/jupyter-server-3  | 
|  eu-west-2  |  arn:aws:sagemaker:eu-west-2:712779665605:image/jupyter-server-3  | 
|  eu-west-3  |  arn:aws:sagemaker:eu-west-3:615547856133:image/jupyter-server-3  | 
|  eu-north-1  |  arn:aws:sagemaker:eu-north-1:243637512696:image/jupyter-server-3  | 
|  eu-south-1  |  arn:aws:sagemaker:eu-south-1:592751261982:image/jupyter-server-3  | 
|  eu-south-2  |  arn:aws:sagemaker:eu-south-2:127363102723:image/jupyter-server-3  | 
|  sa-east-1  |  arn:aws:sagemaker:sa-east-1:782484402741:image/jupyter-server-3  | 
|  cn-north-1  |  arn:aws-cn:sagemaker:cn-north-1:390048526115:image/jupyter-server-3  | 
|  cn-northwest-1  |  arn:aws-cn:sagemaker:cn-northwest-1:390780980154:image/jupyter-server-3  | 

#### Create or update domain
<a name="studio-jl-set-cli-domain"></a>

 You can set a default JupyterServer version at the domain level by invoking [CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) or [UpdateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html) and passing the `UserSettings.JupyterServerAppSettings.DefaultResourceSpec.SageMakerImageArn` field. 

 The following shows how to create a domain with JupyterLab 3 as the default, using the AWS CLI: 

```
aws --region <REGION> \
sagemaker create-domain \
--domain-name <NEW_DOMAIN_NAME> \
--auth-mode <AUTHENTICATION_MODE> \
--subnet-ids <SUBNET-IDS> \
--vpc-id <VPC-ID> \
--default-user-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-server-3",
      "InstanceType": "system"
    }
  }
}'
```

 The following shows how to update a domain to use JupyterLab 3 as the default, using the AWS CLI: 

```
aws --region <REGION> \
sagemaker update-domain \
--domain-id <YOUR_DOMAIN_ID> \
--default-user-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-server-3",
      "InstanceType": "system"
    }
  }
}'
```

#### Create or update user profile
<a name="studio-jl-set-cli-user"></a>

 You can set a default JupyterServer version at the user profile level by invoking [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) or [UpdateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateUserProfile.html) and passing the `UserSettings.JupyterServerAppSettings.DefaultResourceSpec.SageMakerImageArn` field. 

 The following shows how to create a user profile with JupyterLab 3 as the default on an existing domain, using the AWS CLI: 

```
aws --region <REGION> \
sagemaker create-user-profile \
--domain-id <YOUR_DOMAIN_ID> \
--user-profile-name <NEW_USERPROFILE_NAME> \
--query UserProfileArn --output text \
--user-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-server-3",
      "InstanceType": "system"
    }
  }
}'
```

 The following shows how to update a user profile to use JupyterLab 3 as the default, using the AWS CLI: 

```
aws --region <REGION> \
sagemaker update-user-profile \
 --domain-id <YOUR_DOMAIN_ID> \
 --user-profile-name <EXISTING_USERPROFILE_NAME> \
--user-settings '{
  "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
      "SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-server-3",
      "InstanceType": "system"
    }
  }
}'
```

## View and update the JupyterLab version of an application from the console
<a name="studio-jl-view"></a>

 The following shows how to view and update the JupyterLab version of an application. 

1.  Navigate to the SageMaker AI **domains** page. 

1.  Select a domain to view its user profiles. 

1.  Select a user to view their applications. 

1.  To view the JupyterLab version of an application, select the application's name. 

1.  To update the JupyterLab version, select **Action**. 

1.  From the dropdown menu, select **Change JupyterLab version**. 

1.  From the **Studio Classic settings** page, select the JupyterLab version from the dropdown menu. 

1.  After the JupyterLab version for the user profile has been successfully updated, restart the JupyterServer application to make the version changes effective. For more information about restarting a JupyterServer application, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

## Installing JupyterLab and Jupyter Server extensions
<a name="studio-jl-install"></a>

In JupyterLab 3, you must activate the `studio` conda environment before installing extensions. The method for this differs if you're installing the extensions from within Studio Classic or using a lifecycle configuration script.

### Installing Extension from within Studio Classic
<a name="studio-jl-install-studio"></a>

To install extensions from within Studio Classic, you must activate the `studio` environment before you install extensions. 

```
# Before installing extensions
conda activate studio

# Install your extensions
pip install <JUPYTER_EXTENSION>

# After installing extensions
conda deactivate
```

### Installing Extensions using a lifecycle configuration script
<a name="studio-jl-install-lcc"></a>

If you're installing JupyterLab and Jupyter Server extensions in your lifecycle configuration script, you must modify your script so that it works with JupyterLab 3. The following sections show the code needed for existing and new lifecycle configuration scripts.

#### Existing lifecycle configuration script
<a name="studio-jl-install-lcc-existing"></a>

If you're reusing an existing lifecycle configuration script that must work with both versions of JupyterLab, use the following code in your script:

```
# Before installing extension
export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
   eval "$(conda shell.bash hook)"
   conda activate studio
fi;

# Install your extensions
pip install <JUPYTER_EXTENSION>


# After installing extension
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ]; then
   conda deactivate
fi;
```

#### New lifecycle configuration script
<a name="studio-jl-install-lcc-new"></a>

If you're writing a new lifecycle configuration script that only uses JupyterLab 3, you can use the following code in your script:

```
# Before installing extension
eval "$(conda shell.bash hook)"
conda activate studio


# Install your extensions
pip install <JUPYTER_EXTENSION>


conda deactivate
```

# Use the Amazon SageMaker Studio Classic Launcher
<a name="studio-launcher"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can use the Amazon SageMaker Studio Classic Launcher to create notebooks and text files, and to launch terminals and interactive Python shells.

You can open Studio Classic Launcher in any of the following ways:
+ Choose **Amazon SageMaker Studio Classic** at the top left of the Studio Classic interface.
+ Use the keyboard shortcut `Ctrl + Shift + L`.
+ From the Studio Classic menu, choose **File** and then choose **New Launcher**.
+ If the SageMaker AI file browser is open, choose the plus (**\$1**) sign in the Studio Classic file browser menu.
+ In the **Quick actions** section of the **Home** tab, choose **Open Launcher**. The Launcher opens in a new tab. The **Quick actions** section is visible by default but can be toggled off. Choose **Customize Layout** to turn this section back on.

![\[SageMaker Studio Classic launcher.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-new-launcher.png)


The Launcher consists of the following two sections:

**Topics**
+ [

## Notebooks and compute resources
](#studio-launcher-launch)
+ [

## Utilities and files
](#studio-launcher-other)

## Notebooks and compute resources
<a name="studio-launcher-launch"></a>

In this section, you can create a notebook, open an image terminal, or open a Python console.

To create or launch one of those items:

1. Choose **Change environment** to select a SageMaker image, a kernel, an instance type, and, optionally, add a lifecycle configuration script that runs on image start-up. For more information on lifecycle configuration scripts, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md). For more information about kernel updates, see [Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-change-image.md).

1. Select an item.

**Note**  
When you choose an item from this section, you might incur additional usage charges. For more information, see [Usage Metering for Amazon SageMaker Studio Classic Notebooks](notebooks-usage-metering.md).

The following items are available:
+ **Notebook**

  Launches the notebook in a kernel session on the chosen SageMaker image.

  Creates the notebook in the folder that you have currently selected in the file browser. To view the file browser, in the left sidebar of Studio Classic, choose the **File Browser** icon.
+ **Console**

  Launches the shell in a kernel session on the chosen SageMaker image.

  Opens the shell in the folder that you have currently selected in the file browser.
+ **Image terminal**

  Launches the terminal in a terminal session on the chosen SageMaker image.

  Opens the terminal in the root folder for the user (as shown by the **Home** folder in the file browser).

**Note**  
By default, CPU instances launch on a `ml.t3.medium` instance, while GPU instances launch on a `ml.g4dn.xlarge` instance.

## Utilities and files
<a name="studio-launcher-other"></a>

In this section, you can add contextual help in a notebook; create Python, Markdown and text files; and open a system terminal.

**Note**  
Items in this section run in the context of Amazon SageMaker Studio Classic and don't incur usage charges.

The following items are available:
+ **Show Contextual Help**

  Opens a new tab that displays contextual help for functions in a Studio Classic notebook. To display the help, choose a function in an active notebook. To make it easier to see the help in context, drag the help tab so that it's adjacent to the notebook tab. To open the help tab from within a notebook, press `Ctrl + I`.

  The following screenshot shows the contextual help for the `Experiment.create` method.  
![\[SageMaker Studio Classic contextual help.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-context-help.png)
+ **System terminal**

  Opens a `bash` shell in the root folder for the user (as shown by the **Home** folder in the file browser).
+ **Text File** and **Markdown File**

  Creates a file of the associated type in the folder that you have currently selected in the file browser. To view the file browser, in the left sidebar, choose the **File Browser** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)).

# Use Amazon SageMaker Studio Classic Notebooks
<a name="notebooks"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic notebooks are collaborative notebooks that you can launch quickly because you don't need to set up compute instances and file storage beforehand. Studio Classic notebooks provide persistent storage, which enables you to view and share notebooks even if the instances that the notebooks run on are shut down.

You can share your notebooks with others, so that they can easily reproduce your results and collaborate while building models and exploring your data. You provide access to a read-only copy of the notebook through a secure URL. Dependencies for your notebook are included in the notebook's metadata. When your colleagues copy the notebook, it opens in the same environment as the original notebook.

A Studio Classic notebook runs in an environment defined by the following:
+ Amazon EC2 instance type – The hardware configuration the notebook runs on. The configuration includes the number and type of processors (vCPU and GPU), and the amount and type of memory. The instance type determines the pricing rate.
+ SageMaker image – A container image that is compatible with SageMaker Studio Classic. The image consists of the kernels, language packages, and other files required to run a notebook in Studio Classic. There can be multiple images in an instance. For more information, see [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).
+ KernelGateway app – A SageMaker image runs as a KernelGateway app. The app provides access to the kernels in the image. There is a one-to-one correspondence between a SageMaker AI image and a KernelGateway app.
+ Kernel – The process that inspects and runs the code contained in the notebook. A kernel is defined by a *kernel spec* in the image. There can be multiple kernels in an image.

You can change any of these resources from within the notebook.

The following diagram outlines how a notebook kernel runs in relation to the KernelGateway App, User, and domain.

![\[How a notebook kernel runs in relation to the KernelGateway App, User, and domain.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-components.png)


Sample SageMaker Studio Classic notebooks are available in the [aws\$1sagemaker\$1studio](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/aws_sagemaker_studio) folder of the [Amazon SageMaker example GitHub repository](https://github.com/awslabs/amazon-sagemaker-examples). Each notebook comes with the necessary SageMaker image that opens the notebook with the appropriate kernel.

We recommend that you familiarize yourself with the SageMaker Studio Classic interface and the Studio Classic notebook toolbar before creating or using a Studio Classic notebook. For more information, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md) and [Use the Studio Classic Notebook Toolbar](notebooks-menu.md).

**Topics**
+ [

# How Are Amazon SageMaker Studio Classic Notebooks Different from Notebook Instances?
](notebooks-comparison.md)
+ [

# Get Started with Amazon SageMaker Studio Classic Notebooks
](notebooks-get-started.md)
+ [

# Amazon SageMaker Studio Classic Tour
](gs-studio-end-to-end.md)
+ [

# Create or Open an Amazon SageMaker Studio Classic Notebook
](notebooks-create-open.md)
+ [

# Use the Studio Classic Notebook Toolbar
](notebooks-menu.md)
+ [

# Install External Libraries and Kernels in Amazon SageMaker Studio Classic
](studio-notebooks-add-external.md)
+ [

# Share and Use an Amazon SageMaker Studio Classic Notebook
](notebooks-sharing.md)
+ [

# Get Amazon SageMaker Studio Classic Notebook and App Metadata
](notebooks-run-and-manage-metadata.md)
+ [

# Get Notebook Differences in Amazon SageMaker Studio Classic
](notebooks-diff.md)
+ [

# Manage Resources for Amazon SageMaker Studio Classic Notebooks
](notebooks-run-and-manage.md)
+ [

# Usage Metering for Amazon SageMaker Studio Classic Notebooks
](notebooks-usage-metering.md)
+ [

# Available Resources for Amazon SageMaker Studio Classic Notebooks
](notebooks-resources.md)

# How Are Amazon SageMaker Studio Classic Notebooks Different from Notebook Instances?
<a name="notebooks-comparison"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

When you're starting a new notebook, we recommend that you create the notebook in Amazon SageMaker Studio Classic instead of launching a notebook instance from the Amazon SageMaker AI console. There are many benefits to using a Studio Classic notebook, including the following:
+ **Faster: **Starting a Studio Classic notebook is faster than launching an instance-based notebook. Typically, it is 5-10 times faster than instance-based notebooks.
+ **Easy notebook sharing: **Notebook sharing is an integrated feature in Studio Classic. Users can generate a shareable link that reproduces the notebook code and also the SageMaker image required to execute it, in just a few clicks.
+ **Latest Python SDK: **Studio Classic notebooks come pre-installed with the latest [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ **Access all Studio Classic features: **Studio Classic notebooks are accessed from within Studio Classic. This enables you to build, train, debug, track, and monitor your models without leaving Studio Classic.
+ **Persistent user directories:** Each member of a Studio team gets their own home directory to store their notebooks and other files. The directory is automatically mounted onto all instances and kernels as they're started, so their notebooks and other files are always available. The home directories are stored in Amazon Elastic File System (Amazon EFS) so that you can access them from other services.
+ **Direct access:** When using IAM Identity Center, you use your IAM Identity Center credentials through a unique URL to directly access Studio Classic. You don't have to interact with the AWS Management Console to run your notebooks.
+ **Optimized images:** Studio Classic notebooks are equipped with a set of predefined SageMaker image settings to get you started faster.

**Note**  
Studio Classic notebooks don't support *local mode*. However, you can use a notebook instance to train a sample of your dataset locally, and then use the same code in a Studio Classic notebook to train on the full dataset.

When you open a notebook in SageMaker Studio Classic, the view is an extension of the JupyterLab interface. The primary features are the same, so you'll find the typical features of a Jupyter notebook and JupyterLab. For more information about the Studio Classic interface, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

# Get Started with Amazon SageMaker Studio Classic Notebooks
<a name="notebooks-get-started"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

To get started, you or your organization's administrator need to complete the SageMaker AI domain onboarding process. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

You can access a Studio Classic notebook in any of the following ways:
+ You receive an email invitation to access Studio Classic through your organization's IAM Identity Center, which includes a direct link to login to Studio Classic without having to use the Amazon SageMaker AI console. You can proceed to the [Next Steps](#notebooks-get-started-next-steps).
+ You receive a link to a shared Studio Classic notebook, which includes a direct link to log in to Studio Classic without having to use the SageMaker AI console. You can proceed to the [Next Steps](#notebooks-get-started-next-steps). 
+ You onboard to a domain and then log in to the SageMaker AI console. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Launch Amazon SageMaker AI
<a name="notebooks-get-started-log-in"></a>

Complete the steps in [Launch Amazon SageMaker Studio Classic](studio-launch.md) to launch Studio Classic.

## Next Steps
<a name="notebooks-get-started-next-steps"></a>

Now that you're in Studio Classic, you can try any of the following options:
+ To create a Studio Classic notebook or explore Studio Classic end-to-end tutorial notebooks – See [Amazon SageMaker Studio Classic Tour](gs-studio-end-to-end.md) in the next section.
+ To familiarize yourself with the Studio Classic interface – See [Amazon SageMaker Studio Classic UI Overview](studio-ui.md) or try the **Getting started notebook** by selecting **Open the Getting started notebook** in the **Quick actions** section of the Studio Classic Home page.

# Amazon SageMaker Studio Classic Tour
<a name="gs-studio-end-to-end"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

For a walkthrough that takes you on a tour of the main features of Amazon SageMaker Studio Classic, see the [xgboost\$1customer\$1churn\$1studio.ipynb](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/getting_started/xgboost_customer_churn_studio.html) sample notebook from the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) GitHub repository. The code in the notebook trains multiple models and sets up the SageMaker Debugger and SageMaker Model Monitor. The walkthrough shows you how to view the trials, compare the resulting models, show the debugger results, and deploy the best model using the Studio Classic UI. You don't need to understand the code to follow this walkthrough.

**Prerequisites**

To run the notebook for this tour, you need:
+ An IAM account to sign in to Studio. For information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).
+ Basic familiarity with the Studio user interface and Jupyter notebooks. For information, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).
+ A copy of the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) repository in your Studio environment.

**To clone the repository**

1. Launch Studio Classic following the steps in [Launch Amazon SageMaker Studio Classic](studio-launch.md) For users in IAM Identity Center, sign in using the URL from your invitation email.

1. On the top menu, choose **File**, then **New**, then **Terminal**.

1. At the command prompt, run the following command to clone the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) GitHub repository.

   ```
   $ git clone https://github.com/aws/amazon-sagemaker-examples.git
   ```

**To navigate to the sample notebook**

1. From the **File Browser** on the left menu, select **amazon-sagemaker-examples**.

1. Navigate to the example notebook with the following path.

   `~/amazon-sagemaker-examples/aws_sagemaker_studio/getting_started/xgboost_customer_churn_studio.ipynb`

1. Follow the notebook to learn about Studio Classic's main features.

**Note**  
If you encounter an error when you run the sample notebook, and some time has passed from when you cloned the repository, review the notebook on the remote repository for updates.

# Create or Open an Amazon SageMaker Studio Classic Notebook
<a name="notebooks-create-open"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

When you [Create a Notebook from the File Menu](#notebooks-create-file-menu) in Amazon SageMaker Studio Classic or [Open a notebook in Studio Classic](#notebooks-open) for the first time, you are prompted to set up your environment by choosing a SageMaker image, a kernel, an instance type, and, optionally, a lifecycle configuration script that runs on image start-up. SageMaker AI launches the notebook on an instance of the chosen type. By default, the instance type is set to `ml.t3.medium` (available as part of the [AWS Free Tier](https://aws.amazon.com/free)) for CPU-based images. For GPU-based images, the default instance type is `ml.g4dn.xlarge`.

If you create or open additional notebooks that use the same instance type, whether or not the notebooks use the same kernel, the notebooks run on the same instance of that instance type.

After you launch a notebook, you can change its instance type, SageMaker image, and kernel from within the notebook. For more information, see [Change the Instance Type for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-switch-instance-type.md) and [Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-change-image.md).

**Note**  
You can have only one instance of each instance type. Each instance can have multiple SageMaker images running on it. Each SageMaker image can run multiple kernels or terminal instances. 

Billing occurs per instance and starts when the first instance of a given instance type is launched. If you want to create or open a notebook without the risk of incurring charges, open the notebook from the **File** menu and choose **No Kernel** from the **Select Kernel** dialog box. You can read and edit a notebook without a running kernel but you can't run cells.

Billing ends when the SageMaker image for the instance is shut down. For more information, see [Usage Metering for Amazon SageMaker Studio Classic Notebooks](notebooks-usage-metering.md).

For information about shutting down the notebook, see [Shut down resources](notebooks-run-and-manage-shut-down.md#notebooks-run-and-manage-shut-down-sessions).

**Topics**
+ [

## Open a notebook in Studio Classic
](#notebooks-open)
+ [

## Create a Notebook from the File Menu
](#notebooks-create-file-menu)
+ [

## Create a Notebook from the Launcher
](#notebooks-create-launcher)
+ [

## List of the available instance types, images, and kernels
](#notebooks-instance-image-kernels)

## Open a notebook in Studio Classic
<a name="notebooks-open"></a>

Amazon SageMaker Studio Classic can only open notebooks listed in the Studio Classic file browser. For instructions on uploading a notebook to the file browser, see [Upload Files to Amazon SageMaker Studio Classic](studio-tasks-files.md) or [Clone a Git Repository in Amazon SageMaker Studio Classic](studio-tasks-git.md).

**To open a notebook**

1. In the left sidebar, choose the **File Browser** icon ( ![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) to display the file browser.

1. Browse to a notebook file and double-click it to open the notebook in a new tab.

## Create a Notebook from the File Menu
<a name="notebooks-create-file-menu"></a>

**To create a notebook from the File menu**

1. From the Studio Classic menu, choose **File**, choose **New**, and then choose **Notebook**.

1. In the **Change environment** dialog box, use the dropdown menus to select your **Image**, **Kernel**, **Instance type**, and **Start-up script**, then choose **Select**. Your notebook launches and opens in a new Studio Classic tab.  
![\[Studio Classic notebook environment setup.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-environment-setup.png)

## Create a Notebook from the Launcher
<a name="notebooks-create-launcher"></a>

**To create a notebook from the Launcher**

1. To open the Launcher, choose **Amazon SageMaker Studio Classic** at the top left of the Studio Classic interface or use the keyboard shortcut `Ctrl + Shift + L`.

   To learn about all the available ways to open the Launcher, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md)

1. In the Launcher, in the **Notebooks and compute resources** section, choose **Change environment**.  
![\[SageMaker Studio Classic set notebook environment.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-launcher-notebook-creation.png)

1. In the **Change environment** dialog box, use the dropdown menus to select your **Image**, **Kernel**, **Instance type**, and **Start-up script**, then choose **Select**.

1. In the Launcher, choose **Create notebook**. Your notebook launches and opens in a new Studio Classic tab.

To view the notebook's kernel session, in the left sidebar, choose the **Running Terminals and Kernels** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels.png)). You can stop the notebook's kernel session from this view.

## List of the available instance types, images, and kernels
<a name="notebooks-instance-image-kernels"></a>

For a list of all available resources, see:
+ [Instance Types Available for Use With Amazon SageMaker Studio Classic Notebooks](notebooks-available-instance-types.md)
+ [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md)

# Use the Studio Classic Notebook Toolbar
<a name="notebooks-menu"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic notebooks extend the JupyterLab interface. For an overview of the original JupyterLab interface, see [The JupyterLab Interface](https://jupyterlab.readthedocs.io/en/latest/user/interface.html).

The following image shows the toolbar and an empty cell from a Studio Classic notebook.

![\[SageMaker Studio Classic notebook menu.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-menu.png)


When you pause on a toolbar icon, a tooltip displays the icon function. Additional notebook commands are found in the Studio Classic main menu. The toolbar includes the following icons:


| Icon | Description | 
| --- | --- | 
|  ![\[The Save and checkpoint icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-save-and-checkpoint.png)  |  **Save and checkpoint** Saves the notebook and updates the checkpoint file. For more information, see [Get the Difference Between the Last Checkpoint](notebooks-diff.md#notebooks-diff-checkpoint).  | 
|  ![\[The Insert cell icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-insert-cell.png)  |  **Insert cell** Inserts a code cell below the current cell. The current cell is noted by the blue vertical marker in the left margin.  | 
|  ![\[The Cut, copy, and paste cells icons.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-cut-copy-paste.png)  |  **Cut, copy, and paste cells** Cuts, copies, and pastes the selected cells.  | 
|  ![\[The Run cells icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-run.png)  |  **Run cells** Runs the selected cells and then makes the cell that follows the last selected cell the new selected cell.  | 
|  ![\[The Interrupt kernel icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-interrupt-kernel.png)  |  **Interrupt kernel** Interrupts the kernel, which cancels the currently running operation. The kernel remains active.  | 
|  ![\[The Restart kernel icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-restart-kernel.png)  |  **Restart kernel** Restarts the kernel. Variables are reset. Unsaved information is not affected.  | 
|  ![\[The Restart kernel and run all cells icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-restart-kernel-run-all-cells.png)  |  **Restart kernel and run all cells** Restarts the kernel, then run all the cells of the notebook.  | 
|  ![\[The Cell type icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-cell-type.png)  |  **Cell type** Displays or changes the current cell type. The cell types are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-menu.html)  | 
|  ![\[The Launch terminal icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-launch-terminal.png)  |  **Launch terminal** Launches a terminal in the SageMaker image hosting the notebook. For an example, see [Get App Metadata](notebooks-run-and-manage-metadata.md#notebooks-run-and-manage-metadata-app).  | 
|  ![\[The Checkpoint diff icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-checkpoint-diff.png)  |  **Checkpoint diff** Opens a new tab that displays the difference between the notebook and the checkpoint file. For more information, see [Get the Difference Between the Last Checkpoint](notebooks-diff.md#notebooks-diff-checkpoint).  | 
|  ![\[The Git diff icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-git-diff.png)  |  **Git diff** Only enabled if the notebook is opened from a Git repository. Opens a new tab that displays the difference between the notebook and the last Git commit. For more information, see [Get the Difference Between the Last Commit](notebooks-diff.md#notebooks-diff-git).  | 
|  **2 vCPU \$1 4 GiB**  |  **Instance type** Displays or changes the instance type the notebook runs in. The format is as follows: `number of vCPUs + amount of memory + number of GPUs` `Unknown` indicates the notebook was opened without specifying a kernel. The notebook runs on the SageMaker Studio instance and doesn't accrue runtime charges. You can't assign the notebook to an instance type. You must specify a kernel and then Studio assigns the notebook to a default type. For more information, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md) and [Change the Instance Type for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-switch-instance-type.md).  | 
|  ![\[The Cluster icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-cluster.png)  |  **Cluster** Connect your notebook to an Amazon EMR cluster and scale your ETL jobs or run large-scale model training using Apache Spark, Hive, or Presto. For more information, see [Data preparation using Amazon EMR](studio-notebooks-emr-cluster.md).  | 
|  **Python 3 (Data Science)**  |  **Kernel and SageMaker Image** Displays or changes the kernel that processes the cells in the notebook. The format is as follows: `Kernel (SageMaker Image)` `No Kernel` indicates the notebook was opened without specifying a kernel. You can edit the notebook but you can't run any cells. For more information, see [Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-change-image.md).  | 
|  ![\[The Kernel busy status icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-kernel-status.png)  |  **Kernel busy status** Displays the busy status of the kernel. When the edge of the circle and its interior are the same color, the kernel is busy. The kernel is busy when it is starting and when it is processing cells. Additional kernel states are displayed in the status bar at the bottom-left corner of SageMaker Studio.  | 
|  ![\[The Share notebook icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-share.png)  |  **Share notebook** Shares the notebook. For more information, see [Share and Use an Amazon SageMaker Studio Classic Notebook](notebooks-sharing.md).  | 

To select multiple cells, click in the left margin outside of a cell. Hold down the `Shift` key and use `K` or the `Up` key to select previous cells, or use `J` or the `Down` key to select following cells.

# Install External Libraries and Kernels in Amazon SageMaker Studio Classic
<a name="studio-notebooks-add-external"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic notebooks come with multiple images already installed. These images contain kernels and Python packages including scikit-learn, Pandas, NumPy, TensorFlow, PyTorch, and MXNet. You can also install your own images that contain your choice of packages and kernels. For more information on installing your own image, see [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).

The different Jupyter kernels in Amazon SageMaker Studio Classic notebooks are separate conda environments. For information about conda environments, see [Managing environments](https://conda.io/docs/user-guide/tasks/manage-environments.html).

## Package installation tools
<a name="studio-notebooks-external-tools"></a>

**Important**  
Currently, all packages in Amazon SageMaker notebooks are licensed for use with Amazon SageMaker AI and do not require additional commercial licenses. However, this might be subject to change in the future, and we recommend reviewing the licensing terms regularly for any updates.

The method that you use to install Python packages from the terminal differs depending on the image. Studio Classic supports the following package installation tools:
+ **Notebooks** – The following commands are supported. If one of the following does not work on your image, try the other one.
  + `%conda install`
  + `%pip install`
+ **The Jupyter terminal** – You can install packages using pip and conda directly. You can also use `apt-get install` to install system packages from the terminal.

**Note**  
We do not recommend using `pip install -u` or `pip install --user`, because those commands install packages on the user's Amazon EFS volume and can potentially block JupyterServer app restarts. Instead, use a lifecycle configuration to reinstall the required packages on app restarts as shown in [Install packages using lifecycle configurations](#nbi-add-external-lcc).

We recommend using `%pip` and `%conda` to install packages from within a notebook because they correctly take into account the active environment or interpreter being used. For more information, see [Add %pip and %conda magic functions](https://github.com/ipython/ipython/pull/11524). You can also use the system command syntax (lines starting with \$1) to install packages. For example, `!pip install` and `!conda install`. 

### Conda
<a name="studio-notebooks-add-external-tools-conda"></a>

Conda is an open source package management system and environment management system that can install packages and their dependencies. SageMaker AI supports using conda with the conda-forge channel. For more information, see [Conda channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html). The conda-forge channel is a community channel where contributors can upload packages.

**Note**  
Installing packages from conda-forge can take up to 10 minutes. Timing relates to how conda resolves the dependency graph.

All of the SageMaker AI provided environments are functional. User installed packages may not function correctly.

Conda has two methods for activating environments: `conda activate`, and `source activate`. For more information, see [Managing environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

**Supported conda operations**
+ `conda install` of a package in a single environment
+ `conda install` of a package in all environments
+ Installing a package from the main conda repository
+ Installing a package from conda-forge
+ Changing the conda install location to use Amazon EBS
+ Supporting both `conda activate` and `source activate`

### Pip
<a name="studio-notebooks-add-external-tools-pip"></a>

Pip is the tool for installing and managing Python packages. Pip searches for packages on the Python Package Index (PyPI) by default. Unlike conda, pip doesn't have built in environment support. Therfore, pip isn't as thorough as conda when it comes to packages with native or system library dependencies. Pip can be used to install packages in conda environments. You can use alternative package repositories with pip instead of the PyPI.

**Supported pip operations**
+ Using pip to install a package without an active conda environment
+ Using pip to install a package in a conda environment
+ Using pip to install a package in all conda environments
+ Changing the pip install location to use Amazon EBS
+ Using an alternative repository to install packages with pip

### Unsupported
<a name="studio-notebooks-add-external-tools-misc"></a>

SageMaker AI aims to support as many package installation operations as possible. However, if the packages were installed by SageMaker AI and you use the following operations on these packages, it might make your environment unstable:
+ Uninstalling
+ Downgrading
+ Upgrading

Due to potential issues with network conditions or configurations, or the availability of conda or PyPi, packages may not install in a fixed or deterministic amount of time.

**Note**  
Attempting to install a package in an environment with incompatible dependencies can result in a failure. If issues occur, you can contact the library maintainer about updating the package dependencies. When you modify the environment, such as removing or updating existing packages, this may result in instability of that environment.

## Install packages using lifecycle configurations
<a name="nbi-add-external-lcc"></a>

Install custom images and kernels on the Studio Classic instance's Amazon EBS volume so that they persist when you stop and restart the notebook, and that any external libraries you install are not updated by SageMaker AI. To do that, use a lifecycle configuration that includes both a script that runs when you create the notebook (`on-create)` and a script that runs each time you restart the notebook (`on-start`). For more information about using lifecycle configurations with Studio Classic, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md). For sample lifecycle configuration scripts, see [SageMaker AI Studio Classic Lifecycle Configuration Samples](https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples).

# Share and Use an Amazon SageMaker Studio Classic Notebook
<a name="notebooks-sharing"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can share your Amazon SageMaker Studio Classic notebooks with your colleagues. The shared notebook is a copy. After you share your notebook, any changes you make to your original notebook aren't reflected in the shared notebook and any changes your colleague's make in their shared copies of the notebook aren't reflected in your original notebook. If you want to share your latest version, you must create a new snapshot and then share it.

**Topics**
+ [

## Share a Notebook
](#notebooks-sharing-share)
+ [

## Use a Shared Notebook
](#notebooks-sharing-using)
+ [

## Shared spaces and realtime collaboration
](#notebooks-sharing-rtc)

## Share a Notebook
<a name="notebooks-sharing-share"></a>

The following screenshot shows the menu from a Studio Classic notebook.

![\[The location of the Share icon in a Studio Classic notebook.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-menu-share.png)


**To share a notebook**

1. In the upper-right corner of the notebook, choose **Share**.

1. (Optional) In **Create shareable snapshot**, choose any of the following items:
   + **Include Git repo information** – Includes a link to the Git repository that contains the notebook. This enables you and your colleague to collaborate and contribute to the same Git repository.
   + **Include output** – Includes all notebook output that has been saved.
**Note**  
If you're an user in IAM Identity Center and you don't see these options, your IAM Identity Center administrator probably disabled the feature. Contact your administrator.

1. Choose **Create**.

1. After the snapshot is created, choose **Copy link** and then choose **Close**.

1. Share the link with your colleague.

After selecting your sharing options, you are provided with a URL. You can share this link with users that have access to Amazon SageMaker Studio Classic. When the user opens the URL, they're prompted to log in using IAM Identity Center or IAM authentication. This shared notebook becomes a copy, so changes made by the recipient will not be reproduced in your original notebook.

## Use a Shared Notebook
<a name="notebooks-sharing-using"></a>

You use a shared notebook in the same way you would with a notebook that you created yourself. You must first login to your account, then open the shared link. If you don't have an active session, you receive an error.

When you choose a link to a shared notebook for the first time, a read-only version of the notebook opens. To edit the shared notebook, choose **Create a Copy**. This copies the shared notebook to your personal storage.

The copied notebook launches on an instance of the instance type and SageMaker image that the notebook was using when the sender shared it. If you aren't currently running an instance of the instance type, a new instance is started. Customization to the SageMaker image isn't shared. You can also inspect the notebook snapshot by choosing **Snapshot Details**.

The following are some important considerations about sharing and authentication:
+ If you have an active session, you see a read-only view of the notebook until you choose **Create a Copy**.
+ If you don't have an active session, you need to log in.
+ If you use IAM to login, after you login, select your user profile then choose **Open Studio Classic**. Then you need to choose the link you were sent.
+ If you use IAM Identity Center to login, after you login the shared notebook is opened automatically in Studio.

## Shared spaces and realtime collaboration
<a name="notebooks-sharing-rtc"></a>

A shared space consists of a shared JupyterServer application and a shared directory. A key benefit of a shared space is that it facilitates collaboration between members of the shared space in real time. Users collaborating in a workspace get access to a shared Studio Classic application where they can access, read, and edit their notebooks in real time. Real time collaboration is only supported for JupyterServer applications within a shared space. Users with access to a shared space can simultaneously open, view, edit, and execute Jupyter notebooks in the shared Studio Classic application in that space. For more information about shared spaced and real time collaboration, see [Collaboration with shared spaces](domain-space.md).

# Get Amazon SageMaker Studio Classic Notebook and App Metadata
<a name="notebooks-run-and-manage-metadata"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can access notebook metadata and App metadata using the Amazon SageMaker Studio Classic UI.

**Topics**
+ [

## Get Studio Classic Notebook Metadata
](#notebooks-run-and-manage-metadata-notebook)
+ [

## Get App Metadata
](#notebooks-run-and-manage-metadata-app)

## Get Studio Classic Notebook Metadata
<a name="notebooks-run-and-manage-metadata-notebook"></a>

Jupyter notebooks contain optional metadata that you can access through the Amazon SageMaker Studio Classic UI.

**To view the notebook metadata:**

1. In the right sidebar, choose the **Property Inspector** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/gears.png)). 

1. Open the **Advanced Tools** section.

The metadata should look similar to the following.

```
{
    "instance_type": "ml.t3.medium",
    "kernelspec": {
        "display_name": "Python 3 (Data Science)",
        "language": "python",
        "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:<acct-id>:image/datascience-1.0"
    },
    "language_info": {
        "codemirror_mode": {
            "name": "ipython",
            "version": 3
        },
        "file_extension": ".py",
        "mimetype": "text/x-python",
        "name": "python",
        "nbconvert_exporter": "python",
        "pygments_lexer": "ipython3",
        "version": "3.7.10"
    }
}
```

## Get App Metadata
<a name="notebooks-run-and-manage-metadata-app"></a>

When you create a notebook in Amazon SageMaker Studio Classic, the App metadata is written to a file named `resource-metadata.json` in the folder `/opt/ml/metadata/`. You can get the App metadata by opening an Image terminal from within the notebook. The metadata gives you the following information, which includes the SageMaker image and instance type the notebook runs in:
+ **AppType** – `KernelGateway` 
+ **DomainId** – Same as the Studio ClassicID
+ **UserProfileName** – The profile name of the current user
+ **ResourceArn** – The Amazon Resource Name (ARN) of the App, which includes the instance type
+ **ResourceName** – The name of the SageMaker image

Additional metadata might be included for internal use by Studio Classic and is subject to change.

**To get the App metadata**

1. In the center of the notebook menu, choose the **Launch Terminal** icon (![\[Dollar sign icon representing currency or financial transactions.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-launch-terminal.png)). This opens a terminal in the SageMaker image that the notebook runs in.

1. Run the following commands to display the contents of the `resource-metadata.json` file.

   ```
   $ cd /opt/ml/metadata/
   cat resource-metadata.json
   ```

   The file should look similar to the following.

   ```
   {
       "AppType": "KernelGateway",
       "DomainId": "d-xxxxxxxxxxxx",
       "UserProfileName": "profile-name",
       "ResourceArn": "arn:aws:sagemaker:us-east-2:account-id:app/d-xxxxxxxxxxxx/profile-name/KernelGateway/datascience--1-0-ml-t3-medium",
       "ResourceName": "datascience--1-0-ml",
       "AppImageVersion":""
   }
   ```

# Get Notebook Differences in Amazon SageMaker Studio Classic
<a name="notebooks-diff"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can display the difference between the current notebook and the last checkpoint or the last Git commit using the Amazon SageMaker AI UI.

The following screenshot shows the menu from a Studio Classic notebook.

![\[The location of the relevant menu in a Studio Classic notebook.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-menu-diffs.png)


**Topics**
+ [

## Get the Difference Between the Last Checkpoint
](#notebooks-diff-checkpoint)
+ [

## Get the Difference Between the Last Commit
](#notebooks-diff-git)

## Get the Difference Between the Last Checkpoint
<a name="notebooks-diff-checkpoint"></a>

When you create a notebook, a hidden checkpoint file that matches the notebook is created. You can view changes between the notebook and the checkpoint file or revert the notebook to match the checkpoint file.

By default, a notebook is auto-saved every 120 seconds and also when you close the notebook. However, the checkpoint file isn't updated to match the notebook. To save the notebook and update the checkpoint file to match, you must choose the **Save notebook and create checkpoint** icon ( ![\[Padlock icon representing security or access control in cloud services.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-save-and-checkpoint.png)) on the left of the notebook menu or use the `Ctrl + S` keyboard shortcut.

To view the changes between the notebook and the checkpoint file, choose the **Checkpoint diff** icon (![\[Clock icon representing time or duration in a user interface.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-checkpoint-diff.png)) in the center of the notebook menu.

To revert the notebook to the checkpoint file, from the main Studio Classic menu, choose **File** then **Revert Notebook to Checkpoint**.

## Get the Difference Between the Last Commit
<a name="notebooks-diff-git"></a>

If a notebook is opened from a Git repository, you can view the difference between the notebook and the last Git commit.

To view the changes in the notebook from the last Git commit, choose the **Git diff** icon (![\[Dark button with white text displaying "git" in lowercase letters.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/notebook-git-diff.png)) in the center of the notebook menu.

# Manage Resources for Amazon SageMaker Studio Classic Notebooks
<a name="notebooks-run-and-manage"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can change the instance type, and SageMaker image and kernel from within an Amazon SageMaker Studio Classic notebook. To create a custom kernel to use with your notebooks, see [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).

**Topics**
+ [

# Change the Instance Type for an Amazon SageMaker Studio Classic Notebook
](notebooks-run-and-manage-switch-instance-type.md)
+ [

# Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook
](notebooks-run-and-manage-change-image.md)
+ [

# Shut Down Resources from Amazon SageMaker Studio Classic
](notebooks-run-and-manage-shut-down.md)

# Change the Instance Type for an Amazon SageMaker Studio Classic Notebook
<a name="notebooks-run-and-manage-switch-instance-type"></a>

When you open a new Studio Classic notebook for the first time, you are assigned a default Amazon Elastic Compute Cloud (Amazon EC2) instance type to run the notebook. When you open additional notebooks on the same instance type, the notebooks run on the same instance as the first notebook, even if the notebooks use different kernels. 

You can change the instance type that your Studio Classic notebook runs on from within the notebook. 

The following information only applies to Studio Classic notebooks. For information about how to change the instance type of a Amazon SageMaker notebook instance, see [Update a Notebook Instance](nbi-update.md).

**Important**  
If you change the instance type, unsaved information and existing settings for the notebook are lost, and installed packages must be re-installed.  
The previous instance type continues to run even if no kernel sessions or apps are active. You must explicitly stop the instance to stop accruing charges. To stop the instance, see [Shut down resources](notebooks-run-and-manage-shut-down.md#notebooks-run-and-manage-shut-down-sessions).

The following screenshot shows the menu from a Studio Classic notebook. The processor and memory of the instance type powering the notebook are displayed as **2 vCPU \$1 4 GiB**.

![\[The location of the processor and memory of the instance type for the Studio Classic notebook.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-menu-instance.png)


**To change the instance type**

1. Choose the processor and memory of the instance type powering the notebook. This opens a pop up window.

1. From the **Set up notebook environment** pop up window, select the **Instance type** dropdown menu.

1. From the **Instance type** dropdown, choose one of the instance types that are listed.

1. After choosing a type, choose **Select**.

1. Wait for the new instance to become enabled, and then the new instance type information is displayed.

For a list of the available instance types, see [Instance Types Available for Use With Amazon SageMaker Studio Classic Notebooks](notebooks-available-instance-types.md). 

# Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook
<a name="notebooks-run-and-manage-change-image"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

With Amazon SageMaker Studio Classic notebooks, you can change the notebook's image or kernel from within the notebook.

The following screenshot shows the menu from a Studio Classic notebook. The current SageMaker AI kernel and image are displayed as **Python 3 (Data Science)**, where `Python 3` denotes the kernel and `Data Science` denotes the SageMaker AI image that contains the kernel. The color of the circle to the right indicates the kernel is idle or busy. The kernel is busy when the center and the edge of the circle are the same color.

![\[The location of the current kernel and image in the menu bar from a Studio Classic notebook.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-notebook-menu-kernel.png)


**To change a notebook's image or kernel**

1. Choose the image/kernel name in the notebook menu.

1. From the **Set up notebook environment** pop up window, select the **Image** or **Kernel** dropdown menu.

1. From the dropdown menu, choose one of the images or kernels that are listed.

1. After choosing an image or kernel, choose **Select**.

1. Wait for the kernel's status to show as idle, which indicates the kernel has started.

For a list of available SageMaker images and kernels, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

# Shut Down Resources from Amazon SageMaker Studio Classic
<a name="notebooks-run-and-manage-shut-down"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can shut down individual Amazon SageMaker AI resources, including notebooks, terminals, kernels, apps, and instances from Studio Classic. You can also shut down all of the resources in one of these categories at the same time. Amazon SageMaker Studio Classic does not support shutting down resources from within a notebook.

**Note**  
When you shut down a Studio Classic notebook instance, additional resources that you created in Studio Classic are not deleted. For example, additional resources can include SageMaker AI endpoints, Amazon EMR clusters, and Amazon S3 buckets. To stop the accrual of charges, you must manually delete these resources. For information about finding resources that are accruing charges, see [Analyzing your costs with AWS Cost Explorer](https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html).

The following topics demonstrate how to delete these SageMaker AI resources.

**Topics**
+ [

## Shut down an open notebook
](#notebooks-run-and-manage-shut-down-notebook)
+ [

## Shut down resources
](#notebooks-run-and-manage-shut-down-sessions)

## Shut down an open notebook
<a name="notebooks-run-and-manage-shut-down-notebook"></a>

When you shut down a Studio Classic notebook, the notebook is not deleted. The kernel that the notebook is running on is shut down and any unsaved information in the notebook is lost. You can shut down an open notebook from the Studio Classic **File** menu or from the Running Terminal and Kernels pane. The following procedure shows how to shut down an open notebook from the Studio Classic **File** menu.

**To shut down an open notebook from the File menu**

1. Launch Studio Classic by following the steps in [Launch Amazon SageMaker Studio Classic](studio-launch.md).

1. (Optional) Save the notebook contents by choosing **File**, then **Save Notebook**.

1. Choose **File**.

1. Choose **Close and Shutdown Notebook**. This opens a pop-up window.

1. From the pop-up window, choose **OK**.

## Shut down resources
<a name="notebooks-run-and-manage-shut-down-sessions"></a>

You can reach the **Running Terminals and Kernels** pane of Amazon SageMaker Studio Classic by selecting the **Running Terminals and Kernels** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels.png)). The **Running Terminals and Kernels** pane consists of four sections. Each section lists all the resources of that type. You can shut down each resource individually or shut down all the resources in a section at the same time.

When you choose to shut down all resources in a section, the following occurs:
+ **RUNNING INSTANCES/RUNNING APPS** – All instances, apps, notebooks, kernel sessions, consoles/shells, and image terminals are shut down. System terminals aren't shut down.
+ **KERNEL SESSIONS** – All kernels, notebooks and consoles/shells are shut down.
+ **TERMINAL SESSIONS** – All image terminals and system terminals are shut down.

**To shut down resources**

1. Launch Studio Classic by following the steps in [Launch Amazon SageMaker Studio Classic](studio-launch.md).

1. Choose the **Running Terminals and Kernels** icon.

1. Do either of the following:
   + To shut down a specific resource, choose the **Shut Down** icon on the same row as the resource.

     For running instances, a confirmation dialog box lists all of the resources that SageMaker AI will shut down. A confirmation dialog box displays all running apps. To proceed, choose **Shut down all**.
**Note**  
A confirmation dialog box isn't displayed for kernel sessions or terminal sessions.
   + To shut down all resources in a section, choose the **X** to the right of the section label. A confirmation dialog box is displayed. Choose **Shut down all** to proceed.
**Note**  
When you shut down these Studio Classic resources, any additional resources created from Studio Classic, such as SageMaker AI endpoints, Amazon EMR clusters, and Amazon S3 buckets are not deleted. You must manually delete these resources to stop the accrual of charges. For information about finding resources that are accruing charges, see [Analyzing your costs with AWS Cost Explorer](https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html).

# Usage Metering for Amazon SageMaker Studio Classic Notebooks
<a name="notebooks-usage-metering"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

There is no additional charge for using Amazon SageMaker Studio Classic. The costs incurred for running Amazon SageMaker Studio Classic notebooks, interactive shells, consoles, and terminals are based on Amazon Elastic Compute Cloud (Amazon EC2) instance usage.

When you run the following resources, you must choose a SageMaker image and kernel:

**From the Studio Classic Launcher**
+ Notebook
+ Interactive Shell
+ Image Terminal

**From the **File** menu**
+ Notebook
+ Console

When launched, the resource is run on an Amazon EC2 instance of the chosen instance type. If an instance of that type was previously launched and is available, the resource is run on that instance.

For CPU based images, the default suggested instance type is `ml.t3.medium`. For GPU based images, the default suggested instance type is `ml.g4dn.xlarge`.

The costs incurred are based on the instance type. You are billed separately for each instance.

Metering starts when an instance is created. Metering ends when all the apps on the instance are shut down, or the instance is shut down. For information about how to shut down an instance, see [Shut Down Resources from Amazon SageMaker Studio Classic](notebooks-run-and-manage-shut-down.md).

**Important**  
You must shut down the instance to stop incurring charges. If you shut down the notebook running on the instance but don't shut down the instance, you will still incur charges. When you shut down the Studio Classic notebook instances, any additional resources, such as SageMaker AI endpoints, Amazon EMR clusters, and Amazon S3 buckets created from Studio Classic are not deleted. Delete those resources to stop accrual of charges.

When you open multiple notebooks on the same instance type, the notebooks run on the same instance even if they are using different kernels. You are billed only for the time that one instance is running.

You can change the instance type from within the notebook after you open it. For more information, see [Change the Instance Type for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-switch-instance-type.md).

For information about billing along with pricing examples, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

# Available Resources for Amazon SageMaker Studio Classic Notebooks
<a name="notebooks-resources"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following sections list the available resources for Amazon SageMaker Studio Classic notebooks.

**Topics**
+ [

# Instance Types Available for Use With Amazon SageMaker Studio Classic Notebooks
](notebooks-available-instance-types.md)
+ [

# Amazon SageMaker Images Available for Use With Studio Classic Notebooks
](notebooks-available-images.md)

# Instance Types Available for Use With Amazon SageMaker Studio Classic Notebooks
<a name="notebooks-available-instance-types"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic notebooks run on Amazon Elastic Compute Cloud (Amazon EC2) instances. The following Amazon EC2 instance types are available for use with Studio Classic notebooks. For detailed information on which instance types fit your use case, and their performance capabilities, see [Amazon Elastic Compute Cloud Instance types](https://aws.amazon.com/ec2/instance-types/). For information about pricing for these instance types, see [Amazon EC2 Pricing](https://aws.amazon.com/ec2/pricing/).

For information about available Amazon SageMaker Notebook Instance types, see [CreateNotebookInstance](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html#sagemaker-CreateNotebookInstance-request-InstanceType).

**Note**  
For most use cases, you should use a `ml.t3.medium`. This is the default instance type for CPU-based SageMaker images, and is available as part of the [AWS Free Tier](https://aws.amazon.com/free).

**Topics**
+ [

## CPU instances
](#notebooks-resources-no-gpu)
+ [

## Instances with 1 or more GPUs
](#notebooks-resources-gpu)

## CPU instances
<a name="notebooks-resources-no-gpu"></a>

The following table lists the Amazon EC2 CPU instance types with no GPU attached that are available for use with Studio Classic notebooks. It also lists information about the specifications of each instance type. The default instance type for CPU-based images is `ml.t3.medium`. 

For detailed information on which instance types fit your use case, and their performance capabilities, see [Amazon Elastic Compute Cloud Instance types](https://aws.amazon.com/ec2/instance-types/). For information about pricing for these instance types, see [Amazon EC2 Pricing](https://aws.amazon.com/ec2/pricing/).

CPU instances


| Instance | Use case | Fast launch | vCPU | Memory (GiB) | Instance Storage (GB) | 
| --- | --- | --- | --- | --- | --- | 
| ml.t3.medium | General purpose | Yes | 2 | 4 | Amazon EBS Only | 
| ml.t3.large | General purpose | No | 2 | 8 | Amazon EBS Only | 
| ml.t3.xlarge | General purpose | No | 4 | 16 | Amazon EBS Only | 
| ml.t3.2xlarge | General purpose | No | 8 | 32 | Amazon EBS Only | 
| ml.m5.large | General purpose | Yes | 2 | 8 | Amazon EBS Only | 
| ml.m5.xlarge | General purpose | No | 4 | 16 | Amazon EBS Only | 
| ml.m5.2xlarge | General purpose | No | 8 | 32 | Amazon EBS Only | 
| ml.m5.4xlarge | General purpose | No | 16 | 64 | Amazon EBS Only | 
| ml.m5.8xlarge | General purpose | No | 32 | 128 | Amazon EBS Only | 
| ml.m5.12xlarge | General purpose | No | 48 | 192 | Amazon EBS Only | 
| ml.m5.16xlarge | General purpose | No | 64 | 256 | Amazon EBS Only | 
| ml.m5.24xlarge | General purpose | No | 96 | 384 | Amazon EBS Only | 
| ml.m5d.large | General purpose | No | 2 | 8 | 1 x 75 NVMe SSD | 
| ml.m5d.xlarge | General purpose | No | 4 | 16 | 1 x 150 NVMe SSD | 
| ml.m5d.2xlarge | General purpose | No | 8 | 32 | 1 x 300 NVMe SSD | 
| ml.m5d.4xlarge | General purpose | No | 16 | 64 | 2 x 300 NVMe SSD | 
| ml.m5d.8xlarge | General purpose | No | 32 | 128 | 2 x 600 NVMe SSD | 
| ml.m5d.12xlarge | General purpose | No | 48 | 192 | 2 x 900 NVMe SSD | 
| ml.m5d.16xlarge | General purpose | No | 64 | 256 | 4 x 600 NVMe SSD | 
| ml.m5d.24xlarge | General purpose | No | 96 | 384 | 4 x 900 NVMe SSD | 
| ml.c5.large | Compute optimized | Yes | 2 | 4 | Amazon EBS Only | 
| ml.c5.xlarge | Compute optimized | No | 4 | 8 | Amazon EBS Only | 
| ml.c5.2xlarge | Compute optimized | No | 8 | 16 | Amazon EBS Only | 
| ml.c5.4xlarge | Compute optimized | No | 16 | 32 | Amazon EBS Only | 
| ml.c5.9xlarge | Compute optimized | No | 36 | 72 | Amazon EBS Only | 
| ml.c5.12xlarge | Compute optimized | No | 48 | 96 | Amazon EBS Only | 
| ml.c5.18xlarge | Compute optimized | No | 72 | 144 | Amazon EBS Only | 
| ml.c5.24xlarge | Compute optimized | No | 96 | 192 | Amazon EBS Only | 
| ml.r5.large | Memory optimized | No | 2 | 16 | Amazon EBS Only | 
| ml.r5.xlarge | Memory optimized | No | 4 | 32 | Amazon EBS Only | 
| ml.r5.2xlarge | Memory optimized | No | 8 | 64 | Amazon EBS Only | 
| ml.r5.4xlarge | Memory optimized | No | 16 | 128 | Amazon EBS Only | 
| ml.r5.8xlarge | Memory optimized | No | 32 | 256 | Amazon EBS Only | 
| ml.r5.12xlarge | Memory optimized | No | 48 | 384 | Amazon EBS Only | 
| ml.r5.16xlarge | Memory optimized | No | 64 | 512 | Amazon EBS Only | 
| ml.r5.24xlarge | Memory optimized | No | 96 | 768 | Amazon EBS Only | 

## Instances with 1 or more GPUs
<a name="notebooks-resources-gpu"></a>

The following table lists the Amazon EC2 instance types with 1 or more GPUs attached that are available for use with Studio Classic notebooks. It also lists information about the specifications of each instance type. The default instance type for GPU-based images is `ml.g4dn.xlarge`. 

For detailed information on which instance types fit your use case, and their performance capabilities, see [Amazon Elastic Compute Cloud Instance types](https://aws.amazon.com/ec2/instance-types/). For information about pricing for these instance types, see [Amazon EC2 Pricing](https://aws.amazon.com/ec2/pricing/).

Instances with 1 or more GPUs


| Instance | Use case | Fast launch | GPUs | vCPU | Memory (GiB) | GPU Memory (GiB) | Instance Storage (GB) | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| ml.p3.2xlarge | Accelerated computing | No | 1 | 8 | 61 | 16 | Amazon EBS Only | 
| ml.p3.8xlarge | Accelerated computing | No | 4 | 32 | 244 | 64 | Amazon EBS Only | 
| ml.p3.16xlarge | Accelerated computing | No | 8 | 64 | 488 | 128 | Amazon EBS Only | 
| ml.p3dn.24xlarge | Accelerated computing | No | 8 | 96 | 768 | 256 | 2 x 900 NVMe SSD | 
| ml.p4d.24xlarge | Accelerated computing | No | 8 | 96 | 1152 | 320 GB HBM2 | 8 x 1000 NVMe SSD | 
| ml.p4de.24xlarge | Accelerated computing | No | 8 | 96 | 1152 | 640 GB HBM2e | 8 x 1000 NVMe SSD | 
| ml.g4dn.xlarge | Accelerated computing | Yes | 1 | 4 | 16 | 16 | 1 x 125 NVMe SSD | 
| ml.g4dn.2xlarge | Accelerated computing | No | 1 | 8 | 32 | 16 | 1 x 225 NVMe SSD | 
| ml.g4dn.4xlarge | Accelerated computing | No | 1 | 16 | 64 | 16 | 1 x 225 NVMe SSD | 
| ml.g4dn.8xlarge | Accelerated computing | No | 1 | 32 | 128 | 16 | 1 x 900 NVMe SSD | 
| ml.g4dn.12xlarge | Accelerated computing | No | 4 | 48 | 192 | 64 | 1 x 900 NVMe SSD | 
| ml.g4dn.16xlarge | Accelerated computing | No | 1 | 64 | 256 | 16 | 1 x 900 NVMe SSD | 
| ml.g5.xlarge | Accelerated computing | No | 1 | 4 | 16 | 24 | 1 x 250 NVMe SSD | 
| ml.g5.2xlarge | Accelerated computing | No | 1 | 8 | 32 | 24 | 1 x 450 NVMe SSD | 
| ml.g5.4xlarge | Accelerated computing | No | 1 | 16 | 64 | 24 | 1 x 600 NVMe SSD | 
| ml.g5.8xlarge | Accelerated computing | No | 1 | 32 | 128 | 24 | 1 x 900 NVMe SSD | 
| ml.g5.12xlarge | Accelerated computing | No | 4 | 48 | 192 | 96 | 1 x 3800 NVMe SSD | 
| ml.g5.16xlarge | Accelerated computing | No | 1 | 64 | 256 | 24 | 1 x 1900 NVMe SSD | 
| ml.g5.24xlarge | Accelerated computing | No | 4 | 96 | 384 | 96 | 1 x 3800 NVMe SSD | 
| ml.g5.48xlarge | Accelerated computing | No | 8 | 192 | 768 | 192 | 2 x 3800 NVMe SSD | 

# Amazon SageMaker Images Available for Use With Studio Classic Notebooks
<a name="notebooks-available-images"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

This page lists the SageMaker images and associated kernels that are available in Amazon SageMaker Studio Classic. This page also gives information about the format needed to create the ARN for each image. SageMaker images contain the latest [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and the latest version of the kernel. For more information, see [Deep Learning Containers Images](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html).

**Topics**
+ [

## Image ARN format
](#notebooks-available-images-arn)
+ [

## Supported URI tags
](#notebooks-available-uri-tag)
+ [

## Supported images
](#notebooks-available-images-supported)
+ [

## Images slated for deprecation
](#notebooks-available-images-deprecation)
+ [

## Deprecated images
](#notebooks-available-images-deprecated)

## Image ARN format
<a name="notebooks-available-images-arn"></a>

The following table lists the image ARN and URI format for each Region. To create the full ARN for an image, replace the *resource-identifier* placeholder with the corresponding resource identifier for the image. The resource identifier is found in the SageMaker images and kernels table. To create the full URI for an image, replace the *tag* placeholder with the corresponding cpu or gpu tag. For the list of tags you can use, see [Supported URI tags](#notebooks-available-uri-tag).

**Note**  
SageMaker Distribution images use a distinct set of image ARNs, which are listed in the following table.


| Region | Image ARN Format | SageMaker Distribution Image ARN Format | SageMaker Distribution Image URI Format | 
| --- | --- | --- | --- | 
|  us-east-1  | arn:aws:sagemaker:us-east-1:081325390199:image/resource-identifier | arn:aws:sagemaker:us-east-1:885854791233:image/resource-identifier | 885854791233.dkr.ecr.us-east-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  us-east-2  | arn:aws:sagemaker:us-east-2:429704687514:image/resource-identifier | arn:aws:sagemaker:us-east-2:137914896644:image/resource-identifier | 137914896644.dkr.ecr.us-east-2.amazonaws.com/sagemaker-distribution-prod:tag | 
|  us-west-1  | arn:aws:sagemaker:us-west-1:742091327244:image/resource-identifier | arn:aws:sagemaker:us-west-1:053634841547:image/resource-identifier | 053634841547.dkr.ecr.us-west-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  us-west-2  | arn:aws:sagemaker:us-west-2:236514542706:image/resource-identifier | arn:aws:sagemaker:us-west-2:542918446943:image/resource-identifier | 542918446943.dkr.ecr.us-west-2.amazonaws.com/sagemaker-distribution-prod:tag | 
|  af-south-1  | arn:aws:sagemaker:af-south-1:559312083959:image/resource-identifier | arn:aws:sagemaker:af-south-1:238384257742:image/resource-identifier | 238384257742.dkr.ecr.af-south-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-east-1  | arn:aws:sagemaker:ap-east-1:493642496378:image/resource-identifier | arn:aws:sagemaker:ap-east-1:523751269255:image/resource-identifier | 523751269255.dkr.ecr.ap-east-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-south-1  | arn:aws:sagemaker:ap-south-1:394103062818:image/resource-identifier | arn:aws:sagemaker:ap-south-1:245090515133:image/resource-identifier | 245090515133.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-northeast-2  | arn:aws:sagemaker:ap-northeast-2:806072073708:image/resource-identifier | arn:aws:sagemaker:ap-northeast-2:064688005998:image/resource-identifier | 064688005998.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-southeast-1  | arn:aws:sagemaker:ap-southeast-1:492261229750:image/resource-identifier | arn:aws:sagemaker:ap-southeast-1:022667117163:image/resource-identifier | 022667117163.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-southeast-2  | arn:aws:sagemaker:ap-southeast-2:452832661640:image/resource-identifier | arn:aws:sagemaker:ap-southeast-2:648430277019:image/resource-identifier | 648430277019.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-northeast-1  |  arn:aws:sagemaker:ap-northeast-1:102112518831:image/resource-identifier |  arn:aws:sagemaker:ap-northeast-1:010972774902:image/resource-identifier | 010972774902.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ca-central-1  | arn:aws:sagemaker:ca-central-1:310906938811:image/resource-identifier | arn:aws:sagemaker:ca-central-1:481561238223:image/resource-identifier | 481561238223.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-central-1  | arn:aws:sagemaker:eu-central-1:936697816551:image/resource-identifier | arn:aws:sagemaker:eu-central-1:545423591354:image/resource-identifier | 545423591354.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-west-1  | arn:aws:sagemaker:eu-west-1:470317259841:image/resource-identifier | arn:aws:sagemaker:eu-west-1:819792524951:image/resource-identifier | 819792524951.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-west-2  | arn:aws:sagemaker:eu-west-2:712779665605:image/resource-identifier | arn:aws:sagemaker:eu-west-2:021081402939:image/resource-identifier | 021081402939.dkr.ecr.eu-west-2.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-west-3  | arn:aws:sagemaker:eu-west-3:615547856133:image/resource-identifier | arn:aws:sagemaker:eu-west-3:856416204555:image/resource-identifier | 856416204555.dkr.ecr.eu-west-3.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-north-1  | arn:aws:sagemaker:eu-north-1:243637512696:image/resource-identifier | arn:aws:sagemaker:eu-north-1:175620155138:image/resource-identifier | 175620155138.dkr.ecr.eu-north-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  eu-south-1  | arn:aws:sagemaker:eu-south-1:592751261982:image/resource-identifier | arn:aws:sagemaker:eu-south-1:810671768855:image/resource-identifier | 810671768855.dkr.ecr.eu-south-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  sa-east-1  | arn:aws:sagemaker:sa-east-1:782484402741:image/resource-identifier | arn:aws:sagemaker:sa-east-1:567556641782:image/resource-identifier | 567556641782.dkr.ecr.sa-east-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-northeast-3  | arn:aws:sagemaker:ap-northeast-3:792733760839:image/resource-identifier | arn:aws:sagemaker:ap-northeast-3:564864627153:image/resource-identifier | 564864627153.dkr.ecr.ap-northeast-3.amazonaws.com/sagemaker-distribution-prod:tag | 
|  ap-southeast-3  | arn:aws:sagemaker:ap-southeast-3:276181064229:image/resource-identifier | arn:aws:sagemaker:ap-southeast-3:370607712162:image/resource-identifier | 370607712162.dkr.ecr.ap-southeast-3.amazonaws.com/sagemaker-distribution-prod:tag | 
|  me-south-1  | arn:aws:sagemaker:me-south-1:117516905037:image/resource-identifier | arn:aws:sagemaker:me-south-1:523774347010:image/resource-identifier | 523774347010.dkr.ecr.me-south-1.amazonaws.com/sagemaker-distribution-prod:tag | 
|  me-central-1  | arn:aws:sagemaker:me-central-1:103105715889:image/resource-identifier | arn:aws:sagemaker:me-central-1:358593528301:image/resource-identifier | 358593528301.dkr.ecr.me-central-1.amazonaws.com/sagemaker-distribution-prod:tag | 

## Supported URI tags
<a name="notebooks-available-uri-tag"></a>

The following list shows the tags you can include in your image URI.
+ 1-cpu
+ 1-gpu
+ 0-cpu
+ 0-gpu

**The following examples show URIs with various tag formats:**
+ 542918446943.dkr.ecr.us-west-2.amazonaws.com/sagemaker-distribution-prod:1-cpu
+ 542918446943.dkr.ecr.us-west-2.amazonaws.com/sagemaker-distribution-prod:0-gpu

## Supported images
<a name="notebooks-available-images-supported"></a>

The following table gives information about the SageMaker images and associated kernels that are available in Amazon SageMaker Studio Classic. It also gives information about the resource identifier and Python version included in the image.

SageMaker images and kernels


| SageMaker Image | Description | Resource Identifier | Kernels (and Identifier) | Python Version | 
| --- | --- | --- | --- | --- | 
| Base Python 4.3 | Official Python 3.11 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-v4 | Python 3 (python3) | Python 3.11 | 
| Base Python 4.2 | Official Python 3.11 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-v4 | Python 3 (python3) | Python 3.11 | 
| Base Python 4.1 | Official Python 3.11 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-v4 | Python 3 (python3) | Python 3.11 | 
| Base Python 4.0 | Official Python 3.11 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-v4 | Python 3 (python3) | Python 3.11 | 
| Base Python 3.0 | Official Python 3.10 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-310-v1 | Python 3 (python3) | Python 3.10 | 
| Data Science 5.3 | Data Science 5.3 is a Python 3.11 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version jammy-20240212. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-v5 | Python 3 (python3) | Python 3.11 | 
| Data Science 5.2 | Data Science 5.2 is a Python 3.11 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version jammy-20240212. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-v5 | Python 3 (python3) | Python 3.11 | 
| Data Science 5.1 | Data Science 5.1 is a Python 3.11 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version jammy-20240212. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-v5 | Python 3 (python3) | Python 3.11 | 
| Data Science 5.0 | Data Science 5.0 is a Python 3.11 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version jammy-20240212. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-v5 | Python 3 (python3) | Python 3.11 | 
| Data Science 4.0 | Data Science 4.0 is a Python 3.11 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version 22.04. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-311-v1 | Python 3 (python3) | Python 3.11 | 
| Data Science 3.0 | Data Science 3.0 is a Python 3.10 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version 22.04. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-310-v1 | Python 3 (python3) | Python 3.10 | 
| Geospatial 1.0 | Amazon SageMaker geospatial is a Python image consisting of commonly used geospatial libraries such as GDAL, Fiona, GeoPandas, Shapley, and Rasterio. It allows you to visualize geospatial data within SageMaker AI. For more information, see [Amazon SageMaker geospatial Notebook SDK](https://docs.aws.amazon.com/sagemaker/latest/dg/geospatial-notebook-sdk.html) | sagemaker-geospatial-1.0 | Python 3 (python3) | Python 3.10 | 
| SparkAnalytics 4.3 | The SparkAnalytics 4.3 image provides Spark and PySpark kernel options on Amazon SageMaker Studio Classic, including SparkMagic Spark, SparkMagic PySpark, Glue Spark, and Glue PySpark, enabling flexible distributed data processing. | sagemaker-spark-analytics-v4 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.11 | 
| SparkAnalytics 4.2 | The SparkAnalytics 4.2 image provides Spark and PySpark kernel options on Amazon SageMaker Studio Classic, including SparkMagic Spark, SparkMagic PySpark, Glue Spark, and Glue PySpark, enabling flexible distributed data processing. | sagemaker-spark-analytics-v4 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.11 | 
| SparkAnalytics 4.1 | The SparkAnalytics 4.1 image provides Spark and PySpark kernel options on Amazon SageMaker Studio Classic, including SparkMagic Spark, SparkMagic PySpark, Glue Spark, and Glue PySpark, enabling flexible distributed data processing. | sagemaker-spark-analytics-v4 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.11 | 
| SparkAnalytics 4.0 | The SparkAnalytics 4.0 image provides Spark and PySpark kernel options on Amazon SageMaker Studio Classic, including SparkMagic Spark, SparkMagic PySpark, Glue Spark, and Glue PySpark, enabling flexible distributed data processing. | sagemaker-spark-analytics-v4 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.11 | 
| SparkAnalytics 3.0 | The SparkAnalytics 3.0 image provides Spark and PySpark kernel options on Amazon SageMaker Studio Classic, including SparkMagic Spark, SparkMagic PySpark, Glue Spark, and Glue PySpark, enabling flexible distributed data processing. | sagemaker-sparkanalytics-311-v1 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html) | Python 3.11 | 
| SparkAnalytics 2.0 | Anaconda Individual Edition with PySpark and Spark kernels. For more information, see [sparkmagic](https://github.com/jupyter-incubator/sparkmagic). | sagemaker-sparkanalytics-310-v1 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html) | Python 3.10 | 
| PyTorch 2.4.0 Python 3.11 CPU Optimized | The AWS Deep Learning Containers for PyTorch 2.4.0 with CUDA 12.4 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.4.0-cpu-py311 | Python 3 (python3) | Python 3.11 | 
| PyTorch 2.4.0 Python 3.11 GPU Optimized | The AWS Deep Learning Containers for PyTorch 2.4.0 with CUDA 12.4 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.4.0-gpu-py311 | Python 3 (python3) | Python 3.11 | 
| PyTorch 2.3.0 Python 3.11 CPU Optimized | The AWS Deep Learning Containers for PyTorch 2.3.0 with CUDA 12.1 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.3.0-cpu-py311 | Python 3 (python3) | Python 3.11 | 
| PyTorch 2.3.0 Python 3.11 GPU Optimized | The AWS Deep Learning Containers for PyTorch 2.3.0 with CUDA 12.1 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.3.0-gpu-py311 | Python 3 (python3) | Python 3.11 | 
| PyTorch 2.2.0 Python 3.10 CPU Optimized | The AWS Deep Learning Containers for PyTorch 2.2 with CUDA 12.1 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.2.0-cpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.2.0 Python 3.10 GPU Optimized | The AWS Deep Learning Containers for PyTorch 2.2 with CUDA 12.1 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.2.0-gpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.1.0 Python 3.10 CPU Optimized | The AWS Deep Learning Containers for PyTorch 2.1 with CUDA 12.1 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.1.0-cpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.1.0 Python 3.10 GPU Optimized | The AWS Deep Learning Containers for PyTorch 2.1 with CUDA 12.1 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.1.0-gpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 1.13 HuggingFace Python 3.10 Neuron Optimized | PyTorch 1.13 image with HuggingFace and Neuron packages installed for training on Trainium instances optimized for performance and scale on AWS. | pytorch-1.13-hf-neuron-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 1.13 Python 3.10 Neuron Optimized | PyTorch 1.13 image with Neuron packages installed for training on Trainium instances optimized for performance and scale on AWS. | pytorch-1.13-neuron-py310 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.14.0 Python 3.10 CPU Optimized | The AWS Deep Learning Containers for TensorFlow 2.14 with CUDA 11.8 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.14.1-cpu-py310-ubuntu20.04-sagemaker-v1.0 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.14.0 Python 3.10 GPU Optimized | The AWS Deep Learning Containers for TensorFlow 2.14 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.14.1-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.0 | Python 3 (python3) | Python 3.10 | 

## Images slated for deprecation
<a name="notebooks-available-images-deprecation"></a>

SageMaker AI ends support for images the day after any of the packages in the image reach end-of life by their publisher. The following SageMaker images are slated for deprecation. 

Images based on Python 3.8 reached [end-of-life](https://endoflife.date/python) on October 31st, 2024. Starting on November 1, 2024, SageMaker AI will discontinue support for these images and they will not be selectable from the Studio Classic UI. To avoid non-compliance issues, if you're using any of these images, we recommend that you move to an image with a later version.

SageMaker images slated for deprecation


| SageMaker Image | Deprecation date | Description | Resource Identifier | Kernels | Python Version | 
| --- | --- | --- | --- | --- | --- | 
| SageMaker Distribution v0.12 CPU | November 1, 2024 | SageMaker Distribution v0 CPU is a Python 3.8 image that includes popular frameworks for machine learning, data science and visualization on CPU. This includes deep learning frameworks like PyTorch, TensorFlow and Keras; popular Python packages like numpy, scikit-learn and pandas; and IDEs like Jupyter Lab. For more information, see the [Amazon SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution) repo.  | sagemaker-distribution-cpu-v0 | Python 3 (python3) | Python 3.8 | 
| SageMaker Distribution v0.12 GPU | November 1, 2024 | SageMaker Distribution v0 GPU is a Python 3.8 image that includes popular frameworks for machine learning, data science and visualization on GPU. This includes deep learning frameworks like PyTorch, TensorFlow and Keras; popular Python packages like numpy, scikit-learn and pandas; and IDEs like Jupyter Lab. For more information, see the [Amazon SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution) repo.  | sagemaker-distribution-gpu-v0 | Python 3 (python3) | Python 3.8 | 
| Base Python 2.0 | November 1, 2024 | Official Python 3.8 image from DockerHub with boto3 and AWS CLI included. | sagemaker-base-python-38 | Python 3 (python3) | Python 3.8 | 
| Data Science 2.0 | November 1, 2024 | Data Science 2.0 is a Python 3.8 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image based on Ubuntu version 22.04. It includes the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | sagemaker-data-science-38 | Python 3 (python3) | Python 3.8 | 
| PyTorch 1.13 Python 3.9 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.13 with CUDA 11.3 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-1.13-cpu-py39 | Python 3 (python3) | Python 3.9 | 
| PyTorch 1.13 Python 3.9 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.13 with CUDA 11.7 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-1.13-gpu-py39 | Python 3 (python3) | Python 3.9 | 
| PyTorch 1.12 Python 3.8 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for PyTorch 1.12.0](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-pytorch-1-12-0-on-sagemaker/). | pytorch-1.12-cpu-py38 | Python 3 (python3) | Python 3.8 | 
| PyTorch 1.12 Python 3.8 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for PyTorch 1.12.0](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-pytorch-1-12-0-on-sagemaker/). | pytorch-1.12-gpu-py38 | Python 3 (python3) | Python 3.8 | 
| PyTorch 1.10 Python 3.8 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.10 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for PyTorch 1.10.2 on SageMaker AI](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-pytorch-1-10-2-on-sagemaker/). | pytorch-1.10-cpu-py38 | Python 3 (python3) | Python 3.8 | 
| PyTorch 1.10 Python 3.8 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 1.10 with CUDA 11.3 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for PyTorch 1.10.2 on SageMaker AI](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-pytorch-1-10-2-on-sagemaker/). | pytorch-1.10-gpu-py38 | Python 3 (python3) | Python 3.8 | 
| SparkAnalytics 1.0 | November 1, 2024 | Anaconda Individual Edition with PySpark and Spark kernels. For more information, see [sparkmagic](https://github.com/jupyter-incubator/sparkmagic). | sagemaker-sparkanalytics-v1 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.8 | 
| TensorFlow 2.13.0 Python 3.10 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.13 with CUDA 11.8 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers.](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.13.0-cpu-py310-ubuntu20.04-sagemaker-v1.0 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.13.0 Python 3.10 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.13 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers.](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html) | tensorflow-2.13.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.0 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.6 Python 3.8 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.6 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for TensorFlow 2.6](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-tensorflow-2-6/). | tensorflow-2.6-cpu-py38-ubuntu20.04-v1 | Python 3 (python3) | Python 3.8 | 
| TensorFlow 2.6 Python 3.8 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.6 with CUDA 11.2 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for TensorFlow 2.6](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-tensorflow-2-6/). | tensorflow-2.6-gpu-py38-cu112-ubuntu20.04-v1 | Python 3 (python3) | Python 3.8 | 
| PyTorch 2.0.1 Python 3.10 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 2.0.1 with CUDA 12.1 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.0.1-cpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.0.1 Python 3.10 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 2.0.1 with CUDA 12.1 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.0.1-gpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.0.0 Python 3.10 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 2.0.0 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.0.0-cpu-py310 | Python 3 (python3) | Python 3.10 | 
| PyTorch 2.0.0 Python 3.10 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for PyTorch 2.0.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | pytorch-2.0.0-gpu-py310 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.12.0 Python 3.10 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.12.0 with CUDA 11.2 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.12.0-cpu-py310-ubuntu20.04-sagemaker-v1.0 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.12.0 Python 3.10 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.12.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1 | Python 3 (python3) | Python 3.10 | 
| TensorFlow 2.11.0 Python 3.9 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.11.0 with CUDA 11.2 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.11.0-cpu-py39-ubuntu20.04-sagemaker-v1.1 | Python 3 (python3) | Python 3.9 | 
| TensorFlow 2.11.0 Python 3.9 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.11.0 with CUDA 11.2 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker-v1.1 | Python 3 (python3) | Python 3.9 | 
| TensorFlow 2.10 Python 3.9 CPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2 | Python 3 (python3) | Python 3.9 | 
| TensorFlow 2.10 Python 3.9 GPU Optimized | November 1, 2024 | The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html). | tensorflow-2.10.1-gpu-py39-ubuntu20.04-sagemaker-v1.2 | Python 3 (python3) | Python 3.9 | 

## Deprecated images
<a name="notebooks-available-images-deprecated"></a>

SageMaker AI has ended support for the following images. Deprecation occurs the day after any of the packages in the image reach end-of life by their publisher.

SageMaker images slated for deprecation


| SageMaker Image | Deprecation date | Description | Resource Identifier | Kernels | Python Version | 
| --- | --- | --- | --- | --- | --- | 
| Data Science | October 30, 2023 | Data Science is a Python 3.7 [conda](https://docs.conda.io/projects/conda/en/latest/index.html) image with the most commonly used Python packages and libraries, such as NumPy and SciKit Learn. | datascience-1.0 | Python 3 | Python 3.7 | 
| SageMaker JumpStart Data Science 1.0 | October 30, 2023 | SageMaker JumpStart Data Science 1.0 is a JumpStart image that includes commonly used packages and libraries. | sagemaker-jumpstart-data-science-1.0 | Python 3 | Python 3.7 | 
| SageMaker JumpStart MXNet 1.0 | October 30, 2023 | SageMaker JumpStart MXNet 1.0 is a JumpStart image that includes MXNet. | sagemaker-jumpstart-mxnet-1.0 | Python 3 | Python 3.7 | 
| SageMaker JumpStart PyTorch 1.0 | October 30, 2023 | SageMaker JumpStart PyTorch 1.0 is a JumpStart image that includes PyTorch. | sagemaker-jumpstart-pytorch-1.0 | Python 3 | Python 3.7 | 
| SageMaker JumpStart TensorFlow 1.0 | October 30, 2023 | SageMaker JumpStart TensorFlow 1.0 is a JumpStart image that includes TensorFlow. | sagemaker-jumpstart-tensorflow-1.0 | Python 3 | Python 3.7 | 
| SparkMagic | October 30, 2023 | Anaconda Individual Edition with PySpark and Spark kernels. For more information, see [sparkmagic](https://github.com/jupyter-incubator/sparkmagic). | sagemaker-sparkmagic |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html)  | Python 3.7 | 
| TensorFlow 2.3 Python 3.7 CPU Optimized | October 30, 2023 | The AWS Deep Learning Containers for TensorFlow 2.3 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers with TensorFlow 2.3.0](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-with-tensorflow-2-3-0/). | tensorflow-2.3-cpu-py37-ubuntu18.04-v1 | Python 3 | Python 3.7 | 
| TensorFlow 2.3 Python 3.7 GPU Optimized | October 30, 2023 | The AWS Deep Learning Containers for TensorFlow 2.3 with CUDA 11.0 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers for TensorFlow 2.3.1 with CUDA 11.0](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-for-tensorflow-2-3-1-with-cuda-11-0/). | tensorflow-2.3-gpu-py37-cu110-ubuntu18.04-v3 | Python 3 | Python 3.7 | 
| TensorFlow 1.15 Python 3.7 CPU Optimized | October 30, 2023 | The AWS Deep Learning Containers for TensorFlow 1.15 include containers for training on CPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers v7.0 for TensorFlow](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-v7-0-for-tensorflow/). | tensorflow-1.15-cpu-py37-ubuntu18.04-v7 | Python 3 | Python 3.7 | 
| TensorFlow 1.15 Python 3.7 GPU Optimized | October 30, 2023 | The AWS Deep Learning Containers for TensorFlow 1.15 with CUDA 11.0 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see [AWS Deep Learning Containers v7.0 for TensorFlow](https://aws.amazon.com/releasenotes/aws-deep-learning-containers-v7-0-for-tensorflow/). | tensorflow-1.15-gpu-py37-cu110-ubuntu18.04-v8 | Python 3 | Python 3.7 | 

# Customize Amazon SageMaker Studio Classic
<a name="studio-customize"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

There are four options for customizing your Amazon SageMaker Studio Classic environment. You bring your own SageMaker image, use a lifecycle configuration script, attach suggested Git repos to Studio Classic, or create kernels using persistent Conda environments in Amazon EFS. Use each option individually, or together. 
+ **Bring your own SageMaker image:** A SageMaker image is a file that identifies the kernels, language packages, and other dependencies required to run a Jupyter notebook in Amazon SageMaker Studio Classic. Amazon SageMaker AI provides many built-in images for you to use. If you need different functionality, you can bring your own custom images to Studio Classic.
+ **Use lifecycle configurations with Amazon SageMaker Studio Classic:** Lifecycle configurations are shell scripts triggered by Amazon SageMaker Studio Classic lifecycle events, such as starting a new Studio Classic notebook. You can use lifecycle configurations to automate customization for your Studio Classic environment. For example, you can install custom packages, configure notebook extensions, preload datasets, and set up source code repositories.
+ **Attach suggested Git repos to Studio Classic:** You can attach suggested Git repository URLs at the Amazon SageMaker AI domain or user profile level. Then, you can select the repo URL from the list of suggestions and clone that into your environment using the Git extension in Studio Classic. 
+ **Persist Conda environments to the Studio Classic Amazon EFS volume:** Studio Classic uses an Amazon EFS volume as a persistent storage layer. You can save your Conda environment on this Amazon EFS volume, then use the saved environment to create kernels. Studio Classic automatically picks up all valid environments saved in Amazon EFS as KernelGateway kernels. These kernels persist through restart of the kernel, app, and Studio Classic. For more information, see the **Persist Conda environments to the Studio Classic EFS volume** section in [Four approaches to manage Python packages in Amazon SageMaker Studio Classic notebooks](https://aws.amazon.com/blogs/machine-learning/four-approaches-to-manage-python-packages-in-amazon-sagemaker-studio-notebooks/).

The following topics show how to use these three options to customize your Amazon SageMaker Studio Classic environment.

**Topics**
+ [

# Custom Images in Amazon SageMaker Studio Classic
](studio-byoi.md)
+ [

# Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic
](studio-lcc.md)
+ [

# Attach Suggested Git Repos to Amazon SageMaker Studio Classic
](studio-git-attach.md)

# Custom Images in Amazon SageMaker Studio Classic
<a name="studio-byoi"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

A SageMaker image is a file that identifies the kernels, language packages, and other dependencies required to run a Jupyter notebook in Amazon SageMaker Studio Classic. These images are used to create an environment that you then run Jupyter notebooks from. Amazon SageMaker AI provides many built-in images for you to use. For the list of built-in images, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

If you need different functionality, you can bring your own custom images to Studio Classic. You can create images and image versions, and attach image versions to your domain or shared space, using the SageMaker AI control panel, the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html), and the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/). You can also create images and image versions using the SageMaker AI console, even if you haven't onboarded to a SageMaker AI domain. SageMaker AI provides sample Dockerfiles to use as a starting point for your custom SageMaker images in the [SageMaker Studio Classic Custom Image Samples](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/) repository.

The following topics explain how to bring your own image using the SageMaker AI console or AWS CLI, then launch the image in Studio Classic. For a similar blog article, see [Bringing your own R environment to Amazon SageMaker Studio Classic](https://aws.amazon.com/blogs/machine-learning/bringing-your-own-r-environment-to-amazon-sagemaker-studio/). For notebooks that show how to bring your own image for use in training and inference, see [Amazon SageMaker Studio Classic Container Build CLI](https://github.com/aws/amazon-sagemaker-examples/tree/main/aws_sagemaker_studio/sagemaker_studio_image_build).

## Key terminology
<a name="studio-byoi-basics"></a>

The following section defines key terms for bringing your own image to use with Studio Classic.
+ **Dockerfile:** A Dockerfile is a file that identifies the language packages and other dependencies for your Docker image.
+ **Docker image:** The Docker image is a built Dockerfile. This image is checked into Amazon ECR and serves as the basis of the SageMaker AI image.
+ **SageMaker image:** A SageMaker image is a holder for a set of SageMaker AI image versions based on Docker images. Each image version is immutable.
+ **Image version:** An image version of a SageMaker image represents a Docker image and is stored in an Amazon ECR repository. Each image version is immutable. These image versions can be attached to a domain or shared space and used with Studio Classic.

**Topics**
+ [

## Key terminology
](#studio-byoi-basics)
+ [

# Custom SageMaker Image Specifications for Amazon SageMaker Studio Classic
](studio-byoi-specs.md)
+ [

# Prerequisites for Custom Images in Amazon SageMaker Studio Classic
](studio-byoi-prereq.md)
+ [

# Add a Docker Image Compatible with Amazon SageMaker Studio Classic to Amazon ECR
](studio-byoi-sdk-add-container-image.md)
+ [

# Create a Custom SageMaker Image for Amazon SageMaker Studio Classic
](studio-byoi-create.md)
+ [

# Attach a Custom SageMaker Image in Amazon SageMaker Studio Classic
](studio-byoi-attach.md)
+ [

# Launch a Custom SageMaker Image in Amazon SageMaker Studio Classic
](studio-byoi-launch.md)
+ [

# Clean Up Resources for Custom Images in Amazon SageMaker Studio Classic
](studio-byoi-cleanup.md)

# Custom SageMaker Image Specifications for Amazon SageMaker Studio Classic
<a name="studio-byoi-specs"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following specifications apply to the container image that is represented by a SageMaker AI image version.

**Running the image**  
`ENTRYPOINT` and `CMD` instructions are overridden to enable the image to run as a KernelGateway app.  
Port 8888 in the image is reserved for running the KernelGateway web server.

**Stopping the image**  
The `DeleteApp` API issues the equivalent of a `docker stop` command. Other processes in the container won’t get the SIGKILL/SIGTERM signals.

**Kernel discovery**  
SageMaker AI recognizes kernels as defined by Jupyter [kernel specs](https://jupyter-client.readthedocs.io/en/latest/kernels.html#kernelspecs).  
You can specify a list of kernels to display before running the image. If not specified, python3 is displayed. Use the [DescribeAppImageConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAppImageConfig.html) API to view the list of kernels.  
Conda environments are recognized as kernel specs by default. 

**File system**  
The `/opt/.sagemakerinternal` and `/opt/ml` directories are reserved. Any data in these directories might not be visible at runtime.

**User data**  
Each user in a domain gets a user directory on a shared Amazon Elastic File System volume in the image. The location of the current user's directory on the Amazon EFS volume is configurable. By default, the location of the directory is `/home/sagemaker-user`.  
SageMaker AI configures POSIX UID/GID mappings between the image and the host. This defaults to mapping the root user's UID/GID (0/0) to the UID/GID on the host.  
You can specify these values using the [CreateAppImageConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateAppImageConfig.html) API.

**GID/UID limits**  
Amazon SageMaker Studio Classic only supports the following `DefaultUID` and `DefaultGID` combinations:   
+  DefaultUID: 1000 and DefaultGID: 100, which corresponds to a non-priveleged user.
+  DefaultUID: 0 and DefaultGID: 0, which corresponds to root access.

**Metadata**  
A metadata file is located at `/opt/ml/metadata/resource-metadata.json`. No additional environment variables are added to the variables defined in the image. For more information, see [Get App Metadata](notebooks-run-and-manage-metadata.md#notebooks-run-and-manage-metadata-app).

**GPU**  
On a GPU instance, the image is run with the `--gpus` option. Only the CUDA toolkit should be included in the image not the NVIDIA drivers. For more information, see [NVIDIA User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html).

**Metrics and logging**  
Logs from the KernelGateway process are sent to Amazon CloudWatch in the customer’s account. The name of the log group is `/aws/sagemaker/studio`. The name of the log stream is `$domainID/$userProfileName/KernelGateway/$appName`.

**Image size**  
Limited to 35 GB. To view the size of your image, run `docker image ls`.  


## Sample Dockerfile
<a name="studio-byoi-specs-sample"></a>

The following sample Dockerfile creates an image based Amazon Linux 2, installs third party packages and the `python3` kernel, and sets the scope to the non-privileged user.

```
FROM public.ecr.aws/amazonlinux/amazonlinux:2

ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"

RUN \
    yum install --assumeyes python3 shadow-utils && \
    useradd --create-home --shell /bin/bash --gid "${NB_GID}" --uid ${NB_UID} ${NB_USER} && \
    yum clean all && \
    jupyter-activity-monitor-extension \
    python3 -m pip install ipykernel && \
    python3 -m ipykernel install

USER ${NB_UID}
```

# Prerequisites for Custom Images in Amazon SageMaker Studio Classic
<a name="studio-byoi-prereq"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You must satisfy the following prerequisites to bring your own container for use with Amazon SageMaker Studio Classic.
+ The Docker application. For information about setting up Docker, see [Orientation and setup](https://docs.docker.com/get-started/).
+ Install the AWS CLI by following the steps in [Getting started with the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).
+ A local copy of any Dockerfile for creating a Studio Classic compatible image. For sample custom images, see the [SageMaker AI Studio Classic custom image samples](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/) repository.
+ Permissions to access the Amazon Elastic Container Registry (Amazon ECR) service. For more information, see [Amazon ECR Managed Policies](https://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html).
+ An AWS Identity and Access Management execution role that has the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached. If you have onboarded to Amazon SageMaker AI domain, you can get the role from the **Domain Summary** section of the SageMaker AI control panel.
+ Install the Studio Classic image build CLI by following the steps in [SageMaker Docker Build](https://github.com/aws-samples/sagemaker-studio-image-build-cli). This CLI enables you to build a Dockerfile using AWS CodeBuild.

# Add a Docker Image Compatible with Amazon SageMaker Studio Classic to Amazon ECR
<a name="studio-byoi-sdk-add-container-image"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You perform the following steps to add a container image to Amazon ECR:
+ Create an Amazon ECR repository.
+ Authenticate to Amazon ECR.
+ Build a Docker image compatible with Studio Classic.
+ Push the image to the Amazon ECR repository.

**Note**  
The Amazon ECR repository must be in the same AWS Region as Studio Classic.

**To build and add a container image to Amazon ECR**

1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR console, see [Creating a repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html).

   ```
   aws ecr create-repository \
       --repository-name smstudio-custom \
       --image-scanning-configuration scanOnPush=true
   ```

   The response should look similar to the following.

   ```
   {
       "repository": {
           "repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/smstudio-custom",
           "registryId": "acct-id",
           "repositoryName": "smstudio-custom",
           "repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/smstudio-custom",
           ...
       }
   }
   ```

1. Build the `Dockerfile` using the Studio Classic image build CLI. The period (.) specifies that the Dockerfile should be in the context of the build command. This command builds the image and uploads the built image to the ECR repo. It then outputs the image URI.

   ```
   sm-docker build . --repository smstudio-custom:custom
   ```

   The response should look similar to the following.

   ```
   Image URI: <acct-id>.dkr.ecr.<region>.amazonaws.com/<image_name>
   ```

# Create a Custom SageMaker Image for Amazon SageMaker Studio Classic
<a name="studio-byoi-create"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

This topic describes how you can create a custom SageMaker image using the SageMaker AI console or AWS CLI.

When you create an image from the console, SageMaker AI also creates an initial image version. The image version represents a container image in [Amazon Elastic Container Registry (ECR)](https://console.aws.amazon.com/ecr/). The container image must satisfy the requirements to be used in Amazon SageMaker Studio Classic. For more information, see [Custom SageMaker Image Specifications for Amazon SageMaker Studio Classic](studio-byoi-specs.md). For information on testing your image locally and resolving common issues, see the [SageMaker Studio Classic Custom Image Samples repo](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/blob/main/DEVELOPMENT.md).

After you have created your custom SageMaker image, you must attach it to your domain or shared space to use it with Studio Classic. For more information, see [Attach a Custom SageMaker Image in Amazon SageMaker Studio Classic](studio-byoi-attach.md).

## Create a SageMaker image from the console
<a name="studio-byoi-create-console"></a>

The following section demonstrates how to create a custom SageMaker image from the SageMaker AI console.

**To create an image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Images**. 

1. On the **Custom images** page, choose **Create image**.

1. For **Image source**, enter the registry path to the container image in Amazon ECR. The path is in the following format:

   ` acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest] `

1. Choose **Next**.

1. Under **Image properties**, enter the following:
   + Image name – The name must be unique to your account in the current AWS Region.
   + (Optional) Display name – The name displayed in the Studio Classic user interface. When not provided, `Image name` is displayed.
   + (Optional) Description – A description of the image.
   + IAM role – The role must have the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached. Use the dropdown menu to choose one of the following options:
     + Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets that you want users of your notebooks to have access to. If you don't want to allow access to additional buckets, choose **None**.

       SageMaker AI attaches the `AmazonSageMakerFullAccess` policy to the role. The role allows users of your notebooks access to the S3 buckets listed next to the checkmarks.
     + Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
     + Use existing role – Choose one of your existing roles from the list.
   + (Optional) Image tags – Choose **Add new tag**. You can add up to 50 tags. Tags are searchable using the Studio Classic user interface, the SageMaker AI console, or the SageMaker AI `Search` API.

1. Choose **Submit**.

The new image is displayed in the **Custom images** list and briefly highlighted. After the image has been successfully created, you can choose the image name to view its properties or choose **Create version** to create another version.

**To create another image version**

1. Choose **Create version** on the same row as the image.

1. For **Image source**, enter the registry path to the Amazon ECR container image. The container image shouldn't be the same image as used in a previous version of the SageMaker image.

## Create a SageMaker image from the AWS CLI
<a name="studio-byoi-sdk-create-image"></a>

You perform the following steps to create a SageMaker image from the container image using the AWS CLI.
+ Create an `Image`.
+ Create an `ImageVersion`.
+ Create a configuration file.
+ Create an `AppImageConfig`.

**To create the SageMaker image entities**

1. Create a SageMaker image.

   ```
   aws sagemaker create-image \
       --image-name custom-image \
       --role-arn arn:aws:iam::<acct-id>:role/service-role/<execution-role>
   ```

   The response should look similar to the following.

   ```
   {
       "ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/custom-image"
   }
   ```

1. Create a SageMaker image version from the container image.

   ```
   aws sagemaker create-image-version \
       --image-name custom-image \
       --base-image <acct-id>.dkr.ecr.<region>.amazonaws.com/smstudio-custom:custom-image
   ```

   The response should look similar to the following.

   ```
   {
       "ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-image/1"
   }
   ```

1. Check that the image version was successfully created.

   ```
   aws sagemaker describe-image-version \
       --image-name custom-image \
       --version-number 1
   ```

   The response should look similar to the following.

   ```
   {
       "ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-image/1",
       "ImageVersionStatus": "CREATED"
   }
   ```
**Note**  
If the response is `"ImageVersionStatus": "CREATED_FAILED"`, the response also includes the failure reason. A permissions issue is a common cause of failure. You also can check your Amazon CloudWatch logs if you experience a failure when starting or running the KernelGateway app for a custom image. The name of the log group is `/aws/sagemaker/studio`. The name of the log stream is `$domainID/$userProfileName/KernelGateway/$appName`.

1. Create a configuration file, named `app-image-config-input.json`. The `Name` value of `KernelSpecs` must match the name of the kernelSpec available in the Image associated with this `AppImageConfig`. This value is case sensitive. You can find the available kernelSpecs in an image by running `jupyter-kernelspec list` from a shell inside the container. `MountPath` is the path within the image to mount your Amazon Elastic File System (Amazon EFS) home directory. It needs to be different from the path you use inside the container because that path will be overridden when your Amazon EFS home directory is mounted.
**Note**  
The following `DefaultUID` and `DefaultGID` combinations are the only accepted values:   
 DefaultUID: 1000 and DefaultGID: 100 
 DefaultUID: 0 and DefaultGID: 0 

   ```
   {
       "AppImageConfigName": "custom-image-config",
       "KernelGatewayImageConfig": {
           "KernelSpecs": [
               {
                   "Name": "python3",
                   "DisplayName": "Python 3 (ipykernel)"
               }
           ],
           "FileSystemConfig": {
               "MountPath": "/home/sagemaker-user",
               "DefaultUid": 1000,
               "DefaultGid": 100
           }
       }
   }
   ```

1. Create the AppImageConfig using the file created in the previous step.

   ```
   aws sagemaker create-app-image-config \
       --cli-input-json file://app-image-config-input.json
   ```

   The response should look similar to the following.

   ```
   {
       "AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/custom-image-config"
   }
   ```

# Attach a Custom SageMaker Image in Amazon SageMaker Studio Classic
<a name="studio-byoi-attach"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

To use a custom SageMaker image, you must attach a version of the image to your domain or shared space. When you attach an image version, it appears in the SageMaker Studio Classic Launcher and is available in the **Select image** dropdown list, which users use to launch an activity or change the image used by a notebook.

To make a custom SageMaker image available to all users within a domain, you attach the image to the domain. To make an image available to all users within a shared space, you can attach the image to the shared space. To make an image available to a single user, you attach the image to the user's profile. When you attach an image, SageMaker AI uses the latest image version by default. You can also attach a specific image version. After you attach the version, you can choose the version from the SageMaker AI Launcher or the image selector when you launch a notebook.

There is a limit to the number of image versions that can be attached at any given time. After you reach the limit, you must detach a version in order to attach another version of the image.

The following sections demonstrate how to attach a custom SageMaker image to your domain using either the SageMaker AI console or the AWS CLI. You can only attach a custom image to a share space using the AWS CLI.

## Attach the SageMaker image to a domain
<a name="studio-byoi-attach-domain"></a>

### Attach the SageMaker image using the Console
<a name="studio-byoi-attach-existing"></a>

This topic describes how you can attach an existing custom SageMaker image version to your domain using the SageMaker AI control panel. You can also create a custom SageMaker image and image version, and then attach that version to your domain. For the procedure to create an image and image version, see [Create a Custom SageMaker Image for Amazon SageMaker Studio Classic](studio-byoi-create.md).

**To attach an existing image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the **Domains** page, select the domain to attach the image to.

1. From the **Domain details** page, select the **Environment** tab.

1. On the **Environment** tab, under **Custom SageMaker Studio Classic images attached to domain**, choose **Attach image**.

1. For **Image source**, choose **Existing image**.

1. Choose an existing image from the list.

1. Choose a version of the image from the list.

1. Choose **Next**.

1. Verify the values for **Image name**, **Image display name**, and **Description**.

1. Choose the IAM role. For more information, see [Create a Custom SageMaker Image for Amazon SageMaker Studio Classic](studio-byoi-create.md).

1. (Optional) Add tags for the image.

1. Specify the EFS mount path. This is the path within the image to mount the user's Amazon Elastic File System (EFS) home directory.

1. For **Image type**, select **SageMaker Studio image**

1. For **Kernel name**, enter the name of an existing kernel in the image. For information on how to get the kernel information from the image, see [DEVELOPMENT](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/blob/main/DEVELOPMENT.md) in the SageMaker Studio Classic Custom Image Samples repository. For more information, see the **Kernel discovery** and **User data** sections of [Custom SageMaker Image Specifications for Amazon SageMaker Studio Classic](studio-byoi-specs.md).

1. (Optional) For **Kernel display name**, enter the display name for the kernel.

1. Choose **Add kernel**.

1. Choose **Submit**. 

   1. Wait for the image version to be attached to the domain. When attached, the version is displayed in the **Custom images** list and briefly highlighted.

### Attach the SageMaker image using the AWS CLI
<a name="studio-byoi-sdk-attach"></a>

The following sections demonstrate how to attach a custom SageMaker image when creating a new domain or updating your existing domain using the AWS CLI.

#### Attach the SageMaker image to a new domain
<a name="studio-byoi-sdk-attach-new-domain"></a>

The following section demonstrates how to create a new domain with the version attached. These steps require that you specify the Amazon Virtual Private Cloud (VPC) information and execution role required to create the domain. You perform the following steps to create the domain and attach the custom SageMaker image:
+ Get your default VPC ID and subnet IDs.
+ Create the configuration file for the domain, which specifies the image.
+ Create the domain with the configuration file.

**To add the custom SageMaker image to your domain**

1. Get your default VPC ID.

   ```
   aws ec2 describe-vpcs \
       --filters Name=isDefault,Values=true \
       --query "Vpcs[0].VpcId" --output text
   ```

   The response should look similar to the following.

   ```
   vpc-xxxxxxxx
   ```

1. Get your default subnet IDs using the VPC ID from the previous step.

   ```
   aws ec2 describe-subnets \
       --filters Name=vpc-id,Values=<vpc-id> \
       --query "Subnets[*].SubnetId" --output json
   ```

   The response should look similar to the following.

   ```
   [
       "subnet-b55171dd",
       "subnet-8a5f99c6",
       "subnet-e88d1392"
   ]
   ```

1. Create a configuration file named `create-domain-input.json`. Insert the VPC ID, subnet IDs, `ImageName`, and `AppImageConfigName` from the previous steps. Because `ImageVersionNumber` isn't specified, the latest version of the image is used, which is the only version in this case.

   ```
   {
       "DomainName": "domain-with-custom-image",
       "VpcId": "<vpc-id>",
       "SubnetIds": [
           "<subnet-ids>"
       ],
       "DefaultUserSettings": {
           "ExecutionRole": "<execution-role>",
           "KernelGatewayAppSettings": {
               "CustomImages": [
                   {
                       "ImageName": "custom-image",
                       "AppImageConfigName": "custom-image-config"
                   }
               ]
           }
       },
       "AuthMode": "IAM"
   }
   ```

1. Create the domain with the attached custom SageMaker image.

   ```
   aws sagemaker create-domain \
       --cli-input-json file://create-domain-input.json
   ```

   The response should look similar to the following.

   ```
   {
       "DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx",
       "Url": "https://d-xxxxxxxxxxxx.studio.us-east-2.sagemaker.aws/..."
   }
   ```

#### Attach the SageMaker image to your current domain
<a name="studio-byoi-sdk-attach-current-domain"></a>

If you have onboarded to a SageMaker AI domain, you can attach the custom image to your current domain. For more information about onboarding to a SageMaker AI domain, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). You don't need to specify the VPC information and execution role when attaching a custom image to your current domain. After you attach the version, you must delete all the apps in your domain and reopen Studio Classic. For information about deleting the apps, see [Delete an Amazon SageMaker AI domain](gs-studio-delete-domain.md).

You perform the following steps to add the SageMaker image to your current domain.
+ Get your `DomainID` from SageMaker AI control panel.
+ Use the `DomainID` to get the `DefaultUserSettings` for the domain.
+ Add the `ImageName` and `AppImageConfig` as a `CustomImage` to the `DefaultUserSettings`.
+ Update your domain to include the custom image.

**To add the custom SageMaker image to your domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the **Domains** page, select the domain to attach the image to.

1. From the **Domain details** page, select the **Domain settings** tab.

1. From the **Domain settings** tab, under **General settings**, find the `DomainId`. The ID is in the following format: `d-xxxxxxxxxxxx`.

1. Use the domain ID to get the description of the domain.

   ```
   aws sagemaker describe-domain \
       --domain-id <d-xxxxxxxxxxxx>
   ```

   The response should look similar to the following.

   ```
   {
       "DomainId": "d-xxxxxxxxxxxx",
       "DefaultUserSettings": {
         "KernelGatewayAppSettings": {
           "CustomImages": [
           ],
           ...
         }
       }
   }
   ```

1. Save the default user settings section of the response to a file named `default-user-settings.json`.

1. Insert the `ImageName` and `AppImageConfigName` from the previous steps as a custom image. Because `ImageVersionNumber` isn't specified, the latest version of the image is used, which is the only version in this case.

   ```
   {
       "DefaultUserSettings": {
           "KernelGatewayAppSettings": { 
              "CustomImages": [ 
                 { 
                    "ImageName": "string",
                    "AppImageConfigName": "string"
                 }
              ],
              ...
           }
       }
   }
   ```

1. Use the domain ID and default user settings file to update your domain.

   ```
   aws sagemaker update-domain \
       --domain-id <d-xxxxxxxxxxxx> \
       --cli-input-json file://default-user-settings.json
   ```

   The response should look similar to the following.

   ```
   {
       "DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
   }
   ```

## Attach the SageMaker image to a shared space
<a name="studio-byoi-attach-shared-space"></a>

You can only attach the SageMaker image to a shared space using the AWS CLI. After you attach the version, you must delete all of the applications in your shared space and reopen Studio Classic. For information about deleting the apps, see [Delete an Amazon SageMaker AI domain](gs-studio-delete-domain.md).

You perform the following steps to add the SageMaker image to a shared space.
+ Get your `DomainID` from SageMaker AI control panel.
+ Use the `DomainID` to get the `DefaultSpaceSettings` for the domain.
+ Add the `ImageName` and `AppImageConfig` as a `CustomImage` to the `DefaultSpaceSettings`.
+ Update your domain to include the custom image for the shared space.

**To add the custom SageMaker image to your shared space**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the **Domains** page, select the domain to attach the image to.

1. From the **Domain details** page, select the **Domain settings** tab.

1. From the **Domain settings** tab, under **General settings**, find the `DomainId`. The ID is in the following format: `d-xxxxxxxxxxxx`.

1. Use the domain ID to get the description of the domain.

   ```
   aws sagemaker describe-domain \
       --domain-id <d-xxxxxxxxxxxx>
   ```

   The response should look similar to the following.

   ```
   {
       "DomainId": "d-xxxxxxxxxxxx",
       ...
       "DefaultSpaceSettings": {
         "KernelGatewayAppSettings": {
           "CustomImages": [
           ],
           ...
         }
       }
   }
   ```

1. Save the default space settings section of the response to a file named `default-space-settings.json`.

1. Insert the `ImageName` and `AppImageConfigName` from the previous steps as a custom image. Because `ImageVersionNumber` isn't specified, the latest version of the image is used, which is the only version in this case.

   ```
   {
       "DefaultSpaceSettings": {
           "KernelGatewayAppSettings": { 
              "CustomImages": [ 
                 { 
                    "ImageName": "string",
                    "AppImageConfigName": "string"
                 }
              ],
              ...
           }
       }
   }
   ```

1. Use the domain ID and default space settings file to update your domain.

   ```
   aws sagemaker update-domain \
       --domain-id <d-xxxxxxxxxxxx> \
       --cli-input-json file://default-space-settings.json
   ```

   The response should look similar to the following.

   ```
   {
       "DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
   }
   ```

## View the attached image in SageMaker AI
<a name="studio-byoi-sdk-view"></a>

After you create the custom SageMaker image and attach it to your domain, the image appears in the **Environment** tab of the domain. You can only view the attached images for shared spaces using the AWS CLI by using the following command.

```
aws sagemaker describe-domain \
    --domain-id <d-xxxxxxxxxxxx>
```

# Launch a Custom SageMaker Image in Amazon SageMaker Studio Classic
<a name="studio-byoi-launch"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

After you create your custom SageMaker image and attach it to your domain or shared space, the custom image and kernel appear in selectors in the **Change environment** dialog box of the Studio Classic Launcher.

**To launch and select your custom image and kernel**

1. In Amazon SageMaker Studio Classic, open the Launcher. To open the Launcher, choose **Amazon SageMaker Studio Classic** at the top left of the Studio Classic interface or use the keyboard shortcut `Ctrl + Shift + L`.

   To learn about all the available ways to open the Launcher, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md)  
![\[SageMaker Studio Classic launcher.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-new-launcher.png)

1. In the Launcher, in the **Notebooks and compute resources** section, choose **Change environment**.

1. In the **Change environment** dialog, use the dropdown menus to select your **Image** from the **Custom Image** section, and your **Kernel**, then choose **Select**.

1. In the Launcher, choose **Create notebook** or **Open image terminal**. Your notebook or terminal launches in the selected custom image and kernel.

To change your image or kernel in an open notebook, see [Change the Image or a Kernel for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-change-image.md).

**Note**  
If you encounter an error when launching the image, check your Amazon CloudWatch logs. The name of the log group is `/aws/sagemaker/studio`. The name of the log stream is `$domainID/$userProfileName/KernelGateway/$appName`.

# Clean Up Resources for Custom Images in Amazon SageMaker Studio Classic
<a name="studio-byoi-cleanup"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following sections show how to clean up the resources you created in the previous sections from the SageMaker AI console or AWS CLI. You perform the following steps to clean up the resources:
+ Detach the image and image versions from your domain.
+ Delete the image, image version, and app image config.
+ Delete the container image and repository from Amazon ECR. For more information, see [Deleting a repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-delete.html).

## Clean up resources from the SageMaker AI console
<a name="studio-byoi-detach"></a>

The following section shows how to clean up resources from the SageMaker AI console.

When you detach an image from a domain, all versions of the image are detached. When an image is detached, all users of the domain lose access to the image versions. A running notebook that has a kernel session on an image version when the version is detached, continues to run. When the notebook is stopped or the kernel is shut down, the image version becomes unavailable.

**To detach an image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Images**. 

1. Under **Custom SageMaker Studio Classic images attached to domain**, choose the image and then choose **Detach**.

1. (Optional) To delete the image and all versions from SageMaker AI, select **Also delete the selected images ...**. This does not delete the associated container images from Amazon ECR.

1. Choose **Detach**.

## Clean up resources from the AWS CLI
<a name="studio-byoi-sdk-cleanup"></a>

The following section shows how to clean up resources from the AWS CLI.

**To clean up resources**

1. Detach the image and image versions from your domain by passing an empty custom image list to the domain. Open the `default-user-settings.json` file you created in [Attach the SageMaker image to your current domain](studio-byoi-attach.md#studio-byoi-sdk-attach-current-domain). To detach the image and image version from a shared space, open the `default-space-settings.json` file.

1. Delete the custom images and then save the file.

   ```
   "DefaultUserSettings": {
     "KernelGatewayAppSettings": {
        "CustomImages": [
        ],
        ...
     },
     ...
   }
   ```

1. Use the domain ID and default user settings file to update your domain. To update your shared space, use the default space settings file.

   ```
   aws sagemaker update-domain \
       --domain-id <d-xxxxxxxxxxxx> \
       --cli-input-json file://default-user-settings.json
   ```

   The response should look similar to the following.

   ```
   {
       "DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
   }
   ```

1. Delete the app image config.

   ```
   aws sagemaker delete-app-image-config \
       --app-image-config-name custom-image-config
   ```

1. Delete the SageMaker image, which also deletes all image versions. The container images in ECR that are represented by the image versions are not deleted.

   ```
   aws sagemaker delete-image \
       --image-name custom-image
   ```

# Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic
<a name="studio-lcc"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic triggers lifecycle configurations shell scripts during important lifecycle events, such as starting a new Studio Classic notebook. You can use lifecycle configurations to automate customization for your Studio Classic environment. This customization includes installing custom packages, configuring notebook extensions, preloading datasets, and setting up source code repositories.

Using lifecycle configurations gives you flexibility and control to configure Studio Classic to meet your specific needs. For example, you can use customized container images with lifecycle configuration scripts to modify your environment. First, create a minimal set of base container images, then install the most commonly used packages and libraries in those images. After you have completed your images, use lifecycle configurations to install additional packages for specific use cases. This gives you the flexibility to modify your environment across your data science and machine learning teams based on need.

Users can only select lifecycle configuration scripts that they are given access to. While you can give access to multiple lifecycle configuration scripts, you can also set default lifecycle configuration scripts for resources. Based on the resource that the default lifecycle configuration is set for, the default either runs automatically or is the first option shown.

For example lifecycle configuration scripts, see the [Studio Classic Lifecycle Configuration examples GitHub repository](https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples). For a blog on implementing lifecycle configuration, see [Customize Amazon SageMaker Studio Classic using Lifecycle Configurations](https://aws.amazon.com/blogs/machine-learning/customize-amazon-sagemaker-studio-using-lifecycle-configurations/).

**Note**  
Each script has a limit of **16384 characters**.

**Topics**
+ [

# Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic
](studio-lcc-create.md)
+ [

# Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic
](studio-lcc-defaults.md)
+ [

# Debug Lifecycle Configurations in Amazon SageMaker Studio Classic
](studio-lcc-debug.md)
+ [

# Update and Detach Lifecycle Configurations in Amazon SageMaker Studio Classic
](studio-lcc-delete.md)

# Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic
<a name="studio-lcc-create"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker AI provides interactive applications that enable Studio Classic's visual interface, code authoring, and run experience. This series shows how to create a lifecycle configuration and associate it with a SageMaker AI domain.

Application types can be either `JupyterServer` or `KernelGateway`. 
+ **`JupyterServer` applications:** This application type enables access to the visual interface for Studio Classic. Every user and shared space in Studio Classic gets its own JupyterServer application.
+ **`KernelGateway` applications:** This application type enables access to the code run environment and kernels for your Studio Classic notebooks and terminals. For more information, see [Jupyter Kernel Gateway](https://jupyter-kernel-gateway.readthedocs.io/en/latest/).

For more information about Studio Classic's architecture and Studio Classic applications, see [Use Amazon SageMaker Studio Classic Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html).

**Topics**
+ [

# Create a Lifecycle Configuration from the AWS CLI for Amazon SageMaker Studio Classic
](studio-lcc-create-cli.md)
+ [

# Create a Lifecycle Configuration from the SageMaker AI Console for Amazon SageMaker Studio Classic
](studio-lcc-create-console.md)

# Create a Lifecycle Configuration from the AWS CLI for Amazon SageMaker Studio Classic
<a name="studio-lcc-create-cli"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topic shows how to create a lifecycle configuration using the AWS CLI to automate customization for your Studio Classic environment.

## Prerequisites
<a name="studio-lcc-create-cli-prerequisites"></a>

Before you begin, complete the following prerequisites: 
+ Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
+ Onboard to SageMaker AI domain by following the steps in [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Step 1: Create a lifecycle configuration
<a name="studio-lcc-create-cli-step1"></a>

The following procedure shows how to create a lifecycle configuration script that prints `Hello World`.

**Note**  
Each script can have up to **16,384 characters**.

1. From your local machine, create a file named `my-script.sh` with the following content.

   ```
   #!/bin/bash
   set -eux
   echo 'Hello World!'
   ```

1. Convert your `my-script.sh` file into base64 format. This requirement prevents errors that occur from spacing and line break encoding.

   ```
   LCC_CONTENT=`openssl base64 -A -in my-script.sh`
   ```

1. Create a lifecycle configuration for use with Studio Classic. The following command creates a lifecycle configuration that runs when you launch an associated `KernelGateway` application. 

   ```
   aws sagemaker create-studio-lifecycle-config \
   --region region \
   --studio-lifecycle-config-name my-studio-lcc \
   --studio-lifecycle-config-content $LCC_CONTENT \
   --studio-lifecycle-config-app-type KernelGateway
   ```

   Note the ARN of the newly created lifecycle configuration that is returned. This ARN is required to attach the lifecycle configuration to your application.

## Step 2: Attach the lifecycle configuration to your domain, user profile, or shared space
<a name="studio-lcc-create-cli-step2"></a>

To attach the lifecycle configuration, you must update the `UserSettings` for your domain or user profile, or the `SpaceSettings` for a shared space. Lifecycle configuration scripts that are associated at the domain level are inherited by all users. However, scripts that are associated at the user profile level are scoped to a specific user, while scripts that are associated at the shared space level are scoped to the shared space. 

The following example shows how to create a new user profile with the lifecycle configuration attached. You can also create a new domain or space with a lifecycle configuration attached by using the [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-domain.html) and [create-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html) commands, respectively.

Add the lifecycle configuration ARN from the previous step to the settings for the appropriate app type. For example, place it in the `JupyterServerAppSettings` of the user. You can add multiple lifecycle configurations at the same time by passing a list of lifecycle configurations. When a user launches a JupyterServer application with the AWS CLI, they can pass a lifecycle configuration to use instead of the default. The lifecycle configuration that the user passes must belong to the list of lifecycle configurations in `JupyterServerAppSettings`.

```
# Create a new UserProfile
aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"JupyterServerAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

The following example shows how to update an existing shared space to attach the lifecycle configuration. You can also update an existing domain or user profile with a lifecycle configuration attached by using the [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html) or [update-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-user-profile.html) command. When you update the list of lifecycle configurations attached, you must pass all lifecycle configurations as part of the list. If a lifecycle configuration is not part of this list, it will not be attached to the application.

```
aws sagemaker update-space --domain-id domain-id \
--space-name space-name \
--region region \
--space-settings '{
"JupyterServerAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

For information about setting a default lifecycle configuration for a resource, see [Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic](studio-lcc-defaults.md).

## Step 3: Launch application with lifecycle configuration
<a name="studio-lcc-create-cli-step3"></a>

After you attach a lifecycle configuration to a domain, user profile, or space, the user can select it when launching an application with the AWS CLI. This section describes how to launch an application with an attached lifecycle configuration. For information about changing the default lifecycle configuration after launching a JupyterServer application, see [Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic](studio-lcc-defaults.md).

Launch the desired application type using the `create-app` command and specify the lifecycle configuration ARN in the `resource-spec` argument. 
+ The following example shows how to create a `JupyterServer` application with an associated lifecycle configuration. When creating the `JupyterServer`, the `app-name` must be `default`. The lifecycle configuration ARN passed as part of the `resource-spec` parameter must be part of the list of lifecycle configuration ARNs specified in `UserSettings` for your domain or user profile, or `SpaceSettings` for a shared space.

  ```
  aws sagemaker create-app --domain-id domain-id \
  --region region \
  --user-profile-name user-profile-name \
  --app-type JupyterServer \
  --resource-spec LifecycleConfigArn=lifecycle-configuration-arn \
  --app-name default
  ```
+ The following example shows how to create a `KernelGateway` application with an associated lifecycle configuration.

  ```
  aws sagemaker create-app --domain-id domain-id \
  --region region \
  --user-profile-name user-profile-name \
  --app-type KernelGateway \
  --resource-spec LifecycleConfigArn=lifecycle-configuration-arn,SageMakerImageArn=sagemaker-image-arn,InstanceType=instance-type \
  --app-name app-name
  ```

# Create a Lifecycle Configuration from the SageMaker AI Console for Amazon SageMaker Studio Classic
<a name="studio-lcc-create-console"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topic shows how to create a lifecycle configuration from the Amazon SageMaker AI console to automate customization for your Studio Classic environment.

## Prerequisites
<a name="studio-lcc-create-console-prerequisites"></a>

Before you can begin this tutorial, complete the following prerequisite:
+ Onboard to Amazon SageMaker Studio Classic. For more information, see [Onboard to Amazon SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html).

## Step 1: Create a new lifecycle configuration
<a name="studio-lcc-create-console-step1"></a>

You can create a lifecycle configuration by entering a script from the Amazon SageMaker AI console.

**Note**  
Each script can have up to **16,384 characters**.

The following procedure shows how to create a lifecycle configuration script that prints `Hello World`.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Lifecycle configurations**. 

1. Choose the **Studio** tab.

1. Choose **Create configuration**.

1. Under **Select configuration type**, select the type of application that the lifecycle configuration should be attached to. For more information about selecting which application to attach the lifecycle configuration to, see [Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic](studio-lcc-defaults.md).

1. Choose **Next**.

1. In the section called **Configuration settings**, enter a name for your lifecycle configuration.

1. In the **Scripts** section, enter the following content.

   ```
   #!/bin/bash
   set -eux
   echo 'Hello World!'
   ```

1. (Optional) Create a tag for your lifecycle configuration.

1. Choose **Submit**.

## Step 2: Attach the lifecycle configuration to a domain or user profile
<a name="studio-lcc-create-console-step2"></a>

Lifecycle configuration scripts associated at the domain level are inherited by all users. However, scripts that are associated at the user profile level are scoped to a specific user. 

You can attach multiple lifecycle configurations to a domain or user profile for both JupyterServer and KernelGateway applications.

**Note**  
To attach a lifecycle configuration to a shared space, you must use the AWS CLI. For more information, see [Create a Lifecycle Configuration from the AWS CLI for Amazon SageMaker Studio Classic](studio-lcc-create-cli.md).

The following sections show how to attach a lifecycle configuration to your domain or user profile.

### Attach to a domain
<a name="studio-lcc-create-console-step2-domain"></a>

The following shows how to attach a lifecycle configuration to your existing domain from the SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain to attach the lifecycle configuration to.

1. From the **Domain details**, choose the **Environment** tab.

1. Under **Lifecycle configurations for personal Studio apps**, choose **Attach**.

1. Under **Source**, choose **Existing configuration**.

1. Under **Studio lifecycle configurations**, select the lifecycle configuration that you created in the previous step.

1. Select **Attach to domain**.

### Attach to your user profile
<a name="studio-lcc-create-console-step2-userprofile"></a>

The following shows how to attach a lifecycle configuration to your existing user profile.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain that contains the user profile to attach the lifecycle configuration to.

1. Under **User profiles**, select the user profile.

1. From the **User Details** page, choose **Edit**.

1. On the left navigation, choose **Studio settings**.

1. Under **Lifecycle configurations attached to user**, choose **Attach**.

1. Under **Source**, choose **Existing configuration**.

1. Under **Studio lifecycle configurations**, select the lifecycle configuration that you created in the previous step.

1. Choose **Attach to user profile**.

## Step 3: Launch an application with the lifecycle configuration
<a name="studio-lcc-create-console-step3"></a>

After you attach a lifecycle configuration to a domain or user profile, you can launch an application with that attached lifecycle configuration. Choosing which lifecycle configuration to launch with depends on the application type.
+ **JupyterServer**: When launching a JupyterServer application from the console, SageMaker AI always uses the default lifecycle configuration. You can't use a different lifecycle configuration when launching from the console. For information about changing the default lifecycle configuration after launching a JupyterServer application, see [Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic](studio-lcc-defaults.md).

  To select a different attached lifecycle configuration, you must launch with the AWS CLI. For more information about launching a JupyterServer application with an attached lifecycle configuration from the AWS CLI, see [Create a Lifecycle Configuration from the AWS CLI for Amazon SageMaker Studio Classic](studio-lcc-create-cli.md).
+ **KernelGateway**: You can select any of the attached lifecycle configurations when launching a KernelGateway application using the Studio Classic Launcher.

The following procedure describes how to launch a KernelGateway application with an attached lifecycle configuration from the SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Launch Studio Classic. For more information, see [Launch Amazon SageMaker Studio Classic](studio-launch.md).

1. In the Studio Classic UI, open the Studio Classic Launcher. For more information, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md). 

1. In the Studio Classic Launcher, navigate to the **Notebooks and compute resources** section. 

1. Click the **Change environment** button.

1. On the **Change environment** dialog, use the dropdown menus to select your **Image**, **Kernel**, **Instance type**, and a **Start-up script**. If there is no default lifecycle configuration, the **Start-up script** value defaults to `No script`. Otherwise, the **Start-up script** value is your default lifecycle configuration. After you select a lifecycle configuration, you can view the entire script.

1. Click **Select**.

1. Back to the Launcher, click the **Create notebook** to launch a new notebook kernel with your selected image and lifecycle configuration.

## Step 4: View logs for a lifecycle configuration
<a name="studio-lcc-create-console-step4"></a>

You can view the logs for your lifecycle configuration after it has been attached to a domain or user profile. 

1. First, provide access to CloudWatch for your AWS Identity and Access Management (IAM) role. Add read permissions for the following log group and log stream.
   + **Log group:**`/aws/sagemaker/studio`
   + **Log stream:**`domain/user-profile/app-type/app-name/LifecycleConfigOnStart`

    For information about adding permissions, see [Enabling logging from certain AWS services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AWS-logs-and-resource-policy.html).

1. From within Studio Classic, navigate to the **Running Terminals and Kernels** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels.png)) to monitor your lifecycle configuration.

1. Select an application from the list of running applications. Applications with attached lifecycle configurations have an attached indicator icon ![\[Code brackets symbol representing programming or markup languages.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/studio-lcc-indicator-icon.png).

1. Select the indicator icon for your application. This opens a new panel that lists the lifecycle configuration.

1. From the new panel, select `View logs`. This opens a new tab that displays the logs.

# Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic
<a name="studio-lcc-defaults"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Although you can attach multiple lifecycle configuration scripts to a single resource, you can only set one default lifecycle configuration for each JupyterServer or KernelGateway application. The behavior of the default lifecycle configuration depends on whether it is set for JupyterServer or KernelGateway apps. 
+ **JupyterServer apps:** When set as the default lifecycle configuration script for JupyterServer apps, the lifecycle configuration script runs automatically when the user signs in to Studio Classic for the first time or restarts Studio Classic. Use this default lifecycle configuration to automate one-time setup actions for the Studio Classic developer environment, such as installing notebook extensions or setting up a GitHub repo. For an example of this, see [Customize Amazon SageMaker Studio using Lifecycle Configurations](https://aws.amazon.com/blogs/machine-learning/customize-amazon-sagemaker-studio-using-lifecycle-configurations/).
+ **KernelGateway apps:** When set as the default lifecycle configuration script for KernelGateway apps, the lifecycle configuration is selected by default in the Studio Classic launcher. Users can launch a notebook or terminal with the default script selected, or they can select a different one from the list of lifecycle configurations.

SageMaker AI supports setting a default lifecycle configuration for the following resources:
+ Domains
+ User profiles
+ Shared spaces

While domains and user profiles support setting a default lifecycle configuration from both the Amazon SageMaker AI console and AWS Command Line Interface, shared spaces only support setting a default lifecycle configuration from the AWS CLI.

You can set a lifecycle configuration as the default when creating a new resource or updating an existing resource. The following topics demonstrate how to set a default lifecycle configuration using the SageMaker AI console and AWS CLI.

## Default lifecycle configuration inheritance
<a name="studio-lcc-defaults-inheritance"></a>

Default lifecycle configurations set at the *domain* level are inherited by all users and shared spaces. Default lifecycle configurations set at the *user* and *shared space* level are scoped to only that user or shared space. User and space defaults override defaults set at the domain level.

A default KernelGateway lifecycle configuration set for a domain applies to all KernelGateway applications launched in the domain. Unless the user selects a different lifecycle configuration from the list presented in the Studio Classic launcher, the default lifecycle configuration is used. The default script also runs if `No Script` is selected by the user. For more information about selecting a script, see [Step 3: Launch an application with the lifecycle configuration](studio-lcc-create-console.md#studio-lcc-create-console-step3).

**Topics**
+ [

## Default lifecycle configuration inheritance
](#studio-lcc-defaults-inheritance)
+ [

# Set Defaults from the AWS CLI for Amazon SageMaker Studio Classic
](studio-lcc-defaults-cli.md)
+ [

# Set Defaults from the SageMaker AI Console for Amazon SageMaker Studio Classic
](studio-lcc-defaults-console.md)

# Set Defaults from the AWS CLI for Amazon SageMaker Studio Classic
<a name="studio-lcc-defaults-cli"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can set default lifecycle configuration scripts from the AWS CLI for the following resources:
+ Domains
+ User profiles
+ Shared spaces

The following sections outline how to set default lifecycle configuration scripts from the AWS CLI.

**Topics**
+ [

## Prerequisites
](#studio-lcc-defaults-cli-prereq)
+ [

## Set a default lifecycle configuration when creating a new resource
](#studio-lcc-defaults-cli-new)
+ [

## Set a default lifecycle configuration for an existing resource
](#studio-lcc-defaults-cli-existing)

## Prerequisites
<a name="studio-lcc-defaults-cli-prereq"></a>

Before you begin, complete the following prerequisites:
+ Update the AWS CLI by following the steps in [Installing the current AWS CLI version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
+ Onboard to SageMaker AI domain by following the steps in [Amazon SageMaker AI domain overview](gs-studio-onboard.md).
+ Create a lifecycle configuration following the steps in [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md).

## Set a default lifecycle configuration when creating a new resource
<a name="studio-lcc-defaults-cli-new"></a>

To set a default lifecycle configuration when creating a new domain, user profile, or space, pass the ARN of your previously created lifecycle configuration as part of one of the following AWS CLI commands:
+ [create-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-user-profile.html)
+ [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/opensearch/create-domain.html)
+ [create-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html)

You must pass the lifecycle configuration ARN for the following values in the KernelGateway or JupyterServer default settings:
+ `DefaultResourceSpec`:`LifecycleConfigArn` - This specifies the default lifecycle configuration for the application type.
+ `LifecycleConfigArns` - This is the list of all lifecycle configurations attached to the application type. The default lifecycle configuration must also be part of this list.

For example, the following API call creates a new user profile with a default lifecycle configuration.

```
aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"KernelGatewayAppSettings": {
    "DefaultResourceSpec": { 
            "InstanceType": "ml.t3.medium",
            "LifecycleConfigArn": "lifecycle-configuration-arn"
         },
    "LifecycleConfigArns": [lifecycle-configuration-arn-list]
  }
}'
```

## Set a default lifecycle configuration for an existing resource
<a name="studio-lcc-defaults-cli-existing"></a>

To set or update the default lifecycle configuration for an existing resource, pass the ARN of your previously created lifecycle configuration as part of one of the following AWS CLI commands:
+ [update-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-user-profile.html)
+ [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html)
+ [update-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-space.html)

You must pass the lifecycle configuration ARN for the following values in the KernelGateway or JupyterServer default settings:
+ `DefaultResourceSpec`:`LifecycleConfigArn` - This specifies the default lifecycle configuration for the application type.
+ `LifecycleConfigArns` - This is the list of all lifecycle configurations attached to the application type. The default lifecycle configuration must also be part of this list.

For example, the following API call updates a user profile with a default lifecycle configuration.

```
aws sagemaker update-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"KernelGatewayAppSettings": {
    "DefaultResourceSpec": {
            "InstanceType": "ml.t3.medium",
            "LifecycleConfigArn": "lifecycle-configuration-arn"
         },
    "LifecycleConfigArns": [lifecycle-configuration-arn-list]
  }
}'
```

The following API call updates a domain to set a new default lifecycle configuration.

```
aws sagemaker update-domain --domain-id domain-id \
--region region \
--default-user-settings '{
"JupyterServerAppSettings": {
    "DefaultResourceSpec": {
            "InstanceType": "system",
            "LifecycleConfigArn": "lifecycle-configuration-arn"
         },
    "LifecycleConfigArns": [lifecycle-configuration-arn-list]
  }
}'
```

# Set Defaults from the SageMaker AI Console for Amazon SageMaker Studio Classic
<a name="studio-lcc-defaults-console"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can set default lifecycle configuration scripts from the SageMaker AI console for the following resources.
+ Domains
+ User profiles

You cannot set default lifecycle configuration scripts for shared spaces from the SageMaker AI console. For information about setting defaults for shared spaces, see [Set Defaults from the AWS CLI for Amazon SageMaker Studio Classic](studio-lcc-defaults-cli.md).

The following sections outline how to set default lifecycle configuration scripts from the SageMaker AI console.

**Topics**
+ [

## Prerequisites
](#studio-lcc-defaults-cli-prerequisites)
+ [

## Set a default lifecycle configuration for a domain
](#studio-lcc-defaults-cli-domain)
+ [

## Set a default lifecycle configuration for a user profile
](#studio-lcc-defaults-cli-user-profile)

## Prerequisites
<a name="studio-lcc-defaults-cli-prerequisites"></a>

Before you begin, complete the following prerequisites:
+ Onboard to SageMaker AI domain by following the steps in [Amazon SageMaker AI domain overview](gs-studio-onboard.md).
+ Create a lifecycle configuration following the steps in [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md).

## Set a default lifecycle configuration for a domain
<a name="studio-lcc-defaults-cli-domain"></a>

The following procedure shows how to set a default lifecycle configuration for a domain from the SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the list of domains, select the name of the domain to set the default lifecycle configuration for.

1. From the **Domain details** page, choose the **Environment** tab.

1. Under **Lifecycle configurations for personal Studio apps**, select the lifecycle configuration that you want to set as the default for the domain. You can set distinct defaults for JupyterServer and KernelGateway applications.

1. Choose **Set as default**. This opens a pop up window that lists the current defaults for JupyterServer and KernelGateway applications.

1. Choose **Set as default** to set the lifecycle configuration as the default for its respective application type.

## Set a default lifecycle configuration for a user profile
<a name="studio-lcc-defaults-cli-user-profile"></a>

The following procedure shows how to set a default lifecycle configuration for a user profile from the SageMaker AI console.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the list of domains, select the name of the domain that contains the user profile that you want to set the default lifecycle configuration for.

1. From the **Domain details** page, choose the **User profiles** tab.

1. Select the name of the user profile to set the default lifecycle configuration for. This opens a **User Details** page.

1. From the **User Details** page, choose **Edit**. This opens an **Edit user profile** page.

1. From the **Edit user profile** page, choose **Step 2 Studio settings**.

1. Under **Lifecycle configurations attached to user**, select the lifecycle configuration that you want to set as the default for the user profile. You can set distinct defaults for JupyterServer and KernelGateway applications.

1. Choose **Set as default**. This opens a pop up window that lists the current defaults for JupyterServer and KernelGateway applications.

1. Choose **Set as default** to set the lifecycle configuration as the default for its respective application type.

# Debug Lifecycle Configurations in Amazon SageMaker Studio Classic
<a name="studio-lcc-debug"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topics show how to get information about and debug your lifecycle configurations.

**Topics**
+ [

## Verify lifecycle configuration process from CloudWatch Logs
](#studio-lcc-debug-logs)
+ [

## JupyterServer app failure
](#studio-lcc-debug-jupyterserver)
+ [

## KernelGateway app failure
](#studio-lcc-debug-kernel)
+ [

## Lifecycle configuration timeout
](#studio-lcc-debug-timeout)

## Verify lifecycle configuration process from CloudWatch Logs
<a name="studio-lcc-debug-logs"></a>

Lifecycle configurations only log `STDOUT` and `STDERR`.

`STDOUT` is the default output for bash scripts. You can write to `STDERR` by appending `>&2` to the end of a bash command. For example, `echo 'hello'>&2`. 

Logs for your lifecycle configurations are published to your AWS account using Amazon CloudWatch. These logs can be found in the `/aws/sagemaker/studio` log stream in the CloudWatch console.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs** from the left side. From the dropdown menu, select **Log groups**.

1. On the **Log groups** page, search for `aws/sagemaker/studio`. 

1. Select the log group.

1. On the **Log group details** page, choose the **Log streams** tab.

1. To find the logs for a specific app, search the log streams using the following format:

   ```
   domain-id/space-name/app-type/default/LifecycleConfigOnStart
   ```

   For example, to find the lifecycle configuration logs for domain `d-m85lcu8vbqmz`, space name `i-sonic-js`, and application type `JupyterLab`, use the following search string:

   ```
   d-m85lcu8vbqmz/i-sonic-js/JupyterLab/default/LifecycleConfigOnStart
   ```

## JupyterServer app failure
<a name="studio-lcc-debug-jupyterserver"></a>

If your JupyterServer app crashes because of an issue with the attached lifecycle configuration, Studio Classic displays the following error message on the Studio Classic startup screen. 

```
Failed to create SageMaker Studio due to start-up script failure
```

Select the `View script logs` link to view the CloudWatch logs for your JupyterServer app.

In the case where the faulty lifecycle configuration is specified in the `DefaultResourceSpec` of your domain, user profile, or shared space, Studio Classic continues to use the lifecycle configuration even after restarting Studio Classic. 

To resolve this error, follow the steps in [Set Default Lifecycle Configurations for Amazon SageMaker Studio Classic](studio-lcc-defaults.md) to remove the lifecycle configuration script from the `DefaultResourceSpec` or select another script as the default. Then launch a new JupyterServer app.

## KernelGateway app failure
<a name="studio-lcc-debug-kernel"></a>

If your KernelGateway app crashes because of an issue with the attached lifecycle configuration, Studio Classic displays the error message in your Studio Classic Notebook. 

Choose `View script logs` to view the CloudWatch logs for your KernelGateway app.

In this case, your lifecycle configuration is specified in the Studio Classic Launcher when launching a new Studio Classic Notebook. 

To resolve this error, use the Studio Classic launcher to select a different lifecycle configuration or select `No script`.

**Note**  
A default KernelGateway lifecycle configuration specified in `DefaultResourceSpec` applies to all KernelGateway images in the domain, user profile, or shared space unless the user selects a different script from the list presented in the Studio Classic launcher. The default script also runs if `No Script` is selected by the user. For more information on selecting a script, see [Step 3: Launch an application with the lifecycle configuration](studio-lcc-create-console.md#studio-lcc-create-console-step3).

## Lifecycle configuration timeout
<a name="studio-lcc-debug-timeout"></a>

There is a lifecycle configuration timeout limitation of 5 minutes. If a lifecycle configuration script takes longer than 5 minutes to run, Studio Classic throws an error.

To resolve this error, ensure that your lifecycle configuration script completes in less than 5 minutes. 

To help decrease the run time of scripts, try the following:
+ Cut down on necessary steps. For example, limit which conda environments to install large packages in.
+ Run tasks in parallel processes.
+ Use the `nohup` command in your script to ensure that hangup signals are ignored and do not stop the execution of the script.

# Update and Detach Lifecycle Configurations in Amazon SageMaker Studio Classic
<a name="studio-lcc-delete"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

A lifecycle configuration script can't be changed after it's created. To update your script, you must create a new lifecycle configuration script and attach it to the respective domain, user profile, or shared space. For more information about creating and attaching the lifecycle configuration, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md).

The following topic shows how to detach a lifecycle configuration using the AWS CLI and SageMaker AI console.

**Topics**
+ [

## Prerequisites
](#studio-lcc-delete-pre)
+ [

## Detach using the AWS CLI
](#studio-lcc-delete-cli)

## Prerequisites
<a name="studio-lcc-delete-pre"></a>

Before detaching a lifecycle configuration, you must complete the following prerequisite.
+ To successfully detach a lifecycle configuration, no running application can be using the lifecycle configuration. You must first shut down the running applications as shown in [Shut Down and Update Amazon SageMaker Studio Classic and Apps](studio-tasks-update.md).

## Detach using the AWS CLI
<a name="studio-lcc-delete-cli"></a>

To detach a lifecycle configuration using the AWS CLI, remove the desired lifecycle configuration from the list of lifecycle configurations attached to the resource and pass the list as part of the respective command:
+ [update-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-user-profile.html)
+ [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html)
+ [update-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-space.html)

For example, the following command removes all lifecycle configurations for KernelGateways attached to the domain.

```
aws sagemaker update-domain --domain-id domain-id \
--region region \
--default-user-settings '{
"KernelGatewayAppSettings": {
  "LifecycleConfigArns":
    []
  }
}'
```

# Attach Suggested Git Repos to Amazon SageMaker Studio Classic
<a name="studio-git-attach"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic offers a Git extension for you to enter the URL of a Git repository (repo), clone it into your environment, push changes, and view commit history. In addition to this Git extension, you can also attach suggested Git repository URLs at the Amazon SageMaker AI domain or user profile level. Then, you can select the repo URL from the list of suggestions and clone that into your environment using the Git extension in Studio Classic. 

The following topics show how to attach Git repo URLs to a domain or user profile from the AWS CLI and SageMaker AI console. You'll also learn how to detach these repository URLs.

**Topics**
+ [

# Attach a Git Repository from the AWS CLI for Amazon SageMaker Studio Classic
](studio-git-attach-cli.md)
+ [

# Attach a Git Repository from the SageMaker AI Console for Amazon SageMaker Studio Classic
](studio-git-attach-console.md)
+ [

# Detach Git Repos from Amazon SageMaker Studio Classic
](studio-git-detach.md)

# Attach a Git Repository from the AWS CLI for Amazon SageMaker Studio Classic
<a name="studio-git-attach-cli"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topic shows how to attach a Git repository URL using the AWS CLI, so that Amazon SageMaker Studio Classic automatically suggests it for cloning. After you attach the Git repository URL, you can clone it by following the steps in [Clone a Git Repository in Amazon SageMaker Studio Classic](studio-tasks-git.md).

## Prerequisites
<a name="studio-git-attach-cli-prerequisites"></a>

Before you begin, complete the following prerequisites: 
+ Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
+ Onboard to Amazon SageMaker AI domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Attach the Git repo to a domain or user profile
<a name="studio-git-attach-cli-attach"></a>

Git repo URLs associated at the domain level are inherited by all users. However, Git repo URLs that are associated at the user profile level are scoped to a specific user. You can attach multiple Git repo URLs to a domain or user profile by passing a list of repository URLs.

The following sections show how to attach a Git repo URL to your domain and user profile.

### Attach to a domain
<a name="studio-git-attach-cli-attach-domain"></a>

The following command attaches a Git repo URL to an existing domain.

```
aws sagemaker update-domain --region region --domain-id domain-id \
    --default-user-settings JupyterServerAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}
```

### Attach to a user profile
<a name="studio-git-attach-cli-attach-userprofile"></a>

The following shows how to attach a Git repo URL to an existing user profile.

```
aws sagemaker update-user-profile --domain-id domain-id --user-profile-name user-name\
    --user-settings JupyterServerAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}
```

# Attach a Git Repository from the SageMaker AI Console for Amazon SageMaker Studio Classic
<a name="studio-git-attach-console"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topic shows how to associate a Git repository URL from the Amazon SageMaker AI console to clone it in your Studio Classic environment. After you associate the Git repository URL, you can clone it by following the steps in [Clone a Git Repository in Amazon SageMaker Studio Classic](studio-tasks-git.md).

## Prerequisites
<a name="studio-git-attach-console-prerequisites"></a>

Before you can begin this tutorial, you must onboard to Amazon SageMaker AI domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Attach the Git repo to a domain or user profile
<a name="studio-git-attach-console-attach"></a>

Git repo URLs associated at the domain level are inherited by all users. However, Git repo URL that are associated at the user profile level are scoped to a specific user. 

The following sections show how to attach a Git repo URL to a domain and user profile.

### Attach to a domain
<a name="studio-git-attach-console-attach-domain"></a>

**To attach a Git repo URL to an existing domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain to attach the Git repo to.

1. On the **domain details** page, choose the **Environment** tab.

1. On the **Suggested code repositories for the domain** tab, choose **Attach**.

1. Under **Source**, enter the Git repository URL.

1. Select **Attach to domain**.

### Attach to a user profile
<a name="studio-git-attach-console-attach-userprofile"></a>

The following shows how to attach a Git repository URL to an existing user profile.

**To attach a Git repository URL to a user profile**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain that includes the user profile to attach the Git repo to.

1. On the **domain details** page, choose the **User profiles** tab.

1. Select the user profile to attach the Git repo URL to.

1. On the **User details** page, choose **Edit**.

1. On the **Studio settings** page, choose **Attach** from the **Suggested code repositories for the user** section.

1. Under **Source**, enter the Git repository URL.

1. Choose **Attach to user**.

# Detach Git Repos from Amazon SageMaker Studio Classic
<a name="studio-git-detach"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

This guide shows how to detach Git repository URLs from an Amazon SageMaker AI domain or user profile using the AWS CLI or Amazon SageMaker AI console.

**Topics**
+ [

## Detach a Git repo using the AWS CLI
](#studio-git-detach-cli)
+ [

## Detach the Git repo using the SageMaker AI console
](#studio-git-detach-console)

## Detach a Git repo using the AWS CLI
<a name="studio-git-detach-cli"></a>

To detach all Git repo URLs from a domain or user profile, you must pass an empty list of code repositories. This list is passed as part of the `JupyterServerAppSettings` parameter in an `update-domain` or `update-user-profile` command. To detach only one Git repo URL, pass the code repositories list without the desired Git repo URL. This section shows how to detach all Git repo URLs from your domain or user profile using the AWS Command Line Interface (AWS CLI).

### Detach from a domain
<a name="studio-git-detach-cli-domain"></a>

The following command detaches all Git repo URLs from a domain.

```
aws sagemaker update-domain --region region --domain-name domain-name \
    --domain-settings JupyterServerAppSettings={CodeRepositories=[]}
```

### Detach from a user profile
<a name="studio-git-detach-cli-userprofile"></a>

The following command detaches all Git repo URLs from a user profile.

```
aws sagemaker update-user-profile --domain-name domain-name --user-profile-name user-name\
    --user-settings JupyterServerAppSettings={CodeRepositories=[]}
```

## Detach the Git repo using the SageMaker AI console
<a name="studio-git-detach-console"></a>

The following sections show how to detach a Git repo URL from a domain or user profile using the SageMaker AI console.

### Detach from a domain
<a name="studio-git-detach-console-domain"></a>

Use the following steps to detach a Git repo URL from an existing domain.

**To detach a Git repo URL from an existing domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain with the Git repo URL that you want to detach.

1. On the **domain details** page, choose the **Environment** tab.

1. On the **Suggested code repositories for the domain** tab, select the Git repository URL to detach.

1. Choose **Detach**.

1. From the new window, choose **Detach**.

### Detach from a user profile
<a name="studio-git-detach-console-userprofile"></a>

Use the following steps to detach a Git repo URL from a user profile.

**To detach a Git repo URL from a user profile**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain that includes the user profile with the Git repo URL that you want to detach.

1. On the **domain details** page, choose the **User profiles** tab.

1. Select the user profile with the Git repo URL that you want to detach.

1. On the **User details** page, choose **Edit**.

1. On the **Studio settings** page, select the Git repo URL to detach from the **Suggested code repositories for the user** tab.

1. Choose **Detach**.

1. From the new window, choose **Detach**.

# Perform Common Tasks in Amazon SageMaker Studio Classic
<a name="studio-tasks"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following sections describe how to perform common tasks in Amazon SageMaker Studio Classic. For an overview of the Studio Classic interface, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

**Topics**
+ [

# Upload Files to Amazon SageMaker Studio Classic
](studio-tasks-files.md)
+ [

# Clone a Git Repository in Amazon SageMaker Studio Classic
](studio-tasks-git.md)
+ [

# Stop a Training Job in Amazon SageMaker Studio Classic
](studio-tasks-stop-training-job.md)
+ [

# Use TensorBoard in Amazon SageMaker Studio Classic
](studio-tensorboard.md)
+ [

# Use Amazon Q Developer with Amazon SageMaker Studio Classic
](sm-q.md)
+ [

# Manage Your Amazon EFS Storage Volume in Amazon SageMaker Studio Classic
](studio-tasks-manage-storage.md)
+ [

# Provide Feedback on Amazon SageMaker Studio Classic
](studio-tasks-provide-feedback.md)
+ [

# Shut Down and Update Amazon SageMaker Studio Classic and Apps
](studio-tasks-update.md)

# Upload Files to Amazon SageMaker Studio Classic
<a name="studio-tasks-files"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

When you onboard to Amazon SageMaker Studio Classic, a home directory is created for you in the Amazon Elastic File System (Amazon EFS) volume that was created for your team. Studio Classic can only open files that have been uploaded to your directory. The Studio Classic file browser maps to your home directory.

**Note**  
Studio Classic does not support uploading folders. While you can only upload individual files, you can upload multiple files at the same time.

**To upload files to your home directory**

1. In the left sidebar, choose the **File Browser** icon ( ![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)).

1. In the file browser, choose the **Upload Files** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/File_upload_squid.png)).

1. Select the files you want to upload and then choose **Open**.

1. Double-click a file to open the file in a new tab in Studio Classic.

# Clone a Git Repository in Amazon SageMaker Studio Classic
<a name="studio-tasks-git"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic can only connect only to a local Git repository (repo). This means that you must clone the Git repo from within Studio Classic to access the files in the repo. Studio Classic offers a Git extension for you to enter the URL of a Git repo, clone it into your environment, push changes, and view commit history. If the repo is private and requires credentials to access, then you are prompted to enter your user credentials. This includes your username and personal access token. For more information about personal access tokens, see [Managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).

Admins can also attach suggested Git repository URLs at the Amazon SageMaker AI domain or user profile level. Users can then select the repo URL from the list of suggestions and clone that into Studio Classic. For more information about attaching suggested repos, see [Attach Suggested Git Repos to Amazon SageMaker Studio Classic](studio-git-attach.md).

The following procedure shows how to clone a GitHub repo from Studio Classic. 

**To clone the repo**

1. In the left sidebar, choose the **Git** icon ( ![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git.png)).

1. Choose **Clone a Repository**. This opens a new window.

1. In the **Clone Git Repository** window, enter the URL in the following format for the Git repo that you want to clone or select a repository from the list of **Suggested repositories**.

   ```
   https://github.com/path-to-git-repo/repo.git
   ```

1. If you entered the URL of the Git repo manually, select **Clone "*git-url*"** from the dropdown menu.

1. Under **Project directory to clone into**, enter the path to the local directory that you want to clone the Git repo into. If this value is left empty, Studio Classic clones the repo into JupyterLab's root directory.

1. Choose **Clone**. This opens a new terminal window.

1. If the repo requires credentials, you are prompted to enter your username and personal access token. This prompt does not accept passwords, you must use a personal access token. For more information about personal access tokens, see [Managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).

1. Wait for the download to finish. After the repo has been cloned, the **File Browser** opens to display the cloned repo.

1. Double click the repo to open it.

1. Choose the **Git** icon to view the Git user interface which now tracks the repo.

1. To track a different repo, open the repo in the file browser and then choose the **Git** icon.

# Stop a Training Job in Amazon SageMaker Studio Classic
<a name="studio-tasks-stop-training-job"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

You can stop a training job with the Amazon SageMaker Studio Classic UI. When you stop a training job, its status changes to `Stopping` at which time billing ceases. An algorithm can delay termination in order to save model artifacts after which the job status changes to `Stopped`. For more information, see the [stop\$1training\$1job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.stop_training_job) method in the AWS SDK for Python (Boto3).

**To stop a training job**

1. Follow the [View experiments and runs](experiments-view-compare.md) procedure on this page until you open the **Describe Trial Component** tab.

1. At the upper-right side of the tab, choose **Stop training job**. The **Status** at the top left of the tab changes to **Stopped**.

1. To view the training time and billing time, choose **AWS Settings**.

# Use TensorBoard in Amazon SageMaker Studio Classic
<a name="studio-tensorboard"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

 The following doc outlines how to install and run TensorBoard in Amazon SageMaker Studio Classic. 

**Note**  
This guide shows how to open the TensorBoard application through a SageMaker Studio Classic notebook server of an individual SageMaker AI domain user profile. For a more comprehensive TensorBoard experience integrated with SageMaker Training and the access control functionalities of SageMaker AI domain, see [TensorBoard in Amazon SageMaker AI](tensorboard-on-sagemaker.md).

## Prerequisites
<a name="studio-tensorboard-prereq"></a>

This tutorial requires a SageMaker AI domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md)

## Set Up `TensorBoardCallback`
<a name="studio-tensorboard-setup"></a>

1. Launch Studio Classic, and open the Launcher. For more information, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md)

1. In the Amazon SageMaker Studio Classic Launcher, under `Notebooks and compute resources`, choose the **Change environment** button.

1. On the **Change environment** dialog, use the dropdown menus to select the `TensorFlow 2.6 Python 3.8 CPU Optimized` Studio Classic **Image**.

1. Back to the Launcher, click the **Create notebook** tile. Your notebook launches and opens in a new Studio Classic tab.

1. Run this code from within your notebook cells.

1. Import the required packages. 

   ```
   import os
   import datetime
   import tensorflow as tf
   ```

1. Create a Keras model.

   ```
   mnist = tf.keras.datasets.mnist
   
   (x_train, y_train),(x_test, y_test) = mnist.load_data()
   x_train, x_test = x_train / 255.0, x_test / 255.0
   
   def create_model():
     return tf.keras.models.Sequential([
       tf.keras.layers.Flatten(input_shape=(28, 28)),
       tf.keras.layers.Dense(512, activation='relu'),
       tf.keras.layers.Dropout(0.2),
       tf.keras.layers.Dense(10, activation='softmax')
     ])
   ```

1. Create a directory for your TensorBoard logs

   ```
   LOG_DIR = os.path.join(os.getcwd(), "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
   ```

1. Run training with TensorBoard.

   ```
   model = create_model()
   model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
                 
                 
   tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
   
   model.fit(x=x_train,
             y=y_train,
             epochs=5,
             validation_data=(x_test, y_test),
             callbacks=[tensorboard_callback])
   ```

1. Generate the EFS path for the TensorBoard logs. You use this path to set up your logs from the terminal.

   ```
   EFS_PATH_LOG_DIR = "/".join(LOG_DIR.strip("/").split('/')[1:-1])
   print (EFS_PATH_LOG_DIR)
   ```

   Retrieve the `EFS_PATH_LOG_DIR`. You will need it in the TensorBoard installation section.

## Install TensorBoard
<a name="studio-tensorboard-install"></a>

1. Click on the  `Amazon SageMaker Studio Classic` button on the top left corner of Studio Classic to open the Amazon SageMaker Studio Classic Launcher. This launcher must be opened from your root directory. For more information, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md)

1. In the Launcher, under `Utilities and files`, click `System terminal`. 

1. From the terminal, run the following commands. Copy `EFS_PATH_LOG_DIR` from the Jupyter notebook. You must run this from the `/home/sagemaker-user` root directory.

   ```
   pip install tensorboard
   tensorboard --logdir <EFS_PATH_LOG_DIR>
   ```

## Launch TensorBoard
<a name="studio-tensorboard-launch"></a>

1. To launch TensorBoard, copy your Studio Classic URL and replace `lab?` with `proxy/6006/` as follows. You must include the trailing `/` character.

   ```
   https://<YOUR_URL>.studio.region.sagemaker.aws/jupyter/default/proxy/6006/
   ```

1. Navigate to the URL to examine your results. 

# Use Amazon Q Developer with Amazon SageMaker Studio Classic
<a name="sm-q"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker Studio Classic is an integrated machine learning environment where you can build, train, deploy, and analyze your models all in the same application. You can generate code recommendations and suggest improvements related to code issues by using Amazon Q Developer with Amazon SageMaker AI.

Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. In the context of an integrated AWS coding environment, Amazon Q can generate code recommendations based on developers' code, as well as their comments in natural language. 

Amazon Q has the most support for Java, Python, JavaScript, TypeScript, C\$1, Go, PHP, Rust, Kotlin, and SQL, as well as the Infrastructure as Code (IaC) languages JSON (CloudFormation), YAML (CloudFormation), HCL (Terraform), and CDK (Typescript, Python). It also supports code generation for Ruby, C\$1\$1, C, Shell, and Scala. For examples of how Amazon Q integrates with Amazon SageMaker AI and displays code suggestions in the Amazon SageMaker Studio Classic IDE, see [Code Examples](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/inline-suggestions-code-examples.html) in the *Amazon Q Developer User Guide*.

For more information on using Amazon Q with Amazon SageMaker Studio Classic, see the [Amazon Q Developer User Guide](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/sagemaker-setup.html).

# Manage Your Amazon EFS Storage Volume in Amazon SageMaker Studio Classic
<a name="studio-tasks-manage-storage"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The first time a user on your team onboards to Amazon SageMaker Studio Classic, Amazon SageMaker AI creates an Amazon Elastic File System (Amazon EFS) volume for the team. A home directory is created in the volume for each user who onboards to Studio Classic as part of your team. Notebook files and data files are stored in these directories. Users don't have access to other team member's home directories. Amazon SageMaker AI domain does not support mounting custom or additional Amazon EFS volumes.

**Important**  
Don't delete the Amazon EFS volume. If you delete it, the domain will no longer function and all of your users will lose their work.

**To find your Amazon EFS volume**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the **Domains** page, select the domain to find the ID for.

1. From the **Domain details** page, select the **Domain settings** tab.

1. Under **General settings**, find the **Domain ID**. The ID will be in the following format: `d-xxxxxxxxxxxx`.

1. Pass the `Domain ID`, as `DomainId`, to the [describe\$1domain](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_domain) method.

1. In the response from `describe_domain`, note the value for the `HomeEfsFileSystemId` key. This is the Amazon EFS file system ID.

1. Open the [Amazon EFS console](https://console.aws.amazon.com/efs#/file-systems/). Make sure the AWS Region is the same Region that's used by Studio Classic.

1. Under **File systems**, choose the file system ID from the previous step.

1. To verify that you've chosen the correct file system, select the **Tags** heading. The value corresponding to the `ManagedByAmazonSageMakerResource` key should match the `Studio Classic ID`.

For information on how to access the Amazon EFS volume, see [Using file systems in Amazon EFS](https://docs.aws.amazon.com/efs/latest/ug/using-fs.html).

To delete the Amazon EFS volume, see [Deleting an Amazon EFS file system](https://docs.aws.amazon.com/efs/latest/ug/delete-efs-fs.html).

# Provide Feedback on Amazon SageMaker Studio Classic
<a name="studio-tasks-provide-feedback"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

Amazon SageMaker AI takes your feedback seriously. We encourage you to provide feedback.

**To provide feedback**

1. At the right of SageMaker Studio Classic, find the **Feedback** icon (![\[Speech bubble icon representing messaging or communication functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/feedback.png)).

1. Choose a smiley emoji to let us know how satisfied you are with SageMaker Studio Classic and add any feedback you'd care to share with us.

1. Decide whether to share your identity with us, then choose **Submit**.

# Shut Down and Update Amazon SageMaker Studio Classic and Apps
<a name="studio-tasks-update"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

The following topics show how to shut down and update SageMaker Studio Classic and Studio Classic Apps.

Studio Classic provides a notification icon (![\[Red circle icon with white exclamation mark, indicating an alert or warning.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Notification.png)) in the upper-right corner of the Studio Classic UI. This notification icon displays the number of unread notices. To read the notices, select the icon.

Studio Classic provides two types of notifications:
+ Upgrade – Displayed when Studio Classic or one of the Studio Classic apps have released a new version. To update Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md). To update Studio Classic apps, see [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).
+ Information – Displayed for new features and other information.

To reset the notification icon, you must select the link in each notice. Read notifications may still display in the icon. This does not indicate that updates are still needed after you have updated Studio Classic and Studio Classic Apps.

To learn how to update [Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html), see [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).

To ensure that you have the most recent software updates, update Amazon SageMaker Studio Classic and your Studio Classic apps using the methods outlined in the following topics.

**Topics**
+ [

# Shut Down and Update Amazon SageMaker Studio Classic
](studio-tasks-update-studio.md)
+ [

# Shut Down and Update Amazon SageMaker Studio Classic Apps
](studio-tasks-update-apps.md)

# Shut Down and Update Amazon SageMaker Studio Classic
<a name="studio-tasks-update-studio"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

To update Amazon SageMaker Studio Classic to the latest release, you must shut down the JupyterServer app. You can shut down the JupyterServer app from the SageMaker AI console, from Amazon SageMaker Studio or from within Studio Classic. After the JupyterServer app is shut down, you must reopen Studio Classic through the SageMaker AI console or from Studio which creates a new version of the JupyterServer app. 

You cannot delete the JupyterServer application while the Studio Classic UI is still open in the browser. If you delete the JupyterServer application while the Studio Classic UI is still open in the browser, SageMaker AI automatically re-creates the JupyterServer application.

Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't impacted.

Some of the services within Studio Classic, like Data Wrangler, run on their own app. To update these services you must delete the app for that service. To learn more, see [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).

**Note**  
A JupyterServer app is associated with a single Studio Classic user. When you update the app for one user it doesn't affect other users.

The following page shows how to update the JupyterServer App from the SageMaker AI console, from Studio, or from inside Studio Classic.

## Shut down and update from the SageMaker AI console
<a name="studio-tasks-update-studio-console"></a>

1. Navigate to [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain that includes the Studio Classic application that you want to update.

1. Under **User profiles**, select your user name.

1. Under **Apps**, in the row displaying **JupyterServer**, choose **Action**, then choose **Delete**.

1. Choose **Yes, delete app**.

1. Type **delete** in the confirmation box.

1. Choose **Delete**.

1. After the app has been deleted, launch a new Studio Classic app to get the latest version.

## Shut down and update from Studio
<a name="studio-tasks-update-studio-updated"></a>

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. From the Studio UI, find the applications pane on the left side.

1. From the applications pane, select **Studio Classic**.

1. From the Studio Classic landing page, select the Studio Classic instance to stop.

1. Choose **Stop**.

1. After the app has been stopped, select **Run** to use the latest version.

## Shut down and update from inside Studio Classic
<a name="studio-tasks-update-studio-classic"></a>

1. Launch Studio Classic.

1. On the top menu, choose **File** then **Shut Down**.

1. Choose one of the following options:
   + **Shutdown Server** – Shuts down the JupyterServer app. Terminal sessions, kernel sessions, SageMaker images, and instances aren't shut down. These resources continue to accrue charges.
   + **Shutdown All** – Shuts down all apps, terminal sessions, kernel sessions, SageMaker images, and instances. These resources no longer accrue charges.

1. Close the window.

1. After the app has been deleted, launch a new Studio Classic app to use the latest version.

# Shut Down and Update Amazon SageMaker Studio Classic Apps
<a name="studio-tasks-update-apps"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

To update an Amazon SageMaker Studio Classic app to the latest release, you must first shut down the corresponding KernelGateway app from the SageMaker AI console. After the KernelGateway app is shut down, you must reopen it through SageMaker Studio Classic by running a new kernel. The kernel automatically updates. Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't impacted.

After an application has been shut down for 24 hours, SageMaker AI deletes all metadata for the application. To be considered an update and retain application metadata, applications must be restarted within 24 hours after the previous application has been shut down. After this time window, creation of an application is considered a new application rather than an update of the previous application.

**Note**  
A KernelGateway app is associated with a single Studio Classic user. When you update the app for one user it doesn't effect other users.

**To update the KernelGateway app**

1. Navigate to [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain that includes the application that you want to update.

1. Under **User profiles**, select your user name.

1. Under **Apps**, in the row displaying the **App name**, choose **Action**, then choose **Delete** 

   To update Data Wrangler, delete the app that starts with **sagemaker-data-wrang**.

1. Choose **Yes, delete app**.

1. Type **delete** in the confirmation box.

1. Choose **Delete**.

1. After the app has been deleted, launch a new kernel from within Studio Classic to use the latest version.

# Amazon SageMaker Studio Classic Pricing
<a name="studio-pricing"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

When the first member of your team onboards to Amazon SageMaker Studio Classic, Amazon SageMaker AI creates an Amazon Elastic File System (Amazon EFS) volume for the team. When this member, or any member of the team, opens Studio Classic, a home directory is created in the volume for the member. A storage charge is incurred for this directory. Subsequently, additional storage charges are incurred for the notebooks and data files stored in the member's home directory. For pricing information on Amazon EFS, see [Amazon EFS Pricing](https://aws.amazon.com/efs/pricing/).

Additional costs are incurred when other operations are run inside Studio Classic, for example, running a notebook, running training jobs, and hosting a model.

For information on the costs associated with using Studio Classic notebooks, see [Usage Metering for Amazon SageMaker Studio Classic Notebooks](notebooks-usage-metering.md).

For information about billing along with pricing examples, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If Amazon SageMaker Studio is your default experience, see [Amazon SageMaker Studio pricing](studio-updated-cost.md) for more pricing information.

# Troubleshooting Amazon SageMaker Studio Classic
<a name="studio-troubleshooting"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This topic describes how to troubleshoot common Amazon SageMaker Studio Classic issues during setup and use. The following are common errors that might occur while using Amazon SageMaker Studio Classic. Each error is followed by its solution.

## Studio Classic application issues
<a name="studio-troubleshooting-ui"></a>

 The following issues occur when launching and using the Studio Classic application.
+ **Screen not loading: Clearing workspace and waiting doesn't help**

  When launching the Studio Classic application, a pop-up displays the following message. No matter which option is selected, Studio Classic does not load. 

  ```
  Loading...
  The loading screen is taking a long time. Would you like to clear the workspace or keep waiting?
  ```

  The Studio Classic application can have a launch delay if multiple tabs are open in the Studio Classic workspace or several files are on Amazon EFS. This pop-up should disappear in a few seconds after the Studio Classic workspace is ready. 

  If you continue to see a loading screen with a spinner after selecting either of the options, there could be connectivity issues with the Amazon Virtual Private Cloud used by Studio Classic.  

  To resolve connectivity issues with the Amazon Virtual Private Cloud (Amazon VPC) used by Studio Classic, verify the following networking configurations:
  + If your domain is set up in `VpcOnly` mode: Verify that there is an Amazon VPC endpoint for AWS STS, or a NAT Gateway for outbound traffic, including traffic over the internet. To do this, follow the steps in [Connect Studio notebooks in a VPC to external resources](studio-notebooks-and-internet-access.md). 
  + If your Amazon VPC is set up with a custom DNS instead of the DNS provided by Amazon: Verify that the routes are configured using Dynamic Host Configuration Protocol (DHCP) for each Amazon VPC endpoint added to the Amazon VPC used by Studio Classic. For more information about setting default and custom DHCP option sets, see [DHCP option sets in Amazon VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html). 
+ ****Internal Failure** when launching Studio Classic**

  When launching Studio Classic, you are unable to view the Studio Classic UI. You also see an error similar to the following, with **Internal Failure** as the error detail. 

  ```
  Amazon SageMaker Studio
  The JupyterServer app default encountered a problem and was stopped.
  ```

  This error can be caused by multiple factors. If completion of these steps does not resolve your issue, create an issue with https://aws.amazon.com/premiumsupport/.  
  + **Missing Amazon EFS mount target**: Studio Classic uses Amazon EFS for storage. The Amazon EFS volume needs a mount target for each subnet that the Amazon SageMaker AI domain is created in. If this Amazon EFS mount target is deleted accidentally, the Studio Classic application cannot load because it cannot mount the user’s file directory. To resolve this issue, complete the following steps. 

**To verify or create mount targets.**

    1. Find the Amazon EFS volume that is associated with the domain by using the [DescribeDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeDomain.html) API call.  

    1. Sign in to the AWS Management Console and open the Amazon EFS console at [ https://console.aws.amazon.com/efs/](https://console.aws.amazon.com/efs/).

    1. From the list of Amazon EFS volumes, select the Amazon EFS volume that is associated with the domain. 

    1. On the Amazon EFS details page, select the **Network** tab. Verify that there are mount targets for all of the subnets that the domain is set up in. 

    1. If mount targets are missing, add the missing Amazon EFS mount targets. For instructions, see [Creating and managing mount targets and security groups](https://docs.aws.amazon.com/efs/latest/ug/accessing-fs.html). 

    1. After the missing mount targets are created, launch the Studio Classic application. 
  + **Conflicting files in the user’s `.local` folder**: If you're using JupyterLab version 1 on Studio Classic, conflicting libraries in your `.local` folder can cause issues when launching the Studio Classic application. To resolve this, update your user profile's default JupyterLab version to JupyterLab 3.0. For more information about viewing and updating the JupyterLab version, see [JupyterLab Versioning in Amazon SageMaker Studio Classic](studio-jl.md). 
+ ****ConfigurationError: LifecycleConfig** when launching Studio Classic**

  You can't view the Studio Classic UI when launching Studio Classic. This is caused by issues with the default lifecycle configuration script attached to the domain.

**To resolve lifecycle configuration issues**

  1. View the Amazon CloudWatch Logs for the lifecycle configuration to trace the command that caused the failure. To view the log, follow the steps in [Verify lifecycle configuration process from CloudWatch Logs](studio-lcc-debug.md#studio-lcc-debug-logs). 

  1. Detach the default script from the user profile or domain. For more information, see [Update and Detach Lifecycle Configurations in Amazon SageMaker Studio Classic](studio-lcc-delete.md). 

  1. Launch the Studio Classic application. 

  1. Debug your lifecycle configuration script. You can run the lifecycle configuration script from the system terminal to troubleshoot. When the script runs successfully from the terminal, you can attach the script to the user profile or the domain. 
+ **SageMaker Studio Classic core functionalities are not available.**

  If you get this error message when opening Studio Classic, it may be due to Python package version conflicts. This occurs if you used the following commands in a notebook or terminal to install Python packages that have version conflicts with SageMaker AI package dependencies.

  ```
  !pip install
  ```

  ```
  pip install --user
  ```

  To resolve this issue, complete the following steps:

  1. Uninstall recently installed Python packages. If you’re not sure which package to uninstall, create an issue with https://aws.amazon.com/premiumsupport/. 

  1. Restart Studio Classic:

     1. Shut down Studio Classic from the **File** menu.

     1. Wait for one minute.

     1. Reopen Studio Classic by refreshing the page or opening it from the AWS Management Console.

  The problem should be resolved if you have uninstalled the package which caused the conflict. To install packages without causing this issue again, use `%pip install` without the `--user` flag.

  If the issue persists, create a new user profile and set up your environment with that user profile.

  If these solutions don't fix the issue, create an issue with https://aws.amazon.com/premiumsupport/. 
+ **Unable to open Studio Classic from the AWS Management Console.**

  If you are unable to open Studio Classic and cannot make a new running instance with all default settings, create an issue with https://aws.amazon.com/premiumsupport/. 

## KernelGateway application issues
<a name="studio-troubleshooting-kg"></a>

 The following issues are specific to KernelGateway applications that are launched in Studio Classic. 
+ **Cannot access the Kernel session**

  When the user launches a new notebook, they are unable to connect to the notebook session. If the KernelGateway application's status is `In Service`, you can verify the following to resolve the issue. 
  + **Check Security Group configurations**

    If the domain is set up in `VPCOnly` mode, the security group associated with the domain must allow traffic between the ports in the range `8192-65535` for connectivity between the JupyterServer and KernelGateway apps.

**To verify the security group rules**

    1. Get the security groups associated with the domain using the [DescribeDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeDomain.html) API call.

    1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

    1. From the left navigation, under **Security**, choose **Security Groups**.

    1. Filter by the IDs of the security groups that are associated with the domain.

    1. For each security group: 

       1. Select the security group. 

       1. From the security group details page, view the **Inbound rules**. Verify that traffic is allowed between ports in the range `8192-65535`. 

    For more information about security group rules, see [Control traffic to resources using security groups](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html#working-with-security-group-rules). For more information about requirements to use Studio Classic in `VPCOnly` mode, see [Connect Studio notebooks in a VPC to external resources](studio-notebooks-and-internet-access.md).
  + **Verify firewall and WebSocket connections**

    If the KernelGateway apps have an `InService` status and the user is unable to connect to the Studio Classic notebook session, verify the firewall and WebSocket settings. 

    1. Launch the Studio Classic application. For more information, see [Launch Amazon SageMaker Studio Classic](studio-launch.md). 

    1. Open your web browser’s developer tools. 

    1. Choose the **Network** tab. 

    1. Search for an entry that matches the following format.

       ```
       wss://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/api/kernels/<unique-code>/channels?session_id=<unique-code>
       ```

       If the status or response code for the entry is anything other than `101`, then your network settings are preventing the connection between the Studio Classic application and the KernelGateway apps.

       To resolve this issue, contact the team that manages your networking settings to allow list the Studio Classic URL and enable WebSocket connections.  
+ **Unable to launch an app caused by exceeded resource quotas**

  When a user tries to launch a new notebook, the notebook creation fails with either of the following errors. This is caused by exceeding resource quotas. 
  + 

    ```
    Unable to start more Apps of AppType [KernelGateway] and ResourceSpec(instanceType=[]) for UserProfile []. Please delete an App with a matching AppType and ResourceSpec, then try again
    ```

    Studio Classic supports up to four running KernelGateway apps on the same instance. To resolve this issue, you can do either of the following:
    + Delete an existing KernelGateway application running on the instance, then restart the new notebook.
    + Start the new notebook on a different instance type

     For more information, see [Change the Instance Type for an Amazon SageMaker Studio Classic Notebook](notebooks-run-and-manage-switch-instance-type.md).
  + 

    ```
    An error occurred (ResourceLimitExceeded) when calling the CreateApp operation
    ```

    In this case, the account does not have sufficient limits to create a Studio Classic application on the specified instance type. To resolve this, navigate to the Service Quotas console at [https://console.aws.amazon.com/servicequotas/](https://console.aws.amazon.com/servicequotas/). In that console, request to increase the `Studio KernelGateway Apps running on instance-type instance` limit. For more information, see [AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html). 

# SageMaker JupyterLab
<a name="studio-updated-jl"></a>

Create a JupyterLab space within Amazon SageMaker Studio to launch the JupyterLab application. A JupyterLab space is a private or shared space within Studio that manages the storage and compute resources needed to run the JupyterLab application. The JupyterLab application is a web-based interactive development environment (IDE) for notebooks, code, and data. Use the JupyterLab application's flexible and extensive interface to configure and arrange machine learning (ML) workflows.

By default, the JupyterLab application comes with the SageMaker Distribution image. The distribution image has popular packages, such as the following:
+ PyTorch
+ TensorFlow
+ Keras
+ NumPy
+ Pandas
+ Scikit-learn

You can use shared spaces to collaborate on your Jupyter notebooks with other users in real time. For more information about shared spaces, see [Collaboration with shared spaces](domain-space.md).

Within the JupyterLab application, you can use Amazon Q Developer, a generative AI powered code companion to generate, debug, and explain your code. For information about using Amazon Q Developer, see [JupyterLab user guide](studio-updated-jl-user-guide.md). For information about setting up Amazon Q Developer, see [JupyterLab administrator guide](studio-updated-jl-admin-guide.md).

Build unified analytics and ML workflows in same Jupyter notebook. Run interactive Spark jobs on Amazon EMR and AWS Glue serverless infrastructure, right from your notebook. Monitor and debug jobs faster using the inline Spark UI. In a few steps, you can automate your data prep by scheduling the notebook as a job.

The JupyterLab application helps you work collaboratively with your peers. Use the built-in Git integration within the JupyterLab IDE to share and version code. Bring your own file storage system if you have an Amazon EFS volume.

The JupyterLab application runs on a single Amazon Elastic Compute Cloud (Amazon EC2) instance and uses a single Amazon Elastic Block Store (Amazon EBS) volume for storage. You can switch faster instances or increase the Amazon EBS volume size for your needs.

The JupyterLab 4 application runs in a JupyterLab space within Studio. Studio Classic uses the JupyterLab 3 application. JupyterLab 4 provides the following benefits:
+ A faster IDE than Amazon SageMaker Studio Classic, especially with large notebooks
+ Improved document search
+ A more performant and accessible text editor

For more information about JupyterLab, see [JupyterLab Documentation](https://jupyterlab.readthedocs.io/en/stable/#).

**Topics**
+ [

# JupyterLab user guide
](studio-updated-jl-user-guide.md)
+ [

# JupyterLab administrator guide
](studio-updated-jl-admin-guide.md)

# JupyterLab user guide
<a name="studio-updated-jl-user-guide"></a>

This guide shows JupyterLab users how to run analytics and machine learning workflows within SageMaker Studio. You can get fast storage and scale your compute up or down, depending on your needs.

JupyterLab supports both private and shared spaces. Private spaces are scoped to a single user in a domain. Shared spaces let other users in your domain collaborate with you in real time. For information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).

To get started using JupyterLab, create a space and launch your JupyterLab application. The space running your JupyterLab application is a JupyterLab space. The JupyterLab space uses a single Amazon EC2 instance for your compute and a single Amazon EBS volume for your storage. Everything in your space such as your code, git profile, and environment variables are stored on the same Amazon EBS volume. The volume has 3000 IOPS and a throughput of 125 megabytes per second (MBps). You can use the fast storage to open and run multiple Jupyter notebooks on the same instance. You can also switch kernels in a notebook very quickly.

Your administrator has configured the default Amazon EBS storage settings for your space. The default storage size is 5 GB, but you can increase the amount of space that you get. You can talk to your administrator to provide you with guidelines.

You can switch the Amazon EC2 instance type that you’re using to run JupyterLab, scaling your compute up or down depending on your needs. The **Fast launch** instances start up much faster than the other instances.

Your administrator might provide you with a lifecycle configuration that customizes your environment. You can specify the lifecycle configuration when you create the space.

If your administrator gives you access to an Amazon EFS, you can configure your JupyterLab space to access it.

By default, the JupyterLab application uses the SageMaker distribution image. This includes support for many machine learning, analytics, and deep learning packages. However, if you need a custom image, your administrator can help provide access to the custom images.

The Amazon EBS volume persists independently from the life of an instance. You won’t lose your data when you change instances. Use the conda and pip package management libraries to create reproducible custom environments that persist even when you switch instance types.

After you open JupyterLab, you can configure your environment using the terminal. To open the terminal, navigate to the **Launcher** and choose **Terminal**.

The following are examples of different ways that you can configure an environment in JupyterLab.

**Note**  
Within Studio, you can use lifecycle configurations to customize your environment, but we recommend using a package manager instead. Using lifecycle configurations is a more error-prone method. It’s easier to add or remove dependencies than it is to debug a lifecycle configuration script. It can also increase the JupyterLab startup time.  
For information about lifecycle configurations, see [Lifecycle configurations with JupyterLab](jl-lcc.md).

**Topics**
+ [

# Create a space
](studio-updated-jl-user-guide-create-space.md)
+ [

# Configure a space
](studio-updated-jl-user-guide-configure-space.md)
+ [

# Customize your environment using a package manager
](studio-updated-jl-user-guide-customize-package-manager.md)
+ [

# Clean up a conda environment
](studio-updated-jl-clean-up-conda.md)
+ [

# Share conda environments between instance types
](studio-updated-jl-create-conda-share-environment.md)
+ [

# Use Amazon Q to Expedite Your Machine Learning Workflows
](studio-updated-jl-user-guide-use-amazon-q.md)

# Create a space
<a name="studio-updated-jl-user-guide-create-space"></a>

To get started using JupyterLab, create a space or choose the space that your administrator created for you and open JupyterLab.

Use the following procedure to create a space and open JupyterLab.

**To create a space and open JupyterLab**

1. Open Studio. For information about opening Studio, see [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. Choose **JupyterLab**.

1. Choose **Create JupyterLab space**.

1. For **Name**, specify the name of the space.

1. (Optional) Select **Share with my domain** to create a shared space.

1. Choose **Create space**.

1. (Optional) For **Instance**, specify the Amazon EC2 instance that runs the space.

1. (Optional) For **Image**, specify an image that your administrator provided to customize your environment.
**Important**  
Custom IAM policies that allow Studio users to create spaces must also grant permissions to list images (`sagemaker: ListImage`) to view custom images. To add the permission, see [ Add or remove identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS Identity and Access Management* User Guide.   
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker AI resources already include permissions to list images while creating those resources.

1. (Optional) For **Space Settings**, specify the following:
   + **Storage (GB)** – Up to 100 GB or the amount that your administrator specifies.
   + **Lifecycle Configuration** – A lifecycle configuration that your administrator specifies.
   + **Attach custom EFS filesystem** – An Amazon EFS to which your administrator provides access.

1. Choose **Run space**.

1. Choose **Open JupyterLab**.

# Configure a space
<a name="studio-updated-jl-user-guide-configure-space"></a>

After you create a JupyterLab space, you can configure it to do the following:
+ Change the instance type.
+ Change the storage volume.
+ (Admin set up required) Use a custom image.
+ (Admin set up required) Use a lifecycle configuration.
+ (Admin set up required) Attach a custom Amazon EFS.

**Important**  
You must stop the JupyterLab space every time you configure it. Use the following procedure to configure the space.

**To configure a space**

1. Within Studio, navigate to the JupyterLab application page.

1. Choose the name of the space.

1. (Optional) For **Image**, specify an image that your administrator provided to customize your environment.
**Important**  
Custom IAM policies that allow Studio users to create spaces must also grant permissions to list images (`sagemaker: ListImage`) to view custom images. To add the permission, see [ Add or remove identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS Identity and Access Management* User Guide.   
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker AI resources already include permissions to list images while creating those resources.

1. (Optional) For **Space Settings**, specify the following:
   + **Storage (GB)** – Up to 100 GB or the amount that your administrator configured for the space.
   + **Lifecycle Configuration** – A lifecycle configuration that your administrator provides.
   + **Attach custom EFS filesystem** – An Amazon EFS to which your administrator provides access.

1. Choose **Run space**.

When you open the JupyterLab application, your space has the updated configuration.

# Customize your environment using a package manager
<a name="studio-updated-jl-user-guide-customize-package-manager"></a>

Use pip or conda to customize your environment. We recommend using package managers instead of lifecycle configuration scripts. 

## Create and activate your custom environment
<a name="studio-updated-jl-create-basic-conda"></a>

This section provides examples of different ways that you can configure an environment in JupyterLab.

A basic conda environment has the minimum number of packages that are required for your workflows in SageMaker AI. Use the following template to a create a basic conda environment:

```
# initialize conda for shell interaction
conda init

# create a new fresh environment
conda create --name test-env

# check if your new environment is created successfully
conda info --envs

# activate the new environment
conda activate test-env

# install packages in your new conda environment
conda install pip boto3 pandas ipykernel

# list all packages install in your new environment 
conda list

# parse env name information from your new environment
export CURRENT_ENV_NAME=$(conda info | grep "active environment" | cut -d : -f 2 | tr -d ' ')

# register your new environment as Jupyter Kernel for execution 
python3 -m ipykernel install --user --name $CURRENT_ENV_NAME --display-name "user-env:($CURRENT_ENV_NAME)"

# to exit your new environment
conda deactivate
```

The following image shows the location of the environment that you've created.

![\[The test-env environment is displayed in the top right corner of the screen.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juptyer-notebook-environment-location.png)


To change your environment, choose it and select an option from the dropdown menu.

![\[The checkmark and its corresponding text shows an example environment that you previously created.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/jupyter-notebook-select-env.png)


Choose **Select** to select a kernel for the environment.

## Create a conda environment with a specific Python version
<a name="studio-updated-jl-create-conda-version"></a>

Cleaning up conda environments that you’re not using can help free up disk space and improve performance. Use the following template to clean up a conda environment:

```
# create a conda environment with a specific python version
conda create --name py38-test-env python=3.8.10

# activate and test your new python version
conda activate py38-test-env & python3 --version

# Install ipykernel to facilicate env registration
conda install ipykernel

# parse env name information from your new environment
export CURRENT_ENV_NAME=$(conda info | grep "active environment" | cut -d : -f 2 | tr -d ' ')

# register your new environment as Jupyter Kernel for execution 
python3 -m ipykernel install --user --name $CURRENT_ENV_NAME --display-name "user-env:($CURRENT_ENV_NAME)"

# deactivate your py38 test environment
conda deactivate
```

## Create a conda environment with a specific set of packages
<a name="studio-updated-jl-create-conda-specific-packages"></a>

Use the following template to create a conda environment with a specific version of Python and set of packages:

```
# prefill your conda environment with a set of packages,
conda create --name py38-test-env python=3.8.10 pandas matplotlib=3.7 scipy ipykernel

# activate your conda environment and ensure these packages exist
conda activate py38-test-env

# check if these packages exist
conda list | grep -E 'pandas|matplotlib|scipy'

# parse env name information from your new environment
export CURRENT_ENV_NAME=$(conda info | grep "active environment" | cut -d : -f 2 | tr -d ' ')

# register your new environment as Jupyter Kernel for execution 
python3 -m ipykernel install --user --name $CURRENT_ENV_NAME --display-name "user-env:($CURRENT_ENV_NAME)"

# deactivate your conda environment
conda deactivate
```

## Clone conda from an existing environment
<a name="studio-updated-jl-create-conda-clone"></a>

Clone your conda environment to preserve its working state. You experiment in the cloned environment without having to worry about introducing breaking changes in your test environment.

Use the following command to clone an environment.

```
# create a fresh env from a base environment 
conda create --name py310-base-ext --clone base # replace 'base' with another env

# activate your conda environment and ensure these packages exist
conda activate py310-base-ext

# install ipykernel to register your env
conda install ipykernel

# parse env name information from your new environment
export CURRENT_ENV_NAME=$(conda info | grep "active environment" | cut -d : -f 2 | tr -d ' ')

# register your new environment as Jupyter Kernel for execution 
python3 -m ipykernel install --user --name $CURRENT_ENV_NAME --display-name "user-env:($CURRENT_ENV_NAME)"

# deactivate your conda environment
conda deactivate
```

## Clone conda from a reference YAML file
<a name="studio-updated-jl-create-conda-yaml"></a>

Create a conda environment from a reference YAML file. The following is an example of a YAML file that you can use.

```
# anatomy of a reference environment.yml
name: py311-new-env
channels:
  - conda-forge
dependencies:
  - python=3.11
  - numpy
  - pandas
  - scipy
  - matplotlib
  - pip
  - ipykernel
  - pip:
      - git+https://github.com/huggingface/transformers
```

Under `pip`, we recommend specifying only the dependencies that aren't available with conda.

Use the following commands to create a conda environment from a YAML file.

```
# create your conda environment 
conda env create -f environment.yml

# activate your env
conda activate py311-new-env
```

# Clean up a conda environment
<a name="studio-updated-jl-clean-up-conda"></a>

Cleaning up conda environments that you’re not using can help free up disk space and improve performance. Use the following template to clean up a conda environment:

```
# list your environments to select an environment to clean
conda info --envs # or conda info -e

# once you've selected your environment to purge
conda remove --name test-env --all

# run conda environment list to ensure the target environment is purged
conda info --envs # or conda info -e
```

# Share conda environments between instance types
<a name="studio-updated-jl-create-conda-share-environment"></a>

You can share conda environments by saving them to an Amazon EFS directory outside of your Amazon EBS volume. Another user can access the environment in the directory where you saved it.

**Important**  
There are limitations with sharing your environments. For example, we don't recommend an environment meant to run on a GPU Amazon EC2 instance over an environment running on a CPU instance.

Use the following commands as a template to specify the target directory where you’re creating a custom environment. You’re creating a conda within a particular path. You create it within the Amazon EFS directory. You can spin up a new instance and do conda activate path and do it within the Amazon EFS.

```
# if you know your environment path for your conda environment
conda create --prefix /home/sagemaker-user/my-project/py39-test python=3.9

# activate the env with full path from prefix
conda activate home/sagemaker-user/my-project/py39-test

# parse env name information from your new environment
export CURRENT_ENV_NAME=$(conda info | grep "active environment" | awk -F' : ' '{print $2}' | awk -F'/' '{print $NF}')

# register your new environment as Jupyter Kernel for execution 
python3 -m ipykernel install --user --name $CURRENT_ENV_NAME --display-name "user-env-prefix:($CURRENT_ENV_NAME)"

# deactivate your conda environment
conda deactivate
```

# Use Amazon Q to Expedite Your Machine Learning Workflows
<a name="studio-updated-jl-user-guide-use-amazon-q"></a>

Amazon Q Developer is your AI-powered companion for machine learning development. With Amazon Q Developer, you can:
+ Receive step-by-step guidance on using SageMaker AI features independently or in combination with other AWS services.
+ Get sample code to get started on your ML tasks such as data preparation, training, inference, and MLOps.
+ Receive troubleshooting assistance to debug and resolve errors encountered while running code.

Amazon Q Developer seamlessly integrates into your JupyterLab environment. To use Amazon Q Developer, choose the **Q** from the left-hand navigation of your JupyterLab environment or Code Editor environment.

If you don't see the **Q** icon, your administrator needs to set it up for you. For more information about setting up Amazon Q Developer, see [Set up Amazon Q Developer for your users](studio-updated-amazon-q-admin-guide-set-up.md).

Amazon Q automatically provides suggestions to help you write your code. You can also ask for suggestions through the chat interface.

# JupyterLab administrator guide
<a name="studio-updated-jl-admin-guide"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This guide for administrators describes SageMaker AI JupyterLab resources, such as those from Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic Compute Cloud (Amazon EC2). The topics also show how to provide user access and change storage size. 

A SageMaker AI JupyterLab space is composed of the following resources:
+ A distinct Amazon EBS volume that stores all of the data, such as the code and the environment variables. 
+ The Amazon EC2 instance used to run the space.
+ The image used to run JupyterLab.

**Note**  
Applications do not have access to the EBS volume of other applications. For example, Code Editor, based on Code-OSS, Visual Studio Code - Open Source doesn't have access to the EBS volume for JupyterLab. For more information about EBS volumes, see [Amazon Elastic Block Store (Amazon EBS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html).

You can use the Amazon SageMaker API to do the following:
+ Change the default storage size of the EBS volume for your users.
+ Change the maximum size of the EBS storage
+ Specify the user settings for the application. For example, you can specify whether the user is using a custom image or a code repository.
+ Specify the support application type.

The default size of the Amazon EBS volume is 5 GB. You can increase the volume size to a maximum of 16,384 GB. If you don't do anything, your users can increase their volume size to 100 GB. The volume size can be changed only once within a six hour period.

The kernels associated with the JupyterLab application run on the same Amazon EC2 instance that runs JupyterLab. When you create a space, the latest version of the SageMaker Distribution Image is used by default. For more information about SageMaker Distribution Images, see [SageMaker Studio image support policy](sagemaker-distribution.md).

**Important**  
For information about updating the space to use the latest version of the SageMaker AI Distribution Image, see [Update the SageMaker Distribution Image](studio-updated-jl-update-distribution-image.md).

The working directory of your users within the storage volume is `/home/sagemaker-user`. If you specify your own AWS KMS key to encrypt the volume, everything in the working directory is encrypted using your customer managed key. If you don't specify an AWS KMS key, the data inside `/home/sagemaker-user` is encrypted with an AWS managed key. Regardless of whether you specify an AWS KMS key, all of the data outside of the working directory is encrypted with an AWS Managed Key.

The following sections walk you through the configurations that you need to perform as an administrator.

**Topics**
+ [

# Give your users access to spaces
](studio-updated-jl-admin-guide-permissions.md)
+ [

# Change the default storage size for your JupyterLab users
](studio-updated-jl-admin-guide-storage-size.md)
+ [

# Lifecycle configurations with JupyterLab
](jl-lcc.md)
+ [

# Git repos in JupyterLab
](studio-updated-jl-admin-guide-git-attach.md)
+ [

# Custom images
](studio-updated-jl-admin-guide-custom-images.md)
+ [

# Update the SageMaker Distribution Image
](studio-updated-jl-update-distribution-image.md)
+ [

# Delete unused resources
](studio-updated-jl-admin-guide-clean-up.md)
+ [

# Quotas
](studio-updated-jl-admin-guide-quotas.md)

# Give your users access to spaces
<a name="studio-updated-jl-admin-guide-permissions"></a>

To give users access to private or shared spaces, you must attach a permissions policy to their IAM roles. You can also use the permissions policy to restrict private spaces and their associated applications to a specific user profile.

The following permissions policy grants access to private and shared spaces. This allows users to create their own space and list other spaces within their domain. A user with this policy can't access the private space of a different user. For information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).

The policy provides users with permissions to the following:
+ Private spaces or shared spaces.
+ A user profile for accessing those spaces.

To provide permissions, you can scope down the permissions of the following policy and add it to the IAM roles of your users. You can also use this policy to restrict your spaces, and their associated applications, to a specific user profile.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {

      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateApp",
        "sagemaker:DeleteApp"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:app/*",
      "Condition": {
        "Null": {
          "sagemaker:OwnerUserProfileArn": "true"
        }
      }
    },
    {
      "Sid": "SMStudioCreatePresignedDomainUrlForUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreatePresignedDomainUrl"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:user-profile/sagemaker:DomainId/sagemaker:UserProfileName"
    },
    {
      "Sid": "SMStudioAppPermissionsListAndDescribe",
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListApps",
        "sagemaker:ListDomains",
        "sagemaker:ListUserProfiles",
        "sagemaker:ListSpaces",
        "sagemaker:DescribeApp",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace"
      ],
      "Resource": "*"
    },
    {
      "Sid": "SMStudioAppPermissionsTagOnCreate",
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:*/*",
      "Condition": {
        "Null": {
          "sagemaker:TaggingAction": "false"
        }
      }
    },
    {
      "Sid": "SMStudioRestrictSharedSpacesWithoutOwners",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateSpace",
        "sagemaker:UpdateSpace",
        "sagemaker:DeleteSpace"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:space/sagemaker:DomainId/*",
      "Condition": {
        "Null": {
          "sagemaker:OwnerUserProfileArn": "true"
        }
      }
    },
    {
      "Sid": "SMStudioRestrictSpacesToOwnerUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateSpace",
        "sagemaker:UpdateSpace",
        "sagemaker:DeleteSpace"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:space/sagemaker:DomainId/*",
      "Condition": {
        "ArnLike": {
        "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-2:111122223333:user-profile/sagemaker:DomainId/sagemaker:UserProfileName"
        },
        "StringEquals": {
          "sagemaker:SpaceSharingType": [
            "Private",
            "Shared"
          ]
        }
      }
    },
    {
      "Sid": "SMStudioRestrictCreatePrivateSpaceAppsToOwnerUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateApp",
        "sagemaker:DeleteApp"
      ],
      "Resource": "arn:aws:sagemaker:us-east-2:111122223333:app/sagemaker:DomainId/*",
      "Condition": {
        "ArnLike": {
          "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-2:111122223333:user-profile/sagemaker:DomainId/sagemaker:UserProfileName"
        },
        "StringEquals": {
          "sagemaker:SpaceSharingType": [
            "Private"
          ]
        }
      }
    }
  ]
}
```

------

# Change the default storage size for your JupyterLab users
<a name="studio-updated-jl-admin-guide-storage-size"></a>

You can change the default storage settings for your users. You can also change the default storage settings based on your organizational requirements and the needs of your users.

To change the storage size, this section provides commands to do the following:

1. Update the Amazon EBS storage settings in the Amazon SageMaker AI domain (domain).

1. Create a user profile and specify the storage settings within it.

Use the following AWS Command Line Interface (AWS CLI) commands to change the default storage size.

Use the following AWS CLI command to update the domain:

```
aws --region AWS Region sagemaker update-domain \
--domain-id domain-id \
--default-user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":5,
            "MaximumEbsVolumeSizeInGb":100
        }
    }
}'
```

Use the following AWS CLI command to create the user profile and specify the default storage settings:

```
aws --region AWS Region sagemaker create-user-profile \
--domain-id domain-id \
--user-profile-name user-profile-name \
--user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":5,
            "MaximumEbsVolumeSizeInGb":100
        }
    }
}'
```

Use the following AWS CLI commands to update the default storage settings in the user profile:

```
aws --region AWS Region sagemaker update-user-profile \
--domain-id domain-id \
--user-profile-name user-profile-name \
--user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":25,
            "MaximumEbsVolumeSizeInGb":200
        }
    }
}'
```

# Lifecycle configurations with JupyterLab
<a name="jl-lcc"></a>

Lifecycle configurations are shell scripts that are triggered by JupyterLab lifecycle events, such as starting a new JupyterLab notebook. You can use lifecycle configurations to automate customization for your JupyterLab environment. This customization includes installing custom packages, configuring notebook extensions, preloading datasets, and setting up source code repositories.

Using lifecycle configurations gives you flexibility and control to configure JupyterLab to meet your specific needs. For example, you can create a minimal set of base container images with the most commonly used packages and libraries. Then you can use lifecycle configurations to install additional packages for specific use cases across your data science and machine learning teams.

**Note**  
Each script has a limit of **16,384 characters**.

**Topics**
+ [

# Lifecycle configuration creation
](jl-lcc-create.md)
+ [

# Debug lifecycle configurations
](jl-lcc-debug.md)
+ [

# Detach lifecycle configurations
](jl-lcc-delete.md)

# Lifecycle configuration creation
<a name="jl-lcc-create"></a>

This topic includes instructions for creating and associating a lifecycle configuration with JupyterLab. You use the AWS Command Line Interface (AWS CLI) or the AWS Management Console to automate customization for your JupyterLab environment.

Lifecycle configurations are shell scripts triggered by JupyterLab lifecycle events, such as starting a new JupyterLab notebook. For more information about lifecycle configurations, see [Lifecycle configurations with JupyterLab](jl-lcc.md).

## Create a lifecycle configuration (AWS CLI)
<a name="jl-lcc-create-cli"></a>

Learn how to create a lifecycle configuration using the AWS Command Line Interface (AWS CLI) to automate customization for your Studio environment.

### Prerequisites
<a name="jl-lcc-create-cli-prerequisites"></a>

Before you begin, complete the following prerequisites: 
+ Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
+ Onboard to Amazon SageMaker AI domain. For conceptual information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). For a quickstart guide, see [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md).

### Step 1: Create a lifecycle configuration
<a name="jl-lcc-create-cli-step1"></a>

The following procedure shows how to create a lifecycle configuration script that prints `Hello World`.

**Note**  
Each script can have up to **16,384 characters**.

1. From your local machine, create a file named `my-script.sh` with the following content:

   ```
   #!/bin/bash
   set -eux
   echo 'Hello World!'
   ```

1. Use the following to convert your `my-script.sh` file into base64 format. This requirement prevents errors that occur from spacing and line break encoding.

   ```
   LCC_CONTENT=`openssl base64 -A -in my-script.sh`
   ```

1. Create a lifecycle configuration for use with Studio. The following command creates a lifecycle configuration that runs when you launch an associated `JupyterLab` application:

   ```
   aws sagemaker create-studio-lifecycle-config \
   --region region \
   --studio-lifecycle-config-name my-jl-lcc \
   --studio-lifecycle-config-content $LCC_CONTENT \
   --studio-lifecycle-config-app-type JupyterLab
   ```

   Note the ARN of the newly created lifecycle configuration that is returned. This ARN is required to attach the lifecycle configuration to your application.

### Step 2: Attach the lifecycle configuration to your Amazon SageMaker AI domain (domain) and user profile
<a name="jl-lcc-create-cli-step2"></a>

To attach the lifecycle configuration, you must update the `UserSettings` for your domain or user profile. Lifecycle configuration scripts that are associated at the domain level are inherited by all users. However, scripts that are associated at the user profile level are scoped to a specific user. 

You can create a new user profile, domain, or space with a lifecycle configuration attached by using the following commands:
+ [create-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-user-profile.html)
+ [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-domain.html)
+ [create-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html)

The following command creates a user profile with a lifecycle configuration. Add the lifecycle configuration ARN from the preceding step to the `JupyterLabAppSettings` of the user. You can add multiple lifecycle configurations at the same time by passing a list of them. When a user launches a JupyterLab application with the AWS CLI, they can specify a lifecycle configuration instead of using the default one. The lifecycle configuration that the user passes must belong to the list of lifecycle configurations in `JupyterLabAppSettings`.

```
# Create a new UserProfile
aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"JupyterLabAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

## Create a lifecycle configuration (Console)
<a name="jl-lcc-create-console"></a>

Learn how to create a lifecycle configuration using the AWS Management Console to automate customization for your Studio environment.

### Step 1: Create a lifecycle configuration
<a name="jl-lcc-create-console-step1"></a>

Use the following procedure to create a lifecycle configuration script that prints `Hello World`.

**To create a lifecycle configuration**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Lifecycle configurations**. 

1. Choose the **JupyterLab** tab.

1. Choose **Create configuration**.

1. For **Name**, specify the name of the lifecycle configuration.

1. For the text box under **Scripts**, specify the following lifecycle configuration:

   ```
   #!/bin/bash
   set -eux
   echo 'Hello World!'
   ```

1. Choose **Create configuration**.

### Step 2: Attach the lifecycle configuration to your Amazon SageMaker AI domain (domain) and user profile
<a name="jl-lcc-create-console-step2"></a>

Lifecycle configuration scripts associated at the domain level are inherited by all users. However, scripts that are associated at the user profile level are scoped to a specific user.

You can attach multiple lifecycle configurations to a domain or user profile for JupyterLab.

Use the following procedure to attach a lifecycle configuration to a domain.

**To attach a lifecycle configuration to a domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain to attach the lifecycle configuration to.

1. From the **Domain details**, choose the **Environment** tab.

1. Under **Lifecycle configurations for personal Studio apps**, choose **Attach**.

1. Under **Source**, choose **Existing configuration**.

1. Under **Studio lifecycle configurations**, select the lifecycle configuration that you created in the previous step.

1. Select **Attach to domain**.

Use the following procedure to attach a lifecycle configuration to a user profile.

**To attach a lifecycle configuration to a user profile**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain that contains the user profile to attach the lifecycle configuration to.

1. Under **User profiles**, select the user profile.

1. From the **User Details** page, choose **Edit**.

1. On the left navigation, choose **Studio settings**.

1. Under **Lifecycle configurations attached to user**, choose **Attach**.

1. Under **Source**, choose **Existing configuration**.

1. Under **Studio lifecycle configurations**, select the lifecycle configuration that you created in the previous step.

1. Choose **Attach to user profile**.

# Debug lifecycle configurations
<a name="jl-lcc-debug"></a>

The following topics show how to get information about and debug your lifecycle configurations.

**Topics**
+ [

## Verify lifecycle configuration process from CloudWatch Logs
](#jl-lcc-debug-logs)
+ [

## Lifecycle configuration timeout
](#jl-lcc-debug-timeout)

## Verify lifecycle configuration process from CloudWatch Logs
<a name="jl-lcc-debug-logs"></a>

Lifecycle configurations only log `STDOUT` and `STDERR`.

`STDOUT` is the default output for bash scripts. You can write to `STDERR` by appending `>&2` to the end of a bash command. For example, `echo 'hello'>&2`. 

Logs for your lifecycle configurations are published to your AWS account using Amazon CloudWatch. These logs can be found in the `/aws/sagemaker/studio` log stream in the CloudWatch console.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs** from the left navigation pane. From the dropdown menu, select **Log groups**.

1. On the **Log groups** page, search for `aws/sagemaker/studio`. 

1. Select the log group.

1. On the **Log group details** page, choose the **Log streams** tab.

1. To find the logs for a specific space, search the log streams using the following format:

   ```
   domain-id/space-name/app-type/default/LifecycleConfigOnStart
   ```

   For example, to find the lifecycle configuration logs for domain ID `d-m85lcu8vbqmz`, space name `i-sonic-js`, and application type `JupyterLab`, use the following search string:

   ```
   d-m85lcu8vbqmz/i-sonic-js/JupyterLab/default/LifecycleConfigOnStart
   ```

## Lifecycle configuration timeout
<a name="jl-lcc-debug-timeout"></a>

There is a lifecycle configuration timeout limitation of 5 minutes. If a lifecycle configuration script takes longer than 5 minutes to run, you get an error.

To resolve this error, make sure that your lifecycle configuration script completes in less than 5 minutes. 

To help decrease the runtime of scripts, try the following:
+ Reduce unnecessary steps. For example, limit which conda environments to install large packages in.
+ Run tasks in parallel processes.
+ Use the nohup command in your script to make sure that hangup signals are ignored so that the script runs without stopping.

# Detach lifecycle configurations
<a name="jl-lcc-delete"></a>

To update your script, you must create a new lifecycle configuration script and attach it to the respective Amazon SageMaker AI domain (domain), user profile, or shared space. A lifecycle configuration script can't be changed after it's created. For more information about creating and attaching the lifecycle configuration, see [Lifecycle configuration creation](jl-lcc-create.md).

The following section shows how to detach a lifecycle configuration using the AWS Command Line Interface (AWS CLI).

## Detach using the AWS CLI
<a name="jl-lcc-delete-cli"></a>

To detach a lifecycle configuration using the (AWS CLI), remove the desired lifecycle configuration from the list of lifecycle configurations attached to the resource. You then pass the list as part of the respective command:
+ [update-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-user-profile.html)
+ [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html)
+ [update-space](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-space.html)

For example, the following command removes all lifecycle configurations for the JupyterLab application that's attached to the domain.

```
aws sagemaker update-domain --domain-id domain-id \
--region region \
--default-user-settings '{
"JupyterLabAppSettings": {
  "LifecycleConfigArns":
    []
  }
}'
```

# Git repos in JupyterLab
<a name="studio-updated-jl-admin-guide-git-attach"></a>

JupyterLab offers a Git extension to enter the URL of a Git repository (repo), clone it into an environment, push changes, and view the commit history. You can also attach suggested Git repo URLs to a Amazon SageMaker AI domain (domain) or user profile.

The following sections show how to attach or detach Git repo URLs.

**Topics**
+ [

# Attach a Git repository (AWS CLI)
](studio-updated-git-attach-cli.md)
+ [

# Detach Git repo URLs
](studio-updated-git-detach.md)

# Attach a Git repository (AWS CLI)
<a name="studio-updated-git-attach-cli"></a>

This section shows how to attach a Git repository (repo) URL using the AWS CLI. After you attach the Git repo URL, you can clone it by following the steps in [Clone a Git repo in Amazon SageMaker Studio](#studio-updated-tasks-git).

## Prerequisites
<a name="studio-updated-git-attach-cli-prerequisites"></a>

Before you begin, complete the following prerequisites: 
+ Update the AWS CLI by following the steps in [Installing the current AWS Command Line Interface Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 
+ Onboard to Amazon SageMaker AI domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Attach the Git repo to a Amazon SageMaker AI domain (domain) or user profile
<a name="studio-updated-git-attach-cli-attach"></a>

Git repo URLs that are associated at the domain level are inherited by all users. However, Git repo URLs that are associated at the user profile level are scoped to a specific user. You can attach multiple Git repo URLs to a Amazon SageMaker AI domain or to a user profile by passing a list of repository URLs.

The following sections show how to attach a Git repo URL to your domain and your user profile.

### Attach to a Amazon SageMaker AI domain
<a name="studio-updated-git-attach-cli-attach-domain"></a>

The following command attaches a Git repo URL to an existing domain: 

```
aws sagemaker update-domain --region region --domain-id domain-id \
    --default-user-settings JupyterLabAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}
```

### Attach to a user profile
<a name="studio-updated-git-attach-cli-attach-userprofile"></a>

The following command attaches a Git repo URL to an existing user profile:

```
aws sagemaker update-user-profile --domain-id domain-id --user-profile-name user-name\
    --user-settings JupyterLabAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}
```

## Clone a Git repo in Amazon SageMaker Studio
<a name="studio-updated-tasks-git"></a>

Amazon SageMaker Studio connects to a local Git repo only. To access the files in the repo, clone the Git repo from within Studio. To do so, Studio offers a Git extension for you to enter the URL of a Git repo, clone it into your environment, push changes, and view commit history. 

If the repo is private and requires credentials to access, you receive a prompt to enter your user credentials. Your credentials include your username and personal access token. For more information about personal access tokens, see [Managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).

Admins can also attach suggested Git repository URLs at the Amazon SageMaker AI domain or user profile level. Users can then select the repo URL from the list of suggestions and clone that into Studio. For more information about attaching suggested repos, see [Attach Suggested Git Repos to Amazon SageMaker Studio Classic](studio-git-attach.md).

# Detach Git repo URLs
<a name="studio-updated-git-detach"></a>

This section shows how to detach Git repository URLs from an Amazon SageMaker AI domain (domain) or a user profile. You can detach repo URLs by using the AWS Command Line Interface (AWS CLI) or the Amazon SageMaker AI console.

## Detach a Git repo using the AWS CLI
<a name="studio-updated-git-detach-cli"></a>

To detach all Git repo URLs from a domain or user profile, you must pass an empty list of code repositories. This list is passed as part of the `JupyterLabAppSettings` parameter in an `update-domain` or `update-user-profile` command. To detach only one Git repo URL, pass the code repositories list without the desired Git repo URL. 

### Detach from an Amazon SageMaker AI domain
<a name="studio-updated-git-detach-cli-domain"></a>

The following command detaches all Git repo URLs from a domain:

```
aws sagemaker update-domain --region region --domain-name domain-name \
    --domain-settings JupyterLabAppSettings={CodeRepositories=[]}
```

### Detach from a user profile
<a name="studio-updated-git-detach-cli-userprofile"></a>

The following command detaches all Git repo URLs from a user profile:

```
aws sagemaker update-user-profile --domain-name domain-name --user-profile-name user-name\
    --user-settings JupyterLabAppSettings={CodeRepositories=[]}
```

# Custom images
<a name="studio-updated-jl-admin-guide-custom-images"></a>

If you need functionality that is different than what's provided by SageMaker distribution, you can bring your own image with your custom extensions and packages. You can also use it to personalize the JupyterLab UI for your own branding or compliance needs.

The following page will provide JupyterLab-specific information and templates to create your own custom SageMaker AI images. This is meant to supplement the Amazon SageMaker Studio information and instructions on creating your own SageMaker AI image and bringing your own image to Studio. To learn about custom Amazon SageMaker AI images and how to bring your own image to Studio, see [Bring your own image (BYOI)](studio-updated-byoi.md). 

**Topics**
+ [

## Health check and URL for applications
](#studio-updated-jl-admin-guide-custom-images-app-healthcheck)
+ [

## Dockerfile examples
](#studio-updated-jl-custom-images-dockerfile-templates)

## Health check and URL for applications
<a name="studio-updated-jl-admin-guide-custom-images-app-healthcheck"></a>
+ `Base URL` – The base URL for the BYOI application must be `jupyterlab/default`. You can only have one application and it must always be named `default`.
+ `HealthCheck API` – SageMaker AI uses the health check endpoint at port `8888` to check the health of the JupyterLab application. `jupyterlab/default/api/status` is the endpoint for the health check.
+ `Home/Default URL` – The `/opt/.sagemakerinternal` and `/opt/ml` directories that are used by AWS. The metadata file in `/opt/ml` contains metadata about resources such as `DomainId`.
+ Authentication – To enable authentication for your users, turn off the Jupyter notebooks token or password based authentication and allow all origins.

## Dockerfile examples
<a name="studio-updated-jl-custom-images-dockerfile-templates"></a>

The following examples are `Dockerfile`s that meets the above information and [Custom image specifications](studio-updated-byoi-specs.md).

**Note**  
If you are bringing your own image to SageMaker Unified Studio, you will need to follow the [Dockerfile specifications](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html) in the *Amazon SageMaker Unified Studio User Guide*.  
`Dockerfile` examples for SageMaker Unified Studio can be found in [Dockerfile example](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html#byoi-specifications-example) in the *Amazon SageMaker Unified Studio User Guide*.

------
#### [ Example AL2023 Dockerfile ]

The following is an example AL2023 Dockerfile that meets the above information and [Custom image specifications](studio-updated-byoi-specs.md).

```
FROM public.ecr.aws/amazonlinux/amazonlinux:2023

ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100

# Install Python3, pip, and other dependencies
RUN yum install -y \
    python3 \
    python3-pip \
    python3-devel \
    gcc \
    shadow-utils && \
    useradd --create-home --shell /bin/bash --gid "${NB_GID}" --uid ${NB_UID} ${NB_USER} && \
    yum clean all

RUN python3 -m pip install --no-cache-dir \
    'jupyterlab>=4.0.0,<5.0.0' \
    urllib3 \
    jupyter-activity-monitor-extension \
    --ignore-installed

# Verify versions
RUN python3 --version && \
    jupyter lab --version

USER ${NB_UID}
CMD jupyter lab --ip 0.0.0.0 --port 8888 \
    --ServerApp.base_url="/jupyterlab/default" \
    --ServerApp.token='' \
    --ServerApp.allow_origin='*'
```

------
#### [ Example Amazon SageMaker Distribution Dockerfile ]

The following is a example Amazon SageMaker Distribution Dockerfile that meets the above information and [Custom image specifications](studio-updated-byoi-specs.md).

```
FROM public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100

ENV MAMBA_USER=$NB_USER

USER root

RUN apt-get update
RUN micromamba install sagemaker-inference --freeze-installed --yes --channel conda-forge --name base

USER $MAMBA_USER

ENTRYPOINT ["entrypoint-jupyter-server"]
```

------

# Update the SageMaker Distribution Image
<a name="studio-updated-jl-update-distribution-image"></a>

**Important**  
This topic assumes that you've created a space and given the user access to it. For more information, see [Give your users access to spaces](studio-updated-jl-admin-guide-permissions.md).

Update the JupyterLab spaces that you've already created to use the latest version of the SageMaker Distribution Image to access the latest features. You can use either the Studio UI or the AWS Command Line Interface (AWS CLI) to update the image.

The following sections provide information about updating an image.

## Update the image (UI)
<a name="studio-updated-jl-update-distribution-image-ui"></a>

Updating the image involves restarting the JupyterLab space of your user. Use the following procedure to update your user's JupyterLab space with the latest image.

**To update the image (UI)**

1. Open Studio. For information about opening Studio, see [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. Choose **JupyterLab**.

1. Select the JupyterLab space of your user.

1. Choose **Stop space**.

1. For **Image**, select an updated version of the SageMaker AI Distribution Image. For the latest image, choose **Latest**.

1. Choose **Run space**.

## Update the image (AWS CLI)
<a name="studio-updated-jl-update-distribution-image-cli"></a>

This section assumes that you have the AWS Command Line Interface (AWS CLI) installed. For information about installing the AWS CLI, see [Install or update to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

To update the image, you must the do the following for your user's space:

1. Delete the JupyterLab application

1. Update the space

1. Create the application

**Important**  
You must have the following information ready before you start updating the image:  
domain ID – The ID of your user's Amazon SageMaker AI domain.
Application type – JupyterLab.
Application name – default.
Space name – The name specified for the space.
Instance type – The Amazon EC2 instance type that you're using to run the application. For example, `ml.t3.medium`.
SageMaker Image ARN – The Amazon Resource Name (ARN) of the SageMaker AI Distribution Image. You can provide the latest version of the SageMaker AI Distribution Image by specifying either `sagemaker-distribution-cpu` or `sagemaker-distribution-gpu` as the resource identifier.

To delete the JupyterLab application, run the following command:

```
aws sagemaker delete-app \
--domain-id your-user's-domain-id 
--app-type JupyterLab \
--app-name default \
--space-name name-of-your-user's-space
```

To update your user's space, run the following command:

```
aws sagemaker update-space \
--space-name name-of-your-user's-space \
--domain-id your-user's-domain-id
```

If you've updated the space successfully, you'll see the space ARN in the response:

```
{
"SpaceArn": "arn:aws:sagemaker:AWS Region:111122223333:space/your-user's-domain-id/name-of-your-user's-space"
}
```

To create the application, run the following command:

```
aws sagemaker create-app \
--domain-id your-user's-domain-id  \
--app-type JupyterLab \
--app-name default \
--space-name name-of-your-user's-space \
--resource-spec "InstanceType=instance-type,SageMakerImageArn=arn:aws:sagemaker:AWS Region:555555555555:image/sagemaker-distribution-resource-identifier"
```

# Delete unused resources
<a name="studio-updated-jl-admin-guide-clean-up"></a>

To avoid incurring additional costs running JupyterLab, we recommend deleting unused resources in the following order:

1. JupyterLab applications

1. Spaces

1. User profiles

1. domains

Use the following AWS Command Line Interface (AWS CLI) commands to delete resources within a domain:

------
#### [ Delete a JupyterLab application ]

```
aws --region AWS Region sagemaker delete-app --domain-id example-domain-id --app-name default --app-type JupyterLab --space-name example-space-name
```

------
#### [ Delete a space ]

**Important**  
If you delete a space, you delete the Amazon EBS volume associated with it. We recommend backing up any valuable data before you delete your space.

```
aws --region AWS Region sagemaker delete-space --domain-id example-domain-id  --space-name example-space-name
```

------
#### [ Delete a user profile ]

```
aws --region AWS Region sagemaker delete-user-profile --domain-id example-domain-id --user-profile example-user-profile
```

------

# Quotas
<a name="studio-updated-jl-admin-guide-quotas"></a>

JupyterLab, has quotas for the following:
+ The sum of all Amazon EBS volumes within an AWS account.
+ The instance types that are available for your users.
+ The number of instances for a particular that your users can launch.

To get more storage and compute for your users, request an increase to your AWS quotas. For more information about requesting a quota increase, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).

# Amazon SageMaker notebook instances
<a name="nbi"></a>

An Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the Jupyter Notebook application. One of the best ways for machine learning (ML) practitioners to use Amazon SageMaker AI is to train and deploy ML models using SageMaker notebook instances. The SageMaker notebook instances help create the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud (Amazon EC2) and providing preconfigured kernels with the following packages: the Amazon SageMaker Python SDK, AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), Conda, Pandas, deep learning framework libraries, and other libraries for data science and machine learning.

Use Jupyter notebooks in your notebook instance to:
+ prepare and process data
+ write code to train models
+ deploy models to SageMaker hosting
+ test or validate your models

For information about pricing with Amazon SageMaker notebook instance, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Maintenance
<a name="nbi-maintenance"></a>

SageMaker AI updates the underlying software for Amazon SageMaker Notebook Instances at least once every 90 days. Some maintenance updates, such as operating system upgrades, may require your application to be taken offline for a short period of time. It is not possible to perform any operations during this period while the underlying software is being updated. We recommend that you restart your notebooks at least once every 30 days to automatically consume patches.

If the notebook instance isn't updated and is running unsecure software, SageMaker AI might periodically update the instance as part of regular maintenance. During these updates, data outside of the folder `/home/ec2-user/SageMaker` is not persisted.

For more information, contact [AWS Support](https://aws.amazon.com/premiumsupport/).

## Machine Learning with the SageMaker Python SDK
<a name="gs-ml-with-sagemaker-pysdk"></a>

To train, validate, deploy, and evaluate an ML model in a SageMaker notebook instance, use the SageMaker Python SDK. The SageMaker Python SDK abstracts AWS SDK for Python (Boto3) and SageMaker API operations. It enables you to integrate with and orchestrate other AWS services, such as Amazon Simple Storage Service (Amazon S3) for saving data and model artifacts, Amazon Elastic Container Registry (ECR) for importing and servicing the ML models, Amazon Elastic Compute Cloud (Amazon EC2) for training and inference.

You can also take advantage of SageMaker AI features that help you deal with every stage of a complete ML cycle: data labeling, data preprocessing, model training, model deployment, evaluation on prediction performance, and monitoring the quality of model in production.

If you're a first-time SageMaker AI user, we recommend you to use the SageMaker Python SDK, following the end-to-end ML tutorial. To find the open source documentation, see the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

**Topics**
+ [

## Maintenance
](#nbi-maintenance)
+ [

## Machine Learning with the SageMaker Python SDK
](#gs-ml-with-sagemaker-pysdk)
+ [

# Tutorial for building models with Notebook Instances
](gs-console.md)
+ [

# AL2023 notebook instances
](nbi-al2023.md)
+ [

# Amazon Linux 2 notebook instances
](nbi-al2.md)
+ [

# JupyterLab versioning
](nbi-jl.md)
+ [

# Create an Amazon SageMaker notebook instance
](howitworks-create-ws.md)
+ [

# Access Notebook Instances
](howitworks-access-ws.md)
+ [

# Update a Notebook Instance
](nbi-update.md)
+ [

# Customization of a SageMaker notebook instance using an LCC script
](notebook-lifecycle-config.md)
+ [

# Set the Notebook Kernel
](howitworks-set-kernel.md)
+ [

# Git repositories with SageMaker AI Notebook Instances
](nbi-git-repo.md)
+ [

# Notebook Instance Metadata
](nbi-metadata.md)
+ [

# Monitor Jupyter Logs in Amazon CloudWatch Logs
](jupyter-logs.md)

# Tutorial for building models with Notebook Instances
<a name="gs-console"></a>

This Get Started tutorial walks you through how to create a SageMaker notebook instance, open a Jupyter notebook with a preconfigured kernel with the Conda environment for machine learning, and start a SageMaker AI session to run an end-to-end ML cycle. You'll learn how to save a dataset to a default Amazon S3 bucket automatically paired with the SageMaker AI session, submit a training job of an ML model to Amazon EC2, and deploy the trained model for prediction by hosting or batch inferencing through Amazon EC2. 

This tutorial explicitly shows a complete ML flow of training the XGBoost model from the SageMaker AI built-in model pool. You use the [US Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult), and you evaluate the performance of the trained SageMaker AI XGBoost model on predicting individuals' income.
+ [SageMaker AI XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) – The [XGBoost](https://xgboost.readthedocs.io/en/latest/) model is adapted to the SageMaker AI environment and preconfigured as Docker containers. SageMaker AI provides a suite of [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) that are prepared for using SageMaker AI features. To learn more about what ML algorithms are adapted to SageMaker AI, see [Choose an Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html) and [Use Amazon SageMaker Built-in Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). For the SageMaker AI built-in algorithm API operations, see [First-Party Algorithms](https://sagemaker.readthedocs.io/en/stable/algorithms/index.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult) – The dataset from the [1994 Census bureau database](http://www.census.gov/en.html) by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The SageMaker AI XGBoost model is trained using this dataset to predict if an individual makes over \$150,000 a year or less.

**Topics**
+ [

# Create an Amazon SageMaker Notebook Instance for the tutorial
](gs-setup-working-env.md)
+ [

# Create a Jupyter notebook in the SageMaker notebook instance
](ex1-prepare.md)
+ [

# Prepare a dataset
](ex1-preprocess-data.md)
+ [

# Train a Model
](ex1-train-model.md)
+ [

# Deploy the model to Amazon EC2
](ex1-model-deployment.md)
+ [

# Evaluate the model
](ex1-test-model.md)
+ [

# Clean up Amazon SageMaker notebook instance resources
](ex1-cleanup.md)

# Create an Amazon SageMaker Notebook Instance for the tutorial
<a name="gs-setup-working-env"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

An Amazon SageMaker notebook instance is a fully-managed machine learning (ML) Amazon Elastic Compute Cloud (Amazon EC2) compute instance. An Amazon SageMaker notebook instance runs the Jupyter Notebook application. Use the notebook instance to create and manage Jupyter notebooks for preprocessing data, train ML models, and deploy ML models.

**To create a SageMaker notebook instance**  
![\[Animated screenshot that shows how to create a SageMaker notebook instance.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-create-instance.gif)

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Notebook instances**, and then choose **Create notebook instance**.

1. On the **Create notebook instance** page, provide the following information (if a field is not mentioned, leave the default values):

   1. For **Notebook instance name**, type a name for your notebook instance.

   1. For **Notebook Instance type**, choose `ml.t2.medium`. This is the least expensive instance type that notebook instances support, and is enough for this exercise. If a `ml.t2.medium` instance type isn't available in your current AWS Region, choose `ml.t3.medium`.

   1. For **Platform Identifier**, choose a platform type to create the notebook instance on. This platform type defines the Operating System and the JupyterLab version that your notebook instance is created with. The latest and recommended version is `notebook-al2023-v1`, for an Amazon Linux 2023 notebook instance. For information about platform identifier types, see [AL2023 notebook instances](nbi-al2023.md) and [Amazon Linux 2 notebook instances](nbi-al2.md). For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).

   1. For **IAM role**, choose **Create a new role**, and then choose **Create role**. This IAM role automatically gets permissions to access any S3 bucket that has `sagemaker` in the name. It gets these permissions through the `AmazonSageMakerFullAccess` policy, which SageMaker AI attaches to the role. 
**Note**  
If you want to grant the IAM role permission to access S3 buckets without `sagemaker` in the name, you need to attach the `S3FullAccess` policy. You can also limit the permissions to specific S3 buckets to the IAM role. For more information and examples of adding bucket policies to the IAM role, see [Bucket Policy Examples](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html).

   1. Choose **Create notebook instance**. 

      In a few minutes, SageMaker AI launches a notebook instance and attaches a 5 GB of Amazon EBS storage volume to it. The notebook instance has a preconfigured Jupyter notebook server, SageMaker AI and AWS SDK libraries, and a set of Anaconda libraries.

      For more information about creating a SageMaker notebook instance, see [Create a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html). 

## (Optional) Change SageMaker Notebook Instance Settings
<a name="gs-change-ni-settings"></a>

To change the ML compute instance type or the size of the Amazon EBS storage of a SageMaker AI notebook instance, edit the notebook instance settings.

**To change and update the SageMaker Notebook instance type and the EBS volume**

1. On the **Notebook instances** page in the SageMaker AI console, choose your notebook instance.

1. Choose **Actions**, choose **Stop**, and then wait until the notebook instance fully stops.

1. After the notebook instance status changes to **Stopped**, choose **Actions**, and then choose **Update settings**.  
![\[Animated screenshot that shows how to update SageMaker notebook instance settings.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-update-instance.gif)

   1. For **Notebook instance type**, choose a different ML instance type.

   1. For **Volume size in GB**, type a different integer to specify a new EBS volume size.
**Note**  
EBS storage volumes are encrypted, so SageMaker AI can't determine the amount of available free space on the volume. Because of this, you can increase the volume size when you update a notebook instance, but you can't decrease the volume size. If you want to decrease the size of the ML storage volume in use, create a new notebook instance with the desired size. 

1. At the bottom of the page, choose **Update notebook instance**. 

1. When the update is complete, **Start** the notebook instance with the new settings.

For more information about updating SageMaker notebook instance settings, see [Update a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-update.html). 

## (Optional) Advanced Settings for SageMaker Notebook Instances
<a name="gs-ni-advanced-settings"></a>

The following tutorial video shows how to set up and use SageMaker notebook instances through the SageMaker AI console. It includes advanced options, such as SageMaker AI lifecycle configuration and importing GitHub repositories. (Length: 26:04)

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/X5CLunIzj3U/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/X5CLunIzj3U)


For complete documentation about SageMaker notebook instance, see [Use Amazon SageMaker notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).

# Create a Jupyter notebook in the SageMaker notebook instance
<a name="ex1-prepare"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To start scripting for training and deploying your model, create a Jupyter notebook in the SageMaker notebook instance. Using the Jupyter notebook, you can run machine learning (ML) experiments for training and inference while using SageMaker AI features and the AWS infrastructure.

**To create a Jupyter notebook**  
![\[Animated screenshot that shows how to create a Jupyter notebook in the SageMaker AI notebook instance.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-create-notebook.gif)

1. Open the notebook instance as follows:

   1. Sign in to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. On the **Notebook instances** page, open your notebook instance by choosing either:
      + **Open JupyterLab** for the JupyterLab interface
      + **Open Jupyter** for the classic Jupyter view
**Note**  
If the notebook instance status shows **Pending** in the **Status** column, your notebook instance is still being created. The status will change to **InService** when the notebook instance is ready to use. 

1. Create a notebook as follows: 
   + If you opened the notebook in the JupyterLab view, on the **File** menu, choose **New**, and then choose **Notebook**. For **Select Kernel**, choose **conda\$1python3**. This preinstalled environment includes the default Anaconda installation and Python 3.
   + If you opened the notebook in the classic Jupyter view, on the **Files** tab, choose **New**, and then choose **conda\$1python3**. This preinstalled environment includes the default Anaconda installation and Python 3.

1. Save the notebooks as follows:
   + In the JupyterLab view, choose **File**, choose **Save Notebook As...**, and then rename the notebook.
   + In the Jupyter classic view, choose **File**, choose **Save as...**, and then rename the notebook.

# Prepare a dataset
<a name="ex1-preprocess-data"></a>

In this step, you load the [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult) to your notebook instance using the SHAP (SHapley Additive exPlanations) Library, review the dataset, transform it, and upload it to Amazon S3. SHAP is a game theoretic approach to explain the output of any machine learning model. For more information about SHAP, see [Welcome to the SHAP documentation](https://shap.readthedocs.io/en/latest/).

To run the following example, paste the sample code into a cell in your notebook instance.

## Load Adult Census Dataset Using SHAP
<a name="ex1-preprocess-data-pull-data"></a>

Using the SHAP library, import the Adult Census dataset as shown following:

```
import shap
X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)
feature_names = list(X.columns)
feature_names
```

**Note**  
If the current Jupyter kernel does not have the SHAP library, install it by running the following `conda` command:  

```
%conda install -c conda-forge shap
```
If you're using JupyterLab, you must manually refresh the kernel after the installation and updates have completed. Run the following IPython script to shut down the kernel (the kernel will restart automatically):  

```
import IPython
IPython.Application.instance().kernel.do_shutdown(True)
```

The `feature_names` list object should return the following list of features: 

```
['Age',
 'Workclass',
 'Education-Num',
 'Marital Status',
 'Occupation',
 'Relationship',
 'Race',
 'Sex',
 'Capital Gain',
 'Capital Loss',
 'Hours per week',
 'Country']
```

**Tip**  
If you're starting with unlabeled data, you can use Amazon SageMaker Ground Truth to create a data labeling workflow in minutes. To learn more, see [Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/data-label.html). 

## Overview the Dataset
<a name="ex1-preprocess-data-inspect"></a>

Run the following script to display the statistical overview of the dataset and histograms of the numeric features.

```
display(X.describe())
hist = X.hist(bins=30, sharey=True, figsize=(20, 10))
```

![\[Overview of the Adult Census dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-prepare-data-1.png)


**Tip**  
If you want to use a dataset that needs to be cleaned and transformed, you can simplify and streamline data preprocessing and feature engineering using Amazon SageMaker Data Wrangler. To learn more, see [Prepare ML Data with Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html).

## Split the Dataset into Train, Validation, and Test Datasets
<a name="ex1-preprocess-data-transform"></a>

Using Sklearn, split the dataset into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the performance of the final trained model. The dataset is randomly sorted with the fixed random seed: 80 percent of the dataset for training set and 20 percent of it for a test set.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train_display = X_display.loc[X_train.index]
```

Split the training set to separate out a validation set. The validation set is used to evaluate the performance of the trained model while tuning the model's hyperparameters. 75 percent of the training set becomes the final training set, and the rest is the validation set.

```
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
X_train_display = X_display.loc[X_train.index]
X_val_display = X_display.loc[X_val.index]
```

Using the pandas package, explicitly align each dataset by concatenating the numeric features with the true labels.

```
import pandas as pd
train = pd.concat([pd.Series(y_train, index=X_train.index,
                             name='Income>50K', dtype=int), X_train], axis=1)
validation = pd.concat([pd.Series(y_val, index=X_val.index,
                            name='Income>50K', dtype=int), X_val], axis=1)
test = pd.concat([pd.Series(y_test, index=X_test.index,
                            name='Income>50K', dtype=int), X_test], axis=1)
```

Check if the dataset is split and structured as expected:

```
train
```

![\[The example training dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-prepare-data-2-train.png)


```
validation
```

![\[The example validation dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-prepare-data-2-validation.png)


```
test
```

![\[The example test dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-prepare-data-2-test.png)


## Convert the Train and Validation Datasets to CSV Files
<a name="ex1-preprocess-data-transform-2"></a>

Convert the `train` and `validation` dataframe objects to CSV files to match the input file format for the XGBoost algorithm.

```
# Use 'csv' format to store the data
# The first column is expected to be the output column
train.to_csv('train.csv', index=False, header=False)
validation.to_csv('validation.csv', index=False, header=False)
```

## Upload the Datasets to Amazon S3
<a name="ex1-preprocess-data-transform-4"></a>

Using the SageMaker AI and Boto3, upload the training and validation datasets to the default Amazon S3 bucket. The datasets in the S3 bucket will be used by a compute-optimized SageMaker instance on Amazon EC2 for training. 

The following code sets up the default S3 bucket URI for your current SageMaker AI session, creates a new `demo-sagemaker-xgboost-adult-income-prediction` folder, and uploads the training and validation datasets to the `data` subfolder.

```
import sagemaker, boto3, os
bucket = sagemaker.Session().default_bucket()
prefix = "demo-sagemaker-xgboost-adult-income-prediction"

boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'data/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'data/validation.csv')).upload_file('validation.csv')
```

Run the following AWS CLI to check if the CSV files are successfully uploaded to the S3 bucket.

```
! aws s3 ls {bucket}/{prefix}/data --recursive
```

This should return the following output:

![\[Output of the CLI command to check the datasets in the S3 bucket.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-prepare-data-3.png)


# Train a Model
<a name="ex1-train-model"></a>

In this step, you choose a training algorithm and run a training job for the model. The [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) provides framework estimators and generic estimators to train your model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker AI features for training and the AWS infrastructures, such as Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3). For more information about SageMaker AI built-in framework estimators, see [Frameworks](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html)in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation. For more information about built-in algorithms, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

**Topics**
+ [

## Choose the Training Algorithm
](#ex1-train-model-select-algorithm)
+ [

## Create and Run a Training Job
](#ex1-train-model-sdk)

## Choose the Training Algorithm
<a name="ex1-train-model-select-algorithm"></a>

To choose the right algorithm for your dataset, you typically need to evaluate different models to find the most suitable models to your data. For simplicity, the SageMaker AI [XGBoost algorithm with Amazon SageMaker AI](xgboost.md) built-in algorithm is used throughout this tutorial without the pre-evaluation of models.

**Tip**  
If you want SageMaker AI to find an appropriate model for your tabular dataset, use Amazon SageMaker Autopilot that automates a machine learning solution. For more information, see [SageMaker Autopilot](autopilot-automate-model-development.md).

## Create and Run a Training Job
<a name="ex1-train-model-sdk"></a>

After you figured out which model to use, start constructing a SageMaker AI estimator for training. This tutorial uses the XGBoost built-in algorithm for the SageMaker AI generic estimator.

**To run a model training job**

1. Import the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and start by retrieving the basic information from your current SageMaker AI session.

   ```
   import sagemaker
   
   region = sagemaker.Session().boto_region_name
   print("AWS Region: {}".format(region))
   
   role = sagemaker.get_execution_role()
   print("RoleArn: {}".format(role))
   ```

   This returns the following information:
   + `region` – The current AWS Region where the SageMaker AI notebook instance is running.
   + `role` – The IAM role used by the notebook instance.
**Note**  
Check the SageMaker Python SDK version by running `sagemaker.__version__`. This tutorial is based on `sagemaker>=2.20`. If the SDK is outdated, install the latest version by running the following command:   

   ```
   ! pip install -qU sagemaker
   ```
If you run this installation in your exiting SageMaker Studio or notebook instances, you need to manually refresh the kernel to finish applying the version update.

1. Create an XGBoost estimator using the `sagemaker.estimator.Estimator` class. In the following example code, the XGBoost estimator is named `xgb_model`.

   ```
   from sagemaker.debugger import Rule, ProfilerRule, rule_configs
   from sagemaker.session import TrainingInput
   
   s3_output_location='s3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model')
   
   container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
   print(container)
   
   xgb_model=sagemaker.estimator.Estimator(
       image_uri=container,
       role=role,
       instance_count=1,
       instance_type='ml.m4.xlarge',
       volume_size=5,
       output_path=s3_output_location,
       sagemaker_session=sagemaker.Session(),
       rules=[
           Rule.sagemaker(rule_configs.create_xgboost_report()),
           ProfilerRule.sagemaker(rule_configs.ProfilerReport())
       ]
   )
   ```

   To construct the SageMaker AI estimator, specify the following parameters:
   + `image_uri` – Specify the training container image URI. In this example, the SageMaker AI XGBoost training container URI is specified using `sagemaker.image_uris.retrieve`.
   + `role` – The AWS Identity and Access Management (IAM) role that SageMaker AI uses to perform tasks on your behalf (for example, reading training results, call model artifacts from Amazon S3, and writing training results to Amazon S3). 
   + `instance_count` and `instance_type` – The type and number of Amazon EC2 ML compute instances to use for model training. For this training exercise, you use a single `ml.m4.xlarge` instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS) storage, and a high network performance. For more information about EC2 compute instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). For more information about billing, see [Amazon SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/). 
   + `volume_size` – The size, in GB, of the EBS storage volume to attach to the training instance. This must be large enough to store training data if you use `File` mode (`File` mode is on by default). If you don't specify this parameter, its value defaults to 30.
   + `output_path` – The path to the S3 bucket where SageMaker AI stores the model artifact and training results.
   + `sagemaker_session` – The session object that manages interactions with SageMaker API operations and other AWS service that the training job uses.
   + `rules` – Specify a list of SageMaker Debugger built-in rules. In this example, the `create_xgboost_report()` rule creates an XGBoost report that provides insights into the training progress and results, and the `ProfilerReport()` rule creates a report regarding the EC2 compute resource utilization. For more information, see [SageMaker Debugger interactive report for XGBoost](debugger-report-xgboost.md).
**Tip**  
If you want to run distributed training of large sized deep learning models, such as convolutional neural networks (CNN) and natural language processing (NLP) models, use SageMaker AI Distributed for data parallelism or model parallelism. For more information, see [Distributed training in Amazon SageMaker AI](distributed-training.md).

1. Set the hyperparameters for the XGBoost algorithm by calling the `set_hyperparameters` method of the estimator. For a complete list of XGBoost hyperparameters, see [XGBoost hyperparameters](xgboost_hyperparameters.md).

   ```
   xgb_model.set_hyperparameters(
       max_depth = 5,
       eta = 0.2,
       gamma = 4,
       min_child_weight = 6,
       subsample = 0.7,
       objective = "binary:logistic",
       num_round = 1000
   )
   ```
**Tip**  
You can also tune the hyperparameters using the SageMaker AI hyperparameter optimization feature. For more information, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md). 

1. Use the `TrainingInput` class to configure a data input flow for training. The following example code shows how to configure `TrainingInput` objects to use the training and validation datasets you uploaded to Amazon S3 in the [Split the Dataset into Train, Validation, and Test Datasets](ex1-preprocess-data.md#ex1-preprocess-data-transform) section.

   ```
   from sagemaker.session import TrainingInput
   
   train_input = TrainingInput(
       "s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv"
   )
   validation_input = TrainingInput(
       "s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv"
   )
   ```

1. To start model training, call the estimator's `fit` method with the training and validation datasets. By setting `wait=True`, the `fit` method displays progress logs and waits until training is complete.

   ```
   xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)
   ```

   For more information about model training, see [Train a Model with Amazon SageMaker](how-it-works-training.md). This tutorial training job might take up to 10 minutes.

   After the training job has done, you can download an XGBoost training report and a profiling report generated by SageMaker Debugger. The XGBoost training report offers you insights into the training progress and results, such as the loss function with respect to iteration, feature importance, confusion matrix, accuracy curves, and other statistical results of training. For example, you can find the following loss curve from the XGBoost training report which clearly indicates that there is an overfitting problem.  
![\[The chart in the XGBoost training report.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-train-loss-curve-validation-overfitting.png)

   Run the following code to specify the S3 bucket URI where the Debugger training reports are generated and check if the reports exist.

   ```
   rule_output_path = xgb_model.output_path + "/" + xgb_model.latest_training_job.job_name + "/rule-output"
   ! aws s3 ls {rule_output_path} --recursive
   ```

   Download the Debugger XGBoost training and profiling reports to the current workspace:

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```

   Run the following IPython script to get the file link of the XGBoost training report:

   ```
   from IPython.display import FileLink, FileLinks
   display("Click link below to view the XGBoost Training report", FileLink("CreateXgboostReport/xgboost_report.html"))
   ```

   The following IPython script returns the file link of the Debugger profiling report that shows summaries and details of the EC2 instance resource utilization, system bottleneck detection results, and python operation profiling results:

   ```
   profiler_report_name = [rule["RuleConfigurationName"] 
                           for rule in xgb_model.latest_training_job.rule_job_summary() 
                           if "Profiler" in rule["RuleConfigurationName"]][0]
   profiler_report_name
   display("Click link below to view the profiler report", FileLink(profiler_report_name+"/profiler-output/profiler-report.html"))
   ```
**Tip**  
If the HTML reports do not render plots in the JupyterLab view, you must choose **Trust HTML** at the top of the reports.  
To identify training issues, such as overfitting, vanishing gradients, and other problems that prevents your model from converging, use SageMaker Debugger and take automated actions while prototyping and training your ML models. For more information, see [Amazon SageMaker Debugger](train-debugger.md). To find a complete analysis of model parameters, see the [Explainability with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html#Explainability-with-Amazon-SageMaker-Debugger) example notebook. 

You now have a trained XGBoost model. SageMaker AI stores the model artifact in your S3 bucket. To find the location of the model artifact, run the following code to print the model\$1data attribute of the `xgb_model` estimator:

```
xgb_model.model_data
```

**Tip**  
To measure biases that can occur during each stage of the ML lifecycle (data collection, model training and tuning, and monitoring of ML models deployed for prediction), use SageMaker Clarify. For more information, see [Model Explainability](clarify-model-explainability.md). For an end-to-end example, see the [Fairness and Explainability with SageMaker Clarify](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.html) example notebook.

# Deploy the model to Amazon EC2
<a name="ex1-model-deployment"></a>

To get predictions, deploy your model to Amazon EC2 using Amazon SageMaker AI.

**Topics**
+ [

## Deploy the Model to SageMaker AI Hosting Services
](#ex1-deploy-model)
+ [

## (Optional) Use SageMaker AI Predictor to Reuse the Hosted Endpoint
](#ex1-deploy-model-sdk-use-endpoint)
+ [

## (Optional) Make Prediction with Batch Transform
](#ex1-batch-transform)

## Deploy the Model to SageMaker AI Hosting Services
<a name="ex1-deploy-model"></a>

To host a model through Amazon EC2 using Amazon SageMaker AI, deploy the model that you trained in [Create and Run a Training Job](ex1-train-model.md#ex1-train-model-sdk) by calling the `deploy` method of the `xgb_model` estimator. When you call the `deploy` method, you must specify the number and type of EC2 ML instances that you want to use for hosting an endpoint.

```
import sagemaker
from sagemaker.serializers import CSVSerializer
xgb_predictor=xgb_model.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium',
    serializer=CSVSerializer()
)
```
+ `initial_instance_count` (int) – The number of instances to deploy the model.
+ `instance_type` (str) – The type of instances that you want to operate your deployed model.
+ `serializer` (int) – Serialize input data of various formats (a NumPy array, list, file, or buffer) to a CSV-formatted string. We use this because the XGBoost algorithm accepts input files in CSV format.

The `deploy` method creates a deployable model, configures the SageMaker AI hosting services endpoint, and launches the endpoint to host the model. For more information, see the [SageMaker AI generic Estimator's deploy class method](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator.deploy) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). To retrieve the name of endpoint that's generated by the `deploy` method, run the following code:

```
xgb_predictor.endpoint_name
```

This should return the endpoint name of the `xgb_predictor`. The format of the endpoint name is `"sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS"`. This endpoint stays active in the ML instance, and you can make instantaneous predictions at any time unless you shut it down later. Copy this endpoint name and save it to reuse and make real-time predictions elsewhere in SageMaker Studio or SageMaker AI notebook instances.

**Tip**  
To learn more about compiling and optimizing your model for deployment to Amazon EC2 instances or edge devices, see [Compile and Deploy Models with Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html).

## (Optional) Use SageMaker AI Predictor to Reuse the Hosted Endpoint
<a name="ex1-deploy-model-sdk-use-endpoint"></a>

After you deploy the model to an endpoint, you can set up a new SageMaker AI predictor by pairing the endpoint and continuously make real-time predictions in any other notebooks. The following example code shows how to use the SageMaker AI Predictor class to set up a new predictor object using the same endpoint. Re-use the endpoint name that you used for the `xgb_predictor`.

```
import sagemaker
xgb_predictor_reuse=sagemaker.predictor.Predictor(
    endpoint_name="sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS",
    sagemaker_session=sagemaker.Session(),
    serializer=sagemaker.serializers.CSVSerializer()
)
```

The `xgb_predictor_reuse` Predictor behaves exactly the same as the original `xgb_predictor`. For more information, see the [SageMaker AI Predictor](https://sagemaker.readthedocs.io/en/stable/predictors.html#sagemaker.predictor.RealTimePredictor) class in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

## (Optional) Make Prediction with Batch Transform
<a name="ex1-batch-transform"></a>

Instead of hosting an endpoint in production, you can run a one-time batch inference job to make predictions on a test dataset using the SageMaker AI batch transform. After your model training has completed, you can extend the estimator to a `transformer` object, which is based on the [SageMaker AI Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html) class. The batch transformer reads in input data from a specified S3 bucket and makes predictions.

**To run a batch transform job**

1. Run the following code to convert the feature columns of the test dataset to a CSV file and uploads to the S3 bucket:

   ```
   X_test.to_csv('test.csv', index=False, header=False)
   
   boto3.Session().resource('s3').Bucket(bucket).Object(
   os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
   ```

1. Specify S3 bucket URIs of input and output for the batch transform job as shown following:

   ```
   # The location of the test dataset
   batch_input = 's3://{}/{}/test'.format(bucket, prefix)
   
   # The location to store the results of the batch transform job
   batch_output = 's3://{}/{}/batch-prediction'.format(bucket, prefix)
   ```

1. Create a transformer object specifying the minimal number of parameters: the `instance_count` and `instance_type` parameters to run the batch transform job, and the `output_path` to save prediction data as shown following: 

   ```
   transformer = xgb_model.transformer(
       instance_count=1, 
       instance_type='ml.m4.xlarge', 
       output_path=batch_output
   )
   ```

1. Initiate the batch transform job by executing the `transform()` method of the `transformer` object as shown following:

   ```
   transformer.transform(
       data=batch_input, 
       data_type='S3Prefix',
       content_type='text/csv', 
       split_type='Line'
   )
   transformer.wait()
   ```

1. When the batch transform job is complete, SageMaker AI creates the `test.csv.out` prediction data saved in the `batch_output` path, which should be in the following format: `s3://sagemaker-<region>-111122223333/demo-sagemaker-xgboost-adult-income-prediction/batch-prediction`. Run the following AWS CLI to download the output data of the batch transform job:

   ```
   ! aws s3 cp {batch_output} ./ --recursive
   ```

   This should create the `test.csv.out` file under the current working directory. You'll be able to see the float values that are predicted based on the logistic regression of the XGBoost training job.

# Evaluate the model
<a name="ex1-test-model"></a>

Now that you have trained and deployed a model using Amazon SageMaker AI, evaluate the model to ensure that it generates accurate predictions on new data. For model evaluation, use the test dataset that you created in [Prepare a dataset](ex1-preprocess-data.md).

## Evaluate the Model Deployed to SageMaker AI Hosting Services
<a name="ex1-test-model-endpoint"></a>

To evaluate the model and use it in production, invoke the endpoint with the test dataset and check whether the inferences you get returns a target accuracy you want to achieve.

**To evaluate the model**

1. Set up the following function to predict each line of the test set. In the following example code, the `rows` argument is to specify the number of lines to predict at a time. You can change the value of it to perform a batch inference that fully utilizes the instance's hardware resource.

   ```
   import numpy as np
   def predict(data, rows=1000):
       split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
       predictions = ''
       for array in split_array:
           predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])
       return np.fromstring(predictions[1:], sep=',')
   ```

1. Run the following code to make predictions of the test dataset and plot a histogram. You need to take only the feature columns of the test dataset, excluding the 0th column for the actual values.

   ```
   import matplotlib.pyplot as plt
   
   predictions=predict(test.to_numpy()[:,1:])
   plt.hist(predictions)
   plt.show()
   ```  
![\[A histogram of predicted values.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-eval-predicted-values-histogram.png)

1. The predicted values are float type. To determine `True` or `False` based on the float values, you need to set a cutoff value. As shown in the following example code, use the Scikit-learn library to return the output confusion metrics and classification report with a cutoff of 0.5.

   ```
   import sklearn
   
   cutoff=0.5
   print(sklearn.metrics.confusion_matrix(test.iloc[:, 0], np.where(predictions > cutoff, 1, 0)))
   print(sklearn.metrics.classification_report(test.iloc[:, 0], np.where(predictions > cutoff, 1, 0)))
   ```

   This should return the following confusion matrix:  
![\[An example of confusion matrix and statistics after getting the inference of the deployed model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-evaluate-confusion-matrix.png)

1. To find the best cutoff with the given test set, compute the log loss function of the logistic regression. The log loss function is defined as the negative log-likelihood of a logistic model that returns prediction probabilities for its ground truth labels. The following example code numerically and iteratively calculates the log loss values (`-(y*log(p)+(1-y)log(1-p)`), where `y` is the true label and `p` is a probability estimate of the corresponding test sample. It returns a log loss versus cutoff graph.

   ```
   import matplotlib.pyplot as plt
   
   cutoffs = np.arange(0.01, 1, 0.01)
   log_loss = []
   for c in cutoffs:
       log_loss.append(
           sklearn.metrics.log_loss(test.iloc[:, 0], np.where(predictions > c, 1, 0))
       )
   
   plt.figure(figsize=(15,10))
   plt.plot(cutoffs, log_loss)
   plt.xlabel("Cutoff")
   plt.ylabel("Log loss")
   plt.show()
   ```

   This should return the following log loss curve.  
![\[Example following log loss curve.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/get-started-ni/gs-ni-evaluate-logloss-vs-cutoff.png)

1. Find the minimum points of the error curve using the NumPy `argmin` and `min` functions:

   ```
   print(
       'Log loss is minimized at a cutoff of ', cutoffs[np.argmin(log_loss)], 
       ', and the log loss value at the minimum is ', np.min(log_loss)
   )
   ```

   This should return: `Log loss is minimized at a cutoff of 0.53, and the log loss value at the minimum is 4.348539186773897`.

   Instead of computing and minimizing the log loss function, you can estimate a cost function as an alternative. For example, if you want to train a model to perform a binary classification for a business problem such as a customer churn prediction problem, you can set weights to the elements of confusion matrix and calculate the cost function accordingly.

You have now trained, deployed, and evaluated your first model in SageMaker AI.

**Tip**  
To monitor model quality, data quality, and bias drift, use Amazon SageMaker Model Monitor and SageMaker AI Clarify. To learn more, see [Amazon SageMaker Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html), [Monitor Data Quality](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html), [Monitor Model Quality](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality.html), [Monitor Bias Drift](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-bias-drift.html), and [Monitor Feature Attribution Drift](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-feature-attribution-drift.html).

**Tip**  
To get human review of low confidence ML predictions or a random sample of predictions, use Amazon Augmented AI human review workflows. For more information, see [Using Amazon Augmented AI for Human Review](https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-use-augmented-ai-a2i-human-review-loops.html).

# Clean up Amazon SageMaker notebook instance resources
<a name="ex1-cleanup"></a>

To avoid incurring unnecessary charges, use the AWS Management Console to delete the endpoints and resources that you created while running the exercises. 

**Note**  
Training jobs and logs cannot be deleted and are retained indefinitely.

**Note**  
If you plan to explore other exercises in this guide, you might want to keep some of these resources, such as your notebook instance, S3 bucket, and IAM role.

 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and delete the following resources:
   + The endpoint. Deleting the endpoint also deletes the ML compute instance or instances that support it.

     1. Under **Inference**, choose **Endpoints**.

     1. Choose the endpoint that you created in the example, choose **Actions**, and then choose **Delete**.
   + The endpoint configuration.

     1. Under **Inference**, choose **Endpoint configurations**.

     1. Choose the endpoint configuration that you created in the example, choose **Actions**, and then choose **Delete**.
   + The model.

     1. Under **Inference**, choose **Models**.

     1. Choose the model that you created in the example, choose **Actions**, and then choose **Delete**.
   + The notebook instance. Before deleting the notebook instance, stop it.

     1. Under **Notebook**, choose **Notebook instances**.

     1. Choose the notebook instance that you created in the example, choose **Actions**, and then choose **Stop**. The notebook instance takes several minutes to stop. When the **Status** changes to **Stopped**, move on to the next step.

     1. Choose **Actions**, and then choose **Delete**.

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/), and then delete the bucket that you created for storing model artifacts and the training dataset. 

1. Open the Amazon CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/), and then delete all of the log groups that have names starting with `/aws/sagemaker/`.

# AL2023 notebook instances
<a name="nbi-al2023"></a>

Amazon SageMaker notebook instances currently support AL2023 operating systems. AL2023 is now the latest and recommended operating system for notebook instances. You can select the operating system that your notebook instance is based on when you create the notebook instance.

SageMaker AI supports notebook instances based on the following AL2023 operating systems.
+ **notebook-al2023-v1**: These notebook instances support JupyterLab version 4. For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).

**Topics**
+ [

## Supported instance types
](#nbi-al2023-instances)
+ [

## Available kernels
](#nbi-al2023-kernel)

## Supported instance types
<a name="nbi-al2023-instances"></a>

AL2023 supports instance types listed under **Notebook Instances** in [SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/), with the exception that AL2023 does not support `ml.p2`, `ml.p3`, `ml.g3` instances.

## Available kernels
<a name="nbi-al2023-kernel"></a>

The following table gives information about the available kernels for SageMaker notebook instances. All of these images are supported on notebook instances based on the `notebook-al2023-v1` operating system.


| Kernel name | Description | 
| --- | --- | 
| R | A kernel used to perform data analysis and visualization using R code from a Jupyter notebook. | 
| Sparkmagic (PySpark) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the Python programming language. This kernel comes with Python 3.10. | 
| Sparkmagic (Spark) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the Scala programming language. This kernel comes with Python 3.10. | 
| Sparkmagic (SparkR) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the R programming language. This kernel comes with Python 3.10. | 
| conda\$1python3 | A conda environment that comes pre-installed with popular packages for data science and machine learning. This kernel comes with Python 3.10. | 
| conda\$1pytorch | A conda environment that comes pre-installed with PyTorch version 2.7.0, as well as popular data science and machine learning packages. This kernel comes with Python 3.10. | 

# Amazon Linux 2 notebook instances
<a name="nbi-al2"></a>

**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](nbi-jl.md#nbi-jl-version-maintenance).

**Note**  
AL2023 is the latest and recommended operating system available for notebook instances. To learn more, see [AL2023 notebook instances](nbi-al2023.md).

Amazon SageMaker notebook instances currently support Amazon Linux 2 (AL2) operating systems. You can select the operating system that your notebook instance is based on when you create the notebook instance.

SageMaker AI supports notebook instances based on the following Amazon Linux 2 operating systems.
+ **notebook-al2-v1** (deprecated): These notebook instances supported JupyterLab version 1. As of June 30, 2025, you can no longer create new instances with this platform identifier. For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).
+ **notebook-al2-v2** (deprecated): These notebook instances supported JupyterLab version 3. As of June 30, 2025, you can no longer create new instances with this platform identifier. For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).
+ **notebook-al2-v3**: These notebook instances support JupyterLab version 4. For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).

Notebook instances created before 08/18/2021 automatically run on Amazon Linux (AL1). Notebook instances based on AL1 entered a maintenance phase as of 12/01/2022 and are no longer available for new notebook instance creation as of 02/01/2023. To replace AL1, you now have the option to create Amazon SageMaker notebook instances with AL2. For more information, see [AL1 Maintenance Phase Plan](#nbi-al2-deprecation).

**Topics**
+ [

## Supported instance types
](#nbi-al2-instances)
+ [

## Available Kernels
](#nbi-al2-kernel)
+ [

## AL1 Maintenance Phase Plan
](#nbi-al2-deprecation)

## Supported instance types
<a name="nbi-al2-instances"></a>

Amazon Linux 2 supports instance types listed under **Notebook Instances** in [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) with the exception that Amazon Linux 2 does not support `ml.p2` instances.

## Available Kernels
<a name="nbi-al2-kernel"></a>

The following table gives information about the available kernels for SageMaker notebook instances. All of these images are supported on notebook instances based on the `notebook-al2-v1`, `notebook-al2-v2`, and `notebook-al2-v3` operating systems.

SageMaker notebook instance kernels


| Kernel name | Description | 
| --- | --- | 
| R | A kernel used to perform data analysis and visualization using R code from a Jupyter notebook. | 
| Sparkmagic (PySpark) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the Python programming language. This kernel comes with Python 3.10. | 
| Sparkmagic (Spark) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the Scala programming language. This kernel comes with Python 3.10. | 
| Sparkmagic (SparkR) | A kernel used to do data science with remote Spark clusters from Jupyter notebooks using the R programming language. This kernel comes with Python 3.10. | 
| conda\$1python3 | A conda environment that comes pre-installed with popular packages for data science and machine learning. This kernel comes with Python 3.10. | 
| conda\$1pytorch\$1p310 |  A conda environment that comes pre-installed with PyTorch version 2.2.0, as well as popular data science and machine learning packages. This kernel comes with Python 3.10. | 
| conda\$1tensorflow2\$1p310 | A conda environment that comes pre-installed with TensorFlow version 2.16.0, as well as popular data science and machine learning packages. This kernel comes with Python 3.10. | 

## AL1 Maintenance Phase Plan
<a name="nbi-al2-deprecation"></a>

The following table is a timeline for when AL1 entered its extended maintenance phase. The AL1 maintenance phase also coincides with the deprecation of Python 2 and Chainer. Notebooks based on AL2 do not have managed Python 2 and Chainer kernels.


|  Date  |  Description  | 
| --- | --- | 
|  08/18/2021  |  Notebook instances based on AL2 are launched. Newly launched notebook instances still default to AL1. AL1 is supported with security patches and updates, but no new features. You can choose between the two operating systems when launching a new notebook instance.  | 
|  10/31/2022  |  The default platform identifier for SageMaker notebook instances changes from Amazon Linux (al1-v1) to Amazon Linux 2 (al2-v2). You can choose between the two operating systems when launching a new notebook instance.  | 
|  12/01/2022  |  AL1 is no longer supported with non-critical security patches and updates. AL1 still receives fixes for [critical](https://nvd.nist.gov/vuln-metrics/cvss#) security-related issues. You can still launch instances on AL1, but assume the risks associated with using an unsupported operating system.  | 
|  02/01/2023  |  AL1 is no longer an available option for new notebook instance creation. After this date, customers can create notebook instances with the AL2 platform identifiers. Existing notebooks with an `INSERVICE` status should be migrated to the latest platform since continuous availability of AL1 notebook instances cannot be guaranteed.  | 
|  03/31/2024  |  AL1 reaches its end of life on notebook instances on March 31, 2024. After this date, AL1 will no longer receive any security updates, bug fixes, or be available for new notebook instance creation.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/nbi-al2.html)  | 

### Migrating to Amazon Linux 2
<a name="nbi-al2-upgrade"></a>

Your existing AL1 notebook instance is not automatically migrated to Amazon Linux 2. To upgrade your AL1 notebook instance to Amazon Linux 2, you must create a new notebook instance, replicate your code and environment, and delete your old notebook instance. For more information, see the [Amazon Linux 2 migration blog](https://aws.amazon.com/blogs/machine-learning/migrate-your-work-to-amazon-sagemaker-notebook-instance-with-amazon-linux-2/ ).

# JupyterLab versioning
<a name="nbi-jl"></a>

**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](#nbi-jl-version-maintenance).

The Amazon SageMaker notebook instance interface is based on JupyterLab, which is a web-based interactive development environment for notebooks, code, and data. Notebooks now support using either JupyterLab 1, JupyterLab 3, or JupyterLab 4. A single notebook instance can run a single instance of JupyterLab (at most). You can have multiple notebook instances with different JupyterLab versions. 

You can configure your notebook to run your preferred JupyterLab version by selecting the appropriate platform identifier. Use either the AWS CLI or the SageMaker AI console when creating your notebook instance. For more information about platform identifiers, see [AL2023 notebook instances](nbi-al2023.md) and [Amazon Linux 2 notebook instances](nbi-al2.md). If you don’t explicitly configure a platform identifier, your notebook instance defaults to running JupyterLab 1. 

**Topics**
+ [

## JupyterLab version maintenance
](#nbi-jl-version-maintenance)
+ [

## JupyterLab 4
](#nbi-jl-4)
+ [

## JupyterLab 3
](#nbi-jl-3)
+ [

# Create a notebook with your JupyterLab version
](nbi-jl-create.md)
+ [

# View the JupyterLab version of a notebook from the console
](nbi-jl-view.md)

## JupyterLab version maintenance
<a name="nbi-jl-version-maintenance"></a>

JupyterLab 1 and JupyterLab 3 platforms reached end of standard support on June 30, 2025. As of this date:
+ You can no longer create new or restart stopped JupyterLab 1 and JupyterLab 3 notebook instances.
+ Existing in-service JupyterLab 1 and JupyterLab 3 notebook instances may continue to function, but no longer receive SageMaker AI security updates or critical bug fixes.
+ You are responsible for managing the security of these deprecated instances.
+ If issues arise with existing JupyterLab 1 or JupyterLab 3 notebook instances, SageMaker AI cannot guarantee their continued availability. You must migrate your workload to a JupyterLab 4 notebook instance.

Migrate your work to JupyterLab 4 notebook instances (the latest version's platform identifier is [notebook-al2023-v1](nbi-al2023.md)) to ensure you have a secure and supported environment. This allows you to leverage the latest versions of Jupyter notebooks, JupyterLab, and other ML libraries. For instructions, see [ migrate your work to an SageMaker AI notebook instance with Amazon Linux 2](https://aws.amazon.com/blogs//machine-learning/migrate-your-work-to-amazon-sagemaker-notebook-instance-with-amazon-linux-2/).

## JupyterLab 4
<a name="nbi-jl-4"></a>

JupyterLab 4 support is available only on the Amazon Linux 2 operating system platform. JupyterLab 4 includes the following features that are not available in JupyterLab 3:
+ Optimized rendering for a faster experience
+ Opt-in settings for faster tab switching and better performance with long notebooks. For more information, see the blog post [ JupyterLab 4.0 is Here](https://blog.jupyter.org/jupyterlab-4-0-is-here-388d05e03442).
+ Upgraded text editor
+ New extension manager installing from pypi
+ Added improvements to the UI, including document search and accessibility improvements

You can run JupyterLab 4 by specifying [notebook-al2023-v1](nbi-al2023.md) (the latest and recommended version) or [notebook-al2-v3](nbi-al2.md) as the platform identifier when creating your notebook instance.

**Note**  
If you attempt to migrate to a JupyterLab 4 Notebook Instance from another JupyterLab version, the package version changes between JupyterLab 3 and JupyterLab 4 might break any existing lifecycle configurations or Jupyter/JupyterLab extensions.

**Package version changes**

JupyterLab 4 has the following package version changes from JupyterLab 3:
+ JupyterLab has been upgraded from 3.x to 4.x.
+ Jupyter notebook has been upgraded from 6.x to 7.x.
+ jupyterlab-git has been updated to version 0.50.0.

## JupyterLab 3
<a name="nbi-jl-3"></a>

**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](#nbi-jl-version-maintenance).

 JupyterLab 3 support is available only on the Amazon Linux 2 operating system platform. JupyterLab 3 includes the following features that are not available in JupyterLab 1. For more information about these features, see [JupyterLab 3.0 is released\$1](https://blog.jupyter.org/jupyterlab-3-0-is-out-4f58385e25bb). 
+  Visual debugger when using the following kernels: 
  +  conda\$1pytorch\$1p38 
  +  conda\$1tensorflow2\$1p38 
  +  conda\$1amazonei\$1pytorch\$1latest\$1p37 
+ File browser filter
+ Table of Contents (TOC)
+ Multi-language support
+ Simple mode
+ Single interface mode
+ Live editing SVG files with updated rendering
+ User interface for notebook cell tags

### Important changes to JupyterLab 3
<a name="nbi-jl-3-changes"></a>

 For information about important changes when using JupyterLab 3, see the following JupyterLab change logs: 
+  [v2.0.0](https://github.com/jupyterlab/jupyterlab/releases) 
+  [v3.0.0](https://jupyterlab.readthedocs.io/en/stable/getting_started/changelog.html#for-developers) 

 **Package version changes** 

 JupyterLab 3 has the following package version changes from JupyterLab 1: 
+  JupyterLab has been upgraded from 1.x to 3.x.
+  Jupyter notebook has been upgraded from 5.x to 6.x.
+  jupyterlab-git has been updated to version 0.37.1.
+  nbserverproxy 0.x (0.3.2) has been replaced with jupyter-server-proxy 3.x (3.2.1).

# Create a notebook with your JupyterLab version
<a name="nbi-jl-create"></a>

**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](nbi-jl.md#nbi-jl-version-maintenance).

 You can select the JupyterLab version when creating your notebook instance from the console following the steps in [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md). 

 You can also select the JupyterLab version by passing the `platform-identifier` parameter when creating your notebook instance using the AWS CLI as follows: 

```
create-notebook-instance --notebook-instance-name <NEW_NOTEBOOK_NAME> \
--instance-type <INSTANCE_TYPE> \
--role-arn <YOUR_ROLE_ARN> \
--platform-identifier notebook-al2-v3
```

# View the JupyterLab version of a notebook from the console
<a name="nbi-jl-view"></a>

**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](nbi-jl.md#nbi-jl-version-maintenance).

 You can view the JupyterLab version of a notebook using the following procedure: 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation, select **Notebook**.

1.  From the dropdown menu, select **Notebook instances** to navigate to the **Notebook instances** page. 

1.  From the list of notebook instances, select your notebook instance name. 

1.  On the **Notebook instance settings** page, view the **Platform Identifier** to see the JupyterLab version of the notebook. 

# Create an Amazon SageMaker notebook instance
<a name="howitworks-create-ws"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

An Amazon SageMaker notebook instance is a ML compute instance running the Jupyter Notebook application. SageMaker AI manages creating the instance and related resources. Use Jupyter notebooks in your notebook instance to:
+ prepare and process data
+ write code to train models
+ deploy models to SageMaker AI hosting
+ test or validate your models

To create a notebook instance, use either the SageMaker AI console or the [  `CreateNotebookInstance`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html) API.

The notebook instance type you choose depends on how you use your notebook instance. Ensure that your notebook instance is not bound by memory, CPU, or IO. To load a dataset into memory on the notebook instance for exploration or preprocessing, choose an instance type with enough RAM memory for your dataset. This requires an instance with at least 16 GB of memory (.xlarge or larger). If you plan to use the notebook for compute intensive preprocessing, we recommend you choose a compute-optimized instance such as a c4 or c5.

A best practice when using a SageMaker notebook is to use the notebook instance to orchestrate other AWS services. For example, you can use the notebook instance to manage large dataset processing. To do this, make calls to AWS Glue for ETL (extract, transform, and load) services or Amazon EMR for mapping and data reduction using Hadoop. You can use AWS services as temporary forms of computation or storage for your data.

You can store and retrieve your training and test data using an Amazon Simple Storage Service bucket. You can then use SageMaker AI to train and build your model. As a result, the instance type of your notebook would have no bearing on the speed of your model training and testing.

After receiving the request, SageMaker AI does the following:
+ **Creates a network interface**—If you choose the optional VPC configuration, SageMaker AI creates the network interface in your VPC. It uses the subnet ID that you provide in the request to determine which Availability Zone to create the subnet in. SageMaker AI associates the security group that you provide in the request with the subnet. For more information, see [Connect a Notebook Instance in a VPC to External Resources](appendix-notebook-and-internet-access.md). 
+ **Launches an ML compute instance**—SageMaker AI launches an ML compute instance in a SageMaker AI VPC. SageMaker AI performs the configuration tasks that allow it to manage your notebook instance. If you specified your VPC, SageMaker AI enables traffic between your VPC and the notebook instance.
+ **Installs Anaconda packages and libraries for common deep learning platforms**—SageMaker AI installs all of the Anaconda packages that are included in the installer. For more information, see [Anaconda package list](https://docs.anaconda.com/free/anaconda/pkg-docs/). SageMaker AI also installs the TensorFlow and Apache MXNet deep learning libraries. 
+ **Attaches an ML storage volume**—SageMaker AI attaches an ML storage volume to the ML compute instance. You can use the volume as a working area to clean up the training dataset or to temporarily store validation, test, or other data. Choose any size between 5 GB and 16384 GB, in 1 GB increments, for the volume. The default is 5 GB. ML storage volumes are encrypted, so SageMaker AI can't determine the amount of available free space on the volume. Because of this, you can increase the volume size when you update a notebook instance, but you can't decrease the volume size. If you want to decrease the size of the ML storage volume in use, create a new notebook instance with the desired size.

  Only files and data saved within the `/home/ec2-user/SageMaker` folder persist between notebook instance sessions. Files and data that are saved outside this directory are overwritten when the notebook instance stops and restarts. Each notebook instance's `/tmp` directory provides a minimum of 10 GB of storage in an instance store. An instance store is temporary, block-level storage that isn't persistent. When the instance is stopped or restarted, SageMaker AI deletes the directory's contents and any operating system customizations. This temporary storage is part of the root volume of the notebook instance.

  If the notebook instance isn't updated and is running unsecure software, SageMaker AI might periodically update the instance as part of regular maintenance. During these updates, data outside of the folder `/home/ec2-user/SageMaker` is not persisted. For more information about maintenance and security patches, see [Maintenance](nbi.md#nbi-maintenance).

  If the instance type used by the notebook instance has NVMe support, customers can use the NVMe instance store volumes available for that instance type. For instances with NVMe store volumes, all instance store volumes are automatically attached to the instance at launch. For more information about instance types and their associated NVMe store volumes, see the [Amazon Elastic Compute Cloud Instance Type Details](https://aws.amazon.com/ec2/instance-types/).

  To make the attached NVMe store volume available for your notebook instance, complete the steps in [Make instance store volumes available on your instance ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#making-instance-stores-available-on-your-instances). Complete the steps with root access or by using a lifecycle configuration script.
**Note**  
NVMe instance store volumes are not persistent storage. This storage is short-lived with the instance and must be reconfigured every time an instance with this storage is launched.

**To create a SageMaker AI notebook instance:**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. Choose **Notebook instances**, then choose **Create notebook instance**.

1. On the **Create notebook instance** page, provide the following information: 

   1. For **Notebook instance name**, type a name for your notebook instance.

   1. For **Notebook instance type**, choose an instance size suitable for your use case. For a list of supported instance types and quotas, see [Amazon SageMaker AI Service Quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html#limits_sagemaker).

   1. For **Platform Identifier**, choose a platform type to create the notebook instance on. This platform type dictates the Operating System and the JupyterLab version that your notebook instance is created with. The latest and recommended version is `notebook-al2023-v1`, for an Amazon Linux 2023 notebook instance. As of June 30, 2025, only JupyterLab 4 is supported for new instances. For information about platform identifier types, see [AL2023 notebook instances](nbi-al2023.md) and [Amazon Linux 2 notebook instances](nbi-al2.md). For information about JupyterLab versions, see [JupyterLab versioning](nbi-jl.md).
**Important**  
JupyterLab 1 and JupyterLab 3 are no longer supported as of June 30, 2025. You can no longer create new or restart stopped notebook instances using these versions. Existing in-service instances may continue to function but will not receive security updates or bug fixes. Migrate to JupyterLab 4 notebook instances for continued support. For more information, see [JupyterLab version maintenance](nbi-jl.md#nbi-jl-version-maintenance).

   1. (Optional) **Additional configuration** lets advanced users create a shell script that can run when you create or start the instance. This script, called a lifecycle configuration script, can be used to set the environment for the notebook or to perform other functions. For information, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md).

   1. (Optional) **Additional configuration** also lets you specify the size, in GB, of the ML storage volume that is attached to the notebook instance. You can choose a size between 5 GB and 16,384 GB, in 1 GB increments. You can use the volume to clean up the training dataset or to temporarily store validation or other data.

   1. (Optional) For **Minimum IMDS Version**, select a version from the dropdown list. If this value is set to v1, both versions can be used with the notebook instance. If v2 is selected, then only IMDSv2 can be used with the notebook instance. For information about IMDSv2, see [Use IMDSv2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html).
**Note**  
Starting October 31, 2022, the default minimum IMDS Version for SageMaker notebook instances changes from IMDSv1 to IMDSv2.   
Starting February 1, 2023, IMDSv1 is no longer be available for new notebook instance creation. After this date, you can create notebook instances with a minimum IMDS version of 2.

   1. For **IAM role**, choose either an existing IAM role in your account with the necessary permissions to access SageMaker AI resources or **Create a new role**. If you choose **Create a new role**, SageMaker AI creates an IAM role named `AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS`. The AWS managed policy `AmazonSageMakerFullAccess` is attached to the role. The role provides permissions that allow the notebook instance to call SageMaker AI and Amazon S3.

   1. For **Root access**, to give root access for all notebook instance users, choose **Enable**. To remove root access for users, choose **Disable**.If you give root access, all notebook instance users have administrator privileges and can access and edit all files on it. 

   1. (Optional) **Encryption key** lets you encrypt data on the ML storage volume attached to the notebook instance using an AWS Key Management Service (AWS KMS) key. If you plan to store sensitive information on the ML storage volume, consider encrypting the information. 

   1. (Optional) **Network** lets you put your notebook instance inside a Virtual Private Cloud (VPC). A VPC provides additional security and limits access to resources in the VPC from sources outside the VPC. For more information on VPCs, see [Amazon VPC User Guide](https://docs.aws.amazon.com/vpc/latest/userguide/).

      **To add your notebook instance to a VPC:**

      1. Choose the **VPC** and a **SubnetId**.

      1. For **Security Group**, choose your VPC's default security group. 

      1. If you need your notebook instance to have internet access, enable direct internet access. For **Direct internet access**, choose **Enable**. Internet access can make your notebook instance less secure. For more information, see [Connect a Notebook Instance in a VPC to External Resources](appendix-notebook-and-internet-access.md). 

   1. (Optional) To associate Git repositories with the notebook instance, choose a default repository and up to three additional repositories. For more information, see [Git repositories with SageMaker AI Notebook Instances](nbi-git-repo.md).

   1. Choose **Create notebook instance**. 

      In a few minutes, Amazon SageMaker AI launches an ML compute instance—in this case, a notebook instance—and attaches an ML storage volume to it. The notebook instance has a preconfigured Jupyter notebook server and a set of Anaconda libraries. For more information, see the [  `CreateNotebookInstance`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html) API. 

1. When the status of the notebook instance is `InService`, in the console, the notebook instance is ready to use. Choose **Open Jupyter** next to the notebook name to open the classic Jupyter dashboard.
**Note**  
To augment the security of your Amazon SageMaker notebook instance, all regional `notebook.region.sagemaker.aws` domains are registered in the internet [Public Suffix List (PSL)](https://publicsuffix.org/). For further security, we recommend that you use cookies with a `__Host-` prefix to set sensitive cookies for the domains of your SageMaker notebook instances. This helps to defend your domain against cross-site request forgery attempts (CSRF). For more information, see the [Set-Cookie](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#cookie_prefixes) page in the [mozilla.org](https://www.mozilla.org/en-GB/?v=1) developer documentation website.

    You can choose **Open JupyterLab** to open the JupyterLab dashboard. The dashboard provides access to your notebook instance.

   For more information about Jupyter notebooks, see [The Jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/).

# Access Notebook Instances
<a name="howitworks-access-ws"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To access your Amazon SageMaker notebook instances, choose one of the following options: 
+ Use the console.

  Choose **Notebook instances**. The console displays a list of notebook instances in your account. To open a notebook instance with a standard Jupyter interface, choose **Open Jupyter** for that instance. To open a notebook instance with a JupyterLab interface, choose **Open JupyterLab** for that instance.  
![\[Example Notebook instances section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ws-notebook-10.png)

  The console uses your sign-in credentials to send a [  `CreatePresignedNotebookInstanceUrl`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreatePresignedNotebookInstanceUrl.html) API request to SageMaker AI. SageMaker AI returns the URL for your notebook instance, and the console opens the URL in another browser tab and displays the Jupyter notebook dashboard. 
**Note**  
The URL that you get from a call to [  `CreatePresignedNotebookInstanceUrl`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreatePresignedNotebookInstanceUrl.html) is valid only for 5 minutes. If you try to use the URL after the 5-minute limit expires, you are directed to the AWS Management Console sign-in page.
+ Use the API.

  To get the URL for the notebook instance, call the [ `CreatePresignedNotebookInstanceUrl`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreatePresignedNotebookInstanceUrl.html) API and use the URL that the API returns to open the notebook instance.

Use the Jupyter notebook dashboard to create and manage notebooks and to write code. For more information about Jupyter notebooks, see [http://jupyter.org/documentation.html](http://jupyter.org/documentation.html).

# Update a Notebook Instance
<a name="nbi-update"></a>

After you create a notebook instance, you can update it using the SageMaker AI console and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateNotebookInstance.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateNotebookInstance.html) API operation.

You can update the tags of a notebook instance that is `InService`. To update any other attribute of a notebook instance, its status must be `Stopped`.

**To update a notebook instance in the SageMaker AI console:**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. Choose **Notebook instances**.

1. Choose the notebook instance that you want to update by selecting the notebook instance **Name** from the list.

1. If your notebook **Status** is not `Stopped`, select the **Stop** button to stop the notebook instance. 

   When you do this, the notebook instance status changes to `Stopping`. Wait until the status changes to `Stopped` to complete the following steps. 

1. Select the **Edit** button to open the **Edit notebook instance** page. For information about the notebook properties you can update, see [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md).

1. Update your notebook instance and select the **Update notebook instance** button at the bottom of the page when you are done to return to the notebook instances page. Your notebook instance status changes to **Updating**. 

   When the notebook instance update is complete, the status changes to `Stopped`.

# Customization of a SageMaker notebook instance using an LCC script
<a name="notebook-lifecycle-config"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

A *lifecycle configuration* (LCC) provides shell scripts that run only when you create the notebook instance or whenever you start one. When you create a notebook instance, you can create a new LCC or attach an LCC that you already have. Lifecycle configuration scripts are useful for the following use cases:
+ Installing packages or sample notebooks on a notebook instance
+ Configuring networking and security for a notebook instance
+ Using a shell script to customize a notebook instance

You can also use a lifecycle configuration script to access AWS services from your notebook. For example, you can create a script that lets you use your notebook to control other AWS resources, such as an Amazon EMR instance.

We maintain a public repository of notebook lifecycle configuration scripts that address common use cases for customizing notebook instances at [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples).

**Note**  
Each script has a limit of 16384 characters.  
The value of the `$PATH` environment variable that is available to both scripts is `/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin`. The working directory, which is the value of the `$PWD` environment variable, is `/`.  
View CloudWatch Logs for notebook instance lifecycle configurations in log group `/aws/sagemaker/NotebookInstances` in log stream `[notebook-instance-name]/[LifecycleConfigHook]`.  
Scripts cannot run for longer than 5 minutes. If a script runs for longer than 5 minutes, it fails and the notebook instance is not created or started. To help decrease the run time of scripts, try the following:  
Cut down on necessary steps. For example, limit which conda environments in which to install large packages.
Run tasks in parallel processes.
Use the `nohup` command in your script.

You can see a list of notebook instance lifecycle configurations you previously created by choosing **Lifecycle configuration** in the SageMaker AI console. You can attach a notebook instance LCC when you create a new notebook instance. For more information about creating a notebook instance, see [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md).

# Create a lifecycle configuration script
<a name="notebook-lifecycle-config-create"></a>

The following procedure shows how to create a lifecycle configuration script for use with an Amazon SageMaker notebook instance. For more information about creating a notebook instance, see [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md).

**To create a lifecycle configuration**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Lifecycle configurations**. 

1. From the **Lifecycle configurations** page, choose the **Notebook Instance** tab.

1. Choose **Create configuration**.

1. For **Name**, type a name using alphanumeric characters and "-", but no spaces. The name can have a maximum of 63 characters.

1. (Optional) To create a script that runs when you create the notebook and every time you start it, choose **Start notebook**.

1. In the **Start notebook** editor, type the script.

1. (Optional) To create a script that runs only once, when you create the notebook, choose **Create notebook**.

1. In the **Create notebook** editor, type the script configure networking.

1. Choose **Create configuration**.

## Lifecycle Configuration Best Practices
<a name="nbi-lifecycle-config-bp"></a>

The following are best practices for using lifecycle configurations:

**Important**  
We do not recommend storing sensitive information in your lifecycle configuration script.

**Important**  
Lifecycle configuration scripts run with root access and the notebook instance's IAM execution role privileges, regardless of the root access setting for notebook users. Principals with permissions to create or modify lifecycle configurations and update notebook instances can execute code with the execution role's credentials. See [Control root access to a SageMaker notebook instance](nbi-root-access.md) for more information.
+ Lifecycle configurations run as the `root` user. If your script makes any changes within the `/home/ec2-user/SageMaker` directory, (for example, installing a package with `pip`), use the command `sudo -u ec2-user` to run as the `ec2-user` user. This is the same user that Amazon SageMaker AI runs as.
+ SageMaker AI notebook instances use `conda` environments to implement different kernels for Jupyter notebooks. If you want to install packages that are available to one or more notebook kernels, enclose the commands to install the packages with `conda` environment commands that activate the conda environment that contains the kernel where you want to install the packages.

  For example, if you want to install a package only for the `python3` environment, use the following code:

  ```
  #!/bin/bash
  sudo -u ec2-user -i <<EOF
  
  # This will affect only the Jupyter kernel called "conda_python3".
  source activate python3
  
  # Replace myPackage with the name of the package you want to install.
  pip install myPackage
  # You can also perform "conda install" here as well.
  
  source deactivate
  
  EOF
  ```

  If you want to install a package in all conda environments in the notebook instance, use the following code:

  ```
  #!/bin/bash
  sudo -u ec2-user -i <<EOF
  
  # Note that "base" is special environment name, include it there as well.
  for env in base /home/ec2-user/anaconda3/envs/*; do
      source /home/ec2-user/anaconda3/bin/activate $(basename "$env")
  
      # Installing packages in the Jupyter system environment can affect stability of your SageMaker
      # Notebook Instance.  You can remove this check if you'd like to install Jupyter extensions, etc.
      if [ $env = 'JupyterSystemEnv' ]; then
        continue
      fi
  
      # Replace myPackage with the name of the package you want to install.
      pip install --upgrade --quiet myPackage
      # You can also perform "conda install" here as well.
  
      source /home/ec2-user/anaconda3/bin/deactivate
  done
  
  EOF
  ```
+ You must store all conda environments in the default environments folder (/home/user/anaconda3/envs).

**Important**  
When you create or change a script, we recommend that you use a text editor that provides Unix-style line breaks, such as the text editor available in the console when you create a notebook. Copying text from a non-Linux operating system might introduce incompatible line breaks and result in an unexpected error.

# External library and kernel installation
<a name="nbi-add-external"></a>

**Important**  
Currently, all packages in notebook instance environments are licensed for use with Amazon SageMaker AI and do not require additional commercial licenses. However, this might be subject to change in the future, and we recommend reviewing the licensing terms regularly for any updates.

Amazon SageMaker notebook instances come with multiple environments already installed. These environments contain Jupyter kernels and Python packages including: scikit, Pandas, NumPy, TensorFlow, and MXNet. These environments, along with all files in the `sample-notebooks` folder, are refreshed when you stop and start a notebook instance. You can also install your own environments that contain your choice of packages and kernels.

The different Jupyter kernels in Amazon SageMaker notebook instances are separate conda environments. For information about conda environments, see [Managing environments](https://conda.io/docs/user-guide/tasks/manage-environments.html) in the *Conda* documentation.

Install custom environments and kernels on the notebook instance's Amazon EBS volume. This ensures that they persist when you stop and restart the notebook instance, and that any external libraries you install are not updated by SageMaker AI. To do that, use a lifecycle configuration that includes both a script that runs when you create the notebook instance (`on-create)` and a script that runs each time you restart the notebook instance (`on-start`). For more information about using notebook instance lifecycle configurations, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md). There is a GitHub repository that contains sample lifecycle configuration scripts at [SageMaker AI Notebook Instance Lifecycle Config Samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples).

The examples at [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-create.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-create.sh) and [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-start.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-start.sh) show the best practice for installing environments and kernels on a notebook instance. The `on-create` script installs the `ipykernel` library to create custom environments as Jupyter kernels, then uses `pip install` and `conda install` to install libraries. You can adapt the script to create custom environments and install libraries that you want. SageMaker AI does not update these libraries when you stop and restart the notebook instance, so you can ensure that your custom environment has specific versions of libraries that you want. The `on-start` script installs any custom environments that you create as Jupyter kernels, so that they appear in the dropdown list in the Jupyter **New** menu.

## Package installation tools
<a name="nbi-add-external-tools"></a>

SageMaker notebooks support the following package installation tools:
+ conda install
+ pip install

You can install packages using the following methods:
+ Lifecycle configuration scripts.

  For example scripts, see [SageMaker AI Notebook Instance Lifecycle Config Samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples). For more information on lifecycle configuration, see [Customize a Notebook Instance Using a Lifecycle Configuration Script](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html).
+ Notebooks – The following commands are supported.
  + `%conda install`
  + `%pip install`
+ The Jupyter terminal – You can install packages using pip and conda directly.

From within a notebook you can use the system command syntax (lines starting with \$1) to install packages, for example, `!pip install` and `!conda install`. More recently, new commands have been added to IPython: `%pip` and `%conda`. These commands are the recommended way to install packages from a notebook as they correctly take into account the active environment or interpreter being used. For more information, see [Add %pip and %conda magic functions](https://github.com/ipython/ipython/pull/11524).

### Conda
<a name="nbi-add-external-tools-conda"></a>

Conda is an open source package management system and environment management system, which can install packages and their dependencies. SageMaker AI supports using Conda with either of the two main channels, the default channel, and the conda-forge channel. For more information, see [Conda channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html). The conda-forge channel is a community channel where contributors can upload packages.

**Note**  
Due to how Conda resolves the dependency graph, installing packages from conda-forge can take significantly longer (in the worst cases, upwards of 10 minutes).

The Deep Learning AMI comes with many conda environments and many packages preinstalled. Due to the number of packages preinstalled, finding a set of packages that are guaranteed to be compatible is difficult. You may see a warning "The environment is inconsistent, please check the package plan carefully". Despite this warning, SageMaker AI ensures that all the SageMaker AI provided environments are correct. SageMaker AI cannot guarantee that any user installed packages will function correctly.

**Note**  
Users of SageMaker AI, AWS Deep Learning AMIs and Amazon EMR can access the commercial Anaconda repository without taking a commercial license through February 1, 2024 when using Anaconda in those services. For any usage of the commercial Anaconda repository after February 1, 2024, customers are responsible for determining their own Anaconda license requirements.

Conda has two methods for activating environments: conda activate/deactivate, and source activate/deactivate. For more information, see [Should I use 'conda activate' or 'source activate' in Linux](https://stackoverflow.com/questions/49600611/python-anaconda-should-i-use-conda-activate-or-source-activate-in-linux).

SageMaker AI supports moving Conda environments onto the Amazon EBS volume, which is persisted when the instance is stopped. The environments aren't persisted when the environments are installed to the root volume, which is the default behavior. For an example lifecycle script, see [persistent-conda-ebs](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/tree/master/scripts/persistent-conda-ebs).

**Supported conda operations (see note at the bottom of this topic)**
+ conda install of a package in a single environment
+ conda install of a package in all environments
+ conda install of a R package in the R environment
+ Installing a package from the main conda repository
+ Installing a package from conda-forge
+ Changing the Conda install location to use EBS
+ Supporting both conda activate and source activate

### Pip
<a name="nbi-add-external-tools-pip"></a>

Pip is the de facto tool for installing and managing Python packages. Pip searches for packages on the Python Package Index (PyPI) by default. Unlike Conda, pip doesn't have built in environment support, and is not as thorough as Conda when it comes to packages with native/system library dependencies. Pip can be used to install packages in Conda environments.

You can use alternative package repositories with pip instead of the PyPI. For an example lifecycle script, see [on-start.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/add-pypi-repository/on-start.sh).

**Supported pip operations (see note at the bottom of this topic)**
+ Using pip to install a package without an active conda environment (install packages system wide)
+ Using pip to install a package in a conda environment
+ Using pip to install a package in all conda environments
+ Changing the pip install location to use EBS
+ Using an alternative repository to install packages with pip

### Unsupported
<a name="nbi-add-external-tools-misc"></a>

SageMaker AI aims to support as many package installation operations as possible. However, if the packages were installed by SageMaker AI or DLAMI, and you use the following operations on these packages, it might make your notebook instance unstable:
+ Uninstalling
+ Downgrading
+ Upgrading

We do not provide support for installing packages via yum install or installing R packages from CRAN.

Due to potential issues with network conditions or configurations, or the availability of Conda or PyPi, we cannot guarantee that packages will install in a fixed or deterministic amount of time.

**Note**  
We cannot guarantee that a package installation will be successful. Attempting to install a package in an environment with incompatible dependencies can result in a failure. In such a case you should contact the library maintainer to see if it is possible to update the package dependencies. Alternatively you can attempt to modify the environment in such a way as to allow the installation. This modification however will likely mean removing or updating existing packages, which means we can no longer guarantee stability of this environment.

# Notebook Instance Software Updates
<a name="nbi-software-updates"></a>

Amazon SageMaker AI periodically tests and releases software that is installed on notebook instances. This includes:
+ Kernel updates
+ Security patches
+ AWS SDK updates
+ [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) updates
+ Open source software updates

To ensure that you have the most recent software updates, stop and restart your notebook instance, either in the SageMaker AI console or by calling [  `StopNotebookInstance`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopNotebookInstance.html).

You can also manually update software installed on your notebook instance while it is running by using update commands in a terminal or in a notebook.

**Note**  
Updating kernels and some packages might depend on whether root access is enabled for the notebook instance. For more information, see [Control root access to a SageMaker notebook instance](nbi-root-access.md).

You can check the [Personal Health Dashboard](https://aws.amazon.com/premiumsupport/technology/personal-health-dashboard/) or the security bulletin at [Security Bulletins](https://aws.amazon.com/security/security-bulletins/) for updates.

# Control an Amazon EMR Spark Instance Using a Notebook
<a name="nbi-lifecycle-config-emr"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

You can use a notebook instance created with a custom lifecycle configuration script to access AWS services from your notebook. For example, you can create a script that lets you use your notebook with Sparkmagic to control other AWS resources, such as an Amazon EMR instance. You can then use the Amazon EMR instance to process your data instead of running the data analysis on your notebook. This allows you to create a smaller notebook instance because you won't use the instance to process data. This is helpful when you have large datasets that would require a large notebook instance to process the data.

The process requires three procedures using the Amazon SageMaker AI console:
+ Create the Amazon EMR Spark instance
+ Create the Jupyter Notebook
+ Test the notebook-to-Amazon EMR connection

**To create an Amazon EMR Spark instance that can be controlled from a notebook using Sparkmagic**

1. Open the Amazon EMR console at [https://console.aws.amazon.com/elasticmapreduce/](https://console.aws.amazon.com/elasticmapreduce/).

1. In the navigation pane, choose **Create cluster**.

1. On the **Create Cluster - Quick Options** page, under **Software configuration**, choose **Spark: Spark 2.4.4 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.2**.

1. Set additional parameters on the page and then choose **Create cluster**.

1. On the **Cluster** page, choose the cluster name that you created. Note the **Master Public DNS**, the **EMR master's security group**, and the VPC name and subnet ID where the EMR cluster was created. You will use these values when you create a notebook.

**To create a notebook that uses Sparkmagic to control an Amazon EMR Spark instance**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, under **Notebook instances**, choose **Create notebook**.

1. Enter the notebook instance name and choose the instance type.

1. Choose **Additional configuration**, then, under **Lifecycle configuration**, choose **Create a new lifecycle configuration**.

1. Add the following code to the lifecycle configuration script:

   ```
   # OVERVIEW
   # This script connects an Amazon EMR cluster to an Amazon SageMaker notebook instance that uses Sparkmagic.
   #
   # Note that this script will fail if the Amazon EMR cluster's master node IP address is not reachable.
   #   1. Ensure that the EMR master node IP is resolvable from the notebook instance.
   #      One way to accomplish this is to have the notebook instance and the Amazon EMR cluster in the same subnet.
   #   2. Ensure the EMR master node security group provides inbound access from the notebook instance security group.
   #       Type        - Protocol - Port - Source
   #       Custom TCP  - TCP      - 8998 - $NOTEBOOK_SECURITY_GROUP
   #   3. Ensure the notebook instance has internet connectivity to fetch the SparkMagic example config.
   #
   # https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/
   
   # PARAMETERS
   EMR_MASTER_IP=your.emr.master.ip
   
   
   cd /home/ec2-user/.sparkmagic
   
   echo "Fetching Sparkmagic example config from GitHub..."
   wget https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json
   
   echo "Replacing EMR master node IP in Sparkmagic config..."
   sed -i -- "s/localhost/$EMR_MASTER_IP/g" example_config.json
   mv example_config.json config.json
   
   echo "Sending a sample request to Livy.."
   curl "$EMR_MASTER_IP:8998/sessions"
   ```

1. In the `PARAMETERS` section of the script, replace `your.emr.master.ip` with the Master Public DNS name for the Amazon EMR instance.

1. Choose **Create configuration**.

1. On the **Create notebook** page, choose **Network - optional**.

1. Choose the VPC and subnet where the Amazon EMR instance is located.

1. Choose the security group used by the Amazon EMR master node.

1. Choose **Create notebook instance**.

While the notebook instance is being created, the status is **Pending**. After the instance has been created and the lifecycle configuration script has successfully run, the status is **InService**.

**Note**  
If the notebook instance can't connect to the Amazon EMR instance, SageMaker AI can't create the notebook instance. The connection can fail if the Amazon EMR instance and notebook are not in the same VPC and subnet, if the Amazon EMR master security group is not used by the notebook, or if the Master Public DNS name in the script is incorrect. 

**To test the connection between the Amazon EMR instance and the notebook**

1.  When the status of the notebook is **InService**, choose **Open Jupyter** to open the notebook.

1. Choose **New**, then choose **Sparkmagic (PySpark)**.

1. In the code cell, enter **%%info** and then run the cell.

   The output should be similar to the following

   ```
   Current session configs: {'driverMemory': '1000M', 'executorCores': 2, 'kind': 'pyspark'}
                       No active sessions.
   ```

# Set the Notebook Kernel
<a name="howitworks-set-kernel"></a>

Amazon SageMaker AI provides several kernels for Jupyter that provide support for Python 2 and 3, Apache MXNet, TensorFlow, and PySpark. To set a kernel for a new notebook in the Jupyter notebook dashboard, choose **New**, and then choose the kernel from the list. For more information about the available kernels, see [Available Kernels](nbi-al2.md#nbi-al2-kernel).

![\[Location of the New drop-down list in the Jupyter notebook dashboard.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/nbi-set-kernel.png)


You can also create a custom kernel that you can use in your notebook instance. For information, see [External library and kernel installation](nbi-add-external.md).

# Git repositories with SageMaker AI Notebook Instances
<a name="nbi-git-repo"></a>

Associate Git repositories with your notebook instance to save your notebooks in a source control environment that persists even if you stop or delete your notebook instance. You can associate one default repository and up to three additional repositories with a notebook instance. The repositories can be hosted in AWS CodeCommit, GitHub, or on any other Git server. Associating Git repositories with your notebook instance can be useful for:
+ Persistence - Notebooks in a notebook instance are stored on durable Amazon EBS volumes, but they do not persist beyond the life of your notebook instance. Storing notebooks in a Git repository enables you to store and use notebooks even if you stop or delete your notebook instance.
+ Collaboration - Peers on a team often work on machine learning projects together. Storing your notebooks in Git repositories allows peers working in different notebook instances to share notebooks and collaborate on them in a source-control environment.
+ Learning - Many Jupyter notebooks that demonstrate machine learning techniques are available in publicly hosted Git repositories, such as on GitHub. You can associate your notebook instance with a repository to easily load Jupyter notebooks contained in that repository.

There are two ways to associate a Git repository with a notebook instance:
+ Add a Git repository as a resource in your Amazon SageMaker AI account. Then, to access the repository, you can specify an AWS Secrets Manager secret that contains credentials. That way, you can access repositories that require authentication.
+ Associate a public Git repository that is not a resource in your account. If you do this, you cannot specify credentials to access the repository.

**Topics**
+ [

# Add a Git repository to your Amazon SageMaker AI account
](nbi-git-resource.md)
+ [

# Create a Notebook Instance with an Associated Git Repository
](nbi-git-create.md)
+ [

# Associate a CodeCommit Repository in a Different AWS Account with a Notebook Instance
](nbi-git-cross.md)
+ [

# Use Git Repositories in a Notebook Instance
](git-nbi-use.md)

# Add a Git repository to your Amazon SageMaker AI account
<a name="nbi-git-resource"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To manage your GitHub repositories, easily associate them with your notebook instances, and associate credentials for repositories that require authentication, add the repositories as resources in your Amazon SageMaker AI account. You can view a list of repositories that are stored in your account and details about each repository in the SageMaker AI console and by using the API.

You can add Git repositories to your SageMaker AI account in the SageMaker AI console or by using the AWS CLI.

**Note**  
You can use the SageMaker AI API [  `CreateCodeRepository`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCodeRepository.html) to add Git repositories to your SageMaker AI account, but step-by-step instructions are not provided here.

## Add a Git repository to your SageMaker AI account (Console)
<a name="nbi-git-resource-console"></a>

**To add a Git repository as a resource in your SageMaker AI account**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **Notebook**, choose **Git repositories**, then choose **Add repository**.

1. To add an CodeCommit repository, choose **AWS CodeCommit**. To add a GitHub or other Git-based repository, choose **GitHub/Other Git-based repo**.

**To add an existing CodeCommit repository**

1. Choose **Use existing repository**.

1. For **Repository**, choose a repository from the list.

1. Enter a name to use for the repository in SageMaker AI. The name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and - (hyphen).

1. Choose **Add repository**.

**To create a new CodeCommit repository**

1. Choose **Create new repository**.

1. Enter a name for the repository that you can use in both CodeCommit and SageMaker AI. The name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and - (hyphen).

1. Choose **Create repository**.

**To add a Git repository hosted somewhere other than CodeCommit**

1. Choose **GitHub/Other Git-based repo**.

1. Enter a name of up to 63 characters. Valid characters include alpha-numeric characters, a hyphen (-), and 0-9.

1. Enter the URL for the repository. Do not provide a username in the URL. Add the sign-in credentials in AWS Secrets Manager as described in the next step.

1. For **Git credentials**, choose the credentials to use to authenticate to the repository. This is necessary only if the Git repository is private.
**Note**  
If you have two-factor authentication enabled for your Git repository, enter a personal access token generated by your Git service provider in the `password` field.

   1. To use an existing AWS Secrets Manager secret, choose **Use existing secret**, and then choose a secret from the list. For information about creating and storing a secret, see [Creating a Basic Secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/manage_create-basic-secret.html) in the *AWS Secrets Manager User Guide*. The name of the secret you use must contain the string `sagemaker`.
**Note**  
The secret must have a staging label of `AWSCURRENT` and must be in the following format:  
`{"username": UserName, "password": Password}`  
For GitHub repositories, we recommend using a personal access token in the `password` field. For information, see [https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/).

   1. To create a new AWS Secrets Manager secret, choose **Create secret**, enter a name for the secret, and then enter the sign-in credentials to use to authenticate to the repository. The name for the secret must contain the string `sagemaker`.
**Note**  
The IAM role you use to create the secret must have the `secretsmanager:GetSecretValue` permission in its IAM policy.  
The secret must have a staging label of `AWSCURRENT` and must be in the following format:  
`{"username": UserName, "password": Password}`  
For GitHub repositories, we recommend using a personal access token.

   1. To not use any credentials, choose **No secret**.

1. Choose **Create secret**.

# Add a Git repository to your Amazon SageMaker AI account (CLI)
<a name="nbi-git-resource-cli"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Use the `create-code-repository` AWS CLI command to add a Git repository to Amazon SageMaker AI to give users access to external resources. Specify a name for the repository as the value of the `code-repository-name` argument. The name must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and - (hyphen). Also specify the following:
+ The default branch
+ The URL of the Git repository
**Note**  
Do not provide a username in the URL. Add the sign-in credentials in AWS Secrets Manager as described in the next step.
+ The Amazon Resource Name (ARN) of an AWS Secrets Manager secret that contains the credentials to use to authenticate the repository as the value of the `git-config` argument

For information about creating and storing a secret, see [Creating a Basic Secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/manage_create-basic-secret.html) in the *AWS Secrets Manager User Guide*. The following command creates a new repository named `MyRespository` in your Amazon SageMaker AI account that points to a Git repository hosted at `https://github.com/myprofile/my-repo"`.

For Linux, OS X, or Unix:

```
aws sagemaker create-code-repository \
                    --code-repository-name "MyRepository" \
                    --git-config Branch=branch,RepositoryUrl=https://github.com/myprofile/my-repo,SecretArn=arn:aws:secretsmanager:us-east-2:012345678901:secret:my-secret-ABc0DE
```

For Windows:

```
aws sagemaker create-code-repository ^
                    --code-repository-name "MyRepository" ^
                    --git-config "{\"Branch\":\"master\", \"RepositoryUrl\" :
                    \"https://github.com/myprofile/my-repo\", \"SecretArn\" : \"arn:aws:secretsmanager:us-east-2:012345678901:secret:my-secret-ABc0DE\"}"
```

**Note**  
The secret must have a staging label of `AWSCURRENT` and must be in the following format:  
`{"username": UserName, "password": Password}`  
For GitHub repositories, we recommend using a personal access token.

# Create a Notebook Instance with an Associated Git Repository
<a name="nbi-git-create"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

You can associate Git repositories with a notebook instance when you create the notebook instance by using the AWS Management Console, or the AWS CLI. If you want to use a CodeCommit repository that is in a different AWS account than the notebook instance, set up cross-account access for the repository. For information, see [Associate a CodeCommit Repository in a Different AWS Account with a Notebook Instance](nbi-git-cross.md).

**Topics**
+ [

## Create a Notebook Instance with an Associated Git Repository (Console)
](#nbi-git-create-console)
+ [

# Create a Notebook Instance with an Associated Git Repository (CLI)
](nbi-git-create-cli.md)

## Create a Notebook Instance with an Associated Git Repository (Console)
<a name="nbi-git-create-console"></a>

**To create a notebook instance and associate Git repositories in the Amazon SageMaker AI console**

1. Follow the instructions at [Create an Amazon SageMaker Notebook Instance for the tutorial](gs-setup-working-env.md).

1. For **Git repositories**, choose Git repositories to associate with the notebook instance.

   1. For **Default repository**, choose a repository that you want to use as your default repository. SageMaker AI clones this repository as a subdirectory in the Jupyter startup directory at `/home/ec2-user/SageMaker`. When you open your notebook instance, it opens in this repository. To choose a repository that is stored as a resource in your account, choose its name from the list. To add a new repository as a resource in your account, choose **Add a repository to SageMaker AI (opens the Add repository flow in a new window)** and then follow the instructions at [Create a Notebook Instance with an Associated Git Repository (Console)](#nbi-git-create-console). To clone a public repository that is not stored in your account, choose **Clone a public Git repository to this notebook instance only**, and then specify the URL for that repository.

   1. For **Additional repository 1**, choose a repository that you want to add as an additional directory. SageMaker AI clones this repository as a subdirectory in the Jupyter startup directory at `/home/ec2-user/SageMaker`. To choose a repository that is stored as a resource in your account, choose its name from the list. To add a new repository as a resource in your account, choose **Add a repository to SageMaker AI (opens the Add repository flow in a new window)** and then follow the instructions at [Create a Notebook Instance with an Associated Git Repository (Console)](#nbi-git-create-console). To clone a repository that is not stored in your account, choose **Clone a public Git repository to this notebook instance only**, and then specify the URL for that repository.

      Repeat this step up to three times to add up to three additional repositories to your notebook instance.

# Create a Notebook Instance with an Associated Git Repository (CLI)
<a name="nbi-git-create-cli"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To create a notebook instance and associate Git repositories by using the AWS CLI, use the `create-notebook-instance` command as follows:
+ Specify the repository that you want to use as your default repository as the value of the `default-code-repository` argument. Amazon SageMaker AI clones this repository as a subdirectory in the Jupyter startup directory at `/home/ec2-user/SageMaker`. When you open your notebook instance, it opens in this repository. To use a repository that is stored as a resource in your SageMaker AI account, specify the name of the repository as the value of the `default-code-repository` argument. To use a repository that is not stored in your account, specify the URL of the repository as the value of the `default-code-repository` argument.
+ Specify up to three additional repositories as the value of the `additional-code-repositories` argument. SageMaker AI clones this repository as a subdirectory in the Jupyter startup directory at `/home/ec2-user/SageMaker`, and the repository is excluded from the default repository by adding it to the `.git/info/exclude` directory of the default repository. To use repositories that are stored as resources in your SageMaker AI account, specify the names of the repositories as the value of the `additional-code-repositories` argument. To use repositories that are not stored in your account, specify the URLs of the repositories as the value of the `additional-code-repositories` argument.

For example, the following command creates a notebook instance that has a repository named `MyGitRepo`, that is stored as a resource in your SageMaker AI account, as a default repository, and an additional repository that is hosted on GitHub:

```
aws sagemaker create-notebook-instance \
                    --notebook-instance-name "MyNotebookInstance" \
                    --instance-type "ml.t2.medium" \
                    --role-arn "arn:aws:iam::012345678901:role/service-role/AmazonSageMaker-ExecutionRole-20181129T121390" \
                    --default-code-repository "MyGitRepo" \
                    --additional-code-repositories "https://github.com/myprofile/my-other-repo"
```

**Note**  
If you use an AWS CodeCommit repository that does not contain "SageMaker" in its name, add the `codecommit:GitPull` and `codecommit:GitPush` permissions to the role that you pass as the `role-arn` argument to the `create-notebook-instance` command. For information about how to add permissions to a role, see [Adding and Removing IAM Policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS Identity and Access Management User Guide*. 

# Associate a CodeCommit Repository in a Different AWS Account with a Notebook Instance
<a name="nbi-git-cross"></a>

To associate a CodeCommit repository in a different AWS account with your notebook instance, set up cross-account access for the CodeCommit repository.

**To set up cross-account access for a CodeCommit repository and associate it with a notebook instance:**

1. In the AWS account that contains the CodeCommit repository, create an IAM policy that allows access to the repository from users in the account that contains your notebook instance. For information, see [Step 1: Create a Policy for Repository Access in AccountA](https://docs.aws.amazon.com/codecommit/latest/userguide/cross-account-administrator-a.html#cross-account-create-policy-a) in the *CodeCommit User Guide*.

1. In the AWS account that contains the CodeCommit repository, create an IAM role, and attach the policy that you created in the previous step to that role. For information, see [Step 2: Create a Role for Repository Access in AccountA](https://docs.aws.amazon.com/codecommit/latest/userguide/cross-account-administrator-a.html#cross-account-create-role-a) in the *CodeCommit User Guide*.

1. Create a profile in the notebook instance that uses the role that you created in the previous step:

   1. Open the notebook instance.

   1. Open a terminal in the notebook instance.

   1. Edit a new profile by typing the following in the terminal:

      ```
      vi /home/ec2-user/.aws/config
      ```

   1. Edit the file with the following profile information:

      ```
      [profile CrossAccountAccessProfile]
      region = us-west-2
      role_arn = arn:aws:iam::CodeCommitAccount:role/CrossAccountRepositoryContributorRole
      credential_source=Ec2InstanceMetadata
      output = json
      ```

      Where *CodeCommitAccount* is the account that contains the CodeCommit repository, *CrossAccountAccessProfile* is the name of the new profile, and *CrossAccountRepositoryContributorRole* is the name of the role you created in the previous step.

1. On the notebook instance, configure git to use the profile you created in the previous step:

   1. Open the notebook instance.

   1. Open a terminal in the notebook instance.

   1. Edit the Git configuration file typing the following in the terminal:

      ```
      vi /home/ec2-user/.gitconfig
      ```

   1. Edit the file with the following profile information:

      ```
      [credential]
              helper = !aws codecommit credential-helper --profile CrossAccountAccessProfile $@
              UseHttpPath = true
      ```

      Where *CrossAccountAccessProfile* is the name of the profile that you created in the previous step.

# Use Git Repositories in a Notebook Instance
<a name="git-nbi-use"></a>

When you open a notebook instance that has Git repositories associated with it, it opens in the default repository, which is installed in your notebook instance directly under `/home/ec2-user/SageMaker`. You can open and create notebooks, and you can manually run Git commands in a notebook cell. For example:

```
!git pull origin master
```

To open any of the additional repositories, navigate up one folder. The additional repositories are also installed as directories under `/home/ec2-user/SageMaker`.

If you open the notebook instance with a JupyterLab interface, the jupyter-git extension is installed and available to use. For information about the jupyter-git extension for JupyterLab, see [https://github.com/jupyterlab/jupyterlab-git](https://github.com/jupyterlab/jupyterlab-git).

When you open a notebook instance in JupyterLab, you see the git repositories associated with it on the left menu:

![\[Example file browser in JupyterLab.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/git-notebook.png)


You can use the jupyter-git extension to manage git visually, instead of using the command line:

![\[Example of the jupyter-git extension in JupyterLab.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/jupyterlab-git.png)


# Notebook Instance Metadata
<a name="nbi-metadata"></a>

When you create a notebook instance, Amazon SageMaker AI creates a JSON file on the instance at the location `/opt/ml/metadata/resource-metadata.json` that contains the `ResourceName` and `ResourceArn` of the notebook instance. You can access this metadata from anywhere within the notebook instance, including in lifecycle configurations. For information about notebook instance lifecycle configurations, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md).

**Note**  
The `resource-metadata.json` file can be modified with root access.

The `resource-metadata.json` file has the following structure:

```
{
    "ResourceArn": "NotebookInstanceArn",
    "ResourceName": "NotebookInstanceName"
}
```

You can use this metadata from within the notebook instance to get other information about the notebook instance. For example, the following commands get the tags associated with the notebook instance:

```
NOTEBOOK_ARN=$(jq '.ResourceArn'
            /opt/ml/metadata/resource-metadata.json --raw-output)
aws sagemaker list-tags --resource-arn $NOTEBOOK_ARN
```

The output looks like the following:

```
{
    "Tags": [
        {
            "Key": "test",
            "Value": "true"
        }
    ]
}
```

# Monitor Jupyter Logs in Amazon CloudWatch Logs
<a name="jupyter-logs"></a>

Jupyter logs include important information such as events, metrics, and health information that provide actionable insights when running Amazon SageMaker notebooks. By importing Jupyter logs into CloudWatch Logs, customers can use CloudWatch Logs to detect anomalous behaviors, set alarms, and discover insights to keep the SageMaker AI notebooks running more smoothly. You can access the logs even when the Amazon EC2 instance that hosts the notebook is unresponsive, and use the logs to troubleshoot the unresponsive notebook. Sensitive information such as AWS account IDs, secret keys, and authentication tokens in presigned URLs are removed so that customers can share logs without leaking private information. 

**To view Jupyter logs for a notebook instance:**

1. Sign in to the AWS Management Console and open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. Choose **Notebook instances**.

1. In the list of notebook instances, choose the notebook instance for which you want to view Jupyter logs by selecting the Notebook instance **Name**.

   This will bring you to the details page for that notebook instance.

1. Under **Monitor** on the notebook instance details page, choose **View logs**.

1. In the CloudWatch console, choose the log stream for your notebook instance. Its name is in the form `NotebookInstanceName/jupyter.log`.

For more information about monitoring CloudWatch logs for SageMaker AI, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md).

# Amazon SageMaker Studio Lab
<a name="studio-lab"></a>

**Note**  
As of August 8, 2025, Amazon SageMaker Studio Lab uses JupyterLab 4 instead of JupyterLab 3. If you experience dependency issues, reinstall any extensions that you added to your environments.

 Amazon SageMaker Studio Lab is a free service that gives customers access to AWS compute resources, in an environment based on open-source JupyterLab 4. It is based on the same architecture and user interface as Amazon SageMaker Studio Classic, but with a subset of Studio Classic capabilities.

With Studio Lab, you can use AWS compute resources to create and run your Jupyter notebooks without signing up for an AWS account. Because Studio Lab is based on open-source JupyterLab, you can take advantage of open-source Jupyter extensions to run your Jupyter notebooks.

 **Studio Lab compared to Amazon SageMaker Studio Classic**

While Studio Lab provides free access to AWS compute resources, Amazon SageMaker Studio Classic provides the following advanced machine learning capabilities that Studio Lab does not support.
+ Continuous integration and continuous delivery (Pipelines)
+ Real-time predictions
+ Large-scale distributed training
+ Data preparation (Amazon SageMaker Data Wrangler)
+ Data labeling (Amazon SageMaker Ground Truth)
+ Feature Store
+ Bias analysis (Clarify)
+ Model deployment
+ Model monitoring

Studio Classic also supports fine-grained access control and security by using AWS Identity and Access Management (IAM), Amazon Virtual Private Cloud (Amazon VPC), and AWS Key Management Service (AWS KMS). Studio Lab does not support these Studio Classic features, nor does it support the use of estimators and built-in SageMaker AI algorithms.

To export your Studio Lab projects for use with Studio Classic, see [Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic](studio-lab-use-migrate.md).

The following topics give information about Studio Lab and how to use it

**Topics**
+ [

# Amazon SageMaker Studio Lab components overview
](studio-lab-overview.md)
+ [

# Onboard to Amazon SageMaker Studio Lab
](studio-lab-onboard.md)
+ [

# Manage your account
](studio-lab-manage-account.md)
+ [

# Launch your Amazon SageMaker Studio Lab project runtime
](studio-lab-manage-runtime.md)
+ [

# Use Amazon SageMaker Studio Lab starter assets
](studio-lab-integrated-resources.md)
+ [

# Studio Lab pre-installed environments
](studio-lab-environments.md)
+ [

# Use the Amazon SageMaker Studio Lab project runtime
](studio-lab-use.md)
+ [

# Troubleshooting
](studio-lab-troubleshooting.md)

# Amazon SageMaker Studio Lab components overview
<a name="studio-lab-overview"></a>

Amazon SageMaker Studio Lab consists of the following components. The following topics give more details about these components. 

**Topics**
+ [

## Landing page
](#studio-lab-overview-landing)
+ [

## Studio Lab account
](#studio-lab-overview-account)
+ [

## Project overview page
](#studio-lab-overview-project-overview)
+ [

## Preview page
](#studio-lab-overview-preview)
+ [

## Project
](#studio-lab-overview-project)
+ [

## Compute instance type
](#studio-lab-overview-project-compute)
+ [

## Project runtime
](#studio-lab-overview-runtime)
+ [

## Session
](#studio-lab-overview-session)

## Landing page
<a name="studio-lab-overview-landing"></a>

You can request an account and sign in to an existing account on your landing page. To navigate to the landing page, see the [Amazon SageMaker Studio Lab website](https://studiolab.sagemaker.aws/). For more information about creating a Studio Lab account, see [Onboard to Amazon SageMaker Studio Lab](studio-lab-onboard.md).

The following screenshot shows the Studio Lab landing page interface for requesting a user account and signing in.

![\[The Amazon SageMaker Studio Lab landing page layout.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-lab-landing.png)


## Studio Lab account
<a name="studio-lab-overview-account"></a>

Your Studio Lab account gives you access to Studio Lab. For more information about creating a user account, see [Onboard to Amazon SageMaker Studio Lab](studio-lab-onboard.md).

## Project overview page
<a name="studio-lab-overview-project-overview"></a>

You can launch a compute instance and view information about your project on this page. To navigate to this page, you must sign in from the [Amazon SageMaker Studio Lab website](https://studiolab.sagemaker.aws/). The URL takes the following format.

```
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
```

The following screenshot shows a project overview in the Studio Lab user interface.

![\[The layout of the project overview user interface.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-lab-overview.png)


## Preview page
<a name="studio-lab-overview-preview"></a>

On this page, you can access a read-only preview of a Jupyter notebook. You can not execute the notebook from preview, but you can copy that notebook into your project. For many customers, this may be the first Studio Lab page that customers see, as they may be opening a notebook from GitHub notebook. For more information on how to use GitHub resources, see [Use GitHub resources](studio-lab-use-external.md#studio-lab-use-external-clone-github). 

To copy the notebook preview to your Studio Lab project:

1.  Sign in to your Studio Lab account. For more information about creating a Studio Lab account, see [Onboard to Amazon SageMaker Studio Lab](studio-lab-onboard.md). 

1.  Under **Notebook compute instance**, choose a compute instance type. For more information about compute instance types, see [Compute instance type](#studio-lab-overview-project-compute). 

1.  Choose **Start runtime**. You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see [ What is a CAPTCHA puzzle?](https://docs.aws.amazon.com/waf/latest/developerguide/waf-captcha-puzzle.html) 

1.  One time setup, for first time starting runtime using your Studio Lab account: 

   1.  Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account and choose **Continue**. 

      For information on supported countries and regions, see [ Supported countries and regions (SMS channel)](https://docs.aws.amazon.com/pinpoint/latest/userguide/channels-sms-countries.html).

   1.  Enter the 6-digit code sent to the associated mobile phone number and choose **Verify**. 

1.  Choose **Copy to project**. 

## Project
<a name="studio-lab-overview-project"></a>

Your project contains all of your files and folders, including your Jupyter notebooks. You have full control over the files in your project. Your project also includes the JupyterLab-based user interface. From this interface, you can interact with your Jupyter notebooks, edit your source code files, integrate with GitHub, and connect to Amazon S3. For more information, see [Use the Amazon SageMaker Studio Lab project runtime](studio-lab-use.md). 

The following screenshot shows a Studio Lab project with the file browser open and the Studio Lab Launcher displayed.

![\[The layout of the project user interface.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-lab-ui.png)


## Compute instance type
<a name="studio-lab-overview-project-compute"></a>

 Your Amazon SageMaker Studio Lab project runtime is based on an EC2 instance. You are allotted 15 GB of storage and 16 GB of RAM. Availability of compute instances is not guaranteed and is subject to demand. If you require additional storage or compute resources, consider switching to Studio.  

Amazon SageMaker Studio Lab offers the choice of a CPU (Central Processing Unit) and a GPU (Graphical Processing Unit). The following sections give information about these two options, including selection guidance. 

 **CPU** 

A central processing unit (CPU) is designed to handle a wide range of tasks efficiently, but is limited in how many tasks it can run concurrently. For machine learning, a CPU is recommended for compute intensive algorithms, such as time series, forecasting, and tabular data.  

The CPU compute type has up to 4 hours at a time with a limit of 8 hours in a 24-hour period.

 **GPU** 

A graphics processing unit (GPU) is designed to render high-resolution images and video concurrently. A GPU is recommended for deep learning tasks, especially for transformers and computer vision. 

The GPU compute type has up to 4 hours at a time with a limit of 4 hours in a 24-hour period.

 **Compute time** 

When compute time for Studio Lab reaches its time limit, the instance stops all running computations. Studio Lab does not support time limit increases.

Studio Lab automatically saves your environment when you update your environment and every time you create a new file. Custom-installed extensions and packages persist even after your runtime has ended.

File edits are periodically saved, but are not saved when your runtime ends. To ensure that you do not lose your progress, save your work manually. If you have content in your Studio Lab project that you don’t want to lose, we recommend that you back up your content elsewhere. For more information about exporting your environment and files, see [Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic](studio-lab-use-migrate.md).

During long computation, you do not need to keep your project open. For example, you can start training a model, then close your browser. The instance keeps running for up to the compute type limit in a 24-hour period. You can then sign in later to continue your work.  

We recommend that you use checkpointing in your deep learning jobs. You can use saved checkpoints to restart a job from the previously saved checkpoint. For more information, see [File I/O](https://d2l.ai/chapter_deep-learning-computation/read-write.html?highlight=checkpointing).

## Project runtime
<a name="studio-lab-overview-runtime"></a>

The project runtime is the period of time when your compute instance is running.

## Session
<a name="studio-lab-overview-session"></a>

A user session begins every time you launch your project. 

# Onboard to Amazon SageMaker Studio Lab
<a name="studio-lab-onboard"></a>

To onboard to Amazon SageMaker Studio Lab, follow the steps in this guide. In the following sections, you learn how to request a Studio Lab account, create your account, and sign in.

**Topics**
+ [

## Request a Studio Lab account
](#studio-lab-onboard-request)
+ [

## Create a Studio Lab account
](#studio-lab-onboard-register)
+ [

## Sign in to Studio Lab
](#studio-lab-onboard-signin)

## Request a Studio Lab account
<a name="studio-lab-onboard-request"></a>

To use Studio Lab, you must first request approval to create a Studio Lab account. An AWS account cannot be used for onboarding to Studio Lab. 

The following steps show how to request a Studio Lab account.

1. Navigate to the [Studio Lab landing page](https://studiolab.sagemaker.aws).

1. Select **Request account**.

1. Enter the required information into the form.

1. Select **Submit request**.

1. If you receive an email to verify your email address, follow the instructions in the email to complete this step.

Your account request must be approved before you can register for a Studio Lab account. Your request will be reviewed within five business days. When your account request is approved, you receive an email with a link to the Studio Lab account registration page. This link expires seven days after your request is approved. If the link expires, you must submit a new account request. 

Note: Your account request is denied if your email has been associated with activity that violates our [Terms of Service](https://aws.amazon.com/service-terms/) or other agreements. 

### Referral codes
<a name="studio-lab-onboard-request-referral"></a>

Studio Lab referral codes enable new account requests to be automatically approved to support machine learning events like workshops, hackathons, and classes. With a referral code, a trusted host can get their participants immediate access to Studio Lab. After an account has been created using a referral code, the account continues to exist after the expiration of the code.

To get a referral code, contact [Sales Support](https://aws.amazon.com/contact-us/sales-support/). To use a referral code, enter the code as part of the account request form.

## Create a Studio Lab account
<a name="studio-lab-onboard-register"></a>

After your request is approved, complete the following steps to create your Studio Lab account.

1. Select **Create account** in the account request approval email to open a new page.

1. From the new page, enter your **Email**, a **Password**, and a **Username**. 

1. Select **Create account**. 

   You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see [ What is a CAPTCHA puzzle?](https://docs.aws.amazon.com/waf/latest/developerguide/waf-captcha-puzzle.html)

## Sign in to Studio Lab
<a name="studio-lab-onboard-signin"></a>

After you register for your account, you can sign in to Studio Lab.

1. Navigate to the [Studio Lab landing page](https://studiolab.sagemaker.aws).

1. Select **Sign in** to open a new page.

1. Enter your **Email** or **Username** and **Password**. 

1. Select **Sign in** to open a new page to your project. 

   You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see [ What is a CAPTCHA puzzle?](https://docs.aws.amazon.com/waf/latest/developerguide/waf-captcha-puzzle.html)

# Manage your account
<a name="studio-lab-manage-account"></a>

 The following topic gives information about managing your account, including changing your password, deleting your account, and getting information that we have collected. These topics require that you sign in to your Amazon SageMaker Studio Lab account. For more information, see [Sign in to Studio Lab](studio-lab-onboard.md#studio-lab-onboard-signin).

## Change your password
<a name="studio-lab-manage-change-password"></a>

 Follow these steps to change your Amazon SageMaker Studio Lab password. 

1.  Navigate to the Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1.  From the top-right corner, select your user name to open a dropdown menu. 

1.  From the dropdown menu, select **Change password** to open a new page. 

1.  Enter your current password into the **Enter your current password** field.

1.  Enter your new password into the **Create a new password** and **Confirm your new password** fields.

1.  Select **Submit**. 

## Delete your account
<a name="studio-lab-manage-delete"></a>

 Follow these steps to delete your Studio Lab account.  

1.  Navigate to the Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1.  From the top-right corner, select your user name to open a dropdown menu. 

1.  From the dropdown menu, select **Delete account** to open a new page. 

1.  Enter your password to confirm the deletion of your Studio Lab account. 

1.  Select **Delete**. 

## Customer information
<a name="studio-lab-manage-information"></a>

 Studio Lab collects your email address, user name, encrypted password, project files, and metadata. When requesting an account, you can optionally choose to provide your first and last name, country, organization name, occupation, and the reason for your interest in this product. We protect all customer personal data with encryption. For more information about how your personal information is handled, see the [Privacy Notice](https://aws.amazon.com//privacy/). 

When you delete your account, all of your information is deleted immediately. If you have an inquiry about this, submit the [Amazon SageMaker Studio Lab Form](https://pages.awscloud.com/GLOBAL_PM_PA_amazon-sagemaker_20211116_7014z000000rjq2-registration.html). For information and support related to AWS compliance, see [Compliance support](https://aws.amazon.com//contact-us/compliance-support/).

# Launch your Amazon SageMaker Studio Lab project runtime
<a name="studio-lab-manage-runtime"></a>

The Amazon SageMaker Studio Lab project runtime lets you write and run code directly from your browser. It is based on JupyterLab and has an integrated terminal and console. For more information about JupyterLab, see the [JupyterLab Documentation](https://jupyterlab.readthedocs.io/en/stable/).

The following topic gives information about how to manage your project runtime. These topics require that you sign in to your Amazon SageMaker Studio Lab account. For more information about signing in, see [Sign in to Studio Lab](studio-lab-onboard.md#studio-lab-onboard-signin). For more information about your project, see [Amazon SageMaker Studio Lab components overview](studio-lab-overview.md). 

**Topics**
+ [

## Start your project runtime
](#studio-lab-manage-runtime-start)
+ [

## Stop your project runtime
](#studio-lab-manage-runtime-stop)
+ [

## View remaining compute time
](#studio-lab-manage-runtime-view)
+ [

## Change your compute type
](#studio-lab-manage-runtime-change)

## Start your project runtime
<a name="studio-lab-manage-runtime-start"></a>

To use Studio Lab, you must start your project runtime. This runtime gives you access to the JupyterLab environment.

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1. Under **My Project**, select a compute type. For more information about compute types, see [Compute instance type](studio-lab-overview.md#studio-lab-overview-project-compute).

1. Select **Start runtime**. 

   You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see [ What is a CAPTCHA puzzle?](https://docs.aws.amazon.com/waf/latest/developerguide/waf-captcha-puzzle.html)

1. One time setup, for first time starting runtime using your Studio Lab account: 

   1. Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account and choose **Continue**. 

      For information on supported countries and regions, see [ Supported countries and regions (SMS channel)](https://docs.aws.amazon.com/pinpoint/latest/userguide/channels-sms-countries.html).

   1. Enter the 6-digit code sent to the associated mobile phone number and choose **Verify**. 

1. After the runtime is running, select **Open project** to open the project runtime environment in a new browser tab. 

## Stop your project runtime
<a name="studio-lab-manage-runtime-stop"></a>

When you stop your project runtime, your files are not automatically saved. To ensure that you don't lose your work, save all of your changes before stopping your project runtime.
+ Under **My Project**, select **Stop runtime**. 

## View remaining compute time
<a name="studio-lab-manage-runtime-view"></a>

Your project runtime has limited compute time based on the compute type that you select. For more information about compute time in Studio Lab, see [Compute instance type](studio-lab-overview.md#studio-lab-overview-project-compute).
+ Under **My Project**, view **Time remaining**. 

## Change your compute type
<a name="studio-lab-manage-runtime-change"></a>

You can switch your compute type based on your workflow. For more information about compute types, see [Compute instance type](studio-lab-overview.md#studio-lab-overview-project-compute).

1. Save any project files before changing the compute type. 

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1. Under **My Project**, select the desired compute type (CPU or GPU). 

1. Confirm your choice by selecting **Restart** in the **Restart project runtime?** dialog box. Studio Lab stops your current project runtime, then starts a new project runtime with your updated compute type.

1. After your project runtime has started, select **Open project**. This opens your project runtime environment in a new browser tab. For information about using your project runtime environment, see [Use the Amazon SageMaker Studio Lab project runtime](studio-lab-use.md).

# Use Amazon SageMaker Studio Lab starter assets
<a name="studio-lab-integrated-resources"></a>

Amazon SageMaker Studio Lab supports the following assets to help machine learning (ML) practitioners get started. This guide shows you how to clone notebooks for your project. 

**Getting started notebook** 

Studio Lab comes with a starter notebook that gives general information and guides you through key workflows. When you launch your project runtime for the first time, this notebook automatically opens.

**Dive into Deep Learning** 

Dive into Deep Learning (D2L) is an interactive, open-source book that teaches the ideas, mathematical theory, and code that power machine learning. With over 150 Jupyter notebooks, D2L provides a comprehensive overview of deep learning principles. For more information about D2L, see the [D2L website](https://d2l.ai/).

The following procedure shows how to clone the D2L Jupyter notebooks to your instance. 

1. Start and open the Studio Lab project runtime environment by following [Start your project runtime](studio-lab-manage-runtime.md#studio-lab-manage-runtime-start).

1. Once Studio Lab is open, choose the Git tab (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git.png)) on the left sidebar. 

1. Choose **Clone a Repository**.

   If you do not see the **Clone a Repository** option, this may be because you are currently in a Git repository. Instead, use the following substeps.

   1. Choose the Folder tab (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) on the left sidebar.

   1. Beneath the file search bar, choose the folder icon to the left of the currently selected repository. When you hover over the folder icon, you will see the user directory (`/home/studio-lab-user`).

   1. Once you are in the user directory, choose the Git tab on the left sidebar.

   1. Choose **Clone a Repository**.

1. Under **Git repository URL (.git)** you will be asked to provide a URL.

1. On a new browser tab, navigate to your Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1. Under **New to machine learning?**, choose **Dive into Deep Learning**. 

1. From the new **Dive into Deep Learning** browser tab, choose **GitHub** to open a new page with the example notebooks.

1. Choose **Code** and copy the GitHub repository's URL in the **HTTPS** tab.

1. Return to the Studio Lab open project browser tab, paste the D2L repository URL, and clone the repository.

**AWS Machine Learning University** 

The AWS Machine Learning University (MLU) provides access to the machine learning courses used to train Amazon’s own developers. With AWS MLU, any developer can learn how to use machine learning with the learn-at-your-own-pace MLU Accelerator learning series. The MLU Accelerator series is designed to help developers begin their ML journey. It offers three-day foundational courses on these three subjects: Natural Language Processing, Tabular Data, and Computer Vision. For more information, see [Machine Learning University](https://aws.amazon.com//machine-learning/mlu/). 

The following procedure shows how to clone the AWS MLU Jupyter notebooks to your instance. 

1. Start and open the Studio Lab project runtime environment by following [Start your project runtime](studio-lab-manage-runtime.md#studio-lab-manage-runtime-start).

1. Once Studio Lab is open, choose the Git tab (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git.png)) on the left sidebar. 

1. Choose **Clone a Repository**.

   If you do not see the **Clone a Repository** option, this may be because you are currently in a Git repository. Instead, use the following substeps.

   1. Choose the Folder tab (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) on the left sidebar.

   1. Beneath the file search bar, choose the folder icon to the left of the currently selected repository. When you hover over the folder icon, you will see the user directory (`/home/studio-lab-user`).

   1. Once you are in the user directory, choose the Git tab on the left sidebar.

   1. Choose **Clone a Repository**.

1. Under **Git repository URL (.git)** you will be asked to provide a URL.

1. On a new browser tab, navigate to your Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1. Under **New to machine learning?**, choose **AWS Machine Learning University**. 

1. From the new **AWS Machine Learning University** browser tab, find a course that interests you by reading the **Course Summary** for each course.

1. Choose the corresponding GitHub repository of interest under **Course Content**, to open a new page with the example notebooks.

1. Choose **Code** and copy the GitHub repository's URL in the **HTTPS** tab.

1. Return to the Studio Lab open project browser tab, paste the MLU repository URL, and choose **Clone** to clone the repository.

** Roboflow** 

Roboflow gives you the tools to train, fine-tune, and label objects for computer vision applications. For more information, see [https://roboflow.com/](https://roboflow.com/).

The following procedure shows how to clone the Roboflow Jupyter notebooks to your instance.

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

   ```
   https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
   ```

1. Under **Resources and community**, find **Make AI Generated Images**.

1. Under **Make AI Generated Images** choose **Open notebook**.

1. Follow the tutorial under the Notebook preview.

# Studio Lab pre-installed environments
<a name="studio-lab-environments"></a>

Amazon SageMaker Studio Lab uses conda environments to manage packages (or libraries) for your projects. This guide explains what conda environments are, how to interact with them, and the different pre-installed environments available in Studio Lab.

A conda environment is a directory that contains a collection of packages you have installed. It allows you to create isolated environments with specific package versions, preventing conflicts between projects with different dependencies.

You can interact with conda environments in Studio Lab in two ways:
+ Terminal: Use the terminal to create, activate, and manage environments.
+ JupyterLab Notebook: When opening a JupyterLab notebook, select the kernel with the environment name you wish to use, to use the packages installed in that environment.

For a walkthrough on managing environments, see [Manage your environment](studio-lab-use-manage.md)

Studio Lab comes with several pre-installed environments that are either persistent or non-persistent memory environments. Any changes made to persistent memory environments will remain for your next session. Any changes to non-persistent memory environments will not remain for your next sessions, but the packages within will be updated and tested for compatability by Amazon SageMaker AI. Here's an overview of each environment and its use case:
+ `sagemaker-distribution`: A non-persistent environment managed by Amazon SageMaker AI. It contains popular packages for machine learning, data science, and visualization. This environment is regularly updated and tested for compatibility. Use this environment if you want a fully-managed setup with common packages pre-installed. 

  The `sagemaker-distribution` environment is closely related to the environment used in Amazon SageMaker Studio Classic, so after graduating from Studio Lab to Studio Classic the notebooks should run similarly. For information on exporting your environment from Studio Lab to Studio Classic, see [Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic](studio-lab-use-migrate.md).
+ `default`: A persistent environment with minimal packages pre-installed. Use this environment if you want to customize it significantly by installing additional packages. 
+ `studiolab`: A persistent environment where JupyterLab and related packages installed. Use this environment for configuring the JupyterLab user interface and installing Jupyter server extensions.
+ `studiolab-safemode`: A non-persistent environment activated automatically when there's an issue with your project runtime. Use this environment for troubleshooting purposes. For information on troubleshooting, see [Troubleshooting](studio-lab-troubleshooting.md). 
+ `base`: A non-persistent environment used for system tooling. This environment is not intended for customer use.

To view the packages in an environment, run the command `conda list`.

For more information on installing packages within your environment, see [Customize your environment](studio-lab-use-manage.md#studio-lab-use-manage-conda-default-customize).

If you plan to graduate from Studio Lab to Amazon SageMaker Studio Classic, see [Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic](studio-lab-use-migrate.md).

For information on SageMaker images and their versions, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

# Use the Amazon SageMaker Studio Lab project runtime
<a name="studio-lab-use"></a>

 The following topics give information about using the Amazon SageMaker Studio Lab project runtime. Before you can use the Studio Lab project runtime, you must onboard to Studio Lab by following the steps in [Onboard to Amazon SageMaker Studio Lab](studio-lab-onboard.md).

**Topics**
+ [

# Amazon SageMaker Studio Lab UI overview
](studio-lab-use-ui.md)
+ [

# Create or open an Amazon SageMaker Studio Lab notebook
](studio-lab-use-create.md)
+ [

# Use the Amazon SageMaker Studio Lab notebook toolbar
](studio-lab-use-menu.md)
+ [

# Manage your environment
](studio-lab-use-manage.md)
+ [

# Use external resources in Amazon SageMaker Studio Lab
](studio-lab-use-external.md)
+ [

# Get notebook differences
](studio-lab-use-diff.md)
+ [

# Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic
](studio-lab-use-migrate.md)
+ [

# Shut down Studio Lab resources
](studio-lab-use-shutdown.md)

# Amazon SageMaker Studio Lab UI overview
<a name="studio-lab-use-ui"></a>

Amazon SageMaker Studio Lab extends the JupyterLab interface. Previous users of JupyterLab will notice similarities between the JupyterLab and Studio Lab UI, including the workspace. For an overview of the basic JupyterLab interface, see [The JupyterLab Interface](https://jupyterlab.readthedocs.io/en/latest/user/interface.html).

The following image shows Studio Lab with the file browser open and the Studio Lab Launcher displayed.

![\[The layout of the project user interface.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-lab-ui.png)


You will find the *menu bar* at the top of the screen. The *left sidebar* contains icons to open file browsers, resource browsers, and tools. The *status bar* is located at the bottom-left corner of Studio Lab.

The main work area is divided horizontally into two panes. The left pane is the *file and resource browser*. The right pane contains one or more tabs for resources, such as notebooks and terminals.

**Topics**
+ [

## Left sidebar
](#studio-lab-use-ui-nav-bar)
+ [

## File and resource browser
](#studio-lab-use-ui-browser)
+ [

## Main work area
](#studio-lab-use-ui-work)

## Left sidebar
<a name="studio-lab-use-ui-nav-bar"></a>

The left sidebar includes the following icons. When you hover over an icon, a tooltip displays the icon name. When you choose an icon, the file and resource browser displays the described functionality. For hierarchical entries, a selectable breadcrumb at the top of the browser shows your location in the hierarchy.


| Icon | Description | 
| --- | --- | 
|  ![\[The File Browser icon\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/File_browser_squid@2x.png)  |  **File Browser** Choose the **Upload Files** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/File_upload_squid.png)) to add files to Studio Lab. Double-click a file to open the file in a new tab. To have adjacent files open, choose a tab that contains a notebook, Python, or text file, and then choose **New View for File**. Choose the plus (**\$1**) sign on the menu at the top of the file browser to open the Studio Lab Launcher.  | 
|  ![\[The Running Terminals and Kernels icon\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Running_squid@2x.png)  |  **Running Terminals and Kernels** You can see a list of all of the running terminals and kernels in your project. For more information, see [Shut down Studio Lab resources](studio-lab-use-shutdown.md).  | 
|  ![\[The Git icon\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Git_squid@2x.png)  |  **Git** You can connect to a Git repository and then access a full range of Git tools and operations. For more information, see [Use external resources in Amazon SageMaker Studio Lab](studio-lab-use-external.md).  | 
|  ![\[The Table of Contents icon\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-toc.png)  |  **Table of Contents** You can access the Table of Contents for your current Jupyter notebook.  | 
|  ![\[The Extension Manager icon\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-extension.png)  |  **Extension Manager** You can enable and manage third-party JupyterLab extensions.  | 

## File and resource browser
<a name="studio-lab-use-ui-browser"></a>

The file and resource browser shows lists of your notebooks and files. On the menu at the top of the file browser, choose the plus (**\$1**) sign to open the Studio Lab Launcher. The Launcher allows you to create a notebook or open a terminal.

## Main work area
<a name="studio-lab-use-ui-work"></a>

The main work area has multiple tabs that contain your open notebooks and terminals.

# Create or open an Amazon SageMaker Studio Lab notebook
<a name="studio-lab-use-create"></a>

When you create a notebook in Amazon SageMaker Studio Lab or open a notebook in Studio Lab, you must select a kernel for the notebook. The following topics describe how to create and open notebooks in Studio Lab.

For information about shutting down the notebook, see [Shut down Studio Lab resources](studio-lab-use-shutdown.md).

**Topics**
+ [

## Open a Studio Lab notebook
](#studio-lab-use-create-open)
+ [

## Create a notebook from the file menu
](#studio-lab-use-create-file)
+ [

## Create a notebook from the Launcher
](#studio-lab-use-create-launcher)

## Open a Studio Lab notebook
<a name="studio-lab-use-create-open"></a>

Studio Lab can only open notebooks listed in the Studio Lab file browser. To clone a notebook into your file browser from an external repository, see [Use external resources in Amazon SageMaker Studio Lab](studio-lab-use-external.md).

**To open a notebook**

1. In the left sidebar, choose the **File Browser** icon (![\[Dark blue square icon with a white outline of a cloud and an arrow pointing upward.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/File_browser_squid.png)) to display the file browser.

1. Browse to a notebook file and double-click it to open the notebook in a new tab.

## Create a notebook from the file menu
<a name="studio-lab-use-create-file"></a>

**To create a notebook from the File menu**

1. From the Studio Lab menu, choose **File**, choose **New**, and then choose **Notebook**.

1. To use the default kernel, in the **Select Kernel** dialog box, choose **Select**. Otherwise, to select a different kernel, use the dropdown menu.

## Create a notebook from the Launcher
<a name="studio-lab-use-create-launcher"></a>

**To create a notebook from the Launcher**

1. Open the Launcher by using the keyboard shortcut `Ctrl + Shift + L`.

   Alternatively, you can open Launcher from the left sidebar: Choose the **File Browser** icon, and then choose the plus (**\$1**) icon.

1. To use the default kernel from the Launcher, under **Notebook**, choose **default:Python**. Otherwise, select a different kernel.

After you choose the kernel, your notebook launches and opens in a new Studio Lab tab. 

To view the notebook's kernel session, in the left sidebar, choose the **Running Terminals and Kernels** icon (![\[Square icon with a white outline of a cloud on a dark blue background.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Running_squid.png)). You can stop the notebook's kernel session from this view.

# Use the Amazon SageMaker Studio Lab notebook toolbar
<a name="studio-lab-use-menu"></a>

Amazon SageMaker Studio Lab notebooks extend the JupyterLab interface. For an overview of the basic JupyterLab interface, see [The JupyterLab Interface](https://jupyterlab.readthedocs.io/en/latest/user/interface.html).

The following image shows the toolbar and an empty cell from a Studio Lab notebook.

![\[The layout of the notebook toolbar, including the toolbar icons.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-lab-menu.png)


When you hover over a toolbar icon, a tooltip displays the icon function. You can find additional notebook commands in the Studio Lab main menu. The toolbar includes the following icons:


| Icon | Description | 
| --- | --- | 
|  ![\[The Save and checkpoint icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-save-and-checkpoint.png)  |  **Save and checkpoint** Saves the notebook and updates the checkpoint file.  | 
|  ![\[The Insert cell icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-insert-cell.png)  |  **Insert cell** Inserts a code cell below the current cell. The current cell is noted by the blue vertical marker in the left margin.  | 
|  ![\[The Cut, copy, and paste cells icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab_cut_copy_paste.png)  |  **Cut, copy, and paste cells** Cuts, copies, and pastes the selected cells.  | 
|  ![\[The Run cells icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-run.png)  |  **Run cells** Runs the selected cells. The cell that follows the last-selected cell becomes the new-selected cell.  | 
|  ![\[The Interrupt kernel icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-interrupt-kernel.png)  |  **Interrupt kernel** Interrupts the kernel, which cancels the currently-running operation. The kernel remains active.  | 
|  ![\[The Restart kernel icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-restart-kernel.png)  |  **Restart kernel** Restarts the kernel. Variables are reset. Unsaved information is not affected.  | 
|  ![\[The Restart kernel and re-run notebook icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-restart-rerun-kernel.png)  |  **Restart kernel and re-run notebook** Restarts the kernel. Variables are reset. Unsaved information is not affected. Then re-runs the entire notebook.  | 
|  ![\[The Cell type icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab_cell.png)  |  **Cell type** Displays or changes the current cell type. The cell types are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/studio-lab-use-menu.html)  | 
|  ![\[The Checkpoint diff icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-checkpoint-diff.png)  |  **Checkpoint diff** Opens a new tab that displays the difference between the notebook and the checkpoint file. For more information, see [Get notebook differences](studio-lab-use-diff.md).  | 
|  ![\[The Git diff icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-git-diff.png)  |  **Git diff** Only enabled if the notebook is opened from a Git repository. Opens a new tab that displays the difference between the notebook and the last Git commit. For more information, see [Get notebook differences](studio-lab-use-diff.md).  | 
|  **default**  |  **Kernel** Displays or changes the kernel that processes the cells in the notebook. `No Kernel` indicates that the notebook was opened without specifying a kernel. You can edit the notebook, but you can't run any cells.  | 
|  ![\[The Kernel busy status icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio-lab-kernel.png)  |  **Kernel busy status** Displays a kernel's busy status by showing the circle's edge and its interior as the same color. The kernel is busy when it is starting and when it is processing cells. Additional kernel states are displayed in the status bar at the bottom-left corner of Studio Lab.  | 

# Manage your environment
<a name="studio-lab-use-manage"></a>

Amazon SageMaker Studio Lab provides pre-installed environments for your Studio Lab notebook instances. Environments allow you to start up a Studio Lab notebook instance with the packages you want to use. This is done by installing packages in the environment and then selecting the environment as a Kernel. 

Studio Lab has various environments pre-installed for you. You will typically want to use the `sagemaker-distribution` environment if you want to use a fully managed environment that already contains many popular packages used for machine learning (ML) engineers and data scientists. Otherwise you can use the `default` environment if you want persistent customization for your environment. For more information on the available pre-installed Studio Lab environments, see [Studio Lab pre-installed environments](studio-lab-environments.md).

You can customize your environment by adding new packages (or libraries) to it. You can also create new environments from Studio Lab, import compatible environments, reset your environment to create space, and more. 

The following commands are for running in a Studio Lab terminal. However, while installing packages it is highly recommended to install them within your Studio Lab Jupyter notebook. This ensures that the packages are installed in the intended environment. To run the commands in a Jupyter notebook, prefix the command with a `%` before running the cell. For example, the code snippet `pip list` in a terminal is the same as `%pip list` in a Jupyter notebook.

The following sections give information about your `default` conda environment, how to customize it, and how to add and remove conda environments. For a list of sample environments that you can install into Studio Lab, see [Creating Custom conda Environments](https://github.com/aws/studio-lab-examples/tree/main/custom-environments). To use these sample environment YAML files with Studio Lab, see [Step 4: Install your Studio Lab conda environments in Studio Classic](studio-lab-use-migrate.md#studio-lab-use-migrate-step4). 

**Topics**
+ [

## Your default environment
](#studio-lab-use-manage-conda-default)
+ [

## View environments
](#studio-lab-use-view-conda-envs)
+ [

## Create, activate, and use new conda environments
](#studio-lab-use-manage-conda-new-conda)
+ [

## Using sample Studio Lab environments
](#studio-lab-use-manage-conda-sample)
+ [

## Customize your environment
](#studio-lab-use-manage-conda-default-customize)
+ [

## Refresh Studio Lab
](#studio-lab-use-manage-conda-reset)

## Your default environment
<a name="studio-lab-use-manage-conda-default"></a>

Studio Lab uses conda environments to encapsulate the software packages that are needed to run notebooks. Your project contains a default conda environment, named `default`, with the [IPython kernel](https://ipython.readthedocs.io/en/stable/). This environment serves as the default kernel for your Jupyter notebooks.

## View environments
<a name="studio-lab-use-view-conda-envs"></a>

To view the environments in Studio Lab you can use a terminal or Jupyter notebook. The following command will be for a Studio Lab terminal. If you wish to run the corresponding commands in a Jupyter notebook, see [Manage your environment](#studio-lab-use-manage).

Open the Studio Lab terminal by opening the **File Browser** panel (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)), choose the plus (**\$1**) sign on the menu at the top of the file browser to open the **Launcher**, then choose **Terminal**. From the Studio Lab terminal, list the conda environments by running the following.

```
conda env list
```

This command outputs a list of the conda environments and their locations in the file system. When you onboard to Studio Lab, you automatically activate the `studiolab`  conda environment. The following is an example of listed environments after you onboard.

```
# conda environments:
#
default                  /home/studio-lab-user/.conda/envs/default
studiolab             *  /home/studio-lab-user/.conda/envs/studiolab
studiolab-safemode       /opt/amazon/sagemaker/safemode-home/.conda/envs/studiolab-safemode
base                     /opt/conda
sagemaker-distribution     /opt/conda/envs/sagemaker-distribution
```

The `*` marks the activated environment.

## Create, activate, and use new conda environments
<a name="studio-lab-use-manage-conda-new-conda"></a>

If you would like to maintain multiple environments for different use cases, you can create new conda environments in your project. The following sections show how to create and activate new conda environments. For a Jupyter notebook that shows how to create a custom environment, see [Setting up a Custom Environment in SageMaker Studio Lab](https://github.com/aws/studio-lab-examples/blob/main/custom-environments/custom_environment.ipynb).

**Note**  
Maintaining multiple environments counts against your available Studio Lab memory.

 **Create conda environment** 

To create a conda environment, run the following conda command from your terminal. This example creates a new environment with Python 3.9. 

```
conda create --name <ENVIRONMENT_NAME> python=3.9
```

Once the conda environment is created, you can view the environment in your environment list. For more information on how to view your environment list, see [View environments](#studio-lab-use-view-conda-envs).

 **Activate a conda environment** 

To activate any conda environment, run the following command in the terminal.

```
conda activate <ENVIRONMENT_NAME>
```

When you run this command, any packages installed using conda or pip are installed in the environment. For more information on installing packages, see [Customize your environment](#studio-lab-use-manage-conda-default-customize).

 **Use a conda environment** 

1. To use your new conda environments with notebooks, make sure the `ipykernel` package is installed in the environment.

   ```
   conda install ipykernel
   ```

1. Once the `ipykernel` package is installed in the environment, you can select the environment as the kernel for your notebook. 

   You may need to restart JupyterLab to see the environment available as a kernel. This can be done by choosing **Amazon SageMaker Studio Lab** in the top menu of your Studio Lab open project, and choosing **Restart JupyterLab...**. 

1. You can choose the kernel for an existing notebook or when you create a new one.
   + For an existing notebook: open the notebook and choose the current kernel from the right side of the top menu. You can choose the kernel you wish to use from the drop-down menu.
   + For a new notebook: open the Studio Lab launcher and choose the kernel under **Notebook**. This will open the notebook with the kernel you choose.

     For an overview of the Studio Lab UI, see [Amazon SageMaker Studio Lab UI overview](studio-lab-use-ui.md).

## Using sample Studio Lab environments
<a name="studio-lab-use-manage-conda-sample"></a>

Studio Lab provides sample custom environments through the [SageMaker Studio Lab Examples](https://github.com/aws/studio-lab-examples) repository. The following shows how to clone and build these environments.

1. Clone the SageMaker Studio Lab Examples GitHub repository by following the instructions in [Use GitHub resources](studio-lab-use-external.md#studio-lab-use-external-clone-github).

1. In Studio Lab choose the **File Browser** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) on the left menu, so that the **File Browser** panel shows on the left.

1. Navigate to the `studio-lab-examples/custom-environments` directory in the File Browser.

1. Open the directory for the environment that you want to build.

1. Right click the `.yml` file in the folder, then select **Build conda Environment**.

1. You can now use the environment as a kernel after your conda environment has finished building. For instructions on how to use an existing environment as a kernel, see [Create, activate, and use new conda environments](#studio-lab-use-manage-conda-new-conda)

## Customize your environment
<a name="studio-lab-use-manage-conda-default-customize"></a>

You can customize your environment by installing and removing extensions and packages as needed. Studio Lab comes with environments with packages pre-installed and using an existing environment may save you time and memory, as pre-installed packages do not count against your available Studio Lab memory. For more information on the available pre-installed Studio Lab environments, see [Studio Lab pre-installed environments](studio-lab-environments.md).

Any installed extensions and packages installed on your `default` environment will persist in your project. That is, you do not need to install your packages for every project runtime session. However, extensions and packages installed on your `sagemaker-distribution` environment will not persist, so you will need to install new packages during your next session. Thus, it is highly recommended to install packages within your notebook to ensure that the packages are installed in the intended environment.

To view your environments, run the command `conda env list`.

To activate your environment, run the command `conda activate <ENVIRONMENT_NAME>`.

To view the packages in an environment, run the command `conda list`.

 **Install packages** 

It is highly recommended to install your packages within your Jupyter notebook to ensure that your packages are installed in the intended environment. To install additional packages to your environment from a Jupyter notebook, run one of the following commands in a cell within your Jupyter notebook. These commands install packages in the currently activated environment. 
+  `%conda install <PACKAGE>` 
+  `%pip install <PACKAGE>` 

We don't recommend using the `!pip` or `!conda` commands because they can behave in unexpected ways when you have multiple environments. 

After you install new packages to your environment, you may need to restart the kernel to ensure that the packages work in your notebook. This can be done by choosing **Amazon SageMaker Studio Lab** in the top menu of your Studio Lab open project and choosing **Restart JupyterLab...**. 

 **Remove packages** 

To remove a package, run the command

```
%conda remove <PACKAGE_NAME>
```

This command will also remove any package that depends on `<PACKAGE_NAME>`, unless a replacement can be found without that dependency. 

To remove all of the packages in an environment, run the command

```
conda deactivate
&& conda env remove --name
<ENVIRONMENT_NAME>
```

## Refresh Studio Lab
<a name="studio-lab-use-manage-conda-reset"></a>

To refresh Studio Lab, remove all of your environments and files. 

1. List all conda environments.

   ```
   conda env list
   ```

1. Activate the base environment.

   ```
   conda activate base
   ```

1. Remove each environment in the list of conda environments, besides base.

   ```
   conda remove --name <ENVIRONMENT_NAME> --all
   ```

1. Delete all of the files on your Studio Lab.

   ```
   rm -rf *.*
   ```

# Use external resources in Amazon SageMaker Studio Lab
<a name="studio-lab-use-external"></a>

With Amazon SageMaker Studio Lab, you can integrate external resources, such as Jupyter notebooks and data, from Git repositories and Amazon S3. You can also add an **Open in Studio Lab** button to your GitHub repo and notebooks. This button lets you clone your notebooks directly from Studio Lab.

The following topics show how to integrate external resources.

**Topics**
+ [

## Use GitHub resources
](#studio-lab-use-external-clone-github)
+ [

## Add an **Open in Studio Lab** button to your notebook
](#studio-lab-use-external-add-button)
+ [

## Import files from your computer
](#studio-lab-use-external-import)
+ [

## Connect to Amazon S3
](#studio-lab-use-external-s3)

## Use GitHub resources
<a name="studio-lab-use-external-clone-github"></a>

Studio Lab offers integration with GitHub. With this integration, you can clone notebooks and repositories directly to your Studio Lab project. 

The following topics give information about how to use GitHub resources with Studio Lab.

### Studio Lab sample notebooks
<a name="studio-lab-use-external-clone-examples"></a>

To get started with a repository of sample notebooks tailored for Studio Lab, see [Studio Lab Sample Notebooks](https://github.com/aws/studio-lab-examples#sagemaker-studio-lab-sample-notebooks).

This repository provides notebooks for the following use cases and others.
+ Computer vision
+ Connecting to AWS
+ Creating custom environments
+ Geospatial data analysis
+ Natural language processing
+ Using R

### Clone a GitHub repo
<a name="studio-lab-use-external-clone-repo"></a>

To clone a GitHub repo to your Studio Lab project, follow these steps. 

1. Start your Studio Lab project runtime. For more information on launching Studio Lab project runtime, see [Start your project runtime](studio-lab-manage-runtime.md#studio-lab-manage-runtime-start). 

1. In Studio Lab, choose the **File Browser** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) on the left menu, so that the **File Browser** panel shows on the left. 

1. Navigate to your user directory by choosing the file icon beneath the file search bar. 

1. Select the **Git** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git.png)) from the left menu to open a new dropdown menu. 

1. Choose **Clone a Repository**. 

1. Paste the repository's URL under **Git repository URL (.git)**. 

1. Select **Clone**. 

### Clone individual notebooks from GitHub
<a name="studio-lab-use-external-clone-individual"></a>

To open a notebook in Studio Lab, you must have access to the repo that the notebook is in. The following examples describe Studio Lab permission-related behavior in various situations.
+ If a repo is public, you can automatically clone the notebook into your project from the Studio Lab preview page.
+ If a repo is private, you are prompted to sign in to GitHub from the Studio Lab preview page. If you have access to a private repo, you can clone the notebook into your project.
+ If you don't have access to a private repo, you cannot clone the notebook from the Studio Lab preview page.

The following sections show two options for you to copy a GitHub notebook in your Studio Lab project. These options depend on whether the notebook has an **Open in Studio Lab** button. 

#### Option 1: Copy notebook with an **Open in Studio Lab** button
<a name="studio-lab-use-external-clone-individual-button"></a>

The following procedure shows how to copy a notebook that has an **Open in Studio Lab** button. If you want to add this button to your notebook, see [Add an **Open in Studio Lab** button to your notebook](#studio-lab-use-external-add-button).

1. Sign in to Studio Lab following the steps in [Sign in to Studio Lab](studio-lab-onboard.md#studio-lab-onboard-signin).

1. In a new browser tab, navigate to the GitHub notebook that you want to clone. 

1. In the notebook, select the **Open in Studio Lab** button to open a new page in Studio Lab with a preview of the notebook.

1. If your project runtime is not already running, start it by choosing the **Start runtime** button at the top of the preview page. Wait for the runtime to start before proceeding to the next step.

1. After your project runtime has started, select **Copy to project** to open your project runtime in a new browser tab. 

1. In the **Copy from GitHub?** dialog box, select **Copy notebook only**. This copies the notebook file to your project.

#### Option 2: Clone any GitHub notebook
<a name="studio-lab-use-external-clone-individual-general"></a>

The following procedure shows how to copy any notebook from GitHub. 

1. Navigate to the notebook in GitHub. 

1. In the browser’s address bar, modify the notebook URL, as follows.

   ```
   # Original URL
   https://github.com/<PATH_TO_NOTEBOOK>
   
   # Modified URL 
   https://studiolab.sagemaker.aws/import/github/<PATH_TO_NOTEBOOK>
   ```

1. Navigate to the modified URL. This opens a preview of the notebook in Studio Lab. 

1. If your project runtime is not already running, start it by choosing the **Start runtime** button at the top of the preview page. Wait for the runtime to start before proceeding to the next step. 

1. After your project runtime has started, select **Copy to project** to open your project runtime in a new browser tab. 

1. In the **Copy from GitHub?** dialog box, select **Copy notebook only** to copy the notebook file to your project.

## Add an **Open in Studio Lab** button to your notebook
<a name="studio-lab-use-external-add-button"></a>

When you add the **Open in Studio Lab** button to your notebooks, others can clone your notebooks or repositories directly to their Studio Lab projects. If you are sharing your notebook within a public GitHub repository, your content will be publicly readable. Do not share private content, such as AWS access keys or AWS Identity and Access Management credentials, in your notebook.

To add the functional **Open in Studio Lab** button to your Jupyter notebook or repository, add the following markdown to the top of your notebook or repository. 

```
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/<PATH_TO_YOUR_NOTEBOOK_ON_GITHUB>)
```

## Import files from your computer
<a name="studio-lab-use-external-import"></a>

The following steps show how to import files from your computer to your Studio Lab project.  

1. Open the Studio Lab project runtime. 

1. Open the **File Browser** panel. 

1. In the actions bar of the **File Browser** panel, select the **Upload Files** button. 

1. Select the files that you want to upload from your local machine. 

1. Select **Open**. 



Alternatively, you can drag and drop files from your computer into the **File Browser** panel. 

## Connect to Amazon S3
<a name="studio-lab-use-external-s3"></a>

The AWS CLI enables AWS integration in your Studio Lab project. With this integration, you can pull resources from Amazon S3 to use with your Jupyter notebooks.

To use AWS CLI with Studio Lab, complete the following steps. For a notebook that outlines this integration, see [Using Studio Lab with AWS Resources](https://github.com/aws/studio-lab-examples/blob/main/connect-to-aws/Access_AWS_from_Studio_Lab.ipynb).

1. Install the AWS CLI following the steps in  [Installing or updating the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 

1. Configure your AWS credentials by following the steps in  [Quick setup](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). The role for your AWS account must have permissions to access the Amazon S3 bucket that you are copying data from. 

1. From your Jupyter notebook, clone resources from the Amazon S3 bucket, as needed. The following command shows how to clone all resources from an Amazon S3 path to your project. For more information, see the [AWS CLI Command Reference](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/cp.html).

   ```
   !aws s3 cp s3://<BUCKET_NAME>/<PATH_TO_RESOURCES>/ <PROJECT_DESTINATION_PATH>/ --recursive
   ```

# Get notebook differences
<a name="studio-lab-use-diff"></a>

You can display the difference between the current notebook and the last checkpoint, or the last Git commit, using the Amazon SageMaker Studio Lab project UI.

**Topics**
+ [

## Get the difference between the last checkpoint
](#studio-lab-use-diff-checkpoint)
+ [

## Get the difference between the last commit
](#studio-lab-use-diff-git)

## Get the difference between the last checkpoint
<a name="studio-lab-use-diff-checkpoint"></a>

When you create a notebook, a hidden checkpoint file that matches the notebook is created. You can view changes between the notebook and the checkpoint file, or revert the notebook to match the checkpoint file.

To save the Studio Lab notebook and update the checkpoint file to match: Choose the **Save notebook and create checkpoint** icon (![\[Icon of a cloud with an arrow pointing upward, representing cloud upload functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Notebook_save.png)). This is located on the Studio Lab menu's left side. The keyboard shortcut for **Save notebook and create checkpoint** is `Ctrl + s`.

To view changes between the Studio Lab notebook and the checkpoint file: Choose the **Checkpoint diff** icon (![\[Camera icon representing image capture or photo functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Checkpoint_diff.png)), located in the center of the Studio Lab menu.

To revert the Studio Lab notebook to the checkpoint file: On the main Studio Lab menu, choose **File**, and then **Revert Notebook to Checkpoint**.

## Get the difference between the last commit
<a name="studio-lab-use-diff-git"></a>

If a notebook is opened from a Git repository, you can view the difference between the notebook and the last Git commit.

To view the changes in the notebook from the last Git commit: Choose the **Git diff** icon (![\[GitHub icon representing version control and source code management.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Git_diff.png)) in the center of the notebook menu.

# Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio Classic
<a name="studio-lab-use-migrate"></a>

Amazon SageMaker Studio Classic offers many features for machine learning and deep learning work flows that are unavailable in Amazon SageMaker Studio Lab. This page shows how to migrate a Studio Lab environment to Studio Classic to take advantage of more compute capacity, storage, and features. However, you may want to familiarize yourself with Studio Classic's prebuilt containers, which are optimized for the full MLOP pipeline. For more information, see [Amazon SageMaker Studio Lab](studio-lab.md)

To migrate your Studio Lab environment to Studio Classic, you must first onboard to Studio Classic following the steps in [Amazon SageMaker AI domain overview](gs-studio-onboard.md). 

**Topics**
+ [

## Step 1: Export your Studio Lab conda environment
](#studio-lab-use-migrate-step1)
+ [

## Step 2: Save your Studio Lab artifacts
](#studio-lab-use-migrate-step2)
+ [

## Step 3: Import your Studio Lab artifacts to Studio Classic
](#studio-lab-use-migrate-step3)
+ [

## Step 4: Install your Studio Lab conda environments in Studio Classic
](#studio-lab-use-migrate-step4)

## Step 1: Export your Studio Lab conda environment
<a name="studio-lab-use-migrate-step1"></a>

You can export a conda environment and add libraries or packages to the environment by following the steps in [Manage your environment](studio-lab-use-manage.md). The following example demonstrates using the `default` environment to be exported to Studio Classic. 

1. Open the Studio Lab terminal by opening the **File Browser** panel (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)), choose the plus (**\$1**) sign on the menu at the top of the file browser to open the **Launcher**, then choose **Terminal**. From the Studio Lab terminal, list the conda environments by running the following.

   ```
   conda env list
   ```

   This command outputs a list of the conda environments and their locations in the file system. When you onboard to Studio Lab, you automatically activate the `studiolab`  conda environment.

   ```
   # conda environments: #
              default                  /home/studio-lab-user/.conda/envs/default
              studiolab             *  /home/studio-lab-user/.conda/envs/studiolab
              studiolab-safemode       /opt/amazon/sagemaker/safemode-home/.conda/envs/studiolab-safemode
              base                     /opt/conda
   ```

   We recommend that you do not export the `studiolab`, `studiolab-safemode`, and `base` environments. These environments are not usable in Studio Classic for the following reasons: 
   +  `studiolab`: This sets up the JupyterLab environment for Studio Lab. Studio Lab runs a different major version of JupyterLab than Studio Classic, so it is not usable in Studio Classic. 
   +  `studiolab-safemode`: This also sets up the JupyterLab environment for Studio Lab. Studio Lab runs a different major version of JupyterLab than Studio Classic, so it is not usable in Studio Classic. 
   +  `base`: This environment comes with conda by default. The `base` environment in Studio Lab and the `base` environment in Studio Classic have incompatible versions of many packages. 

1. For the conda environment that you want to migrate to Studio Classic, first activate the conda environment. The `default` environment is then changed when new libraries are installed or removed from it. To get the exact state of the environment, export it into a YAML file using the command line. The following command lines export the default environment into a YAML file, creating a file called `myenv.yml`.

   ```
   conda activate default
   conda env export > ~/myenv.yml
   ```

## Step 2: Save your Studio Lab artifacts
<a name="studio-lab-use-migrate-step2"></a>

Now that you have saved your environment to a YAML file, you can move the environment file to any platform. 

------
#### [ Save to a local machine using Studio Lab GUI ]

**Note**  
Downloading a directory from the Studio Lab GUI by right-clicking on the directory is currently unavailable. If you wish to export a directory, please follow the steps using the **Save to Git repository** tab. 

One option is to save the environment onto your local machine. To do this, use the following procedure.

1. In Studio Lab, choose the **File Browser** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) on the left menu, so that the **File Browser** panel shows on the left. 

1. Navigate to your user directory by choosing the file icon beneath the file search bar. 

1. Choose (right-click) the `myenv.yml` file and then choose **Download**. You can repeat this process for other files you want to import to Studio Classic. 

------
#### [ Save to a Git repository ]

Another option is to save your environment to a Git repository. This option uses GitHub as an example. These steps require a GitHub account and repository. For more information, visit [GitHub](https://github.com/). The following procedure shows how to synchronize your content with GitHub using the Studio Lab terminal. 

1. From the Studio Lab terminal, navigate to your user directory and make a new directory to contain the files you want to export. 

   ```
   cd ~
   mkdir <NEW_DIRECTORY_NAME>
   ```

1. After you create a new directory, copy any file or directory you want to export to `<NEW_DIRECTORY_NAME>`. 

   Copy a file using the following code format:

   ```
   cp <FILE_NAME> <NEW_DIRECTORY_NAME>
   ```

   For example, replace `<FILE_NAME>` with `myenv.yml`. 

   Copy any directory using the following code format:

   ```
   cp -r <DIRECTORY_NAME> <NEW_DIRECTORY_NAME>
   ```

   For example, replace `<DIRECTORY_NAME>` with any directory name in your user directory.

1. Navigate to the new directory and initialize the directory as a Git repository using the following command. For more information, see the [git-init documentation](https://git-scm.com/docs/git-init). 

   ```
   cd <NEW_DIRECTORY_NAME>
   git init
   ```

1. Using Git, add all relevant files and then commit your changes. 

   ```
   git add .
   git commit -m "<COMMIT_MESSAGE>"
   ```

   For example, replace `<COMMIT_MESSAGE>` with `Add Amazon SageMaker Studio Lab artifacts to GitHub repository to migrate to Amazon SageMaker Studio Classic `.

1. Push the commit to your remote repository. This repository has the format `https://github.com/<GITHUB_USERNAME>/ <REPOSITORY_NAME>.git` where `<GITHUB_USERNAME>` is your GitHub user name and the `<REPOSITORY_NAME>` is your remote repository name. Create a branch `<BRANCH_NAME>` to push the content to the GitHub repository.

   ```
   git branch -M <BRANCH_NAME>
   git remote add origin https://github.com/<GITHUB_USERNAME>/<REPOSITORY_NAME>.git
   git push -u origin <BRANCH_NAME>
   ```

------

## Step 3: Import your Studio Lab artifacts to Studio Classic
<a name="studio-lab-use-migrate-step3"></a>

The following procedure shows how to import artifacts to Studio Classic. The instructions on using Feature Store through the console depends on if you have enabled Studio or Studio Classic as your default experience. For information on accessing Studio Classic through the console, see [Launch Studio Classic if Studio is your default experience](studio-launch.md#studio-launch-console-updated).

From Studio Classic, you can import files from your local machine or from a Git repository. You can do this using the Studio Classic GUI or terminal. The following procedure uses the examples from [Step 2: Save your Studio Lab artifacts](#studio-lab-use-migrate-step2). 

------
#### [ Import using the Studio Classic GUI ]

If you saved the files to your local machine, you can import the files to Studio Classic using the following steps.

1. Open the **File Browser** panel (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/folder.png)) at the top left of Studio Classic. 

1. Choose the **Upload Files** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/File_upload_squid.png)) on the menu at the top of the **File Browser** panel. 

1. Navigate to the file that you want to import, then choose **Open**. 

**Note**  
To import a directory into Studio Classic, first compress the directory on your local machine to a file. On a Mac, right-click the directory and choose **Compress "*<DIRECTORY\$1NAME>*"**. In Windows, right-click the directory and choose **Send to**, and then choose **Compressed (zipped) folder**. After the directory is compressed, import the compressed file using the preceding steps. Unzip the compressed file by navigating to the Studio Classic terminal and running the command `<DIRECTORY_NAME>.zip`. 

------
#### [ Import using a Git repository ]

This example provides two options for how to clone a GitHub repository into Studio Classic. You can use the Studio Classic GUI by choosing the **Git** (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/git.png)) tab on the left side of Studio Classic. Choose **Clone a Repository**, then paste your GitHub repository URL from [Step 2: Save your Studio Lab artifacts](#studio-lab-use-migrate-step2). Another option is to use the Studio Classic terminal by using the following procedure. 

1. Open the Studio Classic **Launcher**. For more information on opening the **Launcher**, see [Amazon SageMaker Studio Classic Launcher](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-launcher.html). 

1. In the **Launcher**, in the **Notebooks and compute resources** section, choose **Change environment**.

1. In Studio Classic, open the **Launcher**. To open the **Launcher**, choose **Amazon SageMaker Studio Classic** at the top-left corner of Studio Classic. 

   To learn about all the available ways to open the **Launcher**, see [Use the Amazon SageMaker Studio Classic Launcher](studio-launcher.md).

1. In the **Change environment** dialog, use the **Image** dropdown list to select the **Data Science** image and choose **Select**. This image comes with conda pre-installed. 

1. In the Studio Classic **Launcher**, choose **Open image terminal**.

1. From the image terminal, run the following command to clone your repository. This command creates a directory named after `<REPOSITORY_NAME>` in your Studio Classic instance and clones your artifacts in that repository.

   ```
   git clone https://github.com/<GITHUB_USERNAME>/<REPOSITORY_NAME>.git
   ```

------

## Step 4: Install your Studio Lab conda environments in Studio Classic
<a name="studio-lab-use-migrate-step4"></a>

You can now recreate your conda environment by using your YAML file in your Studio Classic instance. Open the Studio Classic **Launcher**. For more information on opening the **Launcher**, see [Amazon SageMaker Studio Classic Launcher](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-launcher.html). From the **Launcher**, choose **Open image terminal**. In the terminal navigate to the directory that contains the YAML file, then run the following commands. 

```
conda env create --file <ENVIRONMENT_NAME>.yml
conda activate <ENVIRONMENT_NAME>
```

After these commands are complete, you can select your environment as the kernel for your Studio Classic notebook instances. To view the available environment, run `conda env list`. To activate your environment, run `conda activate <ENVIRONMENT_NAME>`.



# Shut down Studio Lab resources
<a name="studio-lab-use-shutdown"></a>

You can view and shut down your running Amazon SageMaker Studio Lab resources from one location in your Studio Lab environment. The running resource types include terminals, and kernels. You can also shut down all resources of one resource type at the same time.

When you shut down all resources belonging to a resource type, the following occurs:
+ **KERNELS** – All kernels, notebooks, and consoles are shut down.
+ **TERMINALS** – All terminals are shut down.

**Shut down Studio Lab resources**

1. Start your Studio Lab project runtime. For more information on launching Studio Lab project runtime, see [Start your project runtime](studio-lab-manage-runtime.md#studio-lab-manage-runtime-start).

1. Choose the **Running Terminals and Kernels** icon (![\[Square icon with a white outline of a cloud on a dark blue background.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Running_squid.png)) on the left navigation pane.

1. Choose the **X** symbol to the right of the resource you wish to shut down. You can view the **X** symbol by hovering your cursor over a resource.

1. (Optional) You can shut down all the resources of a given resource type by choosing **Shut Down All** to the right of the resource type name.

# Troubleshooting
<a name="studio-lab-troubleshooting"></a>

The guide shows common errors that might occur when using Amazon SageMaker Studio Lab (Studio Lab). Each error contains a description, as well as a solution to the error.

**Note**  
You cannot share your password with multiple users or use Studio Lab to mine cryptocurrency. We don’t recommend using Studio Lab for production tasks because of runtime limits.

 **Dependency issues** 

On August 8, 2025, Studio Lab migrated from JupyterLab 3 to JupyterLab 4. For information about JupyterLab 4, see the [4.0.0 - Highlights](https://jupyterlab.readthedocs.io/en/4.0.x/getting_started/changelog.html#highlights) in the JupyterLab Changelog. If you installed JupyterLab extensions in your Studio Lab environment before August 8, 2025, you might need to reinstall them in the JupyterLab 4 environment.

 **Can’t access account** 

If you can’t access your account, verify that you are using the correct email and password. If you have forgotten your password, use the following steps to reset your password. If you still cannot access your account, you must request and register for a new account using the instructions in [Onboard to Amazon SageMaker Studio Lab](studio-lab-onboard.md). 

 **Forgot password** 

If you forget your password, you must reset it using the following steps. 

1. Navigate to the [Studio Lab landing page](https://studiolab.sagemaker.aws).

1. Select **Sign in**.

1. Select **Forgot password?** to open a new page. 

1. Enter the email address that you used to sign up for an account. 

1. Select **Send reset link** to send an email with a password reset link. 

1. From the password reset email, select **Reset your password**. 

1. Enter your new password. 

1. Select **Submit**. 

 **Can't launch project runtime** 

If the Studio Lab project runtime does not launch, try launching it again. If this doesn't work, switch the instance type from CPU to GPU (or in reverse). For more information, see [Change your compute type](studio-lab-manage-runtime.md#studio-lab-manage-runtime-change).

 **Runtime stopped running unexpectedly** 

If there is an issue with the environment used to run JupyterLab, then Studio Lab will automatically recreate the environment. Studio Lab does not support manual activation of this process. 

 **Conflicting versions** 

Because you can add packages and modify your environment as needed, you may run into conflicts between packages in your environment. If there are conflicts between packages in your environment, you must remove the conflicting package.

 **Environment build fails** 

When you build an environment from a YAML file, a package-version conflict or file issue might cause a build to fail. To resolve this, remove the environment by running the following command. Do this before attempting to build it again. 

```
conda remove --name <YOUR_ENVIRONMENT> --all
```

 **Error message about allowing to download script from domain \$1.awswaf.com** 

Studio Classic uses the web application firewall service AWS WAF to protect your resources, which uses JavaScript. If you are using a browser security plugin that prevents JavaScript from downloading, this error may pop up. To use Studio Classic, allow the JavaScript download from \$1.awswaf.com as a trusted domain. For more information on AWS WAF, see [AWS WAF](https://docs.aws.amazon.com/waf/latest/developerguide/waf-chapter.html) from the AWS WAF, AWS Firewall Manager, and AWS Shield Advanced. Developer Guide. 

 **Disk space is full** 

If you run into a notification saying mentioning that your disk space is full or **File Load Error for *<FILE\$1NAME>*** while attempting to open a file, you can remove files, directories, libraries, or environments to increase space. For more information on managing your libraries and environments, see [Manage your environment](studio-lab-use-manage.md).

 ****Project runtime is in safe mode** notification** 

If you run into a notification that **Project runtime is in safe mode**, you must free up some disk space to resume using the Studio Lab project runtime. Follow the instructions in the preceding troubleshoot item, **Disk space is full**. Once up to at least 500 MB of space has been cleared, you may restart the project runtime to use Studio Lab. This can be done by choosing **Amazon SageMaker Studio Lab** in the top menu of Studio Lab and choosing **Restart JupyterLab...**.

git **Cannot import `cv2`** 

If you run into an error when importing `cv2` after installing `opencv-python`, you must uninstall `opencv-python` and install `opencv-python-headless` as follows.

```
%pip uninstall opencv-python --yes
%pip install opencv-python-headless
```

You can then import `cv2` as expected.

 **Studio Lab becomes unresponsive when opening large files** 

The Studio Lab IDE may fail to render when large files are opened, resulting in blocked access to Studio Lab resources. To resolve this, reset the Studio Lab workspace using the following procedure.

1. After you open the IDE, copy the URL in your browser's address bar. This URL should be in the `https://xxxxxx.studio.us-east-2.sagemaker.aws/studiolab/default/jupyter/lab` format. Close the tab.

1. In a new tab, paste the URL and remove anything after `https://xxxxxx.studio.us-east-2.sagemaker.aws/studiolab/default/jupyter/lab`.

1. Add `?reset` to the end of the URL, so it is in the `https://xxxxxx.studio.us-east-2.sagemaker.aws/studiolab/default/jupyter/lab?reset` format.

1. Navigate to the updated URL. This resets the saved UI state and makes the Studio Lab IDE responsive.

# Amazon SageMaker Canvas
<a name="canvas"></a>

Amazon SageMaker Canvas gives you the ability to use machine learning to generate predictions without needing to write any code. The following are some use cases where you can use SageMaker Canvas:
+ Predict customer churn
+ Plan inventory efficiently
+ Optimize price and revenue
+ Improve on-time deliveries
+ Classify text or images based on custom categories
+ Identify objects and text in images
+ Extract information from documents

With Canvas, you can chat with popular large language models (LLMs), access Ready-to-use models, or build a custom model trained on your data.

Canvas chat is a functionality that leverages open-source and Amazon LLMs to help you boost your productivity. You can prompt the models to get assistance with tasks such as generating content, summarizing or categorizing documents, and answering questions. To learn more, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

The [Ready-to-use models](canvas-ready-to-use-models.md) in Canvas can extract insights from your data for a variety of use cases. You don’t have to build a model to use Ready-to-use models because they are powered by Amazon AI services, including [Amazon Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html), [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html), and [Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html). You only have to import your data and start using a solution to generate predictions.

If you want a model that is customized to your use case and trained with your data, you can [build a model](canvas-custom-models.md). You can get predictions customized to your data by doing the following:

1. Import your data from one or more data sources.

1. Build a predictive model.

1. Evaluate the model's performance.

1. Generate predictions with the model.

Canvas supports the following types of custom models:
+ Numeric prediction (also known as *regression*)
+ Categorical prediction for 2 and 3\$1 categories (also known as *binary* and *multi-class classification*)
+ Time series forecasting
+ Single-label image prediction (also known as *image classification*)
+ Multi-category text prediction (also known as *multi-class text classification*)

To learn more about pricing, see the [SageMaker Canvas pricing page](https://aws.amazon.com/sagemaker/canvas/pricing/). You can also see [Billing and cost in SageMaker Canvas](canvas-manage-cost.md) for more information.

SageMaker Canvas is currently available in the following Regions:
+ US East (Ohio)
+ US East (N. Virginia)
+ US West (N. California)
+ US West (Oregon)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Seoul)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Tokyo)
+ Canada (Central)
+ Europe (Frankfurt)
+ Europe (Ireland)
+ Europe (London)
+ Europe (Paris)
+ Europe (Stockholm)
+ South America (São Paulo)

**Topics**
+ [

## Are you a first-time SageMaker Canvas user?
](#canvas-first-time-user)
+ [

# Getting started with using Amazon SageMaker Canvas
](canvas-getting-started.md)
+ [

# Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas
](canvas-end-to-end-machine-learning-workflow.md)
+ [

# Amazon SageMaker Canvas setup and permissions management (for IT administrators)
](canvas-setting-up.md)
+ [

# Generative AI assistance for solving ML problems in Canvas using Amazon Q Developer
](canvas-q.md)
+ [

# Data import
](canvas-importing-data.md)
+ [

# Data preparation
](canvas-data-prep.md)
+ [

# Generative AI foundation models in SageMaker Canvas
](canvas-fm-chat.md)
+ [

# Ready-to-use models
](canvas-ready-to-use-models.md)
+ [

# Custom models
](canvas-custom-models.md)
+ [

# Logging out of Amazon SageMaker Canvas
](canvas-log-out.md)
+ [

# Limitations and troubleshooting
](canvas-limits.md)
+ [

# Billing and cost in SageMaker Canvas
](canvas-manage-cost.md)

## Are you a first-time SageMaker Canvas user?
<a name="canvas-first-time-user"></a>

If you are a first-time user of SageMaker Canvas, we recommend that you begin by reading the following sections:
+ For IT administrators – [Amazon SageMaker Canvas setup and permissions management (for IT administrators)](canvas-setting-up.md)
+ For analysts and individual users – [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md)
+ For an example of an end to end workflow – [Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas](canvas-end-to-end-machine-learning-workflow.md)

# Getting started with using Amazon SageMaker Canvas
<a name="canvas-getting-started"></a>

This guide tells you how to get started with using SageMaker Canvas. If you're an IT administrator and would like more in-depth details, see [Amazon SageMaker Canvas setup and permissions management (for IT administrators)](canvas-setting-up.md) to set up SageMaker Canvas for your users.

**Topics**
+ [

## Prerequisites for setting up Amazon SageMaker Canvas
](#canvas-prerequisites)
+ [

## Step 1: Log in to SageMaker Canvas
](#canvas-getting-started-step1)
+ [

## Step 2: Use SageMaker Canvas to get predictions
](#canvas-getting-started-step2)

## Prerequisites for setting up Amazon SageMaker Canvas
<a name="canvas-prerequisites"></a>

To set up a SageMaker Canvas application, onboard using one of the following setup methods:

1. **Onboard with the AWS console.** To onboard through the AWS console, you first create an Amazon SageMaker AI domain. SageMaker AI domains support the various machine learning (ML) environments such as Canvas and [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html). For more information about domains, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

   1. (Quick) [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md) – Choose this option if you’d like to quickly set up a domain. This grants your user all of the default Canvas permissions and basic functionality. Any additional features such as [document querying](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html#canvas-fm-chat-query) can be enabled later by an admin. If you want to configure more granular permissions, we recommend that you choose the Advanced option instead.

   1. (Standard) [Use custom setup for Amazon SageMaker AI](onboard-custom.md) – Choose this option if you’d like to complete a more advanced setup of your domain. Maintain granular control over user permissions such as access to data preparation features, generative AI functionality, and model deployments. 

1. **Onboard with CloudFormation.** [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) automates the provisioning of resources and configurations so that you can set up Canvas for one or more user profiles at the same time. Use this option if you want to automate the onboarding process at scale and make sure that your applications are configured the same way every time. The following [CloudFormation template](https://github.com/aws-samples/cloudformation-studio-domain) provides a streamlined way to onboard to Canvas, ensuring that all required components are properly set up and allowing you to focus on building and deploying your machine learning models.

The following section describes how to onboard to Canvas by using the AWS console to create a domain.

**Important**  
For you to set up Amazon SageMaker Canvas, your version of Amazon SageMaker Studio must be 3.19.0 or later. For information about updating Amazon SageMaker Studio, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

### Onboard with the AWS console
<a name="canvas-prerequisites-domain"></a>

If you’re doing the quick domain setup, then you can follow the instructions in [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md), skip the rest of this section, and move on to [Step 1: Log in to SageMaker Canvas](#canvas-getting-started-step1).

If you’re doing the standard domain setup, then you can specify the Canvas features to which you’d like to grant your users access. Use the rest of this section as you complete the standard domain setup to help you configure the permissions that are specific to Canvas.

In the [Use custom setup for Amazon SageMaker AI](onboard-custom.md) setup instructions, for **Step 2: Users and ML Activities**, you must select the Canvas permissions that you want to grant. In the **ML activities** section, you can select the following permissions policies to grant access to Canvas features. You can only select up to 8 **ML activities** total when setting up your domain. The first two permissions in the following list are required to use Canvas, while the rest are for additional features.
+ **Run Studio Applications** – These permissions are necessary to start up the Canvas application.
+ **[Canvas Core Access](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html)** – These permissions grant you access to the Canvas application and the basic functionality of Canvas, such as creating datasets, using basic data transforms, and building and analyzing models.
+ (Optional) **[Canvas Data Preparation (powered by Data Wrangler)](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html)** – These permissions grant you access to create data flows and use advanced transforms to prepare your data in Canvas. These permissions are also necessary for creating data processing jobs and data preparation job schedules.
+ (Optional) **[Canvas AI Services](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasAIServicesAccess.html)** – These permissions grant you access to the Ready-to-use models, foundation models, and Chat with Data features in Canvas.
+ (Optional) **Kendra access** – This permission grants you access to the [document querying ](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html#canvas-fm-chat-query) feature, where you can query documents stored in an Amazon Kendra index using foundation models in Canvas.

  If you select this option, then in the **Canvas Kendra Access** section, enter the IDs for your Amazon Kendra indexes to which you want to grant access.
+ (Optional) **[Canvas MLOps](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDirectDeployAccess.html)** – This permission grants you access to the [model deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html) feature in Canvas, where you can deploy models for use in production.

In the domain setup’s **Step 3: Applications** section, choose **Configure Canvas** and then do the following:

1.  For the **Canvas storage configuration**, specify where you want Canvas to store the application data, such as model artifacts, batch predictions, datasets, and logs. SageMaker AI creates a `Canvas/` folder inside this bucket to store the data. For more information, see [Configure your Amazon S3 storage](canvas-storage-configuration.md). For this section, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI-created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location.

1. (Optional) For **Amazon Q Developer**, do the following:

   1. Turn on **Enable Amazon Q Developer in SageMaker Canvas for natural language ML** to give your users permissions to leverage generative AI assistance during their ML workflow in Canvas. This option only grants permissions to query Amazon Q Developer for help with predetermined tasks that can be completed in the Canvas application.

   1. Turn on **Enable Amazon Q Developer chat for general AWS questions** to give your users permissions to make generative AI queries related to AWS services.

1. (Optional) Configure the **Large data processing** section if your users plan to process datasets larger than 5 GB in Canvas. For more detailed information about how to configure these options, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

1. (Optional) For the **ML Ops permissions configuration** section, do the following:

   1. Leave the **Enable direct deployment of Canvas models** option turned on to give your users permissions to deploy their models from Canvas to a SageMaker AI endpoint. For more information about model deployment in Canvas, see [Deploy your models to an endpoint](canvas-deploy-model.md).

   1. Leave the **Enable Model Registry registration permissions for all users** option turned on to give your users permissions to register their model version to the SageMaker AI model registry (it is turned on by default). For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).

   1. If you left the **Enable Model Registry registration permissions for all users** option turned on, then select either **Register to Model Registry only** or **Register and approve model in Model Registry**.

1. (Optional) For the **Local file upload configuration** section, turn on the **Enable local file upload** option to give your users permissions to upload files to Canvas from their local machines. Turning this option on attaches a cross-origin resource sharing (CORS) policy to the Amazon S3 bucket specified in the **Canvas storage configuration** (and overrides any existing CORS policy). To learn more about local file upload permissions, see [Grant Your Users Permissions to Upload Local Files](canvas-set-up-local-upload.md).

1. (Optional) For the **OAuth settings** section, do the following:

   1. Choose **Add OAuth configuration**.

   1. For **Data source**, select your data source.

   1. For **Secret setup**, select **Create a new secret** and enter the information you have from your identity provider. If you haven’t done the initial OAuth setup with your data source yet, see [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

1. (Optional) For the **Canvas Ready-to-use models configuration**, do the following:

   1. Leave the **Enable Canvas Ready-to-use models** option turned on to give your users permissions to generate predictions with Ready-to-use models in Canvas (it is turned on by default). This option also gives you permissions to chat with generative-AI powered models. For more information, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

   1. Leave the **Enable document query using Amazon Kendra** option turned on to give your users permissions to use foundation models for querying documents stored in an Amazon Kendra index. Then, from the dropdown menu, select the existing indexes to which you want to grant access. For more information, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

   1. For **Amazon Bedrock role**, select **Create and use a new execution role** to create a new IAM execution role that has a trust relationship with Amazon Bedrock. This IAM role is assumed by Amazon Bedrock to fine-tune large language models (LLMs) in Canvas. If you already have an execution role with a trust relationship, then select **Use an existing execution role** and choose your role from the dropdown. For more information about manually configuring permissions for your own execution role, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md).

1. Finish configuring the rest of the domain settings using the [Use custom setup for Amazon SageMaker AI](onboard-custom.md) procedures.

**Note**  
If you encounter any issues with granting permissions through the console, such as permissions for Ready-to-use models, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

You should now have a SageMaker AI domain set up and all of the Canvas permissions configured.

You can edit the Canvas permissions for a domain or a specific user after the initial domain setup. Individual user settings override the domain settings. To learn how to edit your Canvas permissions in the domain settings, see [Edit domain settings](domain-edit.md).

### Give yourself permissions to use specific features in Canvas
<a name="canvas-prerequisites-permissions"></a>

The following information outlines the various permissions that you can grant to a Canvas user to allow the use of various features and functionalities within Canvas. Some of these permissions can be granted during the domain setup, but some require additional permissions or configuration. Refer to the specific permissions information for each feature that you want to enable:
+ **Local file upload.** The permissions for local file upload are turned on by default in the Canvas base permissions when setting up your domain. If you can't upload local files from your machine to SageMaker Canvas, you can attach a CORS policy to the Amazon S3 bucket that you specified in the Canvas storage configuration. If you allowed SageMaker AI to use the default bucket, the bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}`. For more information, see [Grant Your Users Permissions to Upload Local Files](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-set-up-local-upload.html).
+ **Custom image and text prediction models.** The permissions for building custom image and text prediction models are turned on by default in the Canvas base permissions when setting up your domain. However, if you have a custom IAM configuration and don't want to attach the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy to your user's IAM execution role, then you must explicitly grant your user the necessary permissions. For more information, see [Grant Your Users Permissions to Build Custom Image and Text Prediction Models](canvas-set-up-cv-nlp.md).
+ **Ready-to-use models and foundation models.** You might want to use the Canvas Ready-to-use models to make predictions for your data. With the Ready-to-use models permissions, you can also chat with generative AI-powered models. The permissions are turned on by default when setting up your domain, or you can edit the permissions for a domain that you’ve already created. The Canvas Ready-to-use models permissions option adds the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your execution role. For more information, see the [Get started](canvas-ready-to-use-models.md#canvas-ready-to-use-get-started) section of the Ready-to-use models documentation.

  For more information about getting started with generative AI foundation models, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).
+ **Fine-tune foundation models.** If you'd like to fine-tune foundation models in Canvas, you can either add the permissions when setting up your domain, or you can edit the permissions for the domain or user profile after creating your domain. You must add the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to the AWS IAM role you chose when setting up the user profile, and you must also add a trust relationship with Amazon Bedrock to the role. For instructions on how to add these permissions to your IAM role, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md). 
+ **Send batch predictions to Quick.** You might want to [send *batch predictions*](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-send-predictions.html), or datasets of predictions you generate from a custom model, to Quick for analysis. In [QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html), you can build and publish predictive dashboards with your prediction results. For instructions on how to add these permissions to your Canvas user's IAM role, see [Grant Your Users Permissions to Send Predictions to Quick](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-quicksight-permissions.html).
+ **Deploy Canvas models to a SageMaker AI endpoint.** SageMaker AI Hosting offers *endpoints* which you can use to deploy your model for use in production. You can deploy models built in Canvas to a SageMaker AI endpoint and then make predictions programmatically in a production environment. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).
+ **Register model versions to the model registry.** You might want to register *versions* of your model to the [SageMaker AI model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html), which is a repository for tracking the status of updated versions of your model. A data scientist or MLOps team working in the SageMaker Model Registry can view the versions of your model that you’ve built and approve or reject them. Then, they can deploy your model version to production or kick off an automated workflow. Model registration permissions are turned on by default for your domain. You can manage permissions at the user profile level and grant or remove permissions to specific users. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).
+ **Import data from Amazon Redshift.** If you want to import data from Amazon Redshift, you must give yourself additional permissions. You must add the `AmazonRedshiftFullAccess` managed policy to the AWS IAM role you chose when setting up the user profile. For instructions on how to add the policy to the role, see [Grant Users Permissions to Import Amazon Redshift Data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-redshift-permissions.html).

**Note**  
The necessary permissions to import through other data sources, such as Amazon Athena and SaaS platforms, are included in the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) and [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policies. If you followed the standard setup instructions, these policies should already be attached to your execution role. For more information about these data sources and their permissions, see [Connect to data sources](canvas-connecting-external.md).

## Step 1: Log in to SageMaker Canvas
<a name="canvas-getting-started-step1"></a>

When the initial setup is complete, you can access SageMaker Canvas with any of the following methods, depending on your use case:
+ In the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/), choose the **Canvas** in the left navigation pane. Then, on the **Canvas** page, select your user from the dropdown and launch the Canvas application.
+ Open [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), and in the Studio interface, go to the Canvas page and launch the Canvas application.
+ Use your organization’s SAML 2.0-based SSO methods, such as Okta or the IAM Identity Center.

When you log into SageMaker Canvas for the first time, SageMaker AI creates the application and a SageMaker AI *space* for you. The Canvas application’s data is stored in the space. To learn more about spaces, see [Collaboration with shared spaces](domain-space.md). The space consists of your user profile’s applications and a shared directory for all of your applications’ data. If you don’t want to use the default space created by SageMaker AI and would prefer to create your own space for storing application data, see the page [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md).

## Step 2: Use SageMaker Canvas to get predictions
<a name="canvas-getting-started-step2"></a>

After you’ve logged in to Canvas, you can start building models and generating predictions for your data.

You can either use Canvas Ready-to-use models to make predictions without building a model, or you can build a custom model for your specific business problem. Review the following information to decide whether Ready-to-use models or custom models are best for your use case.
+ **Ready-to-use models.** With Ready-to-use models, you can use pre-built models to extract insights from your data. The Ready-to-use models cover a variety of use cases, such as language detection and document analysis. To get started making predictions with Ready-to-use models, see [Ready-to-use models](canvas-ready-to-use-models.md).
+ **Custom models.** With custom models, you can build a variety of model types that are customized to make predictions for your data. Use custom models if you’d like to build a model that is trained on your business-specific data and if you’d like to use features such as [evaluating your model’s performance](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html). To get started with building a custom model, see [Custom models](canvas-custom-models.md).

# Tutorial: Build an end-to-end machine learning workflow in SageMaker Canvas
<a name="canvas-end-to-end-machine-learning-workflow"></a>

This tutorial guides you through an end-to-end machine learning (ML) workflow using Amazon SageMaker Canvas. SageMaker Canvas is a visual no-code interface that you can use to prepare data and to train and deploy ML models. For the tutorial, you use a NYC taxi dataset to train a model that predicts the fare amount for a given trip. You get hands-on experience with key ML tasks such as assessing data quality and addressing data issues, splitting data into training and test sets, model training and evaluation, making predictions, and deploying your trained model–all within the SageMaker Canvas application.

**Important**  
This tutorial assumes that you or your administrator have created an AWS account. For information about creating an AWS account, see [Getting started: Are you a first time AWS User?](https://docs.aws.amazon.com/accounts/latest/reference/welcome-first-time-user.html)

## Setting up
<a name="canvas-tutorial-setting-up"></a>

An Amazon SageMaker AI domain is a centralized place to manage all your Amazon SageMaker AI environments and resources. A domain acts as a virtual boundary for your work in SageMaker AI, providing isolation and access control for your machine learning (ML) resources. 

To get started with Amazon SageMaker Canvas, you or your administrator must navigate to the SageMaker AI console and create a Amazon SageMaker AI domain. A domain has the storage and compute resources needed for you to run SageMaker Canvas. Within the domain, you configure SageMaker Canvas to access your Amazon S3 buckets and deploy models. Use the following procedure to set up a quick domain and create a SageMaker Canvas application.

**To set up SageMaker Canvas**

1. Navigate to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. On the left-hand navigation, choose SageMaker Canvas.

1. Choose **Create a SageMaker AI domain**.

1. Choose **Set up**. The domain can take a few minutes to set up.

The preceding procedure used a quick domain set up. You can perform an advanced set up to control all aspects of the account configuration, including permissions, integrations, and encryption. For more information about a custom set up, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md).

By default, doing the quick domain set up provides you with permissions to deploy models. If you have custom permissions set up through a standard domain and you need manually grant model deployment permissions, see [Permissions management](canvas-deploy-model.md#canvas-deploy-model-prereqs).

## Flow creation
<a name="canvas-tutorial-flow-creation"></a>

Amazon SageMaker Canvas is a machine learning platform that enables users to build, train, and deploy machine learning models without extensive coding or machine learning expertise. One of the powerful features of Amazon SageMaker Canvas is the ability to import and work with large datasets from various sources, such as Amazon S3.

For this tutorial, we're using the NYC taxi dataset to predict the fare amount for each trip using a Amazon SageMaker Canvas Data Wrangler data flow. The following procedure outlines the steps for importing a modified version of the NYC taxi dataset into a data flow.

**Note**  
For improved processing, SageMaker Canvas imports a sample of your data. By default, it randomly samples 50,000 rows.

**To import the NYC taxi dataset**

1. From the SageMaker Canvas home page, choose **Data Wrangler**.

1. Choose **Import data**.

1. Select **Tabular**.

1. Choose the toolbox next to data source.

1. Select **Amazon S3** from the dropdown.

1. For **Input S3 endpoint**, specify `s3://amazon-sagemaker-data-wrangler-documentation-artifacts/canvas-single-file-nyc-taxi-dataset.csv`

1. Choose **Go**.

1. Select the checkbox next to the dataset.

1. Choose **Preview data**.

1. Choose **Save**.

## Data Quality and Insights Report 1 (sample)
<a name="canvas-tutorial-data-quality-insights-report-1"></a>

After importing a dataset into Amazon SageMaker Canvas, you can generate a Data Quality and Insights report on a sample of the data. Use it to provide valuable insights into the dataset. The report does the following:
+ Assesses the dataset's completeness
+ Identifies missing values and outliers

It can identify other potential issues that may impact model performance. It also evaluates the predictive power of each feature concerning the target variable, allowing you to identify the most relevant features for problem you're trying to solve.

We can use the insights from the report to predict the fare amount. By specifying the **Fare amount** column as the target variable and selecting **Regression** as the problem type, the report will analyze the dataset's suitability for predicting continuous values like fare prices. The report should reveal that features like **year** and **hour\$1of\$1day** have low predictive power for the chosen target variable, providing you with valuable insights.

Use the following procedure to get a Data Quality and Insights report on a 50,000 row sample from the dataset.

**To get a report on a sample**

1. Choose **Get data insights** from the pop up window next to the **Data types** node.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. Choose **Create**.

You can review the Data Quality and Insights report on a sample of your data. The report indicates that the **year** and **hour\$1of\$1day** features are not predictive of the target variable, **Fare amount**.

At the top of the navigation, choose the name of the data flow to navigate back to it.

## Drop year and hour of day
<a name="canvas-tutorial-drop-year-and-hour-of-day"></a>

We're using insights from the report to drop the **year** and **hour\$1of\$1day** columns to streamline the feature space and potentially improve model performance.

Amazon SageMaker Canvas provides a user-friendly interface and tools to perform such data transformations.

Use the following procedure to drop the **year** and **hour\$1of\$1day** columns from the NYC taxi dataset using the Data Wrangler tool in Amazon SageMaker Canvas.

1. Choose the icon next to **Data types**.

1. Choose **Add step**.

1. In the search bar, write **Drop column**.

1. Choose **Manage columns**.

1. Choose **Drop column**.

1. For **Columns to drop**, select the **year** and **hour\$1of\$1day** columns.

1. Choose **Preview** to view how your transform changes your data.

1. Choose **Add**.

You can use the preceding procedure as the basis to add all of the other transforms in SageMaker Canvas.

## Data Quality and Insights Report 2 (full dataset)
<a name="canvas-tutorial-data-quality-insights-report-2"></a>

For the previous insights report, we used a sample of the NYC taxi dataset. For our second report, we're running a comprehensive analysis on the entire dataset to identify potential issues impacting model performance.

Use the following procedure to create a Data Quality and Insights report on an entire dataset.

**To get a report on the entire dataset**

1. Choose the icon next to the **Drop columns** node.

1. Choose **Get data insights**.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. For **Data size**, choose **Full dataset**.

1. Choose **Create**.

The following is an image from the insights report:

![\[Duplicate rows, Skewed target, and Very low quick model score are listed as the insightsP\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/canvas-tutorial-dqi-insights.png)


It shows the following issues:
+ Duplicate rows
+ Skewed target

Duplicate rows can lead to data leakage, where the model is exposed to the same data during training and testing. They can lead to overly optimistic performance metrics. Removing duplicate rows ensures that the model is trained on unique instances, reducing the risk of data leakage and improving the model's ability to generalize.

A skewed target variable distribution, in this case, the **Fare amount** column, can cause imbalanced classes, where the model may become biased towards the majority class. This can lead to poor performance on minority classes, which is particularly problematic in scenarios where accurately predicting rare or underrepresented instances is important.

## Addressing data quality issues
<a name="canvas-tutorial-addressing-data-quality-issues"></a>

To address these issues and prepare the dataset for modeling, you can search for the following transformations and apply them:

1. Drop duplicates using the **Manage rows** transform.

1. **Handle outliers** in the **Fare amount** column using the **Robust standard deviation numeric outliers**.

1. **Handle outliers** in the **Trip distance** and **Trip duration** columns using the **Standard deviation numeric outliers**.

1. Use the **Encode categorical** to encode the **Rate code id**, **Payment type**, **Extra flag**, and **Toll flag** columns as floats.

If you're not sure about how to apply a transform, see [Drop year and hour of day](#canvas-tutorial-drop-year-and-hour-of-day)

By addressing these data quality issues and applying appropriate transformations, you can improve the dataset's suitability for modeling.

## Verifying data quality and quick model accuracy
<a name="canvas-tutorial-verifying-data-quality-and-quick-model-accuracy"></a>

After applying the transforms to address data quality issues, such as removing duplicate rows, we create our final Data Quality and Insights report. This report helps verify that the applied transformations resolved the issues and that the dataset is now in a suitable state for modeling.

When reviewing the final Data Quality and Insights report, you should expect to see no major data quality issues flagged. The report should indicate that:
+ The target variable is no longer skewed
+ There are no outliers or duplicate rows

Additionally, the report should provide a quick model score based on a baseline model trained on the transformed dataset. This score serves as an initial indicator of the model's potential accuracy and performance.

Use the following procedure to create the Data Quality and Insights report.

**To create the Data Quality and Insights report**

1. Choose the icon next to the **Drop columns** node.

1. Choose **Get data insights**.

1. For **Analysis name**, specify a name for the report.

1. For **Problem type**, choose **Regression**.

1. For **Target column**, choose **Fare amount**.

1. For **Data size**, choose **Full dataset**.

1. Choose **Create**.

## Split the data into training and test sets
<a name="canvas-tutorial-split-data"></a>

To train a model and evaluate its performance, we use the **Split data** transform to split the data into training and test sets.

By default, SageMaker Canvas uses a Randomized split, but you can also use the following types of splits:
+ Ordered
+ Stratified
+ Split by key

You can change the **Split percentage** or add splits.

For this tutorial, use all of the default settings in the split. You need to double click on the dataset to view its name. The training dataset has the name **Dataset (Train)**.

Next to the **Ordinal encode** node apply the **Split data** transform.

## Train model
<a name="canvas-tutorial-train-model"></a>

After you split your data, you can train a model. This model learns from patterns in your data. You can use it to make predictions or uncover insights.

SageMaker Canvas has both quick builds and standard builds. Use a standard build to train best performing model on your data.

Before you start training a model, you must first export the training dataset as a SageMaker Canvas dataset.

**To export your dataset**

1. Next to the node for the training dataset, choose the icon and select **Export**.

1. Select **SageMaker Canvas dataset**.

1. Choose **Export** to export the dataset.

After you've created a dataset, you can train a model on the SageMaker Canvas dataset that you've created. For information about training a model, see [Build a custom numeric or categorical prediction model](canvas-build-model-how-to.md#canvas-build-model-numeric-categorical).

## Evaluate model and make predictions
<a name="canvas-tutorial-evaluate-model-and-make-predictions"></a>

After training your machine learning model, it's crucial to evaluate its performance to ensure it meets your requirements and performs well on unseen data. Amazon SageMaker Canvas provides a user-friendly interface to assess your model's accuracy, review its predictions, and gain insights into its strengths and weaknesses. You can use the insights to make informed decisions about its deployment and potential areas for improvement.

Use the following procedure to evaluate a model before you deploy it.

**To evaluate a model**

1. Choose **My Models**.

1. Choose the model you've created.

1. Under **Versions**, select the version corresponding to the model.

You can now view the model evaluation metrics.

After you evaluate the model, you can make predictions on new data. We're using the test dataset that we've created.

To use the test dataset for predictions we need to convert it into a SageMaker Canvas dataset. The SageMaker Canvas dataset is in a format that the model can interpret.

Use the following procedure to create a SageMaker Canvas dataset from the test dataset.

**To create a SageMaker Canvas dataset**

1. Next to the **Dataset (Test)** dataset, choose the radio icon.

1. Select **Export**.

1. Select **SageMaker Canvas dataset**.

1. For **Dataset name**, specify a name for the dataset.

1. Choose **Export**.

Use the following procedure to make predictions. It assumes that you're still on the **Analyze** page.

**To make predictions on test dataset**

1. Choose **Predict**.

1. Choose **Manual**.

1. Select the dataset that you've exported.

1. Choose **Generate predictions**.

1. When SageMaker Canvas has finished generating predictions, select the icon to the right of the dataset.

1. Choose **Preview** to view the predictions.

## Deploy a model
<a name="canvas-tutorial-deploy-a-model"></a>

After you've evaluated your model, you can deploy it to an endpoint. You can submit requests to the endpoint to get predictions.

Use the following procedure to deploy a model. It assumes that you're still on the **Predict** page.

**To deploy a model**

1. Choose **Deploy**.

1. Choose **Create deployment**.

1. Choose **Deploy**.

## Cleaning up
<a name="canvas-tutorial-cleaning-up"></a>

You've successfully completed the tutorial. To avoid incurring additional charges, delete the resources that you're not using.

Use the following procedure to delete the endpoint that you created. It assumes that you're still on the **Deploy** page.

**To delete an endpoint**

1. Choose the radio button to the right of your deployment.

1. Select **Delete deployment**.

1. Choose **Delete**.

After deleting the deployment, delete the datasets that you've created within SageMaker Canvas. Use the following procedure to delete the datasets.

**To delete the datasets**

1. Choose **Datasets** on the left-hand navigation.

1. Select the dataset that you've analyzed and the synthetic dataset used for predictions.

1. Choose **Delete**.

To avoid incurring additional charges, you must log out of SageMaker Canvas. For more information, see [Logging out of Amazon SageMaker Canvas](canvas-log-out.md).

# Amazon SageMaker Canvas setup and permissions management (for IT administrators)
<a name="canvas-setting-up"></a>

The following pages explain how IT administrators can configure Amazon SageMaker Canvas and grant permissions to users within their organizations. You learn how to set up the storage configuration, manage data encryption and VPCs, control access to specific capabilities like generative AI foundation models, integrate with other AWS services like Amazon Redshift, and more. By following these steps, you can tailor SageMaker Canvas for your users based on your organization's specific requirements.

You can also set up SageMaker Canvas for your users with AWS CloudFormation. For more information, see [AWS::SageMaker AI::App](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-app.html) in the *AWS CloudFormation User Guide*.

**Topics**
+ [

# Grant Your Users Permissions to Upload Local Files
](canvas-set-up-local-upload.md)
+ [

# Set Up SageMaker Canvas for Your Users
](setting-up-canvas-sso.md)
+ [

# Configure your Amazon S3 storage
](canvas-storage-configuration.md)
+ [

# Grant permissions for cross-account Amazon S3 storage
](canvas-permissions-cross-account.md)
+ [

# Grant Users Permissions to Use Large Data across the ML Lifecycle
](canvas-large-data-permissions.md)
+ [

# Encrypt Your SageMaker Canvas Data with AWS KMS
](canvas-kms.md)
+ [

# Store SageMaker Canvas application data in your own SageMaker AI space
](canvas-spaces-setup.md)
+ [

# Grant Your Users Permissions to Build Custom Image and Text Prediction Models
](canvas-set-up-cv-nlp.md)
+ [

# Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas
](canvas-fine-tuning-permissions.md)
+ [

# Update SageMaker Canvas for Your Users
](canvas-update.md)
+ [

# Request a Quota Increase
](canvas-requesting-quota-increases.md)
+ [

# Grant Users Permissions to Import Amazon Redshift Data
](canvas-redshift-permissions.md)
+ [

# Grant Your Users Permissions to Send Predictions to Quick
](canvas-quicksight-permissions.md)
+ [

# Applications management
](canvas-manage-apps.md)
+ [

# Configure Amazon SageMaker Canvas in a VPC without internet access
](canvas-vpc.md)
+ [

# Set up connections to data sources with OAuth
](canvas-setting-up-oauth.md)

# Grant Your Users Permissions to Upload Local Files
<a name="canvas-set-up-local-upload"></a>

If your users are uploading files from their local machines to SageMaker Canvas, you must attach a CORS (cross-origin resource sharing) configuration to the Amazon S3 bucket that they're using. When setting up or editing the SageMaker AI domain or user profile, you can specify either a custom Amazon S3 location or the default location, which is a SageMaker AI created Amazon S3 bucket with a name that uses the following pattern: `s3://sagemaker-{Region}-{your-account-id}`. SageMaker Canvas adds your users' data to the bucket whenever they upload a file.

To grant users permissions to upload local files to the bucket, you can attach a CORS configuration to it using either of the following procedures. You can use the first method when editing the settings of your domain, where you opt in to allow SageMaker AI to attach the CORS configuration to the bucket for you. You can also use the first method for editing a user profile within a domain. The second method is the manual method, where you can attach the CORS configuration to the bucket yourself.

## SageMaker AI domain settings method
<a name="canvas-set-up-local-upload-domain"></a>

To grant your users permissions to upload local files, you can edit the Canvas application configuration in the domain settings. This attaches a Cross-Origin Resource Sharing (CORS) configuration to the Canvas storage configuration's Amazon S3 bucket and grants all users in the domain permission to upload local files into SageMaker Canvas. By default, the permissions option is turned on when you set up a new domain, but you can turn this option on and off as needed.

**Note**  
If you have an existing CORS configuration on the storage configuration Amazon S3 bucket, turning on the local file upload option overwrites the existing configuration with the new configuration.

The following procedure shows how you can turn on this option by editing the domain settings in the SageMaker AI console.

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Domains**.

1. From the list of domains, choose your domain.

1. On the domain details page, select the **App Configurations** tab.

1. Go to the **Canvas** section and choose **Edit**.

1. Turn on the **Enable local file upload** toggle. This attaches the CORS configuration and grants local file upload permissions.

1. Choose **Submit**.

Users in the specified domain should now have local file upload permissions.

You can also grant permissions to specific user profiles in a domain by following the preceding procedure and going into the user profile settings instead of the overall domain settings.

## Amazon S3 bucket method
<a name="canvas-set-up-local-upload-s3"></a>

If you want to manually attach the CORS configuration to the SageMaker AI Amazon S3 bucket, use the following procedure.

1. Sign in to [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Choose your bucket. If your domain uses the default SageMaker AI created bucket, the bucket’s name uses the following pattern: `s3://sagemaker-{Region}-{your-account-id}`.

1. Choose **Permissions**.

1. Navigate to **Cross-origins resource sharing (CORS)**.

1. Choose **Edit**.

1. Add the following CORS policy:

   ```
   [
       {
           "AllowedHeaders": [
               "*"
           ],
           "AllowedMethods": [
               "POST"
           ],
           "AllowedOrigins": [
               "*"
           ],
           "ExposeHeaders": []
       }
   ]
   ```

1. Choose **Save changes**.

In the preceding procedure, the CORS policy must have `"POST"` listed under `AllowedMethods`.

After you've gone through the procedure, you should have:
+ An IAM role assigned to each of your users.
+ Amazon SageMaker Studio Classic runtime permissions for each of your users. SageMaker Canvas uses Studio Classic to run the commands from your users.
+ If the users are uploading files from their local machines, a CORS policy attached to their Amazon S3 bucket.

If your users still can't upload the local files after you update the CORS policy, the browser might be caching the CORS settings from a previous upload attempt. If they're running into issues, instruct them to clear their browser cache and try again.

# Set Up SageMaker Canvas for Your Users
<a name="setting-up-canvas-sso"></a>

To set up Amazon SageMaker Canvas, do the following:
+ Create an Amazon SageMaker AI domain.
+ Create user profiles for the domain
+ Set up Okta Single Sign On (Okta SSO) for your users.
+ Activate link sharing for models.

Use Okta Single-Sign On (Okta SSO) to grant your users access to Amazon SageMaker Canvas. SageMaker Canvas supports SAML 2.0 SSO methods. The following sections guide you through procedures to set up Okta SSO.

To set up a domain, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md) and follow the instructions for setting up your domain using IAM authentication. You can use the following information to help you complete the procedure in the section:
+ You can ignore the step about creating projects.
+ You don't need to provide access to additional Amazon S3 buckets. Your users can use the default bucket that we provide when we create a role.
+ To grant your users access to share their notebooks with data scientists, turn on **Notebook Sharing Configuration**.
+ Use Amazon SageMaker Studio Classic version 3.19.0 or later. For information about updating Amazon SageMaker Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

Use the following procedure to set up Okta. For all of the following procedures, you specify the same IAM role for `IAM-role` .

## Add the SageMaker Canvas application to Okta
<a name="canvas-set-up-okta"></a>

Set up the sign-on method for Okta.

1. Sign in to the Okta Admin dashboard.

1. Choose **Add application**. Search for **AWS Account Federation**.

1. Choose **Add**.

1. Optional: Change the name to **Amazon SageMaker Canvas**.

1. Choose **Next**.

1. Choose **SAML 2.0** as the **Sign-On** method.

1. Choose **Identity Provider Metadata** to open the metadata XML file. Save the file locally.

1. Choose **Done**.

## Set up ID federation in IAM
<a name="set-up-id-federation-IAM"></a>

AWS Identity and Access Management (IAM) is the AWS service that you use to gain access to your AWS account. You gain access to AWS through an IAM account.

1. Sign in to the AWS console.

1. Choose **AWS Identity and Access Management (IAM)**.

1. Choose **Identity Providers**.

1. Choose **Create Provider**.

1. For **Configure Provider**, specify the following:
   + **Provider Type** – From the dropdown list, choose **SAML**.
   + **Provider Name** – Specify **Okta**.
   + **Metadata Document** – Upload the XML document that you've saved locally from step 7 of [Add the SageMaker Canvas application to Okta](#canvas-set-up-okta).

1. Find your identity provider under **Identity Providers**. Copy its **Provider ARN** value.

1. For **Roles**, choose the IAM role that you're using for Okta SSO access.

1. Under **Trust Relationship** for the IAM role, choose **Edit Trust Relationship**.

1. Modify the IAM trust relationship policy by specifying the **Provider ARN** value that you've copied and add the following policy:

------
#### [ JSON ]

****  

   ```
     {
     "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws:iam::111122223333:saml-provider/Okta"
           },
           "Action": [
             "sts:AssumeRoleWithSAML",
             "sts:TagSession"
           ],
           "Condition": {
             "StringEquals": {
               "SAML:aud": "https://signin.aws.amazon.com/saml"
             }
           }
         },
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws:iam::111122223333:saml-provider/Okta"
           },
           "Action": [
             "sts:SetSourceIdentity"
           ]
         }
       ]
     }
   ```

------

1. For **Permissions**, add the following policy:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "AmazonSageMakerPresignedUrlPolicy",
              "Effect": "Allow",
              "Action": [
                   "sagemaker:CreatePresignedDomainUrl"
              ],
              "Resource": "*"
         }
     ]
   }
   ```

------

## Configure SageMaker Canvas in Okta
<a name="canvas-configure-okta"></a>

Configure Amazon SageMaker Canvas in Okta using the following procedure.

To configure Amazon SageMaker Canvas to use Okta, follow the steps in this section. You must specify unique user names for each **SageMakerStudioProfileName** field. For example, you can use `user.login` as a value. If the username is different from the SageMaker Canvas profile name, choose a different uniquely identifying attribute. For example, you can use an employee's ID number for the profile name.

For an example of values that you can set for **Attributes**, see the code following the procedure.

1. Under **Directory**, choose **Groups**.

1. Add a group with the following pattern: `sagemaker#canvas#IAM-role#AWS-account-id`.

1. In Okta, open the **AWS Account Federation** application integration configuration.

1. Select **Sign On** for the AWS Account Federation application.

1. Choose **Edit** and specify the following:
   + SAML 2.0
   + **Default Relay State** – https://*Region*.console.aws.amazon.com/sagemaker/home?region=*Region*\$1/studio/canvas/open/*StudioId*. You can find the Studio Classic ID in the console: [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)

1. Choose **Attributes**.

1. In the **SageMakerStudioProfileName** fields, specify unique values for each username. The usernames must match the usernames that you've created in the AWS console.

   ```
   Attribute 1:
   Name: https://aws.amazon.com/SAML/Attributes/PrincipalTag:SageMakerStudioUserProfileName 
   Value: ${user.login}
   
   Attribute 2:
   Name: https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys
   Value: {"SageMakerStudioUserProfileName"}
   ```

1. Select **Environment Type**. Choose **Regular AWS**.
   + If your environment type isn't listed, you can set your ACS URL in the **ACS URL** field. If your environment type is listed, you don't need to enter your ACS URL

1. For **Identity Provider ARN**, specify the ARN you used in step 6 of the preceding procedure.

1. Specify a **Session Duration**.

1. Choose **Join all roles**.

1. Turn on **Use Group Mapping** by specifying the following fields:
   + **App Filter** – `okta`
   + **Group Filter** – `^aws\#\S+\#(?IAM-role[\w\-]+)\#(?accountid\d+)$`
   + **Role Value Pattern** – `arn:aws:iam::$accountid:saml-provider/Okta,arn:aws:iam::$accountid:role/IAM-role`

1. Choose **Save/Next**.

1. Under **Assignments**, assign the application to the group that you've created.

## Add optional policies on access control in IAM
<a name="canvas-optional-access"></a>

In IAM, you can apply the following policy to the administrator user who creates the user profiles.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "CreateSageMakerStudioUserProfilePolicy",
            "Effect": "Allow",
            "Action": "sagemaker:CreateUserProfile",
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:TagKeys": [
                        "studiouserid"
                    ]
                }
            }
        }
    ]
}
```

------

If you choose to add the preceding policy to the admin user, you must use the following permissions from [Set up ID federation in IAM](#set-up-id-federation-IAM).

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement": [
       {
           "Sid": "AmazonSageMakerPresignedUrlPolicy",
           "Effect": "Allow",
           "Action": [
               "sagemaker:CreatePresignedDomainUrl"
           ],
           "Resource": "*",
           "Condition": {
                  "StringEquals": {
                      "sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/SageMakerStudioUserProfileName}"
                 }
            }
      }
  ]
}
```

------

# Configure your Amazon S3 storage
<a name="canvas-storage-configuration"></a>

When you set up your SageMaker Canvas application, the default storage location for model artifacts, datasets, and other application data is an Amazon S3 bucket that Canvas creates. This default Amazon S3 bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}` and exists in the same Region as your Canvas application. However, you can customize the storage location and specify your own Amazon S3 bucket for storing Canvas application data. You might want to use your own Amazon S3 bucket for storing application data for any of the following reasons:
+ Your organization has internal naming conventions for Amazon S3 buckets.
+ You want to enable cross-account access to model artifacts or other Canvas data.
+ You want to be compliant with internal security guidelines, such as restricting users to specific Amazon S3 buckets or model artifacts.
+ You want enhanced visibility and access to logs produced by Canvas, independent of the AWS console or SageMaker Studio Classic.

By specifying your own Amazon S3 bucket, you can have increased control over your own storage and be compliant with your organization. 

To get started, you can either create a new SageMaker AI domain or user profile, or you can update an existing domain or user profile. Note that the user profile settings override the domain-level settings. For example, you can use the default bucket configuration at the domain level, but you can specify a custom Amazon S3 bucket for an individual user. After specifying your own Amazon S3 bucket for the domain or user profile, Canvas creates a subfolder called `Canvas/<UserProfileName>` under the input Amazon S3 URI and saves all artifacts generated in the Canvas application under this subfolder.

**Important**  
If you update an existing domain or user profile, you no longer have access to your Canvas artifacts from the previous location. Your files are still in the old Amazon S3 location, but you can no longer view them from Canvas. The new configuration takes effect the next time you log into the application.

For more information about granting cross-account access to your Amazon S3 bucket, see [Granting cross-account object permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example4.html#access-policies-walkthrough-example4-overview) in the *Amazon S3 User Guide*.

The following sections describe how to specify a custom Amazon S3 bucket for your Canvas storage configuration. If you’re setting up a new SageMaker AI domain (or a new user in a domain), then use the [New domain setup method](#canvas-storage-configuration-new-domain) or the [New user profile setup method](#canvas-storage-configuration-new-user). If you have an existing Canvas user profile and would like to update the profile's storage configuration, use the [Existing user method](#canvas-storage-configuration-existing-user).

## Before you begin
<a name="canvas-storage-configuration-prereqs"></a>

If you’re specifying an Amazon S3 URI from a different AWS account, or if you’re using a bucket that is encrypted with AWS KMS, then you must configure permissions before proceeding. You must grant AWS IAM permissions to ensure that Canvas can download and upload objects to and from your bucket. For detailed information on how to grant the required permissions, see [Grant permissions for cross-account Amazon S3 storage](canvas-permissions-cross-account.md).

Additionally, the final Amazon S3 URI for the training folder in your Canvas storage location must be 128 characters or less. The final Amazon S3 URI consists of your bucket path `s3://<your-bucket-name>/<folder-name>/` plus the path that Canvas adds to your bucket: `Canvas/<user-profile-name>/Training`. For example, an acceptable path that is less than 128 characters is `s3://<amzn-s3-demo-bucket>/<machine-learning>/Canvas/<user-1>/Training`.

## New domain setup method
<a name="canvas-storage-configuration-new-domain"></a>

If you’re setting up a new domain and Canvas application, use this section to configure the storage location at the domain level. This configuration applies to all new users you create in the domain, unless you specify a different storage location for individual user profiles.

When doing a **Standard setup** for your domain, on the **Step 3: Configure Applications - Optional** page, use the following procedure for the **Canvas** section:

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish setting up the domain and choose **Submit**.

Your domain is now configured to use the Amazon S3 location you specified for SageMaker Canvas application storage.

## New user profile setup method
<a name="canvas-storage-configuration-new-user"></a>

If you’re setting up a new user profile in your domain, use this section to configure the storage location for the user. This configuration overrides the domain-level configuration.

When adding a user profile to your domain, for **Step 2: Configure Applications**, use the following procedure for the **Canvas** section:

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish setting up the user profile and choose **Submit**.

Your user profile is now configured to use the Amazon S3 location you specified for SageMaker Canvas application storage.

## Existing user method
<a name="canvas-storage-configuration-existing-user"></a>

If you have an existing Canvas user profile and would like to update the Amazon S3 storage location, you can edit the SageMaker AI domain or user profile settings. The change takes effect the next time you log into the Canvas application.

**Note**  
When you change the storage location for an existing Canvas application, you lose access to your Canvas artifacts from the previous storage location. The artifacts are still stored in the old Amazon S3 location, but you can no longer view them from Canvas.

Remember that the user profile settings override the general domain settings, so you can update the Amazon S3 storage location for specific user profiles without changing it for all of the users. You can update the storage configuration for an existing domain or user by using the following procedures.

------
#### [ Update an existing domain ]

Use the following procedure to update the storage configuration for a domain.

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**. 

1. From the list of domains, choose your domain.

1. On the **Domain details** page, choose the **App Configurations** tab.

1. Scroll down to the **Canvas** section and choose **Edit**.

1. The **Edit Canvas settings** page opens. For the **Canvas storage configuration** section, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI created bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish any other modifications you want to make to the domain, and then choose **Submit** to save your changes.

------
#### [ Update an existing user profile ]

Use the following procedure to update the storage configuration for a user profile.

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of **domains**, choose your domain.

1. From the list of users in the domain, choose the user whose configuration you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For the **Canvas storage configuration**, do the following:

   1. Select **System managed** if you want to set the location to the default SageMaker AI bucket that follows the pattern `s3://sagemaker-{Region}-{your-account-id}`.

   1. Select **Custom S3** to specify your own Amazon S3 bucket as the storage location. Then, enter the Amazon S3 URI.

   1. (Optional) For **Encryption key**, specify a KMS key for encrypting Canvas artifacts stored at the specified location. 

1. Finish any other modifications you want to make to the user profile, and then choose **Submit** to save your changes.

------

The storage location for your Canvas user profile should now be updated. The next time you log into the Canvas application, you receive a notification that the storage location has been updated. You lose access to any previous artifacts that you created in Canvas. You can still access the files in Amazon S3, but you can no longer view them in Canvas.

# Grant permissions for cross-account Amazon S3 storage
<a name="canvas-permissions-cross-account"></a>

When setting up your SageMaker AI domain or user profile for users to access SageMaker Canvas, you specify an Amazon S3 storage location for Canvas artifacts. These artifacts include saved copies of your input datasets, model artifacts, predictions, and other application data. You can either use the default SageMaker AI created Amazon S3 bucket, or you can customize the storage location and specify your own bucket for storing Canvas application data.

You can specify an Amazon S3 bucket in another AWS account for storing your Canvas data, but first you must grant cross-account permissions so that Canvas can access the bucket.

The following sections describe how to grant permissions to Canvas for uploading and downloading objects to and from an Amazon S3 bucket in another account. There are additional permissions for when your bucket is encrypted with AWS KMS.

## Requirements
<a name="canvas-permissions-cross-account-prereqs"></a>

Before you begin, review the following requirements:
+ Cross-account Amazon S3 buckets (and any associated AWS KMS keys) must be in the same AWS Region as the Canvas user domain or user profile.
+ The final Amazon S3 URI for the training folder in your Canvas storage location must be 128 characters or less. The final S3 URI consists of your bucket path `s3://<your-bucket-name>/<folder-name>/` plus the path that Canvas adds to your bucket: `Canvas/<user-profile-name>/Training`. For example, an acceptable path that is less than 128 characters is `s3://<amzn-s3-demo-bucket>/<machine-learning>/Canvas/<user-1>/Training`.

## Permissions for cross-account Amazon S3 buckets
<a name="canvas-permissions-cross-account-s3"></a>

The following section outlines the basic steps for granting the necessary permissions so that Canvas can access your Amazon S3 bucket in another account. For more detailed instructions, see [Example 2: Bucket owner granting cross-account bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) in the *Amazon S3 User Guide*.

1. Create an Amazon S3 bucket, `bucketA`, in Account A.

1. The Canvas user exists in another account called Account B. In the following steps, we refer to the Canvas user's IAM role as `roleB` in Account B.

   Give the IAM role `roleB` in Account B permission to download (`GetObject`) and upload (`PutObject`) objects to and from `bucketA` in Account A by attaching an IAM policy.

   To limit access to a specific bucket folder, define the folder name in the resource element, such as `arn:aws:s3:::<bucketA>/FolderName/*`. For more information, see [How can I use IAM policies to grant user-specific access to specific folders?](https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/)
**Note**  
Bucket-level actions, such as `GetBucketCors` and `GetBucketLocation`, should be added on bucket-level resources, not folders.

   The following example IAM policy grants the required permissions for `roleB` to access objects in `bucketA`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:PutObject",
                   "s3:DeleteObject"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA/FolderName/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:ListBucket",
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA"
               ]
           }
       ]
   }
   ```

------

1. Configure the bucket policy for `bucketA` in Account A to grant permissions to the IAM role `roleB` in Account B.
**Note**  
Admins must also turn off **Block all public access** under the bucket **Permissions** section.

   The following is an example bucket policy for `bucketA` to grant the necessary permissions to `roleB`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:DeleteObject",
                   "s3:GetObject",
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::bucketA/FolderName/*"
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:ListBucket",
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": "arn:aws:s3:::bucketA"
           }
       ]
   }
   ```

------

After configuring the preceding permissions, your Canvas user profile in Account B can now use the Amazon S3 bucket in Account A as the storage location for Canvas artifacts.

## Permissions for cross-account Amazon S3 buckets encrypted with AWS KMS
<a name="canvas-permissions-cross-account-s3-kms"></a>

The following procedure shows you how to grant the necessary permissions so that Canvas can access your Amazon S3 bucket in another account that is encrypted with AWS KMS. The steps are similar to the procedure above, but with additional permissions. For more information about granting cross-account KMS key access, see [Allowing users in other accounts to use a KMS key](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying-external-accounts.html) in the *AWS KMS Developer Guide*.

1. Create an Amazon S3 bucket, `bucketA`, and an Amazon S3 KMS key `s3KmsInAccountA` in Account A.

1. The Canvas user exists in another account called Account B. In the following steps, we refer to the Canvas user's IAM role as `roleB` in Account B.

   Give the IAM role `roleB` in Account B permission to do the following:
   + Download (`GetObject`) and upload (`PutObject`) objects to and from `bucketA` in Account A.
   + Access the AWS KMS key `s3KmsInAccountA` in Account A.

   The following example IAM policy grants the required permissions for `roleB` to access objects in `bucketA` and use the KMS key `s3KmsInAccountA`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:PutObject",
                   "s3:DeleteObject"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA/FolderName/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::bucketA"
               ]
           },
           {
               "Action": [
                   "kms:DescribeKey",
                   "kms:CreateGrant",
                   "kms:RetireGrant",
                   "kms:GenerateDataKey",
                   "kms:GenerateDataKeyWithoutPlainText",
                   "kms:Decrypt"
               ],
               "Effect": "Allow",
               "Resource": "arn:aws:kms:us-east-1:111122223333:key/s3KmsInAccountA"
           }
       ]
   }
   ```

------

1. Configure the bucket policy for `bucketA` and the key policy for `s3KmsInAccountA` in Account A to grant permissions to the IAM role `roleB` in Account B.

   The following is an example bucket policy for `bucketA` to grant the necessary permissions to `roleB`:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:DeleteObject",
                   "s3:GetObject",
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::bucketA/FolderName/*"
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/roleB"
               },
               "Action": [
                   "s3:GetBucketCors",
                   "s3:GetBucketLocation"
               ],
               "Resource": "arn:aws:s3:::bucketA"
           }
       ]
   }
   ```

------

   The following example is a key policy that you attach to the KMS key `s3KmsInAccountA` in Account A to grant `roleB` access. For more information about how to create and attach a key policy statement, see [Creating a key policy](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-overview.html) in the *AWS KMS Developer Guide*.

   ```
   {
     "Sid": "Allow use of the key",
     "Effect": "Allow",
     "Principal": {
       "AWS": [
         "arn:aws:iam::accountB:role/roleB"
       ]
     },
     "Action": [
           "kms:DescribeKey",
           "kms:CreateGrant",
           "kms:RetireGrant",
           "kms:GenerateDataKey",
           "kms:GenerateDataKeyWithoutPlainText",
           "kms:Decrypt"
     ],
     "Resource": "*"
   }
   ```

After configuring the preceding permissions, your Canvas user profile in Account B can now use the encrypted Amazon S3 bucket in Account A as the storage location for Canvas artifacts.

# Grant Users Permissions to Use Large Data across the ML Lifecycle
<a name="canvas-large-data-permissions"></a>

Amazon SageMaker Canvas users working with datasets larger than 10 GB in CSV format or 2.5 GB in Parquet format require specific permissions for large data processing. These permissions are essential for managing large-scale data throughout the machine learning lifecycle. When datasets exceed the stated thresholds, or the application's local memory capacity, SageMaker Canvas uses Amazon EMR Serverless for efficient processing. This applies to:
+ Data Import: Importing large datasets with random or stratified sampling.
+ Data Preparation: Exporting processed data from Data Wrangler in Canvas to Amazon S3, to a new Canvas dataset, or to a Canvas model.
+ Model Building: Training models on large datasets.
+ Inference: Making predictions on large datasets.

By default, SageMaker Canvas uses EMR Serverless to run these remote jobs with the following app settings:
+ Pre-Initialized capacity: Not configured
+ Application limits: Maximum capacity of 400 vCPUs, max concurrent 16 vCPUs per account, 3000 GB memory, 20000 GB disk
+ Metastore configuration: AWS Glue Data Catalog
+ Application logs: AWS managed storage (enabled), using an AWS owned encryption key
+ Application behavior: Auto-starts on job submission and auto-stops after the application is idle for 15 minutes

To enable these large data processing capabilities, users need the necessary permissions, which can be granted through the Amazon SageMaker AI domain settings. The method for granting these permissions depends on how your Amazon SageMaker AI domain was set up initially. We'll cover three main scenarios:
+ Quick domain setup
+ Custom domain setup (with public internet access/without VPC)
+ Custom domain setup (with VPC and without public internet access)

Each scenario requires specific steps to ensure that users have the required permissions to leverage EMR Serverless for large data processing across the entire machine learning lifecycle in SageMaker Canvas.

## Scenario 1: Quick domain setup
<a name="canvas-large-data-quick-setup"></a>

If you used the **Quick setup** option when creating your SageMaker AI domain, follow these steps:

1. Navigate to the Amazon SageMaker AI domain settings:

   1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **Domains**.

   1. Select your domain.

   1. Choose the **App Configurations** tab.

   1. Scroll to the **Canvas** section and choose **Edit**.

1. Enable large data processing:

   1. In the **Large data processing configuration** section, turn on **Enable EMR Serverless for large data processing**.

   1. Create or select an EMR Serverless role:

      1. Choose **Create and use a new execution role** to create a new IAM role that has a trust relationship with EMR Serverless and the [AWS managed policy: AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy) policy attached. This IAM role is assumed by Canvas to create EMR Serverless jobs.

      1. Alternatively, if you already have an execution role with a trust relationship for EMR Serverless, then select **Use an existing execution role** and choose your role from the dropdown.
         + The existing role must have a name that begins with the prefix `AmazonSageMakerCanvasEMRSExecutionAccess-`.
         + The role you select should also have at least the permissions described in the [AWS managed policy: AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasEMRServerlessExecutionRolePolicy) policy.
         + The role should have an EMR Serverless trust policy, as shown below:

------
#### [ JSON ]

****  

           ```
           {
               "Version":"2012-10-17",		 	 	 
               "Statement": [
                   {
                       "Sid": "EMRServerlessTrustPolicy",
                       "Effect": "Allow",
                       "Principal": {
                           "Service": "emr-serverless.amazonaws.com"
                       },
                       "Action": "sts:AssumeRole",
                       "Condition": {
                           "StringEquals": {
                               "aws:SourceAccount": "111122223333"
                           }
                       }
                   }
               ]
           }
           ```

------

1. (Optional) Add Amazon S3 permissions for custom Amazon S3 buckets:

   1. The Canvas managed policy automatically grants read and write permissions for Amazon S3 buckets with `sagemaker` or `SageMaker AI` in their names. It also grants read permissions for objects in custom Amazon S3 buckets with the tag `"SageMaker": "true"`.

   1. For custom Amazon S3 buckets without the required tag, add the following policy to your EMR Serverless role:

   1. 

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:DeleteObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*"
                  ]
              }
          ]
      }
      ```

------

   1. We recommend that you scope down the permissions to specific Amazon S3 buckets that you want Canvas to access.

1. Save your changes and restart your SageMaker Canvas application.

## Scenario 2: Custom domain setup (with public internet access/without VPC)
<a name="canvas-large-data-custom-no-vpc"></a>

If you created or use a custom domain, follow steps 1-3 from Scenario 1, and then do these additional steps:

1. Add permissions for the Amazon ECR `DescribeImages` operation to your Amazon SageMaker AI execution role, as Canvas utilizes public Amazon ECR Docker images for data preparation and model training:

   1. Sign in to the AWS console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. Choose **Roles**.

   1. In the search box, search for your SageMaker AI execution role by name and select it.

   1. Add the following policy to your SageMaker AI execution role. This can be done either by adding it as a new inline policy or by appending the policy statement to an existing one. Note that an IAM role can have a maximum of 10 policies attached.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [{
              "Sid": "ECRDescribeImagesOperation",
              "Effect": "Allow",
              "Action": "ecr:DescribeImages",
              "Resource": [
                  "arn:aws:ecr:*:*:repository/sagemaker-data-wrangler-emr-container",
                  "arn:aws:ecr:*:*:repository/ap-dataprep-emr"
              ]
          }]
      }
      ```

------

1. Save your changes and restart your SageMaker Canvas application.

## Scenario 3: Custom domain setup (with VPC and without public internet access)
<a name="canvas-large-data-custom-vpc"></a>

If you created or use a custom domain, follow all steps from Scenario 2, then follow these additional steps:

1. Ensure your VPC subnets are private:

   1. Verify that the route table for your subnets doesn't have an entry mapping `0.0.0.0/0` to an Internet Gateway.

1. Add permissions for creating network interfaces:

   1. When using SageMaker Canvas with EMR Serverless for large-scale data processing, EMR Serverless requires the ability to create Amazon EC2 ENIs to enable network communication between EMR Serverless applications and your VPC resources.

   1. Add the following policy to your Amazon SageMaker AI execution role. This can be done either by adding it as a new inline policy or by appending the policy statement to an existing one. Note that an IAM role can have a maximum of 10 policies attached.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "AllowEC2ENICreation",
                  "Effect": "Allow",
                  "Action": [
                      "ec2:CreateNetworkInterface"
                  ],
                  "Resource": [
                      "arn:aws:ec2:*:*:network-interface/*"
                  ],
                  "Condition": {
                      "StringEquals": {
                          "aws:CalledViaLast": "ops.emr-serverless.amazonaws.com"
                      }
                  }
              }
          ]
      }
      ```

------

1. (Optional) Restrict ENI creation to specific subnets:

   1. To further secure your setup by restricting the creation of ENIs to certain subnets within your VPC, you can tag each subnet with specific conditions.

   1. Use the following IAM policy to ensure that EMR Serverless applications can only create Amazon EC2 ENIs within the allowed subnets and security groups:

      ```
      {
          "Sid": "AllowEC2ENICreationInSubnetAndSecurityGroupWithEMRTags",
          "Effect": "Allow", 
          "Action": [
              "ec2:CreateNetworkInterface"
          ],
          "Resource": [
              "arn:aws:ec2:*:*:subnet/*",
              "arn:aws:ec2:*:*:security-group/*"
          ],
          "Condition": {
              "StringEquals": {
                  "aws:ResourceTag/KEY": "VALUE"
              }
          }
      }
      ```

1. Follow the steps on the page [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md) to set the VPC endpoint for Amazon S3, which is required by EMR Serverless and other AWS services that are used by SageMaker Canvas.

1. Save your changes and restart your SageMaker Canvas application.

By following these steps, you can enable large data processing in SageMaker Canvas for various domain setups, including those with custom VPC configurations. Remember to restart your SageMaker Canvas application after making these changes to apply the new permissions.

# Encrypt Your SageMaker Canvas Data with AWS KMS
<a name="canvas-kms"></a>

You might have data that you want to encrypt while using Amazon SageMaker Canvas, such as your private company information or customer data. SageMaker Canvas uses AWS Key Management Service to protect your data. AWS KMS is a service that you can use to create and manage cryptographic keys for encrypting your data. For more information about AWS KMS, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) in the *AWS KMS Developer Guide*.

Amazon SageMaker Canvas provides you with several options for encrypting your data. SageMaker Canvas provides default encryption within the application for tasks such as building your model and generating insights. You can also choose to encrypt data stored in Amazon S3 to protect your data at rest. SageMaker Canvas supports importing encrypted datasets, so you can establish an encrypted workflow. The following sections describe how you can use AWS KMS encryption to protect your data while building models with SageMaker Canvas.

## Encrypt your data in SageMaker Canvas
<a name="canvas-kms-app-data"></a>

With SageMaker Canvas, you can use two different AWS KMS encryption keys to encrypt your data in SageMaker Canvas, which you can specify when [setting up your domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html) using the standard domain setup. These keys are specified in the following domain setup steps:
+ **Step 3: Configure Applications - (Optional)** – When configuring the **Canvas storage configuration** section, you can specify an **Encryption key**. This is a KMS key that SageMaker Canvas uses for long-term storage of model objects and datasets, which are stored in the provided Amazon S3 bucket for your domain. If creating a Canvas application with the [CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html) API, use the `S3KMSKeyId` field to specify this key.
+ **Step 6: Configure storage** – SageMaker Canvas uses one key for encrypting the Amazon SageMaker Studio private space that is created for your Canvas application, which includes temporary application storage, visualizations, and compute jobs (such as building models). You can use either the default AWS managed key or specify your own. If you specify your AWS KMS key, the data stored in the `/home/sagemaker-user` directory is encrypted with your key. If you don't specify an AWS KMS key, the data inside `/home/sagemaker-user` is encrypted with an AWS managed key. Regardless of whether you specify an AWS KMS key, all of the data outside of the working directory is encrypted with an AWS Managed Key. To learn more about the Studio space and your Canvas application storage, see [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md). If creating a Canvas application with the [CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html) API, use the `KmsKeyID` field to specify this key.

The preceding keys can be the same or different KMS keys.

### Prerequisites
<a name="canvas-kms-app-data-prereqs"></a>

To use your own KMS key for either of the previously described purposes, you must first grant your user's IAM role permission to use the key. Then, you can specify the KMS key when setting up your domain.

The simplest way to grant your role permission to use the key is to modify the key policy. Use the following procedure to grant your role the necessary permissions.

1. Open the [AWS KMS console](https://console.aws.amazon.com/kms/).

1. In the **Key Policy** section, choose **Switch to policy view**.

1. Modify the key's policy to grant permissions for the `kms:GenerateDataKey` and `kms:Decrypt` actions to the IAM role. Additionally, if you're modifying the key policy that encrypts your Canvas application storage in the Studio space, grant the `kms:CreateGrant` action. You can add a statement that's similar to the following:

   ```
   {
     "Sid": "ExampleStmt",
     "Action": [
       "kms:CreateGrant", #this permission is only required for the key that encrypts your SageMaker Canvas application storage
       "kms:Decrypt",
       "kms:GenerateDataKey"
     ],
     "Effect": "Allow",
     "Principal": {
       "AWS": "<arn:aws:iam::111122223333:role/Jane>"
     },
     "Resource": "*"
   }
   ```

1. Choose **Save changes**.

The less preferred method is to modify the user’s IAM role to grant the user permissions to use or manage the KMS key. If you use this method, the KMS key policy must also allow access management through IAM. To learn how to grant permission to a KMS key through the user’s IAM role, see [Specifying KMS keys in IAM policy statements](https://docs.aws.amazon.com/kms/latest/developerguide/cmks-in-iam-policies.html) in the *AWS KMS Developer Guide*.

### Encrypt your data in the SageMaker Canvas application
<a name="canvas-kms-app-data-app"></a>

The first KMS key you can use in SageMaker Canvas is used for encrypting application data stored on Amazon Elastic Block Store (Amazon EBS) volumes and in the Amazon Elastic File System that SageMaker AI creates in your domain. SageMaker Canvas encrypts your data with this key in the underlying application and temporary storage systems created when using compute instances for building models and generating insights. SageMaker Canvas passes the key to other AWS services, such as Autopilot, whenever SageMaker Canvas initiates jobs with them to process your data.

You can specify this key by setting the `KmsKeyID` in the `CreateDomain` API call or while doing the standard domain setup in the console. If you don’t specify your own KMS key, SageMaker AI uses a default AWS managed KMS key to encrypt your data in the SageMaker Canvas application.

To specify your own KMS key for use in the SageMaker Canvas application through the console, first set up your Amazon SageMaker AI domain using the **Standard setup**. Use the following procedure to complete the **Network and Storage Section** for the domain.

1. Fill out your desired Amazon VPC settings.

1. For **Encryption key**, choose **Enter a KMS key ARN**.

1. For **KMS ARN**, enter the ARN for your KMS key, which should have a format similar to the following: `arn:aws:kms:example-region-1:123456789098:key/111aa2bb-333c-4d44-5555-a111bb2c33dd`

### Encrypt your SageMaker Canvas data saved in Amazon S3
<a name="canvas-kms-app-data-s3"></a>

The second KMS key you can specify is used for data that SageMaker Canvas stores to Amazon S3. This KMS key is specified in the `S3KMSKeyId` field in the `CreateDomain` API call, or while doing the standard domain setup in the SageMaker AI console. SageMaker Canvas saves duplicates of your input datasets, application and model data, and output data to the Region’s default SageMaker AI S3 bucket for your account. The naming pattern for this bucket is `s3://sagemaker-{Region}-{your-account-id}`, and SageMaker Canvas stores data in the `Canvas/` folder.





1. Turn on **Enable notebook resource sharing**.

1. For **S3 location for shareable notebook resources**, leave the default Amazon S3 path. Note that SageMaker Canvas does not use this Amazon S3 path; this Amazon S3 path is used for Studio Classic notebooks.

1. For **Encryption key**, choose **Enter a KMS key ARN**.

1. For **KMS ARN**, enter the ARN for your KMS key, which should have a format similar to the following: `arn:aws:kms:us-east-1:111122223333:key/111aa2bb-333c-4d44-5555-a111bb2c33dd`

## Import encrypted datasets from Amazon S3
<a name="canvas-kms-datasets"></a>

Your users might have datasets that have been encrypted with a KMS key. While the preceding section shows you how to encrypt data in SageMaker Canvas and data stored to Amazon S3, you must grant your user's IAM role additional permissions if you want to import data from Amazon S3 that is already encrypted with AWS KMS.

To grant your user permissions to import encrypted datasets from Amazon S3 into SageMaker Canvas, add the following permissions to the IAM execution role that you've used for the user profile.

```
      "kms:Decrypt",
      "kms:GenerateDataKey"
```

To learn how to edit the IAM permissions for a role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*. For more information about KMS keys, see [Key policies in AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/key-policies.html) in the *AWS KMS Developer Guide*.

## FAQs
<a name="canvas-kms-faqs"></a>

Refer to the following FAQ items for answers to commonly asked questions about SageMaker Canvas AWS KMS support.

### Q: Does SageMaker Canvas retain my KMS key?
<a name="canvas-kms-faqs-1"></a>

A: No. SageMaker Canvas may temporarily cache your key or pass it on to other AWS services (such as Autopilot), but SageMaker Canvas does not retain your KMS key.

### Q: I specified a KMS key when setting up my domain. Why did my dataset fail to import in SageMaker Canvas?
<a name="canvas-kms-faqs-2"></a>

A: Your user’s IAM role may not have permissions to use that KMS key. To grant your user permissions, see the [Prerequisites](#canvas-kms-app-data-prereqs). Another possible error is that you have a bucket policy on your Amazon S3 bucket that requires the use of a specific KMS key that doesn’t match the KMS key you specified in your domain. Make sure that you specify the same KMS key for your Amazon S3 bucket and your domain.

### Q: How do I find the Region’s default SageMaker AI Amazon S3 bucket for my account?
<a name="canvas-kms-faqs-3"></a>

A: The default Amazon S3 bucket follows the naming pattern `s3://sagemaker-{Region}-{your-account-id}`. The `Canvas/` folder in this bucket stores your SageMaker Canvas application data.

### Q: Can I change the default SageMaker AI Amazon S3 bucket used to store SageMaker Canvas data?
<a name="canvas-kms-faqs-4"></a>

A: No, SageMaker AI creates this bucket for you.

### Q: What does SageMaker Canvas store in the default SageMaker AI Amazon S3 bucket?
<a name="canvas-kms-faqs-5"></a>

A: SageMaker Canvas uses the default SageMaker AI Amazon S3 bucket to store duplicates of your input datasets, model artifacts, and model outputs.

### Q: What use cases are supported for using KMS keys with SageMaker Canvas?
<a name="canvas-kms-faqs-6"></a>

A: With SageMaker Canvas, you can use your own encryption keys with AWS KMS for building regression, binary and multi-class classification, and time series forecasting models, as well as for batch inference with your model.

# Store SageMaker Canvas application data in your own SageMaker AI space
<a name="canvas-spaces-setup"></a>

Your Amazon SageMaker Canvas application data, such as datasets that you import and your model artifacts, is stored in a *Amazon SageMaker Studio private space*. The space consists of a storage volume for your application data with 100 GB of storage per user profile, the type of the space (in this case, a Canvas application), and the image for your application's container. When you set up Canvas and launch your application for the first time, SageMaker AI creates a default private space that is assigned to your user profile and stores your Canvas data. You don't have to do any additional configuration to set up the space because SageMaker AI automatically creates the space on your behalf. However, if you don't want to use the default space, you have the option to specify a space that you created yourself. This can be useful if you want to isolate your data. The following page shows you how to create and configure your own Studio space for storing Canvas application data.

**Note**  
You can only configure a custom Studio space for new Canvas applications. You can't modify the space configuration for existing Canvas applications.

## Before you begin
<a name="canvas-spaces-setup-prereqs"></a>

Your Amazon SageMaker AI domain or user profile must have at least 100 GB of storage in order to create and use the SageMaker Canvas application.

If you created your domain through the SageMaker AI console, enough storage is provisioned by default and you don't need to take any additional action. If you created your domain or user profile with the [CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) or [ CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) APIs, then make sure that you set the `MaximumEbsVolumeSizeInGb` value to 100 GB or greater. To set a greater storage value, you can either create a new domain or user profile, or you can update an existing domain or user profile using the [UpdateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html) or [ UpdateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateUserProfile.html) APIs. 

## Create a new space
<a name="canvas-spaces-setup-new-space"></a>

First, create a new Studio space that is configured to store Canvas application data. This is the space that you specify when creating a new Canvas application in the next step.

To create a space, you can use the AWS SDK for Python (Boto3) or the AWS CLI.

------
#### [ SDK for Python (Boto3) ]

The following example shows you how to use the AWS SDK for Python (Boto3) `create_space` method to create a space that you can use for Canvas applications. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain. To find your ID, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section.
+ `SpaceName`: Specify a name for the new space.
+ `EbsVolumeSizeinGb`: Specify the storage volume size for your space (in GB). The minimum value is `5` and the maximum is `16384`.
+ `SharingType`: Specify this field as `Private`. For more information, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).
+ `OwnerUserProfileName`: Specify the user profile name. To find user profile names associated with a domain, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section. In the domain's settings, you can view the user profiles.
+ `AppType`: Specify this field as `Canvas`.

```
response = client.create_space(
    DomainId='<your-domain-id>', 
    SpaceName='<your-new-space-name>',
    SpaceSettings={
        'AppType': 'Canvas',
        'SpaceStorageSettings': {
            'EbsStorageSettings': {
                'EbsVolumeSizeInGb': <storage-volume-size>
            }
        },
    },
    OwnershipSettings={
        'OwnerUserProfileName': '<your-user-profile>'
    },
    SpaceSharingSettings={
        'SharingType': 'Private'
    }  
)
```

------
#### [ AWS CLI ]

The following example shows you how to use the AWS CLI `create-space` method to create a space that you can use for Canvas applications. Make sure to specify these parameters:
+ `domain-id`: Specify the ID for your domain. To find your ID, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section.
+ `space-name`: Specify a name for the new space.
+ `EbsVolumeSizeinGb`: Specify the storage volume size for your space (in GB). The minimum value is `5` and the maximum is `16384`.
+ `SharingType`: Specify this field as `Private`. For more information, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).
+ `OwnerUserProfileName`: Specify the user profile name. To find user profile names associated with a domain, you can go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and locate your domain in the **Domains** section. In the domain's settings, you can view the user profiles.
+ `AppType`: Specify this field as `Canvas`.

```
  
create-space
--domain-id <your-domain-id>
--space-name <your-new-space-name>  
--space-settings '{
        "AppType": "Canvas", 
        "SpaceStorageSettings": {
            "EbsStorageSettings": {"EbsVolumeSizeInGb": <storage-volume-size>}
        },
     }'
--ownership-settings '{"OwnerUserProfileName": "<your-user-profile>"}'
--space-sharing-settings '{"SharingType": "Private"}'
```

------

You should now have a space. Keep track of your space's name for the next step.

## Create a new Canvas application
<a name="canvas-spaces-setup-new-app"></a>

After creating a space, create a new Canvas application that specifies the space as its storage location.

To create a new Canvas application, you can use the AWS SDK for Python (Boto3) or the AWS CLI.

**Important**  
You must use the AWS SDK for Python (Boto3) or the AWS CLI to create your Canvas application. Specifying a custom space when creating Canvas applications through the SageMaker AI console isn't supported.

------
#### [ SDK for Python (Boto3) ]

The following example shows you how to use the AWS SDK for Python (Boto3) `create_app` method to create a new Canvas application. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain.
+ `SpaceName`: Specify the name of the space that you created in the previous step.
+ `AppType`: Specify this field as `Canvas`.
+ `AppName`: Specify `default` as the app name.

```
response = client.create_app(  
    DomainId='<your-domain-id>',
    SpaceName='<your-space-name>',
    AppType='Canvas', 
    AppName='default'  
)
```

------
#### [ AWS CLI ]

The following example shows you how to use the AWS CLI `create-app` method to create a new Canvas application. Make sure to specify these parameters:
+ `DomainId`: Specify the ID for your SageMaker AI domain. 
+ `SpaceName`: Specify the name of the space that you created in the previous step.
+ `AppType`: Specify this field as `Canvas`.
+ `AppName`: Specify `default` as the app name.

```
create-app
--domain-id <your-domain-id>
--space-name <your-space-name>
--app-type Canvas
--app-name default
```

------

You should now have a new Canvas application that uses a custom Studio space as the storage location for application data.

**Important**  
Any time you delete the Canvas application (or log out) and have to re-create the application, you must provide your space in the `SpaceName` field to make sure that Canvas uses your space.

The space is attached to the user profile you specified in the space configuration. You can delete your Canvas application without deleting the space, and the data stored in the space remains. The data stored in your space is only deleted if you delete your user profile, or if you directly delete the space.

# Grant Your Users Permissions to Build Custom Image and Text Prediction Models
<a name="canvas-set-up-cv-nlp"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

In Amazon SageMaker Canvas, you can build [custom models](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) to meet your specific business need. Two of these custom model types are single-label image predicion and multi-category text prediction. The permissions to build these model types are included in the AWS Identity and Access Management (IAM) policy called [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess), which SageMaker AI attaches by default to your user's IAM execution role if you leave the [Canvas base permissions turned on](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites). If you are using a custom IAM configuration, then you must explicitly add permissions to your user's IAM execution role so that they can build custom image and text prediction model types. To grant the necessary permissions to build image and text prediction models, read the following section to learn how to attach a least-permissions policy to your role.

To add the permissions to the user's IAM role, do the following:

1. Go to the [IAM console](https://console.aws.amazon.com/iamv2).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Create inline policy**.

1. Select the JSON tab, and then paste the following least-permissions policy into the editor.

------
#### [ JSON ]

****  

   ```
   {
   "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CreateAutoMLJobV2",
                   "sagemaker:DescribeAutoMLJobV2"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a **Name** for the policy.

1. Choose **Create policy**.

For more information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

# Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas
<a name="canvas-fine-tuning-permissions"></a>

Generative AI features in Amazon SageMaker Canvas are powered by Amazon Bedrock foundation models, which are large language models (LLMs) that have the capability to understand and generate human-like text. This page describes how to grant the permissions necessary for the following features in SageMaker Canvas:
+ [Chat with and compare Amazon Bedrock models](canvas-fm-chat.md): Access and start conversational chats with Amazon Bedrock models through SageMaker Canvas.
+ [Use the Chat for data prep feature in Data Wrangler ](canvas-chat-for-data-prep.md): Use natural language to explore, visualize, and transform your data. This feature is powered by Anthropic Claude 2.
+ [Fine-tune Amazon Bedrock foundation models](canvas-fm-chat-fine-tune.md): Fine-tune an Amazon Bedrock foundation model on your own data to receive customized responses.

In order to use these features, you must first request access to the specific Amazon Bedrock model that you want to use. Then, add the necessary AWS IAM permissions and a trust relationship with Amazon Bedrock to the user's execution role. To grant the permissions to the role, you can choose one of the following methods:
+ Create a new Amazon SageMaker AI domain or user profile and turn on Amazon Bedrock permissions. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).
+ Edit the settings for an existing Amazon SageMaker AI domain or user profile.
+ Manually add permissions and a trust relationship to a domain's or user's IAM role.

## Step 1: Add Amazon Bedrock model access
<a name="canvas-bedrock-access"></a>

Access to Amazon Bedrock models isn't granted by default, so you must go to the Amazon Bedrock console to request access to models for your AWS account.

To learn how to request access to a specific Amazon Bedrock model, following the procedure to **Add model access** on the page [Manage access to Amazon Bedrock foundation models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) in the *Amazon Bedrock User Guide*.

## Step 2: Grant permissions to the user's IAM role
<a name="canvas-bedrock-iam-permissions"></a>

When setting up your Amazon SageMaker AI domain or user profile, the user's IAM execution role must have the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached, as well as a trust relationship with Amazon Bedrock, so that your user can access Amazon Bedrock models from SageMaker Canvas.

You can modify the domain settings and either create a new execution role (to which SageMaker AI attaches the required permissions for you) or specify an existing role.

Alternatively, you can manually modify the permissions for an existing IAM role through the IAM console.

Both methods are described in the following sections.

### Grant permissions through the domain settings
<a name="canvas-fine-tuning-permissions-console"></a>

You can edit your domain or user profile settings to turn on the **Canvas Ready-to-use models configuration** setting and specify an Amazon Bedrock role.

To edit your domain settings and grant access to Amazon Bedrock models for Canvas users in the domain, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Domains**.

1. From the list of domains, choose your domain.

1. Choose the **App Configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. The **Edit Canvas settings** page opens. For the **Canvas Ready-to-use models configuration** section, do the following:

   1. Turn on the **Enable Canvas Ready-to-use models option**.

   1. For **Amazon Bedrock role**, select **Create and use a new execution role** to create a new IAM execution role that has the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock. This IAM role is assumed by Amazon Bedrock when you access Amazon Bedrock models, use the chat for data prep feature, or fine-tune Amazon Bedrock models in Canvas. If you already have an execution role with a trust relationship, then select **Use an existing execution role** and choose your role from the dropdown.

1. Choose **Submit ** to save your changes.

Your users should now have the necessary permissions to access Amazon Bedrock models, use the chat for data prep feature, and fine-tune Amazon Bedrock models in Canvas.

You can use the same procedure above for editing an individual user’s settings, except go into the individual user’s profile from the domain page and edit the user settings instead. Permissions granted to an individual user don’t apply to other users in the domain, while permissions granted through the domain settings apply to all user profiles in the domain.

For more information on editing your domain settings, see [View and Edit domains](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-view-edit.html).

### Grant permissions manually through IAM
<a name="canvas-fine-tuning-permissions-manual"></a>

You can manually grant users permissions to access and fine-tune Amazon Bedrock models in Canvas by adding permissions to the IAM role specified for the domain or user’s profile. The IAM role must have the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock.

The following section shows you how to attach the policy to your IAM role and create the trust relationship with Amazon Bedrock.

First, take note of your domain or user profile’s IAM role. Note that permissions granted to an individual user don’t apply to other users in the domain, while permissions granted through the domain apply to all user profiles in the domain.

To configure the IAM role and grant permissions to fine-tune foundation models in Canvas, do the following:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the left navigation pane, choose **Roles**.

1. Search for the user's IAM role by name from the list of roles and select it.

1. On the **Permissions** tab, choose **Add permissions**. From the dropdown menu, choose **Attach policies**.

1. Search for the `AmazonSageMakerCanvasBedrockAccess` policy and select it.

1. Choose**Add permissions**.

1. Back on the IAM role’s page, choose the **Trust relationships** tab.

1. Choose **Edit trust policy**.

1. In the policy editor, find the **Add a principal option** in the right panel and choose **Add**.

1. In the dialog box, for **Principal type**, select **AWS services**.

1. For **ARN**, enter **bedrock.amazonaws.com**.

1. Choose **Add principal**.

1. Choose **Update policy**.

You should now have an IAM role that has the [ AmazonSageMakerCanvasBedrockAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasBedrockAccess.html) policy attached and a trust relationship with Amazon Bedrock. For information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

# Update SageMaker Canvas for Your Users
<a name="canvas-update"></a>

You can update to the latest version of Amazon SageMaker Canvas as either a user or an IT administrator. You can update Amazon SageMaker Canvas for a single user at a time.

To update the Amazon SageMaker Canvas application, you must delete the previous version.

**Important**  
Deleting the previous version of Amazon SageMaker Canvas doesn't delete the data or models that the users have created.

Use the following procedure to log in to AWS, open Amazon SageMaker AI domain, and update Amazon SageMaker Canvas. The users can start using the SageMaker Canvas application when they log back in.

1. Sign in to the Amazon SageMaker AI console at [Amazon SageMaker Runtime](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. On the **Domains** page, choose your domain.

1. From the list of **User profiles**, choose a user profile.

1. For the list of **Apps**, find the Canvas application (the **App type** says **Canvas**) and choose **Delete app**.

1. Complete the dialog box and choose **Confirm action**.

The following image shows the user profile page and highlights the **Delete app** action from the preceding procedure.

![\[Screenshot of the user profile page with the Delete app action highlighted.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-update-app-1.png)


# Request a Quota Increase
<a name="canvas-requesting-quota-increases"></a>

Your users might use AWS resources in amounts that exceed those specified by their quotas. If your users are resource constrained and encounter errors in SageMaker Canvas, you can request a quota increase for them.

For more details about SageMaker AI quotas and how to request a quota increase, see [Quotas](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#regions-quotas-quotas).

Amazon SageMaker Canvas uses the following services to process the requests of your users:
+ Amazon SageMaker Autopilot
+ Amazon SageMaker Studio Classic domain

For a list of the available quotas for SageMaker Canvas operations, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html).

## Request an increase for instances to build custom models
<a name="canvas-requesting-quota-increases-instances"></a>

When building a custom model, if you encounter an error during post-building analysis that tells you to increase your quota for `ml.m5.2xlarge` instances, use the following information to resolve the issue.

You must increase the SageMaker AI Hosting endpoint quota for the `ml.m5.2xlarge` instance type to a non-zero value in your AWS account. After building a model, SageMaker Canvas hosts the model on a SageMaker AI Hosting endpoint and uses the endpoint to generate the post-building analysis. If you don't increase the default account quota of 0 for `ml.m5.2xlarge` instances, SageMaker Canvas cannot complete this step and generates an error during post-building analysis.

For the procedure to increase the quota, see [ Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*.

# Grant Users Permissions to Import Amazon Redshift Data
<a name="canvas-redshift-permissions"></a>

Your users might have datasets stored in Amazon Redshift. Before users can import data from Amazon Redshift into SageMaker Canvas, you must add the `AmazonRedshiftFullAccess` managed policy to the IAM execution role that you've used for the user profile and add Amazon Redshift as a service principal to the role's trust policy. You must also associate the IAM execution role with your Amazon Redshift cluster. Complete the procedures in the following sections to give your users the required permissions to import Amazon Redshift data.

## Add Amazon Redshift permissions to your IAM role
<a name="canvas-redshift-permissions-iam-role"></a>

You must grant Amazon Redshift permissions to the IAM role specified in your user profile.

To add the `AmazonRedshiftFullAccess` policy to the user's IAM role, do the following.

1. Sign in to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Attach policies**.

1. Search for the `AmazonRedshiftFullAccess` managed policy and select it.

1. Choose **Attach policies** to attach the policy to the role.

After attaching the policy, the role’s **Permissions** section should now include `AmazonRedshiftFullAccess`.

To add Amazon Redshift as a service principal to the IAM role, do the following.

1. On the same page for the IAM role, under **Trust relationships**, choose **Edit trust policy**.

1. In the **Edit trust policy** editor, update the trust policy to add Amazon Redshift as a service principal. An IAM role that allows Amazon Redshift to access other AWS services on your behalf has a trust relationship as follows:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "redshift.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

1. After editing the trust policy, choose **Update policy**.

You should now have an IAM role that has the policy `AmazonRedshiftFullAccess` attached to it and a trust relationship established with Amazon Redshift, giving users permission to import Amazon Redshift data into SageMaker Canvas. For more information about AWS managed policies, see [Managed policies and inline policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) in the *IAM User Guide*.

## Associate the IAM role with your Amazon Redshift cluster
<a name="canvas-redshift-permissions-cluster"></a>

In the settings for your Amazon Redshift cluster, you must associate the IAM role that you granted permissions to in the preceding section.

To associate an IAM role with your cluster, do the following.

1. Sign in to the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Clusters**, and then choose the name of the cluster that you want to update.

1. In the **Actions** dropdown menu, choose **Manage IAM roles**. The **Cluster permissions** page appears.

1. For **Available IAM roles**, enter either the ARN or the name of the IAM role, or choose the IAM role from the list.

1. Choose **Associate IAM role** to add it to the list of **Associated IAM roles**.

1. Choose **Save changes** to associate the IAM role with the cluster.

Amazon Redshift modifies the cluster to complete the change, and the IAM role to which you previously granted Amazon Redshift permissions is now associated with your Amazon Redshift cluster. Your users now have the required permissions to import Amazon Redshift data into SageMaker Canvas.

# Grant Your Users Permissions to Send Predictions to Quick
<a name="canvas-quicksight-permissions"></a>

You must grant your SageMaker Canvas users permissions to send batch predictions to Quick. In Quick, users can create analyses and reports with a dataset and prepare dashboards to share their results. For more information about sending prediction to QuickSight for analysis, see [Send predictions to Quick](canvas-send-predictions.md).

To grant the necessary permissions to share batch predictions with users in QuickSight, you must add a permissions policy to the AWS Identity and Access Management (IAM) execution role that you’ve used for the user profile. The following section shows you how to attach a least-permissions policy to your role.

**Add the permissions policy to your IAM role**

**To add the permissions policy, use the following procedure:**

1. Sign in to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. Choose **Create inline policy**.

1. Select the JSON tab, and then paste the following least-permissions policy into the editor. Replace the placeholders `<your-account-number>` with your own AWS account number.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "quicksight:CreateDataSet",
                   "quicksight:ListUsers",
                   "quicksight:ListNamespaces",
                   "quicksight:CreateDataSource",
                   "quicksight:PassDataSet",
                   "quicksight:PassDataSource"
               ],
               "Resource": [
                   "arn:aws:quicksight:*:111122223333:datasource/*",
                   "arn:aws:quicksight:*:111122223333:user/*",
                   "arn:aws:quicksight:*:111122223333:namespace/*",
                   "arn:aws:quicksight:*:111122223333:dataset/*"
               ]
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a **Name** for the policy.

1. Choose **Create policy**.

You should now have a customer-managed IAM policy attached to your execution role that grants your Canvas users the necessary permissions to send batch predictions to users in QuickSight.

# Applications management
<a name="canvas-manage-apps"></a>

The following sections describe how you can manage your SageMaker Canvas applications. You can view, delete, or relaunch your applications from the **Domains** section of the SageMaker AI console.

**Topics**
+ [

# Check for active applications
](canvas-manage-apps-active.md)
+ [

# Delete an application
](canvas-manage-apps-delete.md)
+ [

# Relaunch an application
](canvas-manage-apps-relaunch.md)

# Check for active applications
<a name="canvas-manage-apps-active"></a>

To check if you have any actively running SageMaker Canvas applications, use the following procedure.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Dashboard**.

1. In the **LCNC** section, there is a row for Canvas that tells you how many active apps are running. Choose the number to view the list of apps.

The **Status** column displays the status of the application, such as **Ready**, **Pending**, or **Deleted**. If the application is **Ready**, then your SageMaker Canvas workspace instance is active. You can delete the application from the console, or you can reopen Canvas and log out.

# Delete an application
<a name="canvas-manage-apps-delete"></a>

If you want to terminate your SageMaker Canvas workspace instance, you can either log out from the SageMaker Canvas application or delete your application from the SageMaker AI console. A *workspace instance* is dedicated for your use from when you start using SageMaker Canvas to the point when you stop using it. Deleting the application only terminates the workspace instance and stops workspace instance charges. Models and datasets aren’t affected, but Quick build tasks automatically restart when you relaunch the application.

To delete your Canvas application through the AWS console, first close the browser tab in which your Canvas application was open. Then, use the following procedure to delete your SageMaker Canvas application.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**. 

1. On the **Domains** page, choose your domain.

1. On the **Domain details** page, choose **Resources**.

1. Under **Applications**, find the application that says **Canvas** in the **App type** column.

1. Select the checkbox next to the Canvas application and choose **Stop**.

You have now successfully stopped the application and terminated the workspace instance.

You can also terminate the workspace instance by [logging out](canvas-log-out.md) from within the SageMaker Canvas application.

# Relaunch an application
<a name="canvas-manage-apps-relaunch"></a>

If you delete or log out of your SageMaker Canvas application and want to relaunch the application, use the following procedure.

1. Navigate to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, choose **Canvas**.

1. On the SageMaker Canvas landing page, in the **Get Started** box, select your user profile from the dropdown.

1. Choose **Open Canvas** to open the application.

SageMaker Canvas begins launching the application.

You can also use the following secondary procedure if you encounter any issues with the previous procedure.

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. On the **Domains** page, choose your domain.

1. On the **Domain details** page, under **User profiles**, select the user profile name for the SageMaker Canvas application you want to view.

1. Choose **Launch** and select **Canvas** from the dropdown list.

SageMaker Canvas begins launching the application.

# Configure Amazon SageMaker Canvas in a VPC without internet access
<a name="canvas-vpc"></a>

The Amazon SageMaker Canvas application runs in a container in an AWS managed Amazon Virtual Private Cloud (VPC). If you want to further control access to your resources or run SageMaker Canvas without public internet access, you can configure your Amazon SageMaker AI domain and VPC settings. Within your own VPC, you can configure settings such as security groups (virtual firewalls that control inbound and outbound traffic from Amazon EC2 instances) and subnets (ranges of IP addresses in your VPC). To learn more about VPCs, see [How Amazon VPC works](https://docs.aws.amazon.com/vpc/latest/userguide/how-it-works.html).

When the SageMaker Canvas application is running in the AWS managed VPC, it can interact with other AWS services using either an internet connection or through VPC endpoints created in a customer-managed VPC (without public internet access). SageMaker Canvas applications can access these VPC endpoints through a Studio Classic-created network interface that provides connectivity to the customer-managed VPC. The default behavior of the SageMaker Canvas application is to have internet access. When using an internet connection, the containers for the preceding jobs access AWS resources over the internet, such as the Amazon S3 buckets where you store training data and model artifacts.

However, if you have security requirements to control access to your data and job containers, we recommend that you configure SageMaker Canvas and your VPC so that your data and containers aren’t accessible over the internet. SageMaker AI uses the VPC configuration settings you specify when setting up your domain for SageMaker Canvas.

If you want to configure your SageMaker Canvas application without internet access, you must configure your VPC settings when you onboard to [Amazon SageMaker AI domain](gs-studio-onboard.md), set up VPC endpoints, and grant the necessary AWS Identity and Access Management permissions. For information about configuring a VPC in Amazon SageMaker AI, see [Choose an Amazon VPC](onboard-vpc.md). The following sections describe how to run SageMaker Canvas in a VPC without public internet access.

## Configure Amazon SageMaker Canvas in a VPC without internet access
<a name="canvas-vpc-configure"></a>

You can send traffic from SageMaker Canvas to other AWS services through your own VPC. If your own VPC doesn't have public internet access and you've set up your domain in **VPC only** mode, then SageMaker Canvas won't have public internet access as well. This includes all requests, such as accessing datasets in Amazon S3 or training jobs for standard builds, and the requests go through VPC endpoints in your VPC instead of the public internet. When you onboard to domain and [Choose an Amazon VPC](onboard-vpc.md), you can specify your own VPC as the default VPC for the domain, along with your desired security group and subnet settings. Then, SageMaker AI creates a network interface in your VPC that SageMaker Canvas uses to access VPC endpoints in your VPC.

Make sure that you set up one or more security groups in your VPC with inbound and outbound rules that allow [ TCP traffic within the security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-other-instances). This is required for connectivity between the Jupyter Server application and the Kernel Gateway applications. You must allow access to at least ports in the range `8192-65535`. Also, make sure to create a distinct security group for each user profile and add inbound access from that same security group. We do not recommend reusing a domain level security group for user profiles. If the domain level security group allows inbound access to itself, all applications in the domain have access to all other applications in the domain. Note that the security group and subnet settings are set after you finish onboarding to domain.

When onboarding to domain, if you choose **Public internet only** as the network access type, the VPC is SageMaker AI managed and allows internet access.

You can change this behavior by choosing **VPC only** so that SageMaker AI sends all traffic to a network interface that SageMaker AI creates in your specified VPC. When you choose this option, you must provide the subnets, security groups, and VPC endpoints that are necessary to communicate with the SageMaker API and SageMaker AI Runtime, and various AWS services, such as Amazon S3 and Amazon CloudWatch, that are used by SageMaker Canvas. Note that you can only import data from Amazon S3 buckets located in the same Region as your VPC.

The following procedures show how you can configure these settings to use SageMaker Canvas without the internet.

### Step 1: Onboard to Amazon SageMaker AI domain
<a name="canvas-vpc-configure-onboard"></a>

To send SageMaker Canvas traffic to a network interface in your own VPC instead of over the internet, specify the VPC you want to use when onboarding to [Amazon SageMaker AI domain](gs-studio-onboard.md). You must also specify at least two subnets in your VPC that SageMaker AI can use. Choose **Standard setup** and do the following procedure when configuring the **Network and Storage Section** for the domain.

1. Select your desired **VPC**.

1. Choose two or more **Subnets**. If you don’t specify the subnets, SageMaker AI uses all of the subnets in the VPC.

1. Choose one or more **Security group(s)**.

1. Choose **VPC Only** to turn off direct internet access in the AWS managed VPC where SageMaker Canvas is hosted.

After disabling internet access, finish the onboarding process to set up your domain. For more information about the VPC settings for Amazon SageMaker AI domain, see [Choose an Amazon VPC](onboard-vpc.md).

### Step 2: Configure VPC endpoints and access
<a name="canvas-vpc-configure-endpoints"></a>

**Note**  
In order to configure Canvas in your own VPC, you must enable private DNS hostnames for your VPC endpoints. For more information, see [Connect to SageMaker AI Through a VPC Interface Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/interface-vpc-endpoint.html).

SageMaker Canvas only accesses other AWS services to manage and store data for its functionality. For example, it connects to Amazon Redshift if your users access an Amazon Redshift database. It can connect to an AWS service such as Amazon Redshift using an internet connection or a VPC endpoint. Use VPC endpoints if you want to set up connections from your VPC to AWS services that don't use the public internet.

A VPC endpoint creates a private connection to an AWS service that uses a networking path that is isolated from the public internet. For example, if you set up access to Amazon S3 using a VPC endpoint from your own VPC, then the SageMaker Canvas application can access Amazon S3 by going through the network interface in your VPC and then through the VPC endpoint that connects to Amazon S3. The communication between SageMaker Canvas and Amazon S3 is private.

For more information about configuring VPC endpoints for your VPC, see [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html). If you are using Amazon Bedrock models in Canvas with a VPC, for more information about controlling access to your data, see [ Protect jobs using a VPC](https://docs.aws.amazon.com/bedrock/latest/userguide/usingVPC.html#configureVPC) in the *Amazon Bedrock User Guide*.

The following are the VPC endpoints for each service you can use with SageMaker Canvas:


| Service | Endpoint | Endpoint type | 
| --- | --- | --- | 
|  AWS Application Auto Scaling  |  com.amazonaws.*Region*.application-autoscaling  | Interface | 
|  Amazon Athena  |  com.amazonaws.*Region*.athena  | Interface | 
|  Amazon SageMaker AI  |  com.amazonaws.*Region*.sagemaker.api com.amazonaws.*Region*.sagemaker.runtime com.amazonaws.*Region*.notebook  | Interface | 
|  Amazon SageMaker AI Data Science Assistant  |  com.amazonaws.*Region*.sagemaker-data-science-assistant  | Interface | 
|  AWS Security Token Service  |  com.amazonaws.*Region*.sts  | Interface | 
|  Amazon Elastic Container Registry (Amazon ECR)  |  com.amazonaws.*Region*.ecr.api com.amazonaws.*Region*.ecr.dkr  | Interface | 
|  Amazon Elastic Compute Cloud (Amazon EC2)  |  com.amazonaws.*Region*.ec2  | Interface | 
|  Amazon Simple Storage Service (Amazon S3)  |  com.amazonaws.*Region*.s3  | Gateway | 
|  Amazon Redshift  |  com.amazonaws.*Region*.redshift-data  | Interface | 
|  AWS Secrets Manager  |  com.amazonaws.*Region*.secretsmanager  | Interface | 
|  AWS Systems Manager  |  com.amazonaws.*Region*.ssm  | Interface | 
|  Amazon CloudWatch  |  com.amazonaws.*Region*.monitoring  | Interface | 
|  Amazon CloudWatch Logs  |  com.amazonaws.*Region*.logs  | Interface | 
|  Amazon Forecast  |  com.amazonaws.*Region*.forecast com.amazonaws.*Region*.forecastquery  | Interface | 
|  Amazon Textract  |  com.amazonaws.*Region*.textract  | Interface | 
|  Amazon Comprehend  |  com.amazonaws.*Region*.comprehend  | Interface | 
|  Amazon Rekognition  |  com.amazonaws.*Region*.rekognition  | Interface | 
|  AWS Glue  |  com.amazonaws.*Region*.glue  | Interface | 
|  AWS Application Auto Scaling  |  com.amazonaws.*Region*.application-autoscaling  | Interface | 
|  Amazon Relational Database Service (Amazon RDS)  |  com.amazonaws.*Region*.rds  | Interface | 
|  Amazon Bedrock (see note after table)  |  com.amazonaws.*Region*.bedrock-runtime  | Interface | 
|  Amazon Kendra  |  com.amazonaws.*Region*.kendra  | Interface | 
|  Amazon EMR Serverless  |  com.amazonaws.*Region*.emr-serverless  | Interface | 
|  Amazon Q Developer (see note after table)  |  com.amazonaws.*Region*.q  | Interface | 

**Note**  
The Amazon Q Developer VPC endpoint is currently available only in the US East (N. Virginia) region. To connect to it from other regions, you can choose one of the following options based on your security and infrastructure preferences:  
**Set up a NAT Gateway.** Configure a NAT Gateway in your VPC's private subnet to enable internet connectivity for the Q Developer endpoint. For more information, see [Setting up a NAT Gateway in a VPC Private Subnet](https://repost.aws/knowledge-center/nat-gateway-vpc-private-subnet).
**Enable cross-region VPC endpoint access.** Set up cross-region VPC endpoint access for Q Developer. Use this option to connect securely without requiring internet access. For more information, see [Configuring Cross-Region VPC Endpoint Access](https://repost.aws/knowledge-center/vpc-endpoints-cross-region-aws-services).

**Note**  
For Amazon Bedrock, the interface endpoint service name `com.amazonaws.Region.bedrock` has been deprecated. Create a new VPC endpoint with the service name listed in the preceding table.  
Additionally, you can't fine-tune foundation models from Canvas VPCs with no internet access. This is because Amazon Bedrock doesn't support VPC endpoints for model customization APIs. To learn more about fine-tuning foundation models in Canvas, see [Fine-tune foundation models](canvas-fm-chat-fine-tune.md).

You must also add an endpoint policy for Amazon S3 to control AWS principal access to your VPC endpoint. For information about how to update your VPC endpoint policy, see [Control access to VPC endpoints using endpoint policies](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-access.html).

The following are two VPC endpoint policies that you can use. Use the first policy if you only want to grant access to the basic functionality of Canvas, such as importing data and creating models. Use the second policy if you want to grant access to the additional [ genenerative AI features](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat.html) in Canvas.

------
#### [ Basic VPC endpoint policy ]

The following policy grants the necessary access to your VPC endpoint for basic operations in Canvas.

```
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:CreateBucket",
                "s3:GetBucketCors",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
```

------
#### [ Generative AI VPC endpoint policy ]

The following policy grants the necessary access to your VPC endpoint for basic operations in Canvas, as well as using generative AI foundation models.

```
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:CreateBucket",
                "s3:GetBucketCors",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::*fmeval/datasets*",
                "arn:aws:s3:::*jumpstart-cache-prod*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
```

------

### Step 3: Grant IAM permissions
<a name="canvas-vpc-configure-permissions"></a>

The SageMaker Canvas user must have the necessary AWS Identity and Access Management permissions to allow connection to the VPC endpoints. The IAM role to which you give permissions must be the same one you used when onboarding to Amazon SageMaker AI domain. You can attach the SageMaker AI managed `AmazonSageMakerFullAccess` policy to the IAM role for the user to give the user the required permissions. If you require more restrictive IAM permissions and use custom policies instead, then give the user’s role the `ec2:DescribeVpcEndpointServices` permission. SageMaker Canvas requires these permissions to verify the existence of the required VPC endpoints for standard build jobs. If it detects these VPC endpoints, then standard build jobs run by default in your VPC. Otherwise, they will run in the default AWS managed VPC.

For instructions on how to attach the `AmazonSageMakerFullAccess` IAM policy to your user’s IAM role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

To grant your user’s IAM role the granular `ec2:DescribeVpcEndpointServices` permission, use the following procedure.

1. Sign in to the AWS Management Console and open the [IAM console](https://console.aws.amazon.com/iam/).

1. In the navigation pane, choose **Roles**.

1. In the list, choose the name of the role to which you want to grant permissions.

1. Choose the **Permissions** tab.

1. Choose **Add permissions** and then choose **Create inline policy**.

1. Choose the **JSON** tab and enter the following policy, which grants the `ec2:DescribeVpcEndpointServices` permission:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "VisualEditor0",
               "Effect": "Allow",
               "Action": "ec2:DescribeVpcEndpointServices",
               "Resource": "*"
           }
       ]
   }
   ```

------

1. Choose **Review policy**, and then enter a **Name** for the policy (for example, `VPCEndpointPermissions`).

1. Choose **Create policy**.

The user’s IAM role should now have permissions to access the VPC endpoints configured in your VPC.

### (Optional) Step 4: Override security group settings for specific users
<a name="canvas-vpc-configure-override"></a>

If you are an administrator, you might want different users to have different VPC settings, or user-specific VPC settings. When you override the default VPC’s security group settings for a specific user, these settings are passed on to the SageMaker Canvas application for that user.

You can override the security groups that a specific user has access to in your VPC when you set up a new user profile in Studio Classic. You can use the [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) SageMaker API call (or [create\$1user\$1profile](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_user_profile) with the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html)), and then in the `UserSettings`, you can specify the `SecurityGroups` for the user.

# Set up connections to data sources with OAuth
<a name="canvas-setting-up-oauth"></a>

The following section describes the steps you must take to set up OAuth connections to data sources from SageMaker Canvas. [OAuth](https://oauth.net/2/) is a common authentication platform for granting access to resources without sharing passwords. With OAuth, you can quickly connect to your data from Canvas and import it for building models. Canvas currently supports OAuth for Snowflake and Salesforce Data Cloud. 

**Note**  
You can only establish one OAuth connection for each data source.

## Set up OAuth for Salesforce Data Cloud
<a name="canvas-setting-up-oauth-salesforce"></a>

To set up OAuth for Salesforce Data Cloud, follow these general steps:

1. Sign in to Salesforce Data Cloud.

1. In Salesforce Data Cloud, create a new app connection and do the following:

   1. Enable OAuth settings.

   1. When prompted for a callback URL (or the URL of the resource accessing your data), specify the URL for your Canvas application. The Canvas application URL follows this format: `https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default`

   1. Copy the consumer key and secret.

   1. Copy your authorization URL and token URL.

For more detailed instructions about performing the preceding tasks in Salesforce Data Cloud, see [Import data from Salesforce Data Cloud](data-wrangler-import.md#data-wrangler-import-salesforce-data-cloud) in the Data Wrangler documentation for importing data from Salesforce Data Cloud.

After enabling access from Salesforce Data Cloud and getting your connection information, you must create an [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) secret to store the information and add it to your Amazon SageMaker AI domain or user profile. Note that you can add a secret to both a domain and user profile, but Canvas looks for secrets in the user profile first.

To add a secret to your domain or user profile, do the following:

1. Go to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Choose **domains** in the navigation pane.

1. From the list of **domains**, choose your domain.

   1. If adding your secret to your domain, do the following:

      1. Choose the domain.

      1. On the **domain settings** page, choose the **domain settings** tab.

      1. Choose **Edit**.

   1. If adding the secret to your user profile, do the following:

      1. Choose the user’s domain.

      1. On the **domain settings** page, choose the user profile.

      1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For **OAuth settings**, choose **Add OAuth configuration**.

1. For **Data source**, select **Salesforce Data Cloud**.

1. For **Secret Setup**, select **Create a new secret**. Alternatively, if you already created an AWS Secrets Manager secret with your credentials, enter the ARN for the secret. If creating a new secret, do the following:

   1. For **Identity Provider**, select **SALESFORCE**.

   1. For **Client ID**, **Client Secret**, **Authorization URL**, and **Token URL**, enter all of the information you gathered from Salesforce Data Cloud in the previous procedure.

1. Save your domain or user profile settings.

You should now be able to create a connection to your data in Salesforce Data Cloud from Canvas.

## Set up OAuth for Snowflake
<a name="canvas-setting-up-oauth-snowflake"></a>

To set up authentication for Snowflake, Canvas supports identity providers that you can use instead of having users directly enter their credentials into Canvas.

The following are links to the Snowflake documentation for the identity providers that Canvas supports:
+ [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
+ [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
+ [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

The following process describes the general steps you must take. For more detailed instructions about performing these steps, you can refer to the [Setting up Snowflake OAuth Access](data-wrangler-import.md#data-wrangler-snowflake-oauth-setup) section in the Data Wrangler documentation for importing data from Snowflake.

To set up OAuth for Snowflake, do the following:

1. Register Canvas as an application with the identity provider. This requires specifying a redirect URL to Canvas, which should follow this format: `https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default`

1. Within the identity provider, create a server or API that sends OAuth tokens to Canvas so that Canvas can access Snowflake. When setting up the server, use the authorization code and refresh token grant types, specify the access token lifetime, and set a refresh token policy. Additionally, within the External OAuth Security Integration for Snowflake, enable `external_oauth_any_role_mode`.

1. Get the following information from the identity provider: token URL, authorization URL, client ID, client secret. For Azure AD, also retrieve the OAuth scope credentials.

1. Store the information retrieved in the previous step in an AWS Secrets Manager secret.

   1. For Okta and Ping Federate, the secret should look like the following format:

      ```
      {"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
      "client_id":"example-client-id", "client_secret":"example-client-secret", "identity_provider":"OKTA"|"PING_FEDERATE",
      "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"}
      ```

   1. For Azure AD, the secret should also include the OAuth scope credentials as the `datasource_oauth_scope` field.

After configuring the identity provider and the secret, you must create an [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) secret to store the information and add it to your Amazon SageMaker AI domain or user profile. Note that you can add a secret to both a domain and user profile, but Canvas looks for secrets in the user profile first.

To add a secret to your domain or user profile, do the following:

1. Go to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker).

1. Choose **domains** in the navigation pane.

1. From the list of **domains**, choose your domain.

   1. If adding your secret to your domain, do the following:

      1. Choose the domain.

      1. On the **domain settings** page, choose the **domain settings** tab.

      1. Choose **Edit**.

   1. If adding the secret to your user profile, do the following:

      1. Choose the user’s domain.

      1. On the **domain settings** page, choose the user profile.

      1. On the **User Details** page, choose **Edit**.

1. In the navigation pane, choose **Canvas settings**.

1. For **OAuth settings**, choose **Add OAuth configuration**.

1. For **Data source**, select **Snowflake**.

1. For **Secret Setup**, select **Create a new secret**. Alternatively, if you already created an AWS Secrets Manager secret with your credentials, enter the ARN for the secret. If creating a new secret, do the following:

   1. For **Identity Provider**, select **SNOWFLAKE**.

   1. For **Client ID**, **Client Secret**, **Authorization URL**, and **Token URL**, enter all of the information you gathered from the identity provider in the previous procedure.

1. Save your domain or user profile settings.

You should now be able to create a connection to your data in Snowflake from Canvas.

# Generative AI assistance for solving ML problems in Canvas using Amazon Q Developer
<a name="canvas-q"></a>

While using Amazon SageMaker Canvas, you can chat with Amazon Q Developer in natural language to leverage generative AI and solve problems. Q Developer is an assistant that helps you translate your goals into machine learning (ML) tasks and describes each step of the ML workflow. Q Developer helps Canvas users reduce the amount of time, effort, and data science expertise required to leverage ML and make data-driven decisions for their organizations. 

Through a conversation with Q Developer, you can initiate actions in Canvas such as preparing data, building an ML model, making predictions, and deploying a model. Q Developer makes suggestions for next steps and provides you with context as you complete each step. It also informs you of results; for example, Canvas can transform your dataset according to best practices, and Q Developer can list the transforms that were used and why.

Amazon Q Developer is available in SageMaker Canvas at no additional cost to both Amazon Q Developer Pro Tier and Free Tier users. However, standard charges apply for resources such as the SageMaker Canvas workspace instance and any resources used for building or deploying models. For more information about pricing, see [Amazon SageMaker Canvas pricing](https://aws.amazon.com/sagemaker-ai/canvas/pricing/).

Use of Amazon Q is licensed to you under [MIT's 0 License](https://github.com/aws/mit-0) and subject to the [AWS Responsible AI Policy](https://aws.amazon.com/machine-learning/responsible-ai/policy/). When you use Q Developer from outside the US, Q Developer processes data across US regions. For more information, see [Cross region inference in Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/cross-region-inference.html).

**Note**  
Amazon Q Developer in SageMaker Canvas doesn't use user content to improve the service, regardless of whether you use the Free-tier or Pro-tier subscription. For service telemetry purposes, Q Developer might track your usage, such as the number of questions asked and whether recommendations were accepted or rejected. This telemetry data doesn't include personally identifiable information such as IP address.

## How it works
<a name="canvas-q-how-it-works"></a>

Amazon Q Developer is a generative AI powered assistant available in SageMaker Canvas that you can query using natural language. Q Developer makes suggestions for each step of the machine learning workflow, explaining concepts and providing you with options and more details as needed. You can use Q Developer for help with regression, binary classification, and multi-class classification use cases.

For example, to predict customer churn, upload a dataset of historical customer churn information to Canvas through Q Developer. Q Developer suggests an appropriate ML model type and steps to fix dataset issues, build a model, and make predictions.

**Important**  
Amazon Q Developer is intended for conversations about machine learning problems within SageMaker Canvas. It guides users through Canvas actions and optionally answers questions about AWS services. Q Developer processes model inputs only in English. For more information about how you can use Q Developer, see [ Amazon Q Developer features](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/features.html) in the *Amazon Q Developer User Guide*.

## Supported regions
<a name="canvas-q-regions"></a>

Amazon Q Developer is available within SageMaker Canvas in the following AWS Regions:
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Seoul)
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Europe (Ireland)
+ Europe (Paris)

## Amazon Q Developer capabilities available in Canvas
<a name="canvas-q-capabilities"></a>

The following list summarizes the Canvas tasks with which Q Developer can provide assistance:
+ **Describe your objective** – Q Developer can suggest an ML model type and general approach to solve your problem.
+ **Import and analyze datasets** – Tell Q Developer where your dataset is stored or upload a file to save it as a Canvas dataset. Prompt Q Developer to identify any issues in your dataset, such as outliers or missing values. Q Developer provides summary statistics about your dataset and lists any identified issues.

  Q Developer supports queries about the following statistics for individual columns:
  + Numeric columns – `number of valid values`, `feature type`, `mean`, `median`, `minimum`, `maximum`, `standard deviation`, `25th percentile`, `75th percentile`, `number of outliers`
  + Categorical columns – `number of missing values`, `number of valid values`, `feature type`, `most frequent`, `most frequent category`, `most frequent category count`, `least frequent`, `least frequent category`, `least frequent category count`, `categories`
+ **Fix dataset issues** – Prompt Q Developer to use Canvas's data transformation capabilities to create a revised version of your dataset. Canvas creates a Data Wrangler data flow and applies transforms according to data science best practices. For more information, see [Data preparation](canvas-data-prep.md).

  If you want to do more advanced data analysis or data preparation tasks than you can accomplish with Q Developer, then we recommend that you go to the Data Wrangler data flow interface.
+ **Train a model** – Q Developer tells you the recommended ML model type for your problem and a proposed model building configuration. You can use the suggested default settings to do a quick build, or you can modify the configuration and do a standard build. When ready, prompt Q Developer to build your Canvas model.

  All of the custom model types are supported. For more information about model types and quick versus standard builds, see [How custom models work](canvas-build-model.md).
+ **Evaluate model accuracy** – After building a model, Q Developer provides a summary of how the model scores across various metrics. These metrics help you determine the usefulness and accuracy of your model. Q Developer can explain any concept or metric in detail.

  To view full details and visualizations, open the model from the chat or the **My Models** page of Canvas. For more information, see [Model evaluation](canvas-evaluate-model.md).
+ **Get predictions for new data** – You can upload a new dataset and prompt Q Developer to help you open the prediction feature of Canvas. 

  Q Developer opens a new window in the application where you can either make a single prediction or make batch predictions with a new dataset. For more information, see [Predictions with custom models](canvas-make-predictions.md).
+ **Deploy a model** – To deploy your model for production, ask Q Developer to help you deploy your model through Canvas. Q Developer opens a new window in which you can configure your deployment. 

  After deploying, view your deployment details either 1) on the **My Models** page of Canvas in the model's **Deploy** tab, or 2) on the **ML Ops** page in the **Deployments** tab. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

## Prerequisites
<a name="canvas-q-prereqs"></a>

To use Amazon Q Developer to build ML models in SageMaker Canvas, complete the following prerequisites:

**Set up a Canvas application**

Make sure that you have a Canvas application set up. For information about how to set up a Canvas application, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

**Grant Q Developer permissions**

To access Q Developer while using Canvas, you must attach the necessary permissions to the AWS IAM role used for your SageMaker AI domain or user profile. You can do this through the console described in this section. If you encounter any permissions issues due to using the console method, then manually attach the AWS managed policy [ AmazonSageMakerCanvasSMDataScienceAssistantAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasSMDataScienceAssistantAccess) to the IAM role.

Permissions attached at the domain level apply to all user profiles in the domain, unless individual permissions are granted or revoked at the user profile level.

------
#### [ SageMaker AI console method ]

You can grant permissions by editing the SageMaker AI domain or user profile settings.

To grant permissions through the domain settings in the SageMaker AI console, do the following:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of domains, select your domain.

1. On the **Domain details** page, select the **App configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. On the **Edit Canvas settings** page, go to the **Amazon Q Developer** section and do the following:

   1. Turn on **Enable Amazon Q Developer in SageMaker Canvas for natural language ML** to add the permissions to chat with Q Developer in Canvas to your domain's execution role.

   1. (Optional) Turn on **Enable Amazon Q Developer chat for general AWS questions** if you want to ask Q Developer questions about various AWS services (for example: Describe how Athena works).
**Note**  
When making general AWS queries to Q Developer, your requests route through the US East (N. Virginia) AWS Region. To prevent your data from routing through US East (N. Virginia), turn off the **Enable Amazon Q Developer chat for general AWS questions** toggle.

------
#### [ Manual method ]

Attach the [ AmazonSageMakerCanvasSMDataScienceAssistantAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasSMDataScienceAssistantAccess) policy to the AWS IAM role used for your domain or user profile. For more information about how to do this, see [ Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS IAM User Guide*.

------

**(Optional) Configure access to Q Developer from your VPC**

If you have a VPC that is configured without public internet access, you can add a VPC endpoint for Q Developer. For more information, see [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md).

## Getting started
<a name="canvas-q-get-started"></a>

To use Amazon Q Developer to build ML models in SageMaker Canvas, do the following:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Amazon Q**.

1. Choose **Start a new conversation** to open a new chat.

When you start a new chat, Q Developer prompts you to state your problem or provide a dataset.

![\[The greeting that Q Developer gives you upon starting a new chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-greeting.png)


After importing your data, you can ask Q Developer to provide you with summary statistics about your dataset, or you can ask questions about specific columns. For a list of the different statistics that Q Developer supports, see the preceding section [Amazon Q Developer capabilities available in Canvas](#canvas-q-capabilities). The following screenshot shows an example of asking for dataset statistics and the most frequent category in a product category column.

![\[Chat dialog asking Q Developer to provide dataset statistics and the most frequent category statistic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-dataset-statistics.png)


Q Developer tracks any Canvas artifacts you import or create during the conversation, such as transformed datasets and models. You can access them from the chat or other Canvas application tabs. For example, if Q Developer fixes issues in your dataset, you can access the new, transformed dataset from the following places:
+ The artifacts sidebar in the Q Developer chat interface
+ The **Datasets** page of Canvas, where you can view both your original and transformed datasets. The transformed dataset has the **Built by Amazon Q** label added to it.
+ The **Data Wrangler** page of Canvas, where Q Developer creates a new data flow for your dataset

The following screenshot shows the original dataset and the transformed dataset in the sidebar of a chat.

![\[The artifacts, which are a dataset and a transformed dataset, shown in the sidebar of a Q Developer chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-artifacts.png)


When your data is ready, ask Q Developer to help build a Canvas model. Q Developer might prompt you to confirm a few fields and review the build configuration. If you use the default build configuration, then your model is built using a quick build. If you want to customize any part of your build configuration, such as selecting the algorithms used or changing the objective metric, then your model is built with a standard build.

The following screenshot shows how you can prompt Q Developer to initiate a Canvas model build with only a few prompts. This example uses the default configuration to start a quick build.

![\[A conversation with Q Developer where the user prompted to start a Canvas model build.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-training-chat.png)


After building your model, you can perform additional actions using either natural language in the chat or the artifacts sidebar menu. For example, you can view model details and metrics, make predictions, or deploy the model. The following screenshot shows the sidebar where you can choose these additional options.

![\[A Q Developer conversation ellipsis menu expanded, showing options for viewing models details, predictions, and deployment.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/amazon-q-ellipsis-menu.png)


You can also perform any of these actions by going to the **My Models** page of Canvas and selecting your model. From your model's page, you can navigate to the **Analyze**, **Predict**, and **Deploy** tabs to view model metrics and visualizations, make predictions, and manage deployments, respectively.

# Logging Q Developer conversations with AWS CloudTrail
<a name="canvas-q-cloudtrail"></a>

AWS CloudTrail is a service that records actions taken by users, roles, or AWS services in Amazon SageMaker AI. CloudTrail captures API calls resulting from your interactions with Amazon Q Developer (a conversational AI assistant) while using SageMaker Canvas (a no-code ML interface). CloudTrail data shows request details, the IP address of the requester, who made the request, and when.

Your interactions with Q Developer are sent as `SendConversation` API calls to the SageMaker AI Data Science Assistant service, which is an internal service that Canvas leverages on the backend. The event source for `SendConversation` API calls is `sagemaker-data-science-assistant.amazonaws.com`.

**Note**  
For privacy and security reasons, the content of your conversations is hidden in the logs, appearing as `HIDDEN_DUE_TO_SECURITY_REASONS` in the request and response elements.

To learn more about CloudTrail, see the [https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html). To learn more about CloudTrail in SageMaker AI, see [Logging Amazon SageMaker AI API calls using AWS CloudTrail](logging-using-cloudtrail.md).

The following is an example log file entry for the `SendConversation` API:

```
{
    "eventVersion":"1.10",
    "userIdentity": {
        "type":"AssumedRole",
        "principalId":"AROA123456789EXAMPLE:user-Isengard",
        "arn":"arn:aws:sts::111122223333:assumed-role/Admin/user",
        "accountId":"111122223333",
        "accessKeyId":"ASIAIOSFODNN7EXAMPLE",
        "sessionContext": {
            "sessionIssuer": {
                "type":"Role",
                "principalId":"AROA123456789EXAMPLE",
                "arn":"arn:aws:iam::111122223333:role/Admin",
                "accountId":"111122223333",
                "userName":"Admin"
            },
            "attributes": {
                "creationDate":"2024-11-11T22:04:37Z",
                "mfaAuthenticated":"false"
            }
        }
    },
    "eventTime":"2024-11-11T22:09:22Z",
    "eventSource":"sagemaker-data-science-assistant.amazonaws.com",
    "eventName":"SendConversation",
    "awsRegion":"us-west-2",
    "sourceIPAddress":"192.0.2.0",
    "userAgent":"Boto3/1.33.13 md/Botocore#1.33.13 ua/2.0 os/linux#5.10.227-198.884.amzn2int.x86_64 md/arch#x86_64 lang/python#3.7.16 md/pyimpl#CPython cfg/retry-mode#legacy Botocore/1.33.13",
    "requestParameters": {
        "conversation": [
            {
                "utteranceId":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
                "utterance":"HIDDEN_DUE_TO_SECURITY_REASONS",
                "timestamp":"Feb 4, 2020, 7:46:29 AM",
                "utteranceType":"User"
            }
        ],
        "utteranceId":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111"
    },
    "responseElements": {
        "responseCode":"CHAT_RESPONSE",
        "conversationId":"1234567890abcdef0",
        "response": {
            "chat": {
                "body":"HIDDEN_DUE_TO_SECURITY_REASONS"
            }
        }
    },
    "requestID":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
    "eventID":"a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
    "readOnly":false,
    "eventType":"AwsApiCall",
    "managementEvent":true,
    "recipientAccountId":"123456789012",
    "eventCategory":"Management",
    "tlsDetails": {
        "tlsVersion":"TLSv1.2",
        "cipherSuite":"ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader":"gamma.us-west-2.data-science-assistant.sagemaker.aws.dev"
    }
}
```

# Data import
<a name="canvas-importing-data"></a>

Amazon SageMaker Canvas supports importing tabular, image, and document data. You can import datasets from your local machine, Amazon services such as Amazon S3 and Amazon Redshift, and external data sources. When importing datasets from Amazon S3, you can bring a dataset of any size. Use the datasets that you import to build models and make predictions for other datasets.

Each use case for which you can build a custom model accepts different types of input. For example, if you want to build a single-label image classification model, then you should import image data. For more information about the different model types and the data they accept, see [How custom models work](canvas-build-model.md). You can import data and build custom models in SageMaker Canvas for the following data types:
+ **Tabular** (CSV, Parquet, or tables)
  + Categorical – Use categorical data to build custom categorical prediction models for 2 and 3\$1 category prediction.
  + Numeric – Use numeric data to build custom numeric prediction models.
  + Text – Use text data to build custom multi-category text prediction models.
  + Timeseries – Use timeseries data to build custom time series forecasting models.
+ **Image** (JPG or PNG) – Use image data to build custom single-label image prediction models.
+ **Document** (PDF, JPG, PNG, TIFF) – Document data is only supported for SageMaker Canvas Ready-to-use models. To learn more about Ready-to-use models that can make predictions for document data, see [Ready-to-use models](canvas-ready-to-use-models.md).

You can import data into Canvas from the following data sources:
+ Local files on your computer
+ Amazon S3 buckets
+ Amazon Redshift provisioned clusters (not Amazon Redshift Serverless)
+ AWS Glue Data Catalog through Amazon Athena
+ Amazon Aurora
+ Amazon Relational Database Service (Amazon RDS)
+ Salesforce Data Cloud
+ Snowflake
+ Databricks, SQLServer, MariaDB, and other popular databases through JDBC connectors
+ Over 40 external SaaS platforms, such as SAP OData

For a full list of data sources from which you can import, see the following table:


| Source | Type | Supported data types | 
| --- | --- | --- | 
| Local file upload | Local | Tabular, Image, Document | 
| Amazon Aurora | Amazon internal | Tabular | 
| Amazon S3 bucket | Amazon internal | Tabular, Image, Document | 
| Amazon RDS | Amazon internal | Tabular | 
| Amazon Redshift provisioned clusters (not Redshift Serverless) | Amazon internal | Tabular | 
| AWS Glue Data Catalog (through Amazon Athena) | Amazon internal | Tabular | 
| [Databricks](https://www.databricks.com/) | External | Tabular | 
| Snowflake | External | Tabular | 
| [Salesforce Data Cloud](https://www.salesforce.com/products/genie/overview/) | External | Tabular | 
| SQLServer | External | Tabular | 
| MySQL | External | Tabular | 
| PostgreSQL | External | Tabular | 
| MariaDB | External | Tabular | 
| [Amplitude](https://docs.aws.amazon.com/appflow/latest/userguide/amplitude.html) | External SaaS platform | Tabular | 
| [CircleCI](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-circleci.html) | External SaaS platform | Tabular | 
| [DocuSign Monitor](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-docusign-monitor.html) | External SaaS platform | Tabular | 
| [Domo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-domo.html) | External SaaS platform | Tabular | 
| [Datadog](https://docs.aws.amazon.com/appflow/latest/userguide/datadog.html) | External SaaS platform | Tabular | 
| [Dynatrace](https://docs.aws.amazon.com/appflow/latest/userguide/dynatrace.html) | External SaaS platform | Tabular | 
| [Facebook Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-ads.html) | External SaaS platform | Tabular | 
| [Facebook Page Insights](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-page-insights.html) | External SaaS platform | Tabular | 
| [Google Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-ads.html) | External SaaS platform | Tabular | 
| [Google Analytics 4](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-analytics-4.html) | External SaaS platform | Tabular | 
| [Google Search Console](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-search-console.html) | External SaaS platform | Tabular | 
| [GitHub](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-github.html) | External SaaS platform | Tabular | 
| [GitLab](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-gitlab.html) | External SaaS platform | Tabular | 
| [Infor Nexus](https://docs.aws.amazon.com/appflow/latest/userguide/infor-nexus.html) | External SaaS platform | Tabular | 
| [Instagram Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-instagram-ads.html) | External SaaS platform | Tabular | 
| [Jira Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jira-cloud.html) | External SaaS platform | Tabular | 
| [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html) | External SaaS platform | Tabular | 
| [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html) | External SaaS platform | Tabular | 
| [Mailchimp](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mailchimp.html) | External SaaS platform | Tabular | 
| [Marketo](https://docs.aws.amazon.com/appflow/latest/userguide/marketo.html) | External SaaS platform | Tabular | 
| [Microsoft Teams](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-teams.html) | External SaaS platform | Tabular | 
| [Mixpanel](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mixpanel.html) | External SaaS platform | Tabular | 
| [Okta](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-okta.html) | External SaaS platform | Tabular | 
| [Salesforce](https://docs.aws.amazon.com/appflow/latest/userguide/salesforce.html) | External SaaS platform | Tabular | 
| [Salesforce Marketing Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-salesforce-marketing-cloud.html) | External SaaS platform | Tabular | 
| [Salesforce Pardot](https://docs.aws.amazon.com/appflow/latest/userguide/pardot.html) | External SaaS platform | Tabular | 
| [SAP OData](https://docs.aws.amazon.com/appflow/latest/userguide/sapodata.html) | External SaaS platform | Tabular | 
| [SendGrid](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-sendgrid.html) | External SaaS platform | Tabular | 
| [ServiceNow](https://docs.aws.amazon.com/appflow/latest/userguide/servicenow.html) | External SaaS platform | Tabular | 
| [Singular](https://docs.aws.amazon.com/appflow/latest/userguide/singular.html) | External SaaS platform | Tabular | 
| [Slack](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html) | External SaaS platform | Tabular | 
| [Stripe](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-stripe.html) | External SaaS platform | Tabular | 
| [Trend Micro](https://docs.aws.amazon.com/appflow/latest/userguide/trend-micro.html) | External SaaS platform | Tabular | 
| [Typeform](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-typeform.html) | External SaaS platform | Tabular | 
| [Veeva](https://docs.aws.amazon.com/appflow/latest/userguide/veeva.html) | External SaaS platform | Tabular | 
| [Zendesk](https://docs.aws.amazon.com/appflow/latest/userguide/zendesk.html) | External SaaS platform | Tabular | 
| [Zendesk Chat](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-chat.html) | External SaaS platform | Tabular | 
| [Zendesk Sell](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sell.html) | External SaaS platform | Tabular | 
| [Zendesk Sunshine](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sunshine.html) | External SaaS platform | Tabular | 
| [Zoom Meetings](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoom.html) | External SaaS platform | Tabular | 

For instructions on how to import data and information regarding input data requirements, such as the maximum file size for images, see [Create a dataset](canvas-import-dataset.md).

Canvas also provides several sample datasets in your application to help you get started. To learn more about the SageMaker AI-provided sample datasets you can experiment with, see [Use sample datasets](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-sample-datasets.html).

After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual update or you can set up a schedule for automatic dataset updates. For more information, see [Update a dataset](canvas-update-dataset.md).

For more information specific to each dataset type, see the following sections:

**Tabular**

To import data from an external data source (such as a Snowflake database or a SaaS platform), you must authenticate and connect to the data source in the Canvas application. For more information, see [Connect to data sources](canvas-connecting-external.md).

If you want to import datasets larger than 5 GB from Amazon S3 into Canvas, you can achieve faster sampling by using Amazon Athena to query and sample the data from Amazon S3.

After creating datasets in Canvas, you can prepare and transform your data using the data preparation functionality of Data Wrangler. You can use Data Wrangler to handle missing values, transform your features, join multiple datasets into a single dataset, and more. For more information, see [Data preparation](canvas-data-prep.md).

**Tip**  
As long as your data is arranged into tables, you can join datasets from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake.

**Image**

For information about how to edit an image dataset and perform tasks such as assigning or reassigning labels, adding images, or deleting images, see [Edit an image dataset](canvas-edit-image.md).

# Create a dataset
<a name="canvas-import-dataset"></a>

**Note**  
If you're importing datasets larger than 5 GB into Amazon SageMaker Canvas, we recommend that you use the [Data Wrangler feature](canvas-data-prep.md) in Canvas to create a data flow. Data Wrangler supports advanced data preparation features such as [joining](canvas-transform.md#canvas-transform-join) and [concatenating](canvas-transform.md#canvas-transform-concatenate) data. After you create a data flow, you can export your data flow as a Canvas dataset and begin building a model. For more information, see [Export to create a model](canvas-processing-export-model.md).

The following sections describe how to create a dataset in Amazon SageMaker Canvas. For custom models, you can create datasets for tabular and image data. For Ready-to-use models, you can use tabular and image datasets as well as document datasets. Choose your workflow based on the following information:
+ For categorical, numeric, text, and timeseries data, see [Import tabular data](#canvas-import-dataset-tabular).
+ For image data, see [Import image data](#canvas-import-dataset-image).
+ For document data, see [Import document data](#canvas-ready-to-use-import-document).

A dataset can consist of multiple files. For example, you might have multiple files of inventory data in CSV format. You can upload these files together as a dataset as long as the schema (or column names and data types) of the files match.

Canvas also supports managing multiple versions of your dataset. When you create a dataset, the first version is labeled as `V1`. You can create a new version of your dataset by updating your dataset. You can do a manual update, or you can set up an automated schedule for updating your dataset with new data. For more information, see [Update a dataset](canvas-update-dataset.md).

When you import your data into Canvas, make sure that it meets the requirements in the following table. The limitations are specific to the type of model you’re building.


| Limit | 2 category, 3\$1 category, numeric, and time series models | Text prediction models | Image prediction models | \$1Document data for Ready-to-use models | 
| --- | --- | --- | --- | --- | 
| Supported file types |  CSV and Parquet (local upload, Amazon S3, or databases) JSON (databases)  |  CSV and Parquet (local upload, Amazon S3, or databases) JSON (databases)  | JPG, PNG | PDF, JPG, PNG, TIFF | 
| Maximum file size |  Local upload: 5 GB Data sources: PBs  |  Local upload: 5 GB Data sources: PBs  | 30 MB per image | 5 MB per document | 
| Maximum number of files you can upload at a time | 30 | 30 | N/A | N/A | 
| Maximum number of columns | 1,000 | 1,000 | N/A | N/A | 
| Maximum number of entries (rows, images, or documents) for **Quick builds** | N/A | 7500 rows | 5000 images | N/A | 
| Maximum number of entries (rows, images, or documents) for **Standard builds** | N/A | 150,000 rows | 180,000 images | N/A | 
| Minimum number of entries (rows) for **Quick builds** |  2 category: 500 rows 3\$1 category, numeric, time series: N/A  | N/A | N/A | N/A | 
| Minimum number of entries (rows, images, or documents) for **Standard builds** | 250 rows | 50 rows | 50 images | N/A | 
|  Minimum number of entries (rows or images) per label | N/A | 25 rows | 25 rows | N/A | 
| Minimum number of labels |  2 category: 2 3\$1 category: 3 Numeric, time series: N/A  | 2 | 2 | N/A | 
|  Minimum sample size for random sampling | 500 | N/A | N/A | N/A | 
|  Maximum sample size for random sampling | 200,000 | N/A | N/A | N/A | 
| Maximum number of labels |  2 category: 2 3\$1 category, numeric, time series: N/A  | 1000 | 1000 | N/A | 

\$1Document data is currently only supported for [Ready-to-use models](canvas-ready-to-use-models.md) that accept document data. You can't build a custom model with document data.

Also note the following restrictions:
+ When importing data from an Amazon S3 bucket, make sure that your Amazon S3 bucket name doesn't contain a `.`. If your bucket name contains a `.`, you might experience errors when trying to import data into Canvas.
+ For tabular data, Canvas disallows selecting any file with extensions other than .csv, .parquet, .parq, and .pqt for both local upload and Amazon S3 import. CSV files can use any common or custom delimiter, and they must not have newline characters except when denoting a new row.
+ For tabular data using Parquet files, note the following:
  + Parquet files can't include complex types like maps and lists.
  + The column names of Parquet files can't contain spaces.
  + If using compression, Parquet files must use either gzip or snappy compression types. For more information about the preceding compression types, see the [gzip documentation](https://www.gzip.org/) and the [snappy documentation](https://github.com/google/snappy).
+ For image data, if you have any unlabeled images, you must label them before building your model. For information about how to assign labels to images within the Canvas application, see [Edit an image dataset](canvas-edit-image.md).
+ If you set up automatic dataset updates or automatic batch prediction configurations, you can only create a total of 20 configurations in your Canvas application. For more information, see [How to manage automations](canvas-manage-automations.md).

After you import a dataset, you can view your datasets on the **Datasets** page at any time.

## Import tabular data
<a name="canvas-import-dataset-tabular"></a>

With tabular datasets, you can build categorical, numeric, time series forecasting, and text prediction models. Review the limitations table in the preceding **Import a dataset** section to ensure that your data meets the requirements for tabular data.

Use the following procedure to import a tabular dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Tabular**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Create tabular dataset** page, open the **Data Source** dropdown menu.

1. Choose your data source:
   + To upload files from your computer, choose **Local upload**.
   + To import data from another source, such as an Amazon S3 bucket or a Snowflake database, search for your data source in the **Search data source bar**. Then, choose the tile for your desired data source.
**Note**  
You can only import data from the tiles that have an active connection. If you want to connect to a data source that is unavailable to you, contact your administrator. If you’re an administrator, see [Connect to data sources](canvas-connecting-external.md).

   The following screenshot shows the **Data Source** dropdown menu.  
![\[Screenshot showing the Data Source dropdown menu and a search for a data source in the search bar.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/import-data-choose-source.png)

1. (Optional) If you’re connecting to an Amazon Redshift or Snowflake database for the first time, a dialog box appears to create a connection. Fill out the dialog box with your credentials and choose **Create connection**. If you already have a connection, choose your connection.

1. From your data source, select your files to import. For local upload and importing from Amazon S3, you can select files. For Amazon S3 only, you also have the option to directly enter the S3 URI, alias, or ARN of your bucket or S3 access point in the **Input S3 endpoint** field, and then choose files to import. For database sources, you can drag-and-drop data tables from the left navigation pane.

1. (Optional) For tabular data sources that support SQL querying (such as Amazon Redshift, Amazon Athena, or Snowflake), you can choose **Edit in SQL** to make SQL queries before importing them.

   The following screenshot shows the **Edit SQL** view for an Amazon Athena data source.  
![\[Screenshot showing a SQL query in the Edit SQL view for Amazon Athena data.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/import-data-edit-sql.png)

1. Choose **Preview dataset** to preview your data before importing it.

1. In the **Import settings**, enter a **Dataset name** or use the default dataset name.

1. (Optional) For data that you import from Amazon S3, you are shown the **Advanced** settings and can fill out the following fields:

   1. Toggle the **Use first row as header** option on if you want to use the first row of your dataset as the column names. If you selected multiple files, this applies to each file.

   1. If you're importing a CSV file, for the **File encoding (CSV)** dropdown, select your dataset file’s encoding. `UTF-8` is the default.

   1. For the **Delimiter** dropdown, select the delimiter that separates each cell in your data. The default delimiter is `,`. You can also specify a custom delimiter.

   1. Select **Multi-line detection** if you’d like Canvas to manually parse your entire dataset for multi-line cells. By default, this option is not selected and Canvas determines whether or not to use multi-line support by taking a sample of your data. However, Canvas might not detect any multi-line cells in the sample. If you have multi-line cells, we recommend that you select the **Multi-line detection** option to force Canvas to check your entire dataset for multi-line cells.

1. When you’re ready to import your data, choose **Create dataset**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas successfully imported your data and you can proceed with [building a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html).

If you have a connection to a data source, such as an Amazon Redshift database or a SaaS connector, you can return to that connection. For Amazon Redshift and Snowflake, you can add another connection by creating another dataset, returning to the **Import data** page, and choosing the **Data Source** tile for that connection. From the dropdown menu, you can open the previous connection or choose **Add connection**.

**Note**  
For SaaS platforms, you can only have one connection per data source.

## Import image data
<a name="canvas-import-dataset-image"></a>

With image datasets, you can build single-label image prediction custom models, which predict a label for an image. Review the limitations in the preceding **Import a dataset** section to ensure that your image dataset meets the requirements for image data.

**Note**  
You can only import image datasets from local file upload or an Amazon S3 bucket. Also, for image datasets, you must have at least 25 images per label.

Use the following procedure to import an image dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Image**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Import** page, open the **Data Source** dropdown menu.

1. Choose your data source. To upload files from your computer, choose **Local upload**. To import files from Amazon S3, choose **Amazon S3**.

1. From your computer or Amazon S3 bucket, select the images or folders of images that you want to upload.

1. When you’re ready to import your data, choose **Import data**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas successfully imported your data and you can proceed with [building a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html).

When you are building your model, you can edit your image dataset, and you can assign or re-assign labels, add images, or delete images from your dataset. For more information about how to edit your image dataset, see [Edit an image dataset](canvas-edit-image.md).

## Import document data
<a name="canvas-ready-to-use-import-document"></a>

The Ready-to-use models for expense analysis, identity document analysis, document analysis, and document queries support document data. You can’t build a custom model with document data.

With document datasets, you can generate predictions for expense analysis, identity document analysis, document analysis, and document queries Ready-to-use models. Review the limitations table in the [Create a dataset](#canvas-import-dataset) section to ensure that your document dataset meets the requirements for document data.

**Note**  
You can only import document datasets from local file upload or an Amazon S3 bucket.

Use the following procedure to import a document dataset into Canvas:

1. Open your SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. Choose **Import data**.

1. From the dropdown menu, choose **Document**.

1. In the popup dialog box, in the **Dataset name** field, enter a name for the dataset and choose **Create**.

1. On the **Import** page, open the **Data Source** dropdown menu.

1. Choose your data source. To upload files from your computer, choose **Local upload**. To import files from Amazon S3, choose **Amazon S3**.

1. From your computer or Amazon S3 bucket, select the document files that you want to upload.

1. When you’re ready to import your data, choose **Import data**.

While your dataset is importing into Canvas, you can see your datasets listed on the **Datasets** page. From this page, you can [View your dataset details](#canvas-view-dataset-details).

When the **Status** of your dataset shows as `Ready`, Canvas has successfully imported your data.

On the **Datasets** page, you can choose your dataset to preview it, which shows you up to the first 100 documents of your dataset.

## View your dataset details
<a name="canvas-view-dataset-details"></a>



For each of your datasets, you can view all of the files in a dataset, the dataset’s version history, and any auto update configurations for the dataset. From the **Datasets** page, you can also initiate actions such as [Update a dataset](canvas-update-dataset.md) or [How custom models work](canvas-build-model.md).

To view the details for a dataset, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose your dataset.

On the **Data** tab, you can see a preview of your data. If you choose **Dataset details**, you can see all of the files that are part of your dataset. Choose a file to see only the data from that file in the preview. For image datasets, the preview only shows you the first 100 images of your dataset.

On the **Version history** tab, you can see a list of all of the versions of your dataset. A new version is made whenever you update a dataset. To learn more about updating a dataset, see [Update a dataset](canvas-update-dataset.md). The following screenshot shows the **Version history** tab in the Canvas application.

![\[Screenshot of the Version history tab for a dataset, with a list of dataset versions.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-version-history.png)


On the **Auto updates** tab, you can enable auto updates for the dataset and set up a configuration to update your dataset on a regular schedule. To learn more about setting up auto updates for a dataset, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md). The following screenshot shows the **Auto updates** tab with auto updates turned on and a list of auto update jobs that have been performed on the dataset.

![\[The Auto updates tab for dataset showing the auto updates turned on and a list of auto update jobs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-auto-updates.png)


# Update a dataset
<a name="canvas-update-dataset"></a>

After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that you want to add to your dataset. For example, you might get inventory data at the end of every week that you want to add to your dataset. Instead of importing your data multiple times, you can update your existing dataset and add or remove files from it.

**Note**  
You can only update datasets that you have imported through local upload or Amazon S3.

You can update your dataset either manually or automatically. For more information about automatic dataset updates, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md).

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use the latest version of your dataset to build a model or generate predictions. For more information about viewing the version history of your dataset, see [View your dataset details](canvas-import-dataset.md#canvas-view-dataset-details).

You can also use dataset updates with automated batch predictions, which starts a batch prediction job whenever you update your dataset. For more information, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

The following section describes how to do manual updates to your dataset.

## Manually update a dataset
<a name="canvas-update-dataset-manual"></a>

To do a manual update, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose the dataset you want to update.

1. Choose the **Update dataset** dropdown menu and choose **Manual update**. You are taken to the import data workflow.

1. From the **Data source** dropdown menu, choose either **Local upload** or **Amazon S3**.

1. The page shows you a preview of your data. From here, you can add or remove files from the dataset. If you’re importing tabular data, the schema of the new files (column names and data types) must match the schema of the existing files. Additionally, your new files must not exceed the maximum dataset size or file size. For more information about these limitations, see [ Import a dataset](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html).
**Note**  
If you add a file with the same name as an existing file in your dataset, the new file overwrites the old version of the file.

1. When you’re ready to save your changes, choose **Update dataset**.

You should now have a new version of your dataset.

On the **Datasets** page, you can choose the **Version history** tab to see all of the versions of your dataset and the history of both manual and automatic updates you’ve made.

# Configure automatic updates for a dataset
<a name="canvas-update-dataset-auto"></a>

After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that you want to add to your dataset. For example, you might get inventory data at the end of every week that you want to add to your dataset. Instead of importing your data multiple times, you can update your existing dataset and add or remove files from it.

**Note**  
You can only update datasets that you have imported through local upload or Amazon S3.

With automatic dataset updates, you specify a location where Canvas checks for files at a frequency you specify. If you import new files during the update, the schema of the files must match the existing dataset exactly.

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use the latest version of your dataset to build a model or generate predictions. For more information about viewing the version history of your dataset, see [View your dataset details](canvas-import-dataset.md#canvas-view-dataset-details).

You can also use dataset updates with automated batch predictions, which starts a batch prediction job whenever you update your dataset. For more information, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

The following section describes how to do automatic updates to your dataset.

An automatic update is when you set up a configuration for Canvas to update your dataset at a given frequency. We recommend that you use this option if you regularly receive new files of data that you want to add to your dataset.

When you set up the auto update configuration, you specify an Amazon S3 location where you upload your files and a frequency at which Canvas checks the location and imports files. Each instance of Canvas updating your dataset is referred to as a *job*. For each job, Canvas imports all of the files in the Amazon S3 location. If you have new files with the same names as existing files in your dataset, Canvas overwrites the old files with the new files.

For automatic dataset updates, Canvas doesn’t perform schema validation. If the schema of files imported during an automatic update don’t match the schema of the existing files or exceed the size limitations (see [Import a dataset](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html) for a table of file size limitations), then you get errors when your jobs run.

**Note**  
You can only set up a maximum of 20 automatic configurations in your Canvas application. Additionally, Canvas only does automatic updates while you’re logged in to your Canvas application. If you log out of your Canvas application, automatic updates pause until you log back in.

To configure automatic updates for your dataset, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Datasets**.

1. From the list of datasets, choose the dataset you want to update.

1. Choose the **Update dataset** dropdown menu and choose **Automatic update**. You are taken to the **Auto updates**tab for the dataset.

1. Turn on the **Auto update enabled** toggle.

1. For **Specify a data source**, enter the Amazon S3 path to a folder where you plan to regularly upload files.

1. For **Choose a frequency**, select **Hourly**, **Weekly**, or **Daily**.

1. For **Specify a starting time**, use the calendar and time picker to select when you want the first auto update job to start.

1. When you’re ready to create the auto update configuration, choose **Save**.

Canvas begins the first job of your auto update cadence at the specified starting time.

# View your automatic dataset update jobs
<a name="canvas-update-dataset-auto-view"></a>

To view the job history for your automatic dataset updates in Amazon SageMaker Canvas, on your dataset details page, choose the **Auto updates** tab.

Each automatic update to a dataset shows as a job in the **Auto updates** tab under the **Job history** section. For each job, you can see the following:
+ **Job created** – The timestamp for when Canvas started updating the dataset.
+ **Files** – The number of files in the dataset.
+ **Cells (Columns x Rows)** – The number of columns and rows in the dataset.
+ **Status** – The status of the dataset after the update. If the job was successful, the status is **Ready**. If the job failed for any reason, the status is **Failed**, and you can hover over the status for more details.

# Edit your automatic dataset update configuration
<a name="canvas-update-dataset-auto-edit"></a>

You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

To make changes to your auto update configuration for a dataset, go to the **Auto updates** tab of your dataset and choose **Edit** to make changes to the configuration.

To pause your dataset updates, turn off your automatic configuration. You can turn off auto updates by going to the **Auto updates** tab of your dataset and turning the **Enable auto updates** toggle off. You can turn this toggle back on at any time to resume the update schedule.

To learn how to delete your configuration, see [Delete an automatic configuration](canvas-manage-automations-delete.md).

# Connect to data sources
<a name="canvas-connecting-external"></a>

In Amazon SageMaker Canvas, you can import data from a location outside of your local file system through an AWS service, a SaaS platform, or other databases using JDBC connectors. For example, you might want to import tables from a data warehouse in Amazon Redshift, or you might want to import Google Analytics data.

When you go through the **Import** workflow to import data in the Canvas application, you can choose your data source and then select the data that you want to import. For certain data sources, like Snowflake and Amazon Redshift, you must specify your credentials and add a connection to the data source.

The following screenshot shows the data sources toolbar in the **Import** workflow, with all of the available data sources highlighted. You can only import data from the data sources that are available to you. Contact your administrator if your desired data source isn’t available.

![\[The Data Source dropdown menu on the Import data page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/data-sources.png)


The following sections provide information about establishing connections to external data sources and and importing data from them. Review the following section first to determine what permissions you need to import data from your data source.

## Permissions
<a name="canvas-connecting-external-permissions"></a>

Review the following information to ensure that you have the necessary permissions to import data from your data source:
+ **Amazon S3:** You can import data from any Amazon S3 bucket as long as your user has permissions to access the bucket. For more information about using AWS IAM to control access to Amazon S3 buckets, see [Identity and access management in Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-access-control.html) in the *Amazon S3 User Guide*.
+ **Amazon Athena:** If you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy and the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you can query your AWS Glue Data Catalog with Amazon Athena. If you’re part of an Athena workgroup, make sure that the Canvas user has permissions to run Athena queries on the data. For more information, see [Using workgroups for running queries](https://docs.aws.amazon.com/athena/latest/ug/workgroups.html) in the *Amazon Athena User Guide*.
+ **Amazon DocumentDB:** You can import data from any Amazon DocumentDB database as long as you have the credentials (username and password) to connect to the database and have the minimum base Canvas permissions attached to your user’s execution role. For more information about Canvas permissions, see the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).
+ **Amazon Redshift:** To give yourself the necessary permissions to import data from Amazon Redshift, see [Grant Users Permissions to Import Amazon Redshift Data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-redshift-permissions.html).
+ **Amazon RDS:** If you have the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you’ll be able to access your Amazon RDS databases from Canvas.
+ **SaaS platforms:** If you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy and the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s execution role, then you have the necessary permissions to import data from SaaS platforms. See [Use SaaS connectors with Canvas](#canvas-connecting-external-appflow) for more information about connecting to a specific SaaS connector.
+ **JDBC connectors:** For database sources such as Databricks, MySQL or MariaDB, you must enable username and password authentication on the source database before attempting to connect from Canvas. If you’re connecting to a Databricks database, you must have the JDBC URL that contains the necessary credentials.

## Connect to a database stored in AWS
<a name="canvas-connecting-internal-database"></a>

You might want to import data that you’ve stored in AWS. You can import data from Amazon S3, use Amazon Athena to query a database in the AWS Glue Data Catalog, import data from [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html), or make a connection to a provisioned Amazon Redshift database (not Redshift Serverless).

You can create multiple connections to Amazon Redshift. For Amazon Athena, you can access any databases that you have in your [AWS Glue Data Catalog](https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/aws-glue-data-catalog.html). For Amazon S3, you can import data from a bucket as long as you have the necessary permissions.

Review the following sections for more detailed information.

### Connect to data in Amazon S3, Amazon Athena, or Amazon RDS
<a name="canvas-connecting-internal-database-s3-athena"></a>

For Amazon S3, you can import data from an Amazon S3 bucket as long as you have permissions to access the bucket.

For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have permissions through your [Amazon Athena workgroup](https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html).

For Amazon RDS, if you have the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy attached to your user’s role, then you’ll be able to import data from your Amazon RDS databases into Canvas.

To import data from an Amazon S3 bucket, or to run queries and import data tables with Amazon Athena, see [Create a dataset](canvas-import-dataset.md). You can only import tabular data from Amazon Athena, and you can import tabular and image data from Amazon S3.

### Connect to an Amazon DocumentDB database
<a name="canvas-connecting-docdb"></a>

Amazon DocumentDB is a fully managed, serverless, document database service. You can import unstructured document data stored in an Amazon DocumentDB database into SageMaker Canvas as a tabular dataset, and then you can build machine learning models with the data.

**Important**  
Your SageMaker AI domain must be configured in **VPC only** mode to add connections to Amazon DocumentDB. You can only access Amazon DocumentDB clusters in the same Amazon VPC as your Canvas application. Additionally, Canvas can only connect to TLS-enabled Amazon DocumentDB clusters. For more information about how to set up Canvas in **VPC only** mode, see [Configure Amazon SageMaker Canvas in a VPC without internet access](canvas-vpc.md).

To import data from Amazon DocumentDB databases, you must have credentials to access the Amazon DocumentDB database and specify the username and password when creating a database connection. You can configure more granular permissions and restrict access by modifying the Amazon DocumentDB user permissions. To learn more about access control in Amazon DocumentDB, see [Database Access Using Role-Based Access Control](https://docs.aws.amazon.com/documentdb/latest/developerguide/role_based_access_control.html) in the *Amazon DocumentDB Developer Guide*.

When you import from Amazon DocumentDB, Canvas converts your unstructured data into a tabular dataset by mapping the fields to columns in a table. Additional tables are created for each complex field (or nested structure) in the data, where the columns correspond to the sub-fields of the complex field. For more detailed information about this process and examples of schema conversion, see the [ Amazon DocumentDB JDBC Driver Schema Discovery](https://github.com/aws/amazon-documentdb-jdbc-driver/blob/develop/src/markdown/schema/schema-discovery.md) GitHub page.

Canvas can only make a connection to a single database in Amazon DocumentDB. To import data from a different database, you must create a new connection.

You can import data from Amazon DocumentDB into Canvas by using the following methods:
+ [Create a dataset](canvas-import-dataset.md). You can import your Amazon DocumentDB data and create a tabular dataset in Canvas. If you choose this method, make sure that you follow the [ Import tabular data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html#canvas-import-dataset-tabular) procedure.
+ [Create a data flow](canvas-data-flow.md). You can create a data preparation pipeline in Canvas and add your Amazon DocumentDB database as a data source.

To proceed with importing your data, follow the procedure for one of the methods linked in the preceding list.

When you reach the step in either workflow to choose a data source (Step 6 for creating a dataset, or Step 8 for creating a data flow), do the following:

1. For **Data Source**, open the dropdown menu and choose **DocumentDB**.

1. Choose **Add connection**.

1. In the dialog box, specify your Amazon DocumentDB credentials:

   1. Enter a **Connection name**. This is a name used by Canvas to identify this connection.

   1. For **Cluster**, select the cluster in Amazon DocumentDB that stores your data. Canvas automatically populates the dropdown menu with Amazon DocumentDB clusters in the same VPC as your Canvas application.

   1. Enter the **Username** for your Amazon DocumentDB cluster.

   1. Enter the **Password** for your Amazon DocumentDB cluster.

   1. Enter the name of the **Database** to which you want to connect.

   1. The **Read preference** option determines which types of instances on your cluster Canvas reads the data from. Select one of the following:
      + **Secondary preferred** – Canvas defaults to reading from the cluster’s secondary instances, but if a secondary instance isn’t available, then Canvas reads from a primary instance.
      + **Secondary** – Canvas only reads from the cluster’s secondary instances, which prevents the read operations from interfering with the cluster’s regular read and write operations.

   1. Choose **Add connection**. The following image shows the dialog box with the preceding fields for an Amazon DocumentDB connection.  
![\[Screenshot of the Add a new DocumentDB connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/add-docdb-connection.png)

You should now have an Amazon DocumentDB connection, and you can use your Amazon DocumentDB data in Canvas to create either a dataset or a data flow.

### Connect to an Amazon Redshift database
<a name="canvas-connecting-redshift"></a>

You can import data from Amazon Redshift, a data warehouse where your organization keeps its data. Before you can import data from Amazon Redshift, the AWS IAM role you use must have the `AmazonRedshiftFullAccess` managed policy attached. For instructions on how to attach this policy, see [Grant Users Permissions to Import Amazon Redshift Data](canvas-redshift-permissions.md). 

To import data from Amazon Redshift, you do the following:

1. Create a connection to an Amazon Redshift database.

1. Choose the data that you're importing.

1. Import the data.

You can use the Amazon Redshift editor to drag datasets onto the import pane and import them into SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
+ SQL queries
+ Joins

With SQL queries, you can customize how you import the values in the dataset. For example, you can specify the columns returned in the dataset or the range of values for a column.

You can use joins to combine multiple datasets from Amazon Redshift into a single dataset. You can drag your datasets from Amazon Redshift into the panel that gives you the ability to join the datasets.

You can use the SQL editor to edit the dataset that you've joined and convert the joined dataset into a single node. You can join another dataset to the node. You can import the data that you've selected into SageMaker Canvas.

Use the following procedure to import data from Amazon Redshift.

1. In the SageMaker Canvas application, go to the **Datasets** page.

1. Choose **Import data**, and from the dropdown menu, choose **Tabular**.

1. Enter a name for the dataset and choose **Create**.

1. For **Data Source**, open the dropdown menu and choose **Redshift**.

1. Choose **Add connection**.

1. In the dialog box, specify your Amazon Redshift credentials:

   1. For **Authentication method**, choose **IAM**.

   1. Enter the **Cluster identifier** to specify to which cluster you want to connect. Enter only the cluster identifier and not the full endpoint of the Amazon Redshift cluster.

   1. Enter the **Database name** of the database to which you want to connect.

   1. Enter a **Database user** to identify the user you want to use to connect to the database.

   1. For **ARN**, enter the IAM role ARN of the role that the Amazon Redshift cluster should assume to move and write data to Amazon S3. For more information about this role, see [ Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the *Amazon Redshift Management Guide*.

   1. Enter a **Connection name**. This is a name used by Canvas to identify this connection.

1. From the tab that has the name of your connection, drag the .csv file that you're importing to the **Drag and drop table to import** pane.

1. Optional: Drag additional tables to the import pane. You can use the GUI to join the tables. For more specificity in your joins, choose **Edit in SQL**.

1. Optional: If you're using SQL to query the data, you can choose **Context** to add context to the connection by specifying values for the following:
   + **Warehouse**
   + **Database**
   + **Schema**

1. Choose **Import data**.

The following image shows an example of fields specified for an Amazon Redshift connection.

![\[Screenshot of the Add a new Redshift connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-add-connection.png)


The following image shows the page used to join datasets in Amazon Redshift.

![\[Screenshot of the Import page in Canvas, showing two datasets being joined.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-join.png)


The following image shows an SQL query being used to edit a join in Amazon Redshift.

![\[Screenshot of a SQL query in the Edit SQL editor on the Import page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-redshift-edit-sql.png)


## Connect to your data with JDBC connectors
<a name="canvas-connecting-jdbc"></a>

With JDBC, you can connect to your databases from sources such as Databricks, SQLServer, MySQL, PostgreSQL, MariaDB, Amazon RDS, and Amazon Aurora.

You must make sure that you have the necessary credentials and permissions to create the connection from Canvas.
+ For Databricks, you must provide a JDBC URL. The URL formatting can vary between Databricks instances. For information about finding the URL and the specifying the parameters within it, see [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters) in the Databricks documentation. The following is an example of how a URL can be formatted: `jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=token;PWD=personal-access-token`
+ For other database sources, you must set up username and password authentication, and then specify those credentials when connecting to the database from Canvas. 

Additionally, your data source must either be accessible through the public internet, or if your Canvas application is running in **VPC only** mode, then the data source must run in the same VPC. For more information about configuring an Amazon RDS database in a VPC, see [Amazon VPC VPCs and Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.html) in the *Amazon RDS User Guide*.

After you’ve configured your data source credentials, you can sign in to the Canvas application and create a connection to the data source. Specify your credentials (or, for Databricks, the URL) when creating the connection.

## Connect to data sources with OAuth
<a name="canvas-connecting-oauth"></a>

Canvas supports using OAuth as an authentication method for connecting to your data in Snowflake and Salesforce Data Cloud. [OAuth](https://oauth.net/2/) is a common authentication platform for granting access to resources without sharing passwords.

**Note**  
You can only establish one OAuth connection for each data source.

To authorize the connection, you must following the initial setup described in [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

After setting up the OAuth credentials, you can do the following to add a Snowflake or Salesforce Data Cloud connection with OAuth:

1. Sign in to the Canvas application.

1. Create a tabular dataset. When prompted to upload data, choose Snowflake or Salesforce Data Cloud as your data source.

1. Create a new connection to your Snowflake or Salesforce Data Cloud data source. Specify OAuth as the authentication method and enter your connection details.

You should now be able to import data from your databases in Snowflake or Salesforce Data Cloud.

## Connect to a SaaS platform
<a name="canvas-connecting-saas"></a>

You can import data from Snowflake and over 40 other external SaaS platforms. For a full list of the connectors, see the table on [Data import](canvas-importing-data.md).

**Note**  
You can only import tabular data, such as data tables, from SaaS platforms.

### Use Snowflake with Canvas
<a name="canvas-using-snowflake"></a>

Snowflake is a data storage and analytics service, and you can import your data from Snowflake into SageMaker Canvas. For more information about Snowflake, see the [Snowflake documentation](https://www.snowflake.com/en/).

You can import data from your Snowflake account by doing the following:

1. Create a connection to the Snowflake database.

1. Choose the data that you're importing by dragging and dropping the table from the left navigation menu into the editor.

1. Import the data.

You can use the Snowflake editor to drag datasets onto the import pane and import them into SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
+ SQL queries
+ Joins

With SQL queries, you can customize how you import the values in the dataset. For example, you can specify the columns returned in the dataset or the range of values for a column.

You can join multiple Snowflake datasets into a single dataset before you import into Canvas using SQL or the Canvas interface. You can drag your datasets from Snowflake into the panel that gives you the ability to join the datasets, or you can edit the joins in SQL and convert the SQL into a single node. You can join other nodes to the node that you've converted. You can then combine the datasets that you've joined into a single node and join the nodes to a different Snowflake dataset. Finally, you can import the data that you've selected into Canvas.

Use the following procedure to import data from Snowflake to Amazon SageMaker Canvas.

1. In the SageMaker Canvas application, go to the **Datasets** page.

1. Choose **Import data**, and from the dropdown menu, choose **Tabular**.

1. Enter a name for the dataset and choose **Create**.

1. For **Data Source**, open the dropdown menu and choose **Snowflake**.

1. Choose **Add connection**.

1. In the **Add a new Snowflake connection** dialog box, specify your Snowflake credentials. For the **Authentication method**, choose one of the following:
   + **Basic - username password** – Provide your Snowflake account ID, username, and password.
   + **ARN** – For improved protection of your Snowflake credentials, provide the ARN of an AWS Secrets Manager secret that contains your credentials. For more information, see [ Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) in the *AWS Secrets Manager User Guide*.

     Your secret should have your Snowflake credentials stored in the following JSON format:

     ```
     {"accountid": "ID",
     "username": "username",
     "password": "password"}
     ```
   + **OAuth** – OAuth lets you authenticate without providing a password but requires additional setup. For more information about setting up OAuth credentials for Snowflake, see [Set up connections to data sources with OAuth](canvas-setting-up-oauth.md).

1. Choose **Add connection**.

1. From the tab that has the name of your connection, drag the .csv file that you're importing to the **Drag and drop table to import** pane.

1. Optional: Drag additional tables to the import pane. You can use the user interface to join the tables. For more specificity in your joins, choose **Edit in SQL**.

1. Optional: If you're using SQL to query the data, you can choose **Context** to add context to the connection by specifying values for the following:
   + **Warehouse**
   + **Database**
   + **Schema**

   Adding context to a connection makes it easier to specify future queries.

1. Choose **Import data**.

The following image shows an example of fields specified for a Snowflake connection.

![\[Screenshot of the Add a new Snowflake connection dialog box in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-connection.png)


The following image shows the page used to add context to a connection.

![\[Screenshot of the Import page in Canvas, showing the Context dialog box.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-connection-context.png)


The following image shows the page used to join datasets in Snowflake.

![\[Screenshot of the Import page in Canvas, showing datasets being joined.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-join.png)


The following image shows a SQL query being used to edit a join in Snowflake.

![\[Screenshot of a SQL query in the Edit SQL editor on the Import page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-snowflake-edit-sql.png)


### Use SaaS connectors with Canvas
<a name="canvas-connecting-external-appflow"></a>

**Note**  
For SaaS platforms besides Snowflake, you can only have one connection per data source.

Before you can import data from a SaaS platform, your administrator must authenticate and create a connection to the data source. For more information about how administrators can create a connection with a SaaS platform, see [Managing Amazon AppFlow connections](https://docs.aws.amazon.com/appflow/latest/userguide/connections.html) in the *Amazon AppFlow User Guide*.

If you’re an administrator getting started with Amazon AppFlow for the first time, see [Getting started](https://docs.aws.amazon.com/appflow/latest/userguide/getting-started.html) in the *Amazon AppFlow User Guide*.

To import data from a SaaS platform, you can follow the standard [Import tabular data](canvas-import-dataset.md#canvas-import-dataset-tabular) procedure, which shows you how to import tabular datasets into Canvas.

# Sample datasets in Canvas
<a name="canvas-sample-datasets"></a>

SageMaker Canvas provides sample datasets addressing unique use cases so you can start building, training, and validating models quickly without writing any code. The use cases associated with these datasets highlight the capabilities of SageMaker Canvas, and you can leverage these datasets to get started with building models. You can find the sample datasets in the **Datasets** page of your SageMaker Canvas application.

The following datasets are the samples that SageMaker Canvas provides by default. These datasets cover use cases such as predicting house prices, loan defaults, and readmission for diabetic patients; forecasting sales; predicting machine failures to streamline predictive maintenance in manufacturing units; and generating supply chain predictions for transportation and logistics. The datasets are stored in the `sample_dataset` folder in the default Amazon S3 bucket that SageMaker AI creates for your account in a Region.
+ **canvas-sample-diabetic-readmission.csv:** This dataset contains historical data including over fifteen features with patient and hospital outcomes. You can use this dataset to predict whether high-risk diabetic patients are likely to get readmitted to the hospital within 30 days of discharge, after 30 days, or not at all. Use the **redadmitted** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/5-hcls). This dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008). 
+ **canvas-sample-housing.csv:** This dataset contains data on the characteristics tied to a given housing price. You can use this dataset to predict housing prices. Use the **median\$1house\$1value** column as the target column, and use the numeric prediction model type with this dataset. To learn more about building a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/2-real-estate). This is the California housing dataset obtained from the [StatLib repository](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).
+ **canvas-sample-loans.csv:** This dataset contains complete loan data for all loans issued from 2007–2011, including the current loan status and latest payment information. You can use this dataset to predict whether a customer will repay a loan. Use the **loan\$1status** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/4-finserv). This data uses the LendingClub data obtained from [Kaggle](https://www.kaggle.com/datasets/wordsforthewise/lending-club).
+ **canvas-sample-maintenance.csv:** This dataset contains data on the characteristics tied to a given maintenance failure type. You can use this dataset to predict which failure will occur in the future. Use the **Failure Type** column as the target column, and use the 3\$1 category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/6-manufacturing). This dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset).
+ **canvas-sample-shipping-logs.csv:** This dataset contains complete shipping data for all products delivered, including estimated time shipping priority, carrier, and origin. You can use this dataset to predict the estimated time of arrival of the shipment in number of days. Use the **ActualShippingDays** column as the target column, and use the numeric prediction model type with this dataset. To learn more about how to build a model with this data, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/7-supply-chain). This is a synthetic dataset created by Amazon.
+ **canvas-sample-sales-forecasting.csv:** This dataset contains historical time series sales data for retail stores. You can use this dataset to forecast sales for a particular retail store. Use the **sales** column as the target column, and use the time series forecasting model type with this dataset. To learn more about how to build a model with this dataset, see the [SageMaker Canvas workshop page](https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/3-retail). This is a synthetic dataset created by Amazon.

# Re-import a deleted sample dataset
<a name="canvas-sample-datasets-reimport"></a>

Amazon SageMaker Canvas provides you with sample datasets for various use cases that highlight the capabilities of Canvas. To learn more about the sample datasets that are available, see [Sample datasets in Canvas](canvas-sample-datasets.md). If you no longer wish to use the sample datasets, you can delete them from the **Datasets** page of your SageMaker Canvas application. However, these datasets are still stored in the Amazon S3 bucket that you specified as the [Canvas storage location](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-storage-configuration.html), so you can always access them later. 

If you used the default Amazon S3 bucket, the bucket name follows the pattern `sagemaker-{region}-{account ID}`. You can find the sample datasets in the directory path `Canvas/sample_dataset`.

If you delete a sample dataset from your SageMaker Canvas application and want to access the sample dataset again, use the following procedure.

1. Navigate to the **Datasets** page in your SageMaker Canvas application.

1. Choose **Import data**.

1. From the list of Amazon S3 buckets, select the bucket that is your Canvas storage location. If using the default SageMaker AI-created Amazon S3 bucket, it follows the naming pattern `sagemaker-{region}-{account ID}`.

1. Select the **Canvas** folder.

1. Select the **sample\$1dataset** folder, which contains all of the sample datasets for SageMaker Canvas.

1. Select the dataset you want to import, and then choose **Import data**.

# Data preparation
<a name="canvas-data-prep"></a>

**Note**  
Previously, Amazon SageMaker Data Wrangler was part of the SageMaker Studio Classic experience. Now, if you update to using the new Studio experience, you must use SageMaker Canvas to access Data Wrangler and receive the latest feature updates. If you have been using Data Wrangler in Studio Classic until now and want to migrate to Data Wrangler in Canvas, you might have to grant additional permissions so that you can create and use a Canvas application. For more information, see [(Optional) Migrate from Data Wrangler in Studio Classic to SageMaker Canvas](studio-updated-migrate-ui.md#studio-updated-migrate-dw).  
To learn how to migrate your data flows from Data Wrangler in Studio Classic, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).

Use Amazon SageMaker Data Wrangler in Amazon SageMaker Canvas to prepare, featurize and analyze your data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows.
+ **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline. 
+ **Transform** – Clean and transform your dataset using standard *transforms* like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
+ **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Quality and Insights Report. 
+ **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation. 
+ **Export** – Export your data preparation workflow to a different location. The following are example locations: 
  + Amazon Simple Storage Service (Amazon S3) bucket
  + Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
+ **Automate data preparation** – Create machine learning workflows from your data flow.
  + Amazon SageMaker Pipelines – Build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs.
  + Serial inference pipeline – Create a serial inference pipeline from your data flow. Use it to make predictions on new data.
  + Python script – Store the data and their transformations in a Python script for your custom workflows.

# Create a data flow
<a name="canvas-data-flow"></a>

Use a Data Wrangler flow in SageMaker Canvas, or *data flow*, to create and modify a data preparation pipeline. We recommend that you use Data Wrangler for datasets larger than 5 GB.

To get started, use the following procedure to import your data into a data flow.

1. Open SageMaker Canvas.

1. In the left-hand navigation, choose **Data Wrangler**.

1. Choose **Import and prepare**.

1. From the dropdown menu, choose either **Tabular** or **Image**.

1. For **Select a data source**, choose your data source and select the data that you want to import. You have the option to select up to 30 files or one folder. If you have a dataset already imported into Canvas, choose **Canvas dataset** as your source. Otherwise, connect to a data source such as Amazon S3 or Snowflake and browse through your data. For information about connecting to a data source or importing data, see the following pages:
   + [Data import](canvas-importing-data.md)
   + [Connect to data sources](canvas-connecting-external.md)

1. After selecting the data that you want to import, choose **Next**.

1. (Optional) For the **Import settings** section when importing a tabular dataset, expand the **Advanced** dropdown menu. You can specify the following advanced settings for data flow imports:
   + **Sampling method** – Select the sampling method and sample size you'd like to use. For more information about how to change your sample, see the section [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **File encoding (CSV)** – Select your dataset file’s encoding. `UTF-8` is the default.
   + **Skip first rows** – Enter the number of rows you’d like to skip importing if you have redundant rows at the beginning of your dataset.
   + **Delimiter** – Select the delimiter that separates each item in your data. You can also specify a custom delimiter.
   + **Multi-line detection** – Select this option if you’d like Canvas to manually parse your entire dataset for multi-line cells. Canvas determines whether or not to use multi-line support by taking a sample of your data, but Canvas might not detect any multi-line cells in the sample. In this case, we recommend that you select the **Multi-line detection** option to force Canvas to check your entire dataset for multi-line cells.

1. Choose **Import**.

You should now have a new data flow, and you can begin adding transform steps and analyses.

# How the data flow UI works
<a name="canvas-data-flow-ui"></a>

To help you navigate your data flow, Data Wrangler has the following tabs in the top navigation pane:
+ **Data flow** – This tab provides you with a visual view of your data flow step where you can add or remove transforms, and export data.
+ **Data** – This tab gives you a preview of your data so that you can check the results of your transforms. You can also see an ordered list of your data flow steps and edit or reorder the steps.
**Note**  
In this tab, you can only preview data visualizations (such as the distribution of values per column) for Amazon S3 data sources. Visualizations for other data sources, such as Amazon Athena, aren't supported.
+ **Analyses** – In this tab, you can see separate sub-tabs for each analysis you create. For example, if you create a histogram and a Data Quality and Insights (DQI) report, Canvas creates a tab for each.

When you import a dataset, the original dataset appears on the data flow and is named **Source**. SageMaker Canvas automatically infers the types of each column in your dataset and creates a new dataframe named **Data types**. You can select this frame to update the inferred data types.

The datasets, transformations, and analyses that you use in the data flow are represented as *steps*. Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than **Join** or **Concatenate**) are added to the same dataset, they are stacked.

Under the **Combine data** option, **Join** and **Concatenate** create standalone steps that contain the new joined or concatenated dataset.

# Edit the data flow sampling configuration
<a name="canvas-data-flow-edit-sampling"></a>

When importing tabular data into a Data Wrangler data flow, you can opt to take a sample of your dataset to speed up the data exploration and cleaning process. Running exploratory transforms on a sample of your dataset is often faster than running transforms on your entire dataset, and when you're ready to export your dataset and build a model, you can apply the transforms to the full dataset.

Canvas supports the following sampling methods:
+ **FirstK** – Canvas selects the first *K* items from your dataset, where *K* is a number you specify. This sampling method is simple but can introduce bias if your dataset isn't randomly ordered.
+ **Random** – Canvas selects items from the dataset at random, with each item having an equal probability of being chosen. This sampling method helps ensure that the sample is representative of the entire dataset.
+ **Stratified** – Canvas divides the dataset into groups (or *strata*) based on one or more attributes (for example, age and income level). Then, a proportional number of items are randomly selected from each group. This method ensures that all relevant subgroups are adequately represented in the sample.

You can edit your sampling configuration at any time to change the size of the sample used for data exploration.

To make changes to your sampling configuration, do the following:

1. In your data flow graph, select your data source node.

1. Choose **Sampling** on the bottom navigation bar.

1. The **Sampling** dialog box opens. For the **Sampling method** dropdown, select your desired sampling method.

1. For **Maximum sample size**, enter the number of rows you want to sample.

1. Choose **Update** to save your changes.

The changes to your sampling configuration should now be applied.

# Add a step to your data flow
<a name="canvas-data-flow-add-step"></a>

In your Data Wrangler data flows, you can add steps that represent data transformations and analyses.

To add a step to your data flow, select **\$1** next to any dataset node or previously added step. Then, select one of the following options:
+ **Edit data types** (For a **Data types** step only): If you have not added any transforms to a **Data types** step, you can double-click on the **Data types** step in your flow to open the **Data** tab and edit the data types that Data Wrangler inferred when importing your dataset. 
+ **Add transform**: Adds a new transform step. See [Transform data](canvas-transform.md) to learn more about the data transformations you can add. 
+ **Get data insights**: Add analyses, such as histograms or custom visualizations. You can use this option to analyze your data at any point in the data flow. See [Perform exploratory data analysis (EDA)](canvas-analyses.md) to learn more about the analyses you can add. 
+ **Join**: Find this option under **Combine data** to join two datasets and add the resulting dataset to the data flow. To learn more, see [Join Datasets](canvas-transform.md#canvas-transform-join).
+ **Concatenate**: Find this option under **Combine data** to concatenate two datasets and add the resulting dataset to the data flow. To learn more, see [Concatenate Datasets](canvas-transform.md#canvas-transform-concatenate).

# Edit data flow steps
<a name="canvas-data-flow-edit-steps"></a>

In Amazon SageMaker Canvas, you can edit individual steps in your data flows to transform your dataset without having to create a new data flow. The following page covers how to edit join and concatenate steps, as well as data source steps.

## Edit join and concatenate steps
<a name="canvas-data-flow-edit-join-concat"></a>

Within your data flows, you have the flexibility to edit your join and concatenate steps. You can make necessary adjustments to your data processing workflow, ensuring that your data is properly combined and transformed without having to redo your entire data flow.

To edit a join or concatenate step in your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to edit.

1. From the context menu, choose **Edit**.

1. A side panel opens where you can edit the details of your join or concatenation. Modify your step fields, such as the type of join. To swap out a data node and select a different one to join or concatenate, choose the delete icon next to the node and then, in the data flow view, select the new node that you want to include in your transformation.
**Note**  
When swapping out a node during the editing process, you can only select steps that occur before the join or concatenate operation. You can swap either the left or right node, but you can only swap one node at a time. Additionally, you cannot select a source node as a replacement.

1. Choose **Preview** to view the result of the combining operation.

1. Choose **Update** to save your changes.

Your data flow should now be updated.

## Edit or replace a data source step
<a name="canvas-data-flow-edit-source"></a>

You might need to make changes to your data source or dataset without deleting the transforms and data flow steps applied to your original data. Within Data Wrangler, you can edit or replace your data source configuration while keeping the steps of your data flow. When editing a data source, you can change the import settings, such as the sampling size or method and any advanced settings. You can also add more files with the same schema, or for query-based data sources such as Amazon Athena, you can edit the query. When replacing a data source, you have the option to select a different dataset, or even import the data from a different data source altogether, as long as the schema of the new data matches the original data.

To edit a data source configuration, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Edit**.

1. For Amazon S3 data sources and local upload, you have the option to select or upload more files with the same schema as your original data. For query-based data sources such as Amazon Athena, you can remove and select different tables in the visual query builder, or you can edit the SQL query directly. When you're done, choose **Next**.

1. For the **Import settings**, make any desired changes.

1. When you're done, choose **Save changes**.

Your data source should now be updated.

To replace a data source, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Replace**.

1. Go through the [create a data flow experience](canvas-data-flow.md) to select another data source and data.

1. When you’ve selected your data and are ready to update the source node, choose **Save**.

You should now see the **Source** node updated in your data flow.

# Reorder steps in your data flow
<a name="canvas-data-flow-reorder-steps"></a>

After adding steps to your data flow, you have the option to reorder steps instead of deleting and re-adding them in the correct order. For example, you might decide to move a transform to impute missing values before a step to format strings.

**Note**  
You can’t change the order of certain step types, such as defining your data source, changing data types, joining, concatenating, or splitting. Steps that can’t be reordered are grayed out in the Canvas application UI.

To reorder your data flow steps, do the following:

1. While editing a data flow in Data Wrangler, choose the **Data** tab. A side panel called **Steps** lists your data flow steps in order.

1. Hover over a transform step and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to that step.

1. From the context menu, choose **Reorder**.

1. Drag and drop your data flow steps into your desired order.

1. When you’ve finished, choose **Save**.

Your data flow steps and graph should now reflect the changes you’ve made.

# Delete a step from your data flow
<a name="canvas-data-flow-delete-step"></a>

Within your data flows, you have the flexibility to delete your join and concatenate steps and choose whether or not to still apply any subsequent transforms to your data.

To delete a join or concatenate step from your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to delete.

1. In the context menu, choose **Delete**.

1. (Optional) If you have transformation steps following the join or concatenate step, then you can choose whether or not to keep the subsequent transformation steps and add them separately to each data node. In the **Delete join** side panel, choose a node to deselect it and remove any subsequent transformation steps. You can leave both nodes selected to keep all transformation steps, or you can deselect both nodes to discard all transformation steps.

   The following screenshot shows this step with only the second of two data nodes selected. When the join is successfully deleted, then the subsequent **Rename column** transform is only kept by the second data node.  
![\[Screenshot of a data flow in Data Wrangler showing the delete join view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-data-flow-delete-step.png)

1. Choose **Delete**.

The join or concatenate step should now be removed from your data flow.

# Perform exploratory data analysis (EDA)
<a name="canvas-analyses"></a>

Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. You can also create custom analyses using your own code. 

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing **Add analysis**. To access an analysis you've created, select the step that contains the analysis, and select the analysis. 

Analyses are generated using a sample of up to 200,000 rows of your dataset, and you can configure the sample size. For more information about changing the sample size of your data flow, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

**Note**  
Analyses are optimized for data with 1000 or fewer columns. You may experience some latency when generating analyses for data with additional columns.

You can add the following analysis to a dataframe:
+ Data visualizations, including histograms and scatter plots. 
+ A quick summary of your dataset, including number of entries, minimum and maximum values (for numeric data), and most and least frequent categories (for categorical data).
+ A quick model of the dataset, which can be used to generate an importance score for each feature. 
+ A target leakage report, which you can use to determine if one or more features are strongly correlated with your target feature.
+ A custom visualization using your own code. 

Use the following sections to learn more about these options.

## Get insights on data and data quality
<a name="canvas-data-insights"></a>

Use the **Data Quality and Insights Report** to perform an analysis of the data that you've imported into Data Wrangler. We recommend that you create the report after you import your dataset. You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Use the following procedure to create a Data Quality and Insights report. It assumes that you've already imported a dataset into your Data Wrangler flow.

**To create a Data Quality and Insights report**

1. Choose the ellipsis icon next to a node in your Data Wrangler flow.

1. Select **Get data insights**.

1. For **Analysis type**, select **Data Quality and Insights Report**.

1. For **Analysis name**, specify a name for the insights report.

1. For **Problem type**, specify **Regression** or **Classification**.

1. For **Target column**, specify the target column.

1. For **Data size**, specify one of the following:
   + **Sampled dataset** – Uses the interactive sample from your data flow, which can contain up to 200,000 rows of your dataset. For information about how to edit the size of your sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **Full dataset** – Uses the full dataset from your data source to create the report.
**Note**  
Creating a Data Quality and Insights report on the full dataset uses an Amazon SageMaker processing job. A SageMaker Processing job provisions the additional compute resources required to get insights for all of your data. For more information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

1. Choose **Create**.

The following topics show the sections of the report:

**Topics**
+ [

### Summary
](#canvas-data-insights-summary)
+ [

### Target column
](#canvas-data-insights-target-column)
+ [

### Quick model
](#canvas-data-insights-quick-model)
+ [

### Feature summary
](#canvas-data-insights-feature-summary)
+ [

### Samples
](#canvas-data-insights-samples)
+ [

### Definitions
](#canvas-data-insights-definitions)

You can either download the report or view it online. To download the report, choose the download button at the top right corner of the screen. 

### Summary
<a name="canvas-data-insights-summary"></a>

The insights report has a brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings that point to probable issues with the data. We recommend that you investigate the warnings.

### Target column
<a name="canvas-data-insights-target-column"></a>

When you create the Data Quality and Insights Report, Data Wrangler gives you the option to select a target column. A target column is a column that you're trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a category. It also presents observations, or rows, with a missing or invalid target value.

For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents observations, or rows, with a missing, invalid, or outlier target value.

### Quick model
<a name="canvas-data-insights-quick-model"></a>

The **Quick model** provides an estimate of the expected predicted quality of a model that you train on your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split, each data partition has the same ratio of labels. For classification problems, it's important to have the same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost model with the default hyperparameters. It applies early stopping on the validation data and performs minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

 To learn more about the information that the classification model summary returns, see [Definitions](#canvas-data-insights-definitions).

A confusion matrix gives you the following information:
+ The number of times the predicted label matches the true label.
+ The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent transactions as fraudulent.

### Feature summary
<a name="canvas-data-insights-feature-summary"></a>

When you specify a target column, Data Wrangler orders the features by their prediction power. Prediction power is measured on the data after it is split into 80% training and 20% validation folds. Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem with other columns. You can confidently use the prediction scores to determine whether a feature in your dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates target leakage. Target leakage usually happens when the dataset contains a column that isn’t available at the prediction time. For example, it could be a duplicate of the target column.

### Samples
<a name="canvas-data-insights-samples"></a>

Data Wrangler provides information about whether your samples are anomalous or if there are duplicates in your dataset.

Data Wrangler detects anomalous samples using the *isolation forest algorithm*. The isolation forest associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative anomaly score are usually considered anomalous and samples with positive anomaly score are considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to unusual values. For example, you might have anomalous values that result from errors in gathering and processing the data. The following is an example of the most anomalous samples according to the Data Wrangler’s implementation of the isolation forest algorithm. We recommend using domain knowledge and business logic when you examine the anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data sources could include valid duplicates. Other data sources could have duplicates that point to problems in data collection. Duplicate samples that result from faulty data collection could interfere with machine learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:
+ Quick model
+ Prediction power estimation
+ Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the **Drop duplicates** transform under **Manage rows**. Data Wrangler shows you the most frequently duplicated rows.

### Definitions
<a name="canvas-data-insights-definitions"></a>

The following are definitions for the technical terms that are used in the data insights report.

------
#### [ Feature types ]

The following are the definitions for each of the feature types:
+ **Numeric** – Numeric values can be either floats or integers, such as age or income. The machine learning models assume that numeric values are ordered and a distance is defined over them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
+ **Categorical** – The column entries belong to a set of unique values, which is usually much smaller than the number of entries in the column. For example, a column of length 100 could contain the unique values `Dog`, `Cat`, and `Mouse`. The values could be numeric, text, or a combination of both. `Horse`, `House`, `8`, `Love`, and `3.1` would all be valid values and could be found in the same categorical column. The machine learning model does not assume order or distance on the values of categorical features, as opposed to numeric features, even when all the values are numbers.
+ **Binary** – Binary features are a special categorical feature type in which the cardinality of the set of unique values is 2.
+ **Text** – A text column contains many non-numeric unique values. In extreme cases, all the elements of the column are unique. In an extreme case, no two entries are the same.
+ **Datetime** – A datetime column contains information about the date or time. It can have information about both the date and time.

------
#### [ Feature statistics ]

The following are definitions for each of the feature statistics:
+ **Prediction power** – Prediction power measures how useful the column is in predicting the target.
+ **Outliers** (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median \$1 5 \$1 RSTD or smaller than median - 5 \$1 RSTD are considered to be outliers.
+ **Skew** (in numeric columns) – Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is considered skewed when the absolute value of the skew is larger than 3.
+ **Kurtosis** (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the distribution. It's defined as the fourth moment of the distribution divided by the square of the second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply that the distribution is concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
+ **Missing values** – Null-like objects, empty strings and strings composed of only white spaces are considered missing.
+ **Valid values for numeric features or regression target** – All values that you can cast to finite floats are valid. Missing values are not valid.
+ **Valid values for categorical, binary, or text features, or for classification target** – All values that are not missing are valid.
+ **Datetime features** – All values that you can cast to a datetime object are valid. Missing values are not valid.
+ **Invalid values** – Values that are either missing or you can't properly cast. For example, in a numeric column, you can't cast the string `"six"` or a null value.

------
#### [ Quick model metrics for regression ]

The following are the definitions for the quick model metrics:
+ R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
+ MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ Max error – The maximum absolute value of the error over the dataset. Max error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.

------
#### [ Quick model metrics for classification ]

The following are the definitions for the quick model metrics:
+ **Accuracy** – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the perfect model.
+ **Balanced accuracy** – Balanced accuracy is the ratio of samples that are predicted accurately when the class weights are adjusted to balance the data. All classes are given the same importance, regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples wrong. 1 is the score of the perfect model.
+ **AUC (binary classification)** – This is the area under the receiver operating characteristic curve. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **AUC (OVR)** – For multiclass classification, this is the area under the receiver operating characteristic curve calculated separately for each label using one versus rest. Data Wrangler reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **Precision** – Precision is defined for a specific class. Precision is the fraction of true positives out of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the score of the model that has no false positives for the class. For binary classification, Data Wrangler reports the precision of the positive class.
+ **Recall** – Recall is defined for a specific class. Recall is the fraction of the relevant class instances that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the recall of the positive class.
+ **F1** – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports the F1 for classes with positive values.

------
#### [ Textual patterns ]

**Patterns** describe the textual format of a string using an easy to read format. The following are examples of textual patterns:
+ "\$1digits:4-7\$1" describes a sequence of digits that have a length between 4 and 7.
+ "\$1alnum:5\$1" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can describe many of the commonly used patterns. The **confidence** expressed as a percentage indicates how much of the data is estimated to match the pattern. Using the textual pattern, you can see which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:


| Pattern | Textual Format | 
| --- | --- | 
|  \$1alnum\$1  |  Alphanumeric strings  | 
|  \$1any\$1  |  Any string of word characters  | 
|  \$1digits\$1  |  A sequence of digits  | 
|  \$1lower\$1  |  A lowercase word  | 
|  \$1mixed\$1  |  A mixed-case word  | 
|  \$1name\$1  |  A word beginning with a capital letter  | 
|  \$1upper\$1  |  An uppercase word  | 
|  \$1whitespace\$1  |  Whitespace characters  | 

A word character is either an underscore or a character that might appear in a word in any language. For example, the strings `'Hello_word'` and `'écoute'` both consist of word characters. 'H' and 'é' are both examples of word characters.

------

## Bias report
<a name="canvas-bias-report"></a>

SageMaker Canvas provides the bias report in Data Wrangler to help uncover potential biases in your data. The bias report analyzes the relationship between the target column (label) and a column that you believe might contain bias (facet variable). For example, if you are trying to predict customer conversion, the facet variable may be the age of the customer. The bias report can help you determine whether or not your data is biased toward a certain age group.

To generate a bias report in Canvas, do the following:

1. In your data flow in Data Wrangler, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to a node in the flow.

1. From the context menu, choose **Get data insights**.

1. The **Create analysis** side panel opens. For the **Analysis type** dropdown menu, select **Bias Report**.

1.  In the **Analysis name** field, enter a name for the bias report.

1. For the **Select the column your model predicts (target)** dropdown menu, select your target column.

1. For **Is your predicted column a value or threshold?**, select **Value** if your target column has categorical values or **Threshold** if it has numerical values.

1. For **Predicted value** (or **Predicted threshold**, depending on your selection in the previous step), enter the target column value or values that correspond to a positive outcome. For example, if predicting customer conversion, your value might be `yes` to indicate that a customer was converted.

1. For the **Select the column to analyze for bias** dropdown menu, select the column that you believe might contain bias, also known as the facet variable.

1. For **Is your column a value or threshold?**, select **Value** if the facet variable has categorical values or **Threshold** if it has numerical values.

1. For **Column value(s) to analyze for bias** (or **Column threshold to analyze for bias**, depending on your selection in the previous step), enter the value or values that you want to analyze for potential bias. For example, if you're checking for bias against customers over a certain age, use the beginning of that age range as your threshold.

1. For **Choose bias metrics**, select the bias metrics you'd like to include in your bias report. Hover over the info icons for more information about each metric.

1. (Optional) When prompted with the option **Would you like to analyze additional metrics?**, select **Yes** to view and include more bias metrics.

1. When you're ready to create the bias report, choose **Add**.

Once generated, the report gives you an overview of the bias metrics you selected. You can view the bias report at any time from the **Analyses** tab of your data flow.

## Histogram
<a name="canvas-visualize-histogram"></a>

Use histograms to see the counts of feature values for a specific feature. You can inspect the relationships between features using the **Color by** option.

You can use the **Facet by** feature to create histograms of one column, for each value in another column. 

## Scatter plot
<a name="canvas-visualize-scatter-plot"></a>

Use the **Scatter Plot** feature to inspect the relationship between features. To create a scatter plot, select a feature to plot on the **X axis** and the **Y axis**. Both of these columns must be numeric typed columns. 

You can color scatter plots by an additional column. 

Additionally, you can facet scatter plots by features.

## Table summary
<a name="canvas-table-summary"></a>

Use the **Table Summary** analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table summary reports the number of entries (count), least frequent value (min), and most frequent value (max). 

## Quick model
<a name="canvas-quick-model"></a>

Use the **Quick Model** visualization to quickly evaluate your data and produce importance scores for each feature. A [feature importance score](http://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances) score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. On the top of the quick model chart, there is a model score. A classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against which you want feature importance to be compared. Data Wrangler does the following:
+ Infers the data types for the target label and each feature in the dataset selected. 
+ Determines the problem type. Based on the number of distinct values in the label column, Data Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a categorical threshold to 100. If there are more than 100 distinct values in the label column, Data Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem. 
+ Pre-processes features and label data for training. The algorithm used requires encoding features to vector type and encoding labels to double type. 
+ Trains a random forest algorithm with 70% of data. Spark’s [RandomForestRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) is used to train a model for regression problems. The [RandomForestClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) is used to train a model for classification problems.
+ Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates classification models using an F1 score and evaluates regression models using an MSE score.
+ Calculates feature importance for each feature using the Gini importance method. 

## Target leakage
<a name="canvas-analysis-target-leakage"></a>

Target leakage occurs when there is data in a machine learning training dataset that is strongly correlated with the target label, but is not available in real-world data. For example, you may have a column in your dataset that serves as a proxy for the column you want to predict with your model. 

When you use the **Target Leakage** analysis, you specify the following:
+ **Target**: This is the feature about which you want your ML model to be able to make predictions.
+ **Problem type**: This is the ML problem type on which you are working. Problem type can either be **classification** or **regression**. 
+  (Optional) **Max features**: This is the maximum number of features to present in the visualization, which shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic, or AUC - ROC curve for each column, up to **Max features**. For regression, it uses a coefficient of determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities, which often indicates target leakage. A score of 0.5 or lower indicates that the information on the column could not provide, on its own, any useful information towards predicting the target. Although it can happen that a column is uninformative on its own but is useful in predicting the target when used in tandem with other features, a low score could indicate the feature is redundant.

## Multicollinearity
<a name="canvas-multicollinearity"></a>

Multicollinearity is a circumstance where two or more predictor variables are related to each other. The predictor variables are the features in your dataset that you're using to predict a target variable. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other.

You can use the **Variance Inflation Factor (VIF)**, **Principal Component Analysis (PCA)**, or **Lasso feature selection** as measures for the multicollinearity in your data. For more information, see the following.

------
#### [ Variance Inflation Factor (VIF) ]

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1 indicate higher correlation.

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50. If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:
+ A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the other variables.
+ A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the other variables.

------
#### [ Principle Component Analysis (PCA) ]

Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset.

For example, if you're trying to predict who survived on the *RMS Titanic* after it hit an iceberg, your feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0. We can use them to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity. Before it performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1.

**Note**  
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).

------
#### [ Lasso feature selection ]

Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

------

## Detect anomalies in time series data
<a name="canvas-time-series-anomaly-detection"></a>

You can use the anomaly detection visualization to see outliers in your time series data. To understand what determines an anomaly, you need to understand that we decompose the time series into a predicted term and an error term. We treat the seasonality and trend of the time series as the predicted term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be away from the mean for it to be considered an anomaly. For example, you can specify a threshold as being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an anomaly.

You can use the following procedure to perform an **Anomaly detection** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Anomaly detection**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Seasonal trend decomposition in time series data
<a name="canvas-seasonal-trend-decomposition"></a>

You can determine whether there's seasonality in your time series data by using the Seasonal Trend Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method to perform the decomposition. We decompose the time series into its seasonal, trend, and residual components. The trend reflects the long term progression of the series. The seasonal component is a signal that recurs in a time period. After removing the trend and the seasonal components from the time series, you have the residual.

You can use the following procedure to perform a **Seasonal-Trend decomposition** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Seasonal-Trend decomposition**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Create custom visualizations
<a name="canvas-visualize-custom"></a>

You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with all the transformations you've applied, is available as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Data Wrangler uses the `df` variable to store the dataframe. You access the dataframe by calling the variable.

You must provide the output variable, `chart`, to store an [Altair](https://altair-viz.github.io/) output chart. For example, you can use the following code block to create a custom histogram using the Titanic dataset.

```
import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule
```

**To create a custom visualization:**

1. Next to the node containing the transformation that you'd like to visualize, choose the **\$1**.

1. Choose **Add analysis**.

1. For **Analysis type**, choose **Custom Visualization**.

1. For **Analysis name**, specify a name.

1. Enter your code in the code box. 

1. Choose **Preview** to preview your visualization.

1. Choose **Save** to add your visualization.

If you don’t know how to use the Altair visualization package in Python, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose **Search example snippets** and specify a query in the search bar.

The following example uses the **Binned scatterplot** code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code. You usually need to specify the column names of your dataset in the code.

```
import altair as alt

# Specify the number of top rows for plotting
rows_number = 1000
df = df.head(rows_number)  
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)


chart = (
    alt.Chart(df)
    .mark_circle()
    .encode(
        # Specify the column names for binning and number of bins for X and Y axis
        x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
        y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
        size="count()",
    )
)

# :Q specifies that label column has quantitative type.
# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
```

# Transform data
<a name="canvas-transform"></a>

Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning and featurizing your data. Using the interactive data preparation tools in Data Wrangler, you can sample datasets of any size with a variety of sampling techniques and start exploring your data in a matter of minutes. After finalizing your data transforms on the sampled data, you can then scale the data flow to apply those transformations to the entire dataset.

When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code. If you know how you want to prepare your data but don't know how to get started or which transforms to use, you can use the chat for data prep feature to interact conversationally with Data Wrangler and apply transforms using natural language. For more information, see [Chat for data prep](canvas-chat-for-data-prep.md). 

You can also add custom transformations using PySpark, Python (User-Defined Function), pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in a single step.

You can apply the **Process numeric** and **Handle missing** transforms only to a single column.

Use this page to learn more about the built-in and custom transforms offered by Data Wrangler.

## Join Datasets
<a name="canvas-transform-join"></a>

You can join datasets directly in your data flow. When you join two datasets, the resulting joined dataset appears in your flow. The following join types are supported by Data Wrangler.
+ **Left outer** – Include all rows from the left table. If the value for the column joined on a left table row does not match any right table row values, that row contains null values for all right table columns in the joined table.
+ **Left anti** – Include rows from the left table that do not contain values in the right table for the joined column.
+ **Left semi** – Include a single row from the left table for all identical rows that satisfy the criteria in the join statement. This excludes duplicate rows from the left table that match the criteria of the join.
+ **Right outer** – Include all rows from the right table. If the value for the joined column in a right table row does not match any left table row values, that row contains null values for all left table columns in the joined table.
+ **Inner** – Include rows from left and right tables that contain matching values in the joined column. 
+ **Full outer** – Include all rows from the left and right tables. If the row value for the joined column in either table does not match, separate rows are created in the joined table. If a row doesn’t contain a value for a column in the joined table, null is inserted for that column.
+ **Cartesian cross** – Include rows which combine each row from the first table with each row from the second table. This is a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of rows from tables in the join. The result of this product is the size of the left table times the size of the right table. Therefore, we recommend caution in using this join between very large datasets. 

Use the following procedure to join two datasets. You should have already imported two data sources into your data flow.

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to join. The first node you select is always the left table in your join. 

1. Hover over **Combine data**, and then choose **Join**.

1. Select the right node. The second node you select is always the right table in your join.

1. The **Join type** field is set to **Inner join** by default. Select the dropdown menu to change the join type.

1. For **Join keys**, verify the columns from the left and right tables that you want to use to join the data. You can add or remove additional join keys.

1. For **Name of join**, enter a name for the joined data, or use the default name.

1. (Optional) Choose **Preview** to preview the joined data.

1. Choose **Add** to complete the join.

**Note**  
If you receive a notice that Canvas didn't identify any matching rows when joining your data, we recommend that you either verify that you've selected the correct columns, or update your sample to try to find matching rows. You can choose a different sampling strategy or change the size of the sample. For information about how to edit the sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

You should now see a join node added to your data flow.

## Concatenate Datasets
<a name="canvas-transform-concatenate"></a>

Concatenating combines two datasets by appending the rows from one dataset to another.

Use the following procedure to concatenate two datasets. You should have already imported two data sources into your data flow.

**To concatenate two datasets:**

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to concatenate. The first node you select is always the left table in your concatenate operation. 

1. Hover over **Combine data**, and then choose **Concatenate**.

1. Select the right node. The second node you select is always the right table in your concatenate.

1. (Optional) Select the checkbox next to **Remove duplicates after concatenation** to remove duplicate columns. 

1. (Optional) Select the checkbox next to **Add column to indicate source dataframe** to add a column to the resulting dataframe that lists the source dataset for each record.

   1. For **Indicator column name**, enter a name for the added column.

   1. For **First dataset indicating string**, enter the value you want to use to mark records from the first dataset (or the left node).

   1. For **Second dataset indicating string**, enter the value you want to use to mark records from the second dataset (or the right node).

1. For **Name of concatenate**, enter a name for the concatenation.

1. (Optional) Choose **Preview** to preview the concatenated data.

1. Choose **Add** to add the new dataset to your data flow. 

You should now see a concatenate node added to your data flow.

## Balance Data
<a name="canvas-transform-balance-data"></a>

You can balance the data for datasets with an underrepresented category. Balancing a dataset can help you create better models for binary classification.

**Note**  
You can't balance datasets containing column vectors.

You can use the **Balance data** operation to balance your data using one of the following operators:
+ *Random oversampling* – Randomly duplicates samples in the minority category. For example, if you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases within the dataset 8 times.
+ *Random undersampling* – Roughly equivalent to random oversampling. Randomly removes samples from the overrepresented category to get the proportion of samples that you desire.
+ *Synthetic Minority Oversampling Technique (SMOTE)* – Uses samples from the underrepresented category to interpolate new synthetic minority samples. For more information about SMOTE, see the following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1]. For numeric features, Data Wrangler interpolates samples using a weighted average of the samples. For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The interpolated sample has a value of 0.7A \$1 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of the time.

## Custom Transforms
<a name="canvas-transform-custom"></a>

The **Custom Transforms** group allows you to use Python (User-Defined Function), PySpark, pandas, or PySpark (SQL) to define custom transformations. For all three options, you use the variable `df` to access the dataframe to which you want to apply the transform. To apply your custom code to your dataframe, assign the dataframe with the transformations that you've made to the `df` variable. If you're not using Python (User-Defined Function), you don't need to include a return statement. Choose **Preview** to preview the result of the custom transform. Choose **Add** to add the custom transform to your list of **Previous steps**.

You can import the popular libraries with an `import` statement in the custom transform code block, such as the following:
+ NumPy version 1.19.0
+ scikit-learn version 0.23.2
+ SciPy version 1.5.4
+ pandas version 1.0.3
+ PySpark version 3.0.0

**Important**  
**Custom transform** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

If you include print statements in the code block, the result appears when you select **Preview**. You can resize the custom code transformer panel. Resizing the panel provides more space to write code. 

The following sections provide additional context and examples for writing custom transform code.

**Python (User-Defined Function)**

The Python function gives you the ability to write custom transformations without needing to know Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:
+ **Input column** – The input column where you're applying the transform.
+ **Mode** – The scripting mode, either pandas or Python.
+ **Return type** – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write transformations by using pure Python functions.

**PySpark**

The following example extracts date and time from a timestamp.

```
from pyspark.sql.functions import from_unixtime, to_date, date_format
df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))
```

**pandas**

The following example provides an overview of the dataframe to which you are adding transforms. 

```
df.info()
```

**PySpark (SQL)**

The following example creates a new dataframe with four columns: *name*, *fare*, *pclass*, *survived*.

```
SELECT name, fare, pclass, survived FROM df
```

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose **Search example snippets** and specify a query in the search bar. The text you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a **Drop duplicate rows** code snippet that can delete rows with similar data in your dataset. You can find the code snippet by searching for one of the following:
+ Duplicates
+ Identical
+ Remove

The following snippet has comments to help you understand the changes that you need to make. For most snippets, you must specify the column names of your dataset in the code.

```
# Specify the subset of columns
# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]
df = df.dropDuplicates(subset)  

# to drop the full-duplicate rows run
# df = df.dropDuplicates()
```

To use a snippet, copy and paste its content into the **Custom transform** field. You can copy and paste multiple code snippets into the custom transform field.

## Custom Formula
<a name="canvas-transform-custom-formula"></a>

Use **Custom formula** to define a new column using a Spark SQL expression to query data in the current dataframe. The query must use the conventions of Spark SQL expressions.

**Important**  
**Custom formula** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

You can use this transform to perform operations on columns, referencing the columns by name. For example, assuming the current dataframe contains columns named *col\$1a* and *col\$1b*, you can use the following operation to produce an **Output column** that is the product of these two columns with the following code:

```
col_a * col_b
```

Other common operations include the following, assuming a dataframe contains `col_a` and `col_b` columns:
+ Concatenate two columns: `concat(col_a, col_b)`
+ Add two columns: `col_a + col_b`
+ Subtract two columns: `col_a - col_b`
+ Divide two columns: `col_a / col_b`
+ Take the absolute value of a column: `abs(col_a)`

For more information, see the [Spark documentation](http://spark.apache.org/docs/latest/api/python) on selecting data. 

## Reduce Dimensionality within a Dataset
<a name="canvas-transform-dimensionality-reduction"></a>

Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality of your dataset corresponds to the number of features. When you use dimensionality reduction in Data Wrangler, you get a new set of features called components. Each component accounts for some variability in the data.

The first component accounts for the largest amount of variation in the data. The second component accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models. Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns in your dataset. The first principal component is the value on the axis that has the largest amount of variance. The second principal component is the value on the axis that has the second largest amount of variance. The nth principal component is the value on the axis that has the nth largest amount of variance.

You can configure the number of principal components that Data Wrangler returns. You can either specify the number of principal components directly or you can specify the variance threshold percentage. Each principal component explains an amount of variance in the data. For example, you might have a principal component with a value of 0.5. The component would explain 50% of the variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the smallest number of components that meet the percentage that you specify.

The following are example principal components with the amount of variance that they explain in the data.
+ Component 1 – 0.5
+ Component 2 – 0.45
+ Component 3 – 0.05

If you specify a variance threshold percentage of `94` or `95`, Data Wrangler returns Component 1 and Component 2. If you specify a variance threshold percentage of `96`, Data Wrangler returns all three principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.

1. Choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Dimensionality Reduction**.

1. For **Input Columns**, choose the features that you're reducing into the principal components.

1. (Optional) For **Number of principal components**, choose the number of principal components that Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for **Variance threshold percentage**.

1. (Optional) For **Variance threshold percentage**, specify the percentage of variation in the data that you want explained by the principal components. Data Wrangler uses the default value of `95` if you don't specify a value for the variance threshold. You can't specify a variance threshold percentage if you've specified a value for **Number of principal components**.

1. (Optional) Deselect **Center** to not use the mean of the columns as the center of the data. By default, Data Wrangler centers the data with the mean before scaling.

1. (Optional) Deselect **Scale** to not scale the data with the unit standard deviation.

1. (Optional) Choose **Columns** to output the components to separate columns. Choose **Vector** to output the components as a single vector.

1. (Optional) For **Output column**, specify a name for an output column. If you're outputting the components to separate columns, the name that you specify is a prefix. If you're outputting the components to a vector, the name that you specify is the name of the vector column.

1. (Optional) Select **Keep input columns**. We don't recommend selecting this option if you plan on only using the principal components to train your model.

1. Choose **Preview**.

1. Choose **Add**.

## Encode Categorical
<a name="canvas-transform-cat-encode"></a>

Categorical data is usually composed of a finite number of categories, where each category is represented with a string. For example, if you have a table of customer data, a column that indicates the country a person lives in is categorical. The categories would be *Afghanistan*, *Albania*, *Algeria*, and so on. Categorical data can be *nominal* or *ordinal*. Ordinal categories have an inherent order, and nominal categories do not. The highest degree obtained (*High school*, *Bachelors*, *Masters*, and so on) is an example of ordinal categories. 

Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are *Dog* and *Cat*, you may encode this information into two vectors, `[1,0]` to represent *Dog*, and `[0,1]` to represent *Cat*.

When you encode ordinal categories, you may need to translate the natural order of categories into your encoding. For example, you can represent the highest degree obtained with the following map: `{"High school": 1, "Bachelors": 2, "Masters":3}`.

Use categorical encoding to encode categorical data that is in string format into arrays of integers. 

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the time the step is defined. If new categories have been added to a column when you start a Data Wrangler job to process your dataset at time *t*, and this column was the input for a Data Wrangler categorical encoding transform at time *t*-1, these new categories are considered *missing* in the Data Wrangler job. The option you select for **Invalid handling strategy** is applied to these missing values. Examples of when this can occur are: 
+ When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the creation of the data flow. For example, you may use a data flow to regularly process sales data each month. If that sales data is updated weekly, new categories may be introduced into columns for which an encode categorical step is defined. 
+ When you select **Sampling** when you import your dataset, some categories may be left out of the sample. 

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an *ordinal* and a *one-hot encode*. Use the following sections to learn more about these options. 

Both transforms create a new column named **Output column name**. You specify the output format of this column with **Output style**:
+ Select **Vector** to produce a single column with a sparse vector. 
+ Select **Columns** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category.

### Ordinal Encode
<a name="canvas-transform-cat-encode-ordinal"></a>

Select **Ordinal encode** to encode categories into an integer between 0 and the total number of categories in the **Input column** you select.

**Invalid handing strategy**: Select a method to handle invalid or missing values. 
+ Choose **Skip** if you want to omit the rows with missing values.
+ Choose **Keep** to retain missing values as the last category.
+ Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ Choose **Replace with NaN** to replace missing with NaN. This option is recommended if your ML algorithm can handle missing values. Otherwise, the first three options in this list may produce better results.

### One-Hot Encode
<a name="canvas-transform-cat-encode-onehot"></a>

Select **One-hot encode** for **Transform** to use one-hot encoding. Configure this transform using the following: 
+ **Drop last category**: If `True`, the last category does not have a corresponding index in the one-hot encoding. When missing values are possible, a missing category is always the last one and setting this to `True` means that a missing value results in an all zero vector.
+ **Invalid handing strategy**: Select a method to handle invalid or missing values. 
  + Choose **Skip** if you want to omit the rows with missing values.
  + Choose **Keep** to retain missing values as the last category.
  + Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ **Is input ordinal encoded**: Select this option if the input vector contains ordinal encoded data. This option requires that input data contain non-negative integers. If **True**, input *i* is encoded as a vector with a non-zero in the *i*th location. 

### Similarity encode
<a name="canvas-transform-cat-encode-similarity"></a>

Use similarity encoding when you have the following:
+ A large number of categorical variables
+ Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It converts the tokens into an embedding using min-hash encoding.

The similarity encodings that Data Wrangler creates:
+ Have low dimensionality
+ Are scalable to a large number of categories
+ Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker/).

1. Choose **Open Studio Classic**.

1. Choose **Launch app**.

1. Choose **Studio**.

1. Specify your data flow.

1. Choose a step with a transformation.

1. Choose **Add step**.

1. Choose **Encode categorical**.

1. Specify the following:
   + **Transform** – **Similarity encode**
   + **Input column** – The column containing the categorical data that you're encoding.
   + **Target dimension** – (Optional) The dimension of the categorical embedding vector. The default value is 30. We recommend using a larger target dimension if you have a large dataset with many categories.
   + **Output style** – Choose **Vector** for a single vector with all of the encoded values. Choose **Column** to have the encoded values in separate columns.
   + **Output column** – (Optional) The name of the output column for a vector encoded output. For a column-encoded output, this is the prefix of the column names followed by listed number.

## Featurize Text
<a name="canvas-transform-featurize-text"></a>

Use the **Featurize Text** transform group to inspect string-typed columns and use text embedding to featurize these columns. 

This feature group contains two features, *Character statistics* and *Vectorize*. Use the following sections to learn more about these transforms. For both options, the **Input column** must contain text data (string type).

### Character Statistics
<a name="canvas-transform-featurize-text-character-stats"></a>

Use **Character statistics** to generate statistics for each row in a column containing text data. 

This transform computes the following ratios and counts for each row, and creates a new column to report the result. The new column is named using the input column name as a prefix and a suffix that is specific to the ratio or count. 
+ **Number of words**: The total number of words in that row. The suffix for this output column is `-stats_word_count`.
+ **Number of characters**: The total number of characters in that row. The suffix for this output column is `-stats_char_count`.
+ **Ratio of upper**: The number of uppercase characters, from A to Z, divided by all characters in the column. The suffix for this output column is `-stats_capital_ratio`.
+ **Ratio of lower**: The number of lowercase characters, from a to z, divided by all characters in the column. The suffix for this output column is `-stats_lower_ratio`.
+ **Ratio of digits**: The ratio of digits in a single row over the sum of digits in the input column. The suffix for this output column is `-stats_digit_ratio`.
+ **Special characters ratio**: The ratio of non-alphanumeric (characters like \$1\$1&%:@) characters to over the sum of all characters in the input column. The suffix for this output column is `-stats_special_ratio`.

### Vectorize
<a name="canvas-transform-featurize-text-vectorize"></a>

Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–inverse document frequency (TF-IDF) vectors. 

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real number that represents its semantic importance. Higher numbers are associated with less frequent words, which tend to be more meaningful. 

When you define a **Vectorize** transform step, Data Wrangler uses the data in your dataset to define the count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following: 
+ **Output column name**: This transform creates a new column with the text embedding. Use this field to specify a name for this output column. 
+ **Tokenizer**: A tokenizer converts the sentence into a list of words, or *tokens*. 

  Choose **Standard** to use a tokenizer that splits by white space and converts each word to lowercase. For example, `"Good dog"` is tokenized to `["good","dog"]`.

  Choose **Custom** to use a customized tokenizer. If you choose **Custom**, you can use the following fields to configure the tokenizer:
  + **Minimum token length**: The minimum length, in characters, for a token to be valid. Defaults to `1`. For example, if you specify `3` for minimum token length, words like `a, at, in` are dropped from the tokenized sentence. 
  + **Should regex split on gaps**: If selected, **regex** splits on gaps. Otherwise, it matches tokens. Defaults to `True`. 
  + **Regex pattern**: Regex pattern that defines the tokenization process. Defaults to `' \\ s+'`.
  + **To lowercase**: If chosen, Data Wrangler converts all characters to lowercase before tokenization. Defaults to `True`.

  To learn more, see the Spark documentation on [Tokenizer](https://spark.apache.org/docs/latest/ml-features#tokenizer).
+ **Vectorizer**: The vectorizer converts the list of tokens into a sparse numeric vector. Each token corresponds to an index in the vector and a non-zero indicates the existence of the token in the input sentence. You can choose from two vectorizer options, *Count* and *Hashing*.
  + **Count vectorize** allows customizations that filter infrequent or too common tokens. **Count vectorize parameters** include the following: 
    + **Minimum term frequency**: In each row, terms (tokens) with smaller frequency are filtered. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Minimum document frequency**: Minimum number of rows in which a term (token) must appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Maximum document frequency**: Maximum number of documents (rows) in which a term (token) can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `0.999`.
    + **Maximum vocabulary size**: Maximum size of the vocabulary. The vocabulary is made up of all terms (tokens) in all rows of the column. Defaults to `262144`.
    + **Binary outputs**: If selected, the vector outputs do not include the number of appearances of a term in a document, but rather are a binary indicator of its appearance. Defaults to `False`.

    To learn more about this option, see the Spark documentation on [CountVectorizer](https://spark.apache.org/docs/latest/ml-features#countvectorizer).
  + **Hashing** is computationally faster. **Hash vectorize parameters** includes the following:
    + **Number of features during hashing**: A hash vectorizer maps tokens to a vector index according to their hash value. This feature determines the number of possible hash values. Large values result in fewer collisions between hash values but a higher dimension output vector.

    To learn more about this option, see the Spark documentation on [FeatureHasher](https://spark.apache.org/docs/latest/ml-features#featurehasher)
+ **Apply IDF** applies an IDF transformation, which multiplies the term frequency with the standard inverse document frequency used for TF-IDF embedding. **IDF parameters** include the following: 
  + **Minimum document frequency **: Minimum number of documents (rows) in which a term (token) must appear to be included. If **count\$1vectorize** is the chosen vectorizer, we recommend that you keep the default value and only modify the **min\$1doc\$1freq** field in **Count vectorize parameters**. Defaults to `5`.
+ ** Output format**:The output format of each row. 
  + Select **Vector** to produce a single column with a sparse vector. 
  + Select **Flattened** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category. You can only choose flattened when **Vectorizer** is set as **Count vectorizer**.

## Transform Time Series
<a name="canvas-transform-time-series"></a>

In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to specific time. For example, a dataset that shows the number of customers in a store for each hour in a day is a time series dataset. The following table shows an example of a time series dataset.

Hourly number of customers in a store


| Number of customers | Time (hour) | 
| --- | --- | 
| 4 | 09:00 | 
| 10 | 10:00 | 
| 14 | 11:00 | 
| 25 | 12:00 | 
| 20 | 13:00 | 
| 18 | 14:00 | 

For the preceding table, the **Number of Customers** column contains the time series data. The time series data is indexed on the hourly data in the **Time (hour)** column.

You might need to perform a series of transformations on your data to get it in a format that you can use for your analysis. Use the **Time series** transform group to transform your time series data. For more information about the transformations that you can perform, see the following sections.

**Topics**
+ [

### Group by a Time Series
](#canvas-group-by-time-series)
+ [

### Resample Time Series Data
](#canvas-resample-time-series)
+ [

### Handle Missing Time Series Data
](#canvas-transform-handle-missing-time-series)
+ [

### Validate the Timestamp of Your Time Series Data
](#canvas-transform-validate-timestamp)
+ [

### Standardizing the Length of the Time Series
](#canvas-transform-standardize-length)
+ [

### Extract Features from Your Time Series Data
](#canvas-transform-extract-time-series-features)
+ [

### Use Lagged Features from Your Time Series Data
](#canvas-transform-lag-time-series)
+ [

### Create a Datetime Range In Your Time Series
](#canvas-transform-datetime-range)
+ [

### Use a Rolling Window In Your Time Series
](#canvas-transform-rolling-window)

### Group by a Time Series
<a name="canvas-group-by-time-series"></a>

You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage


| Household ID | Daily timestamp | Electricity usage (kWh) | Number of household occupants | 
| --- | --- | --- | --- | 
| household\$10 | 1/1/2020 | 30 | 2 | 
| household\$10 | 1/2/2020 | 40 | 2 | 
| household\$10 | 1/4/2020 | 35 | 3 | 
| household\$11 | 1/2/2020 | 45 | 3 | 
| household\$11 | 1/3/2020 | 55 | 4 | 

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | 
| --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 
| household\$11 | [45, 55] | [3, 4] | 

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of the sequence corresponds to the first timestamp of the series. For `household_0`, `30` is the first value of the **Electricity Usage Series**. The value of `30` corresponds to the first timestamp of `1/1/2020`.

You can include the starting timestamp and ending timestamp. The following table shows how that information appears.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | Start\$1time | End\$1time | 
| --- | --- | --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 1/1/2020 | 1/4/2020 | 
| household\$11 | [45, 55] | [3, 4] | 1/2/2020 | 1/3/2020 | 

You can use the following procedure to group by a time series column. 

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Time Series**.

1. Under **Transform**, choose **Group by**.

1. Specify a column in **Group by this column**.

1. For **Apply to columns**, specify a value.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Resample Time Series Data
<a name="canvas-resample-time-series"></a>

Time series data usually has observations that aren't taken at regular intervals. For example, a dataset could have some observations that are recorded hourly and other observations that are recorded every two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals. Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The hourly observations are aggregated into a single value using an aggregation method such as the mean or median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, you can use an interpolation method to infer hourly observations from the ones that have been taken every two hours. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

You can resample both numeric and non-numeric data.

Use the **Resample** operation to resample your time series data. If you have multiple time series in your dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 1:00 | 32 | 
| 2:00 | 35 | 
| 3:00 | 32 | 
| 4:00 | 30 | 

Temperature readings downsampled to every two hours


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 2:00 | 33.5 | 
| 4:00 | 35 | 

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Resample**.

1. For **Timestamp**, choose the timestamp column.

1. For **Frequency unit**, specify the frequency that you're resampling.

1. (Optional) Specify a value for **Frequency quantity**.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Handle Missing Time Series Data
<a name="canvas-transform-handle-missing-time-series"></a>

If you have missing values in your dataset, you can do one of the following:
+ For datasets that have multiple time series, drop the time series that have missing values that are greater than a threshold that you specify.
+ Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an inferential method. The following are the methods that you can use for imputation:
+ Constant value – Replace all the missing data in your dataset with a value that you specify.
+ Most common value – Replace all the missing data with the value that has the highest frequency in the dataset.
+ Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
+ Backward fill – Use a backward fill to replace the missing values with the non-missing value that follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8]. 
+ Interpolate – Uses an interpolation function to impute the missing values. For more information on the functions that you can use for interpolation, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

Some of the imputation methods might not be able to impute of all the missing value in your dataset. For example, a **Forward fill** can't impute a missing value that appears at the beginning of the time series. You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, NaN, NaN] | 
| household\$11 | [45, NaN, 55] | 

Electricity usage with values imputed using a forward fill


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, 35, 35] | 
| household\$11 | [45, 45, 55] | 

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | NaN | 
| household\$11 | NaN | 
| household\$11 | NaN | 

Average daily household electricity usage with values imputed using a forward fill


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | 40 | 
| household\$11 | 40 | 
| household\$11 | 40 | 

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Handle missing**.

1. For **Time series input type**, choose whether you want to handle missing values inside of a cell or along a column.

1. For **Impute missing values for this column**, specify the column that has the missing values.

1. For **Method for imputing values**, select a method.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. If you have missing values, you can specify a method for imputing them under **Method for imputing values**.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Validate the Timestamp of Your Time Series Data
<a name="canvas-transform-validate-timestamp"></a>

You might have time stamp data that is invalid. You can use the **Validate time stamp** function to determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or more of the following reasons:
+ Your timestamp column has missing values.
+ The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your dataset:
+ Drop the rows that have the missing or invalid values.
+ Identify the rows that have the missing or invalid values.
+ Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the `timestamp` type or the `string` type. If the column has the `string` type, Data Wrangler converts the type of the column to `timestamp` and performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Validate timestamps**.

1. For **Timestamp Column**, choose the timestamp column.

1. For **Policy**, choose whether you want to handle missing timestamps.

1. (Optional) For **Output column**, specify a name for the output column.

1. If the date time column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Standardizing the Length of the Time Series
<a name="canvas-transform-standardize-length"></a>

If you have time series data stored as arrays, you can standardize each time series to the same length. Standardizing the length of the time series array might make it easier for you to perform your analysis on the data.

You can standardize your time series for data transformations that require the length of your data to be fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time series data is separating each value of the time series into its own column in a dataset. The number of columns in a dataset can't change, so the lengths of the time series need to be standardized between you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For example, you can have three sequences that have the following lengths:
+ 3
+ 4
+ 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile length.

Time series arrays that are shorter than the length you've specified have missing values added. The following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN, NaN].

You can use different approaches to handle the missing values. For information on those approaches, see [Handle Missing Time Series Data](#canvas-transform-handle-missing-time-series).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Standardize length**.

1. For **Standardize the time series length for the column**, choose a column.

1. (Optional) For **Output column**, specify a name for the output column. If you don't specify a name, the transform is done in place.

1. If the datetime column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Cutoff quantile** and specify a quantile to set the length of the sequence.

1. Choose **Flatten the output** to output the values of the time series into separate columns.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Extract Features from Your Time Series Data
<a name="canvas-transform-extract-time-series-features"></a>

If you're running a classification or a regression algorithm on your time series data, we recommend extracting features from the time series before running the algorithm. Extracting features might improve the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:
+ Use **Minimal subset** to specify extracting 8 features that you know are useful in downstream analyses. You can use a minimal subset when you need to perform computations quickly. You can also use it when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
+ Use **Efficient subset** to specify extracting the most features possible without extracting features that are computationally intensive in your analyses.
+ Use **All features** to specify extracting all features from the tune series.
+ Use **Manual subset** to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Extract features**.

1. For **Extract features for this column**, choose a column.

1. (Optional) Select **Flatten** to output the features into separate columns.

1. For **Strategy**, choose a strategy to extract the features.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use Lagged Features from Your Time Series Data
<a name="canvas-transform-lag-time-series"></a>

For many use cases, the best way to predict the future behavior of your time series is to use its most recent behavior.

The most common uses of lagged features are the following:
+ Collecting a handful of past values. For example, for time, t \$1 1, you collect t, t - 1, t - 2, and t - 3.
+ Collecting values that correspond to seasonal behavior in the data. For example, to predict the occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as predictive as using the features from previous days.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Lag features**.

1. For **Generate lag features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. For **Lag**, specify the duration of the lag.

1. (Optional) Configure the output using one of the following options:
   + **Include the entire lag window**
   + **Flatten the output**
   + **Drop rows without history**

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Create a Datetime Range In Your Time Series
<a name="canvas-transform-datetime-range"></a>

You might have time series data that don't have timestamps. If you know that the observations were taken at regular intervals, you can generate timestamps for the time series in a separate column. To generate timestamps, you specify the value for the start timestamp and the frequency of the timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant


| Number of customers | 
| --- | 
| 10 | 
| 14 | 
| 24 | 
| 40 | 
| 30 | 
| 20 | 

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can add a timestamp column that corresponds to the time series data. You can see the timestamp column in the following table.

Time series data on the number of customers at a restaurant


| Number of customers | Timestamp | 
| --- | --- | 
| 10 | 1:00 PM | 
| 14 | 2:00 PM | 
| 24 | 3:00 PM | 
| 40 | 4:00 PM | 
| 30 | 5:00 PM | 
| 20 | 6:00 PM | 

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Datetime range**.

1. For **Frequency type**, choose the unit used to measure the frequency of the timestamps.

1. For **Starting timestamp**, specify the start timestamp.

1. For **Output column**, specify a name for the output column.

1. (Optional) Configure the output using the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use a Rolling Window In Your Time Series
<a name="canvas-transform-rolling-window"></a>

You can extract features over a time period. For example, for time, *t*, and a time window length of 3, and for the row that indicates the *t*th timestamp, we append the features that are extracted from the time series at times *t* - 3, *t* -2, and *t* - 1. For information on extracting features, see [Extract Features from Your Time Series Data](#canvas-transform-extract-time-series-features). 

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Rolling window features**.

1. For **Generate rolling window features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. (Optional) For **Output Column**, specify the name of the output column.

1. For **Window size**, specify the window size.

1. For **Strategy**, choose the extraction strategy.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Featurize Datetime
<a name="canvas-transform-datetime-embed"></a>

Use **Featurize date/time** to create a vector embedding representing a datetime field. To use this transform, your datetime data must be in one of the following formats: 
+ Strings describing datetime: For example, `"January 1st, 2020, 12:44pm"`. 
+ A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or nanoseconds from 1/1/1970. 

You can choose to **Infer datetime format** and provide a **Datetime format**. If you provide a datetime format, you must use the codes described in the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The options you select for these two configurations have implications for the speed of the operation and the final results.
+ The most manual and computationally fastest option is to specify a **Datetime format** and select **No** for **Infer datetime format**.
+ To reduce manual labor, you can choose **Infer datetime format** and not specify a datetime format. It is also a computationally fast operation; however, the first datetime format encountered in the input column is assumed to be the format for the entire column. If there are other formats in the column, these values are NaN in the final output. Inferring the datetime format can give you unparsed strings. 
+ If you don't specify a format and select **No** for **Infer datetime format**, you get the most robust results. All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower than the first two options in this list. 

When you use this transform, you specify an **Input column** which contains datetime data in one of the formats listed above. The transform creates an output column named **Output column name**. The format of the output column depends on your configuration using the following:
+ **Vector**: Outputs a single column as a vector. 
+ **Columns**: Creates a new column for every feature. For example, if the output contains a year, month, and day, three separate columns are created for year, month, and day. 

Additionally, you must choose an **Embedding mode**. For linear models and deep networks, we recommend choosing **cyclic**. For tree-based algorithms, we recommend choosing **ordinal**.

## Format String
<a name="canvas-transform-format-string"></a>

The **Format string** transforms contain standard string formatting operations. For example, you can use these operations to remove special characters, normalize string lengths, and update string casing.

This feature group contains the following transforms. All transforms return copies of the strings in the **Input column** and add the result to a new, output column.


| Name | Function | 
| --- | --- | 
| Left pad |  Left-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Right pad |  Right-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Center (pad on either side) |  Center-pad the string (add padding on both sides of the string) with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Prepend zeros |  Left-fill a numeric string with zeros, up to a given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Strip left and right |  Returns a copy of the string with the leading and trailing characters removed.  | 
| Strip characters from left |  Returns a copy of the string with leading characters removed.  | 
| Strip characters from right |  Returns a copy of the string with trailing characters removed.  | 
| Lower case |  Convert all letters in text to lowercase.  | 
| Upper case |  Convert all letters in text to uppercase.  | 
| Capitalize |  Capitalize the first letter in each sentence.   | 
| Swap case | Converts all uppercase characters to lowercase and all lowercase characters to uppercase characters of the given string, and returns it. | 
| Add prefix or suffix |  Adds a prefix and a suffix the string column. You must specify at least one of **Prefix** and **Suffix**.   | 
| Remove symbols |  Removes given symbols from a string. All listed characters are removed. Defaults to white space.   | 

## Handle Outliers
<a name="canvas-transform-handle-outlier"></a>

Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. Use this feature group to detect and update outliers in your dataset. 

When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. 

Use the following sections to learn more about the transforms this group contains. You specify an **Output name** and each of these transforms produces an output column with the resulting data. 

### Robust standard deviation numeric outliers
<a name="canvas-transform-handle-outlier-rstdev"></a>

This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an **Upper quantile** and a **Lower quantile** for the statistics used to calculate outliers. You must also specify the number of **Standard deviations** from which a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Standard Deviation Numeric Outliers
<a name="canvas-transform-handle-outlier-sstdev"></a>

This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of **Standard deviations** a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Quantile Numeric Outliers
<a name="canvas-transform-handle-outlier-quantile-numeric"></a>

Use this transform to detect and fix outliers in numeric features using quantiles. You can define an **Upper quantile** and a **Lower quantile**. All values that fall above the upper quantile or below the lower quantile are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Min-Max Numeric Outliers
<a name="canvas-transform-handle-outlier-minmax-numeric"></a>

This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this method if you know threshold values that demark outliers.

You specify a **Upper threshold** and a **Lower threshold**, and if values fall above or below those thresholds respectively, they are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Replace Rare
<a name="canvas-transform-handle-outlier-replace-rare"></a>

When you use the **Replace rare** transform, you specify a threshold and Data Wrangler finds all values that meet that threshold and replaces them with a string that you specify. For example, you may want to use this transform to categorize all outliers in a column into an "Others" category. 
+ **Replacement string**: The string with which to replace outliers.
+ **Absolute threshold**: A category is rare if the number of instances is less than or equal to this absolute threshold.
+ **Fraction threshold**: A category is rare if the number of instances is less than or equal to this fraction threshold multiplied by the number of rows.
+ **Max common categories**: Maximum not-rare categories that remain after the operation. If the threshold does not filter enough categories, those with the top number of appearances are classified as not rare. If set to 0 (default), there is no hard limit to the number of categories.

## Handle Missing Values
<a name="canvas-transform-handle-missing"></a>

Missing values are a common occurrence in machine learning datasets. In some situations, it is appropriate to impute missing data with a calculated value, such as an average or categorically common value. You can process missing values using the **Handle missing values** transform group. This group contains the following transforms. 

### Fill Missing
<a name="canvas-transform-fill-missing"></a>

Use the **Fill missing** transform to replace missing values with a **Fill value** you define. 

### Impute Missing
<a name="canvas-transform-impute"></a>

Use the **Impute missing** transform to create a new column that contains imputed values where missing values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute. You can choose to impute the mean or the median over the values that are present in your dataset. Data Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column. To impute a custom string, use the **Fill missing** transform instead.

### Add Indicator for Missing
<a name="canvas-transform-missing-add-indicator"></a>

Use the **Add indicator for missing** transform to create a new indicator column, which contains a Boolean `"false"` if a row contains a value, and `"true"` if a row contains a missing value. 

### Drop Missing
<a name="canvas-transform-drop-missing"></a>

Use the **Drop missing** option to drop rows that contain missing values from the **Input column**.

## Manage Columns
<a name="canvas-manage-columns"></a>

You can use the following transforms to quickly update and manage columns in your dataset: 


****  

| Name | Function | 
| --- | --- | 
| Drop Column | Delete a column.  | 
| Duplicate Column | Duplicate a column. | 
| Rename Column | Rename a column. | 
| Move Column |  Move a column's location in the dataset. Choose to move your column to the start or end of the dataset, before or after a reference column, or to a specific index.   | 

## Manage Rows
<a name="canvas-transform-manage-rows"></a>

Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the following:
+ **Sort**: Sort the entire dataframe by a given column. Select the check box next to **Ascending order** for this option; otherwise, deselect the check box and descending order is used for the sort. 
+ **Shuffle**: Randomly shuffle all rows in the dataset. 

## Manage Vectors
<a name="canvas-transform-manage-vectors"></a>

Use this transform group to combine or flatten vector columns. This group contains the following transforms. 
+ **Assemble**: Use this transform to combine Spark vectors and numeric data into a single column. For example, you can combine three columns: two containing numeric data and one containing vectors. Add all the columns you want to combine in **Input columns** and specify a **Output column name** for the combined data. 
+ **Flatten**: Use this transform to flatten a single column containing vector data. The input column must contain PySpark vectors or array-like objects. You can control the number of columns created by specifying a **Method to detect number of outputs**. For example, if you select **Length of first vector**, the number of elements in the first valid vector or array found in the column determines the number of output columns that are created. All other input vectors with too many items are truncated. Inputs with too few items are filled with NaNs.

  You also specify an **Output prefix**, which is used as the prefix for each output column. 

## Process Numeric
<a name="canvas-transform-process-numeric"></a>

Use the **Process Numeric** feature group to process numeric data. Each scalar in this group is defined using the Spark library. The following scalars are supported:
+ **Standard Scaler**: Standardize the input column by subtracting the mean from each value and scaling to unit variance. To learn more, see the Spark documentation for [StandardScaler](https://spark.apache.org/docs/latest/ml-features#standardscaler).
+ **Robust Scaler**: Scale the input column using statistics that are robust to outliers. To learn more, see the Spark documentation for [RobustScaler](https://spark.apache.org/docs/latest/ml-features#robustscaler).
+ **Min Max Scaler**: Transform the input column by scaling each feature to a given range. To learn more, see the Spark documentation for [MinMaxScaler](https://spark.apache.org/docs/latest/ml-features#minmaxscaler).
+ **Max Absolute Scaler**: Scale the input column by dividing each value by the maximum absolute value. To learn more, see the Spark documentation for [MaxAbsScaler](https://spark.apache.org/docs/latest/ml-features#maxabsscaler).

## Sampling
<a name="canvas-transform-sampling"></a>

After you've imported your data, you can use the **Sampling** transformer to take one or more samples of it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:
+ **Limit**: Samples the dataset starting from the first row up to the limit that you specify.
+ **Randomized**: Takes a random sample of a size that you specify.
+ **Stratified**: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample. 

To take a random sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. For **Stratify column**, specify the name of the column that you want to stratify on.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

## Search and Edit
<a name="canvas-transform-search-edit"></a>

Use this section to search for and edit specific patterns within strings. For example, you can find and update strings within sentences or documents, split strings by delimiters, and find occurrences of specific strings. 

The following transforms are supported under **Search and edit**. All transforms return copies of the strings in the **Input column** and add the result to a new output column.


****  

| Name | Function | 
| --- | --- | 
|  Find substring  |  Returns the index of the first occurrence of the **Substring** for which you searched , You can start and end the search at **Start** and **End** respectively.   | 
|  Find substring (from right)  |  Returns the index of the last occurrence of the **Substring** for which you searched. You can start and end the search at **Start** and **End** respectively.   | 
|  Matches prefix  |  Returns a Boolean value if the string contains a given **Pattern**. A pattern can be a character sequence or regular expression. Optionally, you can make the pattern case sensitive.   | 
|  Find all occurrences  |  Returns an array with all occurrences of a given pattern. A pattern can be a character sequence or regular expression.   | 
|  Extract using regex  |  Returns a string that matches a given Regex pattern.  | 
|  Extract between delimiters  |  Returns a string with all characters found between **Left delimiter** and **Right delimiter**.   | 
|  Extract from position  |  Returns a string, starting from **Start position** in the input string, that contains all characters up to the start position plus **Length**.   | 
|  Find and replace substring  |  Returns a string with all matches of a given **Pattern** (regular expression) replaced by **Replacement string**.  | 
|  Replace between delimiters  |  Returns a string with the substring found between the first appearance of a **Left delimiter** and the last appearance of a **Right delimiter** replaced by **Replacement string**. If no match is found, nothing is replaced.   | 
|  Replace from position  |  Returns a string with the substring between **Start position** and **Start position** plus **Length** replaced by **Replacement string**. If **Start position** plus **Length** is greater than the length of the replacement string, the output contains **…**.  | 
|  Convert regex to missing  |  Converts a string to `None` if invalid and returns the result. Validity is defined with a regular expression in **Pattern**.  | 
|  Split string by delimiter  |  Returns an array of strings from the input string, split by **Delimiter**, with up to **Max number of splits** (optional). The delimiter defaults to white space.   | 

## Split data
<a name="canvas-transform-split-data"></a>

Use the **Split data** transform to split your dataset into two or three datasets. For example, you can split your dataset into a dataset used to train your model and a dataset used to test it. You can determine the proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method you use to split the data. For example, you might want to use a stratified split to make sure that the distribution of the observations in the target column are the same across datasets. You can use the following split transforms:
+ Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger datasets, using a randomized split might be computationally expensive and take longer than an ordered split.
+ Ordered split – Splits the dataset based on the sequential order of the observations. For example, for an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in keeping the existing order of the data between splits.
+ Stratified split – Splits the dataset to make sure that the number of observations in the input column have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the 1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go to the testing set.
+ Split by key – Avoids data with the same key occurring in more than one split. For example, if you have a dataset with the column 'customer\$1id' and you're using it as a key, no customer id is in more than one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases, they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions that you specify for the splits. If you set a higher error threshold, you get better performance, but lower accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would get observations approximating one of the following results:
+ 8010 observations in the training set and 1990 in the testing set
+ 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010 and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a different value for the seed to create a different reproducible split.

------
#### [ Randomized split ]

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Ordered split ]

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. For **Transform**, choose **Ordered split**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) For **Input column**, specify a column with numeric values. Uses the values of the columns to infer which records are in each split. The smaller values are in one split with the larger values in the other splits.

1. (Optional) Select **Handle duplicates** to add noise to duplicate values and create a dataset of entirely unique values.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Stratified split ]

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Stratified split**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Input column**, specify a column with up to 100 unique values. Data Wrangler can't stratify a column with more than 100 unique values.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed** to specify a different seed.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Split by column keys ]

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Split by key**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Key columns**, specify the columns with values that you don't want to appear in both datasets.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. Choose **Preview**.

1. Choose **Add**.

------

## Parse Value as Type
<a name="canvas-transform-cast-type"></a>

Use this transform to cast a column to a new type. The supported Data Wrangler data types are:
+ Long
+ Float
+ Boolean
+ Date, in the format dd-MM-yyyy, representing day, month, and year respectively. 
+ String

## Validate String
<a name="canvas-transform-validate-string"></a>

Use the **Validate string** transforms to create a new column that indicates that a row of text data meets a specified condition. For example, you can use a **Validate string** transform to verify that a string only contains lowercase characters. The following transforms are supported under **Validate string**. 

The following transforms are included in this transform group. If a transform outputs a Boolean value, `True` is represented with a `1` and `False` is represented with a `0`.


****  

| Name | Function | 
| --- | --- | 
|  String length  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.   | 
|  Starts with  |  Returns `True` if a string starts will a specified prefix. Otherwise, returns `False`.  | 
|  Ends with  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.  | 
|  Is alphanumeric  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is alpha (letters)  |  Returns `True` if a string only contains letters. Otherwise, returns `False`.  | 
|  Is digit  |  Returns `True` if a string only contains digits. Otherwise, returns `False`.  | 
|  Is space  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is title  |  Returns `True` if a string contains any white spaces. Otherwise, returns `False`.  | 
|  Is lowercase  |  Returns `True` if a string only contains lower case letters. Otherwise, returns `False`.  | 
|  Is uppercase  |  Returns `True` if a string only contains upper case letters. Otherwise, returns `False`.  | 
|  Is numeric  |  Returns `True` if a string only contains numbers. Otherwise, returns `False`.  | 
|  Is decimal  |  Returns `True` if a string only contains decimal numbers. Otherwise, returns `False`.  | 

## Unnest JSON Data
<a name="canvas-transform-flatten-column"></a>

If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might have nested data in columns of either a Parquet file or a JSON document.

Use the **Flatten structured** operator to separate the first level keys into separate columns. A first level key is a key that isn't nested within a value.

For example, you might have a dataset that has a *person* column with demographic information on each person stored as JSON strings. A JSON string might look like the following.

```
 "{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city": "Posbotno","state": "WV"}"
```

The **Flatten structured** operator converts the following first level keys into additional columns in your dataset:
+ seq
+ name
+ age
+ city
+ state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column names and values of the JSON.

```
seq, name,                                    age, city, state
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV
```

For each value in your dataset containing JSON, the **Flatten structured** operator creates columns for the first-level keys. To create columns for nested keys, call the operator again. For the preceding example, calling the operator creates the columns:
+ name\$1first
+ name\$1last

The following example shows the dataset that results from calling the operation again.

```
seq, name,                                    age, city, state, name_first, name_last
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson
```

Choose **Keys to flatten on** to specify the first-level keys that want to extract as separate columns. If you don't specify any keys, Data Wrangler extracts all the keys by default.

## Explode Array
<a name="canvas-transform-explode-array"></a>

Use **Explode array** to expand the values of the array into separate output rows. For example, the operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the following rows:

```
                [1, 2, 3]
                [4, 5, 6]
                [7, 8, 9]
```

Data Wrangler names the new column, input\$1column\$1name\$1flatten.

You can call the **Explode array** operation multiple times to get the nested values of the array into separate output columns. The following example shows the result of calling the operation multiple times on a dataset with a nested array.

Putting the values of a nested array into separate columns


| id | array | id | array\$1items | id | array\$1items\$1items | 
| --- | --- | --- | --- | --- | --- | 
| 1 | [ [cat, dog], [bat, frog] ] | 1 | [cat, dog] | 1 | cat | 
| 2 |  [[rose, petunia], [lily, daisy]]  | 1 | [bat, frog] | 1 | dog | 
|  |  | 2 | [rose, petunia] | 1 | bat | 
|  |  | 2 | [lily, daisy] | 1 | frog | 
|  |  |  | 2 | 2 | rose | 
|  |  |  | 2 | 2 | petunia | 
|  |  |  | 2 | 2 | lily | 
|  |  |  | 2 | 2 | daisy | 

## Transform Image Data
<a name="canvas-transform-image"></a>

Use Data Wrangler to import and transform the images that you're using for your machine learning (ML) pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

After you've familiarized yourself with the concepts of transforming your image data, go through the following tutorial, [Prepare image data with Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/).

The following industries and use cases are examples where applying machine learning to transformed image data can be useful:
+ Manufacturing – Identifying defects in items from the assembly line
+ Food – Identifying spoiled or rotten food
+ Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.

1. Transform – Use the built-in transformations to prepare the images for your machine learning pipeline.

1. Export – Export the images that you’ve transformed to a location that can be accessed from the pipeline.

Use the following procedure to import your image data.

**To import your image data**

1. Navigate to the **Create connection** page.

1. Choose **Amazon S3**.

1. Specify the Amazon S3 file path that contains the image data.

1. For **File type**, choose **Image**.

1. (Optional) Choose **Import nested directories** to import images from multiple Amazon S3 paths.

1. Choose **Import**.

Data Wrangler uses the open-source [imgaug](https://imgaug.readthedocs.io/en/latest/) library for its built-in image transformations. You can use the following built-in transformations:
+ **ResizeImage**
+ **EnhanceImage**
+ **CorruptImage**
+ **SplitImage**
+ **DropCorruptedImages**
+ **DropImageDuplicates**
+ **Brightness**
+ **ColorChannels**
+ **Grayscale**
+ **Rotate**

Use the following procedure to transform your images without writing code.

**To transform the image data without writing code**

1. From your Data Wrangler flow, choose the **\$1** next to the node representing the images that you've imported.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose the transform and configure it.

1. Choose **Preview**.

1. Choose **Add**.

In addition to using the transformations that Data Wrangler provides, you can also use your own custom code snippets. For more information about using custom code snippets, see [Custom Transforms](#canvas-transform-custom). You can import the OpenCV and imgaug libraries within your code snippets and use the transforms associated with them. The following is an example of a code snippet that detects edges within the images.

```
# A table with your image data is stored in the `df` variable
import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import BasicImageOperationDecorator, PandasUDFOperationDecorator


@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
  # To use the code snippet on your image data, modify the following lines within the function
    HYST_THRLD_1, HYST_THRLD_2 = 100, 200
    edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
    return edges
    

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
    return my_transform(image_row)
    

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
```

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't apply the transforms to all of your images.

## Filter data
<a name="canvas-transform-filter-data"></a>

Use Data Wrangler to filter the data in your columns. When you filter the data in a column, you specify the following fields:
+ **Column name** – The name of the column that you're using to filter the data.
+ **Condition** – The type of filter that you're applying to values in the column.
+ **Value** – The value or category in the column to which you're applying the filter.

You can filter on the following conditions:
+ **=** – Returns values that match the value or category that you specify.
+ **\$1=** – Returns values that don't match the value or category that you specify.
+ **>=** – For **Long** or **Float** data, filters for values that are greater than or equal to the value that you specify.
+ **<=** – For **Long** or **Float** data, filters for values that are less than or equal to the value that you specify.
+ **>** – For **Long** or **Float** data, filters for values that are greater than the value that you specify.
+ **<** – For **Long** or **Float** data, filters for values that are less than the value that you specify.

For a column that has the categories, `male` and `female`, you can filter out all the `male` values. You could also filter for all the `female` values. Because there are only `male` and `female` values in the column, the filter returns a column that only has `female` values.

You can also add multiple filters. The filters can be applied across multiple columns or the same column. For example, if you're creating a column that only has values within a certain range, you add two different filters. One filter specifies that the column must have values greater than the value that you provide. The other filter specifies that the column must have values less than the value that you provide.

Use the following procedure to add the filter transform to your data.

**To filter your data**

1. From your Data Wrangler flow, choose the **\$1** next to the node with the data that you're filtering.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose **Filter data**.

1. Specify the following fields:
   + **Column name** – The column that you're filtering.
   + **Condition** – The condition of the filter.
   + **Value** – The value or category in the column to which you're applying the filter.

1. (Optional) Choose **\$1** following the filter that you've created.

1. Configure the filter.

1. Choose **Preview**.

1. Choose **Add**.

# Chat for data prep
<a name="canvas-chat-for-data-prep"></a>

**Important**  
For administrators:  
Chat for data prep requires the `AmazonSageMakerCanvasAIServicesAccess` policy. For more information, see [AWS managed policy: AmazonSageMakerCanvasAIServicesAccess](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess)
Chat for data prep requires access to Amazon Bedrock and the **Anthropic Claude** model within it. For more information, see [Add model access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html#add-model-access).
You must run SageMaker Canvas data prep in the same AWS Region as the Region where you're running your model. Chat for data prep is available in the US East (N. Virginia), US West (Oregon), and Europe (Frankfurt) AWS Regions.

In addition to using the built-in transforms and analyses, you can use natural language to explore, visualize, and transform your data in a conversational interface. Within the conversational interface, you can use natural language queries to understand and prepare your data to build ML models.

The following are examples of some prompts that you can use:
+ Summarize my data
+ Drop column `example-column-name`
+ Replace missing values with median
+ Plot histogram of prices
+ What is the most expensive item sold?
+ How many distinct items were sold?
+ Sort data by region

When you’re transforming your data using your prompts, you can view a preview that shows how data is being transformed. You can choose to add it as step in your Data Wrangler flow based on what you see in the preview.

The responses to your prompts generate code for your transformations and analyses. You can modify the code to update the output from the prompt. For example, you can modify the code for an analysis to change the values of the axes of a graph.

Use the following procedure to start chatting with your data:

**To chat with your data**

1. Open the SageMaker Canvas data flow.

1. Choose the speech bubble.  
![\[Chat for data prep is at the top of the screen\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/chat-for-data-prep-welcome-step.png)

1. Specify a prompt.

1. (Optional) If an analysis has been generated by your query, choose **Add to analyses** to reference it for later.  
![\[The view of an editable and copyable code block.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/encanto-query-for-visualization.png)

1. (Optional) If you've transformed your data using a prompt, do the following.

   1. Choose **Preview** to view the results.

   1. (Optional) Modify the code in the transform and choose **Update**.

   1. (Optional) If you're happy with the results of the transform, choose **Add to steps** to add it to the steps panel on the right-hand navigation.  
![\[Added to steps shows confirmation that the transform has been added to the flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/transform-added-to-steps-panel.png)

After you’ve prepared your data using natural language, you can create a model using your transformed data. For more information about creating a model, see [How custom models work](canvas-build-model.md).

# How data processing works in Data Wrangler
<a name="canvas-data-processing"></a>

While working with data interactively in an Amazon SageMaker Data Wrangler data flow, Amazon SageMaker Canvas only applies the transformations to a sample dataset for you to preview. After finishing your data flow in SageMaker Canvas, you can process all of your data and save it in a location that is suitable for your machine learning workflows.

There are several options for how to proceed after you've finished transforming your data in Data Wrangler:
+ [Create a model](canvas-processing-export-model.md). You can create a Canvas model, where you directly start creating a model with your prepared data. You can create a model either after processing your entire dataset, or by exporting just the sample data you worked with in Data Wrangler. Canvas saves your processed data (either the entire dataset or the sample data) as a Canvas dataset.

  We recommend that you use your sample data for quick iterations, but that you use your entire data when you want to train your final model. When building tabular models, datasets larger than 5 GB are automatically downsampled to 5 GB, and for time series forecasting models, datasets larger than 30 GB are downsampled to 30 GB.

  To learn more about creating a model, see [How custom models work](canvas-build-model.md).
+ [Export the data](canvas-export-data.md). You can export your data for use in machine learning workflows. When you choose to export your data, you have several options:
  + You can save your data in the Canvas application as a dataset. For more information about the supported file types for Canvas datasets and additional requirements when importing data into Canvas, see [Create a dataset](canvas-import-dataset.md).
  + You can save your data to Amazon S3. Depending on the Canvas memory availability, your data is processed in the application and then exported to Amazon S3. If the size of your dataset exceeds what Canvas can process, then by default, Canvas uses an EMR Serverless job to scale to multiple compute instances, process your full dataset, and export it to Amazon S3. You can also manually configure a SageMaker Processing job to have more granular control over the compute resources used to process your data.
+ [Export a data flow](canvas-export-data-flow.md). You might want to save the code for your data flow so that you can modify or run your transformations outside of Canvas. Canvas provides you with the option to save your data flow transformations as Python code in a Jupyter notebook, which you can then export to Amazon S3 for use elsewhere in your machine learning workflows.

When you export your data from a data flow and save it either as a Canvas dataset or to Amazon S3, Canvas creates a new destination node in your data flow, which is a final node that shows you where your processed data is stored. You can add additional destination nodes to your flow if you'd like to perform multiple export operations. For example, you can export the data from different points in your data flow to only apply some of the transformations, or you can export transformed data to different Amazon S3 locations. For more information about how to add or edit a destination node, see [Add destination nodes](canvas-destination-nodes-add.md) and [Edit a destination node](canvas-destination-nodes-edit.md).

For more information about setting up a schedule with Amazon EventBridge to automatically process and export your data on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

# Export to create a model
<a name="canvas-processing-export-model"></a>

In just a few clicks from your data flow, you can export your transformed data and start creating an ML model in Canvas. Canvas saves your data as a Canvas dataset, and you're taken to the model build configuration page for a new model.

To create a Canvas model with your transformed data:

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. From the context menu, choose **Create model**.

1. In the **Export to create a model** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected to process and export your entire dataset before proceeding with building a model. Turn this option off to train your model using the interactive sample data you are working with in your data flow.

1. Enter a **Model name** to name the new model.

1. Select a **Problem type**, or the type of model that you want to build. For more information about the supported model types in SageMaker Canvas, see [How custom models work](canvas-build-model.md).

1. Select the **Target column**, or the value that you want the model to predict.

1. Choose **Export and create model**.

The **Build** tab for a new Canvas model should open, and you can finish configuring and training your model. For more information about how to build a model, see [Build a model](canvas-build-model-how-to.md).

# Export data
<a name="canvas-export-data"></a>

Export data to apply the transforms from your data flow to the full imported dataset. You can export any node in your data flow to the following locations:
+ SageMaker Canvas dataset
+ Amazon S3

If you want to train models in Canvas, you can export your full, transformed dataset as a Canvas dataset. If you want to use your transformed data in machine learning workflows external to SageMaker Canvas, you can export your dataset to Amazon S3.

## Export to a Canvas dataset
<a name="canvas-export-data-canvas"></a>

Use the following procedure to export a SageMaker Canvas dataset from a node in your data flow.

**To export a node in your flow as a SageMaker Canvas dataset**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Canvas dataset**.

1. In the **Export to Canvas dataset** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected if you want SageMaker Canvas to process and save your full dataset. Turn this option off to only apply the transforms to the sample data you are working with in your data flow.

1. Choose **Export**.

You should now be able to go to the **Datasets** page of the Canvas application and see your new dataset.

## Export to Amazon S3
<a name="canvas-export-data-s3"></a>

When exporting your data to Amazon S3, you can scale to transform and process data of any size. Canvas automatically processes your data locally if the application's memory can handle the size of your dataset. If your dataset size exceeds the local memory capacity of 5 GB, then Canvas initiates a remote job on your behalf to provision additional compute resources and process the data more quickly. By default, Canvas uses Amazon EMR Serverless to run these remote jobs. However, you can manually configure Canvas to use either EMR Serverless or a SageMaker Processing job with your own settings.

**Note**  
When running an EMR Serverless job, by default the job inherits the IAM role, KMS key settings, and tags of your Canvas application.

The following summarizes the options for remote jobs in Canvas:
+ **EMR Serverless**: This is the default option that Canvas uses for remote jobs. EMR Serverless automatically provisions and scales compute resources to process your data so that you don't have to worry about choosing the right compute resources for your workload. For more information about EMR Serverless, see the [EMR Serverless User Guide](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html).
+ **SageMaker Processing**: SageMaker Processing jobs offer more advanced options and granular control over the compute resources used to process your data. For example, you can specify the type and count of the compute instances, configure the job in your own VPC and control network access, automate processing jobs, and more. For more information about automating processing jobs see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md). For more general information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

The following file types are supported when exporting to Amazon S3:
+ CSV
+ Parquet

To get started, review the following prerequisites.

### Prerequisites for EMR Serverless jobs
<a name="canvas-export-data-emr-prereqs"></a>

To create a remote job that uses EMR Serverless resources, you must have the necessary permissions. You can grant permissions either through the Amazon SageMaker AI domain or user profile settings, or you can manually configure your user's AWS IAM role. For instructions on how to grant users permissions to perform large data processing, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

If you don't want to configure these policies but still need to process large datasets through Data Wrangler, you can alternatively use a SageMaker Processing job.

Use the following procedures to export your data to Amazon S3. To configure a remote job, follow the optional advanced steps.

**To export a node in your flow to Amazon S3**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Amazon S3**.

1. In the **Export to Amazon S3** side panel, you can change the **Dataset name** for the new dataset.

1. For the **S3 location**, enter the Amazon S3 location to which you want to export the dataset. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. (Optional) For the **Advanced settings**, specify values for the following fields:

   1. **File type** – The file format of your exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

   1. **Number of partitions** – The number of dataset files that Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.
**Note**  
If you only export a sample of your data, Canvas processes your data in the application and doesn't create a remote job for you.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the section after this procedure before you export your data.

1. Choose **Export**.

The following procedures show how to manually configure the remote job settings for either EMR Serverless or SageMaker Processing when exporting your full dataset to Amazon S3.

------
#### [ EMR Serverless ]

To configure an EMR Serverless job while exporting to Amazon S3, do the following:

1. In the Export to Amazon S3 side panel, turn off the **Auto job configuration** option.

1. Select **EMR Serverless**.

1. For **Job name**, enter a name for your EMR Serverless job. The name can contain letters, numbers, hyphens, and underscores.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions to run EMR Serverless applications. For more information, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

1. (Optional) For **KMS key**, specify the key ID or ARN of an AWS KMS key to encrypt the job logs. If you don't enter a key, Canvas uses a default key for EMR Serverless.

1. (Optional) For **Monitoring configuration**, enter the name of an Amazon CloudWatch Logs log group to which you want to publish your logs.

1. (Optional) For **Tags**, add metadata tags to the EMR Serverless job consisting of key-value pairs. These tags can be used to categorize and search for jobs.

1. Choose **Export** to start the job.

------
#### [ SageMaker Processing ]

To configure a SageMaker Processing job while exporting to Amazon S3, do the following:

1. In the **Export to Amazon S3** side panel, turn off the **Auto job configuration** option.

1. Select **SageMaker Processing**.

1. For **Job name**, enter a name for your SageMaker AI Processing job.

1. For **Instance type**, select the type of compute instance to run the processing job.

1. For **Instance count**, specify the number of compute instances to launch.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions for SageMaker AI to create and run processing jobs on your behalf. These permissions are granted if you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy attached to your IAM role.

1. For **Volume size**, enter the storage size in GB for the ML storage volume that is attached to each processing instance. Choose the size based on your expected input and output data size.

1. (Optional) For **Volume KMS key**, specify a KMS key to encrypt the storage volume. If you don't specify a key, the default Amazon EBS encryption key is used.

1. (Optional) For **KMS key**, specify a KMS key to encrypt input and output Amazon S3 data sources used by the processing job.

1. (Optional) For **Spark memory configuration**, do the following:

   1. Enter **Driver memory in MB** for the Spark driver node that handles job coordination and scheduling.

   1. Enter **Executor memory in MB** for the Spark executor nodes that run individual tasks in the job.

1. (Optional) For **Network configuration**, do the following:

   1. For **Subnet configuration**, enter the IDs of the VPC subnets for the processing instances to be launched in. By default, the job uses the settings of your default VPC.

   1. For **Security group configuration**, enter the IDs of the security groups to control inbound and outbound connectivity rules.

   1. Turn on the **Enable inter-container traffic encryption** option to encrypt network communication between processing containers during the job.

1. (Optional) For **Associate schedules**, you can choose create an Amazon EventBridge schedule to have the processing job run on recurring intervals. Choose **Create new schedule** and fill out the dialog box. For more information about filling out this section and running processing jobs on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

1. (Optional) Add **Tags** as key-value pairs so that you can categorize and search for processing jobs.

1. Choose **Export** to start the processing job.

------

After exporting your data, you should find the fully processed dataset in the specified Amazon S3 location.

# Export a data flow
<a name="canvas-export-data-flow"></a>

Exporting your data flow translates the operations that you've made in Data Wrangler and exports it into a Jupyter notebook of Python code that you can modify and run. This can be helpful for integrating the code for your data transformations into your machine learning pipelines.

You can choose any data node in your data flow and export it. Exporting the data node exports the transformation that the node represents and the transformations that precede it.

**To export a data flow as a Jupyter notebook**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you want to export.

1. In the context menu, hover over **Export**, and then hover over **Export via Jupyter notebook**.

1. Choose one of the following:
   + **SageMaker Pipelines**
   + **Amazon S3**
   + **SageMaker AI Inference Pipeline**
   + **SageMaker AI Feature Store**
   + **Python Code**

1. The **Export data flow as notebook** dialog box opens. Select one of the following:
   + **Download a local copy**
   + **Export to S3 location**

1. If you selected **Export to S3 location**, enter the Amazon S3 location to which you want to export the notebook.

1. Choose **Export**.

Your Jupyter notebook should either download to your local machine, or you can find it saved in the Amazon S3 location you specified.

# Add destination nodes
<a name="canvas-destination-nodes-add"></a>

A destination node in SageMaker Canvas specifies where to store your processed and transformed data. When you choose to export your transformed data to Amazon S3, Canvas uses the specified destination node location, applying all the transformations you've configured in your data flow. For more information about export jobs to Amazon S3, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

By default, choosing to export your data to Amazon S3 adds a destination node to your data flow. However, you can add multiple destination nodes to your flow, allowing you to simultaneously export different sets of transformations or variations of your data to different Amazon S3 locations. For example, you can create one destination node that exports the data after applying all transformations, and another destination node that exports the data after only certain initial transformations, such as a join operation. This flexibility enables you to export and store different versions or subsets of your transformed data in separate S3 locations for various use cases.

Use the following procedure to add a destination node to your data flow.

**To add a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node where you want to place the destination node.

1. In the context menu, hover over **Export**, and then select **Add destination**.

1. In the **Export destination** side panel, enter a **Dataset name** to name the output.

1. For **Amazon S3 location**, enter the Amazon S3 location to which you want to export the output. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. For **Export settings**, specify the following fields:

   1. **File type** – The file format of the exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

1. For **Partitioning**, specify the following fields:

   1. **Number of partitions** – The number of dataset files that SageMaker Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Choose **Add** if you want to simply add a destination node to your data flow, or choose **Add** and then choose **Export** if you want to add the node and initiate an export job.

You should now see a new destination node in your flow.

# Edit a destination node
<a name="canvas-destination-nodes-edit"></a>

A *destination node* in a Amazon SageMaker Canvas data flow specifies the Amazon S3 location where your processed and transformed data is stored, applying all the configured transformations in your data flow. You can edit the configuration of an existing destination node and then choose to re-run the job to overwrite the data in the specified Amazon S3 location. For more information about adding a new destination node, see [Add destination nodes](canvas-destination-nodes-add.md).

Use the following procedure to edit a destination node in your data flow and initiate an export job.

**To edit a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the destination node that you want to edit.

1. In the context menu, choose **Edit**.

1. The **Edit destination** side panel opens. From this panel, you can edit details such as the dataset name, the Amazon S3 location, and the export and partitioning settings.

1. (Optional) In **Additional nodes to export**, you can select more destination nodes to process when you run the export job.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

1. When you're done making changes, choose **Update**.

Saving changes to your destination node configuration doesn't automatically re-run a job or overwrite data that has already been processed and exported. Export your data again to run a job with the new configuration. If you decide to export your data again with a job, Canvas uses the updated destination node configuration to transform and output the data to the specified location, overwriting any existing data.

# Create a schedule to automatically process new data
<a name="canvas-data-export-schedule-job"></a>

**Note**  
The following section only applies to SageMaker Processing jobs. If you used the default Canvas settings or EMR Serverless to create a remote job to apply transforms to your full dataset, this section doesn’t apply.

If you're processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data. For more information about processing jobs, see [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

When you create a job, you must specify an IAM role that has permissions to create the job. You can use the [AmazonSageMakerCanvasDataPrepFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html) policy to add permissions.

Add the following trust policy to the role to allow EventBridge to assume it.

```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "events.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}
```

**Important**  
When you create a schedule, Data Wrangler creates an `eventRule` in EventBridge. You incur charges for both the event rules that you create and the instances used to run the processing job.  
For information about EventBridge pricing, see [Amazon EventBridge pricing](https://aws.amazon.com/eventbridge/pricing/). For information about processing job pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can set a schedule using one of the following methods:
+ [CRON expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html)
**Note**  
Data Wrangler doesn't support the following expressions:  
LW\$1
Abbreviations for days
Abbreviations for months
+ [RATE expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html#eb-rate-expressions)
+ Recurring – Set an hourly or daily interval to run the job.
+ Specific time – Set specific days and times to run the job.

The following sections provide procedures on scheduling jobs when filling out the SageMaker AI Processing job settings while [exporting your data to Amazon S3](canvas-export-data.md#canvas-export-data-s3). All of the following instructions begin in the **Associate schedules** section of the SageMaker Processing job settings.

------
#### [ CRON ]

Use the following procedure to create a schedule with a CRON expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **CRON**.

1. For each of the **Minutes**, **Hours**, **Days of month**, **Month**, and **Day of week** fields, enter valid CRON expression values.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ RATE ]

Use the following procedure to create a schedule with a RATE expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Rate**.

1. For **Value**, specify an integer.

1. For **Unit**, select one of the following:
   + **Minutes**
   + **Hours**
   + **Days**

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Recurring ]

Use the following procedure to create a schedule that runs a job on a recurring basis.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Recurring**.

1. For **Every x hours**, specify the hourly frequency that the job runs during the day. Valid values are integers in the inclusive range of **1** and **23**.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.
**Note**  
The schedule resets every day. If you schedule a job to run every five hours, it runs at the following times during the day:  
00:00
05:00
10:00
15:00
20:00

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Specific time ]

Use the following procedure to create a schedule that runs a job at specific times.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Start time**.

1. For **Start time**, enter a time in UTC format (for example, **09:00**). The start time defaults to the time zone where you are located.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------

You can use the SageMaker AI AWS Management Console to view the jobs that are scheduled to run. Your processing jobs run within Pipelines. Each processing job has its own pipeline. It runs as a processing step within the pipeline. You can view the schedules that you've created within a pipeline. For information about viewing a pipeline, see [View the details of a pipeline](pipelines-studio-list.md).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio Classic.

1. Open Pipelines

1. View the pipelines for the jobs that you've created.

   The pipeline running the job uses the job name as a prefix. For example, if you've created a job named `housing-data-feature-enginnering`, the name of the pipeline is `canvas-data-prep-housing-data-feature-engineering`.

1. Choose the pipeline containing your job.

1. View the status of the pipelines. Pipelines with a **Status** of **Succeeded** have run the processing job successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an event rule stops all the jobs associated with the schedule from running. For information about deleting a rule, see [Disabling or deleting an Amazon EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-delete-rule.html).

You can stop and delete the pipelines associated with the schedules as well. For information about stopping a pipeline, see [StopPipelineExecution](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopPipelineExecution.html). For information about deleting a pipeline, see [DeletePipeline](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeletePipeline.html#API_DeletePipeline_RequestSyntax).

# Automate data preparation in SageMaker Canvas
<a name="canvas-data-export"></a>

After you transform your data in data flow, you can export the transforms to your machine learning workflows. When you export your transforms, SageMaker Canvas creates a Jupyter notebook. You must run the notebook within Amazon SageMaker Studio Classic. For information about getting started with Studio Classic, contact your administrator.

## Automate data preparation using Pipelines
<a name="canvas-data-export-pipelines"></a>

When you want to build and deploy large-scale machine learning (ML) workflows, you can use Pipelines to create workflows that manage and deploy SageMaker AI jobs. With Pipelines, you can build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs. You can use the first-party algorithms that SageMaker AI offers by using Pipelines. For more information on Pipelines, see [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).

When you export one or more steps from your data flow to Pipelines, Data Wrangler creates a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

### Use a Jupyter Notebook to Create a Pipeline
<a name="canvas-pipelines-notebook"></a>

Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Pipelines.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Pipelines (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline includes the data processing steps that are defined by your Data Wrangler flow. 

You can add additional steps to your pipeline by adding steps to the `steps` list in the following code in the notebook:

```
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=[step_process], #Add more steps to this list to run in your Pipeline
)
```

For more information on defining pipelines, see [Define SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html).

## Automate data preparation using an inference endpoint
<a name="canvas-data-export-inference"></a>

Use your Data Wrangler flow to process data at the time of inference by creating a SageMaker AI serial inference pipeline from your Data Wrangler flow. An inference pipeline is a series of steps that results in a trained model making predictions on new data. A serial inference pipeline within Data Wrangler transforms the raw data and provides it to the machine learning model for a prediction. You create, run, and manage the inference pipeline from a Jupyter notebook within Studio Classic. For more information about accessing the notebook, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Within the notebook, you can either train a machine learning model or specify one that you've already trained. You can either use Amazon SageMaker Autopilot or XGBoost to train the model using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see [Multi-model endpoints](multi-model-endpoints.md).

**Important**  
You can't export your Data Wrangler flow to an inference endpoint if it has the following transformations:  
Join
Concatenate
Group by
If you must use the preceding transforms to prepare your data, use the following procedure.  
Create a Data Wrangler flow.
Apply the preceding transforms that aren't supported.
Export the data to an Amazon S3 bucket.
Create a separate Data Wrangler flow.
Import the data that you've exported from the preceding flow.
Apply the remaining transforms.
Create a serial inference pipeline using the Jupyter notebook that we provide.
For information about exporting your data to an Amazon S3 bucket see [Export data](canvas-export-data.md). For information about opening the Jupyter notebook used to create the serial inference pipeline, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Data Wrangler ignores transforms that remove data at the time of inference. For example, Data Wrangler ignores the [Handle Missing Values](canvas-transform.md#canvas-transform-handle-missing) transform if you use the **Drop missing** configuration.

If you've refit transforms to your entire dataset, the transforms carry over to your inference pipeline. For example, if you used the median value to impute missing values, the median value from refitting the transform is applied to your inference requests. You can either refit the transforms from your Data Wrangler flow when you're using the Jupyter notebook or when you're exporting your data to an inference pipeline. .

The serial inference pipeline supports the following data types for the input and output strings. Each data type has a set of requirements.

**Supported datatypes**
+ `text/csv` – the datatype for CSV strings
  + The string can't have a header.
  + Features used for the inference pipeline must be in the same order as features in the training dataset.
  + There must be a comma delimiter between features.
  + Records must be delimited by a newline character.

  The following is an example of a validly formatted CSV string that you can provide in an inference request.

  ```
  abc,0.0,"Doe, John",12345\ndef,1.1,"Doe, Jane",67890                    
  ```
+ `application/json` – the datatype for JSON strings
  + The features used in the dataset for the inference pipeline must be in the same order as the features in the training dataset.
  + The data must have a specific schema. You define schema as a single `instances` object that has a set of `features`. Each `features` object represents an observation.

  The following is an example of a validly formatted JSON string that you can provide in an inference request.

  ```
  {
      "instances": [
          {
              "features": ["abc", 0.0, "Doe, John", 12345]
          },
          {
              "features": ["def", 1.1, "Doe, Jane", 67890]
          }
      ]
  }
  ```

### Use a Jupyter notebook to create an inference endpoint
<a name="canvas-inference-notebook"></a>

Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference pipeline using a Jupyter notebook, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **SageMaker AI Inference Pipeline (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

When you run the Jupyter notebook, it creates an inference flow artifact. An inference flow artifact is a Data Wrangler flow file with additional metadata used to create the serial inference pipeline. The node that you're exporting encompasses all of the transforms from the preceding nodes.

**Important**  
Data Wrangler needs the inference flow artifact to run the inference pipeline. You can't use your own flow file as the artifact. You must create it by using the preceding procedure.

## Automate data preparation using Python Code
<a name="canvas-data-export-python-code"></a>

To export all steps in your data flow to a Python file that you can manually integrate into any data processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Python code.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Python Code**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You might need to configure the Python script to make it run in your pipeline. For example, if you're running a Spark environment, make sure that you are running the script from an environment that has permission to access AWS resources.

# Generative AI foundation models in SageMaker Canvas
<a name="canvas-fm-chat"></a>

Amazon SageMaker Canvas provides generative AI foundation models that you can use to start conversational chats. These content generation models are trained on large amounts of text data to learn the statistical patterns and relationships between words, and they can produce coherent text that is statistically similar to the text on which they were trained. You can use this capability to increase your productivity by doing the following:
+ Generate content, such as document outlines, reports, and blogs
+ Summarize text from large corpuses of text, such as earnings call transcripts, annual reports, or chapters of user manuals
+ Extract insights and key takeaways from large passages of text, such as meeting notes or narratives
+ Improve text and catch grammatical errors or typos

The foundation models are a combination of Amazon SageMaker JumpStart and [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-service.html) large language models (LLMs). Canvas offers the following models:


| Model | Type | Description | 
| --- | --- | --- | 
|  Amazon Titan  | Amazon Bedrock model |  Amazon Titan is a powerful, general-purpose language model that you can use for tasks such as summarization, text generation (such as creating a blog post), classification, open-ended Q&A, and information extraction. It is pretrained on large datasets, making it suitable for complex tasks and reasoning. To continue supporting best practices in the responsible use of AI, Amazon Titan foundation models are built to detect and remove harmful content in the data, reject inappropriate content in the user input, and filter model outputs that contain inappropriate content (such as hate speech, profanity, and violence).  | 
|  Anthropic Claude Instant  | Amazon Bedrock model |  Anthropic's Claude Instant is a faster and more cost-effective yet still very capable model. This model can handle a range of tasks including casual dialogue, text analysis, summarization, and document question answering. Just like Claude-2, Claude Instant can support up to 100,000 tokens in each prompt, equivalent to about 200 pages of information.  | 
|  Anthropic Claude-2  | Amazon Bedrock model |  Claude-2 is Anthropic's most powerful model, which excels at a wide range of tasks from sophisticated dialogue and creative content generation to detailed instruction following. Claude-2 can take up to 100,000 tokens in each prompt, equivalent to about 200 pages of information. It can generate longer responses compared to its prior version. It supports use cases such as question answering, information extraction, removing PII, content generation, multiple-choice classification, roleplay, comparing text, summarization, and document Q&A with citation.  | 
|  Falcon-7B-Instruct  | JumpStart model |  Falcon-7B-Instruct has 7 billion parameters and was fine-tuned on a mixture of chat and instruct datasets. It is suitable as a virtual assistant and performs best when following instructions or engaging in conversation. Since the model was trained on large amounts of English-language web data, it carries the stereotypes and biases commonly found online and is not suitable for languages other than English. Compared to Falcon-40B-Instruct, Falcon-7B-Instruct is a slightly smaller and more compact model.  | 
|  Falcon-40B-Instruct  | JumpStart model |  Falcon-40B-Instruct has 40 billion parameters and was fine-tuned on a mixture of chat and instruct datasets. It is suitable as a virtual assistant and performs best when following instructions or engaging in conversation. Since the model was trained on large amounts of English-language web data, it carries the stereotypes and biases commonly found online and is not suitable for languages other than English. Compared to Falcon-7B-Instruct, Falcon-40B-Instruct is a slightly larger and more powerful model.  | 
|  Jurassic-2 Mid  | Amazon Bedrock model |  Jurassic-2 Mid is a high-performance text generation model trained on a massive corpus of text (current up to mid 2022). It is highly versatile, general-purpose, and capable of composing human-like text and solving complex tasks such as question answering, text classification, and many others. This model offers zero-shot instruction capabilities, allowing it to be directed with only natural language and without the use of examples. It performs up to 30% faster than its predecessor, the Jurassic-1 model. Jurassic-2 Mid is AI21’s mid-sized model, carefully designed to strike the right balance between exceptional quality and affordability.  | 
|  Jurassic-2 Ultra  | Amazon Bedrock model |  Jurassic-2 Ultra is a high-performance text generation model trained on a massive corpus of text (current up to mid 2022). It is highly versatile, general-purpose, and capable of composing human-like text and solving complex tasks such as question answering, text classification, and many others. This model offers zero-shot instruction capabilities, allowing it to be directed with only natural language and without the use of examples. It performs up to 30% faster than its predecessor, the Jurassic-1 model. Compared to Jurassic-2 Mid, Jurassic-2 Ultra is a slightly larger and more powerful model.  | 
|  Llama-2-7b-Chat  | JumpStart model |  Llama-2-7b-Chat is a foundation model by Meta that is suitable for engaging in meaningful and coherent conversations, generating new content, and extracting answers from existing notes. Since the model was trained on large amounts of English-language internet data, it carries the biases and limitations commonly found online and is best-suited for tasks in English.  | 
|  Llama-2-13B-Chat  | Amazon Bedrock model |  Llama-2-13B-Chat by Meta was fine-tuned on conversational data after initial training on internet data. It is optimized for natural dialog and engaging chat abilities, making it well-suited as a conversational agent. Compared to the smaller Llama-2-7b-Chat, Llama-2-13B-Chat has nearly twice as many parameters, allowing it to remember more context and produce more nuanced conversational responses. Like Llama-2-7b-Chat, Llama-2-13B-Chat was trained on English-language data and is best-suited for tasks in English.  | 
|  Llama-2-70B-Chat  | Amazon Bedrock model |  Like Llama-2-7b-Chat and Llama-2-13B-Chat, the Llama-2-70B-Chat model by Meta is optimized for engaging in natural and meaningful dialog. With 70 billion parameters, this large conversational model can remember more extensive context and produce highly coherent responses when compared to the more compact model versions. However, this comes at the cost of slower responses and higher resource requirements. Llama-2-70B-Chat was trained on large amounts of English-language internet data and is best-suited for tasks in English.  | 
|  Mistral-7B  | JumpStart model |  Mistral-7B by Mistral.AI is an excellent general purpose language model suitable for a wide range of natural language (NLP) tasks like text generation, summarization, and question answering. It utilizes grouped-query attention (GQA) which allows for faster inference speeds, making it perform comparably to models with twice or three times as many parameters. It was trained on a mixture of text data including books, websites, and scientific papers in the English language, so it is best-suited for tasks in English.  | 
|  Mistral-7B-Chat  | JumpStart model |  Mistral-7B-Chat is a conversational model by Mistral.AI based on Mistral-7B. While Mistral-7B is best for general NLP tasks, Mistral-7B-Chat has been further fine-tuned on conversational data to optimize its abilities for natural, engaging chat. As a result, Mistral-7B-Chat generates more human-like responses and remembers the context of previous responses. Like Mistral-7B, this model is best-suited for English language tasks.  | 
|  MPT-7B-Instruct  | JumpStart model |  MPT-7B-Instruct is a model for long-form instruction following tasks and can assist you with writing tasks including text summarization and question-answering to save you time and effort. This model was trained on large amounts of fine-tuned data and can handle larger inputs, such as complex documents. Use this model when you want to process large bodies of text or want the model to generate long responses.  | 

The foundation models from Amazon Bedrock are currently only available in the US East (N. Virginia) and US West (Oregon) Regions. Additionally, when using foundation models from Amazon Bedrock, you are charged based on the volume of input tokens and output tokens, as specified by each model provider. For more information, see the [Amazon Bedrock pricing page](https://aws.amazon.com/bedrock/pricing/). The JumpStart foundation models are deployed on SageMaker AI Hosting instances, and you are charged for the duration of usage based on the instance type used. For more information about the cost of different instance types, see the Amazon SageMaker AI Hosting: Real-Time Inference section on the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/).

Document querying is an additional feature that you can use to query and get insights from documents stored in indexes using Amazon Kendra. With this functionality, you can generate content from the context of those documents and receive responses that are specific to your business use case, as opposed to responses that are generic to the large amounts of data on which the foundation models were trained. For more information about indexes in Amazon Kendra, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html).

If you would like to get responses from any of the foundation models that are customized to your data and use case, you can fine-tune foundation models. To learn more, see [Fine-tune foundation models](canvas-fm-chat-fine-tune.md).

If you'd like to get predictions from an Amazon SageMaker JumpStart foundation model through an application or website, you can deploy the model to a SageMaker AI *endpoint*. SageMaker AI endpoints host your model, and you can send requests to the endpoint through your application code to receive predictions from the model. For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

# Complete the prerequisites for foundation models in SageMaker Canvas
<a name="canvas-fm-chat-prereqs"></a>

The following sections outline the prerequisites for interacting with foundation models and using the document query feature in Canvas. The rest of the content on this page assumes that you’ve met the prerequisites for foundation models. The document query feature requires additional permissions.

## Prerequisites for foundation models
<a name="canvas-fm-chat-prereqs-fm"></a>

The permissions you need for interacting with models are included in the Canvas Ready-to-use models permissions. To use the generative AI-powered models in Canvas, you must turn on the **Canvas Ready-to-use models configuration** permissions when setting up your Amazon SageMaker AI domain. For more information, see [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites). The **Canvas Ready-to-use models configuration** attaches the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues with granting permissions, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

If you’ve already set up your domain, you can edit your domain settings and turn on the permissions. For instructions on how to edit your domain settings, see [Edit domain settings](domain-edit.md). When editing the settings for your domain, go to the **Canvas settings** and turn on the **Enable Canvas Ready-to-use models** option.

Certain JumpStart foundation models also require that you request a SageMaker AI instance quota increase. Canvas hosts the models that you’re currently interacting with on these instances, but the default quota for your account may be insufficient. If you run into an error while running any of the following models, request a quota increase for the associated instance types:
+ Falcon-40B – `ml.g5.12xlarge`, `ml.g5.24xlarge`
+ Falcon-13B – `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`
+ MPT-7B-Instruct – `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`

For the preceding instances types, request an increase from 0 to 1 for the endpoint usage quota. For more information about how to increase an instance quota for your account, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*.

## Prerequisites for document querying
<a name="canvas-fm-chat-prereqs-kendra"></a>

**Note**  
Document querying is supported in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Asia Pacific (Mumbai).

The document querying feature requires that you already have an Amazon Kendra index that stores your documents and document metadata. For more information about Amazon Kendra, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html). To learn more about the quotas for querying indexes, see [Quotas](https://docs.aws.amazon.com/kendra/latest/dg/quotas.html) in the *Amazon Kendra Developer Guide*.

You must also make sure that your Canvas user profile has the necessary permissions for document querying. The [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy must be attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application (this policy is attached by default to all new and existing Canvas user profiles). You must also specifically grant document querying permissions and specify access to one or more Amazon Kendra indexes.

If your Canvas administrator is setting up a new domain or user profile, have them set up the domain by following the instructions in [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites). While setting up the domain, they can turn on the document querying permissions through the **Canvas Ready-to-use models configuration**.

The Canvas administrator can manage document querying permissions at the user profile level as well. For example, if the administrator wants to grant document querying permissions to some user profiles but remove permissions for others, they can edit the permissions for a specific user.

The following procedure shows how to turn on document querying permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**.

1. From the list of domains, select the user profile’s domain.

1. On the **domain details** page, choose the **User profile** whose permissions you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the left navigation pane, choose **Canvas settings**.

1. In the **Canvas Ready-to-use models configuration** section, turn on the **Enable document query using Amazon Kendra** toggle.

1. In the dropdown, select one or more Amazon Kendra indexes to which you want to grant access.

1. Choose **Submit** to save the changes to your domain settings.

You should now be able to use Canvas foundation models to query documents in the specified Amazon Kendra indexes.

# Start a new conversation to generate, extract, or summarize content
<a name="canvas-fm-chat-new"></a>

To get started with generative AI foundation models in Canvas, you can initiate a new chat session with one of the models. For JumpStart models, you are charged while the model is active, so you must start up models when you want to use them and shut them down when you are done interacting. If you do not shut down a JumpStart model, Canvas shuts it down after 2 hours of inactivity. For Amazon Bedrock models (such as Amazon Titan), you are charged by prompt; the models are already active and don’t need to be started up or shut down. You are charged directly for use of these models by Amazon Bedrock.

To open a chat with a model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **Ready-to-use models**.

1. Choose **Generate, extract and summarize content**.

1. On the welcome page, you’ll receive a recommendation to start up the default model. You can start the recommended model, or you can choose **Select another model** from the dropdown to choose a different one.

1. If you selected a JumpStart foundation model, you have to start it up before it is available for use. Choose **Start up the model**, and then the model is deployed to a SageMaker AI instance. It might take several minutes for this to complete. When the model is ready, you can enter prompts and ask the model questions.

   If you selected a foundation model from Amazon Bedrock, you can start using it instantly by entering a prompt and asking questions.

Depending on the model, you can perform various tasks. For example, you can enter a passage of text and ask the model to summarize it. Or, you can ask the model to come up with a short summary of the market trends in your domain.

The model’s responses in a chat are based on the context of your previous prompts. If you want to ask a new question in the chat that is unrelated to the previous conversation topic, we recommend that you start a new chat with the model.

# Extract information from documents with document querying
<a name="canvas-fm-chat-query"></a>

**Note**  
This section assumes that you’ve completed the section above [Prerequisites for document querying](canvas-fm-chat-prereqs.md#canvas-fm-chat-prereqs-kendra).

Document querying is a feature that you can use while interacting with foundation models in Canvas. With document querying, you can access a corpus of documents stored in an Amazon Kendra *index*, which holds the contents of your documents and is structured in a way to make documents searchable. You can ask specific questions that are targeted to the data in your Amazon Kendra index, and the foundation model returns answers to your questions. For example, you can query an internal knowledge base of IT information and ask questions such as “How do I connect to my company’s network?” For more information about setting up an index, see the [Amazon Kendra Developer Guide](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html).

When using the document query feature, the foundation models restrict their responses to the content of the documents in your index with a technique called Retrieval Augmented Generation (RAG). This technique bundles the most relevant information from the index along with the user's prompt and sends it to the foundation model to get a response. Responses are limited to what can be found in your index, preventing the model from giving you incorrect responses based on external data. For more information about this process, see the blog post [Quickly build high-accuracy Generative AI applications on enterprise data](https://aws.amazon.com/blogs/machine-learning/quickly-build-high-accuracy-generative-ai-applications-on-enterprise-data-using-amazon-kendra-langchain-and-large-language-models/).

To get started, in a chat with a foundation model in Canvas, turn on the **Document query** toggle at the top of the page. From the dropdown, select the Amazon Kendra index that you want to query. Then, you can begin asking questions related to the documents in your index.

**Important**  
Document querying supports the [Compare model outputs](canvas-fm-chat-compare.md) feature. Any existing chat history is overwritten when you start a new chat to compare model outputs.

# Start up models
<a name="canvas-fm-chat-manage"></a>

**Note**  
The following section describe starting up models, which only applies to the JumpStart foundation models, such as Falcon-40B-Instruct. You can access Amazon Bedrock models, such as Amazon Titan, instantly at any time.

You can start up as many JumpStart models as you like. Each active JumpStart model incurs charges on your account, so we recommend that you don’t start up more models than you are currently using.

To start up another model, you can do the following:

1. On the **Generate, extract and summarize content** page, choose **New chat**.

1. Choose the model from the dropdown menu. If you want to choose a model not displayed in the dropdown, choose **Start up another model**, and then select the model that you want to start up.

1. Choose **Start up model**.

The model should begin starting up, and within a few minutes you can chat with the model.

# Shut down models
<a name="canvas-fm-chat-shut-down"></a>

We highly recommend that you shut down models that you aren’t using. The models automatically shut down after 2 hours of inactivity. However, to manually shut down a model, you can do the following:

1. On the **Generate, extract and summarize content** page, open the chat for the model that you want to shut down.

1. On the chat page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Shut down model**.

1. In the **Shut down model** confirmation box, choose **Shut down**.

The model begins shutting down. If your chat compares two or more models, you can shut down an individual model from the chat page by choosing the model’s **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then choosing **Shut down model**.

# Compare model outputs
<a name="canvas-fm-chat-compare"></a>

You might want to compare the output of different models side by side to see which model output you prefer. This can help you decide which model is best suited to your use case. You can compare up to three models in chats.

**Note**  
Each individual model incurs charges on your account.

You must start a new chat to add models for comparison. To compare the output of models side by side in a chat, do the following:

1. In a chat, choose **New chat**.

1. Choose **Compare**, and use the dropdown menu to select the model that you want to add. To add a third model, choose **Compare** again to add another model.
**Note**  
If you want to use a JumpStart model that isn’t currently active, you are prompted to start up the model.

When the models are active, you see the two models side by side in the chat. You can submit your prompt, and each model responds in the same chat, as shown in the following screenshot.

![\[Screenshot of the Canvas interface with the output of two models shown side by side.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-chat-compare-outputs.png)


When you’re done interacting, make sure to shut down any JumpStart models individually to avoid incurring further charges.

# Fine-tune foundation models
<a name="canvas-fm-chat-fine-tune"></a>

The foundation models that you can access through Amazon SageMaker Canvas can help you with a range of general purpose tasks. However, if you have a specific use case and would like to customized responses based on your own data, you can *fine-tune* a foundation model.

To fine-tune a foundation model, you provide a dataset that consists of sample prompts and model responses. Then, you train the foundation model on the data. Finally, the fine-tuned foundation model is able to provide you with more specific responses.

The following list contains the foundation models that you can fine-tune in Canvas:
+ Titan Express
+ Falcon-7B
+ Falcon-7B-Instruct
+ Falcon-40B-Instruct
+ Falcon-40B
+ Flan-T5-Large
+ Flan-T5-Xl
+ Flan-T5-Xxl
+ MPT-7B
+ MPT-7B-Instruct

You can access more detailed information about each foundation model in the Canvas application while fine-tuning a model. For more information, see [Fine-tune the model](#canvas-fm-chat-fine-tune-procedure-model).

This topic describes how to fine-tune foundation models in Canvas.

## Before you begin
<a name="canvas-fm-chat-fine-tune-prereqs"></a>

Before fine-tuning a foundation model, make sure that you have the permissions for Ready-to-use models in Canvas and an AWS Identity and Access Management execution role that has a trust relationship with Amazon Bedrock, which allows Amazon Bedrock to assume your role while fine-tuning foundation models.

While setting up or editing your Amazon SageMaker AI domain, you must 1) turn on the Canvas Ready-to-use models configuration permissions, and 2) create or specify an Amazon Bedrock role, which is an IAM execution role to which SageMaker AI attaches a trust relationship with Amazon Bedrock. For more information about configuring these settings, see [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).

You can configure the Amazon Bedrock role manually if you would rather use your own IAM execution role (instead of letting SageMaker AI create one on your behalf). For more information about configuring your own IAM execution role’s trust relationship with Amazon Bedrock, see [Grant Users Permissions to Use Amazon Bedrock and Generative AI Features in Canvas](canvas-fine-tuning-permissions.md).

You must also have a dataset that is formatted for fine-tuning large language models (LLMs). The following is a list of requirements for your dataset:
+ The dataset must be tabular and contain at least two columns of text data–one input column (which contains example prompts to the model) and one output column (which contains example responses from the model).

  An example is the following:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat-fine-tune.html)
+ We recommend that the dataset has at least 100 text pairs (rows of corresponding input and output items). This ensures that the foundation model has enough data for fine-tuning and increases the accuracy of its responses.
+ Each input and output item should contain a maximum of 512 characters. Anything longer is reduced to 512 characters when fine-tuning the foundation model.

When fine-tuning an Amazon Bedrock model, you must adhere to the Amazon Bedrock quotas. For more information, see [Model customization quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html#model-customization-quotas) in the *Amazon Bedrock User Guide*.

For more information about general dataset requirements and limitations in Canvas, see [Create a dataset](canvas-import-dataset.md).

## Fine-tune a foundation model
<a name="canvas-fm-chat-fine-tune-procedure"></a>

You can fine-tune a foundation model by using any of the following methods in the Canvas application:
+ While in a **Generate, extract and summarize content** chat with a foundation model, choose the **Fine-tune model** icon (![\[Magnifying glass icon with a plus sign, indicating a search or zoom-in function.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/wrench-icon-small.png)).
+ While in a chat with a foundation model, if you’ve re-generated the response two or more times, then Canvas offers you the option to **Fine-tune model**. The following screenshot shows you what this looks like.  
![\[Screenshot of the Fine-tune foundation model option shown in a chat.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/fine-tuning-ingress.png)
+ On the **My models** page, you can create a new model by choosing **New model**, and then select **Fine-tune foundation model**.
+ On the **Ready-to-use models** home page, you can choose **Create your own model**, and then in the **Create new model** dialog box, choose **Fine-tune foundation model**.
+ While browsing your datasets in the **Data Wrangler** tab, you can select a dataset and choose **Create a model**. Then, choose **Fine-tune foundation model**.

After you’ve begun to fine-tune a model, do the following:

### Select a dataset
<a name="canvas-fm-chat-fine-tune-procedure-select"></a>

On the **Select** tab of fine-tuning a model, you choose the data on which you’d like to train the foundation model.

Either select an existing dataset or create a new dataset that meets the requirements listed in the [Before you begin](#canvas-fm-chat-fine-tune-prereqs) section. For more information about how to create a dataset, see [Create a dataset](canvas-import-dataset.md).

When you’ve selected or created a dataset and you’re ready to move on, choose **Select dataset**.

### Fine-tune the model
<a name="canvas-fm-chat-fine-tune-procedure-model"></a>

After selecting your data, you’re now ready to begin training and fine-tune the model.

On the **Fine-tune** tab, do the following:

1. (Optional) Choose **Learn more about our foundation models** to access more information about each model and help you decide which foundation model or models to deploy.

1. For **Select up to 3 base models**, open the dropdown menu and check up to 3 foundation models (up to 2 JumpStart models and 1 Amazon Bedrock model) that you’d like to fine-tune during the training job. By fine-tuning multiple foundation models, you can compare their performance and ultimately choose the one best suited to your use case as the default model. For more information about default models, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

1. For **Select Input column**, select the column of text data in your dataset that contains the example model prompts.

1. For **Select Output column**, select the column of text data in your dataset that contains the example model responses.

1. (Optional) To configure advanced settings for the training job, choose **Configure model**. For more information about the advanced model building settings, see [Advanced model building configurations](canvas-advanced-settings.md).

   In the **Configure model** pop-up window, do the following:

   1. For **Hyperparameters**, you can adjust the **Epoch count**, **Batch size**, **Learning rate**, and **Learning rate warmup steps** for each model you selected. For more information about these parameters, see the [ Hyperparameters section in the JumpStart documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-fine-tune.html#jumpstart-hyperparameters).

   1. For **Data split**, you can specify percentages for how to divide your data between the **Training set** and **Validation set**.

   1. For **Max job runtime**, you can set the maximum amount of time that Canvas runs the build job. This feature is only available for JumpStart foundation models.

   1. After configuring the settings, choose **Save**.

1. Choose **Fine-tune** to begin training the foundation models you selected.

After the fine-tuning job begins, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for use, and you can now analyze the performance of your fine-tuned foundation model.

### Analyze the fine-tuned foundation model
<a name="canvas-fm-chat-fine-tune-procedure-analyze"></a>

On the **Analyze** tab of your fine-tuned foundation model, you can see the model’s performance.

The **Overview** tab on this page shows you the perplexity and loss scores, along with analyses that visualize the model’s improvement over time during training. The following screenshot shows the **Overview** tab.

![\[The Analyze tab of a fine-tuned foundation model in Canvas, showing the perplexity and loss curves.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-fine-tune-analyze-2.png)


On this page, you can see the following visualizations:
+ The **Perplexity Curve** measures how well the model predicts the next word in a sequence, or how grammatical the model’s output is. Ideally, as the model improves during training, the score decreases and results in a curve that lowers and flattens over time.
+ The **Loss Curve** quantifies the difference between the correct output and the model’s predicted output. A loss curve that decreases and flattens over time indicates that the model is improving its ability to make accurate predictions.

The **Advanced metrics** tab shows you the hyperparameters and additional metrics for your model. It looks like the following screenshot:

![\[Screenshot of the Advanced metrics tab of a fine-tuned foundation model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-fine-tune-metrics.png)


The **Advanced metrics** tab contains the following information:
+ The **Explainability** section contains the **Hyperparameters**, which are the values set before the job to guide the model’s fine-tuning. If you didn’t specify custom hyperparameters in the model’s advanced settings in the [Fine-tune the model](#canvas-fm-chat-fine-tune-procedure-model) section, then Canvas selects default hyperparameters for you.

  For JumpStart models, you can also see the advanced metric [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)), which evaluates the quality of summaries generated by the model. It measures how well the model can summarize the main points of a passage.
+ The **Artifacts** section provides you with links to artifacts generated during the fine-tuning job. You can access the training and validation data saved in Amazon S3, as well as the link to the model evaluation report (to learn more, see the following paragraph).

To get more model evaluation insights, you can download a report that is generated using [SageMaker Clarify](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-configure-processing-jobs.html), which is a feature that can help you detect bias in your model and data. First, generate the report by choosing **Generate evaluation report** at the bottom of the page. After the report has generated, you can download the full report by choosing **Download report** or by returning to the **Artifacts** section.

You can also access a Jupyter notebook that shows you how to replicate your fine-tuning job in Python code. You can use this to replicate or make programmatic changes to your fine-tuning job or get a deeper understanding of how Canvas fine-tunes your model. To learn more about model notebooks and how to access them, see [Download a model notebook](canvas-notebook.md).

For more information about how to interpret the information in the **Analyze** tab of your fine-tuned foundation model, see the topic [Model evaluation](canvas-evaluate-model.md).

After analyzing the **Overview** and **Advanced metrics** tabs, you can also choose to open the **Model leaderboard**, which shows you the list of the base models trained during the build. The model with the lowest loss score is considered the best performing model and is selected as the **Default model**, which is the model whose analysis you see in the **Analyze** tab. You can only test and deploy the default model. For more information about the model leaderboard and how to change the default model, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

### Test a fine-tuned foundation model in a chat
<a name="canvas-fm-chat-fine-tune-procedure-test"></a>

After analyzing the performance of a fine-tuned foundation model, you might want to test it out or compare its responses with the base model. You can test a fine-tuned foundation model in a chat in the **Generate, extract and summarize content** feature.

Start a chat with a fine-tuned model by choosing one of the following methods:
+ On the fine-tuned model’s **Analyze** tab, choose **Test in Ready-to-use foundation models**.
+ On the Canvas **Ready-to-use models** page, choose **Generate, extract and summarize content**. Then, choose **New chat** and select the version of the model that you want to test.

The model starts up in a chat, and you can interact with it like any other foundation model. You can add more models to the chat and compare their outputs. For more information about the functionality of chats, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

## Operationalize fine-tuned foundation models
<a name="canvas-fm-chat-fine-tune-mlops"></a>

After fine-tuning your model in Canvas, you can do the following:
+ Register the model to the SageMaker Model Registry for integration into your organizations MLOps processes. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).
+ Deploy the model to a SageMaker AI endpoint and send requests to the model from your application or website to get predictions (or *inference*). For more information, see [Deploy your models to an endpoint](canvas-deploy-model.md).

**Important**  
You can only register and deploy JumpStart based fine-tuned foundation models, not Amazon Bedrock based models.

# Ready-to-use models
<a name="canvas-ready-to-use-models"></a>

With Amazon SageMaker Canvas Ready-to-use models, you can make predictions on your data without writing a single line of code or having to build a model—all you have to bring is your data. The Ready-to-use models use pre-built models to generate predictions without requiring you to spend the time, expertise, or cost required to build a model, and you can choose from a variety of use cases ranging from language detection to expense analysis.

Canvas integrates with existing AWS services, such as [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html), [Amazon Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html), and [Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html), to analyze your data and make predictions or extract insights. You can use the predictive power of these services from within the Canvas application to get high quality predictions for your data.

Canvas supports the following Ready-to-use models types:


| Ready-to-use model | Description | Supported data type | 
| --- | --- | --- | 
| Sentiment analysis | Detect sentiment in lines of text, which can be positive, negative, neutral, or mixed. Currently, you can only do sentiment analysis for English language text. | Plain text or tabular (CSV, Parquet) | 
| Entities extraction | Extract entities, which are real-world objects such as people, places, and commercial items, or units such as dates and quantities, from text. | Plain text or tabular (CSV, Parquet) | 
| Language detection | Determine the dominant language in text such as English, French, or German. | Plain text or tabular (CSV, Parquet) | 
| Personal information detection | Detect personal information that could be used to identify an individual, such as addresses, bank account numbers, and phone numbers, from text. | Plain text or tabular (CSV, Parquet) | 
| Object detection in images | Detect objects, concepts, scenes, and actions in your images. | Image (JPG, PNG) | 
| Text detection in images | Detect text in your images. | Image (JPG, PNG) | 
| Expense analysis | Extract information from invoices and receipts, such as date, number, item prices, total amount, and payment terms. | Document (PDF, JPG, PNG, TIFF) | 
| Identity document analysis | Extract information from passports, driver licenses, and other identity documentation issued by the US Government. | Document (PDF, JPG, PNG, TIFF) | 
| Document analysis | Analyze documents and forms for relationships among detected text. | Document (PDF, JPG, PNG, TIFF) | 
| Document queries | Extract information from structured documents such as paystubs, bank statements, W-2s, and mortgage application forms by asking questions using natural language. | Document (PDF) | 

## Get started
<a name="canvas-ready-to-use-get-started"></a>

To get started with Ready-to-use models, review the following information.

**Prerequisites**

To use Ready-to-use models in Canvas, you must turn on the **Canvas Ready-to-use models configuration** permissions when [setting up your Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites). The **Canvas Ready-to-use models configuration** attaches the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy to your Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues with granting permissions, see the topic [Troubleshooting issues with granting permissions through the SageMaker AI console](canvas-limits.md#canvas-troubleshoot-trusted-services).

If you’ve already set up your domain, you can edit your domain settings and turn on the permissions. For instructions on how to edit your domain settings, see [Edit domain settings](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-edit.html). When editing the settings for your domain, go to the **Canvas settings** and turn on the **Enable Canvas Ready-to-use models** option.

**(Optional) Opt out of AI services data storage**

Certain AWS AI services store and use your data to make improvements to the service. You can opt out of having your data stored or used for service improvements. To learn more about how to opt out, see [ AI services opt-out policies](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_ai-opt-out.html) in the *AWS Organizations User Guide*.

**How to use Ready-to-use models**

To get started with Ready-to-use models, do the following:

1. **(Optional) Import your data.** You can import a tabular, image, or document dataset to generate batch predictions, or a dataset of predictions, with Ready-to-use models. To get started with importing a dataset, see [Create a data flow](canvas-data-flow.md).

1. **Generate predictions.** You can generate single or batch predictions with your chosen Ready-to-use model. To get started with making predictions, see [Make predictions for text data](canvas-ready-to-use-predict-text.md).

# Make predictions for text data
<a name="canvas-ready-to-use-predict-text"></a>

The following procedures describe how to make both single and batch predictions for text datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: sentiment analysis, entities extraction, language detection, and personal information detection.

**Note**  
For sentiment analysis, you can only use English language text.

## Single predictions
<a name="canvas-ready-to-use-predict-text-single"></a>

To make a single prediction for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For text data, it should be one of the following: **Sentiment analysis**, **Entities extraction**, **Language detection**, or **Personal information detection**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. For **Text field**, enter the text for which you’d like to get a prediction.

1. Choose **Generate prediction results** to get your prediction.

In the right pane **Prediction results**, you receive an analysis of your text in addition to a **Confidence** score for each result or label. For example, if you chose language detection and entered a passage of text in French, you might get French with a 95% confidence score and traces of other languages, like English, with a 5% confidence score.

The following screenshot shows the results for a single prediction using language detection where the model is 100% confident that the passage is English.

![\[Screenshot of the results of a single prediction with the language detection Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-text-prediction.png)


## Batch predictions
<a name="canvas-ready-to-use-predict-text-batch"></a>

To make batch predictions for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For text data, it should be one of the following: **Sentiment analysis**, **Entities extraction**, **Language detection**, or **Personal information detection**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can **Preview** the output data. Then, you can choose **Download** to download the results.

# Make predictions for image data
<a name="canvas-ready-to-use-predict-image"></a>

The following procedures describe how to make both single and batch predictions for image datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: object detection images and text detection in images.

## Single predictions
<a name="canvas-ready-to-use-predict-image-single"></a>

To make a single prediction for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Object detection images** or **Text detection in images**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. Choose **Upload image**.

1. You are prompted to select an image to upload from your local computer. Select the image from your local files, and then the prediction results generate.

In the right pane **Prediction results**, you receive an analysis of your image in addition to a **Confidence** score for each object or text detected. For example, if you chose object detection in images, you receive a list of objects in the image along with a confidence score of how certain the model is that each object was accurately detected, such as 93%.

The following screenshot shows the results for a single prediction using the object detection in images solution, where the model predicts objects such as a clock tower and bus with 100% confidence.

![\[The results of a single prediction with the object detection solution in images Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-image-prediction.png)


## Batch predictions
<a name="canvas-ready-to-use-predict-image-batch"></a>

To make batch predictions for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Object detection images** or **Text detection in images**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to preview the output data. Then, you can choose **Download prediction** and download the results as a CSV or a ZIP file.

# Make predictions for document data
<a name="canvas-ready-to-use-predict-document"></a>

The following procedures describe how to make both single and batch predictions for document datasets. Each Ready-to-use model supports both **Single predictions** and **Batch predictions** for your dataset. A **Single prediction** is when you only need to make one prediction. For example, you have one image from which you want to extract text, or one paragraph of text for which you want to detect the dominant language. A **Batch prediction** is when you’d like to make predictions for an entire dataset. For example, you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or you might have image files in which you’d like to detect objects.

You can use these procedures for the following Ready-to-use model types: expense analysis, identity document analysis, and document analysis.

**Note**  
For document queries, only single predictions are currently supported.

## Single predictions
<a name="canvas-ready-to-use-predict-document-single"></a>

To make a single prediction for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For document data, it should be one of the following: **Expense analysis**, **Identity document analysis**, or **Document analysis**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Single prediction**.

1. If your Ready-to-use model is identity document analysis or document analysis, complete the following actions. If you’re doing expense analysis or document queries, skip this step and go to Step 5 or Step 6, respectively.

   1. Choose **Upload document**.

   1. You are prompted to upload a PDF, JPG, or PNG file from your local computer. Select the document from your local files, and then the prediction results will generate.

1. If your Ready-to-use model is expense analysis, do the following:

   1. Choose **Upload invoice or receipt**.

   1. You are prompted to upload a PDF, JPG, PNG, or TIFF file from your local computer. Select the document from your local files, and then the prediction results will generate.

1. If your Ready-to-use model is document queries, do the following:

   1. Choose **Upload document**.

   1. You are prompted to upload a PDF file from your local computer. Select the document from your local files. Your PDF must be 1–100 pages long.
**Note**  
If you're in the Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), or Europe (Frankfurt) regions, then the maximum PDF size for document queries is 20 pages.

   1. In the right side pane, enter queries to search for information in the document. The number of characters you can have in a single query is from 1–200. You can add up to 15 queries at a time.

   1. Choose **Submit queries**, and then the results generate with answers to your queries. You are billed once for each submissions of queries you make.

In the right pane **Prediction results**, you’ll receive an analysis of your document.

The following information describes the results for each type of solution:
+ For expense analysis, the results are categorized into **Summary fields**, which include fields such as the total on a receipt, and **Line item fields**, which include fields such as individual items on a receipt. The identified fields are highlighted on the document image in the output.
+ For identity document analysis, the output shows you the fields that the Ready-to-use model identified, such as first and last name, address, or date of birth. The identified fields are highlighted on the document image in the output.
+ For document analysis, the results are categorized into **Raw text**, **Forms**, **Tables**, and **Signatures**. **Raw text** includes all of the extracted text, while **Forms**, **Tables**, and **Signatures** only include information on the form that falls into those categories. For example, **Tables** only includes information extracted from tables in the document. The identified fields are highlighted on the document image in the output.
+ For document queries, Canvas returns answers to each of your queries. You can open the collapsible query dropdown to view a result, along with a confidence score for the prediction. If Canvas finds multiple answers in the document, then you might have more than one result for each query.

The following screenshot shows the results for a single prediction using the document analysis solution.

![\[Screenshot of the results of a single prediction with the document analysis Ready-to-use model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-ready-to-use/ai-solutions-document-analysis.png)


## Batch predictions
<a name="canvas-ready-to-use-predict-document-batch"></a>

To make batch predictions for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose **Ready-to-use models**.

1. On the **Ready-to-use models** page, choose the Ready-to-use model for your use case. For image data, it should be one of the following: **Expense analysis**, **Identity document analysis**, or **Document analysis**.

1. On the **Run predictions** page for your chosen Ready-to-use model, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you are directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions**. If your use case is document analysis, continue to Step 6.

1. (Optional) If your use case is Document analysis, another dialog box called **Select features to include in batch prediction** appears. You can select **Forms**, **Tables**, and **Signatures** to group the results by those features. Then, choose **Generate predictions**.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to preview the analysis of your document data.

The following information describes the results for each type of solution:
+ For expense analysis, the results are categorized into **Summary fields**, which include fields such as the total on a receipt, and **Line item fields**, which include fields such as individual items on a receipt. The identified fields are highlighted on the document image in the output.
+ For identity document analysis, the output shows you the fields that the Ready-to-use model identified, such as first and last name, address, or date of birth. The identified fields are highlighted on the document image in the output.
+ For document analysis, the results are categorized into **Raw text**, **Forms**, **Tables**, and **Signatures**. **Raw text** includes all of the extracted text, while **Forms**, **Tables**, and **Signatures** only include information on the form that falls into those categories. For example, **Tables** only includes information extracted from tables in the document. The identified fields are highlighted on the document image in the output.

After previewing your results, you can choose **Download prediction** and download the results as a ZIP file.

# Custom models
<a name="canvas-custom-models"></a>

In Amazon SageMaker Canvas, you can train custom machine learning models tailored to your specific data and use case. By training a custom model on your data, you are able to capture characteristics and trends that are specific and most representative of your data. For example, you might want to create a custom time series forecasting model that you train on inventory data from your warehouse to manage your logistics operations.

Canvas supports training a range of model types. After training a custom model, you can evaluate the model's performance and accuracy. Once satisfied with a model, you can make predictions on new data, and you also have the option to share the custom model with data scientists for further analysis or to deploy it to a SageMaker AI hosted endpoint for real-time inference, all from within the Canvas application.

You can train a Canvas custom model on the following types of datasets:
+ Tabular (including numeric, categorical, timeseries, and text data)
+ Image

The following table shows the types of custom models that you can build in Canvas, along with their supported data types and data sources.


| Model type | Example use case | Supported data types | Supported data sources | 
| --- | --- | --- | --- | 
| Numeric prediction | Predicting house prices based on features like square footage | Numeric | Local upload, Amazon S3, SaaS connectors | 
| 2 category prediction | Predicting whether or not a customer is likely to churn | Binary or categorical | Local upload, Amazon S3, SaaS connectors | 
| 3\$1 category prediction | Predicting patient outcomes after being discharged from the hospital | Categorical | Local upload, Amazon S3, SaaS connectors | 
| Time series forecasting | Predicting your inventory for the next quarter | Timeseries | Local upload, Amazon S3, SaaS connectors | 
| Single-label image prediction | Predicting types of manufacturing defects in images | Image (JPG, PNG) | Local upload, Amazon S3 | 
| Multi-category text prediction | Predicting categories of products, such as clothing, electronics, or household goods, based on product descriptions |  Source column: text Target column: binary or categorical | Local upload, Amazon S3 | 

**Get started**

To get started with building and generating predictions from a custom model, do the following:
+ Determine your use case and type of model that you want to build. For more information about the custom model types, see [How custom models work](canvas-build-model.md). For more information about the data types and sources supported for custom models, see [Data import](canvas-importing-data.md).
+ [Import your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-importing-data.html) into Canvas. You can build a custom model with any tabular or image dataset that meets the input requirements. For more information about the input requirements, see [Create a dataset](canvas-import-dataset.md).

  To learn more about sample datasets provided by SageMaker AI with which you can experiment, see [Sample datasets in Canvas](canvas-sample-datasets.md).
+ [Build](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) your custom model. You can do a **Quick build** to get your model and start making predictions more quickly, or you can do a **Standard build** for greater accuracy.

  For numeric, categorical, and time series forecasting model types, you can clean and prepare your data with the [Data Wrangler feature](canvas-data-prep.md). In Data Wrangler, you can create a data flow and use various data preparation techniques, such as applying advanced transforms or joining datasets. For image prediction models, you can [Edit an image dataset](canvas-edit-image.md) to update your labels or add and delete images. Note that you can't use these features for multi-category text prediction models.
+ [Evaluate your model's performance](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html) and determine how well it might perform on real-world data.
+ [Make single or batch predictions](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-make-predictions.html) with your model.

# How custom models work
<a name="canvas-build-model"></a>

Use Amazon SageMaker Canvas to build a custom model on the dataset that you've imported. Use the model that you've built to make predictions on new data. SageMaker Canvas uses the information in the dataset to build up to 250 models and choose the one that performs the best.

When you begin building a model, Canvas automatically recommends one or more *model types*. Model types fall into one of the following categories:
+ **Numeric prediction** – This is known as *regression* in machine learning. Use the numeric prediction model type when you want to make predictions for numeric data. For example, you might want to predict the price of houses based on features such as the house’s square footage.
+ **Categorical prediction** – This is known as *classification* in machine learning. When you want to categorize data into groups, use the categorical prediction model types:
  + **2 category prediction** – Use the 2 category prediction model type (also known as *binary classification* in machine learning) when you have two categories that you want to predict for your data. For example, you might want to determine whether a customer is likely to churn.
  + **3\$1 category prediction** – Use the 3\$1 category prediction model type (also known as *multi-class classification* in machine learning) when you have three or more categories that you want to predict for your data. For example, you might want to predict a customer's loan status based on features such as previous payments.
+ **Time series forecasting** – Use time series forecasts when you want to make predictions over a period of time. For example, you might want to predict the number of items you’ll sell in the next quarter. For information about time series forecasts, see [Time Series Forecasts in Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html).
+ **Image prediction** – Use the single-label image prediction model type (also known as *single-label image classification* in machine learning) when you want to assign labels to images. For example, you might want to classify different types of manufacturing defects in images of your product.
+ **Text prediction** – Use the multi-category text prediction model type (also known as *multi-class text classification* in machine learning) when you want to assign labels to passages of text. For example, you might have a dataset of customer reviews for a product, and you want to determine whether customers liked or disliked the product. You might have your model predict whether a given passage of text is `Positive`, `Negative`, or `Neutral`.

For a table of the supported input data types for each model type, see [Custom models](canvas-custom-models.md).

For each tabular data model that you build (which includes numeric, categorical, time series forecasting, and text prediction models), you choose the **Target column**. The **Target column** is the column that contains the information that you want to predict. For example, if you're building a model to predict whether people have cancelled their subscriptions, the **Target column** contains data points that are either a `yes` or a `no` about someone's cancellation status.

For image prediction models, you build the model with a dataset of images that have been assigned labels. For the unlabeled images that you provide, the model predicts a label. For example, if you’re building a model to predict whether an image is a cat or a dog, you provide images labeled as cats or dogs when building the model. Then, the model can accept unlabeled images and predict them as either cats or dogs.

**What happens when you build a model**

To build your model, you can choose either a **Quick build** or a **Standard build**. The **Quick build** has a shorter build time, but the **Standard build** generally has a higher accuracy.

For tabular and time series forecasting models, Canvas uses *downsampling* to reduce the size of datasets larger than 5 GB or 30 GB, respectively. Canvas downsamples with the stratified sampling method. The table below lists the size of the downsample by model type. To control the sampling process, you can use Data Wrangler in Canvas to sample using your preferred sampling technique. For time series data, you can resample to aggregate data points. For more information about sampling, see [Sampling](canvas-transform.md#canvas-transform-sampling). For more information about resampling time series data, see [Resample Time Series Data](canvas-transform.md#canvas-resample-time-series).

If you choose to do a **Quick build** on a dataset with more than 50,000 rows, then Canvas samples your data down to 50,000 rows for a shorter model training time.

The following table summarizes key characteristics of the model building process, including average build times for each model and build type, the size of the downsample when building models with large datasets, and the minimum and maximum number of data points you should have for each build type.


| Limit | Numeric and categorical prediction | Time series forecasting | Image prediction | Text prediction | 
| --- | --- | --- | --- | --- | 
| **Quick build** time | 2‐20 minutes | 2‐20 minutes | 15‐30 minutes | 15‐30 minutes | 
| **Standard build** time | 2‐4 hours | 2‐4 hours | 2‐5 hours | 2‐5 hours | 
| Downsample size (the reduced size of a large dataset after Canvas downsamples) | 5 GB | 30 GB | N/A | N/A | 
| Minimum number of entries (rows) for **Quick builds** |  2 category: 500 rows 3\$1 category, numeric, time series: N/A  | N/A | N/A | N/A | 
| Minimum number of entries (rows, images, or documents) for **Standard builds** | 250 | 50 | 50 | N/A | 
| Maximum number of entries (rows, images, or documents) for **Quick builds** | N/A | N/A | 5000 | 7500 | 
| Maximum number of entries (rows, images, or documents) for **Standard builds** | N/A | 150,000 | 180,000 | N/A | 
| Maximum number of columns | 1,000 | 1,000 | N/A | N/A | 

Canvas predicts values by using the information in the rest of the dataset, depending on the model type:
+ For categorical prediction, Canvas puts each row into one of the categories listed in the **Target column**.
+ For numeric prediction, Canvas uses the information in the dataset to predict the numeric values in the **Target column**.
+ For time series forecasting, Canvas uses historical data to predict values for the **Target column** in the future.
+ For image prediction, Canvas uses images that have been assigned labels to predict labels for unlabeled images.
+ For text prediction, Canvas analyzes text data that has been assigned labels to predict labels for passages of unlabeled text.

**Additional features to help you build your model**

Before building your model, you can use Data Wrangler in Canvas to prepare your data using 300\$1 built-in transforms and operators. Data Wrangler supports transforms for both tabular and image datasets. Additionally, you can connect to data sources outside of Canvas, create jobs to apply transforms to your entire dataset, and export your fully prepared and cleaned data for use in ML workflows outside of Canvas. For more information, see [Data preparation](canvas-data-prep.md).

To see visualizations and analytics to explore your data and determine which features to include in your model, you can use Data Wrangler’s built-in analyses. You can also access a **Data Quality and Insights Report** that highlights potential issues with your dataset and provides recommendations for how to fix them. For more information, see [Perform exploratory data analysis (EDA)](canvas-analyses.md).

In addition to the more advanced data preparation and exploration functionality provided through Data Wrangler, Canvas provides some basic features that you can use:
+ To filter your data and access a set of basic data transforms, see [Prepare data for model building](canvas-prepare-data.md).
+ To access simple visualizations and analytics for feature exploration, see [Data exploration and analysis](canvas-explore-data.md).
+ To learn more about additional features such as previewing your model, validating your dataset, and changing the size of the random sample used to build your model, see [Preview your model](canvas-preview-model.md).

For tabular datasets with multiple columns (such as datasets for building categorical, numeric, or time series forecasting model types), you might have rows with missing data points. While Canvas builds the model, it automatically adds missing values. Canvas uses the values in your dataset to perform a mathematical approximation for the missing values. For the highest model accuracy, we recommend adding in the missing data if you can find it. Note that the missing data feature is not supported for text prediction or image prediction models.

**Get started**

To get started with building a custom model, see [Build a model](canvas-build-model-how-to.md) and follow the procedure for the type of model that you want to build.

# Preview your model
<a name="canvas-preview-model"></a>

**Note**  
The following functionality is only available for custom models built with tabular datasets. Multi-category text prediction models are also excluded.

SageMaker Canvas provides you with a tool to preview your model before you begin building. This gives you an estimated accuracy score and also gives you a preliminary idea of how each column might impact the model. 

To preview the model score, when you're on the **Build** tab of your model, choose **Preview model**.

The model preview generates an **Estimated accuracy** prediction of how well the model might analyze your data. The accuracy of a **Quick build** or a **Standard build** represents how well the model can perform on real data and is generally higher than the **Estimated accuracy**.

The model preview also provides you with the **Column Impact** scores, which can indicate the importance of each column to the model's predictions.

The following screenshot shows a model preview in the Canvas application.

![\[Screenshot of the Build tab for a model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-build/canvas-build-preview-model.png)


Amazon SageMaker Canvas automatically handles missing values in your dataset while it builds the model. It infers the missing values by using adjacent values that are present in the dataset.

If you're satisfied with your model preview and want to proceed with building a model, then see [Build a model](canvas-build-model-how-to.md).

# Data validation
<a name="canvas-dataset-validation"></a>

Before you build your model, SageMaker Canvas checks your dataset for issues that might cause your build to fail. If SageMaker Canvas finds any issues, then it warns you on the **Build** page before you attempt to build a model.

You can choose **Validate data** to see a list of the issues with your dataset. You can then use the SageMaker Canvas [Data Wrangler data preparation features](canvas-data-prep.md), or your own tools, to fix your dataset before starting a build. If you don’t fix the issues with your dataset, then your build fails.

If you make changes to your dataset to fix the issues, you have the option to re-validate your dataset before attempting a build. We recommend that you re-validate your dataset before building.

The following table shows the issues that SageMaker Canvas checks for in your dataset and how to resolve them.


| Issue | Resolution | 
| --- | --- | 
|  Wrong model type for your data  |  Try another model type or use a different dataset.  | 
|  Missing values in your target column  |  Replace the missing values, drop rows with missing values, or use a different dataset.  | 
|  Too many unique labels in your target column  |  Verify that you've used the correct column for your target column, or use a different dataset.  | 
|  Too many non-numeric values in your target column  |  Choose a different target column, select another model type, or use a different dataset.  | 
|  One or more column names contain double underscores  |  Rename the columns to remove any double underscores, and try again.  | 
|  None of the rows in your dataset are complete  |  Replace the missing values, or use a different dataset.  | 
|  Too many unique labels for the number of rows in your data  |  Check that you're using the right target column, increase the number of rows in your dataset, consolidate similar labels, or use a different dataset.  | 

# Random sample
<a name="canvas-random-sample"></a>

SageMaker Canvas uses the random sampling method to sample your dataset. The random sample method means that each row has an equal chance of being picked for the sample. You can choose a column in the preview to get summary statistics for the random sample, such as the mean and the mode.

By default, SageMaker Canvas uses a random sample size of 20,000 rows from your dataset for datasets with more than 20,000 rows. For datasets smaller than 20,000 rows, the default sample size is the number of rows in your dataset. You can increase or decrease the sample size by choosing **Random sample** in the **Build** tab of the SageMaker Canvas application. You can use the slider to select your desired sample size, and then choose **Update** to change the sample size. The maximum sample size you can choose for a dataset is 40,000 rows, and the minimum sample size is 500 rows. If you choose a large sample size, the dataset preview and summary statistics might take a few moments to reload.

The **Build** page shows a preview of 100 rows from your dataset. If the sample size is the same size as your dataset, then the preview uses the first 100 rows of your dataset. Otherwise, the preview uses the first 100 rows of the random sample.

# Build a model
<a name="canvas-build-model-how-to"></a>

The following sections show you how to build a model for each of the main types of custom models.
+ To build numeric prediction, 2 category prediction, or 3\$1 category prediction models, see [Build a custom numeric or categorical prediction model](#canvas-build-model-numeric-categorical).
+ To build single-label image prediction models, see [Build a custom image prediction model](#canvas-build-model-image).
+ To build multi-category text prediction models, see [Build a custom text prediction model](#canvas-build-model-text).
+ To build time series forecasting models, see [Build a time series forecasting model](#canvas-build-model-forecasting).

**Note**  
If you encounter an error during post-building analysis that tells you to increase your quota for `ml.m5.2xlarge` instances, see [Request a Quota Increase](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-requesting-quota-increases.html).

## Build a custom numeric or categorical prediction model
<a name="canvas-build-model-numeric-categorical"></a>

Numeric and categorical prediction models support both **Quick builds** and **Standard builds**.

To build a numeric or categorical prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Predictive analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict.

1. For **Model type**, Canvas automatically detects the problem type for you. If you want to change the type or configure advanced model settings, choose **Configure model**.

   When the **Configure model** dialog box opens, do the following:

   1. For **Model type**, choose the model type that you want to build.

   1. After you choose the model type, there are additional **Advanced settings**. For more information about each of the advanced settings, see [Advanced model building configurations](canvas-advanced-settings.md). To configure the advanced settings, do the following:

      1. (Optional) For the **Objective metric** dropdown menu, select the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see [Metrics reference](canvas-metrics.md).

      1. For **Training method**, choose **Auto**, **Ensemble**, or **Hyperparameter optimization (HPO) mode**.

      1. For **Algorithms**, select the algorithms that you want to include for building model candidates.

      1. For **Data split**, specify in percentages how you want to split your data between the **Training set** and the **Validation set**. The training set is used for building the model, while the validation set is used for testing accuracy of model candidates.

      1. For **Max candidates and runtime**, do the following:

         1. Set the **Max candidates** value, or the maximum number of model candidates that Canvas can generate. Note that **Max candidates** is only available in HPO mode.

         1. Set the hour and minute values for **Max job runtime**, or the maximum amount of time that Canvas can spend building your model. After the maximum time, Canvas stops building and selects the best model candidate.

   1. After configuring the advanced settings, choose **Save**.

1. Select or deselect columns in your data to include or drop them from your build.
**Note**  
If you make batch predictions with your model after building, Canvas adds dropped columns to your prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

1. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and determine which features you might want to include in your model. For more information, see [Explore and analyze your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html).

1. (Optional) Use data transformations to clean, transform, and prepare your data for model building. For more information, see [ Prepare your data with advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html). You can view and remove your transforms by choosing **Model recipe** to open the **Model recipe** side panel.

1. (Optional) For additional features such as previewing the accuracy of your model, validating your dataset, and changing the size of the random sample that Canvas takes from your dataset, see [Preview your model](canvas-preview-model.md).

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model. The following screenshot shows the **Build** page and the **Quick build** and **Standard build** options.  
![\[The Build page for a 2 category model showing the Quick build and Standard build options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-tabular-quick-standard-options.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a custom image prediction model
<a name="canvas-build-model-image"></a>

Single-label image prediction models support both **Quick builds** and **Standard builds**.

To build a single-label image prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Image analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, you see the **Label distribution** for the images in your dataset. The **Model type** is set to **Single-label image prediction**.

1. On this page, you can preview your images and edit the dataset. If you have any unlabeled images, choose **Edit dataset** and [Assign labels to unlabeled images](canvas-edit-image.md#canvas-edit-image-assign). You can also perform other tasks when you [Edit an image dataset](canvas-edit-image.md), such as renaming labels and adding images to the dataset.

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model. The following screenshot shows the **Build** page of an image prediction model that is ready to be built.  
![\[The Build page for a single-label image prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-image-model.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a custom text prediction model
<a name="canvas-build-model-text"></a>

Multi-category text prediction models support both **Quick builds** and **Standard builds**.

To build a text prediction model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Text analysis** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict. The target column must have a binary or categorical data type, and there must be at least 25 entries (or rows of data) for each unique label in the target column.

1. For **Model type**, confirm that the model type is automatically set to **Multi-category text prediction**.

1. For the training column, select your source column of text data. This should be the column containing the text that you want to analyze.

1. Choose **Quick build** or **Standard build** to begin building your model. The following screenshot shows the **Build** page of a text prediction model that is ready to be built.  
![\[The Build page for a multi-category text prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/build-page-text-model.png)

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

## Build a time series forecasting model
<a name="canvas-build-model-forecasting"></a>

Time series forecasting models support both **Quick builds** and **Standard builds**.

To build a time series forecasting model, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose **New model**.

1. In the **Create new model** dialog box, do the following:

   1. Enter a name in the **Model name** field.

   1. Select the **Time series forecasting** problem type.

   1. Choose **Create**.

1. For **Select dataset**, select your dataset from the list of datasets. If you haven’t already imported your data, choose **Import** to be directed through the import data workflow.

1. When you’re ready to begin building your model, choose **Select dataset**.

1. On the **Build** tab, for the **Target column** dropdown list, select the target for your model that you would like to predict.

1. In the **Model type** section, choose **Configure model**.

1. The **Configure model** box opens. For the **Time series configuration** section, fill out the following fields:

   1. For **Item ID column**, choose a column in your dataset that uniquely identifies each row. The column should have a data type of `Text`.

   1. (Optional) For **Group column**, choose one or more categorical columns (with a data type of `Text`) that you want to use for grouping your forecasting values.

   1. For **Time stamp column**, select the column with timestamps (in datetime format). For more information about the accepted datetime formats, see [Time Series Forecasts in Amazon SageMaker Canvas](canvas-time-series.md).

   1. For the **Forecast length** field, enter the period of time for which you want to forecast values. Canvas automatically detects the units of time in your data.

   1. (Optional) Turn on the **Use holiday schedule** toggle to select a holiday schedule from various countries and make your forecasts with holiday data more accurate.

1. In the **Configure model** box, there are additional settings in the **Advanced** section. For more information about each of the advanced settings, see [Advanced model building configurations](canvas-advanced-settings.md). To configure the **Advanced** settings, do the following:

   1. For the **Objective metric** dropdown menu, select the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see [Metrics reference](canvas-metrics.md).

   1. If you’re running a standard build, you’ll see the **Algorithms** section. This section is for selecting the time series forecasting algorithms that you’d like to use for building your model. You can select a subset of the available algorithms, or you can select all of them if you aren’t sure which ones to try.

      When you run your standard build, Canvas builds an ensemble model that combines all of the algorithms together to optimize prediction accuracy.
**Note**  
If you’re running a quick build, Canvas uses a single tree-based learning algorithm to train your model, and you don’t have to select any algorithms.

   1. For **Forecast quantiles**, enter up to 5 comma-separated quantile values to specify the upper and lower bounds of your forecast.

   1. After configuring the **Advanced** settings, choose **Save**.

1. Select or deselect columns in your data to include or drop them from your build.
**Note**  
If you make batch predictions with your model after building, Canvas adds dropped columns to your prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

1. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and determine which features you might want to include in your model. For more information, see [Explore and analyze your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html).

1. (Optional) Use data transformations to clean, transform, and prepare your data for model building. For more information, see [ Prepare your data with advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html). You can view and remove your transforms by choosing **Model recipe** to open the **Model recipe** side panel.

1. (Optional) For additional features such as previewing the accuracy of your model, validating your dataset, and changing the size of the random sample that Canvas takes from your dataset, see [Preview your model](canvas-preview-model.md).

1. After reviewing your data and making any changes to your dataset, choose **Quick build** or **Standard build** to begin a build for your model.

After your model begins building, you can leave the page. When the model shows as **Ready** on the **My models** page, it’s ready for analysis and predictions.

# Advanced model building configurations
<a name="canvas-advanced-settings"></a>

Amazon SageMaker Canvas supports various advanced settings that you can configure when building a model. The following page lists all of the advanced settings along with additional information about their options and configurations.

**Note**  
The following advanced settings are currently only supported for numeric, categorical, and time series forecasting model types.

## Advanced numeric and categorical prediction model settings
<a name="canvas-advanced-settings-predictive"></a>

Canvas supports the following advanced settings for numeric and categorical prediction model types.

### Objective metric
<a name="canvas-advanced-settings-predictive-obj-metric"></a>

The objective metric is the metric that you want Canvas to optimize while building your model. If you don’t select a metric, Canvas chooses one for you by default. For descriptions of the available metrics, see the [Metrics reference](canvas-metrics.md).

### Training method
<a name="canvas-advanced-settings-predictive-method"></a>

Canvas can automatically select the training method based on the dataset size, or you can select it manually. The following training methods are available for you to choose from:
+ **Ensembling** – SageMaker AI leverages the AutoGluon library to train several base models. To find the best combination for your dataset, ensemble mode runs 5–10 trials with different model and meta parameter settings. Then, these models are combined using a stacking ensemble method to create an optimal predictive model. For a list of algorithms supported by ensemble mode for tabular data, see the following [Algorithms](#canvas-advanced-settings-predictive-algos) section.
+ **Hyperparameter optimization (HPO)** – SageMaker AI finds the best version of a model by tuning hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to 100 trials (default) to find the optimal hyperparameters settings within the selected range. If your dataset size is less than 100 MB, SageMaker AI uses Bayesian optimization. SageMaker AI chooses multi-fidelity optimization if your dataset is larger than 100 MB.

  For a list of algorithms supported by HPO mode for tabular data, see the following [Algorithms](#canvas-advanced-settings-predictive-algos) section.
+ **Auto** – SageMaker AI automatically chooses either ensembling mode or HPO mode based on your dataset size. If your dataset is larger than 100 MB, SageMaker AI chooses HPO mode. Otherwise, it chooses ensembling mode.

### Algorithms
<a name="canvas-advanced-settings-predictive-algos"></a>

In **Ensembling** mode, Canvas supports the following machine learning algorithms:
+ [LightGBM](https://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm.html) – An optimized framework that uses tree-based algorithms with gradient boosting. This algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
+ [CatBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html) – A framework that uses tree-based algorithms with gradient boosting. Optimized for handling categorical variables.
+ [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) – A framework that uses tree-based algorithms with gradient boosting that grows in depth, rather than breadth.
+ [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) – A tree-based algorithm that uses several decision trees on random sub-samples of the data with replacement. The trees are split into optimal nodes at each level. The decisions of each tree are averaged together to prevent overfitting and improve predictions.
+ [Extra Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier) – A tree-based algorithm that uses several decision trees on the entire dataset. The trees are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to improve predictions. Extra trees add a degree of randomization in comparison to the random forest algorithm.
+ [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) – A framework that uses a linear equation to model the relationship between two variables in observed data.
+ Neural network torch – A neural network model that's implemented using [Pytorch](https://pytorch.org/).
+ Neural network fast.ai – A neural network model that's implemented using [fast.ai](https://www.fast.ai/).

In **HPO mode**, Canvas supports the following machine learning algorithms:
+ [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) – A supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.
+ Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network. This algorithm can handle data that is not linearly separable.

### Data split
<a name="canvas-advanced-settings-predictive-split"></a>

You have the option to specify how you want to split your dataset between the training set (the portion of your dataset used for building the model) and the validation set, (the portion of your dataset used for verifying the model’s accuracy). For example, a common split ratio is 80% training and 20% validation, where 80% of your data is used to build the model while 20% is saved for measuring model performance. If you don’t specify a custom ratio, then Canvas splits your dataset automatically.

### Max candidates
<a name="canvas-advanced-settings-predictive-candidates"></a>

**Note**  
This feature is only available in the HPO training mode.

You can specify the maximum number of model candidates that Canvas generates while building your model. We recommend that you use the default number of candidates, which is 100, to build the most accurate models. The maximum number you can specify is 250. Decreasing the number of model candidates may impact your model’s accuracy.

### Max job runtime
<a name="canvas-advanced-settings-predictive-runtime"></a>

You can specify the maximum job runtime, or the maximum amount of time that Canvas spends building your model. After the time limit, Canvas stops building and selects the best model candidate.

The maximum time that you can specify is 720 hours. We highly recommend that you keep the maximum job runtime greater than 30 minutes to ensure that Canvas has enough time to generate model candidates and finish building your model.

## Advanced time series forecasting model settings
<a name="canvas-advanced-settings-time-series"></a>

For time series forecasting models, Canvas supports the Objective metric, which is listed in the previous section.

Time series forecasting models also support the following advanced setting:

### Algorithm selection
<a name="canvas-advanced-settings-time-series-algos"></a>

When you build a time series forecasting model, Canvas uses an *ensemble* (or a combination) of statistical and machine learning algorithms to deliver highly accurate time series forecasts. By default, Canvas selects the optimal combination of all the available algorithms based on the time series in your dataset. However, you have the option to specify one or more algorithms to use for your forecasting model. In this case, Canvas determines the best blend using only your selected algorithms. If you're uncertain about which algorithm to select for training your model, we recommend that you choose all of the available algorithms.

**Note**  
Algorithm selection is only supported for standard builds. If you don’t select any algorithms in the advanced settings, then by default SageMaker AI runs a quick build and trains model candidates using a single tree-based learning algorithm. For more information about the difference between quick builds and standard builds, see [How custom models work](canvas-build-model.md).

Canvas supports the following time series forecasting algorithms:
+ [ Autoregressive Integrated Moving Average (ARIMA)](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) – A simple stochastic time series model that uses statistical analysis to interpret the data and make future predictions. This algorithm is useful for simple datasets with fewer than 100 time series.
+ [ Convolutional Neural Network - Quantile Regression (CNN-QR)](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-algo-cnnqr.html) – A proprietary, supervised learning algorithm that trains one global model from a large collection of time series and uses a quantile decoder to make predictions. CNN-QR works best with large datasets containing hundreds of time series.
+ [ DeepAR\$1](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html) – A proprietary, supervised learning algorithm for forecasting scalar time series using recurrent neural networks (RNNs) to train a single model jointly over all of the time series. DeepAR\$1 works best with large datasets containing hundreds of feature time series.
+ [ Non-Parametric Time Series (NPTS)](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-npts.html) – A scalable, probabilistic baseline forecaster that predicts the future value distribution of a given time series by sampling from past observations. NPTS is useful when working with sparse or intermittent time series (for example, forecasting demand for individual items where the time series has many 0s or low counts).
+ [Exponential Smoothing (ETS)](https://en.wikipedia.org/wiki/Exponential_smoothing) – A forecasting method that produces forecasts which are weighted averages of past observations where the weights of older observations exponentially decrease. The algorithm is useful for simple datasets with fewer than 100 time series and datasets with seasonality patterns.
+ [Prophet](https://facebook.github.io/prophet/) – An additive regression model that works best with time series that have strong seasonal effects and several seasons of historical data. The algorithm is useful for datasets with non-linear growth trends that approach a limit.

### Forecast quantiles
<a name="canvas-advanced-settings-time-series-quantiles"></a>

For time series forecasting, SageMaker AI trains 6 model candidates with your target time series. Then, SageMaker AI combines these models using a stacking ensemble method to create an optimal forecasting model for a given objective metric. Each forecasting model generates a probabilistic forecast by producing forecasts at quantiles between P1 and P99. These quantiles are used to account for forecast uncertainty. By default, forecasts are generated for 0.1 (`p10`), 0.5 (`p50`), and 0.9 (`p90`). You can choose to specify up to five of your own quantiles from 0.01 (`p1`) to 0.99 (`p99`), by increments of 0.01 or higher.

# Edit an image dataset
<a name="canvas-edit-image"></a>

In Amazon SageMaker Canvas, you can edit your image datasets and review your labels before building a model. You might want to perform tasks such as assigning labels to unlabeled images or adding more images to the dataset. These tasks can all be done in the Canvas application, providing you with one place to modify your dataset and build a model.

**Note**  
Before building a model, you must assign labels to all images in your dataset. Also, you must have at least 25 images per label and a minimum of two labels. For more information about assigning labels, see the section on this page called **Assign labels to unlabeled images**. If you can’t determine a label for an image, you should delete it from your dataset. For more information about deleting images, see the section on this page [Add or delete images from the dataset](#canvas-edit-image-add-delete).

To begin editing your image dataset, you should be on the **Build** tab while building your single-label image prediction model.

A new page opens that shows the images in your dataset along with their labels. This page categorizes your image dataset into **Total images**, **Labeled images**, and **Unlabeled images**. You can also review the **Dataset preparation guide** for best practices on building a more accurate image prediction model.

The following screenshot shows the page for editing your image dataset.

![\[Screenshot of the image dataset management page in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/dataset-management-page.png)


From this page, you can do the following actions.

## View the properties for each image (label, size, dimensions)
<a name="canvas-edit-image-view"></a>

To view an individual image, you can search for it by file name in the search bar. Then, choose the image to open the full view. You can view the image properties and reassign the image’s label. Choose **Save** when you’re doing viewing the image.

## Add, rename, or delete labels in the dataset
<a name="canvas-edit-image-labels"></a>

Canvas lists the labels for your dataset in the left navigation pane. You can add new labels to the dataset by entering a label in the **Add label** text field.

To rename or delete a label from your dataset, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the label and select either **Rename** or **Delete**. If you rename the label, you can enter the new label name and choose **Confirm**. If you delete the label, the label is removed from all images in your dataset that have that label. Any images with that label are left unlabeled.

## Assign labels to unlabeled images
<a name="canvas-edit-image-assign"></a>

To view the unlabeled images in your dataset, choose **Unlabeled** in the left navigation pane. For each image, select it and open the label titled **Unlabeled** and select a label to assign to the image from the dropdown list. You can also select more than one image and perform this action, and all selected images are assigned the label you chose.

## Reassign labels to images
<a name="canvas-edit-image-reassign"></a>

You can reassign labels to images by selecting the image (or multiple images at a time) and opening the dropdown titled with the current label. Select your desired label, and the image or images are updated with the new label.

## Sort your images by label
<a name="canvas-edit-image-sort"></a>

You can view all the images for a given label by choosing the label in the left navigation pane.

## Add or delete images from the dataset
<a name="canvas-edit-image-add-delete"></a>

You can add more images to your dataset by choosing **Add images** in the top navigation pane. You’ll be taken through the workflow to import more images. The images you import are added to your existing dataset.

You can delete images from your dataset by selecting them and then choosing **Delete** in the top navigation pane.

**Note**  
After making any changes to your dataset, choose **Save dataset** to make sure that you don’t lose your changes.

# Data exploration and analysis
<a name="canvas-explore-data"></a>

**Note**  
You can only use SageMaker Canvas visualizations and analytics for models built on tabular datasets. Multi-category text prediction models are also excluded.

In Amazon SageMaker Canvas, you can explore the variables in your dataset using visualizations and analytics and create in-application visualizations and analytics. You can use these explorations to uncover relationships between your variables before building your model.

For more information about visualization techniques in Canvas, see [Explore your data using visualization techniques](canvas-explore-data-visualization.md).

For more information about analytics in Canvas, see [Explore your data using analytics](canvas-explore-data-analytics.md).

# Explore your data using visualization techniques
<a name="canvas-explore-data-visualization"></a>

**Note**  
You can only use SageMaker Canvas visualizations for models built on tabular datasets. Multi-category text prediction models are also excluded.

With Amazon SageMaker Canvas, you can explore and visualize your data to gain advanced insights into your data before building your ML models. You can visualize using scatter plots, bar charts, and box plots, which can help you understand your data and discover the relationships between features that could affect the model accuracy.

In the **Build** tab of the SageMaker Canvas application, choose **Data visualizer** to begin creating your visualizations.

You can change the visualization sample size to adjust the size of the random sample taken from your dataset. A sample size that is too large might affect the performance of your data visualizations, so we recommend that you choose an appropriate sample size. To change the sample size, use the following procedure.

1. Choose **Visualization sample**.

1. Use the slider to select your desired sample size.

1. Choose **Update** to confirm the change to your sample size.

**Note**  
Certain visualization techniques require columns of a specific data type. For example, you can only use numeric columns for the x and y-axes of scatter plots.

## Scatter plot
<a name="canvas-explore-data-scatterplot"></a>

To create a scatter plot with your dataset, choose **Scatter plot** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Color by** to color the data points on the plot with a third feature. You can also use **Group by** to group the data into separate plots based on a fourth feature.

The following image shows a scatter plot that uses **Color by** and **Group by**. In this example, each data point is colored by the `MaritalStatus` feature, and grouping by the `Department` feature results in a scatter plot for the data points of each department.

![\[Screenshot of a scatter plot in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-scatter-plot.png)


## Bar chart
<a name="canvas-explore-data-barchart"></a>

To create a bar chart with your dataset, choose **Bar chart** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Group by** to group the bar chart by a third feature. You can use **Stack by** to vertically shade each bar based on the unique values of a fourth feature.

The following image shows a bar chart that uses **Group by** and **Stack by**. In this example, the bar chart is grouped by the `MaritalStatus` feature and stacked by the `JobLevel` feature. For each `JobRole` on the x axis, there is a separate bar for the unique categories in the `MaritalStatus` feature, and every bar is vertically stacked by the `JobLevel` feature.

![\[Screenshot of a bar chart in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-bar-chart.png)


## Box plot
<a name="canvas-explore-data-boxplot"></a>

To create a box plot with your dataset, choose **Box plot** in the **Visualization** panel. Choose the features you want to plot on the x and y-axes from the **Columns** section. You can drag and drop the columns onto the axes or, once an axis has been dropped, you can choose a column from the list of supported columns.

You can use **Group by** to group the box plots by a third feature.

The following image shows a box plot that uses **Group by**. In this example, the x and y-axes show `JobLevel` and `JobSatisfaction`, respectively, and the colored box plots are grouped by the `Department` feature.

![\[Screenshot of a box plot in the Data visualizer view of the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-eda-box-plot.png)


# Explore your data using analytics
<a name="canvas-explore-data-analytics"></a>

**Note**  
You can only use SageMaker Canvas analytics for models built on tabular datasets. Multi-category text prediction models are also excluded.

With analytics in Amazon SageMaker Canvas, you can explore your dataset and gain insight on all of your variables before building a model. You can determine the relationships between features in your dataset using correlation matrices. You can use this technique to summarize your dataset into a matrix that shows the correlations between two or more values. This helps you identify and visualize patterns in a given dataset for advanced data analysis.

The matrix shows the correlation between each feature as positive, negative, or neutral. You might want to include features that have a high correlation with each other when building your model. Features that have little to no correlation might be irrelevant to your model, and you can drop those features when building your model.

To get started with correlation matrices in SageMaker Canvas, see the following section.

## Create a correlation matrix
<a name="canvas-explore-data-analytics-correlation-matrix"></a>

You can create a correlation matrix when you are preparing to build a model in the **Build** tab of the SageMaker Canvas application.

For instructions on how to begin creating a model, see [Build a model](canvas-build-model-how-to.md).

After you’ve started preparing a model in the SageMaker Canvas application, do the following:

1. In the **Build** tab, choose **Data visualizer**.

1. Choose **Analytics**.

1. Choose **Correlation matrix**.

You should see a visualization similar to the following screenshot, which shows up to 15 columns of the dataset organized into a correlation matrix.

![\[Screenshot of a correlation matrix in the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-correlation-matrix-2.png)


After you’ve created the correlation matrix, you can customize it by doing the following:

### 1. Choose your columns
<a name="canvas-explore-data-analytics-correlation-matrix-columns"></a>

For **Columns**, you can select the columns that you want to include in the matrix. You can compare up to 15 columns from your dataset.

**Note**  
You can use numeric, categorical, or binary column types for a correlation matrix. The correlation matrix doesn’t support datetime or text data column types.

To add or remove columns from the correlation matrix, select and deselect columns from the **Columns** panel. You can also drag and drop columns from the panel directly onto the matrix. If your dataset has a lot of columns, you can search for the columns you want in the **Search columns** bar.

To filter the columns by data type, choose the dropdown list and select **All**, **Numeric**, or **Categorical**. Selecting **All** shows you all of the columns from your dataset, whereas the **Numeric** and **Categorical** filters only show you the numeric or categorical columns in your dataset. Note that binary column types are included in the numeric or categorical filters.

For the best data insights, include your target column in the correlation matrix. When you include your target column in the correlation matrix, it appears as the last feature on the matrix with a target symbol.

### 2. Choose your correlation type
<a name="canvas-explore-data-analytics-correlation-matrix-cor-type"></a>

SageMaker Canvas supports different *correlation types*, or methods for calculating the correlation between your columns.

To change the correlation type, use the **Columns** filter mentioned in the preceding section to filter for your desired column type and columns. You should see the **Correlation type** in the side panel. For numeric comparisons, you have the option to select either **Pearson** or **Spearman**. For categorical comparisons, the correlation type is set as **MI**. For categorical and mixed comparisons, the correlation type is set as **Spearman & MI**.

For matrices that only compare numeric columns, the correlation type is either Pearson or Spearman. The Pearson measure evaluates the linear relationship between two continuous variables. The Spearman measure evaluates the monotonic relationship between two variables. For both Pearson and Spearman, the scale of correlation ranges from -1 to 1, with either end of the scale indicating a perfect correlation (a direct 1:1 relationship) and 0 indicating no correlation. You might want to select Pearson if your data has more linear relationships (as revealed by a [scatter plot visualization](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-explore-data.html#canvas-explore-data-scatterplot)). If your data is not linear, or contains a mixture of linear and monotonic relationships, then you might want to select Spearman.

For matrices that only compare categorical columns, the correlation type is set to Mutual Information Classification (MI). The MI value is a measure of the mutual dependence between two random variables. The MI measure is on a scale of 0 to 1, with 0 indicating no correlation and 1 indicating a perfect correlation.

For matrices that compare a mix of numeric and categorical columns, the correlation type **Spearman & MI** is a combination of the Spearman and MI correlation types. For correlations between two numeric columns, the matrix shows the Spearman value. For correlations between a numeric and categorical column or two categorical columns, the matrix shows the MI value.

Lastly, remember that correlation does not necessarily indicate causation. A strong correlation value only indicates that there is a relationship between two variables, but the variables might not have a causal relationship. Carefully review your columns of interest to avoid bias when building your model.

### 3. Filter your correlations
<a name="canvas-explore-data-analytics-correlation-matrix-filter"></a>

In the side panel, you can use the **Filter correlations** feature to filter for the range of correlation values that you want to include in the matrix. For example, if you want to filter for features that only have positive or neutral correlation, you can set the **Min** to 0 and the **Max** to 1 (valid values are -1 to 1).

For Spearman and Pearson comparisons, you can set the **Filter correlations** range anywhere from -1 to 1, with 0 meaning that there is no correlation. -1 and 1 mean that the variables have a strong negative or positive correlation, respectively.

For MI comparisons, the correlation range only goes from 0 to 1, with 0 meaning that there is no correlation and 1 meaning that the variables have a strong correlation, either positive or negative.

Each feature has a perfect correlation (1) with itself. Therefore, you might notice that the top row of the correlation matrix is always 1. If you want to exclude these values, you can use the filter to set the **Max** less than 1.

Keep in mind that if your matrix compares a mix of numeric and categorical columns and uses the **Spearman & MI** correlation type, then the *categorical x numeric* and *categorical x categorical* correlations (which use the MI measure) are on a scale of 0 to 1, whereas the *numeric x numeric* correlations (which use the Spearman measure) are on a scale of -1 to 1. Review your correlations of interest carefully to ensure that you know the correlation type being used to calculate each value.

### 4. Choose the visualization method
<a name="canvas-explore-data-analytics-correlation-matrix-viz-method"></a>

In the side panel, you can use **Visualize by** to change the visualization method of the matrix. Choose the **Numeric** visualization method to show the correlation (Pearson, Spearman, or MI) value, or choose the **Size** visualization method to visualize the correlation with differently sized and colored dots. If you choose **Size**, you can hover over a specific dot on the matrix to see the actual correlation value.

### 5. Choose a color palette
<a name="canvas-explore-data-analytics-correlation-matrix-color"></a>

In the side panel, you can use **Color selection** to change the color palette used for the scale of negative to positive correlation in the matrix. Select one of the alternative color palettes to change the colors used in the matrix.

# Prepare data for model building
<a name="canvas-prepare-data"></a>

**Note**  
You can now do advanced data preparation in SageMaker Canvas with Data Wrangler, which provides you with a natural language interface and over 300 built-in transformations. For more information, see [Data preparation](canvas-data-prep.md).

Your machine learning dataset might require data preparation before you build your model. You might want to clean your data due to various issues, which might include missing values or outliers, and perform feature engineering to improve the accuracy of your model. Amazon SageMaker Canvas provides ML data transforms with which you can clean, transform, and prepare your data for model building. You can use these transforms on your datasets without any code. SageMaker Canvas adds the transforms you use to the **Model recipe**, which is a record of the data preparation done on your data before building the model. Any data transforms you use only modify the input data for model building and do not modify your original data source.

The preview of your dataset shows the first 100 rows of the dataset. If your dataset has more than 20,000 rows, Canvas takes a random sample of 20,000 rows and previews the first 100 rows from that sample. You can only search for and specify values from the previewed rows, and the filter functionality only filters the previewed rows and not the entire dataset.

The following transforms are available in SageMaker Canvas for you to prepare your data for building.

**Note**  
You can only use advanced transformations for models built on tabular datasets. Multi-category text prediction models are also excluded.

## Drop columns
<a name="canvas-prepare-data-drop"></a>

You can exclude a column from your model build by dropping it in the **Build** tab of the SageMaker Canvas application. Deselect the column you want to drop, and it isn't included when building the model.

**Note**  
If you drop columns and then make [batch predictions](canvas-make-predictions.md) with your model, SageMaker Canvas adds the dropped columns back to the ouput dataset available for you to download. However, SageMaker Canvas does not add the dropped columns back for time series models.

## Filter rows
<a name="canvas-prepare-data-filter"></a>

The filter functionality filters the previewed rows (the first 100 rows of your dataset) according to conditions that you specify. Filtering rows creates a temporary preview of the data and does not impact the model building. You can filter to preview rows that have missing values, contain outliers, or meet custom conditions in a column you choose.

### Filter rows by missing values
<a name="canvas-prepare-data-filter-missing"></a>

Missing values are a common occurrence in machine learning datasets. If you have rows with null or empty values in certain columns, you might want to filter for and preview those rows.

To filter missing values from your previewed data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows ** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check for missing values.

1. For the **Operation**, choose **Is missing**.

SageMaker Canvas filters for rows that contain missing values in the **Column** you selected and provides a preview of the filtered rows.

![\[Screenshot of the filter by missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-missing.png)


### Filter rows by outliers
<a name="canvas-prepare-data-filter-outliers"></a>

Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. SageMaker Canvas enables you to detect and filter rows that contain outliers in numeric columns. You can choose to define outliers with either standard deviations or a custom range.

To filter for outliers in your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows ** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check for outliers.

1. For the **Operation**, choose **Is outlier**.

1. Set the **Outlier range** to either **Standard deviation** or **Custom range**.

1. If you choose **Standard deviation**, specify a **SD** (standard deviation) value from 1–3. If you choose **Custom range**, select either **Percentile** or **Number**, and then specify the **Min** and **Max** values.

The **Standard deviation** option detects and filters for outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify `3` for **SD**, a value must fall more than 3 standard deviations from the mean to be considered an outlier.

The **Custom range** option detects and filters for outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. You can set the **Type** of the range to either **Percentile** or **Number**. If you choose **Percentile**, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0-100) that you want to allow. If you choose **Number**, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to filter in the data.

![\[Screenshot of the filter by outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-outlier.png)


### Filter rows by custom values
<a name="canvas-prepare-data-filter-custom"></a>

You can filter for rows with values that meet custom conditions. For example, you might want to preview rows that have a price value greater than 100 before removing them. With this functionality, you can filter rows that exceed the threshold you set and preview the filtered data.

To use the custom filter functionality, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Filter by rows** (![\[Filter icon in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/filter-icon.png)).

1. Choose the **Column** you want to check.

1. Select the type of **Operation** you want to use, and then specify the values for the selected condition.

For the **Operation**, you can choose one of the following options. Note that the available operations depend on the data type of the column you choose. For example, you cannot create a `is greater than` operation for a column containing text values.


| Operation | Supported data type | Supported feature type | Function | 
| --- | --- | --- | --- | 
|  Is equal to  |  Numeric, Text  | Binary, Categorical |  Filters rows where the value in **Column** equals the values you specify.  | 
|  Is not equal to  |  Numeric, Text  | Binary, Categorical |  Filters rows where the value in **Column** doesn't equal the values you specify.  | 
|  Is less than  |  Numeric  | N/A |  Filters rows where the value in **Column** is less than the value you specify.  | 
|  Is less than or equal to  |  Numeric  | N/A |  Filters rows where the value in **Column** is less than or equal to the value you specify.  | 
|  Is greater than  |  Numeric  | N/A |  Filters rows where the value in **Column** is greater than the value you specify.  | 
|  Is greater than or equal to  |  Numeric  | N/A |  Filters rows where the value in **Column** is greater than or equal to the value you specify.  | 
|  Is between  |  Numeric  | N/A |  Filters rows where the value in **Column** is between or equal to two values you specify.  | 
|  Contains  |  Text  | Categorical |  Filters rows where the value in **Column** contains a values you specify.  | 
|  Starts with  |  Text  | Categorical |  Filters rows where the value in **Column** begins with a value you specify.  | 
|  Ends with  |  Categorical  | Categorical |  Filters rows where the value in **Column** ends with a value you specify.  | 

After you set the filter operation, SageMaker Canvas updates the preview of the dataset to show you the filtered data.

![\[Screenshot of the filter by custom values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-filter-custom.png)


## Functions and operators
<a name="canvas-prepare-data-custom-formula"></a>

You can use mathematical functions and operators to explore and distribute your data. You can use the SageMaker Canvas supported functions or create your own formula with your existing data and create a new column with the result of the formula. For example, you can add the corresponding values of two columns and save the result to a new column.

You can nest statements to create more complex functions. The following are some examples of nested functions that you might use.
+ To calculate BMI, you could use the function `weight / (height ^ 2)`.
+ To classify ages, you could use the function `Case(age < 18, 'child', age < 65, 'adult', 'senior')`.

You can specify functions in the data preparation stage before you build your model. To use a function, do the following.
+ In the **Build** tab of the SageMaker Canvas application, choose **View all** and then choose **Custom formula** to open the **Custom formula** panel.
+ In the **Custom formula** panel, you can choose a **Formula** to add to your **Model Recipe**. Each formula is applied to all of the values in the columns you specify. For formulas that accept two or more columns as arguments, use columns with matching data types; otherwise, you get an error or `null` values in the new column. 
+ After you’ve specified a **Formula**, add a column name in the **New Column Name** field. SageMaker Canvas uses this name for the new column that is created.
+ (Optional) Choose **Preview** to preview your transform.
+ To add the function to your **Model Recipe**, choose **Add**.

SageMaker Canvas saves the result of your function to a new column using the name you specified in **New Column Name**. You can view or remove functions from the **Model Recipe** panel.

SageMaker Canvas supports the following operators for functions. You can use either the text format or the in-line format to specify your function.


| Operator | Description | Supported data types | Text format | In-line format | 
| --- | --- | --- | --- | --- | 
|  Add  |  Returns the sum of the values  |  Numeric  | Add(sales1, sales2) | sales1 \$1 sales2 | 
|  Subtract  |  Returns the difference between the values  |  Numeric  | Subtract(sales1, sales2) | sales1 ‐ sales2 | 
|  Multiply  |  Returns the product of the values  |  Numeric  | Multiply(sales1, sales2) | sales1 \$1 sales2 | 
|  Divide  |  Returns the quotient of the values  |  Numeric  | Divide(sales1, sales2) | sales1 / sales2 | 
|  Mod  |  Returns the result of the modulo operator (the remainder after dividing the two values)  |  Numeric  | Mod(sales1, sales2) | sales1 % sales2 | 
|  Abs  | Returns the absolute value of the value |  Numeric  | Abs(sales1) | N/A | 
|  Negate  | Returns the negative of the value |  Numeric  | Negate(c1) | ‐c1 | 
|  Exp  |  Returns e (Euler's number) raised to the power of the value  |  Numeric  | Exp(sales1) | N/A | 
|  Log  |  Returns the logarithm (base 10) of the value  |  Numeric  | Log(sales1) | N/A | 
|  Ln  |  Returns the natural logarithm (base e) of the value  |  Numeric  | Ln(sales1) | N/A | 
|  Pow  |  Returns the value raised to a power  |  Numeric  | Pow(sales1, 2) | sales1 ^ 2 | 
|  If  |  Returns a true or false label based on a condition you specify  |  Boolean, Numeric, Text  | If(sales1>7000, 'truelabel, 'falselabel') | N/A | 
|  Or  |  Returns a Boolean value of whether one of the specified values or conditions is true or not  |  Boolean  | Or(fullprice, discount) | fullprice \$1\$1 discount | 
|  And  |  Returns a Boolean value of whether two of the specified values or conditions are true or not  |  Boolean  | And(sales1,sales2) | sales1 && sales2 | 
|  Not  |  Returns a Boolean value that is the opposite of the specified value or conditions  |  Boolean  | Not(sales1) | \$1sales1 | 
|  Case  |  Returns a Boolean value based on conditional statements (returns c1 if cond1 is true, returns c2 if cond2 is true, else returns c3)  |  Boolean, Numeric, Text  | Case(cond1, c1, cond2, c2, c3) | N/A | 
|  Equal  |  Returns a Boolean value of whether two values are equal  |  Boolean, Numeric, Text  | N/A | c1 = c2c1 == c2 | 
|  Not equal  |  Returns a Boolean value of whether two values are not equal  |  Boolean, Numeric, Text  | N/A | c1 \$1= c2 | 
|  Less than  |  Returns a Boolean value of whether c1 is less than c2  |  Boolean, Numeric, Text  | N/A | c1 < c2 | 
|  Greater than  |  Returns a Boolean value of whether c1 is greater than c2  |  Boolean, Numeric, Text  | N/A | c1 > c2 | 
|  Less than or equal  |  Returns a Boolean value of whether c1 is less than or equal to c2  |  Boolean, Numeric, Text  | N/A | c1 <= c2 | 
|  Greater than or equal  |  Returns a Boolean value of whether c1 is greater than or equal to c2  |  Boolean, Numeric, Text  | N/A | c1 >= c2 | 

SageMaker Canvas also supports aggregate operators, which can perform operations such as calculating the sum of all the values or finding the minimum value in a column. You can use aggregate operators in combination with standard operators in your functions. For example, to calculate the difference of values from the mean, you could use the function `Abs(height – avg(height))`. SageMaker Canvas supports the following aggregate operators.


| Aggregate operator | Description | Format | Example | 
| --- | --- | --- | --- | 
|  sum  |  Returns the sum of all the values in a column  | sum | sum(c1) | 
|  minimum  |  Returns the minimum value of a column  | min | min(c2) | 
|  maximum  |  Returns the maximum value of a column  | max | max(c3) | 
|  average  |  Returns the average value of a column  | avg | avg(c4) | 
|  std  | Returns the sample standard deviation of a column | std | std(c1) | 
|  stddev  | Returns the standard deviation of the values in a column | stddev | stddev(c1) | 
|  variance  | Returns the unbiased variance of the values in a column | variance | variance(c1) | 
|  approx\$1count\$1distinct  | Returns the approximate number of distinct items in a column | approx\$1count\$1distinct | approx\$1count\$1distinct(c1) | 
|  count  | Returns the number of items in a column | count | count(c1) | 
|  first  |  Returns the first value of a column  | first | first(c1) | 
|  last  |  Returns the last value of a column  | last | last(c1) | 
|  stddev\$1pop  | Returns the population standard deviation of a column | stddev\$1pop | stddev\$1pop(c1) | 
|  variance\$1pop  |  Returns the population variance of the values in a column  | variance\$1pop | variance\$1pop(c1) | 

## Manage rows
<a name="canvas-prepare-data-manage"></a>

With the Manage rows transform, you can perform sort, random shuffle, and remove rows of data from the dataset.

### Sort rows
<a name="canvas-prepare-data-manage-sort"></a>

To sort the rows in a dataset by a given column, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Sort rows**.

1. For **Sort Column**, choose the column you want to sort by.

1. For **Sort Order**, choose either **Ascending** or **Descending**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Shuffle rows
<a name="canvas-prepare-data-manage-shuffle"></a>

To randomly shuffle the rows in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Shuffle rows**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Drop duplicate rows
<a name="canvas-prepare-data-manage-drop-duplicate"></a>

To remove duplicate rows in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows** and then choose **Drop duplicate rows**.

1. Choose **Add** to add the transform to the **Model recipe**.

### Remove rows by missing values
<a name="canvas-prepare-data-remove-missing"></a>

Missing values are a common occurrence in machine learning datasets and can impact model accuracy. Use this transform if you want to drop rows with null or empty values in certain columns.

To remove rows that contain missing values in a specified column, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by missing values**.

1. Choose **Add** to add the transform to the **Model recipe**.

SageMaker Canvas drops rows that contain missing values in the **Column** you selected. After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-missing.png)


### Remove rows by outliers
<a name="canvas-prepare-data-remove-outliers"></a>

Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. With SageMaker Canvas, you can detect and remove rows that contain outliers in numeric columns. You can choose to define outliers with either standard deviations or a custom range.

To remove outliers from your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by outlier values**.

1. Choose the **Column** you want to check for outliers.

1. Set the **Operator** to **Standard deviation**, **Custom numeric range**, or **Custom quantile range**.

1. If you choose **Standard deviation**, specify a **Standard deviations** (standard deviation) value from 1–3. If you choose **Custom numeric range** or **Custom quantile range**, specify the **Min** and **Max** values (numbers for numeric ranges, or percentiles between 0–100% for quantile ranges).

1. Choose **Add** to add the transform to the **Model recipe**.

The **Standard deviation** option detects and removes outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify `3` for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier.

The **Custom numeric range** and **Custom quantile range** options detect and remove outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. If you choose a numeric range, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to allow in the data. If you choose a quantile range, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0–100) that you want to allow.

After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-outlier.png)


### Remove rows by custom values
<a name="canvas-prepare-data-remove-custom"></a>

You can remove rows with values that meet custom conditions. For example, you might want to exclude all of the rows with a price value greater than 100 when building your model. With this transform, you can create a rule that removes all rows that exceed the threshold you set.

To use the custom remove transform, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage rows**.

1. Choose **Drop rows by formula**.

1. Choose the **Column** you want to check.

1. Select the type of **Operation** you want to use, and then specify the values for the selected condition.

1. Choose **Add** to add the transform to the **Model recipe**.

For the **Operation**, you can choose one of the following options. Note that the available operations depend on the data type of the column you choose. For example, you cannot create a `is greater than` operation for a column containing text values.


| Operation | Supported data type | Supported feature type | Function | 
| --- | --- | --- | --- | 
|  Is equal to  |  Numeric, Text  |  Binary, Categorical  |  Removes rows where the value in **Column** equals the values you specify.  | 
|  Is not equal to  |  Numeric, Text  |  Binary, Categorical  |  Removes rows where the value in **Column** doesn't equal the values you specify.  | 
|  Is less than  |  Numeric  | N/A |  Removes rows where the value in **Column** is less than the value you specify.  | 
|  Is less than or equal to  |  Numeric  | N/A |  Removes rows where the value in **Column** is less than or equal to the value you specify.  | 
|  Is greater than  |  Numeric  | N/A |  Removes rows where the value in **Column** is greater than the value you specify.  | 
|  Is greater than or equal to  | Numeric | N/A |  Removes rows where the value in **Column** is greater than or equal to the value you specify.  | 
|  Is between  | Numeric | N/A |  Removes rows where the value in **Column** is between or equal to two values you specify.  | 
|  Contains  |  Text  | Categorical |  Removes rows where the value in **Column** contains a values you specify.  | 
|  Starts with  |  Text  | Categorical |  Removes rows where the value in **Column** begins with a value you specify.  | 
|  Ends with  |  Text  | Categorical |  Removes rows where the value in **Column** ends with a value you specify.  | 

After removing the rows from the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the rows return to your dataset.

![\[Screenshot of the remove rows by custom values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-remove-custom.png)


## Rename columns
<a name="canvas-prepare-data-rename"></a>

With the rename columns transform, you can rename columns in your data. When you rename a column, SageMaker Canvas changes the column name in the model input.

You can rename a column in your dataset by double-clicking on the column name in the **Build** tab of the SageMaker Canvas application and entering a new name. Pressing the **Enter** key submits the change, and clicking anywhere outside the input cancels the change. You can also rename a column by clicking the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), located at the end of the row in list view or at the end of the header cell in grid view, and choosing **Rename**.

Your column name can’t be longer than 32 characters or have double underscores (\$1\$1), and you can’t rename a column to the same name as another column. You also can’t rename a dropped column.

The following screenshot shows how to rename a column by double-clicking the column name.

![\[Screenshot of renaming a column with the double-click method in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-rename-column.png)


When you rename a column, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the column reverts to its original name.

## Manage columns
<a name="canvas-prepare-data-manage-cols"></a>

With the following transforms, you can change the data type of columns and replace missing values or outliers for specific columns. SageMaker Canvas uses the updated data types or values when building your model but doesn’t change your original dataset. Note that if you've dropped a column from your dataset using the [Drop columns](#canvas-prepare-data-drop) transform, you can't replace values in that column.

### Replace missing values
<a name="canvas-prepare-data-replace-missing"></a>

Missing values are a common occurrence in machine learning datasets and can impact model accuracy. You can choose to drop rows that have missing values, but your model is more accurate if you choose to replace the missing values instead. With this transform, you can replace missing values in numeric columns with the mean or median of the data in a column, or you can also specify a custom value with which to replace missing values. For non-numeric columns, you can replace missing values with the mode (most common value) of the column or a custom value.

Use this transform if you want to replace the null or empty values in certain columns. To replace missing values in a specified column, do the following. 

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage columns**.

1. Choose **Replace missing values**.

1. Choose the **Column** in which you want to replace missing values.

1. Set **Mode** to **Manual** to replace missing values with values that you specify. With the **Automatic (default)** setting, SageMaker Canvas replaces missing values with imputed values that best fit your data. This imputation method is done automatically for each model build, unless you specify the **Manual** mode.

1. Set the **Replace with** value:
   + If your column is numeric, then select **Mean**, **Median**, or **Custom**. **Mean** replaces missing values with the mean for the column, and **Median** replaces missing values with the median for the column. If you choose **Custom**, then you must specify a custom value that you want to use to replace missing values.
   + If your column is non-numeric, then select **Mode** or **Custom**. **Mode** replaces missing values with the mode, or the most common value, for the column. For **Custom**, specify a custom value. that you want to use to replace missing values.

1. Choose **Add** to add the transform to the **Model recipe**.

After replacing the missing values in the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the missing values return to the dataset.

![\[Screenshot of the replace missing values operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-replace-missing.png)


### Replace outliers
<a name="canvas-prepare-data-replace-outliers"></a>

Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy and lead to longer building times. SageMaker Canvas enables you to detect outliers in numeric columns and replace the outliers with values that lie within an accepted range in your data. You can choose to define outliers with either standard deviations or a custom range, and you can replace outliers with the minimum and maximum values in the accepted range.

To replace outliers in your data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Manage columns**.

1. Choose **Replace outlier values**.

1. Choose the **Column** in which you want to replace outliers.

1. For **Define outliers**, choose **Standard deviation**, **Custom numeric range**, or **Custom quantile range**.

1. If you choose **Standard deviation**, specify a **Standard deviations** (standard deviation) value from 1–3. If you choose **Custom numeric range** or **Custom quantile range**, specify the **Min** and **Max** values (numbers for numeric ranges, or percentiles between 0–100% for quantile ranges).

1. For **Replace with**, select **Min/max range**.

1. Choose **Add** to add the transform to the **Model recipe**.

The **Standard deviation** option detects outliers in numeric columns using the mean and standard deviation. You specify the number of standard deviations a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. SageMaker Canvas replaces outliers with the minimum value or maximum value in the accepted range. For example, if you configure the standard deviations to only include values from 200–300, then SageMaker Canvas changes a value of 198 to 200 (the minimum).

The **Custom numeric range** and **Custom quantile range** options detect outliers in numeric columns using minimum and maximum values. Use this method if you know your threshold values that delimit outliers. If you choose a numeric range, the **Min** and **Max** values should be the minimum and maximum numeric values that you want to allow. SageMaker Canvas replaces any values that fall outside of the minimum and maximum to the minimum and maximum values. For example, if your range only allows values from 1–100, then SageMaker Canvas changes a value of 102 to 100 (the maximum). If you choose a quantile range, the **Min** and **Max** values should be the minimum and maximum of the percentile range (0–100) that you want to allow.

After replacing the values in the dataset, SageMaker Canvas adds the transform in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the original values return to the dataset.

![\[Screenshot of the replace outliers operation in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-replace-outlier.png)


### Change data type
<a name="canvas-prepare-data-change-type"></a>

SageMaker Canvas provides you with the ability to change the *data type* of your columns between numeric, text, and datetime, while also displaying the associated *feature type* for that data type. A *data type* refers to the format of the data and how it is stored, while the *feature type* refers to the characteristic of the data used in machine learning algorithms, such as binary or categorical. This gives you the flexibility to manually change the type of data in your columns based on the features. The ability to choose the right data type ensures data integrity and accuracy prior to building models. These data types are used when building models.

**Note**  
Currently, changing the feature type (for example, from binary to categorical) is not supported.

The following table lists all of the supported data types in Canvas.


| Data type | Description | Example | 
| --- | --- | --- | 
| Numeric | Numeric data represents numerical values | 1, 2, 31.1, 1.2. 1.3 | 
| Text | Text data represents sequences of characters, like names or descriptions | A, B, C, Dapple, banana, orange1A\$1, 2A\$1, 3A\$1 | 
| Datetime | Datetime data represents dates and times in timestamp format | 2019-07-01 01:00:00, 2019-07-01 02:00:00, 2019-07-01 03:00:00 | 

The following table lists all of the supported feature types in Canvas.


| Feature type | Description | Example | 
| --- | --- | --- | 
| Binary | Binary features represent two possible values | 0, 1, 0, 1, 0 (2 distinct values)true, false, true (2 distinct values) | 
| Categorical | Categorical features represent distinct categories or groups | apple, banana, orange, apple (3 distinct values)A, B, C, D, E, A, D, C (5 distinct values) | 

To modify data type of a column in a dataset, do the following.

1. In the **Build** tab of the SageMaker Canvas application, go to the **Column view** or **Grid view** and select the **Data type** dropdown for the specific column.

1. In the **Data type** dropdown, choose the data type to convert to. The following screenshot shows the dropdown menu.  
![\[The data type conversion dropdown menu for a column, shown in the Build tab.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-prepare-data-change.png)

1. For **Column**, choose or verify the column you want to change the data type for.

1. For **New data type**, choose or verify the new data type you want to convert to.

1. If the **New data type** is `Datetime` or `Numeric`, choose one of the following options under **Handle invalid values**:

   1. **Replace with empty value** – Invalid values are substituted with an empty value

   1. **Delete rows** – Rows with an invalid value are removed from the dataset

   1. **Replace with custom value** – Invalid values are substituted with the **Custom Value** that you specify.

1. Choose **Add** to add the transform to the **Model recipe**.

The data type for your column should now be updated.

## Prepare time series data
<a name="canvas-prepare-data-timeseries"></a>

Use the following functionalities to prepare your time series data for building time series forecasting models.

### Resample time series data
<a name="canvas-prepare-data-resample"></a>

By resampling time-series data, you can establish regular intervals for the observations in your time series dataset. This is particularly useful when working with time series data containing irregularly spaced observations. For instance, you can use resampling to transform a dataset with observations recorded every one hour, two hour and three hour intervals into a regular one hour interval between observations. Forecasting algorithms require the observations to be taken at regular intervals.

To resample time series data, do the following.

1. In the **Build** tab of the SageMaker Canvas application, choose **Time series**.

1. Choose **Resample**.

1. For **Timestamp column**, choose the column you want to apply the transform to. You can only select columns of the **Datetime** type.

1. In the **Frequency settings** section, choose a **Frequency** and **Rate**. **Frequency** is the unit of frequency and **Rate** is the interval of the unit of frequency to be applied to the column. For example, choosing `Calendar Day` for **Frequency value** and `1` for **Rate** sets the interval to increase every 1 calendar day, such as `2023-03-26 00:00:00`, `2023-03-27 00:00:00`, `2023-03-28 00:00:00`. See the table after this procedure for a complete list of **Frequency value**. 

1. Choose **Add** to add the transform to the **Model recipe**.

The following table lists all of the **Frequency** types you can select when resampling time series data.


| Frequency | Description | Example values (assuming Rate is 1) | 
| --- | --- | --- | 
|  Business Day  |  Resample observations in the datetime column to 5 business days of the week (Monday, Tuesday, Wednesday, Thursday, Friday)  |  2023-03-24 00:00:00 2023-03-27 00:00:00 2023-03-28 00:00:00 2023-03-29 00:00:00 2023-03-30 00:00:00 2023-03-31 00:00:00 2023-04-03 00:00:00  | 
|  Calendar Day  |  Resample observations in the datetime column to all 7 days of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)  |  2023-03-26 00:00:00 2023-03-27 00:00:00 2023-03-28 00:00:00 2023-03-29 00:00:00 2023-03-30 00:00:00 2023-03-31 00:00:00 2023-04-01 00:00:00  | 
|  Week  |  Resample observations in the datetime column to the first day of each week  |  2023-03-13 00:00:00 2023-03-20 00:00:00 2023-03-27 00:00:00 2023-04-03 00:00:00  | 
|  Month  |  Resample observations in the datetime column to the first day of each month  |  2023-03-01 00:00:00 2023-04-01 00:00:00 2023-05-01 00:00:00 2023-06-01 00:00:00  | 
|  Annual Quarter  |  Resample observations in the datetime column to the last day of each quarter  |  2023-03-31 00:00:00 2023-06-30 00:00:00 2023-09-30 00:00:00 2023-12-31 00:00:00  | 
|  Year  |  Resample observations in the datetime column to the last day of each year  |  2022-12-31 0:00:00 2023-12-31 00:00:00 2024-12-31 00:00:00  | 
|  Hour  |  Resample observations in the datetime column to each hour of each day  |  2023-03-24 00:00:00 2023-03-24 01:00:00 2023-03-24 02:00:00 2023-03-24 03:00:00  | 
|  Minute  |  Resample observations in the datetime column to each minute of each hour  |  2023-03-24 00:00:00 2023-03-24 00:01:00 2023-03-24 00:02:00 2023-03-24 00:03:00  | 
|  Second  |  Resample observations in the datetime column to each second of each minute  |  2023-03-24 00:00:00 2023-03-24 00:00:01 2023-03-24 00:00:02 2023-03-24 00:00:03  | 

When applying the resampling transform, you can use the **Advanced** option to specify how the resulting values of the rest of the columns (other than the timestamp column) in your dataset are modified. This can be achieved by specifying the resampling methodology, which can either be downsampling or upsampling for both numeric and non-numeric columns.

*Downsampling* increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The values of other columns of the hourly observations are aggregated into a single value using a combination method. The following tables show an example of downsampling time series data by using mean as the combination method. The data is downsampled from every two hours to every hour.

The following table shows the hourly temperature readings over a day before downsampling.


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 pm | 30 | 
| 1:00 am | 32 | 
| 2:00 am | 35 | 
| 3:00 am | 32 | 
| 4:00 am | 30 | 

The following table shows the temperature readings after downsampling to every two hours.


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 pm | 30 | 
| 2:00 am | 33.5 | 
| 2:00 am | 35 | 
| 4:00 am | 32.5 | 

To downsample time series data, do the following:

1. Expand the **Advanced ** section under the **Resample** transform.

1. Choose **Non-numeric combination** to specify the combination method for non-numeric columns. See the table below for a complete list of combination methods.

1. Choose **Numeric combination** to specify the combination method for numeric columns. See the table below for a complete list of combination methods.

If you don’t specify combination methods, the default values are `Most Common` for **Non-numeric combination** and `Mean` for **Numeric combination**. The following table lists the methods for numeric and non-numeric combination.


| Downsampling methodology | Combination method | Description | 
| --- | --- | --- | 
| Non-numeric combination | Most Common | Aggregate values in the non-numeric column by the most commonly ocurring value | 
| Non-numeric combination | Last | Aggregate values in the non-numeric column by the last value in the column | 
| Non-numeric combination | First | Aggregate values in the non-numeric column by the first value in the column | 
| Numeric combination | Mean | Aggregate values in the numeric column by the taking the mean of all the values in the column | 
| Numeric combination | Median | Aggregate values in the numeric column by the taking the median of all the values in the column | 
| Numeric combination | Min | Aggregate values in the numeric column by the taking the minimum of all the values in the column | 
| Numeric combination | Max | Aggregate values in the numeric column by the taking the maximum of all the values in the column | 
| Numeric combination | Sum | Aggregate values in the numeric column by adding all the values in the column | 
| Numeric combination | Quantile | Aggregate values in the numeric column by the taking the quantile of all the values in the column | 

*Upsampling* reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, the values of other columns of the hourly observations are interpolated from the ones that have been taken every two hours.

To upsample time series data, do the following:

1. Expand the **Advanced** section under the **Resample** transform.

1. Choose **Non-numeric estimation** to specify the estimation method for non-numeric columns. See the table after this procedure for a complete list of methods.

1. Choose **Numeric estimation** to specify the estimation method for numeric columns. See the table below for a complete list of methods.

1. (Optional) Choose **ID Column** to specify the column that has the IDs of the observations of the time series. Specify this option if your dataset has two time series. If you have a column representing only one time series, don't specify a value for this field. For example, you can have a dataset that has the columns `id` and `purchase`. The `id` column has the following values: `[1, 2, 2, 1]`. The `purchase` column has the following values `[$2, $3, $4, $1]`. Therefore, the dataset has two time series—one time series is: `1: [$2, $1]`, and the other time series is `2: [$3, $4]`.

If you don’t specify estimation methods, the default values are `Forward Fill` for **Non-numeric estimation** and `Linear` for **Numeric estimation**. The following table lists the methods for estimation.


| Upsampling methodology | Estimation method | Description | 
| --- | --- | --- | 
| Non-numeric estimation | Forward Fill | Interpolate values in the non-numeric column by taking the consecutive values after all the values in the column | 
| Non-numeric estimation | Backward Fill | Interpolate values in the non-numeric column by taking the consecutive values before all the values in the column | 
| Non-numeric estimation | Keep Missing | Interpolate values in the non-numeric column by showing empty values | 
| Numeric estimation | Linear, Time, Index, Zero, S-Linear, Nearest, Quadratic, Cubic, Barycentric, Polynomial, Krogh, Piecewise Polynomial, Spline, P-chip, Akima, Cubic Spline, From Derivatives | Interpolate values in the numeric column by using the specfied interpolator. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html) in the pandas documentation. | 

The following screenshot shows the **Advanced** settings with the fields for downsampling and upsampling filled out.

![\[The Canvas application, with the time series resampling side panel showing the advanced options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-prepare-data-resampling.png)


### Use datetime extraction
<a name="canvas-prepare-data-datetime"></a>

With the datetime extraction transform, you can extract values from a datetime column to a separate column. For example, if you have a column containing dates of purchases, you can extract the month value to a separate column and use the new column when building your model. You can also extract multiple values to separate columns with a single transform.

Your datetime column must use a supported timestamp format. For a list of the formats that SageMaker Canvas supports, see [Time Series Forecasts in Amazon SageMaker Canvas](canvas-time-series.md). If your dataset does not use one of the supported formats, update your dataset to use a supported timestamp format and re-import it to Amazon SageMaker Canvas before building your model.

To perform a datetime extraction, do the following.

1. In the **Build** tab of the SageMaker Canvas application, on the transforms bar, choose **View all**.

1. Choose **Extract features**.

1. Choose the **Timestamp column** from which you want to extract values.

1. For **Values**, select one or more values to extract from the column. The values you can extract from a timestamp column are **Year**, **Month**, **Day**, **Hour**, **Week of year**, **Day of year**, and **Quarter**.

1. (Optional) Choose **Preview** to preview the transform results.

1. Choose **Add** to add the transform to the **Model recipe**.

SageMaker Canvas creates a new column in the dataset for each of the values you extract. Except for **Year** values, SageMaker Canvas uses a 0-based encoding for the extracted values. For example, if you extract the **Month** value, January is extracted as 0, and February is extracted as 1.

![\[Screenshot of the datetime extraction box in the SageMaker Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-datetime-extract.png)


You can see the transform listed in the **Model recipe** section. If you remove the transform from the **Model recipe** section, the new columns are removed from the dataset.

# Model evaluation
<a name="canvas-evaluate-model"></a>

After you’ve built your model, you can evaluate how well your model performed on your data before using it to make predictions. You can use information, such as the model’s accuracy when predicting labels and advanced metrics, to determine whether your model can make sufficiently accurate predictions for your data.

The section [Evaluate your model's performance](canvas-scoring.md) describes how to view and interpret the information on your model's **Analyze** page. The section [Use advanced metrics in your analyses](canvas-advanced-metrics.md) contains more detailed information about the **Advanced metrics** used to quantify your model’s accuracy.

You can also view more advanced information for specific *model candidates*, which are all of the model iterations that Canvas runs through while building your model. Based on the advanced metrics for a given model candidate, you can select a different candidate to be the default, or the version that is used for making predictions and deploying. For each model candidate, you can view the **Advanced metrics** information to help you decide which model candidate you’d like to select as the default. You can view this information by selecting the model candidate from the **Model leaderboard**. For more information, see [View model candidates in the model leaderboard](canvas-evaluate-model-candidates.md).

Canvas also provides the option to download a Jupyter notebook so that you can view and run the code used to build your model. This is useful if you’d like to make adjustments to the code or learn more about how your model was built. For more information, see [Download a model notebook](canvas-notebook.md).

# Evaluate your model's performance
<a name="canvas-scoring"></a>

Amazon SageMaker Canvas provides overview and scoring information for the different types of model. Your model’s score can help you determine how accurate your model is when it makes predictions. The additional scoring insights can help you quantify the differences between the actual and predicted values.

To view the analysis of your model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, you can view the overview and scoring information for your model.

The following sections describe how to interpret the scoring for each model type.

## Evaluate categorical prediction models
<a name="canvas-scoring-categorical"></a>

The **Overview** tab shows you the column impact for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the **Accuracy** score for the model, along with the **Optimization metric**, which is the metric that you choose to optimize when building the model. In this case, the **Optimization metric** is **Accuracy**. You can specify a different optimization metric if you build a new version of your model.

![\[Screenshot of the accuracy score and optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-category.png)


The **Scoring** tab for a categorical prediction model gives you the ability to visualize all the predictions. Line segments extend from the left of the page, indicating all the predictions the model has made. In the middle of the page, the line segments converge on a perpendicular segment to indicate the proportion of each prediction to a single category. From the predicted category, the segments branch out to the actual category. You can get a visual sense of how accurate the predictions were by following each line segment from the predicted category to the actual category.

The following image gives you an example **Scoring** section for a **3\$1 category prediction** model.

![\[Screenshot of the Scoring tab for a 3+ category prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze/canvas-multiclass-classification.png)


You can also view the **Advanced metrics** tab for more detailed information about your model’s performance, such as the advanced metrics, error density plots, or confusion matrices. To learn more about the **Advanced metrics** tab, see [Use advanced metrics in your analyses](canvas-advanced-metrics.md).

## Evaluate numeric prediction models
<a name="canvas-scoring-numeric"></a>

The **Overview** tab shows you the column impact for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the **RMSE** score for the model on the **Overview** tab, which in this case is the **Optimization metric**. The **Optimization metric** is the metric that you choose to optimize when building the model. You can specify a different optimization metric if you build a new version of your model.

![\[Screenshot of the RMSE optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-numeric.png)


The **Scoring** tab for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions. The values of the numeric prediction are often \$1/- the RMSE (root mean squared error) value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range.

The following image shows the **Scoring** section for numeric prediction.

![\[Screenshot of the Scoring tab for a numeric prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze/canvas-analyze-regression-scoring.png)


You can also view the **Advanced metrics** tab for more detailed information about your model’s performance, such as the advanced metrics, error density plots, or confusion matrices. To learn more about the **Advanced metrics** tab, see [Use advanced metrics in your analyses](canvas-advanced-metrics.md).

## Evaluate time series forecasting models
<a name="canvas-scoring-time-series"></a>

On the **Analyze** page for time series forecasting models, you can see an overview of the model’s metrics. You can hover over each metric for more information, or you can see [Use advanced metrics in your analyses](canvas-advanced-metrics.md) for more information about each metric.

In the **Column impact** section, you can see the score for each column. **Column impact** is a percentage score that indicates how much weight a column has in making predictions in relation to the other columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other columns.

The following screenshot shows the time series metrics scores for the model, along with the **Optimization metric**, which is the metric that you choose to optimize when building the model. In this case, the **Optimization metric** is **RMSE**. You can specify a different optimization metric if you build a new version of your model. These metrics scores are taken from your backtest results, which are available for download in the **Artifacts** tab.

![\[Screenshot of the RMSE optimization metric on the Analyze tab in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-tab-2-time-series.png)


The **Artifacts** tab provides access to several key resources that you can use to dive deeper into your model’s performance and continue iterating upon it:
+ **Shuffled training and validation splits** – This section includes links to the artifacts generated when your dataset was split into training and validation sets, enabling you to review the data distribution and potential biases.
+ **Backtest results** – This section includes a link to the forecasted values for your validation dataset, which is used to generate accuracy metrics and evaluation data for your model.
+ **Accuracy metrics** – This section lists the advanced metrics that evaluate your model's performance, such as Root Mean Squared Error (RMSE). For more information about each metric, see [Metrics for time series forecasts](canvas-metrics.md#canvas-time-series-forecast-metrics).
+ **Explainability report** – This section provides a link to download the explainability report, which offers insights into the model's decision-making process and the relative importance of input columns. This report can help you identify potential areas for improvement.

On the **Analyze** page, you can also choose the **Download** button to directly download the backtest results, accuracy metrics, and explainability report artifacts to your local machine.

## Evaluate image prediction models
<a name="canvas-scoring-image"></a>

The **Overview** tab shows you the **Per label performance**, which gives you an overall accuracy score for the images predicted for each label. You can choose a label to see more specific details, such as the **Correctly predicted** and **Incorrectly predicted** images for the label.

You can turn on the **Heatmap** toggle to see a heatmap for each image. The heatmap shows you the areas of interest that have the most impact when your model is making predictions. For more information about heatmaps and how to use them to improve your model, choose the **More info** icon next to the **Heatmap** toggle.

The **Scoring** tab for single-label image prediction models shows you a comparison of what the model predicted as the label versus what the actual label was. You can select up to 10 labels at a time. You can change the labels in the visualization by choosing the labels dropdown menu and selecting or deselecting labels.

You can also view insights for individual labels or groups of labels, such as the three labels with the highest or lowest accuracy, by choosing the **View scores for** dropdown menu in the **Model accuracy insights** section.

The following screenshot shows the **Scoring** information for a single-label image prediction model.

![\[The actual versus predicted labels on the Scoring page for a multi-category text prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-image-scoring.png)


## Evaluate text prediction models
<a name="canvas-scoring-text"></a>

The **Overview** tab shows you the **Per label performance**, which gives you an overall accuracy score for the passages of text predicted for each label. You can choose a label to see more specific details, such as the **Correctly predicted** and **Incorrectly predicted** passages for the label.

The **Scoring** tab for multi-category text prediction models shows you a comparison of what the model predicted as the label versus what the actual label was.

In the **Model accuracy insights** section, you can see the **Most frequent category**, which tells you the category that the model predicted most frequently and how accurate those predictions were. If you model predicts a label of **Positive** correctly 99% of the time, then you can be fairly confident that your model is good at predicting positive sentiment in text.

The following screenshot shows the **Scoring** information for a multi-category text prediction model.

![\[The actual versus predicted labels on the Scoring page for a single-label image prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/analyze-text-scoring.png)


# Use advanced metrics in your analyses
<a name="canvas-advanced-metrics"></a>

The following section describes how to find and interpret the advanced metrics for your model in Amazon SageMaker Canvas.

**Note**  
Advanced metrics are only currently available for numeric and categorical prediction models.

To find the **Advanced metrics** tab, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, choose the **Advanced metrics** tab.

In the **Advanced metrics** tab, you can find the **Performance** tab. The page looks like the following screenshot.

![\[Screenshot of the advanced metrics tab for a categorical prediction model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-analyze-performance.png)


At the top, you can see an overview of the metrics scores, including the **Optimization metric**, which is the metric that you selected (or that Canvas selected by default) to optimize when building the model.

The following sections describe more detailed information for the **Performance** tab within the **Advanced metrics**.

## Performance
<a name="canvas-advanced-metrics-performance"></a>

In the **Performance** tab, you’ll see a **Metrics table**, along with visualizations that Canvas creates based on your model type. For categorical prediction models, Canvas provides a *confusion matrix*, whereas for numeric prediction models, Canvas provides you with *residuals* and *error density* charts.

In the **Metrics table**, you are provided with a full list of your model’s scores for each advanced metric, which is more comprehensive than the scores overview at the top of the page. The metrics shown here depend on your model type. For a reference to help you understand and interpret each metric, see [Metrics reference](canvas-metrics.md).

To understand the visualizations that might appear based on your model type, see the following options:
+ **Confusion matrix** – Canvas uses confusion matrices to help you visualize when a model makes predictions correctly. In a confusion matrix, your results are arranged to compare the predicted values against the actual values. The following example explains how a confusion matrix works for a 2 category prediction model that predicts positive and negative labels:
  + True positive – The model correctly predicted positive when the true label was positive.
  + True negative – The model correctly predicted negative when the true label was negative.
  + False positive – The model incorrectly predicted positive when the true label was negative.
  + False negative – The model incorrectly predicted negative when the true label was positive.
+ **Precision recall curve** – The precision recall curve is a visualization of the model’s precision score plotted against the model’s recall score. Generally, a model that can make perfect predictions would have precision and recall scores that are both 1. The precision recall curve for a decently accurate model is fairly high in both precision and recall.
+ **Residuals** – Residuals are the difference between the actual values and the values predicted by the model. A residuals chart plots the residuals against the corresponding values to visualize their distribution and any patterns or outliers. A normal distribution of residuals around zero indicates that the model is a good fit for the data. However, if the residuals are significantly skewed or have outliers, it may indicate that the model is overfitting the data or that there are other issues that need to be addressed.
+ **Error density** – An error density plot is a representation of the distribution of errors made by a model. It shows the probability density of the errors at each point, helping you to identify any areas where the model may be overfitting or making systematic errors.

# View model candidates in the model leaderboard
<a name="canvas-evaluate-model-candidates"></a>

When you do a [Standard build](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) for tabular and time series forecasting models in Amazon SageMaker Canvas, SageMaker AI trains multiple *model candidates* (different iterations of the model) and by default selects the one with the highest value for the optimization metric. For tabular models, Canvas builds up to 250 different model candidates using various algorithms and hyperparameter settings. For time series forecasting models, Canvas builds 7 different models—one for each of the [supported forecasting algorithms](canvas-advanced-settings.md#canvas-advanced-settings-time-series) and one ensemble model that averages the predictions of the other models to try to optimize accuracy.

The default model candidate is the only version that you can use in Canvas for actions like making predictions, registering to the model registry, or deploying to an endpoint. However, you might want to review all of the model candidates and select a different candidate to be the default model. You can view all of the model candidates and more details about each candidate on the **Model leaderboard** in Canvas.

To view the **Model leaderboard**, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model that you built.

1. In the top navigation pane, choose the **Analyze** tab.

1. Within the **Analyze** tab, choose **Model leaderboard.**

The **Model leaderboard** page opens, which for tabular models looks like the following screenshot.

![\[The model leaderboard, which lists all of the model candidates that Canvas trained.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-model-leaderboard.png)


For time series forecasting models, you see 7 models, which include one for each of the time series forecasting algorithms supported by Canvas and one ensemble model. For more information about the algorithms, see [Advanced time series forecasting model settings](canvas-advanced-settings.md#canvas-advanced-settings-time-series).

In the preceding screenshot, you can see that the first model candidate listed is marked as the **Default model**. This is the model candidate with which you can make predictions or deploy to endpoints.

To view more detailed metrics information about the model candidates to compare them, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and choose **View model details**.

**Important**  
 Loading the model details for non-default model candidates may take a few minutes (typically less than 10 minutes), and SageMaker AI Hosting charges apply. For more information, see [SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

The model candidate opens in the **Analyze** tab, and the metrics shown are specific to that model candidate. When you’re done reviewing the model candidate’s metrics, you can go back or exit the view to return to the **Model leaderboard**.

If you’d like to set the **Default model** to a different candidate, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and choose **Change to default model**. Changing the default model for a model trained using HPO mode might take several minutes.

**Note**  
If your model is already deployed in production, [registered to the model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html), or has [automations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-manage-automations.html) set up, you must delete your deployment, model registration, or automations before changing the default model.

# Metrics reference
<a name="canvas-metrics"></a>

The following sections describe the metrics that are available in Amazon SageMaker Canvas for each model type.

## Metrics for numeric prediction
<a name="canvas-numeric-metrics"></a>

The following list defines the metrics for numeric prediction in SageMaker Canvas and gives you information about how you can use them.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ MAE – Mean absolute error. On average, the prediction for the target column is \$1/- \$1MAE\$1 from the actual value.

  Measures how different the predicted and actual values are when they're averaged over all values. MAE is commonly used in numeric prediction to understand model prediction error. If the predictions are linear, MAE represents the average distance from a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the number of observations. Values range from 0 to infinity, with smaller numbers indicating a better model fit to the data.
+ MAPE – Mean absolute percent error. On average, the prediction for the target column is \$1/- \$1MAPE\$1 % from the actual value.

  MAPE is the mean of the absolute differences between the actual values and the predicted or estimated values, divided by the actual values and expressed as a percentage. A lower MAPE indicates better performance, as it means that the predicted or estimated values are closer to the actual values.
+ MSE – Mean squared error, or the average of the squared differences between the predicted and actual values.

  MSE values are always positive. The better a model is at predicting the actual values, the smaller the MSE value is.
+ R2 – The percentage of the difference in the target column that can be explained by the input column.

  Quantifies how much a model can explain the variance of a dependent variable. Values range from one (1) to negative one (-1). Higher numbers indicate a higher fraction of explained variability. Values close to zero (0) indicate that very little of the dependent variable can be explained by the model. Negative values indicate a poor fit and that the model is outperformed by a constant function (or a horizontal line).
+ RMSE – Root mean squared error, or the standard deviation of the errors.

  Measures the square root of the squared difference between predicted and actual values, and is averaged over all values. It is used to understand model prediction error, and it's an important metric to indicate the presence of large model errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of different types.

## Metrics for categorical prediction
<a name="canvas-categorical-metrics"></a>

This section defines the metrics for categorical prediction in SageMaker Canvas and gives you information about how you can use them.

The following is a list of available metrics for 2-category prediction:
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ AUC – A value between 0 and 1 that indicates how well your model is able to separate the categories in your dataset. A value of 1 indicates that it was able to separate the categories perfectly.
+ BalancedAccuracy – Measures the ratio of accurate predictions to all predictions.

  This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total number of positive (P) and negative (N) values. It is defined as follows: `0.5*((TP/P)+(TN/N))`, with values ranging from 0 to 1. The balanced accuracy metric gives a better measure of accuracy when the number of positives or negatives differ greatly from each other in an imbalanced dataset, such as when only 1% of email is spam.
+ F1 – A balanced measure of accuracy that takes class balance into account.

  It is the harmonic mean of the precision and recall scores, defined as follows: `F1 = 2 * (precision * recall) / (precision + recall)`. F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ LogLoss – Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability outputs, rather than the outputs themselves. Log loss is an important metric to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to infinity. A value of 0 represents a model that perfectly predicts the data.
+ Precision – Of all the times that \$1category x\$1 was predicted, the prediction was correct \$1precision\$1% of the time.

  Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives that it identifies. It is defined as follows: `Precision = TP/(TP+FP)`, with values ranging from zero (0) to one (1). Precision is an important metric when the cost of a false positive is high. For example, the cost of a false positive is very high if an airplane safety system is falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative in the data.
+ Recall – The model correctly predicted \$1recall\$1% to be \$1category x\$1 when \$1target\$1column\$1 was actually \$1category x\$1.

  Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A true positive is a positive prediction that is also an actual positive value in the data. Recall is defined as follows: `Recall = TP/(TP+FN)`, with values ranging from 0 to 1. Higher scores reflect a better ability of the model to predict true positives (TP) in the data. Note that it is often insufficient to measure only recall, because predicting every output as a true positive yields a perfect recall score.

The following is a list of available metrics for 3\$1 category prediction:
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ BalancedAccuracy – Measures the ratio of accurate predictions to all predictions.

  This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total number of positive (P) and negative (N) values. It is defined as follows: `0.5*((TP/P)+(TN/N))`, with values ranging from 0 to 1. The balanced accuracy metric gives a better measure of accuracy when the number of positives or negatives differ greatly from each other in an imbalanced dataset, such as when only 1% of email is spam.
+ F1macro – The F1macro score applies F1 scoring by calculating the precision and recall, and then taking their harmonic mean to calculate the F1 score for each class. Then, the F1macro averages the individual scores to obtain the F1macro score. F1macro scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ InferenceLatency – The approximate amount of time between making a request for a model prediction to receiving it from a real-time endpoint to which the model is deployed. This metric is measured in seconds and is only available for models built with the **Ensembling**mode.
+ LogLoss – Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability outputs, rather than the outputs themselves. Log loss is an important metric to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to infinity. A value of 0 represents a model that perfectly predicts the data.
+ PrecisionMacro – Measures precision by calculating precision for each class and averaging scores to obtain precision for several classes. Scores range from zero (0) to one (1). Higher scores reflect the model's ability to predict true positives (TP) out of all of the positives that it identifies, averaged across multiple classes.
+ RecallMacro – Measures recall by calculating recall for each class and averaging scores to obtain recall for several classes. Scores range from 0 to 1. Higher scores reflect the model's ability to predict true positives (TP) in a dataset, whereas a true positive reflects a positive prediction that is also an actual positive value in the data. It is often insufficient to measure only recall, because predicting every output as a true positive will yield a perfect recall score.

Note that for 3\$1 category prediction, you also receive the average F1, Accuracy, Precision, and Recall metrics. The scores for these metrics are just the metric scores averaged for all categories.

## Metrics for image and text prediction
<a name="canvas-cv-nlp-metrics"></a>

The following is a list of available metrics for image prediction and text prediction.
+ Accuracy – The percentage of correct predictions.

  Or, the ratio of the number of correctly predicted items to the total number of predictions. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates complete inaccuracy.
+ F1 – A balanced measure of accuracy that takes class balance into account.

  It is the harmonic mean of the precision and recall scores, defined as follows: `F1 = 2 * (precision * recall) / (precision + recall)`. F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.
+ Precision – Of all the times that \$1category x\$1 was predicted, the prediction was correct \$1precision\$1% of the time.

  Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives that it identifies. It is defined as follows: `Precision = TP/(TP+FP)`, with values ranging from zero (0) to one (1). Precision is an important metric when the cost of a false positive is high. For example, the cost of a false positive is very high if an airplane safety system is falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative in the data.
+ Recall – The model correctly predicted \$1recall\$1% to be \$1category x\$1 when \$1target\$1column\$1 was actually \$1category x\$1.

  Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A true positive is a positive prediction that is also an actual positive value in the data. Recall is defined as follows: `Recall = TP/(TP+FN)`, with values ranging from 0 to 1. Higher scores reflect a better ability of the model to predict true positives (TP) in the data. Note that it is often insufficient to measure only recall, because predicting every output as a true positive yields a perfect recall score.

Note that for image and text prediction models where you are predicting 3 or more categories, you also receive the *average* F1, Accuracy, Precision, and Recall metrics. The scores for these metrics are just the metric scores average for all categories.

## Metrics for time series forecasts
<a name="canvas-time-series-forecast-metrics"></a>

The following defines the advanced metrics for time series forecasts in Amazon SageMaker Canvas and gives you information about how you can use them.
+ Average Weighted Quantile Loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles. A lower value indicates a more accurate model.
+ Weighted Absolute Percent Error (WAPE) – The sum of the absolute error normalized by the sum of the absolute target, which measures the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
+ Root Mean Square Error (RMSE) – The square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
+ Mean Absolute Percent Error (MAPE) – The percentage error (percent difference of the mean forecasted value versus the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
+ Mean Absolute Scaled Error (MASE) – The mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.

# Predictions with custom models
<a name="canvas-make-predictions"></a>

Use the custom model that you've built in SageMaker Canvas to make predictions for your data. The following sections show you how to make predictions for numeric and categorical prediction models, time series forecasts, image prediction models, and text prediction models.

Numeric and categorical prediction, image prediction, and text prediction custom models support making the following types of predictions for your data:
+ **Single predictions** — A **Single prediction** is when you only need to make one prediction. For example, you have one image or passage of text that you want to classify.
+ **Batch predictions** — A **Batch prediction** is when you’d like to make predictions for an entire dataset. You can make batch predictions for datasets that are 1 TB\$1. For example, you have a CSV file of customer reviews for which you’d like to predict the customer sentiment, or you have a folder of image files that you'd like to classify. You should make predictions with a dataset that matches your input dataset. Canvas provides you with the ability to do manual batch predictions, or you can configure automatic batch predictions that run whenever you update a dataset.

For each prediction or set of predictions, SageMaker Canvas returns the following:
+ The predicted values
+ The probability of the predicted value being correct

**Get started**

Choose one of the following workflows to make predictions with your custom model:
+ [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md)
+ [Make single predictions](canvas-make-predictions-single.md)

After generating predictions with your model, you can also do the following:
+ [Update your model by adding versions.](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html) If you want to try to improve the prediction accuracy of your model, you can build new versions of your model. You can choose to clone your original model building configuration and dataset, or you can change your configuration and select a different dataset. After adding a new version, you can review and compare versions to choose the best one.
+ [Register a model version in the SageMaker AI model registry](canvas-register-model.md). You can register versions of your model to the SageMaker Model Registry, which is a feature for tracking and managing the status of model versions and machine learning pipelines. A data scientist or MLOps team user with access to the SageMaker Model Registry can review your model versions and approve or reject them before deploying them to production.
+ [Send your batch predictions to Quick.](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-send-predictions.html) In Quick, you can build and publish dashboards with your batch prediction datasets. This can help you analyze and share results generated by your custom model.

# Make single predictions
<a name="canvas-make-predictions-single"></a>

**Note**  
This section describes how to get single predictions from your model inside the Canvas application. For information about making real-time invocations in a production environment by deploying your model to an endpoint, see [Deploy your models to an endpoint](canvas-deploy-model.md).

Make single predictions if you want to get a prediction for a single data point. You can use this feature to get real-time predictions or to experiment with changing individual values to see how they impact the prediction outcome. Note that single predictions rely on an Asynchronous Inference endpoint, which shuts down after being idle (or not receiving any prediction requests) for two hours.

Choose one of the following procedures based on your model type.

## Make single predictions with numeric and categorical prediction models
<a name="canvas-make-predictions-numeric-categorical"></a>

To make a single prediction for a numeric or categorical prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. For each **Column** field, which represents the columns of your input data, you can change the **Value**. Select the dropdown menu for the **Value** you want to change. For numeric fields, you can enter a new number. For fields with labels, you can select a different label.

1. When you’re ready to generate the prediction, in the right **Prediction** pane, choose **Update**.

In the right **Prediction** pane, you’ll see the prediction result. You can **Copy** the prediction result chart, or you can also choose **Download** to either download the prediction result chart as an image or to download the values and prediction as a CSV file.

## Make single predictions with time series forecasting models
<a name="canvas-make-predictions-forecast"></a>

To make a single prediction for a time series forecasting model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. Choose **Single prediction**.

1. For **Item**, select the item for which you want to forecast values.

1. If you used a group by column to train the model, then select the group by category for the item.

The prediction result loads in the pane below, showing you a chart with the forecast for each quantile. Choose **Schema view** to see the numeric predicted values. You can also choose **Download** to download the prediction results as either an image or a CSV file.

## Make single predictions with image prediction models
<a name="canvas-make-predictions-image"></a>

To make a single prediction for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. Choose **Import image**.

1. You’ll be prompted to upload an image. You can upload an image from your local computer or from an Amazon S3 bucket.

1. Choose **Import** to import your image and generate the prediction.

In the right **Prediction results** pane, the model lists the possible labels for the image along with a **Confidence** score for each label. For example, the model might predict the label **Sea** for an image, with a confidence score of 96%. The model may have predicted the image as a **Glacier** with only a confidence score of 4%. Therefore, you can determine that your model is fairly confident in predicting images of the sea.

## Make single predictions with text prediction models
<a name="canvas-make-predictions-text"></a>

To make a single prediction for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Single prediction**.

1. For the **Text field**, enter the text for which you’d like to get a prediction.

1. Choose **Generate prediction results** to get your prediction.

In the right **Prediction results** pane, you receive an analysis of your text in addition to a **Confidence** score for each possible label. For example, if you entered a good review for a product, you might get **Positive** with a confidence score of 85%, while the confidence score for **Neutral** might be 10% and the confidence score for **Negative** only 5%.

# Batch predictions in SageMaker Canvas
<a name="canvas-make-predictions-batch"></a>

Make batch predictions when you have an entire dataset for which you’d like to generate predictions. Amazon SageMaker Canvas supports batch predictions for datasets up to PBs in size.

There are two types of batch predictions you can make:
+ [Manual batch predictions](canvas-make-predictions-batch-manual.md) are when you have a dataset for which you want to make one-time predictions.
+ [Automatic batch predictions](canvas-make-predictions-batch-auto.md) are when you set up a configuration that runs whenever a specific dataset is updated. For example, if you’ve configured weekly updates to a SageMaker Canvas dataset of inventory data, you can set up automatic batch predictions that run whenever you update the dataset. After setting up an automated batch predictions workflow, see [How to manage automations](canvas-manage-automations.md) for more information about viewing and editing the details of your configuration. For more information about setting up automatic dataset updates, see [Configure automatic updates for a dataset](canvas-update-dataset-auto.md).

**Note**  
Time series forecasting models don't support automatic batch predictions.  
You can only set up automatic batch predictions for datasets imported through local upload or Amazon S3. Additionally, automatic batch predictions can only run while you’re logged in to the Canvas application. If you log out of Canvas, the automatic batch prediction job resumes when you log back in.

To get started, review the [Batch prediction dataset requirements](canvas-make-predictions-batch-preqreqs.md), and then choose one of the following manual or automatic batch prediction workflows.

**Topics**
+ [

# Batch prediction dataset requirements
](canvas-make-predictions-batch-preqreqs.md)
+ [

# Make manual batch predictions
](canvas-make-predictions-batch-manual.md)
+ [

# Make automatic batch predictions
](canvas-make-predictions-batch-auto.md)
+ [

# Edit your automatic batch prediction configuration
](canvas-make-predictions-batch-auto-edit.md)
+ [

# Delete your automatic batch prediction configuration
](canvas-make-predictions-batch-auto-delete.md)
+ [

# View your batch prediction jobs
](canvas-make-predictions-batch-auto-view.md)

# Batch prediction dataset requirements
<a name="canvas-make-predictions-batch-preqreqs"></a>

For batch predictions, make sure that your datasets meet the requirements outlined in [Create a dataset](canvas-import-dataset.md). If your dataset is larger than 5 GB, then Canvas uses Amazon EMR Serverless to process your data and split it into smaller batches. After your data has been split, Canvas uses SageMaker AI Batch Transform to make predictions. You may see charges from both of these services after running batch predictions. For more information, see [Canvas pricing](https://aws.amazon.com/sagemaker/canvas/pricing/).

You might not be able to make predictions on some datasets if they have incompatible *schemas*. A *schema* is an organizational structure. For a tabular dataset, the schema is the names of the columns and the data type of the data in the columns. An incompatible schema might happen for one of the following reasons:
+ The dataset that you're using to make predictions has fewer columns than the dataset that you're using to build the model.
+ The data types in the columns you used to build the dataset might be different from the data types in dataset that you're using to make predictions.
+ The dataset that you're using to make predictions and the dataset that you've used to build the model have column names that don't match. The column names are case sensitive. `Column1` is not the same as `column1`.

To ensure that you can successfully generate batch predictions, match the schema of your batch predictions dataset to the dataset you used to train the model.

**Note**  
For batch predictions, if you dropped any columns when building your model, Canvas adds the dropped columns back to the prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.

# Make manual batch predictions
<a name="canvas-make-predictions-batch-manual"></a>

Choose one of the following procedures to make manual batch predictions based on your model type.

## Make manual batch predictions with numeric, categorical, and time series forecasting models
<a name="canvas-make-predictions-batch-numeric-categorical"></a>

To make manual batch predictions for numeric, categorical, and time series forecasting model types, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** to pick a dataset for generating predictions.

1. From the list of available datasets, select your dataset, and then choose **Start Predictions** to get your predictions.

After the prediction job finishes running, there is an output dataset listed on the same page in the **Predictions** section. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to preview the output data. You can see the input data matched to the prediction and the probability that the prediction is correct. Then, you can choose **Download prediction** to download the results as a file.

## Make manual batch predictions with image prediction models
<a name="canvas-make-predictions-batch-image"></a>

To make manual batch predictions for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you’ll be directed through the import data workflow.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View prediction results** to see the output data. You can see the images along with their predicted labels and confidence scores. Then, you can choose **Download prediction** to download the results as a CSV or a ZIP file.

## Make manual batch predictions with text prediction models
<a name="canvas-make-predictions-batch-text"></a>

To make manual batch predictions for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose **My models**.

1. On the **My models** page, choose your model.

1. After opening your model, choose the **Predict** tab.

1. On the **Run predictions** page, choose **Batch prediction**.

1. Choose **Select dataset** if you’ve already imported your dataset. If not, choose **Import new dataset**, and then you’ll be directed through the import data workflow. The dataset you choose must have the same source column as the dataset with which you built the model.

1. From the list of available datasets, select your dataset and choose **Generate predictions** to get your predictions.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to see the output data. You can see the images along with their predicted labels and confidence scores. Then, you can choose **Download prediction** to download the results.

# Make automatic batch predictions
<a name="canvas-make-predictions-batch-auto"></a>

**Note**  
Time series forecasting models don't support automatic batch predictions.

To set up a schedule for automatic batch predictions, do the following:

1. In the left navigation pane of Canvas, choose **My models**.

1. Choose your model.

1. Choose the **Predict** tab.

1. Choose **Batch prediction**.

1. For **Generate predictions**, choose **Automatic**.

1. The **Automate batch predictions** dialog box pops up. Choose **Select dataset** and choose the dataset for which you want to automate predictions. Note that you can only select a dataset that was imported through local upload or Amazon S3.

1. After selecting a dataset, choose **Set up**.

Canvas runs a batch predictions job for the dataset after you set up the configuration. Then, every time you [Update a dataset](canvas-update-dataset.md), either manually or automatically, another batch predictions job runs.

After the prediction job finishes running, on the **Run predictions** page, you see an output dataset listed under **Predictions**. This dataset contains your results, and if you select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **Preview** to preview the output data. You can see the input data matched to the prediction and the probability that the prediction is correct. Then, you can choose **Download** to download the results.

The following sections describe how to view, update, and delete your automatic batch prediction configuration through the **Datasets** page in the Canvas application. You can only set up a maximum of 20 automatic configurations in Canvas. For more information about viewing your automated batch predictions job history or making changes to your automatic configuration through the **Automations** page, see [How to manage automations](canvas-manage-automations.md).

# Edit your automatic batch prediction configuration
<a name="canvas-make-predictions-batch-auto-edit"></a>

You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

When you edit a batch prediction configuration, you can change the target dataset but not the frequency (since automatic batch predictions occur whenever the dataset is updated).

To edit your auto update configuration, do the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the dropdown menu, choose **Update configuration**.

1. The **Automate batch prediction** dialog box opens. You can select another dataset and choose **Set up** to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration by doing the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic batch predictions are now paused. You can turn the toggle back on at any time to resume the update schedule.

# Delete your automatic batch prediction configuration
<a name="canvas-make-predictions-batch-auto-delete"></a>

To learn how to delete your automatic batch prediction configuration, see [Delete an automatic configuration](canvas-manage-automations-delete.md).

You can also delete your configuration by doing the following:

1. Go to the **Predict** tab of your model.

1. Under **Predictions**, choose the **Configuration** tab.

1. Find your configuration from the list and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the dropdown menu, choose **Delete configuration**.

Your configuration should now be deleted.

# View your batch prediction jobs
<a name="canvas-make-predictions-batch-auto-view"></a>

To view the statuses and history of your batch prediction jobs, go to the **Predict** tab of your model.

Each batch prediction job shows up in the **Predict** tab of your model. Under **Predictions**, you can see the **All jobs** tab and the **Configuration** tabs:
+ **All jobs** – In this tab, you can see all of the manual and automatic batch prediction jobs for this model. You can filter the jobs by configuration name. For each job, you can see the following fields:
  + **Status** – The current status of your batch prediction job. If the status is **Failed** or **Partially failed**, you can hover over the status to view a more detailed error message to help you troubleshoot.
  + **Input dataset** – The name of your Canvas input dataset, including the dataset version.
  + **Prediction type** – Whether the prediction job was automatic or manual.
  + **Rows** – The number of rows predicted.
  + **Configuration name** – The name of the batch prediction job configuration.
  + **QuickSight** – Describes whether you've sent the batch predictions to Quick.
  + **Created** – The creation time of the batch prediction job.

  If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View details**, **Preview prediction**, **Download prediction**, or **Send to Quick**. If you choose **View details**, a page opens that shows you the full details of the batch prediction job, including the status, the input and output data configurations, information about the instances used to complete the job and access to the Amazon CloudWatch logs. The page looks like the following screenshot.  
![\[Batch prediction job details page showing all of the additional details about a job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-view-batch-prediction-job-details.png)
+ **Configuration** – In this tab, you can see all of the automatic batch prediction configurations you’ve created for this model. For each configuration, you can see fields such as the timestamp for when it was **Created**, the **Input dataset** it tracks for updates, and the **Next job scheduled**, which is the time when the next automatic prediction job is scheduled to start. If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), you can choose **View all jobs** to see the job history and in progress jobs for the configuration.



# Send predictions to Quick
<a name="canvas-send-predictions"></a>

**Note**  
You can send batch predictions to Quick for numeric and categorical prediction and time series forecasting models. Single-label image prediction and multi-category text prediction models are excluded.

Once you generate batch predictions with custom tabular models in SageMaker Canvas, you can send those predictions as CSV files to Quick, which is a business intelligence (BI) service to build and publish predictive dashboards.

For example, if you built a 2 category prediction model to determine whether a customer will churn, you can create a visual, predictive dashboard in Quick to show the percentage of customers that are expected to churn. To learn more about Quick, see the [Quick User Guide](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html).

The following sections show you how to send your batch predictions to Quick for analysis.

## Before you begin
<a name="canvas-send-predictions-prereqs"></a>

Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your predictions to Quick. Your administrator can set up the IAM permissions for your user. For more information, see [Grant Your Users Permissions to Send Predictions to Quick](canvas-quicksight-permissions.md).

Your Quick account must contain the `default` namespace, which is set up when you first create your Quick account. Contact your administrator to help you get access to Quick. For more information, see [Setting up for Quick](https://docs.aws.amazon.com/quicksight/latest/user/setting-up.html) in the *Quick User Guide*.

Your Quick account must be created in the same Region as your Canvas application. If your Quick account’s home Region differs from your Canvas application’s Region, you must either [close](https://docs.aws.amazon.com/quicksight/latest/user/closing-account.html) and recreate your Quick account, or [set up a Canvas application](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites) in the same Region as your Quick account. You can check your Quick home Region by doing the following (assuming you already have an Quick account):

1. Open your [Quick console](https://quicksight.aws.amazon.com/).

1. When the page loads, your Quick home Region is appended to the URL in the following format: `https://<your-home-region>.quicksight.aws.amazon.com/`.

You must know the usernames of the Quick users to whom you want to send your predictions. You can send predictions to yourself or other users who have the right permissions. Any users to whom you send predictions must be in the `default` [namespace](https://docs.aws.amazon.com/quicksight/latest/user/namespaces.html) of your Quick account and have the `Author` or `Admin` role in Quick.

Additionally, Quick must have access to the SageMaker AI default Amazon S3 bucket for your domain, which is named with the following format: `sagemaker-{REGION}-{ACCOUNT_ID}`. The Region should be the same as your Quick account's home Region and your Canvas application’s Region. To learn how to give Quick access to the batch predictions stored in your Amazon S3 bucket, see the topic [I can’t connect to Amazon S3](https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-S3.html) in the *Quick User Guide*.

## Supported data formats
<a name="canvas-send-predictions-formatting"></a>

Before sending your predictions, check that the data format of your batch predictions is compatible with Quick.
+ To learn more about the accepted data formats for timeseries data, see [Supported date formats](https://docs.aws.amazon.com/quicksight/latest/user/supported-date-formats.html) in the *Quick User Guide*.
+ To learn more about data values that might prevent you from sending to Quick, see [Unsupported values in data](https://docs.aws.amazon.com/quicksight/latest/user/unsupported-data-values.html) in the *Quick User Guide*.

Also note that Quick uses the character `"` as a text qualifier, so if your Canvas data contains any `"` characters, make sure that you close all matching quotes. Any mismatching quotes can cause issues with sending your dataset to Quick.

## Send your batch predictions to Quick
<a name="canvas-send-predictions-send"></a>

Use the following procedure to send your predictions to Quick:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model.

1. Choose the **Predict** tab.

1. Under **Predictions**, select the dataset (or datasets) of batch predictions that you’d like to share. You can share up to 5 datasets of batch predictions at a time.

1. After you select your dataset, choose **Send to Quick**.
**Note**  
The **Send to Quick** button doesn’t activate unless you select one or more datasets.

   Alternatively, you can preview your predictions by choosing the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then **View prediction results**. From the dataset preview, you can choose **Send to Quick**. The following screenshot shows you the **Send to Quick** button in a dataset preview.  
![\[Screenshot of a dataset preview with the Send to Quick button at the bottom.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/send-to-quicksight-preview.png)

1. In the **Send to Quick** dialog box, do the following:

   1. For **QuickSight users**, enter the name of the Quick users to whom you want to send your predictions. If you want to send them to yourself, enter your own username. You can only send predictions to users in the `default` namespace of the Quick account, and the user must have the `Author` or `Admin` role in Quick.

   1. Choose **Send**.

   The following screenshot shows the **Send to Quick** dialog box:  
![\[The Send to Quick dialog box.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/send-to-quicksight.png)

After you send your batch predictions, the **QuickSight** field for the datasets you sent shows as `Sent`. In the confirmation box that confirms your predictions were sent, you can choose **Open Quick** to open your Quick application. If you’re done using Canvas, you should [log out](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-log-out.html) of the Canvas application.

The Quick users that you’ve sent datasets to can open their Quick application and view the Canvas datasets that have been shared with them. Then, they can create predictive dashboards with the data. For more information, see [Getting started with Quick data analysis](https://docs.aws.amazon.com/quicksight/latest/user/getting-started.html) in the *Quick User Guide*.

By default, all of the users to whom you send predictions have owner permissions for the dataset in Quick. Owners are able to create analyses, refresh, edit, delete, and re-share datasets. The changes that owners make to a dataset change the dataset for all users with access. To change the permissions, go to the dataset in Quick and manage its permissions. For more information, see [Viewing and editing the permissions users that a dataset is shared with](https://docs.aws.amazon.com/quicksight/latest/user/sharing-data-sets.html#view-users-data-set) in the *Quick User Guide*.

# Download a model notebook
<a name="canvas-notebook"></a>

**Note**  
The model notebook feature is available for quick build and standard build tabular models, and fine-tuned foundation models. Model notebooks aren't supported for image prediction, text prediction, or time series forecasting models.  
If you'd like to generate a model notebook for a tabular model built before this feature was launched, you must rebuild the model to generate a notebook.

For eligible models that you successfully build in Amazon SageMaker Canvas, a Jupyter notebook containing a report of all the model building steps is generated. This Jupyter notebook contains Python code that you can run locally or run in an environment like Amazon SageMaker Studio Classic to replicate the steps necessary to build your model. The notebook can be useful if you’d like to experiment with the code or see the backend details of how Canvas builds models.

To access the model notebook, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. Choose the model and version that you built.

1. On the model version’s page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) in the header.

1. From the dropdown menu, choose **View Notebook**.

1. A popup appears with the notebook content. You can choose **Download** and then do one of the following:

   1. Choose **Download** to save the notebook content to your local device.

   1. Choose **Copy S3 URI** to copy the Amazon S3 location where the notebook is stored. The notebook is stored in the Amazon S3 bucket specified in your **Canvas storage configuration**, which is configured in the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites) section.

You should now be able to view the notebook either locally or as an object in Amazon S3. You can upload the notebook to an IDE to edit and run the code, or you can share the notebook with others in your organization to review.

# Send your model to Quick
<a name="canvas-send-model-to-quicksight"></a>

If you use Quick and want to leverage SageMaker Canvas in your Quick visualizations, you can build an Amazon SageMaker Canvas model and use it as a *predictive field* in your Quick dataset. A *predictive field* is a field in your Quick dataset that can make predictions for a given column in your dataset, similar to how Canvas users make single or batch predictions with a model. To learn more about how to integrate Canvas predictive abilities into your Quick datasets, see [SageMaker Canvas integration](https://docs.aws.amazon.com/quicksight/latest/user/sagemaker-canvas-integration.html) in the [Quick User Guide](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html).

The following steps explain how you can add a predictive field to your Quick dataset using a Canvas model:

1. Open the Canvas application and build a model with your dataset.

1. After building the model in Canvas, send the model to Quick. A schema file automatically downloads to your local machine when you send the model to Quick. You upload this schema file to Quick in the next step.

1. Open Quick and choose a dataset with the same schema as the dataset you used to build your model. Add a predictive field to the dataset and do the following:

   1. Specify the model sent from Canvas.

   1. Upload the schema file that was downloaded in Step 2.

1. Save and publish your changes, and then generate predictions for the new dataset. Quick uses the model to fill in the target column with predictions.

In order to send a model from Canvas to Quick, you must meet the following prerequisites:
+ You must have both Canvas and Quick set up. Your Quick account must be created in the same AWS Region as your Canvas application. If your Quick account’s home Region differs from your Canvas application’s Region, you must either [close](https://docs.aws.amazon.com/quicksight/latest/user/closing-account.html) and recreate your Quick account, or [set up a Canvas application](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites) in the same Region as your Quick account. Your Quick account must also contain the default namespace, which you set up when you first create your Quick account. Contact your administrator to help you get access to Quick. For more information, see [Setting up for Quick](https://docs.aws.amazon.com/quicksight/latest/user/setting-up.html) in the *Quick User Guide*.
+ Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your predictions to Quick. Your administrator can set up the IAM permissions for your user. For more information, see [Grant Your Users Permissions to Send Predictions to Quick](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-quicksight-permissions.html).
+ Quick must have access to the Amazon S3 bucket that you’ve specified for Canvas application storage. For more information, see [Configure your Amazon S3 storage](canvas-storage-configuration.md).

# Time Series Forecasts in Amazon SageMaker Canvas
<a name="canvas-time-series"></a>

**Note**  
Time series forecasting models are only supported for tabular datasets.

Amazon SageMaker Canvas gives you the ability to use machine learning time series forecasts. Time series forecasts give you the ability to make predictions that can vary with time.

You can make a time series forecast for the following examples:
+ Forecasting your inventory in the coming months.
+ The number of items sold in the next four months.
+ The effect of reducing the price on sales during the holiday season.
+ Item inventory in the next 12 months.
+ The number of customers entering a store in the next several hours.
+ Forecasting how a 10% reduction in the price of a product affects sales over a time period.

To make a time series forecast, your dataset must have the following:
+ A timestamp column with all values having the `datetime` type.
+ A target column that has the values that you're using to forecast future values.
+ An item ID column that contains unique identifiers for each item in your dataset, such as SKU numbers.

The `datetime` values in the timestamp column must use one of the following formats:
+ `YYYY-MM-DD HH:MM:SS`
+ `YYYY-MM-DDTHH:MM:SSZ`
+ `YYYY-MM-DD`
+ `MM/DD/YY`
+ `MM/DD/YY HH:MM`
+ `MM/DD/YYYY`
+ `YYYY/MM/DD HH:MM:SS`
+ `YYYY/MM/DD`
+ `DD/MM/YYYY`
+ `DD/MM/YY`
+ `DD-MM-YY`
+ `DD-MM-YYYY`

You can make forecasts for the following intervals:
+ 1 min
+ 5 min
+ 15 min
+ 30 min
+ 1 hour
+ 1 day
+ 1 week
+ 1 month
+ 1 year

## Future values in your input dataset
<a name="canvas-time-series-future"></a>

Canvas automatically detects columns in your dataset that might potentially contain future values. If present, these values can enhance the accuracy of predictions. Canvas marks these specific columns with a `Future values` label. Canvas infers the relationship between the data in these columns and the target column that you are trying to predict, and utilizes that relationship to generate more accurate forecasts.

For example, you can forecast the amount of ice cream sold by a grocery store. To make a forecast, you must have a timestamp column and a column that indicates how much ice cream the grocery store sold. For a more accurate forecast, your dataset can also include the price, the ambient temperature, the flavor of the ice cream, or a unique identifier for the ice cream.

Ice cream sales might increase when the weather is warmer. A decrease in the price of the ice cream might result in more units sold. Having a column with ambient temperature data and a column with pricing data can improve your ability to forecast the number of units of ice cream the grocery store sells.

While providing future values is optional, it helps you to perform what-if analyses directly in the Canvas application, showing you how changes in future values could alter your predictions.

## Handling missing values
<a name="canvas-time-series-missing"></a>

You might have missing data for different reasons. The reason for your missing data might inform how you want Canvas to impute it. For example, your organization might use an automatic system that only tracks when a sale happens. If you're using a dataset that comes from this type of automatic system, you have missing values in the target column.

**Important**  
If you have missing values in the target column, we recommend using a dataset that doesn't have them. SageMaker Canvas uses the target column to forecast future values. Missing values in the target column can greatly reduce the accuracy of the forecast.

For missing values in the dataset, Canvas automatically imputes the missing values for you by filling the target column with `0` and other numeric columns with the median value of the column.

However, you can select your own filling logic for the target column and other numeric columns in your datasets. Target columns have different filling guidelines and restrictions than the rest of the numeric columns. Target columns are filled up to the end of the historical period, whereas numeric columns are filled across both historical and future periods all the way to the end of the forecast horizon. Canvas only fills future values in a numeric column if your data has at least one record with a future timestamp and a value for that specific column.

You can choose one of the following filling logic options to impute missing values in your data:
+ `zero` – Fill with `0`.
+ `NaN` – Fill with NaN, or not a number. This is only supported for the target column.
+ `mean` – Fill with the mean value from the data series.
+ `median` – Fill with the median value from the data series.
+ `min` – Fill with the minimum value from the data series.
+ `max` – Fill with the maximum value from the data series.

When choosing a filling logic, you should consider how your model interprets the logic. For example, in a retail scenario, recording zero sales of an available item is different from recording zero sales of an unavailable item, as the latter scenario doesn’t necessarily imply a lack of customer interest in the unavailable item. In this case, filling with `0` in the target column of the dataset might cause the model to be under-biased in its predictions and infer a lack of customer interest in unavailable items. Conversely, filling with `NaN` might cause the model to ignore true occurrences of zero items being sold of available items.

## Types of forecasts
<a name="canvas-time-series-types"></a>

You can make one of the following types of forecasts:
+ **Single item**
+ **All items**

For a forecast on all the items in your dataset, SageMaker Canvas returns a forecast for the future values for each item in your dataset.

For a single item forecast, you specify the item and SageMaker Canvas returns a forecast for the future values. The forecast includes a line graph that plots the predicted values over time.

**Topics**
+ [

## Future values in your input dataset
](#canvas-time-series-future)
+ [

## Handling missing values
](#canvas-time-series-missing)
+ [

## Types of forecasts
](#canvas-time-series-types)
+ [

# Additional options for forecasting insights
](canvas-additional-insights.md)

# Additional options for forecasting insights
<a name="canvas-additional-insights"></a>

In Amazon SageMaker Canvas, you can use the following optional methods to get more insights from your forecast:
+ Group column
+ Holiday schedule
+ What-if scenario

You can specify a column in your dataset as a **Group column**. Amazon SageMaker Canvas groups the forecast by each value in the column. For example, you can group the forecast on columns containing price data or unique item identifiers. Grouping a forecast by a column lets you make more specific forecasts. For example, if you group a forecast on a column containing item identifiers, you can see the forecast for each item.

Overall sales of items might be impacted by the presence of holidays. For example, in the United States, the number of items sold in both November and December might differ greatly from the number of items sold in January. If you use the data from November and December to forecast the sales in January, your results might be inaccurate. Using a holiday schedule prevents you getting inaccurate results. You can use a holiday schedule for 251 countries.

For a forecast on a single item in your dataset, you can use what-if scenarios. A what-if scenario gives you the ability to change values in your data and change the forecast. For example, you can answer the following questions by using a what-if scenario, "What if I lowered prices? How would that affect the number of items sold?"

# Adding model versions in Amazon SageMaker Canvas
<a name="canvas-update-model"></a>

In Amazon SageMaker Canvas, you can update the models that you’ve built by adding *versions*. Each model that you build has a version number. The first model is version 1 or `V1`. You can use model versions to see changes in prediction accuracy when you update your data or use [advanced transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-prepare-data.html).

When viewing your model, SageMaker Canvas shows you the model history so that you can compare all of the model versions that you built. You can also delete versions that are no longer useful to you. By creating multiple model versions and evaluating their accuracy, you can iteratively improve your model performance.

**Note**  
Text prediction and image prediction models only support one model version.

To add a model version, you can either clone an existing version or create a new version. 

Cloning an existing version copies over the current model configuration, including the model recipe and the input dataset. Alternatively, you can create a new version if you want to configure a new model recipe or choose a different dataset. 

If you create a new version and select a different dataset, you must choose a dataset with the same target column and schema as the dataset from version 1.

Before you can add a new version, you must successfully build at least one model version. Then, you can [ register a model version in the SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html). Use the registry for tracking model versions and for collaborating with Studio Classic users on production model approvals.

If you did a quick build for your first model version, you have the option to run a standard build when you add a version. Standard builds generally have higher accuracy. Therefore, if you feel confident in your quick build configuration, you can run a standard build to create a final version of your model. To learn more about the differences between quick builds and standard builds, see [How custom models work](canvas-build-model.md).

The following procedures show you how to add model versions; the procedure is different depending on whether you are adding a version of the same build type or a different build type (quick versus standard). Use the procedure **To add a new model version** to add versions of the same build type. To add a standard build model version after running a quick build, follow the procedure **To run a standard build**.

**To add a new model version**

1. Open your SageMaker Canvas application. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. To find your model, you can choose **Filter by problem type**.

1. After your model opens, choose the **Add version** button in the top panel.

1. From the dropdown menu, select one of the following options:

   1. **Add a new version from scratch** – When you select this option, the **Build** tab opens with the draft for a new model version. You can select a different dataset (as long as the schema matches the schema of the first model version’s dataset) and configure a new model recipe. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

   1. **Clone an existing version with configurations** – A dialog box prompts you to select the version that you want to clone. After you've selected your desired version, choose **Clone**. The **Build** tab opens with the draft for a new model version. Any model recipe configurations are copied over from the cloned version. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

**To run a standard build**

1. Open your SageMaker Canvas application. For more information, see [Getting started with using Amazon SageMaker Canvas](canvas-getting-started.md).

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. You can choose **Filter by problem type** to find your model more easily.

1. After your model opens, choose the **Analyze** tab.

1. Choose **Standard build**.  
![\[The Analyze tab of a Canvas model showing the standard build button.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-add-version-quick-to-standard.png)

   On the model draft page that opens to the **Build** tab, you can modify your model configuration and start a build. For more information about building a model version, see [Build a model](canvas-build-model-how-to.md).

You should now have a new model version build in progress. For more information about building a model, see [How custom models work](canvas-build-model.md).

After building a model version, you can return to your model details page at any time to view all of the versions or add more versions. The following image shows the **Versions** page for a model.

![\[The model versions page for a model in Canvas.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/model-versions.png)


On the **Versions** page, you can view the following information for each of your model versions:
+ **Status** – This field tells you whether your model is currently building (`In building`), done building (`Ready`), failed to build (`Failed`), or still being edited (`In draft`).
+ **Model score**, **F1**, **Precision**, **Recall**, and **AUC** – If you turn on the **Show advanced metrics** toggle on this page, you can see these model metrics. These metrics indicate the accuracy and performance of your model. For more information, see [Evaluate your model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model.html).
+ **Shared** – This field states whether you shared the model version with SageMaker Studio Classic users.
+ **Model registry** – This field states whether you registered the version to a model registry. For more information, see [Register a model version in the SageMaker AI model registry](canvas-register-model.md).

# MLOps
<a name="canvas-mlops"></a>

After building a model in SageMaker Canvas that you feel confident about, you might want to integrate your model with the machine learning operations (MLOps) processes in your organization. MLOps includes common tasks such as deploying a model for use in production or setting up continuous integration and continuous deployment (CI/CD) pipelines.

The following topics describe how you can use features within Canvas to use a Canvas-built model in production.

**Topics**
+ [

# Register a model version in the SageMaker AI model registry
](canvas-register-model.md)
+ [

# Deploy your models to an endpoint
](canvas-deploy-model.md)
+ [

# View your deployments
](canvas-deploy-model-view.md)
+ [

# Update a deployment configuration
](canvas-deploy-model-update.md)
+ [

# Test your deployment
](canvas-deploy-model-test.md)
+ [

# Invoke your endpoint
](canvas-deploy-model-invoke.md)
+ [

# Delete a model deployment
](canvas-deploy-model-delete.md)

# Register a model version in the SageMaker AI model registry
<a name="canvas-register-model"></a>

With SageMaker Canvas, you can build multiple iterations, or versions, of your model to improve it over time. You might want to build a new version of your model if you acquire better training data or if you want to attempt to improve the model’s accuracy. For more information about adding versions to your model, see [Update a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html).

After you’ve [built a model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) that you feel confident about, you might want to evaluate its performance and have it reviewed by a data scientist or MLOps engineer in your organization before using it in production. To do this, you can register your model versions to the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html). The SageMaker Model Registry is a repository that data scientists or engineers can use to catalog machine learning (ML) models and manage model versions and their associated metadata, such as training metrics. They can also manage and log the approval status of a model.

After you register your model versions to the SageMaker Model Registry, a data scientist or your MLOps team can access the SageMaker Model Registry through [SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), which is a web-based integrated development environment (IDE) for working with machine learning models. In the SageMaker Model Registry interface in Studio Classic, the data scientist or MLOps team can evaluate your model and update its approval status. If the model doesn’t perform to their requirements, the data scientist or MLOps team can update the status to `Rejected`. If the model does perform to their requirements, then the data scientist or MLOps team can update the status to `Approved`. Then, they can [deploy your model to an endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-prereqs) or [automate model deployment](https://aws.amazon.com/blogs/machine-learning/building-automating-managing-and-scaling-ml-workflows-using-amazon-sagemaker-pipelines/) with CI/CD pipelines. You can use the SageMaker AI model registry feature to seamlessly integrate models built in Canvas with the MLOps processes in your organization.

The following diagram summarizes an example of registering a model version built in Canvas to the SageMaker Model Registry for integration into an MLOps workflow.

![\[The steps registering a model version built in Canvas for integration into an MLOps workflow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-model-registration-diagram.jpg)


You can register tabular, image, and text model versions to the SageMaker Model Registry. This includes time series forecasting models and JumpStart based [fine-tuned foundation models](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-fm-chat-fine-tune.html).

**Note**  
Currently, you can't register Amazon Bedrock based fine-tuned foundation models built in Canvas to the SageMaker Model Registry.

The following sections show you how to register a model version to the SageMaker Model Registry from Canvas.

## Permissions management
<a name="canvas-register-model-prereqs"></a>

By default, you have permissions to register model versions to the SageMaker Model Registry. SageMaker AI grants these permissions for all new and existing Canvas user profiles through the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy, which is attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application.

If your Canvas administrator is setting up a new domain or user profile, when they're setting up the domain and following the prerequisite instructions in the [Getting started guide](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites), SageMaker AI turns on the model registration permissions through the **ML Ops permissions configuration** option, which is enabled by default.

The Canvas administrator can manage model registration permissions at the user profile level as well. For example, if the administrator wants to grant model registration permissions to some user profiles but remove permissions for others, they can edit the permissions for a specific user. The following procedure shows how to turn off model registration permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the user profile’s domain.

1. On the **domain details** page, choose the **User profile** whose permissions you want to edit.

1. On the **User Details** page, choose **Edit**.

1. In the left navigation pane, choose **Canvas settings**.

1. In the **ML Ops permissions configuration** section, turn off the **Enable Model Registry registration permissions** toggle.

1. Choose **Submit** to save the changes to your domain settings.

The user profile should no longer have model registration permissions.

## Register a model version to the SageMaker AI model registry
<a name="canvas-register-model-register"></a>

SageMaker Model Registry tracks all of the model versions that you build to solve a particular problem in a *model group*. When you build a SageMaker Canvas model and register it to SageMaker Model Registry, it gets added to a model group as a new model version. For example, if you build and register four versions of your model, then a data scientist or MLOps team working in the SageMaker Model Registry interface can view the model group and review all four versions of the model in one place.

When registering a Canvas model to the SageMaker Model Registry, a model group is automatically created and named after your Canvas model. Optionally, you can rename it to a name of your choice, or use an existing model group in the SageMaker Model Registry. For more information about creating a model group, see [Create a Model Group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-model-group.html).

**Note**  
Currently, you can only register models built in Canvas to the SageMaker Model Registry in the same account.

To register a model version to the SageMaker Model Registry from the Canvas application, use the following procedure:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **My models**.

1. On the **My models** page, choose your model. You can **Filter by problem type** to find your model more easily.

1. After choosing your model, the **Versions** page opens, listing all of the versions of your model. You can turn on the **Show advanced metrics** toggle to view the advanced metrics, such as **Recall** and **Precision**, to compare your model versions and determine which one you’d like to register.

1. From the list of model versions, for the the version that you want to register, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)). Alternatively, you can double click on the version that you need to register, and then on the version details page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown list, choose **Add to Model Registry**. The **Add to Model Registry** dialog box opens.

1. In the **Add to Model Registry** dialog box, do the following:

   1. (Optional) In the **SageMaker Studio Classic model group** section, for the **Model group name** field, enter the name of the model group to which you want to register your version. You can specify the name for a new model group that SageMaker AI creates for you, or you can specify an existing model group. If you don’t specify this field, Canvas registers your version to a default model group with the same name as your model.

   1. Choose **Add**.

Your model version should now be registered to the model group in the SageMaker Model Registry. When you register a model version to a model group in the SageMaker Model Registry, all subsequent versions of the Canvas model are registered to the same model group (if you choose to register them). If you register your versions to a different model group, you need to go to the SageMaker Model Registry and [delete the model group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-delete-model-group.html). Then, you can re-register your model versions to the new model group.

To view the status of your models, you can return to the **Versions** page for your model in the Canvas application. This page shows you the **Model Registry** status of each version. If the status is `Registered`, then the model has been successfully registered.

If you want to view the details of your registered model version, for the **Model Registry** status, you can hover over the **Registered** field to see the **Model registry details** pop-up box. These details contain more info, such as the following:
+ The **Model package group name** is the model group that your version is registered to in the SageMaker Model Registry.
+ The **Approval status**, which can be `Pending Approval`, `Approved`, or `Rejected`. If a Studio Classic user approves or rejects your version in the SageMaker Model Registry, then this status is updated on your model versions page when you refresh the page.

The following screenshot shows the **Model registry details** box, along with an **Approval status** of `Approved` for this particular model version.

![\[Screenshot of the SageMaker Model Registry details box in the Canvas application.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/approved-mr.png)


# Deploy your models to an endpoint
<a name="canvas-deploy-model"></a>

In Amazon SageMaker Canvas, you can deploy your models to an endpoint to make predictions. SageMaker AI provides the ML infrastructure for you to host your model on an endpoint with the compute instances that you choose. Then, you can *invoke* the endpoint (send a prediction request) and get a real-time prediction from your model. With this functionality, you can use your model in production to respond to incoming requests, and you can integrate your model with existing applications and workflows.

To get started, you should have a model that you'd like to deploy. You can deploy custom model versions that you've built, Amazon SageMaker JumpStart foundation models, and fine-tuned JumpStart foundation models. For more information about building a model in Canvas, see [How custom models work](canvas-build-model.md). For more information about JumpStart foundation models in Canvas, see [Generative AI foundation models in SageMaker Canvas](canvas-fm-chat.md).

Review the following **Permissions management** section, and then begin creating new deployments in the **Deploy a model** section.

## Permissions management
<a name="canvas-deploy-model-prereqs"></a>

By default, you have permissions to deploy models to SageMaker AI Hosting endpoints. SageMaker AI grants these permissions for all new and existing Canvas user profiles through the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasFullAccess.html) policy, which is attached to the AWS IAM execution role for the SageMaker AI domain that hosts your Canvas application.

If your Canvas administrator is setting up a new domain or user profile, when they're setting up the domain and following the prerequisite instructions in the [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites), SageMaker AI turns on the model deployment permissions through the **Enable direct deployment of Canvas models** option, which is enabled by default.

The Canvas administrator can manage model deployment permissions at the user profile level as well. For example, if the administrator doesn't want to grant model deployment permissions to all user profiles when setting up a domain, they can grant permissions to specific users after creating the domain.

The following procedure shows how to modify the model deployment permissions for a specific user profile:

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Domains**.

1. From the list of domains, select the user profile’s domain.

1. On the **Domain details** page, select the **User profiles** tab.

1. Choose your **User profile**.

1. On the user profile's page, select the **App Configurations** tab.

1. In the **Canvas** section, choose **Edit**.

1. In the **ML Ops configuration** section, turn on the **Enable direct deployment of Canvas models** toggle to enable deployment permissions.

1. Choose **Submit** to save the changes to your domain settings.

The user profile should now have model deployment permissions.

After granting permissions to the domain or user profile, make sure that the user logs out of their Canvas application and logs back in to apply the permission changes.

## Deploy a model
<a name="canvas-deploy-model-deploy"></a>

To get started with deploying your model, you create a new deployment in Canvas and specify the model version that you want to deploy along with the ML infrastructure, such as the type and number of compute instances that you would like to use for hosting the model.

Canvas suggests a default type and number of instances based on your model type, or you can learn more about the various SageMaker AI instance types on the [Amazon SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/). You are charged based on the SageMaker AI instance pricing while your endpoint is active.

When deploying JumpStart foundation models, you also have the option to specify the length of the deployment time. You can deploy the model to an endpoint indefinitely (meaning the endpoint is active until you delete the deployment). Or, if you only need the endpoint for a short period of time and would like to reduce costs, you can deploy the model to an endpoint for a specified amount of time, after which SageMaker AI shuts down the endpoint for you.

**Note**  
If you deploy a model for a specified amount of time, stay logged in to the Canvas application for the duration of the endpoint. If you log out of or delete the application, then Canvas is unable to shut down the endpoint at the specified time.

After your model is deployed to a SageMaker AI Hosting [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html), you can begin making predictions by *invoking* the endpoint.

There are several different ways for you to deploy a model from the Canvas application. You can access the model deployment option through any of the following methods:
+ On the **My models** page of the Canvas application, choose the model that you want to deploy. Then, from the model’s **Versions** page, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to a model version and select **Deploy**.
+ When on the details page for a model version, on the **Analyze** tab, choose the **Deploy** option.
+ When on the details page for a model version, on the **Predict** tab, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) at the top of the page and select **Deploy**.
+ On the **ML Ops** page of the Canvas application, choose the **Deployments** tab and then choose **Create deployment**.
+ For JumpStart foundation models and fine-tuned foundation models, go to the **Ready-to-use models** page of the Canvas application. Choose **Generate, extract and summarize content**. Then, find the JumpStart foundation model or fine-tuned foundation model that you want to deploy. Choose the model, and on the model's chat page, choose the **Deploy** button.

All of these methods open the **Deploy model** side panel, where you specify the deployment configuration for your model. To deploy the model from this panel, do the following:

1. (Optional) If you’re creating a deployment from the **ML Ops** page, you’ll have the option to **Select model and version**. Use the dropdown menus to select the model and model version that you want to deploy.

1. Enter a name in the **Deployment name** field.

1. (For JumpStart foundation models and fine-tuned foundation models only) Choose a **Deployment length**. Select **Indefinite** to leave the endpoint active until you shut it down, or select **Specify length** and then enter the period of time for which you want the endpoint to remain active.

1. For **Instance type**, SageMaker AI detects a default instance type and number that is suitable for your model. However, you can change the instance type that you would like to use for hosting your model.
**Note**  
If you run out of the instance quota for the chosen instance type on your AWS account, you can request a quota increase. For more information about the default quotas and how to request an increase, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) in the *AWS General Reference guide*.

1. For **Instance count**, you can set the number of active instances that are used for your endpoint. SageMaker AI detects a default number that is suitable for your model, but you can change this number.

1. When you’re ready to deploy your model, choose **Deploy**.

Your model should now be deployed to an endpoint.

# View your deployments
<a name="canvas-deploy-model-view"></a>

You might want to check the status or details of a model deployment in Amazon SageMaker Canvas. For example, if your deployment failed, you might want to check the details to troubleshoot.

You can view your Canvas model deployments from the Canvas application or from the Amazon SageMaker AI console.

To view deployment details from Canvas, choose one of the following procedures:

To view your deployment details from the **ML Ops** page, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation pane, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. Choose your deployment by name from the list.

To view your deployment details from a model version’s page, do the following:

1. In the SageMaker Canvas application, go to your model version’s details page.

1. Choose the **Deploy** tab.

1. On the **Deployments ** section that lists all of the deployment configurations associated with that model version, find your deployment.

1. Choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)), and then select **View details** to open the details page.

The details page for your deployment opens, and you can view information such as the time of the most recent prediction, the endpoint’s status and configuration, and the model version that is currently deployed to the endpoint.

You can also view your currently active Canvas workspace instances and active endpoints from the **SageMaker AI dashboard** in the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). Your Canvas endpoints are listed alongside any other SageMaker AI Hosting endpoints that you’ve created, and you can filter them by searching for endpoints with the Canvas tag.

The following screenshot shows the SageMaker AI dashboard. In the **Canvas** section, you can see that one workspace instance is in service and four endpoints are active.

![\[Screenshot of the SageMaker AI dashboard showing the active Canvas workspace instances and endpoints.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-sagemaker-dashboard.png)


# Update a deployment configuration
<a name="canvas-deploy-model-update"></a>

You can update the deployment configuration for models that you've deployed to endpoints in Amazon SageMaker Canvas. For example, you can deploy an updated model version to the endpoint, or you can update the instance type or number of instances behind the endpoint based on your capacity needs.

There are several different ways for you to update your deployment from the Canvas application. You can use any of the following methods:
+ On the **ML Ops** page of the Canvas application, you can choose the **Deployments** tab and select the deployment that you want to update. Then, choose **Update configuration**.
+ When on the details page for a model version, on the **Deploy** tab, you can view the deployments for that version. Next to the deployment, choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) and then choose **Update configuration**.

Both of the preceding methods open the **Update configuration** side panel, where you can make changes to your deployment configuration. To update the configuration, do the following:

1. For the **Select version** dropdown menu, you can select a different model version to deploy to the endpoint.
**Note**  
When updating a deployment configuration, you can only choose a different model version to deploy. To deploy a different model, create a new deployment.

1. For **Instance type**, you can select a different instance type for hosting your model.

1. For **Instance count**, you can change the number of active instances that are used for your endpoint.

1. Choose **Save**.

Your deployment configuration should now be updated.

# Test your deployment
<a name="canvas-deploy-model-test"></a>

You can test a model deployment by invoking the endpoint, or making single prediction requests, through the Amazon SageMaker Canvas application. You can use this functionality to confirm that your endpoint responds to requests before invoking your endpoint programmatically in a production environment.

## Test a custom model deployment
<a name="canvas-deploy-model-test-custom"></a>

You can test a custom model deployment by accessing it through the **ML Ops** page and making a single invocation, which returns a prediction along with the probability that the prediction is correct.

**Note**  
Execution length is an estimate of the time taken to invoke and get a response from the endpoint in Canvas. For detailed latency metrics, see [SageMaker AI Endpoint Invocation Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation).

To test your endpoint through the Canvas application, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, choose the one with the endpoint that you want to invoke.

1. On the deployment’s details page, choose the **Test deployment** tab.

1. On the deployment testing page, you can modify the **Value** fields to specify a new data point. For time series forecasting models, you specify the **Item ID** for which you want to make a forecast.

1. After modifying the values, choose **Update** to get the prediction result.

The prediction loads, along with the **Invocation result** fields which indicate whether or not the invocation was successful and how long the request took to process.

The following screenshot shows a prediction performed in the Canvas application on the **Test deployment** tab.

![\[The Canvas application showing a test prediction for a deployed model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-test-deployments.png)


For all model types except numeric prediction and time series forecasting, the prediction returns the following fields:
+  **predicted\$1label** – the predicted output
+  **probability** – the probability that the predicted label is correct
+  **labels** – the list of all the possible labels
+  **probabilities** – the probabilities corresponding to each label (the order of this list matches the order of the labels)

For numeric prediction models, the prediction only contains the **score** field, which is the predicted output of the model, such as the predicted price of a house.

For time series forecasting models, the prediction is a graph showing the forecasts by quantile. You can choose **Schema view** to see the forecasted numeric values for each quantile.

You can continue making single predictions through the deployment testing page, or you can see the following section [Invoke your endpoint](canvas-deploy-model-invoke.md) to learn how to invoke your endpoint programmatically from applications.

## Test a JumpStart foundation model deployment
<a name="canvas-deploy-model-test-js"></a>

You can chat with a deployed JumpStart foundation model through the Canvas application to test its functionality before invoking it through code.

To chat with a deployed JumpStart foundation model, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, find the one that you want to invoke and choose its **More options** icon (![\[More options icon for a model deployment.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. From the context menu, choose **Test deployment**.

1. A new **Generate, extract and summarize content** chat opens with the JumpStart foundation model, and you can begin typing prompts. Note that prompts from this chat are sent as requests to your SageMaker AI Hosting endpoint.

# Invoke your endpoint
<a name="canvas-deploy-model-invoke"></a>

**Note**  
We recommend that you [test your model deployment in Amazon SageMaker Canvas](canvas-deploy-model-test.md) before invoking a SageMaker AI endpoint programmatically.

You can use your Amazon SageMaker Canvas models that you've deployed to a SageMaker AI endpoint in production with your applications. Invoke the endpoint programmatically the same way that you invoke any other [SageMaker AI real-time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). Invoking an endpoint programmatically returns a response object which contains the same fields described in [Test your deployment](canvas-deploy-model-test.md).

For more detailed information about how to programmatically invoke endpoints, see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).

The following Python examples show you how to invoke your endpoint based on the model type.

## JumpStart foundation models
<a name="canvas-invoke-js-example"></a>

The following example shows you how to invoke a JumpStart foundation model that you've deployed to an endpoint.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame(
    [['feature_column1', 'feature_column2'], 
    ['feature_column1', 'feature_column2']]
).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Numeric and categorical prediction models
<a name="canvas-invoke-tabular-example"></a>

The following example shows you how to invoke numeric or categorical prediction models.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame(['feature_column1', 'feature_column2'], ['feature_column1', 'feature_column2']).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Time series forecasting models
<a name="canvas-invoke-forecast-example"></a>

The following example shows you how to invoke time series forecasting models. For a complete example of how to test invoke a time series forecasting model, see [ Time-Series Forecasting with Amazon SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/eef13dae197a6e588a8bc111aba3244f99ee0fbb/autopilot/autopilot_time_series.ipynb).

```
import boto3
import pandas as pd

csv_path = './real-time-payload.csv'
data = pd.read_csv(csv_path)

client = boto3.client("runtime.sagemaker")

body = data.to_csv(index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

## Image prediction models
<a name="canvas-invoke-cv-example"></a>

The following example shows you how to invoke image prediction models.

```
import boto3
client = boto3.client("runtime.sagemaker")
with open("example_image.jpg", "rb") as file:
    body = file.read()
    response = client.invoke_endpoint(
        EndpointName="endpoint_name",
        ContentType="application/x-image",
        Body=body,
        Accept="application/json"
    )
```

## Text prediction models
<a name="canvas-invoke-nlp-example"></a>

The following example shows you how to invoke text prediction models.

```
import boto3
import pandas as pd

client = boto3.client("runtime.sagemaker")
body = pd.DataFrame([["Example text 1"], ["Example text 2"]]).to_csv(header=False, index=False).encode("utf-8")
    
response = client.invoke_endpoint(
    EndpointName="endpoint_name",
    ContentType="text/csv",
    Body=body,
    Accept="application/json"
)
```

# Delete a model deployment
<a name="canvas-deploy-model-delete"></a>

You can delete your model deployments from the Amazon SageMaker Canvas application. This action also deletes the endpoint from the SageMaker AI console and shuts down any endpoint-related resources.

**Note**  
Optionally, you can delete your endpoint through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/) or using the SageMaker AI `DeleteEndpoint` API. For more information, see [Delete Endpoints and Resources](realtime-endpoints-delete-resources.md). However, when you delete the endpoint through the SageMaker AI console or APIs instead of the Canvas application, the list of deployments in Canvas isn’t automatically updated. You must also delete the deployment from the Canvas application to remove it from the list.

To delete a deployment in Canvas, do the following:

1. Open the SageMaker Canvas application.

1. In the left navigation panel, choose **ML Ops**.

1. Choose the **Deployments** tab.

1. From the list of deployments, choose the one that you want to delete.

1. At the top of the deployment details page, choose the **More options** icon (![\[More options icon for the output CSV file.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Delete deployment**.

1. In the ** Delete deployment** dialog box, choose **Delete**.

Your deployment and SageMaker AI Hosting endpoint should now be deleted from both Canvas and the SageMaker AI console.

# How to manage automations
<a name="canvas-manage-automations"></a>

In SageMaker Canvas, you can create automations that update your dataset or generate predictions from your model on a schedule. For example, you might receive new shipping data on a daily basis. You can set up an automatic update for your dataset and automatic batch predictions that run whenever the dataset is updated. Using these features, you can set up an automated workflow and reduce the amount of time you spend manually updating datasets and making predictions.

**Note**  
You can only set up a maximum of 20 automatic configurations in your Canvas application. Automations are only active while you’re logged in to the Canvas application. If you log out of Canvas, your automatic jobs pause until you log back in.

The following sections describe how to view, edit, and delete configurations for existing automations. To learn how to set up automations, see the following topics:
+ To set up automatic dataset updates, see [Update a dataset](canvas-update-dataset.md).
+ To set up automatic batch predictions, see [Batch predictions in SageMaker Canvas](canvas-make-predictions-batch.md).

**Topics**
+ [

# View your automations
](canvas-manage-automations-view.md)
+ [

# Edit your automatic configurations
](canvas-manage-automations-edit.md)
+ [

# Delete an automatic configuration
](canvas-manage-automations-delete.md)

# View your automations
<a name="canvas-manage-automations-view"></a>

You can also view all of your auto update jobs by going to the left navigation pane of Canvas and choosing **ML Ops**. The **ML Operations** page combines automations for both automatic dataset updates and automatic batch predictions. On the **Automations** tab, you can see the following sub-tabs:
+ **All jobs** – You can see every instance of a **Dataset update** or **Batch prediction** job that Canvas has done. For each job, you can see fields such as the associated **Input dataset**, the **Configuration name** of the associated auto update configuration, and the **Status** showing whether the job was successful or not. You can filter the jobs by configuration name:
  + For dataset update jobs, you can choose the latest version of the dataset, or the most recent job, to preview the dataset.
  + For batch prediction jobs, you can choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) to preview or download the predictions for that job. You can also choose **View details** to see more details about your prediction job. For more information about batch prediction job details, see [View your batch prediction jobs](canvas-make-predictions-batch-auto-view.md).
+ **Configuration** – You can see all of the **Dataset update** and **Batch prediction** configurations you’ve created. For each configuration, you can see fields such as the associated **Input dataset** and the **Frequency** of the jobs. You can also turn off or turn on the **Auto update** toggle to pause or resume automatic updates. If you choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) for a specific configuration, you can choose to **View all jobs** for the configuration, **Update configuration**, or **Delete configuration**.

# Edit your automatic configurations
<a name="canvas-manage-automations-edit"></a>

After setting up a configuration, you might want to make changes to it. For automatic dataset updates, you can update the Amazon S3 location for Canvas to import data, the frequency of the updates, and the starting time. For automatic batch predictions, you can change the dataset that the configuration tracks for updates. You can also turn off the automation to temporarily pause updates until you choose to resume them.

The following sections show you how to update each type of configuration.

**Note**  
You can’t change the frequency for automatic batch predictions because automatic batch predictions run every time the target dataset is updated.

**Topics**
+ [

# Edit your automatic dataset update configuration
](canvas-manage-automations-edit-dataset.md)
+ [

# Edit your automatic batch prediction configuration
](canvas-manage-automations-edit-batch.md)

# Edit your automatic dataset update configuration
<a name="canvas-manage-automations-edit-dataset"></a>

You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

To make changes to your auto update configuration for a dataset, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. For your auto update configuration, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown menu, choose **Update configuration**. You are taken to the **Auto updates** tab of the dataset.

1. Make your changes to the configuration. When you’re done making changes, choose **Save**.

To pause your dataset updates, turn off your automatic configuration. One way to turn off auto updates is by doing the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the ** Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic updates for your dataset are now paused. You can turn this toggle back on at any time to resume the update schedule.

# Edit your automatic batch prediction configuration
<a name="canvas-manage-automations-edit-batch"></a>

When you edit a batch prediction configuration, you can change the target dataset but not the frequency (since automatic batch predictions occur whenever the dataset is updated).

To make changes to your automatic batch predictions configuration, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. For your auto update configuration, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. In the dropdown menu, choose **Update configuration**. You are taken to the **Auto updates** tab of the dataset.

1. The **Automate batch prediction** dialog box opens. You can select another dataset and choose **Set up** to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration. Use the following procedure to turn off your configuration:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the ** Configuration** tab.

1. Find your configuration from the list and turn off the **Auto update** toggle.

Automatic batch predictions for your dataset are now paused. You can turn this toggle back on at any time to resume the update schedule.

# Delete an automatic configuration
<a name="canvas-manage-automations-delete"></a>

You might want to delete a configuration to stop your automated workflow in SageMaker Canvas.

To delete a configuration for automatic dataset updates or automatic batch predictions, do the following:

1. In the left navigation pane of Canvas, choose **ML Ops**.

1. Choose the **Automations** tab.

1. Choose the **Configuration** tab.

1. Find your auto update configuration, and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)).

1. Choose **Delete configuration**.

1. In the dialog box that pops up, choose **Delete**.

Your auto update configuration is now deleted.

# Logging out of Amazon SageMaker Canvas
<a name="canvas-log-out"></a>

After completing your work in Amazon SageMaker Canvas, you can log out or configure your application to automatically terminate the *workspace instance*. A workspace instance is dedicated for your use every time you launch a Canvas application, and you are billed for as long as the instance runs. Logging out or terminating the workspace instance stops the workspace instance billing. For more information, see [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

The following sections describe how to log out of your Canvas application and how to configure your application to automatically shut down on a schedule.

## Log out of Canvas
<a name="canvas-log-out-how-to"></a>

When you log out of Canvas, your models and datasets aren't affected. Any quick or standard model builds or [large data processing jobs](canvas-export-data.md#canvas-export-data-s3) continue running even if you log out.

To log out, choose the **Log out** button (![\[Filter icon in the SageMaker Canvas app.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/logout-icon.png)) on the left panel of the SageMaker Canvas application.

You can also log out from the SageMaker Canvas application by closing your browser tab and then [deleting the application](canvas-manage-apps-delete.md) in the console.

After you log out, SageMaker Canvas tells you to relaunch in a different tab. Logging in takes around 1 minute. If you have an administrator who set up SageMaker Canvas for you, use the instructions they gave you to log back in. If don't have an administrator, see the procedure for accessing SageMaker Canvas in [Prerequisites for setting up Amazon SageMaker Canvas](canvas-getting-started.md#canvas-prerequisites).

## Automatically shut down Canvas
<a name="canvas-auto-shutdown"></a>

If you’re a Canvas administrator, you might want to regularly shut down applications to reduce costs. You can either create a schedule to shut down active Canvas applications, or you can create an automation to shut down Canvas applications as soon as they’re *idle* (meaning the user hasn’t been active for 2 hours).

You can create these solutions using AWS Lambda functions that call the `DeleteApp` API and delete Canvas applications given certain conditions. For more information about these solutions and access to CloudFormation templates that you can use, see the blog [Optimizing costs for Amazon SageMaker Canvas with automatic shutdown of idle apps ](https://aws.amazon.com/blogs/machine-learning/optimizing-costs-for-amazon-sagemaker-canvas-with-automatic-shutdown-of-idle-apps/).

**Note**  
You might experience missing [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) metrics if there was an error when setting up your idle shut down schedule or a CloudWatch error. We recommend that you add a CloudWatch alarm that monitors for missing metrics. If you encounter this issue, reach out to Support for help.

# Limitations and troubleshooting
<a name="canvas-limits"></a>

The following section outlines troubleshooting help and limitations that apply when using Amazon SageMaker Canvas. You can use these this topic to help troubleshoot any issues you encounter.

## Troubleshooting issues with granting permissions through the SageMaker AI console
<a name="canvas-troubleshoot-trusted-services"></a>

If you’re having trouble granting Canvas base permissions or Ready-to-use models permissions to your user, your user might have an AWS IAM execution role with more than one trust relationship to other AWS services. A trust relationship is a policy attached to your role that defines which principals (users, roles, accounts, or services) can assume the role. For example, you might encounter an issue granting additional Canvas permissions to your user if their execution role has a trust relationship to both Amazon SageMaker AI and Amazon Forecast.

You can fix this problem by choosing one of the following options.

### 1. Remove all but one trusted service from the role.
<a name="canvas-troubleshoot-trusted-services-remove"></a>

This solution requires you to edit the trust relationship for your user profile’s IAM role and remove all AWS services except SageMaker AI.

To edit the trust relationship for your IAM execution role, do the following:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the navigation pane of the IAM console, choose **Roles**. The console displays the roles for your account.

1. Choose the name of the role that you want to modify, and select the **Trust relationships** tab on the details page.

1. Choose **Edit trust policy**.

1. In the **Edit trust policy editor**, paste the following, and then choose **Update Policy**.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "sagemaker.amazonaws.com"
                   ]
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

You can also update this policy document using the IAM CLI. For more information, see [update-trust](https://docs.aws.amazon.com/cli/latest/reference/ds/update-trust.html) in the *IAM Command Line Reference*.

You can now retry granting the Canvas base permissions or the Ready-to-use models permissions to your user.

### 2. Use a different role with one or fewer trusted services.
<a name="canvas-troubleshoot-trusted-services-alternate"></a>

This solution requires you to specify a different IAM role for your user profile. Use this option if you already have an IAM role that you can substitute.

To specify a different execution role for your user, do the following:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain that you want to view a list of user profiles for.

1. On the **domain details** page, choose the **User profiles** tab.

1. Choose the user whose permissions you want to edit. On the **User details** page, choose **Edit**.

1. On the **General settings** page, choose the **Execution role** dropdown list and select the role that you want to use.

1. Choose **Submit** to save your changes to the user profile.

Your user should now be using an execution role with only one trusted service (SageMaker AI).

You can retry granting the Canvas base permissions or the Ready-to-use models permissions to your user.

### 3. Manually attach the AWS managed policy to the execution role instead of using the toggle in the SageMaker AI domain settings.
<a name="canvas-troubleshoot-trusted-services-manual"></a>

Instead of using the toggle in the domain or user profile settings, you can manually attach the AWS managed policies that grant a user the correct permissions.

To grant a user Canvas base permissions, attach the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy. To grant a user Ready-to-use models permissions, attach the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy.

Use the following procedure to attach an AWS managed policy to your role:

1. Go to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search box, search for the user's IAM role by name and select it.

1. On the page for the user's role, under **Permissions**, choose **Add permissions**.

1. From the dropdown menu, choose **Attach policies**.

1. Search for and select the policy or policies that you want to attach to the user’s execution role:

   1. To grant the Canvas base permissions, search for and select the [AmazonSageMakerCanvasFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasFullAccess) policy.

   1. To grant the Ready-to-use models permissions, search for and select the [AmazonSageMakerCanvasAIServicesAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-canvas.html#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess) policy.

1. Choose **Add permissions** to attach the policy to the role.

After attaching an AWS managed policy to the user’s role through the IAM console, your user should now have the Canvas base permissions or Ready-to-use models permissions.

## Troubleshooting issues with creating a Canvas application due to space failure
<a name="canvas-troubleshoot-spaces"></a>

When creating a new Canvas application, if you encounter an error stating `Unable to create app <app-arn> because space <space-arn> is not in InService state`, this indicates that the underlying Amazon SageMaker Studio space creation has failed. A Studio *space* is the underlying storage that hosts your Canvas application data. For more general information about Studio spaces, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md). For more information about configuring spaces in Canvas, see [Store SageMaker Canvas application data in your own SageMaker AI space](canvas-spaces-setup.md).

To determine the root cause of your why space creation failed, you can use the [DescribeSpace](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeSpace.html) API to check the `FailureReason` field. For more information about the possible statuses of spaces and what they mean, see [Amazon SageMaker AI domain entities and statuses](sm-domain.md).

To resolve this issue, find your domain in the SageMaker AI console and delete the failed space listed in the error message you received. For detailed steps on how to find and delete a space, see the page [Stop and delete your Studio running applications and spaces](studio-updated-running-stop.md) and follow the instructions to **Delete a Studio space**. Deleting the space also deletes any applications associated with the space. After deleting the space, you can try to create your Canvas application again. The space should now provision successfully, allowing Canvas to launch.

# Billing and cost in SageMaker Canvas
<a name="canvas-manage-cost"></a>

To track the costs associated with your SageMaker Canvas application, you can use the AWS Billing and Cost Management service. Billing and Cost Management provides tools to help you gather information related to your cost and usage, analyze your cost drivers and usage trends, and take action to budget your spending. For more information, see [What is AWS Billing and Cost Management?](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html)

Billing in SageMaker Canvas consists of the following components:
+ Workspace instance charges – You are charged for the number of hours that you are logged in to or using SageMaker Canvas. We recommend that you log out or create a schedule to shut down any Canvas applications that you’re not actively using to reduce costs. For more information, see [Logging out of Amazon SageMaker Canvas](canvas-log-out.md).
+ AWS service charges – You are charged for building and making predictions with custom models, or for making predictions with Ready-to-use models:
  + Training charges – For all model types, you are charged based on your resource usage while the model builds. These resources include any compute instances that Canvas spins up. You may see these charges on your account as Hosting, Training, Processing, or Batch Transform jobs.
  + Prediction charges – You are charged for the resources used to generate predictions, depending on the type of custom model that you built or the type of Ready-to-use model you used.

The [Ready-to-use models](canvas-ready-to-use-models.md) in Canvas leverage other AWS services to generate predictions. When you use a Ready-to-use model, you are charged by the respective service, and their pricing conditions apply:
+ For sentiment analysis, entities extraction, language detection, and personal information detection, you’re charged with [Amazon Comprehend pricing](https://aws.amazon.com/comprehend/pricing/).
+ For object detection in images and text detection in images, you’re charged with [Amazon Rekognition pricing](https://aws.amazon.com/rekognition/pricing/).
+ For expense analysis, identity document analysis, and document analysis, you’re charged with [Amazon Textract pricing](https://aws.amazon.com/textract/pricing/).

For more information, see [SageMaker Canvas pricing](https://aws.amazon.com/sagemaker/canvas/pricing/).

To help you track your costs in Billing and Cost Management, you can assign custom tags to your SageMaker Canvas app and users. You can track the costs your apps incur, and by tagging individual user profiles, you can track costs based on the user profile. For more information about tags, see [Using Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html).

You can add tags to your SageMaker Canvas app and users by doing the following:
+ If you are setting up your Amazon SageMaker AI domain and SageMaker Canvas for the first time, follow the [Getting Started](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html) instructions and add tags when creating your domain or users. You can add tags either through the **General settings** in the domain console setup, or through the APIs ([CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) or [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html)). SageMaker AI adds the tags specified in your domain or UserProfile to any SageMaker Canvas apps or users you create after you create the domain.
+ If you want to add tags to apps in an existing domain, you must add tags to either the domain or the UserProfile. You can adds tags through either the console or the [AddTags](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html) API. If you add tags through the console, then you must delete and relaunch your SageMaker Canvas app in order for the tags to propagate to the app. If you use the API, the tags are added directly to the app. For more information about deleting and relaunching a SageMaker Canvas app, see [Manage apps](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-manage-apps.html).

After you add tags to your domain, it might take up to 24 hours for the tags to appear in the AWS Billing and Cost Management console for activation. After they appear in the console, it takes another 24 hours for the tags to activate.

On the **Cost explorer** page, you can group and filter your costs by tags and usage types to separate your Workspace instance charges from your Training charges. The charges for each are listed as the following:
+ Workspace instance charges: Charges show up under the usage type `REGION-Canvas:Session-Hrs (Hrs)`.
+ Training charges: Charges show up under the usage types for SageMaker AI Hosting, Training, Processing, or Batch Transform jobs.

# Amazon SageMaker geospatial capabilities
<a name="geospatial"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. If prior to November 30, 2023 you created a Amazon SageMaker AI domain, Studio Classic remains the default experience. domains created after November 30, 2023 default to the new Studio experience.  
Amazon SageMaker geospatial features and resources are *only* available in Studio Classic. To learn more about setting up a domain and getting started with Studio, see [Getting started with Amazon SageMaker geospatial](geospatial-getting-started.md).

Amazon SageMaker geospatial capabilities makes it easier for data scientists and machine learning (ML) engineers to build, train, and deploy ML models faster using geospatial data. You have access to open-source and third-party data, processing, and visualization tools to make it more efficient to prepare geospatial data for ML. You can increase your productivity by using purpose-built algorithms and pre-trained ML models to speed up model building and training, and use built-in visualization tools to explore prediction outputs on an interactive map and then collaborate across teams on insights and results.

**Note**  
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region.  
If you don't see the SageMaker geospatial UI available in your current Studio Classic instance check to make sure you are currently in the US West (Oregon) Region.
<a name="why-use-geo"></a>
**Why use SageMaker geospatial capabilities?**  
You can use SageMaker geospatial capabilities to make predictions on geospatial data faster than do-it-yourself solutions. SageMaker geospatial capabilities make it easier to access geospatial data from your existing customer data lakes, open-source datasets, and other SageMaker geospatial data providers. SageMaker geospatial capabilities minimize the need for building custom infrastructure and data preprocessing functions by offering purpose-built algorithms for efficient data preparation, model training, and inference. You can also create and share custom visualizations and data with your company from Amazon SageMaker Studio Classic. SageMaker geospatial capabilities offer pre-trained models for common uses in agriculture, real estate, insurance, and financial services.

## How can I use SageMaker geospatial capabilities?
<a name="how-use-geo"></a>

You can use SageMaker geospatial capabilities in two ways.
+ Through the SageMaker geospatial UI, as a part of Amazon SageMaker Studio Classic UI.
+ Through a Studio Classic notebook instance that uses the **Geospatial 1.0** image.

**SageMaker AI has the following geospatial capabilities**
+ Use a purpose built SageMaker geospatial image that supports both CPU and GPU-based notebook instances, and also includes commonly used open-source libraries found in geospatial machine learning workflows.
+ Use the Amazon SageMaker Processing and the SageMaker geospatial container to run large-scale workloads with your own datasets, including soil, weather, climate, LiDAR, and commercial aerial and satellite imagery.
+ Run an [Earth Observation job](https://docs.aws.amazon.com/sagemaker/latest/dg/geospatial-eoj.html) for raster data processing.
+ Run a [Vector Enrichment job](https://docs.aws.amazon.com/sagemaker/latest/dg/geospatial-vej.html) to convert latitude and longitude into human readable addresses, and match noisy GPS traces to specific roads.
+ Use built-in [visualization tools right in Studio Classic to interactively view geospatial data or model predictions on a map.](https://docs.aws.amazon.com/sagemaker/latest/dg/geospatial-visualize.html)

You can also use data from a collection of geospatial data providers. Currently, the data collections available include:
+ [https://www.usgs.gov/centers/eros/data-citation?qt-science_support_page_related_con=0#qt-science_support_page_related_con](https://www.usgs.gov/centers/eros/data-citation?qt-science_support_page_related_con=0#qt-science_support_page_related_con)
+ [https://sentinels.copernicus.eu/documents/247904/690755/Sentinel_Data_Legal_Notice](https://sentinels.copernicus.eu/documents/247904/690755/Sentinel_Data_Legal_Notice)
+ [https://sentinel.esa.int/web/sentinel/missions/sentinel-2](https://sentinel.esa.int/web/sentinel/missions/sentinel-2)
+ [https://registry.opendata.aws/copernicus-dem/](https://registry.opendata.aws/copernicus-dem/)
+ [https://registry.opendata.aws/naip/](https://registry.opendata.aws/naip/)

## Are you a first-time user of SageMaker geospatial?
<a name="first-time-geospatial-data"></a>

As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. New domains created after November 30, 2023 default to the Studio experience. Access to SageMaker geospatial is limited to Studio Classic, to learn more see [Accessing SageMaker geospatial](access-studio-classic-geospatial.md).

If you're a first-time user of AWS or Amazon SageMaker AI, we recommend that you do the following:

1. **Create an AWS account.**

   To learn about setting up an AWS account and getting started with SageMaker AI, see [Complete Amazon SageMaker AI prerequisites](gs-set-up.md).

1. **Create a user role and execution role that work with SageMaker geospatial**.

   As a managed service, Amazon SageMaker geospatial capabilities performs operations on your behalf on the AWS hardware that SageMaker AI manages. A SageMaker AI execution role an perform only the operations that users grant. To work with SageMaker geospatial capabilities, you must set up a user role and an execution role. For more information, see [SageMaker geospatial capabilities roles](sagemaker-geospatial-roles.md).

1. **Update your trust policy to include SageMaker geospatial**.

   SageMaker geospatial defines an additional service principal. To learn how to create or update your SageMaker AI execution role's trust policy, see [Adding  the SageMaker geospatial service principal to an existing SageMaker AI execution role](sagemaker-geospatial-roles-pass-role.md).

1. **Set up an Amazon SageMaker AI domain to access Amazon SageMaker Studio Classic.**

   To use SageMaker geospatial, a domain is required. For domains created before November 30, 2023 the default experience is Studio Classic. domains created after November 30, 2023 default to the Studio experience. To learn more about accessing Studio Classic from Studio, see [Accessing SageMaker geospatial](access-studio-classic-geospatial.md).

1. **Remember, shut down resources.**

   When you have finished using SageMaker geospatial capabilities, shut down the instance it runs on to avoid incurring additional charges. For more information, see [Shut Down Resources from Amazon SageMaker Studio Classic](notebooks-run-and-manage-shut-down.md). 

**Topics**
+ [

## How can I use SageMaker geospatial capabilities?
](#how-use-geo)
+ [

## Are you a first-time user of SageMaker geospatial?
](#first-time-geospatial-data)
+ [

# Getting started with Amazon SageMaker geospatial
](geospatial-getting-started.md)
+ [

# Using a processing jobs for custom geospatial workloads
](geospatial-custom-operations.md)
+ [

# Earth Observation Jobs
](geospatial-eoj.md)
+ [

# Vector Enrichment Jobs
](geospatial-vej.md)
+ [

# Visualization Using SageMaker geospatial capabilities
](geospatial-visualize.md)
+ [

# Amazon SageMaker geospatial Map SDK
](geospatial-notebook-sdk.md)
+ [

# SageMaker geospatial capabilities FAQ
](geospatial-faq.md)
+ [

# SageMaker geospatial Security and Permissions
](geospatial-security-general.md)
+ [

# Types of compute instances
](geospatial-instances.md)
+ [

# Data collections
](geospatial-data-collections.md)

# Getting started with Amazon SageMaker geospatial
<a name="geospatial-getting-started"></a>

 SageMaker geospatial provides a purpose built **Image** and **Instance type** for Amazon SageMaker Studio Classic notebooks. You can use either CPU or GPU enabled notebooks with the SageMaker geospatial **Image**. You can also visualize your geospatial data using a purpose built visualizer. Furthermore, SageMaker geospatial also provides APIs that allow you to query raster data collections.You can also use pre-trained models to analyze geospatial data, reverse geocoding, and map matching.

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. If prior to November 30, 2023 you created a Amazon SageMaker AI domain, Studio Classic remains the default experience. domains created after November 30, 2023 default to the new Studio experience.

To access and get started using Amazon SageMaker geospatial, do the following:

**Topics**
+ [

# Accessing SageMaker geospatial
](access-studio-classic-geospatial.md)
+ [

# Create an Amazon SageMaker Studio Classic notebook using the geospatial image
](geospatial-launch-notebook.md)
+ [

# Access the Sentinel-2 raster data collection and create an earth observation job to perform land segmentation
](geospatial-demo.md)

# Accessing SageMaker geospatial
<a name="access-studio-classic-geospatial"></a>

**Note**  
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region and in Studio Classic.  
If you don't see the SageMaker geospatial UI available in your current Studio Classic instance check to make sure you are currently in the US West (Oregon) Region.

A domain is required to access SageMaker geospatial. If you created a domain prior to November 30, 2023 the default experience is Studio Classic.

If you created a domain after November 30, 2023 or if you have migrated to Studio, then you can use the following procedure to activate Studio Classic from within Studio to use SageMaker geospatial features.

To learn more about creating a domain, see [Onboard to Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html).

**To access Studio Classic from Studio**

1. Launch Amazon SageMaker Studio.

1. Under **Applications**, choose **Studio Classic**.

1. Then, choose **Create Studio Classic space**.

1. On the **Create Studio Classic space** page, enter a **Name**.

1. Disable the **Share with my domain** option. SageMaker geospatial is not available in shared domains.

1. Then choose **Create space**.

When successful the **Status** changes to **Updating**. When your Studio Classic application is ready to be used the status changes to **Stopped**.

To start your Studio Classic application, choose **Run**.

# Create an Amazon SageMaker Studio Classic notebook using the geospatial image
<a name="geospatial-launch-notebook"></a>

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the Studio Classic application. For information about using the updated Studio experience, see [Amazon SageMaker Studio](studio-updated.md).  
Studio Classic is still maintained for existing workloads but is no longer available for onboarding. You can only stop or delete existing Studio Classic applications and cannot create new ones. We recommend that you [migrate your workload to the new Studio experience](studio-updated-migrate.md).

**Note**  
Currently, SageMaker geospatial is only supported in the US West (Oregon) Region.  
If you don't see SageMaker geospatial available in your current domain or notebook instance, make sure that you're currently in the US West (Oregon) Region.

Use the following procedure to create Studio Classic notebook with the SageMaker geospatial image. If your default studio experience is Studio, see [Accessing SageMaker geospatial](access-studio-classic-geospatial.md) to learn about starting a Studio Classic application.

**To create a Studio Classic notebook with the SageMaker geospatial image**

1. Launch Studio Classic

1. Choose **Home** in the menu bar.

1. Under **Quick actions**, choose **Open Launcher**.

1. When the **Launcher** dialog box opens. Choose **Change environment** under **Notebooks and compute resources**.

1. When, the **Change environment** dialog box opens. Choose the **Image** dropdown and choose or type **Geospatial 1.0**.  
![\[A dialogue boxing showing the correct geospatial image and instance type selected.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geospatial-environment-dialogue.png)

1. Next, choose an **Instance type** from the dropdown.

   SageMaker geospatial supports two types of notebook instances: CPU and GPU. The supported CPU instance is called **ml.geospatial.interactive**. Any of the G5-family of GPU instances can be used with the Geospatial 1.0 image.
**Note**  
If you receive a ResourceLimitExceeded error when attempting to start a GPU based instance, you need to request a quota increase. To get started on a Service Quotas quota increase request, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide* 

1. Choose **Select**.

1. Choose **Create notebook**.

After creating a notebook, to learn more about SageMaker geospatial, try the [SageMaker geospatial tutorial](geospatial-demo.md). It shows you how to process Sentinel-2 image data and perform land segmentation on it using the earth observation jobs API. 

# Access the Sentinel-2 raster data collection and create an earth observation job to perform land segmentation
<a name="geospatial-demo"></a>

This Python-based tutorial uses the SDK for Python (Boto3) and an Amazon SageMaker Studio Classic notebook. To complete this demo successfully, make sure that you have the required AWS Identity and Access Management (IAM) permissions to use SageMaker geospatial and Studio Classic. SageMaker geospatial requires that you have a user, group, or role which can access Studio Classic. You must also have a SageMaker AI execution role that specifies the SageMaker geospatial service principal, `sagemaker-geospatial.amazonaws.com` in its trust policy. 

To learn more about these requirements, see [SageMaker geospatial IAM roles](sagemaker-geospatial-roles.md).

This tutorial shows you how to use SageMaker geospatial API to complete the following tasks:
+ Find the available raster data collections with `list_raster_data_collections`.
+ Search a specified raster data collection by using `search_raster_data_collection`.
+ Create an earth observation job (EOJ) by using `start_earth_observation_job`.

## Using `list_raster_data_collections` to find available data collections
<a name="demo-use-list-rdc"></a>

SageMaker geospatial supports multiple raster data collections. To learn more about the available data collections, see [Data collections](geospatial-data-collections.md).

This demo uses satellite data that's collected from [Sentinel-2 Cloud-Optimized GeoTIFF](https://registry.opendata.aws/sentinel-2-l2a-cogs/) satellites. These satellites provide global coverage of Earth's land surface every five days. In addition to collecting surface images of Earth, the Sentinel-2 satellites also collect data across a variety of spectralbands.

To search an area of interest (AOI), you need the ARN that's associated with the Sentinel-2 satellite data. To find the available data collections and their associated ARNs in your AWS Region, use the `list_raster_data_collections` API operation.

Because the response can be paginated, you must use the `get_paginator` operation to return all of the relevant data:

```
import boto3
import sagemaker
import sagemaker_geospatial_map
import json 

## SageMaker Geospatial  is currently only avaialable in US-WEST-2  
session = boto3.Session(region_name='us-west-2')
execution_role = sagemaker.get_execution_role()

## Creates a SageMaker Geospatial client instance 
geospatial_client = session.client(service_name="sagemaker-geospatial")

# Creates a resusable Paginator for the list_raster_data_collections API operation 
paginator = geospatial_client.get_paginator("list_raster_data_collections")

# Create a PageIterator from the paginator class
page_iterator = paginator.paginate()

# Use the iterator to iterate throught the results of list_raster_data_collections
results = []
for page in page_iterator:
    results.append(page['RasterDataCollectionSummaries'])

print(results)
```

This is a sample JSON response from the `list_raster_data_collections` API operation. It's truncated to include only the data collection (Sentinel-2) that's used in this code example. For more details about a specific raster data collection, use `get_raster_data_collection`:

```
{
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
    "Description": "Sentinel-2a and Sentinel-2b imagery, processed to Level 2A (Surface Reflectance) and converted to Cloud-Optimized GeoTIFFs",
    "DescriptionPageUrl": "https://registry.opendata.aws/sentinel-2-l2a-cogs",
    "Name": "Sentinel 2 L2A COGs",
    "SupportedFilters": [
        {
            "Maximum": 100,
            "Minimum": 0,
            "Name": "EoCloudCover",
            "Type": "number"
        },
        {
            "Maximum": 90,
            "Minimum": 0,
            "Name": "ViewOffNadir",
            "Type": "number"
        },
        {
            "Name": "Platform",
            "Type": "string"
        }
    ],
    "Tags": {},
    "Type": "PUBLIC"
}
```

After running the previous code sample, you get the ARN of the Sentinel-2 raster data collection, `arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8`. In the [next section](#demo-search-raster-data), you can query the Sentinel-2 data collection using the `search_raster_data_collection` API.

## Searching the Sentinel-2 raster data collection using `search_raster_data_collection`
<a name="demo-search-raster-data"></a>

In the preceding section, you used `list_raster_data_collections` to get the ARN for the Sentinel-2 data collection. Now you can use that ARN to search the data collection over a given area of interest (AOI), time range, properties, and the available UV bands.

To call the `search_raster_data_collection` API you must pass in a Python dictionary to the `RasterDataCollectionQuery` parameter. This example uses `AreaOfInterest`, `TimeRangeFilter`, `PropertyFilters`, and `BandFilter`. For ease, you can specify the Python dictionary using the variable **search\$1rdc\$1query** to hold the search query parameters:

```
search_rdc_query = {
    "AreaOfInterest": {
        "AreaOfInterestGeometry": {
            "PolygonGeometry": {
                "Coordinates": [
                    [
                        # coordinates are input as longitute followed by latitude 
                        [-114.529, 36.142],
                        [-114.373, 36.142],
                        [-114.373, 36.411],
                        [-114.529, 36.411],
                        [-114.529, 36.142],
                    ]
                ]
            }
        }
    },
    "TimeRangeFilter": {
        "StartTime": "2022-01-01T00:00:00Z",
        "EndTime": "2022-07-10T23:59:59Z"
    },
    "PropertyFilters": {
        "Properties": [
            {
                "Property": {
                    "EoCloudCover": {
                        "LowerBound": 0,
                        "UpperBound": 1
                    }
                }
            }
        ],
        "LogicalOperator": "AND"
    },
    "BandFilter": [
        "visual"
    ]
}
```

In this example, you query an `AreaOfInterest` that includes [Lake Mead](https://en.wikipedia.org/wiki/Lake_Mead) in Utah. Furthermore, Sentinel-2 supports multiple types of image bands. To measure the change in the surface of the water, you only need the `visual` band.

After you create the query parameters, you can use the `search_raster_data_collection` API to make the request. 

The following code sample implements a `search_raster_data_collection` API request. This API does not support pagination using the `get_paginator` API. To make sure that the full API response has been gathered the code sample uses a `while` loop to check that `NextToken` exists. The code sample then uses `.extend()` to append the satellite image URLs and other response metadata to the `items_list`. 

To learn more about the `search_raster_data_collection`, see [SearchRasterDataCollection](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_SearchRasterDataCollection.html) in the *Amazon SageMaker AI API Reference*.

```
search_rdc_response = sm_geo_client.search_raster_data_collection(
    Arn='arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
    RasterDataCollectionQuery=search_rdc_query
)


## items_list is the response from the API request. 
items_list = []

## Use the python .get() method to check that the 'NextToken' exists, if null returns None breaking the while loop 
while search_rdc_response.get('NextToken'):
    items_list.extend(search_rdc_response['Items'])
    search_rdc_response = sm_geo_client.search_raster_data_collection(
        Arn='arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
        RasterDataCollectionQuery=search_rdc_query, NextToken=search_rdc_response['NextToken']
    )

## Print the number of observation return based on the query
print (len(items_list))
```

The following is a JSON response from your query. It has been truncated for clarity. Only the **"BandFilter": ["visual"]** specified in the request is returned in the `Assets` key-value pair:

```
{
    'Assets': {
        'visual': {
            'Href': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/15/T/UH/2022/6/S2A_15TUH_20220623_0_L2A/TCI.tif'
        }
    },
    'DateTime': datetime.datetime(2022, 6, 23, 17, 22, 5, 926000, tzinfo = tzlocal()),
    'Geometry': {
        'Coordinates': [
            [
                [-114.529, 36.142],
                [-114.373, 36.142],
                [-114.373, 36.411],
                [-114.529, 36.411],
                [-114.529, 36.142],
            ]
        ],
        'Type': 'Polygon'
    },
    'Id': 'S2A_15TUH_20220623_0_L2A',
    'Properties': {
        'EoCloudCover': 0.046519,
        'Platform': 'sentinel-2a'
    }
}
```

Now that you have your query results, in the next section you can visualize the results by using `matplotlib`. This is to verify that results are from the correct geographical region. 

## Visualizing your `search_raster_data_collection` using `matplotlib`
<a name="demo-geospatial-visualize"></a>

Before you start the earth observation job (EOJ), you can visualize a result from our query with`matplotlib`. The following code sample takes the first item, `items_list[0]["Assets"]["visual"]["Href"]`, from the `items_list` variable created in the previous code sample and prints an image using `matplotlib`.

```
# Visualize an example image.
import os
from urllib import request
import tifffile
import matplotlib.pyplot as plt

image_dir = "./images/lake_mead"
os.makedirs(image_dir, exist_ok=True)

image_dir = "./images/lake_mead"
os.makedirs(image_dir, exist_ok=True)

image_url = items_list[0]["Assets"]["visual"]["Href"]
img_id = image_url.split("/")[-2]
path_to_image = image_dir + "/" + img_id + "_TCI.tif"
response = request.urlretrieve(image_url, path_to_image)
print("Downloaded image: " + img_id)

tci = tifffile.imread(path_to_image)
plt.figure(figsize=(6, 6))
plt.imshow(tci)
plt.show()
```

After checking that the results are in the correct geographical region, you can start the Earth Observation Job (EOJ) in the next step. You use the EOJ to identify the water bodies from the satellite images by using a process called land segmentation.

## Starting an earth observation job (EOJ) that performs land segmentation on a series of Satellite images
<a name="demo-start-eoj"></a>

SageMaker geospatial provides multiple pre-trained models that you can use to process geospatial data from raster data collections. To learn more about the available pre-trained models and custom operations, see [Types of Operations](geospatial-eoj-models.md).

To calculate the change in the water surface area, you need to identify which pixels in the images correspond to water. Land cover segmentation is a semantic segmentation model supported by the `start_earth_observation_job` API. Semantic segmentation models associate a label with every pixel in each image. In the results, each pixel is assigned a label that's based on the class map for the model. The following is the class map for the land segmentation model:

```
{
    0: "No_data",
    1: "Saturated_or_defective",
    2: "Dark_area_pixels",
    3: "Cloud_shadows",
    4: "Vegetation",
    5: "Not_vegetated",
    6: "Water",
    7: "Unclassified",
    8: "Cloud_medium_probability",
    9: "Cloud_high_probability",
    10: "Thin_cirrus",
    11: "Snow_ice"
}
```

To start an earth observation job, use the `start_earth_observation_job` API. When you submit your request, you must specify the following:
+ `InputConfig` (*dict*) – Used to specify the coordinates of the area that you want to search, and other metadata that's associated with your search.
+ `JobConfig` (*dict*) – Used to specify the type of EOJ operation that you performed on the data. This example uses **LandCoverSegmentationConfig**.
+ `ExecutionRoleArn` (*string*) – The ARN of the SageMaker AI execution role with the necessary permissions to run the job.
+ `Name` (*string*) –A name for the earth observation job.

The `InputConfig` is a Python dictionary. Use the following variable **eoj\$1input\$1config** to hold the search query parameters. Use this variable when you make the `start_earth_observation_job` API request. w.

```
# Perform land cover segmentation on images returned from the Sentinel-2 dataset.
eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates":[
                        [
                            [-114.529, 36.142],
                            [-114.373, 36.142],
                            [-114.373, 36.411],
                            [-114.529, 36.411],
                            [-114.529, 36.142],
                        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2021-01-01T00:00:00Z",
            "EndTime": "2022-07-10T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 1}}}],
            "LogicalOperator": "AND",
        },
    }
}
```

The `JobConfig` is a Python dictionary that is used to specify the EOJ operation that you want performed on your data:

```
eoj_config = {"LandCoverSegmentationConfig": {}}
```

With the dictionary elements now specified, you can submit your `start_earth_observation_job` API request using the following code sample:

```
# Gets the execution role arn associated with current notebook instance 
execution_role_arn = sagemaker.get_execution_role()

# Starts an earth observation job
response = sm_geo_client.start_earth_observation_job(
    Name="lake-mead-landcover",
    InputConfig=eoj_input_config,
    JobConfig=eoj_config,
    ExecutionRoleArn=execution_role_arn,
)
            
print(response)
```

The start an earth observation job returns an ARN along with other metadata.

To get a list of all ongoing and current earth observation jobs use the `list_earth_observation_jobs` API. To monitor the status of a single earth observation job use the `get_earth_observation_job` API. To make this request, use the ARN created after submitting your EOJ request. To learn more, see [GetEarthObservationJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_GetEarthObservationJob.html) in the *Amazon SageMaker AI API Reference*.

To find the ARNs associated with your EOJs use the `list_earth_observation_jobs` API operation. To learn more, see [ListEarthObservationJobs](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_geospatial_ListEarthObservationJobs.html) in the *Amazon SageMaker AI API Reference*.

```
# List all jobs in the account
sg_client.list_earth_observation_jobs()["EarthObservationJobSummaries"]
```

The following is an example JSON response:

```
{
    'Arn': 'arn:aws:sagemaker-geospatial:us-west-2:111122223333:earth-observation-job/futg3vuq935t',
    'CreationTime': datetime.datetime(2023, 10, 19, 4, 33, 54, 21481, tzinfo = tzlocal()),
    'DurationInSeconds': 3493,
    'Name': 'lake-mead-landcover',
    'OperationType': 'LAND_COVER_SEGMENTATION',
    'Status': 'COMPLETED',
    'Tags': {}
}, {
    'Arn': 'arn:aws:sagemaker-geospatial:us-west-2:111122223333:earth-observation-job/wu8j9x42zw3d',
    'CreationTime': datetime.datetime(2023, 10, 20, 0, 3, 27, 270920, tzinfo = tzlocal()),
    'DurationInSeconds': 1,
    'Name': 'mt-shasta-landcover',
    'OperationType': 'LAND_COVER_SEGMENTATION',
    'Status': 'INITIALIZING',
    'Tags': {}
}
```

After the status of your EOJ job changes to `COMPLETED`, proceed to the next section to calculate the change in Lake Mead's surface area.

## Calculating the change in the Lake Mead surface area
<a name="demo-geospatial-calc"></a>

To calculate the change in Lake Mead's surface area, first export the results of the EOJ to Amazon S3 by using `export_earth_observation_job`:

```
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()  # Replace with your own bucket if needed
s3_bucket = session.resource("s3").Bucket(s3_bucket_name)
prefix = "export-lake-mead-eoj"  # Replace with the S3 prefix desired
export_bucket_and_key = f"s3://{s3_bucket_name}/{prefix}/"

eoj_output_config = {"S3Data": {"S3Uri": export_bucket_and_key}}
export_response = sm_geo_client.export_earth_observation_job(
    Arn="arn:aws:sagemaker-geospatial:us-west-2:111122223333:earth-observation-job/7xgwzijebynp",
    ExecutionRoleArn=execution_role_arn,
    OutputConfig=eoj_output_config,
    ExportSourceImages=False,
)
```

To see the status of your export job, use `get_earth_observation_job`:

```
export_job_details = sm_geo_client.get_earth_observation_job(Arn=export_response["Arn"])
```

To calculate the changes in Lake Mead's water level, download the land cover masks to the local SageMaker notebook instance and download the source images from our previous query. In the class map for the land segmentation model, the water’s class index is 6.

To extract the water mask from a Sentinel-2 image, follow these steps. First, count the number of pixels marked as water (class index 6) in the image. Second, multiply the count by the area that each pixel covers. Bands can differ in their spatial resolution. For the land cover segmentation model all bands are down sampled to a spatial resolution equal to 60 meters.

```
import os
from glob import glob
import cv2
import numpy as np
import tifffile
import matplotlib.pyplot as plt
from urllib.parse import urlparse
from botocore import UNSIGNED
from botocore.config import Config

# Download land cover masks
mask_dir = "./masks/lake_mead"
os.makedirs(mask_dir, exist_ok=True)
image_paths = []
for s3_object in s3_bucket.objects.filter(Prefix=prefix).all():
    path, filename = os.path.split(s3_object.key)
    if "output" in path:
        mask_name = mask_dir + "/" + filename
        s3_bucket.download_file(s3_object.key, mask_name)
        print("Downloaded mask: " + mask_name)

# Download source images for visualization
for tci_url in tci_urls:
    url_parts = urlparse(tci_url)
    img_id = url_parts.path.split("/")[-2]
    tci_download_path = image_dir + "/" + img_id + "_TCI.tif"
    cogs_bucket = session.resource(
        "s3", config=Config(signature_version=UNSIGNED, region_name="us-west-2")
    ).Bucket(url_parts.hostname.split(".")[0])
    cogs_bucket.download_file(url_parts.path[1:], tci_download_path)
    print("Downloaded image: " + img_id)

print("Downloads complete.")

image_files = glob("images/lake_mead/*.tif")
mask_files = glob("masks/lake_mead/*.tif")
image_files.sort(key=lambda x: x.split("SQA_")[1])
mask_files.sort(key=lambda x: x.split("SQA_")[1])
overlay_dir = "./masks/lake_mead_overlay"
os.makedirs(overlay_dir, exist_ok=True)
lake_areas = []
mask_dates = []

for image_file, mask_file in zip(image_files, mask_files):
    image_id = image_file.split("/")[-1].split("_TCI")[0]
    mask_id = mask_file.split("/")[-1].split(".tif")[0]
    mask_date = mask_id.split("_")[2]
    mask_dates.append(mask_date)
    assert image_id == mask_id
    image = tifffile.imread(image_file)
    image_ds = cv2.resize(image, (1830, 1830), interpolation=cv2.INTER_LINEAR)
    mask = tifffile.imread(mask_file)
    water_mask = np.isin(mask, [6]).astype(np.uint8)  # water has a class index 6
    lake_mask = water_mask[1000:, :1100]
    lake_area = lake_mask.sum() * 60 * 60 / (1000 * 1000)  # calculate the surface area
    lake_areas.append(lake_area)
    contour, _ = cv2.findContours(water_mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    combined = cv2.drawContours(image_ds, contour, -1, (255, 0, 0), 4)
    lake_crop = combined[1000:, :1100]
    cv2.putText(lake_crop, f"{mask_date}", (10,50), cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 0), 3, cv2.LINE_AA)
    cv2.putText(lake_crop, f"{lake_area} [sq km]", (10,100), cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 0), 3, cv2.LINE_AA)
    overlay_file = overlay_dir + '/' + mask_date + '.png'
    cv2.imwrite(overlay_file, cv2.cvtColor(lake_crop, cv2.COLOR_RGB2BGR))

# Plot water surface area vs. time.
plt.figure(figsize=(20,10))
plt.title('Lake Mead surface area for the 2021.02 - 2022.07 period.', fontsize=20)
plt.xticks(rotation=45)
plt.ylabel('Water surface area [sq km]', fontsize=14)
plt.plot(mask_dates, lake_areas, marker='o')
plt.grid('on')
plt.ylim(240, 320)
for i, v in enumerate(lake_areas):
    plt.text(i, v+2, "%d" %v, ha='center')
plt.show()
```

Using `matplotlib`, you can visualize the results with a graph. The graph shows that the surface area of Lake Mead decreased from January 2021–July 2022.

![\[A bar graph showing the surface area of Lake Mead decreased from January 2021-July 2022\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/lake-mead-decrease.png)


# Using a processing jobs for custom geospatial workloads
<a name="geospatial-custom-operations"></a>

With [Amazon SageMaker Processing](processing-job.md), you can use a simplified, managed experience on SageMaker AI to run your data processing workloads with the purpose-built geospatial container.

 The underlying infrastructure for a Amazon SageMaker Processing job is fully managed by SageMaker AI. During a processing job, cluster resources are provisioned for the duration of your job, and cleaned up when a job completes.

![\[Running a processing job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png)


The preceding diagram shows how SageMaker AI spins up a geospatial processing job. SageMaker AI takes your geospatial workload script, copies your geospatial data from Amazon Simple Storage Service(Amazon S3), and then pulls the specified geospatial container. The underlying infrastructure for the processing job is fully managed by SageMaker AI. Cluster resources are provisioned for the duration of your job, and cleaned up when a job completes. The output of the processing job is stored in the bucket you specified. 

**Path naming constraints**  
The local paths inside a Processing jobs container must begin with **/opt/ml/processing/**.

SageMaker geospatial provides a purpose-built container, `081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest` that can be speciﬁed when running a processing job.

**Topics**
+ [

# Overview: Run processing jobs using `ScriptProcessor` and a SageMaker geospatial container
](geospatial-custom-operations-overview.md)
+ [

# Using `ScriptProcessor` to calculate the Normalized Difference Vegetation Index (NDVI) using Sentinel-2 satellite data
](geospatial-custom-operations-procedure.md)

# Overview: Run processing jobs using `ScriptProcessor` and a SageMaker geospatial container
<a name="geospatial-custom-operations-overview"></a>

SageMaker geospatial provides a purpose-built processing container, `081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest`. You can use this container when running a job with Amazon SageMaker Processing. When you create an instance of the [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) class that is available through the *Amazon SageMaker Python SDK for Processing*, specify this `image_uri`.

**Note**  
If you receive a ResourceLimitExceeded error when attempting to start a processing job, you need to request a quota increase. To get started on a Service Quotas quota increase request, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide* 

**Prerequisites for using `ScriptProcessor`**

1. You have created a Python script that specifies your geospatial ML workload.

1. You have granted the SageMaker AI execution role access to any Amazon S3 buckets that are needed.

1. Prepare your data for import into the container. Amazon SageMaker Processing jobs support either setting the `s3_data_type` equal to `"ManifestFile"` or to `"S3Prefix"`.

The following procedure show you how to create an instance of `ScriptProcessor` and submit a Amazon SageMaker Processing job using the SageMaker geospatial container.

**To create a `ScriptProcessor` instance and submit a Amazon SageMaker Processing job using a SageMaker geospatial container**

1. Instantiate an instance of the `ScriptProcessor` class using the SageMaker geospatial image:

   ```
   from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
   	
   sm_session = sagemaker.session.Session()
   execution_role_arn = sagemaker.get_execution_role()
   
   # purpose-built geospatial container
   image_uri = '081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'
   
   script_processor = ScriptProcessor(
   	command=['python3'],
   	image_uri=image_uri,
   	role=execution_role_arn,
   	instance_count=4,
   	instance_type='ml.m5.4xlarge',
   	sagemaker_session=sm_session
   )
   ```

   Replace *execution\$1role\$1arn* with the ARN of the SageMaker AI execution role that has access to the input data stored in Amazon S3 and any other AWS services that you want to call in your processing job. You can update the `instance_count` and the `instance_type` to match the requirements of your processing job.

1. To start a processing job, use the `.run()` method:

   ```
   # Can be replaced with any S3 compliant string for the name of the folder.
   s3_folder = geospatial-data-analysis
   
   # Use .default_bucket() to get the name of the S3 bucket associated with your current SageMaker session
   s3_bucket = sm_session.default_bucket()
   					
   s3_manifest_uri = f's3://{s3_bucket}/{s3_folder}/manifest.json'
   s3_prefix_uri =  f's3://{s3_bucket}/{s3_folder}/image-prefix
   
   script_processor.run(
   	code='preprocessing.py',
   	inputs=[
   		ProcessingInput(
   			source=s3_manifest_uri | s3_prefix_uri ,
   			destination='/opt/ml/processing/input_data/',
   			s3_data_type= "ManifestFile" | "S3Prefix",
   			s3_data_distribution_type= "ShardedByS3Key" | "FullyReplicated"
   		)
   	],
   	outputs=[
           ProcessingOutput(
               source='/opt/ml/processing/output_data/',
               destination=s3_output_prefix_url
           )
       ]
   )
   ```
   + Replace *preprocessing.py* with the name of your own Python data processing script.
   + A processing job supports two methods for formatting your input data. You can either create a manifest file that points to all of the input data for your processing job, or you can use a common prefix on each individual data input. If you created a manifest file set `s3_manifest_uri` equal to `"ManifestFile"`. If you used a file prefix set `s3_manifest_uri` equal to `"S3Prefix"`. You specify the path to your data using `source`.
   + You can distribute your processing job data two ways:
     + Distribute your data to all processing instances by setting `s3_data_distribution_type` equal to `FullyReplicated`.
     + Distribute your data in shards based on the Amazon S3 key by setting `s3_data_distribution_type` equal to `ShardedByS3Key`. When you use `ShardedByS3Key` one shard of data is sent to each processing instance.

    You can use a script to process SageMaker geospatial data. That script can be found in [Step 3: Writing a script that can calculate the NDVI](geospatial-custom-operations-procedure.md#geospatial-custom-operations-script-mode). To learn more about the `.run()` API operation, see [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) in the *Amazon SageMaker Python SDK for Processing*.

To monitor the progress of your processing job, the `ProcessingJobs` class supports a [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingJob.describe](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingJob.describe) method. This method returns a response from the `DescribeProcessingJob` API call. To learn more, see [`DescribeProcessingJob` in the *Amazon SageMaker AI API Reference*](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html).

The next topic show you how to create an instance of the `ScriptProcessor` class using the SageMaker geospatial container, and then how to use it to calculate the Normalized Difference Vegetation Index (NDVI) with Sentinel-2 images.



# Using `ScriptProcessor` to calculate the Normalized Difference Vegetation Index (NDVI) using Sentinel-2 satellite data
<a name="geospatial-custom-operations-procedure"></a>

The following code samples show you how to calculate the normalized difference vegetation index of a specific geographical area using the purpose-built geospatial image within a Studio Classic notebook and run a large-scale workload with Amazon SageMaker Processing using [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) from the SageMaker AI Python SDK.

This demo also uses an Amazon SageMaker Studio Classic notebook instance that uses the geospatial kernel and instance type. To learn how to create a Studio Classic geospatial notebook instance, see [Create an Amazon SageMaker Studio Classic notebook using the geospatial image](geospatial-launch-notebook.md).

You can follow along with this demo in your own notebook instance by copying and pasting the following code snippets:

1. [Use `search_raster_data_collection` to query a specific area of interest (AOI) over a given a time range using a specific raster data collection, Sentinel-2.](#geospatial-custom-operations-procedure-search)

1. [Create a manifest file that specifies what data will be processed during the processing job.](#geospatial-custom-operations-procedure-manifest)

1. [Write a data processing Python script calculating the NDVI.](#geospatial-custom-operations-script-mode)

1. [Create a `ScriptProcessor` instance and start the Amazon SageMaker Processing job](#geospatial-custom-operations-create).

1. [Visualizing the results of your processing job](#geospatial-custom-operations-visual).

## Query the Sentinel-2 raster data collection using `SearchRasterDataCollection`
<a name="geospatial-custom-operations-procedure-search"></a>

With `search_raster_data_collection` you can query supported raster data collections. This example uses data that's pulled from Sentinel-2 satellites. The area of interest (`AreaOfInterest`) specified is rural northern Iowa, and the time range (`TimeRangeFilter`) is January 1, 2022 to December 30, 2022. To see the available raster data collections in your AWS Region use `list_raster_data_collections`. To see a code example using this API, see [`ListRasterDataCollections`](geospatial-data-collections.md) in the *Amazon SageMaker AI Developer Guide*.

In following code examples you use the ARN associated with Sentinel-2 raster data collection, `arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8`.

A `search_raster_data_collection` API request requires two parameters:
+ You need to specify an `Arn` parameter that corresponds to the raster data collection that you want to query.
+ You also need to specify a `RasterDataCollectionQuery` parameter, which takes in a Python dictionary.

The following code example contains the necessary key-value pairs needed for the `RasterDataCollectionQuery` parameter saved to the `search_rdc_query` variable.

```
search_rdc_query = {
    "AreaOfInterest": {
        "AreaOfInterestGeometry": {
            "PolygonGeometry": {
                "Coordinates": [[
                    [
              -94.50938680498298,
              43.22487436936203
            ],
            [
              -94.50938680498298,
              42.843474642037194
            ],
            [
              -93.86520004156142,
              42.843474642037194
            ],
            [
              -93.86520004156142,
              43.22487436936203
            ],
            [
              -94.50938680498298,
              43.22487436936203
            ]
               ]]
            }
        }
    },
    "TimeRangeFilter": {"StartTime": "2022-01-01T00:00:00Z", "EndTime": "2022-12-30T23:59:59Z"}
}
```

To make the `search_raster_data_collection` request, you must specify the ARN of the Sentinel-2 raster data collection: `arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8`. You also must need to pass in the Python dictionary that was defined previously, which specifies query parameters. 

```
## Creates a SageMaker Geospatial client instance 
sm_geo_client= session.create_client(service_name="sagemaker-geospatial")

search_rdc_response1 = sm_geo_client.search_raster_data_collection(
    Arn='arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
    RasterDataCollectionQuery=search_rdc_query
)
```

The results of this API can not be paginated. To collect all the satellite images returned by the `search_raster_data_collection` operation, you can implement a `while` loop. This checks for`NextToken` in the API response:

```
## Holds the list of API responses from search_raster_data_collection
items_list = []
while search_rdc_response1.get('NextToken') and search_rdc_response1['NextToken'] != None:
    items_list.extend(search_rdc_response1['Items'])
    
    search_rdc_response1 = sm_geo_client.search_raster_data_collection(
    	Arn='arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
        RasterDataCollectionQuery=search_rdc_query, 
        NextToken=search_rdc_response1['NextToken']
    )
```

The API response returns a list of URLs under the `Assets` key corresponding to specific image bands. The following is a truncated version of the API response. Some of the image bands were removed for clarity.

```
{
	'Assets': {
        'aot': {
            'Href': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/15/T/UH/2022/12/S2A_15TUH_20221230_0_L2A/AOT.tif'
        },
        'blue': {
            'Href': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/15/T/UH/2022/12/S2A_15TUH_20221230_0_L2A/B02.tif'
        },
        'swir22-jp2': {
            'Href': 's3://sentinel-s2-l2a/tiles/15/T/UH/2022/12/30/0/B12.jp2'
        },
        'visual-jp2': {
            'Href': 's3://sentinel-s2-l2a/tiles/15/T/UH/2022/12/30/0/TCI.jp2'
        },
        'wvp-jp2': {
            'Href': 's3://sentinel-s2-l2a/tiles/15/T/UH/2022/12/30/0/WVP.jp2'
        }
    },
    'DateTime': datetime.datetime(2022, 12, 30, 17, 21, 52, 469000, tzinfo = tzlocal()),
    'Geometry': {
        'Coordinates': [
            [
                [-95.46676936182894, 43.32623760511659],
                [-94.11293433656887, 43.347431265475954],
                [-94.09532154452742, 42.35884880571144],
                [-95.42776890002203, 42.3383710796791],
                [-95.46676936182894, 43.32623760511659]
            ]
        ],
        'Type': 'Polygon'
    },
    'Id': 'S2A_15TUH_20221230_0_L2A',
    'Properties': {
        'EoCloudCover': 62.384969,
        'Platform': 'sentinel-2a'
    }
}
```

In the [next section](#geospatial-custom-operations-procedure-manifest), you create a manifest file using the `'Id'` key from the API response.

## Create an input manifest file using the `Id` key from the `search_raster_data_collection` API response
<a name="geospatial-custom-operations-procedure-manifest"></a>

When you run a processing job, you must specify a data input from Amazon S3. The input data type can either be a manifest file, which then points to the individual data files. You can also add a prefix to each file that you want processed. The following code example defines the folder where your manifest files will be generated.

Use SDK for Python (Boto3) to get the default bucket and the ARN of the execution role that is associated with your Studio Classic notebook instance:

```
sm_session = sagemaker.session.Session()
s3 = boto3.resource('s3')
# Gets the default excution role associated with the notebook
execution_role_arn = sagemaker.get_execution_role() 

# Gets the default bucket associated with the notebook
s3_bucket = sm_session.default_bucket() 

# Can be replaced with any name
s3_folder = "script-processor-input-manifest"
```

Next, you create a manifest file. It will hold the URLs of the satellite images that you wanted processed when you run your processing job later in step 4.

```
# Format of a manifest file
manifest_prefix = {}
manifest_prefix['prefix'] = 's3://' + s3_bucket + '/' + s3_folder + '/'
manifest = [manifest_prefix]

print(manifest)
```

The following code sample returns the S3 URI where your manifest files will be created.

```
[{'prefix': 's3://sagemaker-us-west-2-111122223333/script-processor-input-manifest/'}]
```

All the response elements from the search\$1raster\$1data\$1collection response are not needed to run the processing job. 

The following code snippet removes the unnecessary elements `'Properties'`, `'Geometry'`, and `'DateTime'`. The `'Id'` key-value pair, `'Id': 'S2A_15TUH_20221230_0_L2A'`, contains the year and the month. The following code example parses that data to create new keys in the Python dictionary **dict\$1month\$1items**. The values are the assets that are returned from the `SearchRasterDataCollection` query. 

```
# For each response get the month and year, and then remove the metadata not related to the satelite images.
dict_month_items = {}
for item in items_list:
    # Example ID being split: 'S2A_15TUH_20221230_0_L2A' 
    yyyymm = item['Id'].split("_")[2][:6]
    if yyyymm not in dict_month_items:
        dict_month_items[yyyymm] = []
    
    # Removes uneeded metadata elements for this demo 
    item.pop('Properties', None)
    item.pop('Geometry', None)
    item.pop('DateTime', None)

    # Appends the response from search_raster_data_collection to newly created key above
    dict_month_items[yyyymm].append(item)
```

This code example uploads the `dict_month_items` to Amazon S3 as a JSON object using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_file.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_file.html) API operation:

```
## key_ is the yyyymm timestamp formatted above
## value_ is the reference to all the satellite images collected via our searchRDC query 
for key_, value_ in dict_month_items.items():
    filename = f'manifest_{key_}.json'
    with open(filename, 'w') as fp:
        json.dump(value_, fp)
    s3.meta.client.upload_file(filename, s3_bucket, s3_folder + '/' + filename)
    manifest.append(filename)
    os.remove(filename)
```

This code example uploads a parent `manifest.json` file that points to all the other manifests uploaded to Amazon S3. It also saves the path to a local variable: **s3\$1manifest\$1uri**. You'll use that variable again to specify the source of the input data when you run the processing job in step 4.

```
with open('manifest.json', 'w') as fp:
    json.dump(manifest, fp)
s3.meta.client.upload_file('manifest.json', s3_bucket, s3_folder + '/' + 'manifest.json')
os.remove('manifest.json')

s3_manifest_uri = f's3://{s3_bucket}/{s3_folder}/manifest.json'
```

Now that you created the input manifest files and uploaded them, you can write a script that processes your data in the processing job. It processes the data from the satellite images, calculates the NDVI, and then returns the results to a different Amazon S3 location.

## Write a script that calculates the NDVI
<a name="geospatial-custom-operations-script-mode"></a>

Amazon SageMaker Studio Classic supports the use of the `%%writefile` cell magic command. After running a cell with this command, its contents will be saved to your local Studio Classic directory. This is code specific to calculating NDVI. However, the following can be useful when you write your own script for a processing job:
+ In your processing job container, the local paths inside the container must begin with `/opt/ml/processing/`. In this example, **input\$1data\$1path = '/opt/ml/processing/input\$1data/' ** and **processed\$1data\$1path = '/opt/ml/processing/output\$1data/'** are specified in that way.
+ With Amazon SageMaker Processing, a script that a processing job runs can upload your processed data directly to Amazon S3. To do so, make sure that the execution role associated with your `ScriptProcessor` instance has the necessary requirements to access the S3 bucket. You can also specify an outputs parameter when you run your processing job. To learn more, see the [`.run()` API operation ](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) in the *Amazon SageMaker Python SDK*. In this code example, the results of the data processing are uploaded directly to Amazon S3.
+ To manage the size of the Amazon EBScontainer attached to your processing jobuse the `volume_size_in_gb` parameter. The containers's default size is 30 GB. You can aslo optionally use the Python library [Garbage Collector](https://docs.python.org/3/library/gc.html) to manage storage in your Amazon EBS container.

  The following code example loads the arrays into the processing job container. When arrays build up and fill in the memory, the processing job crashes. To prevent this crash, the following example contains commands that remove the arrays from the processing job’s container.

```
%%writefile compute_ndvi.py

import os
import pickle
import sys
import subprocess
import json
import rioxarray

if __name__ == "__main__":
    print("Starting processing")
    
    input_data_path = '/opt/ml/processing/input_data/'
    input_files = []
    
    for current_path, sub_dirs, files in os.walk(input_data_path):
        for file in files:
            if file.endswith(".json"):
                input_files.append(os.path.join(current_path, file))
    
    print("Received {} input_files: {}".format(len(input_files), input_files))

    items = []
    for input_file in input_files:
        full_file_path = os.path.join(input_data_path, input_file)
        print(full_file_path)
        with open(full_file_path, 'r') as f:
            items.append(json.load(f))
            
    items = [item for sub_items in items for item in sub_items]

    for item in items:
        red_uri = item["Assets"]["red"]["Href"]
        nir_uri = item["Assets"]["nir"]["Href"]

        red = rioxarray.open_rasterio(red_uri, masked=True)
        nir = rioxarray.open_rasterio(nir_uri, masked=True)

        ndvi = (nir - red)/ (nir + red)
        
        file_name = 'ndvi_' + item["Id"] + '.tif'
        output_path = '/opt/ml/processing/output_data'
        output_file_path = f"{output_path}/{file_name}"
        
        ndvi.rio.to_raster(output_file_path)
        print("Written output:", output_file_path)
```

You now have a script that can calculate the NDVI. Next, you can create an instance of the ScriptProcessor and run your Processing job.

## Creating an instance of the `ScriptProcessor` class
<a name="geospatial-custom-operations-create"></a>

This demo uses the [ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) class that is available via the Amazon SageMaker Python SDK. First, you need to create an instance of the class, and then you can start your Processing job by using the `.run()` method.

```
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

image_uri = '081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'

processor = ScriptProcessor(
	command=['python3'],
	image_uri=image_uri,
	role=execution_role_arn,
	instance_count=4,
	instance_type='ml.m5.4xlarge',
	sagemaker_session=sm_session
)

print('Starting processing job.')
```

When you start your Processing job, you need to specify a [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) object. In that object, you specify the following:
+ The path to the manifest file that you created in step 2, **s3\$1manifest\$1uri**. This is the source of the input data to the container.
+ The path to where you want the input data to be saved in the container. This must match the path that you specified in your script.
+ Use the `s3_data_type` parameter to specify the input as `"ManifestFile"`.

```
s3_output_prefix_url = f"s3://{s3_bucket}/{s3_folder}/output"

processor.run(
    code='compute_ndvi.py',
    inputs=[
        ProcessingInput(
            source=s3_manifest_uri,
            destination='/opt/ml/processing/input_data/',
            s3_data_type="ManifestFile",
            s3_data_distribution_type="ShardedByS3Key"
        ),
    ],
    outputs=[
        ProcessingOutput(
            source='/opt/ml/processing/output_data/',
            destination=s3_output_prefix_url,
            s3_upload_mode="Continuous"
        )
    ]
)
```

The following code example uses the [`.describe()` method](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingJob.describe) to get details of your Processing job.

```
preprocessing_job_descriptor = processor.jobs[-1].describe()
s3_output_uri = preprocessing_job_descriptor["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
print(s3_output_uri)
```

## Visualizing your results using `matplotlib`
<a name="geospatial-custom-operations-visual"></a>

With the [Matplotlib](https://matplotlib.org/stable/index.html) Python library, you can plot raster data. Before you plot the data, you need to calculate the NDVI using sample images from the Sentinel-2 satellites. The following code example opens the image arrays using the `.open_rasterio()` API operation, and then calculates the NDVI using the `nir` and `red` image bands from the Sentinel-2 satellite data. 

```
# Opens the python arrays 
import rioxarray

red_uri = items[25]["Assets"]["red"]["Href"]
nir_uri = items[25]["Assets"]["nir"]["Href"]

red = rioxarray.open_rasterio(red_uri, masked=True)
nir = rioxarray.open_rasterio(nir_uri, masked=True)

# Calculates the NDVI
ndvi = (nir - red)/ (nir + red)

# Common plotting library in Python 
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(18, 18))
ndvi.plot(cmap='viridis', ax=ax)
ax.set_title("NDVI for {}".format(items[25]["Id"]))
ax.set_axis_off()
plt.show()
```

The output of the preceding code example is a satellite image with the NDVI values overlaid on it. An NDVI value near 1 indicates lots of vegetation is present, and values near 0 indicate no vegetation is presentation.

![\[A satellite image of northern Iowa with the NDVI overlaid on top\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ndvi-iowa.png)


This completes the demo of using `ScriptProcessor`.

# Earth Observation Jobs
<a name="geospatial-eoj"></a>

Using an Earth Observation job (EOJ), you can acquire, transform, and visualize geospatial data to make predictions. You can choose an operation based on your use case from a wide range of operations and models. You get the flexibility of choosing your area of interest, selecting the data providers, and setting time-range based and cloud-cover-percentage-based filters. After SageMaker AI creates an EOJ for you, you can visualize the inputs and outputs of the job using the visualization functionality. An EOJ has various use cases that include comparing deforestation over time and diagnosing plant health. You can create an EOJ by using a SageMaker notebook with a SageMaker geospatial image. You can also access the SageMaker geospatial UI as a part of Amazon SageMaker Studio Classic UI to view the list of all your jobs. You can also use the UI to pause or stop an ongoing job. You can choose a job from the list of available EOJ to view the **Job summary**, the **Job details** as well as visualize the **Job output**.

**Topics**
+ [

# Create an Earth Observation Job Using a Amazon SageMaker Studio Classic Notebook with a SageMaker geospatial Image
](geospatial-eoj-ntb.md)
+ [

# Types of Operations
](geospatial-eoj-models.md)

# Create an Earth Observation Job Using a Amazon SageMaker Studio Classic Notebook with a SageMaker geospatial Image
<a name="geospatial-eoj-ntb"></a>

**To use a SageMaker Studio Classic notebook with a SageMaker geospatial image:**

1. From the **Launcher**, choose **Change environment** under **Notebooks and compute resources**.

1. Next, the **Change environment** dialog opens.

1. Select the **Image** dropdown and choose **Geospatial 1.0**. The **Instance type** should be **ml.geospatial.interactive**. Do not change the default values for other settings.

1. Choose **Select**.

1. Choose **Create notebook**.

You can initiate an EOJ using a Amazon SageMaker Studio Classic notebook with a SageMaker geospatial image using the code provided below.

```
import boto3
import sagemaker
import sagemaker_geospatial_map

session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")
```

The following is an example showing how to create an EOJ in the in the US West (Oregon) Region.

```
#Query and Access Data
search_rdc_args = {
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",  # sentinel-2 L2A COG
    "RasterDataCollectionQuery": {
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-114.529, 36.142],
                            [-114.373, 36.142],
                            [-114.373, 36.411],
                            [-114.529, 36.411],
                            [-114.529, 36.142],
                        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2021-01-01T00:00:00Z",
            "EndTime": "2022-07-10T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 1}}}],
            "LogicalOperator": "AND",
        },
        "BandFilter": ["visual"],
    },
}

tci_urls = []
data_manifests = []
while search_rdc_args.get("NextToken", True):
    search_result = sg_client.search_raster_data_collection(**search_rdc_args)
    if search_result.get("NextToken"):
        data_manifests.append(search_result)
    for item in search_result["Items"]:
        tci_url = item["Assets"]["visual"]["Href"]
        print(tci_url)
        tci_urls.append(tci_url)

    search_rdc_args["NextToken"] = search_result.get("NextToken")
        
# Perform land cover segmentation on images returned from the sentinel dataset.
eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-114.529, 36.142],
                            [-114.373, 36.142],
                            [-114.373, 36.411],
                            [-114.529, 36.411],
                            [-114.529, 36.142],
                        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2021-01-01T00:00:00Z",
            "EndTime": "2022-07-10T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 1}}}],
            "LogicalOperator": "AND",
        },
    }
}
eoj_config = {"LandCoverSegmentationConfig": {}}

response = sg_client.start_earth_observation_job(
    Name="lake-mead-landcover",
    InputConfig=eoj_input_config,
    JobConfig=eoj_config,
    ExecutionRoleArn=execution_role,
)
```

After your EOJ is created, the `Arn` is returned to you. You use the `Arn` to identify a job and perform further operations. To get the status of a job, you can run `sg_client.get_earth_observation_job(Arn = response['Arn'])`.

The following example shows how to query the status of an EOJ until it is completed.

```
eoj_arn = response["Arn"]
job_details = sg_client.get_earth_observation_job(Arn=eoj_arn)
{k: v for k, v in job_details.items() if k in ["Arn", "Status", "DurationInSeconds"]}
# List all jobs in the account
sg_client.list_earth_observation_jobs()["EarthObservationJobSummaries"]
```

After the EOJ is completed, you can visualize the EOJ outputs directly in the notebook. The following example shows you how an interactive map can be rendered.

```
map = sagemaker_geospatial_map.create_map({
'is_raster': True
})
map.set_sagemaker_geospatial_client(sg_client)
# render the map
map.render()
```

The following example shows how the map can be centered on an area of interest and the input and output of the EOJ can be rendered as separate layers within the map.

```
# visualize the area of interest
config = {"label": "Lake Mead AOI"}
aoi_layer = map.visualize_eoj_aoi(Arn=eoj_arn, config=config)

# Visualize input.
time_range_filter = {
    "start_date": "2022-07-01T00:00:00Z",
    "end_date": "2022-07-10T23:59:59Z",
}
config = {"label": "Input"}

input_layer = map.visualize_eoj_input(
    Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)
# Visualize output, EOJ needs to be in completed status.
time_range_filter = {
    "start_date": "2022-07-01T00:00:00Z",
    "end_date": "2022-07-10T23:59:59Z",
}
config = {"preset": "singleBand", "band_name": "mask"}
output_layer = map.visualize_eoj_output(
    Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)
```

You can use the `export_earth_observation_job` function to export the EOJ results to your Amazon S3 bucket. The export function makes it convenient to share results across teams. SageMaker AI also simplifies dataset management. We can simply share the EOJ results using the job ARN, instead of crawling thousands of files in the S3 bucket. Each EOJ becomes an asset in the data catalog, as results can be grouped by the job ARN. The following example shows how you can export the results of an EOJ.

```
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()  # Replace with your own bucket if needed
s3_bucket = session.resource("s3").Bucket(s3_bucket_name)
prefix = "eoj_lakemead"  # Replace with the S3 prefix desired
export_bucket_and_key = f"s3://{s3_bucket_name}/{prefix}/"

eoj_output_config = {"S3Data": {"S3Uri": export_bucket_and_key}}
export_response = sg_client.export_earth_observation_job(
    Arn=eoj_arn,
    ExecutionRoleArn=execution_role,
    OutputConfig=eoj_output_config,
    ExportSourceImages=False,
)
```

You can monitor the status of your export job using the following snippet.

```
# Monitor the export job status
export_job_details = sg_client.get_earth_observation_job(Arn=export_response["Arn"])
{k: v for k, v in export_job_details.items() if k in ["Arn", "Status", "DurationInSeconds"]}
```

You are not charged the storage fees after you delete the EOJ.

For an example that showcases how to run an EOJ, see this [blog post](https://aws.amazon.com/blogs/machine-learning/monitoring-lake-mead-drought-using-the-new-amazon-sagemaker-geospatial-capabilities/).

For more example notebooks on SageMaker geospatial capabilities, see this [GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-geospatial).

# Types of Operations
<a name="geospatial-eoj-models"></a>

When you create an EOJ, you select an operation based on your use case. Amazon SageMaker geospatial capabilities provide a combination of purpose-built operations and pre-trained models. You can use these operations to understand the impact of environmental changes and human activities over time or identify cloud and cloud-free pixels.

**Cloud Masking**

Identify clouds in satellite images is an essential pre-processing step in producing high-quality geospatial data. Ignoring cloud pixels can lead to errors in analysis, and over-detection of cloud pixels can decrease the number of valid observations. Cloud masking has the ability to identify cloudy and cloud-free pixels in satellite images. An accurate cloud mask helps get satellite images for processing and improves data generation. The following is the class map for cloud masking.

```
{
0: "No_cloud",
1: "cloud"
}
```

**Cloud Removal**

Cloud removal for Sentinel-2 data uses an ML-based semantic segmentation model to identify clouds in the image. Cloudy pixels can be replaced by with pixels from other timestamps. USGS Landsat data contains landsat metadata that is used for cloud removal.

**Temporal Statistics**

Temporal statistics calculate statistics for geospatial data through time. The temporal statistics currently supported include mean, median, and standard deviation. You can calculate these statistics by using `GROUPBY` and set it to either `all` or `yearly`. You can also mention the `TargetBands`.

**Zonal Statistics**

Zonal statistics performs statistical operations over a specified area on the image. 

**Resampling**

Resampling is used to upscale and downscale the resolution of a geospatial image. The `value` attribute in resampling represents the length of a side of the pixel.

**Geomosaic**

Geomosaic allows you to stitch smaller images into a large image.

**Band Stacking**

Band stacking takes more than one image band as input and stacks them into a single GeoTIFF. The `OutputResolution` attribute determines the resolution of the output image. Based on the resolutions of the input images, you can set it to `lowest`, `highest` or `average`.

**Band Math**

Band Math, also known as Spectral Index, is a process of transforming the observations from multiple spectral bands to a single band, indicating the relative abundance of features of interests. For instance, Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) are helpful for observing the presence of green vegetation features.

**Land Cover Segmentation**

Land Cover segmentation is a semantic segmentation model that has the capability to identify the physical material, such as vegetation, water, and bare ground, at the earth surface. Having an accurate way to map the land cover patterns helps you understand the impact of environmental change and human activities over time. Land Cover segmentation is often used for region planning, disaster response, ecological management, and environmental impact assessment. The following is the class map for Land Cover segmentation.

```
{
0: "No_data",
1: "Saturated_or_defective",
2: "Dark_area_pixels",
3: "Cloud_shadows",
4: "Vegetation",
5: "Not_vegetated",
6: "Water",
7: "Unclassified",
8: "Cloud_medium_probability",
9: "Cloud_high_probability",
10: "Thin_cirrus",
11: "Snow_ice"
}
```

## Availability of EOJ Operations
<a name="geospatial-eoj-models-avail"></a>

The availability of operations depends on whether you are using the SageMaker geospatial UI or the Amazon SageMaker Studio Classic notebooks with a SageMaker geospatial image. Currently, notebooks support all functionalities. To summarize, the following geospatial operations are supported by SageMaker AI:


| Operations |  Description  |  Availability  | 
| --- | --- | --- | 
| Cloud Masking | Identify cloud and cloud-free pixels to get improved and accurate satellite imagery. | UI, Notebook | 
| Cloud Removal | Remove pixels containing parts of a cloud from satellite imagery. | Notebook | 
| Temporal Statistics | Calculate statistics through time for a given GeoTIFF. | Notebook | 
| Zonal Statistics | Calculate statistics on user-defined regions. | Notebook | 
| Resampling | Scale images to different resolutions. | Notebook | 
| Geomosaic | Combine multiple images for greater fidelity. | Notebook | 
| Band Stacking | Combine multiple spectral bands to create a single image. | Notebook | 
| Band Math / Spectral Index | Obtain a combination of spectral bands that indicate the abundance of features of interest. | UI, Notebook | 
| Land Cover Segmentation | Identify land cover types such as vegetation and water in satellite imagery. | UI, Notebook | 

# Vector Enrichment Jobs
<a name="geospatial-vej"></a>

A Vector Enrichment Job (VEJ) performs operations on your vector data. Currently, you can use a VEJ to do reverse geocoding or map matching.
<a name="geospatial-vej-rev-geo"></a>
**Reverse Geocoding**  
With a reverse geocoding VEJ, you can convert geographic coordinates (latitude, longitude) to human-readable addresses powered by Amazon Location Service. When you upload a CSV file containing the longitude and latitude coordinates, a it returns the address number, country, label, municipality, neighborhood, postal code and region of that location. The output file consists of your input data along with columns containing these the values appended at the end. These jobs are optimized to accept tens of thousands of GPS traces. 
<a name="geospatial-vej-map-match"></a>
**Map Matching**  
Map matching allows you to snap GPS coordinates to road segments. The input should be a CSV file containing the trace ID (route), longitude, latitude and the timestamp attributes. There can be multiple GPS co-ordinates per route. The input can contain multiple routes too. The output is a GeoJSON file that contains links of the predicted route. It also has the snap points provided in the input. These jobs are optimized to accept tens of thousands of drives in one request. Map matching is supported by [OpenStreetMap](https://www.openstreetmap.org/). Map matching fails if the names in the input source field don't match the ones in `MapMatchingConfig`. The error message you receive contains the the field names present in the input file and the expected field name that is not found in `MapMatchingConfig`. 

The input CSV file for a VEJ must contain the following:
+ A header row
+ Latitude and longitude in separate columns
+ The ID and Timestamp columns can be in numeric or string format. All other column data must be in numeric format only
+ No miss matching quotes

For the timestamp column, SageMaker geospatial capabilities supports epoch time in seconds and milliseconds (long integer). The string formats supported are as follows:
+ "dd.MM.yyyy HH:mm:ss z"
+ "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
+ "yyyy-MM-dd'T'HH:mm:ss"
+ "yyyy-MM-dd hh:mm:ss a"
+ "yyyy-MM-dd HH:mm:ss"
+ "yyyyMMddHHmmss"

While you need to use an Amazon SageMaker Studio Classic notebook to execute a VEJ, you can view all the jobs you create using the UI. To use the visualization in the notebook, you first need to export your output to your S3 bucket. The VEJ actions you can perform are as follows.
+ [StartVectorEnrichmentJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_StartVectorEnrichmentJob.html)
+ [GetVectorEnrichmentJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_GetVectorEnrichmentJob.html)
+ [ListVectorEnrichmentJobs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_ListVectorEnrichmentJobs.html)
+ [StopVectorEnrichmentJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_StopVectorEnrichmentJob.html)
+ [DeleteVectorEnrichmentJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_DeleteVectorEnrichmentJob.html)

# Visualization Using SageMaker geospatial capabilities
<a name="geospatial-visualize"></a>

Using the visualization functionalities provided by Amazon SageMaker geospatial you can visualize geospatial data, the inputs to your EOJ or VEJ jobs as well as the outputs exported from your Amazon S3 bucket. The visualization tool is powered by [Foursquare Studio](https://studio.foursquare.com/home). The following image depicts the visualization tool supported by SageMaker geospatial capabilities. 

![\[Visualization tool using SageMaker geospatial capabilities shows a map of the California coast.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geospatial_vis.png)


You can use the left navigation panel to add data, layers, filters, and columns. You can also make modifications to how you interact with the map.

**Dataset**

The source of data used for visualization is called a **Dataset**. To add data for visualization, choose **Add Data** in the left navigation panel. You can either upload the data from your Amazon S3 bucket or your local machine. The data formats supported are CSV, JSON and GeoJSON. You can add multiple datasets to your map. After you upload the dataset, you can see it loaded on the map screen. 

**Layers**

In the layer panel, a layer is created and populated automatically when you add a dataset. If your map consists of more than one dataset, you can select which dataset belongs to a layer. You can create new layers and group them. SageMaker SageMaker geospatial capabilities support various layer types, including point, arc, icon, and polygon. 

You can choose any data point in a layer to have an **Outline**. You can also further customize the data points. For example, you can choose the layer type as **Point** and then **Fill Color** based on any column of your dataset. You can also change the radius of the points. 

The following image shows the layers panel supported by SageMaker geospatial capabilities.

![\[The layers panel with data points on a USA map, supported by SageMaker geospatial capabilities.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geospatial_vis_layer.png)


**Columns**

You can view the columns present in your dataset by using the **Columns** tab in the left navigation panel.

**Filters**

You can use filters to limit the data points that display on the map.

**Interactions**

In the **Interactions** panel, you can customize how you interact with the map. For example, you can choose what metrics to display when you hover the tooltip over a data point.

**Base map**

Currently, SageMaker AI only supports the Amazon Dark base map.

**Split Map Modes**

You can have a **Single Map**, **Dual Maps** or **Swipe Maps**. With **Dual Maps**, you can compare the same map side-by-side using different layers. Use **Swipe Maps** to overlay two maps on each other and use the sliding separator to compare them. You can choose the split map mode by choosing the **Split Mode** button on the top right corner of your map.

## Legends for EOJ in the SageMaker geospatial UI
<a name="geo-legends-eoj"></a>

The output visualization of an EOJ depends on the operation you choose to create it. The legend is based on the default color scale. You can view the legend by choosing the **Show legend** button on the top right corner of your map.

**Spectral Index**

When you visualize the output for an EOJ that uses the spectral index operation, you can map the category based on the color from the legend as shown.

![\[The legend for spectral index mapping.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geo_spectral_index.png)


**Cloud Masking**

When you visualize the output for an EOJ that uses the cloud masking operation, you can map the category based on the color from the legend as shown.

![\[The legend for cloud masking mapping.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geo_cloud_masking.png)


**Land Cover Segmentation**

When you visualize the output for an EOJ that uses the Land Cover Segmentation operation, you can map the category based on the color from the legend as shown.

![\[The legend for land cover segmentation mapping.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/geo_landcover_ss.png)


# Amazon SageMaker geospatial Map SDK
<a name="geospatial-notebook-sdk"></a>

You can use Amazon SageMaker geospatial capabilities to visualize maps within the SageMaker geospatial UI as well as SageMaker notebooks with a geospatial image. These visualizations are supported by the map visualization library called [Foursquare Studio](https://studio.foursquare.com/home)

You can use the APIs provided by the SageMaker geospatial map SDK to visualize your geospatial data, including the input, output, and AoI for EOJ.

**Topics**
+ [

## add\$1dataset API
](#geo-add-dataset)
+ [

## update\$1dataset API
](#geo-update-dataset)
+ [

## add\$1layer API
](#geo-add-layer)
+ [

## update\$1layer API
](#geo-update-layer)
+ [

## visualize\$1eoj\$1aoi API
](#geo-visualize-eoj-aoi)
+ [

## visualize\$1eoj\$1input API
](#geo-visualize-eoj-input)
+ [

## visualize\$1eoj\$1output API
](#geo-visualize-eoj-output)

## add\$1dataset API
<a name="geo-add-dataset"></a>

Adds a raster or vector dataset object to the map.

**Request syntax**

```
Request = 
    add_dataset(
      self,
      dataset: Union[Dataset, Dict, None] = None,
      *,
      auto_create_layers: bool = True,
      center_map: bool = True,
      **kwargs: Any,
    ) -> Optional[Dataset]
```

**Request parameters**

The request accepts the following parameters.

Positional arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `dataset` | Union[Dataset, Dict, None] | Data used to create a dataset, in CSV, JSON, or GeoJSON format (for local datasets) or a UUID string. | 

Keyword arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `auto_create_layers` | Boolean | Whether to attempt to create new layers when adding a dataset. Default value is `False`. | 
| `center_map` | Boolean | Whether to center the map on the created dataset. Default value is `True`. | 
| `id` | String | Unique identifier of the dataset. If you do not provide it, a random ID is generated. | 
| `label` | String | Dataset label which is displayed. | 
| `color` | Tuple[float, float, float] | Color label of the dataset. | 
| `metadata` | Dictionary | Object containing tileset metadata (for tiled datasets). | 

**Response**

This API returns the [Dataset](https://location.foursquare.com/developer/docs/studio-map-sdk-types#dataset) object that was added to the map.

## update\$1dataset API
<a name="geo-update-dataset"></a>

Updates an existing dataset's settings.

**Request syntax**

```
Request = 
    update_dataset(
    self,
    dataset_id: str,
    values: Union[_DatasetUpdateProps, dict, None] = None,
    **kwargs: Any,
) -> Dataset
```

**Request parameters**

The request accepts the following parameters.

Positional arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `dataset_id` | String | The identifier of the dataset to be updated. | 
| `values` | Union[[\$1DatasetUpdateProps](https://location.foursquare.com/developer/docs/studio-map-sdk-types#datasetupdateprops), dict, None] | The values to update. | 

Keyword arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `label` | String | Dataset label which is displayed. | 
| `color` | [RGBColor](https://location.foursquare.com/developer/docs/studio-map-sdk-types#rgbcolor) | Color label of the dataset. | 

**Response**

This API returns the updated dataset object for interactive maps, or `None` for non-interactive HTML environments. 

## add\$1layer API
<a name="geo-add-layer"></a>

Adds a new layer to the map. This function requires at least one valid layer configuration.

**Request syntax**

```
Request = 
    add_layer(
    self,
    layer: Union[LayerCreationProps, dict, None] = None,
    **kwargs: Any
) -> Layer
```

**Request parameters**

The request accepts the following parameters.

Arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `layer` | Union[[LayerCreationProps](https://location.foursquare.com/developer/docs/studio-map-sdk-types#layercreationprops), dict, None] | A set of properties used to create a layer. | 

**Response**

The layer object that was added to the map.

## update\$1layer API
<a name="geo-update-layer"></a>

Update an existing layer with given values.

**Request syntax**

```
Request = 
    update_layer(
  self,
  layer_id: str,
  values: Union[LayerUpdateProps, dict, None],
  **kwargs: Any
) -> Layer
```

**Request parameters**

The request accepts the following parameters.

Arguments


| Positional argument |  Type  |  Description  | 
| --- | --- | --- | 
| `layer_id` | String | The ID of the layer to be updated. | 
| `values` | Union[[LayerUpdateProps](https://location.foursquare.com/developer/docs/studio-map-sdk-types#layerupdateprops), dict, None] | The values to update. | 

Keyword arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `type` | [LayerType](https://location.foursquare.com/developer/docs/studio-map-sdk-types#layertype) | The type of layer. | 
| `data_id` | String | Unique identifier of the dataset this layer visualizes. | 
| `fields` | Dict [string, Optional[string]] | Dictionary that maps fields that the layer requires for visualization to appropriate dataset fields. | 
| `label` | String | Canonical label of this layer. | 
| `is_visible` | Boolean | Whether the layer is visible or not. | 
| `config` | [LayerConfig](https://location.foursquare.com/developer/docs/studio-map-sdk-types#layerconfig) | Layer configuration specific to its type.  | 

**Response**

Returns the updated layer object.

## visualize\$1eoj\$1aoi API
<a name="geo-visualize-eoj-aoi"></a>

Visualize the AoI of the given job ARN.

**Request parameters**

The request accepts the following parameters.

Arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
|  `Arn`  |  String  |  The ARN of the job.  | 
|  `config`  |  Dictionary config = \$1 label: <string> custom label of the added AoI layer, default AoI \$1  |  An option to pass layer properties.  | 

**Response**

Reference of the added input layer object.

## visualize\$1eoj\$1input API
<a name="geo-visualize-eoj-input"></a>

Visualize the input of the given EOJ ARN.

**Request parameters**

The request accepts the following parameters.

Arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
| `Arn` | String | The ARN of the job. | 
| `time_range_filter` |  Dictionary time\$1range\$1filter = \$1 start\$1date: <string> date in ISO format end\$1date: <string> date in ISO format \$1  | An option to provide the start and end time. Defaults to the raster data collection search start and end date. | 
| `config` |  Dictionary config = \$1 label: <string> custom label of the added output layer, default Input \$1  | An option to pass layer properties. | 

**Response**

Reference of the added input layer object.

## visualize\$1eoj\$1output API
<a name="geo-visualize-eoj-output"></a>

Visualize the output of the given EOJ ARN.

**Request parameters**

The request accepts the following parameters.

Arguments


| Argument |  Type  |  Description  | 
| --- | --- | --- | 
|  `Arn`  |  String  |  The ARN of the job.  | 
|  `time_range_filter`  |  Dictionary time\$1range\$1filter = \$1 start\$1date: <string> date in ISO format end\$1date: <string> date in ISO format \$1  | An option to provide the start and end time. Defaults to the raster data collection search start and end date. | 
| `config` |  Dictionary config = \$1 label: <string> custom label of the added output layer, default Output preset: <string> singleBand or trueColor, band\$1name: <string>, only required for 'singleBand' preset. Allowed bands for a EOJ \$1  | An option to pass layer properties. | 

**Response**

Reference of the added output Layer object.

To learn more about visualizing your geospatial data, refer to [Visualization Using Amazon SageMaker geospatial](https://docs.aws.amazon.com/sagemaker/latest/dg/geospatial-visualize.html).

# SageMaker geospatial capabilities FAQ
<a name="geospatial-faq"></a>

Use the following FAQ items to find answers to commonly asked questions about SageMaker geospatial capabilities.

1. **What regions are Amazon SageMaker geospatial capabilities available in?**

   Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region. To view SageMaker geospatial, choose the name of the currently displayed Region in the navigation bar of the console. Then choose the US West (Oregon) Region.

1. ** What AWS Identity and Access Management permissions and policies are required to use SageMaker geospatial?**

   To use SageMaker geospatial you need a user, group, or role that can access SageMaker AI. You also need to create a SageMaker AI execution role so that SageMaker geospatial can perform operations on your behalf. To learn more, see [SageMaker geospatial capabilities roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-geospatial-roles.html).

1. **I have an existing SageMaker AI execution role. Do I need to update it?**

   Yes. To use SageMaker geospatial you must specify an additional service principal in your IAM trust policy: `sagemaker-geospatial.amazonaws.com`. To learn about specifying a service principal in a trust relationship, see [Adding  the SageMaker geospatial service principal to an existing SageMaker AI execution role](sagemaker-geospatial-roles-pass-role.md) in the *Amazon SageMaker AI Developer Guide*.

1. **Can I use SageMaker geospatial capabilities through my VPC environment?**

   Yes, you can use SageMaker geospatial through a VPN. To learn more, see [Use Amazon SageMaker geospatial capabilities in Your Amazon Virtual Private Cloud](geospatial-notebooks-and-internet-access-vpc-requirements.md).

1. **Why can't I see the SageMaker geospatial map visualizer, image or instance type when I navigate to Amazon SageMaker Studio Classic?**

   Verify that you are launching Amazon SageMaker Studio Classic in the US West (Oregon) Region and that you are not using a shared space.

1. **Why can't I see the SageMaker geospatial image or instance type when I try to create a notebook instance in Studio Classic?**

   Verify that you are launching Amazon SageMaker Studio Classic in the US West (Oregon) Region and that you are not using a shared space. To learn more, see [Create an Amazon SageMaker Studio Classic notebook using the geospatial image](geospatial-launch-notebook.md).

1. **What bands supported for various raster data collections?**

   Use the `GetRasterDataCollection` API response and refer to the `ImageSourceBands` field to find the bands supported for that particular data collection.

# SageMaker geospatial Security and Permissions
<a name="geospatial-security-general"></a>

Use the topics on this page to learn about SageMaker geospatial capabilities security features. Additionally, learn how to use SageMaker geospatial capabilities in an Amazon Virtual Private Cloud as well as protect your data at rest using encryption.

For more information about IAM users and roles, see [Identities (Users, Groups, and Roles)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id.html) in the IAM User Guide. 

To learn more about using IAM with SageMaker AI, see [AWS Identity and Access Management for Amazon SageMaker AI](security-iam.md).

**Topics**
+ [

# Configuration and Vulnerability Analysis in SageMaker geospatial
](geospatial-config-vulnerability.md)
+ [

# Security Best Practices for SageMaker geospatial capabilities
](geospatial-sec-best-practices.md)
+ [

# Use Amazon SageMaker geospatial capabilities in Your Amazon Virtual Private Cloud
](geospatial-notebooks-and-internet-access-vpc-requirements.md)
+ [

# Use AWS KMS Permissions for Amazon SageMaker geospatial capabilities
](geospatial-kms.md)

# Configuration and Vulnerability Analysis in SageMaker geospatial
<a name="geospatial-config-vulnerability"></a>

Configuration and IT controls are a shared responsibility between AWS and you, our customer. AWS handles basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery. These procedures have been reviewed and certified by the appropriate third parties. For more details, see the following resources: 
+ [Shared Responsibility Model](https://aws.amazon.com/compliance/shared-responsibility-model/).
+ [Amazon Web Services: Overview of Security Processes](https://d0.awsstatic.com/whitepapers/Security/AWS_Security_Whitepaper.pdf).

# Security Best Practices for SageMaker geospatial capabilities
<a name="geospatial-sec-best-practices"></a>

Amazon SageMaker geospatial capabilities provide a number of security features to consider as you develop and implement your own security policies. The following best practices are general guidelines and don't represent a complete security solution. Because these best practices might not be appropriate or sufficient for your environment, treat them as helpful considerations rather than prescriptions.
<a name="geospatial-least-privilege"></a>
**Apply principle of least privilege**  
Amazon SageMaker geospatial capabilities provide granular access policy for applications using IAM roles. We recommend that the roles be granted only the minimum set of privileges required by the job. We also recommend auditing the jobs for permissions on a regular basis and upon any change to your application.
<a name="geospatial-role-access"></a>
**Role-based access control (RBAC) permissions**  
Administrators should strictly control Role-based access control (RBAC) permissions for Amazon SageMaker geospatial capabilities.
<a name="geospatial-temp-creditentials"></a>
**Use temporary credentials whenever possible**  
Where possible, use temporary credentials instead of long-term credentials, such as access keys. For scenarios in which you need IAM users with programmatic access and long-term credentials, we recommend that you rotate access keys. Regularly rotating long-term credentials helps you familiarize yourself with the process. This is useful in case you are ever in a situation where you must rotate credentials, such as when an employee leaves your company. We recommend that you use IAM access last used information to rotate and remove access keys safely. For more information, see [Rotating access keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_RotateAccessKey) and [Security best practices in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html).
<a name="geospatial-cloudtrail-log"></a>
**Use AWS CloudTrail to view and log API calls**  
AWS CloudTrail tracks anyone making API calls in your AWS account. API calls are logged whenever anyone uses the Amazon SageMaker geospatial capabilities API, the Amazon SageMaker geospatial capabilities console or Amazon SageMaker geospatial capabilities AWS CLI commands. Enable logging and specify an Amazon S3 bucket to store the logs.

Your trust, privacy, and the security of your content are our highest priorities. We implement responsible and sophisticated technical and physical controls designed to prevent unauthorized access to, or disclosure of, your content and ensure that our use complies with our commitments to you. For more information, see [AWS Data Privacy FAQ](https://aws.amazon.com/compliance/data-privacy-faq/).

# Use Amazon SageMaker geospatial capabilities in Your Amazon Virtual Private Cloud
<a name="geospatial-notebooks-and-internet-access-vpc-requirements"></a>

The following topic gives information on how to use SageMaker notebooks with a SageMaker geospatial image in a Amazon SageMaker AI domain with VPC only mode. For more information on VPCs in Amazon SageMaker Studio Classic see [Choose an Amazon VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-vpc.html).

## `VPC only` communication with the internet
<a name="studio-notebooks-and-internet-access-vpc-geospatial"></a>

By default, SageMaker AI domain uses two Amazon VPC. One of the Amazon VPC is managed by Amazon SageMaker AI and provides direct internet access. You specify the other Amazon VPC, which provides encrypted traffic between the domain and your Amazon Elastic File System (Amazon EFS) volume.

You can change this behavior so that SageMaker AI sends all traffic over your specified Amazon VPC. If `VPC only` has been choosen as the network access mode during the SageMaker AI domain creation, the following requirements need to be considered to still allow usage of SageMaker Studio Classic notebooks within the created SageMaker AI domain.

## Requirements to use `VPC only` mode
<a name="studio-notebooks-and-internet-access-vpc-geospatial-requirements"></a>

**Note**  
In order to use the visualization components of SageMaker geospatial capabilities, the browser you use to access the SageMaker Studio Classic UI needs to be connected to the internet.

When you choose `VpcOnly`, follow these steps:

1. You must use private subnets only. You cannot use public subnets in `VpcOnly` mode.

1. Ensure your subnets have the required number of IP addresses needed. The expected number of IP addresses needed per user can vary based on use case. We recommend between 2 and 4 IP addresses per user. The total IP address capacity for a Studio Classic domain is the sum of available IP addresses for each subnet provided when the domain is created. Ensure that your estimated IP address usage does not exceed the capacity supported by the number of subnets you provide. Additionally, using subnets distributed across many availability zones can aid in IP address availability. For more information, see [VPC and subnet sizing for IPv4](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html#vpc-sizing-ipv4).
**Note**  
You can configure only subnets with a default tenancy VPC in which your instance runs on shared hardware. For more information on the tenancy attribute for VPCs, see [Dedicated Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/dedicated-instance.html).

1. Set up one or more security groups with inbound and outbound rules that together allow the following traffic:
   + [NFS traffic over TCP on port 2049](https://docs.aws.amazon.com/efs/latest/ug/network-access.html) between the domain and the Amazon EFS volume.
   + [TCP traffic within the security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-other-instances). This is required for connectivity between the JupyterServer app and the KernelGateway apps. You must allow access to at least ports in the range `8192-65535`.

1. If you want to allow internet access, you must use a [NAT gateway](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html#nat-gateway-working-with) with access to the internet, for example through an [internet gateway](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html).

1. If you don't want to allow internet access, [create interface VPC endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-interface.html) (AWS PrivateLink) to allow Studio Classic to access the following services with the corresponding service names. You must also associate the security groups for your VPC with these endpoints.
**Note**  
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region.
   + SageMaker API : `com.amazonaws.us-west-2.sagemaker.api` 
   + SageMaker AI runtime: `com.amazonaws.us-west-2.sagemaker.runtime`. This is required to run Studio Classic notebooks with a SageMaker geospatial image.
   + Amazon S3: `com.amazonaws.us-west-2.s3`.
   + To use SageMaker Projects: `com.amazonaws.us-west-2.servicecatalog`.
   + SageMaker geospatial capabilities: `com.amazonaws.us-west-2.sagemaker-geospatial`

    If you use the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to run remote training jobs, you must also create the following Amazon VPC endpoints.
   + AWS Security Token Service: `com.amazonaws.region.sts`
   + Amazon CloudWatch: `com.amazonaws.region.logs`. This is required to allow SageMaker Python SDK to get the remote training job status from Amazon CloudWatch.

**Note**  
For a customer working within VPC mode, company firewalls can cause connection issues with SageMaker Studio Classic or between JupyterServer and the KernelGateway. Make the following checks if you encounter one of these issues when using SageMaker Studio Classic from behind a firewall.  
Check that the Studio Classic URL is in your networks allowlist.
Check that the websocket connections are not blocked. Jupyter uses websocket under the hood. If the KernelGateway application is InService, JupyterServer may not be able to connect to the KernelGateway. You should see this problem when opening System Terminal as well. 

# Use AWS KMS Permissions for Amazon SageMaker geospatial capabilities
<a name="geospatial-kms"></a>

You can protect your data at rest using encryption for SageMaker geospatial capabilities. By default, it uses server-side encryption with an Amazon SageMaker geospatial owned key. SageMaker geospatial capabilities also supports an option for server-side encryption with a customer managed KMS key.

## Server-Side Encryption with Amazon SageMaker geospatial managed key (Default)
<a name="geospatial-managed-key"></a>

SageMaker geospatial capabilities encrypts all your data, including computational results from your Earth Observation jobs (EOJ) and Vector Enrichment jobs (VEJ) along with all your service metadata. There is no data that is stored within SageMaker geospatial capabilities unencrypted. It uses a default AWS owned key to encrypt all your data.

## Server-Side Encryption with customer managed KMS key (Optional)
<a name="geospatial-customer-managed-key"></a>

SageMaker geospatial capabilities supports the use of a symmetric customer managed key that you create, own, and manage to add a second layer of encryption over the existing AWS owned encryption. Because you have full control of this layer of encryption, you can perform such tasks as:
+ Establishing and maintaining key policies
+ Establishing and maintaining IAM policies and grants
+ Enabling and disabling key policies
+ Rotating key cryptographic material
+ Adding tags
+ Creating key aliases
+ Scheduling keys for deletion

For more information, see [Customer managed keys](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk) in the *AWS Key Management Service Developer Guide*.

## How SageMaker geospatial capabilities uses grants in AWS KMS
<a name="geospatial-grants-cmk"></a>

 SageMaker geospatial capabilities requires a grant to use your customer managed key. When you create an EOJ or an VEJ encrypted with a customer managed key, SageMaker geospatial capabilities creates a grant on your behalf by sending a `CreateGrant` request to AWS KMS. Grants in AWS KMS are used to give SageMaker geospatial capabilities access to a KMS key in a customer account. You can revoke access to the grant, or remove the service's access to the customer managed key at any time. If you do, SageMaker geospatial capabilities won't be able to access any of the data encrypted by the customer managed key, which affects operations that are dependent on that data. 

## Create a customer managed key
<a name="geospatial-create-cmk"></a>

You can create a symmetric customer managed key by using the AWS Management Console, or the AWS KMS APIs.

**To create a symmetric customer managed key**

Follow the steps for [Creating symmetric encryption KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html#create-symmetric-cmk) in the AWS Key Management Service Developer Guide.

**Key policy**

Key policies control access to your customer managed key. Every customer managed key must have exactly one key policy, which contains statements that determine who can use the key and how they can use it. When you create your customer managed key, you can specify a key policy. For more information, see [Determining access to AWS KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/determining-access.html) in the *AWS Key Management Service Developer Guide*.

To use your customer managed key with your SageMaker geospatial capabilities resources, the following API operations must be permitted in the key policy. The principal for these operations should be the Execution Role you provide in the SageMaker geospatial capabilities request. SageMaker geospatial capabilities assumes the provided Execution Role in the request to perform these KMS operations.
+ `[kms:CreateGrant](https://docs.aws.amazon.com/kms/latest/APIReference/API_CreateGrant.html)`
+ `kms:GenerateDataKey`
+ `kms:Decrypt`
+ `kms:GenerateDataKeyWithoutPlaintext`

The following are policy statement examples you can add for SageMaker geospatial capabilities:

**CreateGrant**

```
"Statement" : [ 
    {
      "Sid" : "Allow access to Amazon SageMaker geospatial capabilities",
      "Effect" : "Allow",
      "Principal" : {
        "AWS" : "<Customer provided Execution Role ARN>"
      },
      "Action" : [ 
          "kms:CreateGrant",
           "kms:Decrypt",
           "kms:GenerateDataKey",
           "kms:GenerateDataKeyWithoutPlaintext"
      ],
      "Resource" : "*",
    },
 ]
```

For more information about specifying permissions in a policy, see [AWS KMS permissions](https://docs.aws.amazon.com/kms/latest/developerguide/kms-api-permissions-reference.html) in the *AWS Key Management Service Developer Guide*. For more information about troubleshooting, see [Troubleshooting key access](https://docs.aws.amazon.com/kms/latest/developerguide/policy-evaluation.html) in the *AWS Key Management Service Developer Guide*. 

If your key policy does not have your account root as key administrator, you need to add the same KMS permissions on your execution role ARN. Here is a sample policy you can add to the execution role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "kms:CreateGrant",
                "kms:Decrypt",
                "kms:GenerateDataKey",
                "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": [
              "arn:aws:kms:us-east-1:111122223333:key/key-id"
            ],
            "Effect": "Allow"
        }
    ]
}
```

------

## Monitoring your encryption keys for SageMaker geospatial capabilities
<a name="geospatial-monitor-cmk"></a>

When you use an AWS KMS customer managed key with your SageMaker geospatial capabilities resources, you can use AWS CloudTrail or Amazon CloudWatch Logs to track requests that SageMaker geospatial sends to AWS KMS.

Select a tab in the following table to see examples of AWS CloudTrail events to monitor KMS operations called by SageMaker geospatial capabilities to access data encrypted by your customer managed key.

------
#### [ CreateGrant ]

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
        "arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/SageMaker-Geospatial-StartEOJ-KMSAccess",
        "accountId": "111122223333",
        "accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AKIAIOSFODNN7EXAMPLE3",
                "arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole",
                "accountId": "111122223333",
                "userName": "SageMakerGeospatialCustomerRole"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2023-03-17T18:02:06Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "arn:aws:iam::111122223333:root"
    },
    "eventTime": "2023-03-17T18:02:06Z",
    "eventSource": "kms.amazonaws.com",
    "eventName": "CreateGrant",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "172.12.34.56",
    "userAgent": "ExampleDesktop/1.0 (V1; OS)",
    "requestParameters": {
        "retiringPrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com",
        "keyId": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
        "operations": [
            "Decrypt"
        ],
        "granteePrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com"
    },
    "responseElements": {
        "grantId": "0ab0ac0d0b000f00ea00cc0a0e00fc00bce000c000f0000000c0bc0a0000aaafSAMPLE",
        "keyId": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
    },
    "requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "readOnly": false,
    "resources": [
        {
            "accountId": "111122223333",
            "type": "AWS::KMS::Key",
            "ARN": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management"
}
```

------
#### [ GenerateDataKey ]

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AWSService",
        "invokedBy": "sagemaker-geospatial.amazonaws.com"
    },
    "eventTime": "2023-03-24T00:29:45Z",
    "eventSource": "kms.amazonaws.com",
    "eventName": "GenerateDataKey",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
    "userAgent": "sagemaker-geospatial.amazonaws.com",
    "requestParameters": {
        "encryptionContext": {
            "aws:s3:arn": "arn:aws:s3:::axis-earth-observation-job-378778860802/111122223333/napy9eintp64/output/consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
        },
        "keyId": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
        "keySpec": "AES_256"
    },
    "responseElements": null,
    "requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "readOnly": true,
    "resources": [
        {
            "accountId": "111122223333",
            "type": "AWS::KMS::Key",
            "ARN": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management"
}
```

------
#### [ Decrypt ]

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AWSService",
        "invokedBy": "sagemaker-geospatial.amazonaws.com"
    },
    "eventTime": "2023-03-28T22:04:24Z",
    "eventSource": "kms.amazonaws.com",
    "eventName": "Decrypt",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
    "userAgent": "sagemaker-geospatial.amazonaws.com",
    "requestParameters": {
        "encryptionAlgorithm": "SYMMETRIC_DEFAULT",
        "encryptionContext": {
            "aws:s3:arn": "arn:aws:s3:::axis-earth-observation-job-378778860802/111122223333/napy9eintp64/output/consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
        },
    },
    "responseElements": null,
    "requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "readOnly": true,
    "resources": [
        {
            "accountId": "111122223333",
            "type": "AWS::KMS::Key",
            "ARN": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management"
}
```

------
#### [ GenerateDataKeyWithoutPlainText ]

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
        "arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/SageMaker-Geospatial-StartEOJ-KMSAccess",
        "accountId": "111122223333",
        "accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AKIAIOSFODNN7EXAMPLE3",
                "arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole",
                "accountId": "111122223333",
                "userName": "SageMakerGeospatialCustomerRole"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2023-03-17T18:02:06Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "arn:aws:iam::111122223333:root"
    },
    "eventTime": "2023-03-28T22:09:16Z",
    "eventSource": "kms.amazonaws.com",
    "eventName": "GenerateDataKeyWithoutPlaintext",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "172.12.34.56",
    "userAgent": "ExampleDesktop/1.0 (V1; OS)",
    "requestParameters": {
        "keySpec": "AES_256",
        "keyId": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
    },
    "responseElements": null,
    "requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
    "readOnly": true,
    "resources": [
        {
            "accountId": "111122223333",
            "type": "AWS::KMS::Key",
            "ARN": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management"
}
```

------

# Types of compute instances
<a name="geospatial-instances"></a>

SageMaker geospatial capabilities offer three types of compute instances.
+ **SageMaker Studio Classic geospatial notebook instances** – SageMaker geospatial supports both CPU and GPU-based notebook instances in Studio Classic. Notebook instances are used to build, train, and deploy ML models. For a list of available notebook instance types that work with the geospatial image, see [Supported notebook instance types](#supported-geospatial-instances). 
+ **SageMaker geospatial jobs instances** – Run processing jobs to transform satellite image data.
+ **SageMaker geospatial model inference types** – Make predictions by using pre-trained ML models on satellite imagery.

The instance type is determined by the operations that you run.

The following table shows the available SageMaker geospatial specific operations and  instance types that you can use.


|  Operations  |  Instance  | 
| --- | --- | 
| Temporal Statistics | ml.geospatial.jobs | 
| Zonal Statistics | ml.geospatial.jobs | 
| Resampling | ml.geospatial.jobs | 
| Geomosaic | ml.geospatial.jobs | 
| Band Stacking | ml.geospatial.jobs | 
| Band Math | ml.geospatial.jobs | 
| Cloud Removal with Landsat8 | ml.geospatial.jobs | 
| Cloud Removal with Sentinel-2 | ml.geospatial.models | 
| Cloud Masking | ml.geospatial.models | 
| Land Cover Segmentation | ml.geospatial.models | 

## SageMaker geospatial supported notebook instance types
<a name="notebook-instances"></a>

SageMaker geospatial supports both CPU and GPU-based notebook instances in Studio Classic. If when starting a GPU enabled notebook instance you receive a ResourceLimitExceeded error, you need to request a quota increase. To get started on a Service Quotas quota increase request, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*.

Supported Studio Classic notebook instance types


|  Name  |  Instance type  | 
| --- | --- | 
| ml.geospatial.interactive | CPU | 
| ml.g5.xlarge | GPU | 
| ml.g5.2xlarge | GPU | 
| ml.g5.4xlarge | GPU | 
| ml.g5.8xlarge | GPU | 
| ml.g5.16xlarge | GPU | 
| ml.g5.12xlarge | GPU | 
| ml.g5.24xlarge | GPU | 
| ml.g5.48xlarge | GPU | 

You are charged different rates for each type of compute instance that you use. For more information about pricing, see [Geospatial ML with Amazon SageMaker AI](https://aws.amazon.com/sagemaker/geospatial).

## SageMaker geospatial libraries
<a name="geospatial-notebook-libraries"></a>

The SageMaker geospatial specific **Instance type**, **ml.geospatial.interactive** contains the following Python libraries.

Geospatial libraries available on the geospatial instance type


|  Library name  |  Version available  | 
| --- | --- | 
| numpy | 1.23.4 | 
| scipy | 1.11.2 | 
| pandas | 1.4.4 | 
| gdal | 3.2.2 | 
| fiona | 1.8.22 | 
| geopandas | 0.11.1 | 
| shapley | 1.8.4 | 
| seaborn | 0.11.2 | 
| notebook | 1.8.22 | 
| scikit-image | 0.11.2 | 
| rasterio | 6.4.12 | 
| scikit-learn | 0.19.2 | 
| ipyleaflet | 1.0.1 | 
| rtree | 0.17.2 | 
| opencv | 4.6.0.66 | 
| supy | 2022.4.7 | 
| SNAP toolbox | 9.0 | 
| cdsapi | 0.6.1 | 
| arosics | 1.8.1 | 
| rasterstats | 0.18.0 | 
| rioxarray | 0.14.1 | 
| pyroSAR | 0.20.0 | 
| eo-learn | 1.4.1 | 
| deepforest | 1.2.7 | 
| scrapy | 2.8.0 | 
| netCDF4 | 1.6.3 | 
| xarray[complete] | 0.20.1 | 
| Orfeotoolbox | OTB-8.1.1 | 
| pytorch | 2.0.1 | 
| pytorch-cuda | 11.8 | 
| torchvision | 0.15.2 | 
| torchaudio | 2.0.2 | 
| pytorch-lightning | 2.0.6 | 
| tensorflow | 2.13.0 | 

# Data collections
<a name="geospatial-data-collections"></a>

Amazon SageMaker geospatial supports the following raster data collections. Of the following data collections, you can use the  USGS Landsat and the Sentinel-2 Cloud-Optimized GeoTIFF data collections when starting an Earth Observation Job (EOJ). To learn more about the EOJs, see [Earth Observation Jobs](geospatial-eoj.md).
+ [Copernicus Digital Elevation Model (DEM) – GLO-30](https://registry.opendata.aws/copernicus-dem/)
+ [Copernicus Digital Elevation Model (DEM) – GLO-90](https://registry.opendata.aws/copernicus-dem/)
+ [https://registry.opendata.aws/sentinel-2-l2a-cogs/](https://registry.opendata.aws/sentinel-2-l2a-cogs/)
+ [https://registry.opendata.aws/sentinel-1/](https://registry.opendata.aws/sentinel-1/)
+ [National Agriculture Imagery Program (NAIP) on AWS](https://registry.opendata.aws/naip/)
+ [https://registry.opendata.aws/usgs-landsat/](https://registry.opendata.aws/usgs-landsat/)

To find the list of available raster data collections in your AWS Regions, use `ListRasterDataCollections`. In the [`ListRasterDataCollections` response](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_ListRasterDataCollections.html#API_geospatial_ListRasterDataCollections_ResponseSyntax), you get a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_ListRasterDataCollections.html#API_geospatial_ListRasterDataCollections_ResponseSyntax](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_geospatial_ListRasterDataCollections.html#API_geospatial_ListRasterDataCollections_ResponseSyntax) object that contains details about the available raster data collections.

**Example – Calling the `ListRasterDataCollections` API using the AWS SDK for Python (Boto3)**  <a name="list-raster-data-collections"></a>
When you use the SDK for Python (Boto3) and SageMaker geospatial, you must create a geospatial client, `geospatial_client`. Use the following Python snippet to make a call to the `list_raster_data_collections` API:  

```
import boto3
import sagemaker
import sagemaker_geospatial_map
import json 

## SageMaker Geospatial Capabilities is currently only avaialable in US-WEST-2  
session = boto3.Session(region_name='us-west-2')
execution_role = sagemaker.get_execution_role()

## Creates a SageMaker Geospatial client instance 
geospatial_client = session.client(service_name="sagemaker-geospatial")

# Creates a resusable Paginator for the list_raster_data_collections API operation 
paginator = geospatial_client.get_paginator("list_raster_data_collections")

# Create a PageIterator from the Paginator
page_iterator = paginator.paginate()

# Use the iterator to iterate throught the results of list_raster_data_collections
results = []
for page in page_iterator:
	results.append(page['RasterDataCollectionSummaries'])

print (results)
```
In the JSON response, you will receive the following, which has been truncated for clarity:  

```
{
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:555555555555:raster-data-collection/public/dxxbpqwvu9041ny8",
    "Description": "Copernicus DEM is a Digital Surface Model which represents the surface of the Earth including buildings, infrastructure, and vegetation. GLO-30 is instance of Copernicus DEM that provides limited worldwide coverage at 30 meters.",
    "DescriptionPageUrl": "https://registry.opendata.aws/copernicus-dem/",
    "Name": "Copernicus DEM GLO-30",
    "Tags": {},
    "Type": "PUBLIC"
}
```

## Image band information from the USGS Landsat and Sentinel-2 data collections
<a name="image-band-information"></a>

Image band information from the USGS Landsat 8 and Sentinel-2 data collections are provided in the following table.

USGS Landsat


| Band name | Wave length range (nm) | Units | Valid range | Fill value | Spatial resolution | 
| --- | --- | --- | --- | --- | --- | 
| coastal | 435 - 451 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| blue | 452 - 512 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| green | 533 - 590 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| red | 636 - 673 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| nir | 851 - 879 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| swir16 | 1566 - 1651 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| swir22 | 2107 - 2294 | Unitless | 1 - 65455 | 0 (No Data) | 30m | 
| qa\$1aerosol | NA | Bit Index | 0 - 255 | 1 | 30m | 
| qa\$1pixel | NA | Bit Index | 1 - 65455 | 1 (bit 0) | 30m | 
| qa\$1radsat | NA | Bit Index | 1 - 65455 | NA | 30m | 
| t | 10600 - 11190 | Scaled Kelvin | 1 - 65455 | 0 (No Data) | 30m (scaled from 100m) | 
| atran | NA | Unitless | 0 - 10000 | -9999 (No Data) | 30m | 
| cdist | NA | Kilometers | 0 - 24000 | -9999 (No Data) | 30m | 
| drad | NA | W/(m^2 sr µm)/DN | 0 - 28000 | -9999 (No Data) | 30m | 
| urad | NA | W/(m^2 sr µm)/DN | 0 - 28000 | -9999 (No Data) | 30m | 
| trad | NA | W/(m^2 sr µm)/DN | 0 - 28000 | -9999 (No Data) | 30m | 
| emis | NA | Emissivity coefficient | 1 - 10000 | -9999 (No Data) | 30m | 
| emsd | NA | Emissivity coefficient | 1 - 10000 | -9999 (No Data) | 30m | 

Sentinel-2


| Band name | Wave length range (nm) | Scale | Valid range | Fill value | Spatial resolution | 
| --- | --- | --- | --- | --- | --- | 
| coastal | 443 | 0.0001 | NA | 0 (No Data) | 60m | 
| blue | 490 | 0.0001 | NA | 0 (No Data) | 10m | 
| green | 560 | 0.0001 | NA | 0 (No Data) | 10m | 
| red | 665 | 0.0001 | NA | 0 (No Data) | 10m | 
| rededge1 | 705 | 0.0001 | NA | 0 (No Data) | 20m | 
| rededge2 | 740 | 0.0001 | NA | 0 (No Data) | 20m | 
| rededge3 | 783 | 0.0001 | NA | 0 (No Data) | 20m | 
| nir | 842 | 0.0001 | NA | 0 (No Data) | 10m | 
| nir08 | 865 | 0.0001 | NA | 0 (No Data) | 20m | 
| nir08 | 865 | 0.0001 | NA | 0 (No Data) | 20m | 
| nir09 | 940 | 0.0001 | NA | 0 (No Data) | 60m | 
| swir16 | 1610 | 0.0001 | NA | 0 (No Data) | 20m | 
| swir22 | 2190 | 0.0001 | NA | 0 (No Data) | 20m | 
| aot | Aerosol optical thickness | 0.001 | NA | 0 (No Data) | 10m | 
| wvp | Scene-average water vapor | 0.001 | NA | 0 (No Data) | 10m | 
| scl | Scene classification data | NA | 1 - 11 | 0 (No Data) | 20m | 

# RStudio on Amazon SageMaker AI
<a name="rstudio"></a>

RStudio is an integrated development environment for R, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. Amazon SageMaker AI supports RStudio as a fully-managed integrated development environment (IDE) integrated with Amazon SageMaker AI domain through Posit Workbench. RStudio allows customers to create data science insights using an R environment. With RStudio integration, you can launch an RStudio environment in the domain to run your RStudio workflows on SageMaker AI resources. For more information about Posit Workbench, see the [Posit website](https://posit.co/products/enterprise/workbench/). This page gives information about important RStudio concepts.

SageMaker AI integrates RStudio through the creation of a RStudioServerPro app.

 The following are supported by RStudio on SageMaker AI. 
+ R developers use the RStudio IDE interface with popular developer tools from the R ecosystem. Users can launch new RStudio sessions, write R code, install dependencies from RStudio Package Manager, and publish Shiny apps using RStudio Connect. 
+ R developers can quickly scale underlying compute resources to run large scale data processing and statistical analysis.  
+ Platform administrators can set up user identities, authorization, networking, storage, and security for their data science teams through AWS IAM Identity Center and AWS Identity and Access Management integration. This includes connection to private Amazon Virtual Private Cloud (Amazon VPC) resources and internet-free mode with AWS PrivateLink.
+ Integration with AWS License Manager. 

 For information on the onboarding steps to create a domain with RStudio enabled, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

## Region availability
<a name="rstudio-region"></a>

The following table gives information about the AWS Regions that RStudio on SageMaker AI is supported in.


|  Region name  |  Region  | 
| --- | --- | 
|  US East (Ohio)  |  us-east-2  | 
|  US East (N. Virginia)  |  us-east-1  | 
|  US West (N. California)  |  us-west-1  | 
|  US West (Oregon)  |  us-west-2  | 
|  Asia Pacific (Mumbai)  |  ap-south-1  | 
|  Asia Pacific (Seoul)  |  ap-northeast-2  | 
|  Asia Pacific (Singapore)  |  ap-southeast-1  | 
|  Asia Pacific (Sydney)  |  ap-southeast-2  | 
|  Asia Pacific (Tokyo)  |  ap-northeast-1  | 
|  Canada (Central)  |  ca-central-1  | 
|  Europe (Frankfurt)  |  eu-central-1  | 
|  Europe (Ireland)  |  eu-west-1  | 
|  Europe (London)  |  eu-west-2  | 
|  Europe (Paris)  |  eu-west-3  | 
|  Europe (Stockholm)  |  eu-north-1  | 
|  South America (São Paulo)  |  sa-east-1  | 

## RStudio components
<a name="rstudio-components"></a>
+ *RStudioServerPro*: The RStudioServerPro app is a multiuser app that is a shared resource among all user profiles in the domain. Once an RStudio app is created in a domain, the admin can give permissions to users in the domain.  
+ *RStudio user*: RStudio users are users within the domain that are authorized to use the RStudio license.
+ *RStudio admin*: An RStudio on Amazon SageMaker AI admin can access the RStudio administrative dashboard. RStudio on Amazon SageMaker AI admins differ from "stock" Posit Workbench admins because they do not have root access to the instance running the RStudioServerPro app and can't modify the RStudio configuration file.
+ *RStudio Server*: The RStudio Server instance is responsible for serving the RStudio UI to all authorized Users. This instance is launched on an Amazon SageMaker AI instance.
+ *RSession*: An RSession is a browser-based interface to the RStudio IDE running on an Amazon SageMaker AI instance. Users can create and interact with their RStudio projects through the RSession.
+ *RSessionGateway*: The RSessionGateway app is used to support an RSession. 
+ *RStudio administrative dashboard*: This dashboard gives information on the RStudio users in the Amazon SageMaker AI domain and their sessions. This dashboard can only be accessed by users that have RStudio admin authorization.

## Differences from Posit Workbench
<a name="rstudio-differences"></a>

RStudio on Amazon SageMaker AI has some significant differences from [Posit Workbench](https://posit.co/products/enterprise/workbench/).
+ When using RStudio on SageMaker AI, users don’t have access to the RStudio configuration files. Amazon SageMaker AI manages the configuration file and sets defaults. You can modify the RStudio Connect and RStudio Package Manager URLs when creating your RStudio-enabled Amazon SageMaker AI domain.
+ Project sharing, realtime collaboration, and Job Launcher are not currently supported when using RStudio on Amazon SageMaker AI.
+ When using RStudio on SageMaker AI, the RStudio IDE runs on Amazon SageMaker AI instances for on-demand containerized compute resources. 
+ RStudio on SageMaker AI only supports the RStudio IDE and does not support other IDEs supported by a Posit Workbench installation.
+ RStudio on SageMaker AI only supports the RStudio version specified in [RStudio Versioning](rstudio-version.md).

# RStudio on Amazon SageMaker AI management
<a name="rstudio-manage"></a>

 The following topics give information on managing RStudio on Amazon SageMaker AI. This includes information about your RStudio environment configuration, user sessions, and necessary resources. For information on how to use RStudio on SageMaker AI, see [RStudio on Amazon SageMaker AI user guide](rstudio-use.md). 

 For information about creating a Amazon SageMaker AI domain with RStudio enabled, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).  

 For information about the AWS Regions that RStudio on SageMaker AI is supported in, see [Supported Regions and Quotas](regions-quotas.md).  

**Topics**
+ [

# Get an RStudio license
](rstudio-license.md)
+ [

# RStudio Versioning
](rstudio-version.md)
+ [

# Network and Storage
](rstudio-network.md)
+ [

# RStudioServerPro instance type
](rstudio-select-instance.md)
+ [

# Add an RStudio Connect URL
](rstudio-configure-connect.md)
+ [

# Update the RStudio Package Manager URL
](rstudio-configure-pm.md)
+ [

# Create an Amazon SageMaker AI domain with RStudio using the AWS CLI
](rstudio-create-cli.md)
+ [

# Add RStudio support to an existing domain
](rstudio-add-existing.md)
+ [

# Custom images with RStudio on SageMaker AI
](rstudio-byoi.md)
+ [

# Create a user to use RStudio
](rstudio-create-user.md)
+ [

# Log in to RStudio as another user
](rstudio-login-another.md)
+ [

# Terminate sessions for another user
](rstudio-terminate-another.md)
+ [

# Use the RStudio administrative dashboard
](rstudio-admin.md)
+ [

# Shut down RStudio
](rstudio-shutdown.md)
+ [

# Billing and cost
](rstudio-billing.md)
+ [

# Diagnose issues and get support
](rstudio-troubleshooting.md)

# Get an RStudio license
<a name="rstudio-license"></a>

RStudio on Amazon SageMaker AI is a paid product and requires that each user is appropriately licensed. Licenses for RStudio on Amazon SageMaker AI may be obtained from RStudio PBC directly, or by purchasing a subscription to Posit Workbench on AWS Marketplace. For existing customers of Posit Workbench Enterprise, licenses are issued at no additional cost. To use an RStudio license with Amazon SageMaker AI, you must first have a valid RStudio license registered with AWS License Manager. For licenses purchased directly through Rstudio PBC, a licenses grant for your AWS Account must be created. Contact RStudio for direct license purchases or to enable existing licenses in AWS License Manager. For more information about registering a license with AWS License Manager, see [Seller issued licenses in AWS License Manager](https://docs.aws.amazon.com/license-manager/latest/userguide/seller-issued-licenses.html). 

The following topics show how to acquire and validate a license granted by RStudio PBC.

 **Get an RStudio license** 

1. If you don't have an RStudio license, you may purchase one from the AWS Marketplace or from RStudio PBC directly.
   + To purchase a subscription from the AWS Marketplace, complete the steps to [subscribe with a SaaS contract](https://docs.aws.amazon.com/marketplace/latest/buyerguide/buyer-saas-products.html) by searching for **Posit Team**. To fulfill the license, you will be redirected to an external form outside the AWS Marketplace. You must provide additional information, including your company name and email address. If you can’t access that form to provide a company name and a contact email, create a ticket with Posit Support at [https://support.posit.co/hc/en-us/requests/new](https://support.posit.co/hc/en-us/requests/new) with details about your purchase.
   + To purchase from RStudio PBC directly, navigate to [RStudio Pricing](https://www.rstudio.com/pricing/) or contact [sales@rstudio.com](mailto:sales@rstudio.com). When buying or updating an RStudio license, you must provide the AWS Account that will host your Amazon SageMaker AI domain. 

   If you have an existing RStudio license, contact your RStudio Sales representative or [sales@rstudio.com](mailto:sales@rstudio.com) to add RStudio on Amazon SageMaker AI to your existing Posit Workbench Enterprise license, or to convert your Posit Workbench Standard license. The RStudio Sales representative will send you the appropriate electronic order form.

1. RStudio grants a Posit Workbench license to your AWS Account through AWS License Manager in the US East (N. Virginia) Region. Although the RStudio license is granted in the US East (N. Virginia) Region, your license can be consumed in any AWS Region that RStudio on Amazon SageMaker AI is supported in. You can expect the license grant process to complete within three business days after you share your AWS account ID with RStudio.

1. When this license is granted, you receive an email from your RStudio Sales representative with instructions to accept your license grant.

 **Validate your RStudio license to be used with Amazon SageMaker AI** 

1. Log into the AWS License Manager console in the same region as your Amazon SageMaker AI domain. If you are using AWS License Manager for the first time, AWS License Manager prompts you to grant permission to use AWS License Manager. 

1.  Select **Start using AWS License manager**. 

1.  Select `I grant AWS License Manager the required permissions` and select **Grant Permissions**. 

1. Navigate to **Granted Licenses** on the left panel. 

1. Select the license grant with `RSW-SageMaker` as the `Product name` and select **View**.

1. From the license detail page, select **Accept & activate license**. 

 **RStudio administrative dashboard** 

You can use the RStudio administrative dashboard to see the number of users on the license following the steps in [Use the RStudio administrative dashboard](rstudio-admin.md).

# RStudio Versioning
<a name="rstudio-version"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This guide provides information about the `2025.05.1+513.pro3` version update for RStudio on SageMaker AI. Starting October 31, 2025, new domains with RStudio support are created with Posit Workbench version `2025.05.1+513.pro3`. This applies to the `RStudioServerPro` applications and default `RSessionGateway` applications.

The following sections provide information about the `2025.05.1+513.pro3` release.

## Latest version updates
<a name="rstudio-version-latest"></a>

The latest RStudio version is `2025.05.1+513.pro3`. 
+ R versions supported:
  + 4.5.1
  + 4.4.3
  + 4.4.0
  + 4.3.3
  + 4.2.3
  + 4.2.1
  + 4.1.3
  + 4.0.2

For more information about the changes in this release, see [https://docs.posit.co/ide/news/](https://docs.posit.co/ide/news/). 

**Note**  
To ensure compatibility, we recommend using RSessions with a prefix that matches the current Posit Workbench version.  
If you see the following warning, there is a version mismatch between the `RSession` and the Posit Workbench version used in RStudio on SageMaker AI. To resolve this issue, update the RStudio version for the domain. For information about updating the RStudio version, see [Upgrade to the new version](rstudio-version-upgrade.md).  

```
Session version 2024.04.2+764.pro1 does not match server version 2025.05.1+513.pro3 - this is an unsupported configuration, and you may experience unexpected issues as a result.
```

## Versioning
<a name="rstudio-version-new"></a>

There are currently two versions of Posit Workbench supported by SageMaker AI. 
+ Latest version: `2025.05.1+513.pro3`

  Deprecation Date: December 5, 2026
+ Previous version: `2024.04.2+764.pro1`

  Deprecation Date: April 30, 2026

**Note**  
While you can continue creating new domains with the older version `2024.04.2+764.pro1` until 04/30/2026 by explicitly pinning the version when you create the domain using CLI, we strongly recommend customers to begin using the `2025.05` version in all domains. POSIT has ceased providing vulnerability fixes for `2024.04.2+764.pro1`.  
Versions `2023.03.2-547.pro5` and `2022.02.2-485.pro2` are deprecated and are no longer supported. We recommend updating to the latest version.

The default Posit Workbench version that SageMaker AI selects depends on the creation date of the domain. 
+ For domains created after October 31, 2025, version `2025.05.1+513.pro3` is the default selected version. 
+ For domains created after September 04, 2024 and before October 31, 2025, version `2024.04.2+764.pro1` is the default selected version. You can update your domains to the latest version (`2025.05.1+513.pro3`) by setting it as the default version for the domain. For more information, see [Upgrade to the new version](rstudio-version-upgrade.md).

**Note**  
The default `RSessionGateway` application version matches the current version of the `RStudioServerPro` application.

The following table lists the image ARNs for both versions for each AWS Region. These ARNs are passed as part of an `update-domain` command to set the desired version.


|  Region | `2024.04.2+764.pro1` Image ARN  | `2025.05.1+513.pro3` Image ARN  | 
| --- | --- | --- | 
| us-east-1 |  arn:aws:sagemaker:us-east-1:081325390199:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:us-east-1:081325390199:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| us-east-2 |  arn:aws:sagemaker:us-east-2:429704687514:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:us-east-2:429704687514:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| us-west-1 |  arn:aws:sagemaker:us-west-1:742091327244:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:us-west-1:742091327244:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| us-west-2 |  arn:aws:sagemaker:us-west-2:236514542706:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:us-west-2:236514542706:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| af-south-1 |  arn:aws:sagemaker:af-south-1:559312083959:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:af-south-1:559312083959:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-east-1 |  arn:aws:sagemaker:ap-east-1:493642496378:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-east-1:493642496378:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-south-1 |  arn:aws:sagemaker:ap-south-1:394103062818:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-south-1:394103062818:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-northeast-2 |  arn:aws:sagemaker:ap-northeast-2:806072073708:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-northeast-2:806072073708:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-southeast-1 |  arn:aws:sagemaker:ap-southeast-1:492261229750:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-southeast-1:492261229750:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-southeast-2 |  arn:aws:sagemaker:ap-southeast-2:452832661640:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-southeast-2:452832661640:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ap-northeast-1 |  arn:aws:sagemaker:ap-northeast-1:102112518831:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ap-northeast-1:102112518831:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| ca-central-1 |  arn:aws:sagemaker:ca-central-1:310906938811:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:ca-central-1:310906938811:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-central-1 |  arn:aws:sagemaker:eu-central-1:936697816551:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-central-1:936697816551:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-west-1 |  arn:aws:sagemaker:eu-west-1:470317259841:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-west-1:470317259841:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-west-2 |  arn:aws:sagemaker:eu-west-2:712779665605:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-west-2:712779665605:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-west-3 |  arn:aws:sagemaker:eu-west-3:615547856133:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-west-3:615547856133:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-north-1 |  arn:aws:sagemaker:eu-north-1:243637512696:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-north-1:243637512696:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| eu-south-1 |  arn:aws:sagemaker:eu-south-1:592751261982:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:eu-south-1:592751261982:image/rstudio-workbench-2025.05-sagemaker-1.0  | 
| sa-east-1 |  arn:aws:sagemaker:sa-east-1:782484402741:image/rstudio-workbench-2024.04-sagemaker-1.1  |  arn:aws:sagemaker:sa-east-1:782484402741:image/rstudio-workbench-2025.05-sagemaker-1.0  | 

### Changes to BYOI Images
<a name="rstudio-version-byoi"></a>

If you use a BYOI image with RStudio and update your `RStudioServerPro` version to `2025.05.1+513.pro3`, you must upgrade your custom images to use the `2025.05.1+513.pro3` release and redeploy your existing RSessions. If you attempt to load a non-compatible image in an RSession of a domain using the `2025.05.1+513.pro3` version, the RSession fails because it cannot parse parameters that it receives. To prevent failure, update all of the deployed custom images in your existing `RStudioServerPro` application. 

The `RSW_VERSION` in the Dockerfile must be consistent with the Posit Workbench version used in RStudio on SageMaker AI. You can validate the current version in Posit Workbench. To do so, use the version name that's located in the lower left corner of the Posit Workbench launcher page.

```
ARG RSW_VERSION=2025.05.1+513.pro3
ENV RSTUDIO_FORCE_NON_ZERO_EXIT_CODE="1"
ARG RSW_NAME=rstudio-workbench
ARG OS_CODE_NAME=jammy
ARG RSW_DOWNLOAD_URL=https://s3.amazonaws.com/rstudio-ide-build/server/${OS_CODE_NAME}/amd64
RUN RSW_VERSION_URL=`echo -n "${RSW_VERSION}" | sed 's/+/-/g'` \
    && curl -o rstudio-workbench.deb ${RSW_DOWNLOAD_URL}/${RSW_NAME}-${RSW_VERSION_URL}-amd64.deb \
    && gdebi -n ./rstudio-workbench.deb
```

# Upgrade to the new version
<a name="rstudio-version-upgrade"></a>

Existing domains using version `2024.04.2+764.pro1` can upgrade to `2025.05.1+513.pro3` version in one of two ways:
+ Create a new domain from the AWS CLI with RStudio enabled.
+ Update an existing domain to use the `2025.05.1+513.pro3` version.

The following procedure shows how to delete the RStudio application for an existing domain, set the default version to `2025.05.1+513.pro3`, and then create an RStudio application.

1. Delete the `RStudioServerPro` application and all `RSessionGateway` applications associated with your existing domain. For information about how to find your domain ID, see [View domains](domain-view.md). For more information about deleting applications, see [Shut down RStudio](rstudio-shutdown.md).

   ```
   aws sagemaker delete-app \
       --region region \
       --domain-id domainId \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

1. If your domain is using RStudio version `2024.04.2+764.pro1`, update the domain to set `2025.05.1+513.pro3` as the default Posit Workbench version. The `SageMakerImageArn` value in the following `update-domain` command specifies the RStudio `2025.05.1+513.pro3` version as the default. This ARN must match the Region that your domain is in. For a list of all available ARNs, see [Versioning](rstudio-version.md#rstudio-version-new).

   Pass an execution role ARN for the domain that provides permissions to update the domain. 

   ```
   aws sagemaker update-domain \
       --region region \
       --domain-id domainId \
       --domain-settings-for-update "{\"RStudioServerProDomainSettingsForUpdate\":{\"DefaultResourceSpec\": {\"SageMakerImageArn\": \"arn-for-2025.05.1+513.pro3-version\", \"InstanceType\": \"system\"}, \"DomainExecutionRoleArn\": \"execution-role-arn\"}}"
   ```

1. Create a new `RStudioServerPro` application in the existing domain.

   ```
   aws sagemaker create-app \
       --region region
       --domain-id domainId \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

Your `RStudioServerPro` application is now updated to version `2025.05.1+513.pro3`. You can now relaunch your `RSessionGateway` applications.

# Downgrade to a previous version
<a name="rstudio-version-downgrade"></a>

You can manually downgrade the version of your existing RStudio application to the `2024.04.2+764.pro1` version. 

**To downgrade to a previous version**

1. Delete the `RStudioServerPro` application that's associated with your existing domain. For information about how to find your domain ID, see [View domains](domain-view.md).

   ```
   aws sagemaker delete-app \
       --domain-id domainId \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

1. Pass the corresponding `2024.04.2+764.pro1` ARN for your Region as part of the `update-domain` command. For a list of all available ARNs, see [Versioning](rstudio-version.md#rstudio-version-new). You must also pass an execution role ARN for the domain that provides permissions to update the domain. 

   ```
   aws sagemaker update-domain \
       --region region \
       --domain-id domainId \
       --domain-settings-for-update "{\"RStudioServerProDomainSettingsForUpdate\":{\"DefaultResourceSpec\": {\"SageMakerImageArn\": \"arn-for-2024.04.2+764.pro1-version\", \"InstanceType\": \"system\"}, \"DomainExecutionRoleArn\": \"execution-role-arn\"}}"
   ```

1. Create a new `RStudioServerPro` application in the existing domain. The RStudio version defaults to `2024.04.2+764.pro1`.

   ```
   aws sagemaker create-app \
       --domain-id domainId \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

Your `RStudioServerPro` application is now downgraded to version `2024.04.2+764.pro1`. 

# Network and Storage
<a name="rstudio-network"></a>

The following topic describes network access and data storage considerations for your RStudio instance. For general information about network access and data storage when using Amazon SageMaker AI, see [Data Protection in Amazon SageMaker AI](data-protection.md).

 **Amazon EFS volume**

RStudio on Amazon SageMaker AI shares an Amazon EFS volume with the Amazon SageMaker Studio Classic application in the domain. When the RStudio application is added to a domain, SageMaker AI creates a folder named `shared` in the Amazon EFS directory. If this `shared` folder is deleted or changed manually, then the RStudio application may no longer function. For more information about the Amazon EFS volume, see [Manage Your Amazon EFS Storage Volume in Amazon SageMaker Studio Classic](studio-tasks-manage-storage.md).

 **Installed packages and scripts**

Packages that you install from within RStudio are scoped to the user profile level. This means that the installed package persists through RSession shut down, restarts, and across RSessions for each user profile that they are installed in. R Scripts that are saved in RSessions behave the same way. Both packages and R Scripts are saved in the user's Amazon EFS volume.

 **Encryption**

 RStudio on Amazon SageMaker AI supports encryption at rest.

 **Use RStudio in VPC-only mode**

RStudio on Amazon SageMaker AI supports [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) integration. With this integration, you can use RStudio on SageMaker AI in VPC-only mode without direct access to the internet. When you use RStudio in VPC-only mode, your security groups are automatically managed by the service. This includes connectivity between your RServer and your RSessions.

The following are required to use RStudio in VPC-only mode. For more information on selecting a VPC, see [Choose an Amazon VPC](onboard-vpc.md).
+ A private subnet with either access the internet to make a call to Amazon SageMaker AI & License Manager, or Amazon Virtual Private Cloud (Amazon VPC) endpoints for both Amazon SageMaker AI & License Manager.
+ The domain cannot have any more than two associated Security Groups.
+ A Security Group ID for use with the domain in domain Settings. This must allow all outbound access.
+ A Security Group ID for use with the Amazon VPC endpoint. This security group must allow inbound traffic from the domain Security Group ID.
+ Amazon VPC Endpoint for `sagemaker.api` and AWS License Manager. This must be in the same Amazon VPC as the private subnet. 

# RStudioServerPro instance type
<a name="rstudio-select-instance"></a>

When deciding which Amazon EC2 instance type to use for your RStudioServerPro app, the main factor to consider is bandwidth. Bandwidth is important because the RStudioServerPro instance is responsible for serving the RStudio UI to all users. This includes UI heavy workflows, such as generating figures, animations, and displaying many data rows. Therefore, there may be some UI performance degradation depending on the workload across all users. The following are the available instance types to use for your RStudioServerPro. For pricing information about these instances, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).
+ `system`: This instance type is recommended for Domains with low UI use.
**Note**  
The `system` value is translated to `ml.t3.medium`.
+ `ml.c5.4xlarge`: This instance type is recommended for Domains with moderate UI use.
+ `ml.c5.9xlarge`: This instance type is recommended for Domains with heavy UI use.

 **Changing RStudio instance type** 

To change the instance type of your RStudioServerPro, pass the new instance type as part of a call to the `update-domain` CLI command. You then need to delete the existing RStudioServerPro app using the `delete-app` CLI command and create a new RStudioServerPro app using the `create-app` CLI command. 

# Add an RStudio Connect URL
<a name="rstudio-configure-connect"></a>

RStudio Connect is a publishing platform for Shiny applications, R Markdown reports, dashboards, plots, and more. RStudio Connect makes it easy to surface machine learning and data science insights by making hosting content simple and scalable. If you have an RStudio Connect server, then you can set the server as the default place where apps are published. For more information about RStudio Connect, see [RStudio Connect](https://www.rstudio.com/products/connect/).

When you onboard to RStudio on Amazon SageMaker AI domain, an RStudio Connect server is not created. You can create an RStudio Connect server on an Amazon EC2 instance to use Connect with Amazon SageMaker AI domain. For information about how to set up your RStudio Connect server, see [Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/host-rstudio-connect-and-package-manager-for-ml-development-in-rstudio-on-amazon-sagemaker/). 

 **Add an RStudio Connect URL** 

If you have an RStudio Connect URL, you can update the default URL so that your RStudio Users can publish to it. 

1. Navigate to the **domains** page. 

1. Select the desired domain.

1. Choose **domain Settings**.

1. Under **General Settings**, select **Edit**.

1.  From the new page, select **RStudio Settings** on the left side.  

1.  Under **RStudio Connect URL**, enter the RStudio Connect URL to add. 

1.  Select **Submit**. 

 **CLI** 

 You can set a default RStudio Connect URL when you create your domain. The only way to update your RStudio Connect URL from the AWS CLI is to delete your domain and create a new one with the updated RStudio Connect URL. 

# Update the RStudio Package Manager URL
<a name="rstudio-configure-pm"></a>

RStudio Package Manager is a repository management server used to organize and centralize packages across your organization. For more information on RStudio Package Manager, see [RStudio Package Manager](https://www.rstudio.com/products/package-manager/). If you don't supply your own Package Manager URL, Amazon SageMaker AI domain uses the default Package Manager repository when you onboard RStudio following the steps in [Amazon SageMaker AI domain overview](gs-studio-onboard.md). For more information, see [Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/host-rstudio-connect-and-package-manager-for-ml-development-in-rstudio-on-amazon-sagemaker/). The following procedure shows how to update the Package Manager URL.

 **Update Package Manager URL** 

You can update the Package Manager URL used for your RStudio-enabled domain as follows.

1. Navigate to the **domains** page. 

1. Select the desired domain.

1. Choose **domain Settings**.

1. Under **General Settings**, select **Edit**.

1.  From the new page, select **RStudio Settings** on the left side.  

1.  Under **RStudio Package Manager**, enter your RStudio Package Manager URL. 

1.  Select **Submit**. 

 **CLI** 

The only way to update your Package Manager URL from the AWS CLI is to delete your domain and create a new one with the updated Package Manager URL. 

# Create an Amazon SageMaker AI domain with RStudio using the AWS CLI
<a name="rstudio-create-cli"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

The following topic shows how to onboard to Amazon SageMaker AI domain with RStudio enabled using the AWS CLI. To onboard using the AWS Management Console, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). 

## Prerequisites
<a name="rstudio-create-cli-prerequisites"></a>
+  Install and configure [AWS CLI version 2](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) 
+  Configure the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) with IAM credentials 

## Create `DomainExecution` role
<a name="rstudio-create-cli-domainexecution"></a>

To launch the RStudio App, you must provide a `DomainExecution` role. This role is used to determine whether RStudio needs to be launched as part of Amazon SageMaker AI domain creation. This role is also used by Amazon SageMaker AI to access the RStudio License and push RStudio logs.  

**Note**  
The `DomainExecution` role should have at least AWS License Manager permissions to access RStudio License, and CloudWatch permissions to push logs in your account.

The following procedure shows how to create the `DomainExecution` role with the AWS CLI. 

1.  Create a file named `assume-role-policy.json` with the following content. 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Action": "sts:AssumeRole",
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "sagemaker.amazonaws.com"
                   ]
               }
           }
       ]
   }
   ```

------

1.  Create the `DomainExecution` role. `<REGION>` should be the AWS Region to launch your domain in. 

   ```
   aws iam create-role --region <REGION> --role-name DomainExecution --assume-role-policy-document file://assume-role-policy.json
   ```

1. Create a file named `domain-setting-policy.json` with the following content. This policy allows the RStudioServerPro app to access necessary resources and allows Amazon SageMaker AI to automatically launch an RStudioServerPro app when the existing RStudioServerPro app is in a `Deleted` or `Failed` status.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "VisualEditor0",
               "Effect": "Allow",
               "Action": [
                   "license-manager:ExtendLicenseConsumption",
                   "license-manager:ListReceivedLicenses",
                   "license-manager:GetLicense",
                   "license-manager:CheckoutLicense",
                   "license-manager:CheckInLicense",
                   "logs:CreateLogDelivery",
                   "logs:CreateLogGroup",
                   "logs:CreateLogStream",
                   "logs:DeleteLogDelivery",
                   "logs:Describe*",
                   "logs:GetLogDelivery",
                   "logs:GetLogEvents",
                   "logs:ListLogDeliveries",
                   "logs:PutLogEvents",
                   "logs:PutResourcePolicy",
                   "logs:UpdateLogDelivery",
                   "sagemaker:CreateApp"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1.  Create the domain setting policy that is attached to the `DomainExecution` role. Be aware of the `PolicyArn` from the response, you will need to enter that ARN in the following steps. 

   ```
   aws iam create-policy --region <REGION> --policy-name domain-setting-policy --policy-document file://domain-setting-policy.json
   ```

1.  Attach `domain-setting-policy` to the `DomainExecution` role. Use the `PolicyArn` returned in the previous step.

   ```
   aws iam attach-role-policy --role-name DomainExecution --policy-arn <POLICY_ARN>
   ```

## Create Amazon SageMaker AI domain with RStudio App
<a name="rstudio-create-cli-domain"></a>

The RStudioServerPro app is launched automatically when you create a Amazon SageMaker AI domain using the `create-domain` CLI command with the `RStudioServerProDomainSettings` parameter specified. When launching the RStudioServerPro App, Amazon SageMaker AI checks for a valid RStudio license in the account and fails domain creation if the license is not found. 

The creation of a Amazon SageMaker AI domain differs based on the authentication method and the network type. These options must be used together, with one authentication method and one network connection type selected. For more information about the requirements to create a new domain, see [CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html). 

The following authentication methods are supported.
+  `IAM Auth` 
+  `SSO Auth` 

The following network connection types are supported:
+  `PublicInternet` 
+  `VPCOnly` 

### Authentication methods
<a name="rstudio-create-cli-domain-auth"></a>

 **IAM Auth Mode** 

The following shows how to create a Amazon SageMaker AI domain with RStudio enabled and an `IAM Auth` Network Type. For more information about AWS Identity and Access Management, see [What is IAM?](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html).
+ `DomainExecutionRoleArn` should be the ARN for the role created in the previous step.
+ `ExecutionRole` is the ARN of the role given to users in the Amazon SageMaker AI domain.
+ `vpc-id` should be the ID of your Amazon Virtual Private Cloud. `subnet-ids` should be a space-separated list of subnet IDs. For information about `vpc-id` and `subnet-ids`, see [VPCs and subnets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html).
+ `RStudioPackageManagerUrl` and `RStudioConnectUrl` are optional and should be set to the URLs of your RStudio Package Manager and RStudio Connect server, respectively.
+ `app-network-access-type` should be either `PublicInternetOnly` or `VPCOnly`.

```
aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \
    --auth-mode IAM \
    --default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
    --domain-settings RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CONNECT_URL>,DomainExecutionRoleArn=<DOMAINEXECUTION_ROLE_ARN>} \
    --vpc-id <VPC_ID> \
    --subnet-ids <SUBNET_IDS> \
    --app-network-access-type <NETWORK_ACCESS_TYPE>
```

 **Authentication using IAM Identity Center** 

The following shows how to create a Amazon SageMaker AI domain with RStudio enabled and an `SSO Auth` Network Type. AWS IAM Identity Center must be enabled for the region that the domain is launched on. For more information about IAM Identity Center, see [What is AWS IAM Identity Center?](https://docs.aws.amazon.com/singlesignon/latest/userguide/what-is.html).
+ `DomainExecutionRoleArn` should be the ARN for the role created in the previous step.
+ `ExecutionRole` is the ARN of the role given to users in the Amazon SageMaker AI domain.
+ `vpc-id` should be the ID of your Amazon Virtual Private Cloud. `subnet-ids` should be a space-separated list of subnet IDs. For information about `vpc-id` and `subnet-ids`, see [VPCs and subnets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html).
+ `RStudioPackageManagerUrl` and `RStudioConnectUrl` are optional and should be set to the URLs of your RStudio Package Manager and RStudio Connect server, respectively.
+ `app-network-access-type` should be either `PublicInternetOnly` or `VPCOnly`.

```
aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \
    --auth-mode SSO \
    --default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
    --domain-settings RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CONNECT_URL>,DomainExecutionRoleArn=<DOMAINEXECUTION_ROLE_ARN>} \
    --vpc-id <VPC_ID> \
    --subnet-ids <SUBNET_IDS> \
    --app-network-access-type <NETWORK_ACCESS_TYPE>
```

### Connection types
<a name="rstudio-create-cli-domain-connection"></a>

 **PublicInternet/Direct Internet network type** 

The following shows how to create a Amazon SageMaker AI domain with RStudio enabled and a `PublicInternet` Network Type.
+ `DomainExecutionRoleArn` should be the ARN for the role created in the previous step.
+ `ExecutionRole` is the ARN of the role given to users in the Amazon SageMaker AI domain.
+ `vpc-id` should be the ID of your Amazon Virtual Private Cloud. `subnet-ids` should be a space-separated list of subnet IDs. For information about `vpc-id` and `subnet-ids`, see [VPCs and subnets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html).
+ `RStudioPackageManagerUrl` and `RStudioConnectUrl` are optional and should be set to the URLs of your RStudio Package Manager and RStudio Connect server, respectively.
+ `auth-mode` should be either `SSO` or `IAM`.

```
aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \
    --auth-mode <AUTH_MODE> \
    --default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
    --domain-settings RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CONNECT_URL>,DomainExecutionRoleArn=<DOMAINEXECUTION_ROLE_ARN>} \
    --vpc-id <VPC_ID> \
    --subnet-ids <SUBNET_IDS> \
    --app-network-access-type PublicInternetOnly
```

 **VPCOnly mode** 

The following shows how to launch a Amazon SageMaker AI domain with RStudio enabled and a `VPCOnly` Network Type. For more information about using the `VPCOnly` network access type, see [Connect Studio notebooks in a VPC to external resources](studio-notebooks-and-internet-access.md).
+ `DomainExecutionRoleArn` should be the ARN for the role created in the previous step.
+ `ExecutionRole` is the ARN of the role given to users in the Amazon SageMaker AI domain.
+ `vpc-id` should be the ID of your Amazon Virtual Private Cloud. `subnet-ids` should be a space-separated list of subnet IDs. Your private subnet must be able to either access the internet to make a call to Amazon SageMaker AI, and AWS License Manager or have Amazon VPC endpoints for both Amazon SageMaker AI and AWS License Manager. For information about Amazon VPC endpoints, see [Interface Amazon VPC endpoints ](https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-interface.html)For information about `vpc-id` and `subnet-ids`, see [VPCs and subnets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html). 
+ `SecurityGroups` must allow outbound access to the Amazon SageMaker AI and AWS License Manager endpoints.
+ `auth-mode` should be either `SSO` or `IAM`.

**Note**  
When using Amazon Virtual Private Cloud endpoints, the security group attached to your Amazon Virtual Private Cloud endpoints must allow inbound traffic from the security group you pass as part of the `domain-setting` parameter of the `create-domain` CLI call.

With RStudio, Amazon SageMaker AI manages security groups for you. This means that Amazon SageMaker AI manages security group rules to ensure RSessions can access RStudioServerPro Apps. Amazon SageMaker AI creates one security group rule per user profile.

```
aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \
    --auth-mode <AUTH_MODE> \
    --default-user-settings SecurityGroups=<USER_SECURITY_GROUP>,ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
    --domain-settings SecurityGroupIds=<DOMAIN_SECURITY_GROUP>,RStudioServerProDomainSettings={DomainExecutionRoleArn=<DOMAINEXECUTION_ROLE_ARN>} \
    --vpc-id <VPC_ID> \
    --subnet-ids "<SUBNET_IDS>" \
    --app-network-access-type VPCOnly --app-security-group-management Service
```

Note: The RStudioServerPro app is launched by a special user profile named `domain-shared`. As a result, this app is not returned as part of `list-app` API calls by any other user profiles. 

You may have to increase the Amazon VPC quota in your account to increase the number of users. For more information, see [Amazon VPC quotas](https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-security-groups). 

## Verify domain creation
<a name="rstudio-create-cli-domain-verify"></a>

Use the following command to verify that your domain has been created with a `Status` of `InService`. Your `domain-id` is appended to the domains ARN. For example, `arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:domain/<DOMAIN_ID>`.

```
aws sagemaker describe-domain --domain-id <DOMAIN_ID> --region <REGION>
```

# Add RStudio support to an existing domain
<a name="rstudio-add-existing"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

 If you have added an RStudio License through AWS License Manager, you can create a new Amazon SageMaker AI domain with support for RStudio on SageMaker AI. If you have an existing domain that does not support RStudio, you can add RStudio support to that domain without having to delete and recreate the domain.  

 The following topic outlines how to add this support. 

## Prerequisites
<a name="rstudio-add-existing-prerequisites"></a>

 You must complete the following steps before you update your current domain to add support for RStudio on SageMaker AI.  
+  Install and configure [AWS CLI version 2](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) 
+  Configure the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) with IAM credentials 
+  Create a domain execution role following the steps in [Create a SageMaker AI Domain with RStudio using the AWS CLI](https://docs.aws.amazon.com/sagemaker/latest/dg/rstudio-create-cli.html#rstudio-create-cli-domainexecution). This domain-level IAM role is required by the RStudioServerPro app. The role requires access to AWS License Manager for verifying a valid Posit Workbench license and Amazon CloudWatch Logs for publishing server logs.  
+  Bring your RStudio license to AWS License Manager following the steps in [RStudio license](https://docs.aws.amazon.com/sagemaker/latest/dg/rstudio-license.html). 
+  (Optional) If you want to use RStudio in `VPCOnly` mode, complete the steps in [RStudio in VPC-Only](https://docs.aws.amazon.com/sagemaker/latest/dg/rstudio-network.html). 
+  Ensure that the security groups you have configured for each [UserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) in your domain meet the account-level quotas. When configuring the default user profile during domain creation, you can use the `DefaultUserSettings` parameter of the [CreateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateDomain.html) API to add `SecurityGroups` that are inherited by all the user profiles created in the domain. You can also provide additional security groups for a specific user as part of the `UserSettings` parameter of the [CreateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateUserProfile.html) API. If you have added security groups this way, you must ensure that the total number of security groups per user profile doesn’t exceed the maximum quota of 2 in `VPCOnly` mode and 4 in `PublicInternetOnly` mode. If the resulting total number of security groups for any user profile exceeds the quota, you can combine multiple security groups’ rules into one security group.  

## Add RStudio support to an existing domain
<a name="rstudio-add-existing-enable"></a>

After you have completed the prerequisites, you can add RStudio support to your existing domain. The following steps outline how to update your existing domain to add support for RStudio. 

### Step 1: Delete all apps in the domain
<a name="rstudio-add-existing-enable-step1"></a>

To add support for RStudio in your domain, SageMaker AI must update the underlying security groups for all existing user profiles. To complete this, you must delete and recreate all existing apps in the domain. The following procedure shows how to delete all of the apps. 

1.  List all of the apps in the domain. 

   ```
   aws sagemaker \
      list-apps \
      --domain-id-equals <DOMAIN_ID>
   ```

1.  Delete each app for each user profile in the domain. 

   ```
   // JupyterServer apps 
   aws sagemaker \
       delete-app \
       --domain-id <DOMAIN_ID> \
       --user-profile-name <USER_PROFILE> \
       --app-type JupyterServer \
       --app-name <APP_NAME>
   
   // KernelGateway apps
   aws sagemaker \
       delete-app \
       --domain-id <DOMAIN_ID> \
       --user-profile-name <USER_PROFILE> \
       --app-type KernelGateway \
       --app-name <APP_NAME>
   ```

### Step 2 - Update all user profiles with the new list of security groups
<a name="rstudio-add-existing-enable-step2"></a>

 This is a one-time action that you must complete for all of the existing user profiles in your domain when you have refactored your existing security groups. This prevents you from hitting the quota for the maximum number of security groups. The `UpdateUserProfile` API call fails if the user has any apps that are in [InService](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeApp.html#sagemaker-DescribeApp-response-Status) status. Delete all apps, then call `UpdateUserProfile` API to update the security groups. 

**Note**  
The following requirement for `VPCOnly` mode outlined in [Connect Amazon SageMaker Studio Classic Notebooks in a VPC to External Resources](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html#studio-notebooks-and-internet-access-vpc-only) is no longer needed when adding RStudio support because `AppSecurityGroupManagement` is managed by the SageMaker AI service:  
“[TCP traffic within the security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-other-instances). This is required for connectivity between the JupyterServer app and the KernelGateway apps. You must allow access to at least ports in the range `8192-65535`.” 

```
aws sagemaker \
    update-user-profile \
    --domain-id <DOMAIN_ID>\
    --user-profile-name <USER_PROFILE> \
    --user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\", \"<SECURITY_GROUP>\"]}"
```

### Step 3 - Activate RStudio by calling the UpdateDomain API
<a name="rstudio-add-existing-enable-step3"></a>

1.  Call the [UpdateDomain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateDomain.html) API to add support for RStudio on SageMaker AI. The `defaultusersettings` parameter is only needed if you have refactored the default security groups for your user profiles. 
   +  For `VPCOnly` mode: 

     ```
     aws sagemaker \
         update-domain \
         --domain-id <DOMAIN_ID> \
         --app-security-group-management Service \
         --domain-settings-for-update RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} \
         --default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\", \"<SECURITY_GROUP>\"]}"
     ```
   +  For `PublicInternetOnly` mode: 

     ```
     aws sagemaker \
         update-domain \
         --domain-id <DOMAIN_ID> \
         --domain-settings-for-update RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} \
         --default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\", \"<SECURITY_GROUP>\"]}"
     ```

1. Verify that the domain status is `InService`. After the domain status is `InService`, support for RStudio on SageMaker AI is added.

   ```
   aws sagemaker \
       describe-domain \
       --domain-id <DOMAIN_ID>
   ```

1. Verify that the RStudioServerPro app’s status is `InService` using the following command.

   ```
   aws sagemaker list-apps --user-profile-name domain-shared
   ```

### Step 4 - Add RStudio access for existing users
<a name="rstudio-add-existing-enable-step4"></a>

 As part of the update in Step 3, SageMaker AI marks the RStudio [AccessStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RStudioServerProAppSettings.html#sagemaker-Type-RStudioServerProAppSettings-AccessStatus) of all existing user profiles in the domain as `DISABLED` by default. This prevents exceeding the number of users allowed by your current license. To add access for existing users, there is a one-time opt-in step. Perform the opt-in by calling the [UpdateUserProfile](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateUserProfile.html) API with the following [RStudioServerProAppSettings](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UserSettings.html#sagemaker-Type-UserSettings-RStudioServerProAppSettings): 
+  `AccessStatus` = `ENABLED` 
+  *Optional* - `UserGroup` = `R_STUDIO_USER` or `R_STUDIO_ADMIN` 

```
aws sagemaker \
    update-user-profile \
    --domain-id <DOMAIN_ID>\
    --user-profile-name <USER_PROFILE> \
    --user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"ENABLED\"}}"
```

**Note**  
By default, the number of users that can have access to RStudio is 60.

### Step 5 – Deactivate RStudio access for new users
<a name="rstudio-add-existing-enable-step5"></a>

 Unless otherwise specified when calling `UpdateDomain`, RStudio support is added by default for all new user profiles created after you have added support for RStudio on SageMaker AI. To deactivate access for a new user profile, you must explicitly set the `AccessStatus` parameter to `DISABLED` as part of the `CreateUserProfile` API call. If the `AccessStatus` parameter is not specified as part of the `CreateUserProfile` API, the default access status is `ENABLED`. 

```
aws sagemaker \
    create-user-profile \
    --domain-id <DOMAIN_ID>\
    --user-profile-name <USER_PROFILE> \
    --user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"DISABLED\"}}"
```

# Custom images with RStudio on SageMaker AI
<a name="rstudio-byoi"></a>

A SageMaker image is a file that identifies language packages and other dependencies that are required to run RStudio on Amazon SageMaker AI. SageMaker AI uses these images to create an environment where you run RStudio. Amazon SageMaker AI provides a built-in RStudio image for you to use. If you need different functionality, you can bring your own custom images. This page gives information about key concepts for using custom images with RStudio on SageMaker AI. The process to bring your own image to use with RStudio on SageMaker AI takes three steps:

1. Build a custom image from a Dockerfile and push it to a repository in Amazon Elastic Container Registry (Amazon ECR).

1. Create a SageMaker image that points to a container image in Amazon ECR and attach it to your Amazon SageMaker AI domain.

1. Launch a new session in RStudio with your custom image.

You can create images and image versions, and attach image versions to your domain, using the SageMaker AI control panel, the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html), and the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/). You can also create images and image versions using the SageMaker AI console, even if you haven't onboarded to a domain.

The following topics show how to bring your own image to RStudio on SageMaker AI by creating, attaching, and launching a custom image.

## Key terminology
<a name="rstudio-byoi-basics"></a>

The following section defines key terms for bringing your own image to use with RStudio on SageMaker AI.
+ **Dockerfile:** A Dockerfile is a file that identifies the language packages and other dependencies for your Docker image.
+ **Docker image:** The Docker image is a built Dockerfile. This image is checked into Amazon ECR and serves as the basis of the SageMaker AI image.
+ **SageMaker image:** A SageMaker image is a holder for a set of SageMaker image versions based on Docker images. 
+ **Image version:** An image version of a SageMaker image represents a Docker image that is compatible with RStudio and stored in an Amazon ECR repository. Each image version is immutable. These image versions can be attached to a domain and used with RStudio on SageMaker AI.

# Complete prerequisites
<a name="rstudio-byoi-prerequisites"></a>

You must complete the following prerequisites before bringing your own image to use with RStudio on Amazon SageMaker AI. 
+ If you have an existing domain with RStudio that was created before April 7, 2022, you must delete your RStudioServerPro application and recreate it. For information about how to delete an application, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).
+ Install the Docker application. For information about setting up Docker, see [Orientation and setup](https://docs.docker.com/get-started/).
+ Create a local copy of an RStudio-compatible Dockerfile that works with SageMaker AI. For information about creating a sample RStudio dockerfile, see [Use a custom image to bring your own development environment to RStudio on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/use-a-custom-image-to-bring-your-own-development-environment-to-rstudio-on-amazon-sagemaker/).
+ Use an AWS Identity and Access Management execution role that has the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached. If you have onboarded to domain, you can get the role from the **domain Summary** section of the SageMaker AI control panel.

  Add the following permissions to access the Amazon Elastic Container Registry (Amazon ECR) service to your execution role.

------
#### [ JSON ]

****  

  ```
  { 
      "Version":"2012-10-17",		 	 	  
      "Statement":[ 
          {
              "Sid": "VisualEditor0",
              "Effect":"Allow", 
              "Action":[ 
                  "ecr:CreateRepository", 
                  "ecr:BatchGetImage", 
                  "ecr:CompleteLayerUpload", 
                  "ecr:DescribeImages", 
                  "ecr:DescribeRepositories", 
                  "ecr:UploadLayerPart", 
                  "ecr:ListImages", 
                  "ecr:InitiateLayerUpload", 
                  "ecr:BatchCheckLayerAvailability", 
                  "ecr:PutImage" 
              ], 
              "Resource": "*" 
          }
      ]
  }
  ```

------
+ Install and configure AWS CLI with the following (or higher) version. For information about installing the AWS CLI, see [Installing or updating the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

  ```
  AWS CLI v1 >= 1.23.6
  AWS CLI v2 >= 2.6.2
  ```

# Custom RStudio image specifications
<a name="rstudio-byoi-specs"></a>

In this guide, you'll learn custom RStudio image specifications to use when you bring your own image. There are two sets of requirements that you must satisfy with your custom RStudio image to use it with Amazon SageMaker AI. These requirements are imposed by RStudio PBC and the Amazon SageMaker Studio Classic platform. If either of these sets of requirements aren't satisfied, then your custom image won't function properly.

## RStudio PBC requirements
<a name="rstudio-byoi-specs-rstudio"></a>

RStudio PBC requirements are laid out in the [Using Docker images with RStudio Workbench / RStudio Server Pro, Launcher, and Kubernetes](https://support.rstudio.com/hc/en-us/articles/360019253393-Using-Docker-images-with-RStudio-Server-Pro-Launcher-and-Kubernetes) article. Follow the instructions in this article to create the base of your custom RStudio image. 

For instructions about how to install multiple R versions in your custom image, see [Installing multiple versions of R on Linux](https://support.rstudio.com/hc/en-us/articles/215488098).

## Amazon SageMaker Studio Classic requirements
<a name="rstudio-byoi-specs-studio"></a>

Amazon SageMaker Studio Classic imposes the following set of installation requirements for your RStudio image.
+ You must use an RStudio base image of at least `2025.05.1+513.pro3`. For more information, see [RStudio Versioning](rstudio-version.md).
+ You must install the following packages:

  ```
  yum install -y sudo \
  openjdk-11-jdk \
  libpng-dev \
  && yum clean all \
  && /opt/R/${R_VERSION}/bin/R -e "install.packages('reticulate', repos='https://packagemanager.rstudio.com/cran/__linux__/centos7/latest')" \
  && /opt/python/${PYTHON_VERSION}/bin/pip install --upgrade \
      'boto3>1.0<2.0' \
      'awscli>1.0<2.0' \
      'sagemaker[local]<3'
  ```
+ You must provide default values for the `RSTUDIO_CONNECT_URL` and `RSTUDIO_PACKAGE_MANAGER_URL` environment values.

  ```
  ENV RSTUDIO_CONNECT_URL "YOUR_CONNECT_URL"
  ENV RSTUDIO_PACKAGE_MANAGER_URL "YOUR_PACKAGE_MANAGER_URL"
  ENV RSTUDIO_FORCE_NON_ZERO_EXIT_CODE 1
  ```

The following general specifications apply to the image that is represented by an RStudio image version.

**Running the image**  
`ENTRYPOINT` and `CMD` instructions are overridden so that the image is run as an RSession application.

**Stopping the image**  
The `DeleteApp` API issues the equivalent of a `docker stop` command. Other processes in the container won’t get the SIGKILL/SIGTERM signals.

**File system**  
The `/opt/.sagemakerinternal` and `/opt/ml` directories are reserved. Any data in these directories might not be visible at runtime.

**User data**  
Each user in a SageMaker AI domain gets a user directory on a shared Amazon Elastic File System volume in the image. The location of the current user’s directory on the Amazon Elastic File System volume is `/home/sagemaker-user`.

**Metadata**  
A metadata file is located at `/opt/ml/metadata/resource-metadata.json`. No additional environment variables are added to the variables defined in the image. For more information, see [Get App Metadata](notebooks-run-and-manage-metadata.md#notebooks-run-and-manage-metadata-app).

**GPU**  
On a GPU instance, the image is run with the `--gpus` option. Only the CUDA toolkit should be included in the image, not the NVIDIA drivers. For more information, see [NVIDIA User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html).

**Metrics and logging**  
Logs from the RSession process are sent to Amazon CloudWatch in the customer’s account. The name of the log group is `/aws/sagemaker/studio`. The name of the log stream is `$domainID/$userProfileName/RSession/$appName`.

**Image size**  
Image size is limited to 25 GB. To view the size of your image, run `docker image ls`.

# Create a custom RStudio image
<a name="rstudio-byoi-create"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This topic describes how you can create a custom RStudio image using the SageMaker AI console and the AWS CLI. If you use the AWS CLI, you must run the steps from your local machine. The following steps do not work from within Amazon SageMaker Studio Classic.

When you create an image, SageMaker AI also creates an initial image version. The image version represents a container image in [Amazon Elastic Container Registry (ECR)](https://console.aws.amazon.com/ecr/). The container image must satisfy the requirements to be used in RStudio. For more information, see [Custom RStudio image specifications](rstudio-byoi-specs.md).

For information about testing your image locally and resolving common issues, see the [SageMaker Studio Custom Image Samples repo](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/blob/main/DEVELOPMENT.md).

**Topics**
+ [

## Add a SageMaker AI-compatible RStudio Docker container image to Amazon ECR
](#rstudio-byoi-sdk-add-container-image)
+ [

## Create a SageMaker image from the console
](#rstudio-byoi-create-console)
+ [

## Create an image from the AWS CLI
](#rstudio-byoi-create-cli)

## Add a SageMaker AI-compatible RStudio Docker container image to Amazon ECR
<a name="rstudio-byoi-sdk-add-container-image"></a>

Use the following steps to add a Docker container image to Amazon ECR:
+ Create an Amazon ECR repository.
+ Authenticate to Amazon ECR.
+ Build a SageMaker AI-compatible RStudio Docker image.
+ Push the image to the Amazon ECR repository.

**Note**  
The Amazon ECR repository must be in the same AWS Region as your domain.

**To build and add a Docker image to Amazon ECR**

1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR console, see [Creating a repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html).

   ```
   aws ecr create-repository \
       --repository-name rstudio-custom \
       --image-scanning-configuration scanOnPush=true
   ```

   Response:

   ```
   {
       "repository": {
           "repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/rstudio-custom",
           "registryId": "acct-id",
           "repositoryName": "rstudio-custom",
           "repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/rstudio-custom",
           ...
       }
   }
   ```

1. Authenticate to Amazon ECR using the repository URI returned as a response from the `create-repository` command. Make sure that the Docker application is running. For more information, see [Registry Authentication](https://docs.aws.amazon.com/AmazonECR/latest/userguide/Registries.html#registry_auth).

   ```
   aws ecr get-login-password | \
       docker login --username AWS --password-stdin <repository-uri>
   ```

   Response:

   ```
   Login Succeeded
   ```

1. Build the Docker image. Run the following command from the directory that includes your Dockerfile.

   ```
   docker build .
   ```

1. Tag your built image with a unique tag.

   ```
   docker tag <image-id> "<repository-uri>:<tag>"
   ```

1. Push the container image to the Amazon ECR repository. For more information, see [ImagePush](https://docs.docker.com/engine/api/v1.40/#operation/ImagePush) and [Pushing an image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html).

   ```
   docker push <repository-uri>:<tag>
   ```

   Response:

   ```
   The push refers to repository [<account-id>.dkr.ecr.us-east-2.amazonaws.com/rstudio-custom]
   r: digest: <digest> size: 3066
   ```

## Create a SageMaker image from the console
<a name="rstudio-byoi-create-console"></a>

**To create an image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Images**. 

1. On the **Custom images** page, choose **Create image**.

1. For **Image source**, enter the registry path to the container image in Amazon ECR. The path is in the following format:

   ` acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest] `

1. Choose **Next**.

1. Under **Image properties**, enter the following:
   + Image name – The name must be unique to your account in the current AWS Region.
   + (Optional) Image display name – The name displayed in the domain user interface. When not provided, `Image name` is displayed.
   + (Optional) Description – A description of the image.
   + IAM role – The role must have the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached. Use the dropdown menu to choose one of the following options:
     + Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets that you want your notebooks users to access. If you don't want to allow access to additional buckets, choose **None**.

       SageMaker AI attaches the `AmazonSageMakerFullAccess` policy to the role. The role allows your notebook users to access the Amazon S3 buckets listed next to the check marks.
     + Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
     + Use existing role – Choose one of your existing roles from the list.
   + (Optional) Image tags – Choose **Add new tag**. You can add up to 50 tags. Tags are searchable using the SageMaker AI console or the SageMaker AI `Search` API.

1. Under **Image type**, select RStudio image.

1. Choose **Submit**.

The new image is displayed in the **Custom images** list and briefly highlighted. After the image has been successfully created, you can choose the image name to view its properties or choose **Create version** to create another version.

**To create another image version**

1. Choose **Create version** on the same row as the image.

1. For **Image source**, enter the registry path to the Amazon ECR image. The image shouldn't be the same image as used in a previous version of the SageMaker AI image.

To use the custom image in RStudio, you must attach it to your domain. For more information, see [Attach a custom SageMaker image](rstudio-byoi-attach.md).

## Create an image from the AWS CLI
<a name="rstudio-byoi-create-cli"></a>

This section shows how to create a custom Amazon SageMaker image using the AWS CLI.

Use the following steps to create a SageMaker image:
+ Create an `Image`.
+ Create an `ImageVersion`.
+ Create a configuration file.
+ Create an `AppImageConfig`.

**To create the SageMaker image entities**

1. Create a SageMaker image. The role ARN must have at least the `AmazonSageMakerFullAccessPolicy` policy attached.

   ```
   aws sagemaker create-image \
       --image-name rstudio-custom-image \
       --role-arn arn:aws:iam::<acct-id>:role/service-role/<execution-role>
   ```

   Response:

   ```
   {
       "ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/rstudio-custom-image"
   }
   ```

1. Create a SageMaker image version from the image. Pass the unique tag value that you chose when you pushed the image to Amazon ECR.

   ```
   aws sagemaker create-image-version \
       --image-name rstudio-custom-image \
       --base-image <repository-uri>:<tag>
   ```

   Response:

   ```
   {
       "ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-image/1"
   }
   ```

1. Check that the image version was successfully created.

   ```
   aws sagemaker describe-image-version \
       --image-name rstudio-custom-image \
       --version 1
   ```

   Response:

   ```
   {
       "ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-custom-image/1",
       "ImageVersionStatus": "CREATED"
   }
   ```
**Note**  
If the response is `"ImageVersionStatus": "CREATED_FAILED"`, the response also includes the failure reason. A permissions issue is a common cause of failure. You also can check your Amazon CloudWatch Logs. The name of the log group is `/aws/sagemaker/studio`. The name of the log stream is `$domainID/$userProfileName/KernelGateway/$appName`.

1. Create a configuration file, named `app-image-config-input.json`. The app image config is used to configuration for running a SageMaker image as a Kernel Gateway application.

   ```
   {
       "AppImageConfigName": "rstudio-custom-config"
   }
   ```

1. Create the AppImageConfig using the file that you created in the previous step.

   ```
   aws sagemaker create-app-image-config \
       --cli-input-json file://app-image-config-input.json
   ```

   Response:

   ```
   {
       "AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/r-image-config"
   }
   ```

# Attach a custom SageMaker image
<a name="rstudio-byoi-attach"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This guide shows how to attach a custom RStudio image to your Amazon SageMaker AI domain using the SageMaker AI console or the AWS Command Line Interface (AWS CLI). 

To use a custom SageMaker image, you must attach a custom RStudio image to your domain. When you attach an image version, it appears in the RStudio Launcher and is available in the **Select image** dropdown list. You use the dropdown to change the image used by RStudio.

There is a limit to the number of image versions that you can attach. After you reach the limit, you must first detach a version so that you can attach a different version of the image.

**Topics**
+ [

## Attach an image version to your domain using the console
](#rstudio-byoi-attach-console)
+ [

## Attach an existing image version to your domain using the AWS CLI
](#rstudio-byoi-attach-cli)

## Attach an image version to your domain using the console
<a name="rstudio-byoi-attach-console"></a>

You can attach a custom SageMaker image version to your domain using the SageMaker AI console's control panel. You can also create a custom SageMaker image, and an image version, and then attach that version to your domain.

**To attach an existing image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the desired domain.

1. Choose **Environment**.

1. Under **Custom SageMaker Studio Classic images attached to domain**, choose **Attach image**.

1. For **Image source**, choose **Existing image** or **New image**.

   If you select **Existing image**, choose an image from the Amazon SageMaker image store.

   If you select **New image**, provide the Amazon ECR registry path for your Docker image. The path must be in the same AWS Region as the domain. The Amazon ECR repo must be in the same account as your domain, or cross-account permissions for SageMaker AI must be enabled.

1. Choose an existing image from the list.

1. Choose a version of the image from the list.

1. Choose **Next**.

1. Enter values for **Image name**, **Image display name**, and **Description**.

1. Choose the IAM role. For more information, see [Create a custom RStudio image](rstudio-byoi-create.md).

1. (Optional) Add tags for the image.

1. (Optional) Choose **Add new tag**, then add a configuration tag.

1. For **Image type**, select **RStudio Image**.

1. Choose **Submit**.

Wait for the image version to be attached to the domain. After the version is attached, it appears in the **Custom images** list and is briefly highlighted.

## Attach an existing image version to your domain using the AWS CLI
<a name="rstudio-byoi-attach-cli"></a>

Two methods are presented to attach the image version to your domain using the AWS CLI. In the first method, you create a new domain with the version attached. This method is simpler but you must specify the Amazon Virtual Private Cloud (Amazon VPC) information and execution role that's required to create the domain.

If you have already onboarded to the domain, you can use the second method to attach the image version to your current domain. In this case, you don't need to specify the Amazon VPC information and execution role. After you attach the version, delete all of the applications in your domain and relaunch RStudio.

### Attach the SageMaker image to a new domain
<a name="rstudio-byoi-cli-attach-new-domain"></a>

To use this method, you must specify an execution role that has the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy attached.

Use the following steps to create the domain and attach the custom SageMaker AI image:
+ Get your default VPC ID and subnet IDs.
+ Create the configuration file for the domain, which specifies the image.
+ Create the domain with the configuration file.

**To add the custom SageMaker image to your domain**

1. Get your default VPC ID.

   ```
   aws ec2 describe-vpcs \
       --filters Name=isDefault,Values=true \
       --query "Vpcs[0].VpcId" --output text
   ```

   Response:

   ```
   vpc-xxxxxxxx
   ```

1. Get your default subnet IDs using the VPC ID from the previous step.

   ```
   aws ec2 describe-subnets \
       --filters Name=vpc-id,Values=<vpc-id> \
       --query "Subnets[*].SubnetId" --output json
   ```

   Response:

   ```
   [
       "subnet-b55171dd",
       "subnet-8a5f99c6",
       "subnet-e88d1392"
   ]
   ```

1. Create a configuration file named `create-domain-input.json`. Insert the VPC ID, subnet IDs, `ImageName`, and `AppImageConfigName` from the previous steps. Because `ImageVersionNumber` isn't specified, the latest version of the image is used, which is the only version in this case. Your execution role must satisfy the requirements in [Complete prerequisites](rstudio-byoi-prerequisites.md).

   ```
   {
     "DomainName": "domain-with-custom-r-image",
     "VpcId": "<vpc-id>",
     "SubnetIds": [
       "<subnet-ids>"
     ],
     "DomainSettings": {
       "RStudioServerProDomainSettings": {
         "DomainExecutionRoleArn": "<execution-role>"
       }
     },
     "DefaultUserSettings": {
       "ExecutionRole": "<execution-role>",
       "RSessionAppSettings": {
         "CustomImages": [
           {
            "AppImageConfigName": "rstudio-custom-config",
            "ImageName": "rstudio-custom-image"
           }
         ]
        }
     },
     "AuthMode": "IAM"
   }
   ```

1. Create the domain with the attached custom SageMaker image.

   ```
   aws sagemaker create-domain \
       --cli-input-json file://create-domain-input.json
   ```

   Response:

   ```
   {
       "DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id",
       "Url": "https://domain-id.studio.region.sagemaker.aws/..."
   }
   ```

### Attach the SageMaker image to an existing domain
<a name="rstudio-byoi-cli-attach-current-domain"></a>

This method assumes that you've already onboarded to domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

**Note**  
You must delete all of the applications in your domain to update the domain with the new image version. For information about deleting these applications, see [Delete an Amazon SageMaker AI domain](gs-studio-delete-domain.md).

Use the following steps to add the SageMaker image to your current domain.
+ Get your `DomainID` from the SageMaker AI console.
+ Use the `DomainID` to get the `DefaultUserSettings` for the domain.
+ Add the `ImageName` and `AppImageConfig` as a `CustomImage` to the `DefaultUserSettings`.
+ Update your domain to include the custom image.

**To add the custom SageMaker image to your domain**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the desired domain.

1. Choose **domain settings**.

1. Under **General Settings**, find the **domain ID**. The ID is in the following format: `d-xxxxxxxxxxxx`.

1. Use the domain ID to get the description of the domain.

   ```
   aws sagemaker describe-domain \
       --domain-id <d-xxxxxxxxxxxx>
   ```

   Response:

   ```
   {
       "DomainId": "d-xxxxxxxxxxxx",
       "DefaultUserSettings": {
         "KernelGatewayAppSettings": {
           "CustomImages": [
           ],
           ...
         }
       }
   }
   ```

1. Save the `DefaultUserSettings` section of the response to a file named `update-domain-input.json`.

1. Insert the `ImageName` and `AppImageConfigName` from the previous steps as a custom image. Because `ImageVersionNumber` isn't specified, the latest version of the image is used, which is the only version in this case.

   ```
   {
       "DefaultUserSettings": {
           "RSessionAppSettings": { 
              "CustomImages": [ 
                 { 
                    "ImageName": "rstudio-custom-image",
                    "AppImageConfigName": "rstudio-custom-config"
                 }
              ]
           }
       }
   }
   ```

1. Use the domain ID and default user settings file to update your domain.

   ```
   aws sagemaker update-domain \
       --domain-id <d-xxxxxxxxxxxx> \
       --cli-input-json file://update-domain-input.json
   ```

   Response:

   ```
   {
       "DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id"
   }
   ```

1. Delete the `RStudioServerPro` application. You must restart the `RStudioServerPro` domain-shared application for the RStudio Launcher UI to pick up the latest changes.

   ```
   aws sagemaker delete-app \
       --domain-id <d-xxxxxxxxxxxx> --user-profile-name domain-shared \
       --app-type RStudioServerPro --app-name default
   ```

1. Create a new `RStudioServerPro` application. You must create this application using the AWS CLI.

   ```
   aws sagemaker create-app \
       --domain-id <d-xxxxxxxxxxxx> --user-profile-name domain-shared \
       --app-type RStudioServerPro --app-name default
   ```

# Launch a custom SageMaker image in RStudio
<a name="rstudio-byoi-launch"></a>

You can use your custom image when launching an RStudio applicaton from the console. After you create your custom SageMaker image and attach it to your domain, the image appears in the image selector dialog box of the RStudio Launcher. To launch a new RStudio app, follow the steps in [Launch RSessions from the RStudio Launcher](rstudio-launcher.md) and select your custom image as shown in the following image.

![\[Screenshot of the RStudio launcher with image dropdown.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rstudio-launcher-custom.png)


# Clean up image resources
<a name="rstudio-byoi-sdk-cleanup"></a>

This guide shows how to clean up RStudio image resources that you created in the previous sections. To delete an image, complete the following steps using either the SageMaker AI console or the AWS CLI, as shown in this guide.
+ Detach the image and image versions from your Amazon SageMaker AI domain.
+ Delete the image, image version, and app image config.

After you've completed these steps, you can delete the container image and repository from Amazon ECR. For more information about how to delete the container image and repository, see [Deleting a repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-delete.html).

## Clean up resources from the SageMaker AI console
<a name="rstudio-byoi-sdk-cleanup-console"></a>

When you detach an image from a domain, all versions of the image are detached. When an image is detached, all users of the domain lose access to the image versions.

**To detach an image**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the desired domain.

1. Choose **Environment**.

1. Under **Custom images attached to domain**, choose the image and then choose **Detach**.

1. (Optional) To delete the image and all versions from SageMaker AI, select **Also delete the selected images ...**. This does not delete the associated images from Amazon ECR.

1. Choose **Detach**.

## Clean up resources from the AWS CLI
<a name="rstudio-byoi-sdk-cleanup-cli"></a>

**To clean up resources**

1. Detach the image and image versions from your domain by passing an empty custom image list to the domain. Open the `update-domain-input.json` file that you created in [Attach the SageMaker image to your current domain](studio-byoi-attach.md#studio-byoi-sdk-attach-current-domain).

1. Delete the `RSessionAppSettings` custom images and then save the file. Do not modify the `KernelGatewayAppSettings` custom images.

   ```
   {
       "DomainId": "d-xxxxxxxxxxxx",
       "DefaultUserSettings": {
         "KernelGatewayAppSettings": {
            "CustomImages": [
            ],
            ...
         },
         "RSessionAppSettings": { 
           "CustomImages": [ 
           ],
           "DefaultResourceSpec": { 
           }
           ...
         }
       }
   }
   ```

1. Use the domain ID and default user settings file to update your domain.

   ```
   aws sagemaker update-domain \
       --domain-id <d-xxxxxxxxxxxx> \
       --cli-input-json file://update-domain-input.json
   ```

   Response:

   ```
   {
       "DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
   }
   ```

1. Delete the app image config.

   ```
   aws sagemaker delete-app-image-config \
       --app-image-config-name rstudio-image-config
   ```

1. Delete the SageMaker image, which also deletes all image versions. The container images in Amazon ECR that are represented by the image versions are not deleted.

   ```
   aws sagemaker delete-image \
       --image-name rstudio-image
   ```

# Create a user to use RStudio
<a name="rstudio-create-user"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

After your RStudio-enabled Amazon SageMaker AI domain is running, you can add user profiles (UserProfiles) to the domain. The following topics show how to create user profiles that are authorized to use RStudio, as well as update an existing user profile. For information on how to delete an RStudio App, UserProfile, or domain, follow the steps in [Delete an Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-delete-domain.html). 

**Note**  
The limit for the total number of UserProfiles in a Amazon SageMaker AI domain is 60.

 There are two types of users: 
+ Unauthorized: This user cannot access the RStudio app.
+ Authorized: This user can access the RStudio app and use one of the RStudio license seats. By default, a new user is `Authorized` if the domain is enabled for RStudio.

Changing a user's authorization status is only valid from `Unauthorized` to `Authorized`. If a user is authorized, they can be given one of the following levels of access to RStudio. 
+  RStudio User: This is a standard RStudio user and can access RStudio. 
+  RStudio Admin: The admin of your Amazon SageMaker AI domain has the ability to create users, add existing users, and update the permissions of existing users. Admins can also access the RStudio Administrative dashboard. However, this admin is not able to update parameters that are managed by Amazon SageMaker AI.

## Methods to create a user
<a name="rstudio-create-user-methods"></a>

The following topics show how to create a user in your RStudio-enabled Amazon SageMaker AI domain.

 **Create user console** 

To create a user in your RStudio-enabled Amazon SageMaker AI domain from the console, complete the steps in [Add user profiles](domain-user-profile-add.md).

 **Create user CLI** 

 The following command shows how to add users to a Amazon SageMaker AI domain with IAM authentication. A User can belong to either the `R_STUDIO_USER` or `R_STUDIO_ADMIN` User group. 

```
aws sagemaker create-user-profile --region <REGION> \
    --domain-id <DOMAIN-ID> \
    --user-profile-name <USER_PROFILE_NAME-ID> \
    --user-settings RStudioServerProAppSettings={UserGroup=<USER-GROUP>}
```

The following command shows how to add users to a Amazon SageMaker AI domain with authentication using IAM Identity Center. A user can belong to either the `R_STUDIO_USER` or `R_STUDIO_ADMIN` User group. 

```
aws sagemaker create-user-profile --region <REGION> \
    --domain-id <DOMAIN-ID> \
    --user-profile-name <USER_PROFILE_NAME-ID> \
    --user-settings RStudioServerProAppSettings={UserGroup=<USER-GROUP>} \
    --single-sign-on-user-identifier UserName \
    --single-sign-on-user-value <USER-NAME>
```

# Log in to RStudio as another user
<a name="rstudio-login-another"></a>

The following topic demonstrates how to log in to RStudio on Amazon SageMaker AI as another user.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Select the domain containing the user profile.

1.  Select a user name from the list of users. This opens a new page with details about the user profile and the apps that are running. 

1.  Select **Launch**. 

1.  From the dropdown, select **RStudio** to launch an RStudio instance. 

# Terminate sessions for another user
<a name="rstudio-terminate-another"></a>

The following topic demonstrates how to terminate sessions for another user in RStudio on Amazon SageMaker AI.

1.  From the list of running apps, identify the app you want to delete. 

1.  Click the respective **Delete app** button for the app you are deleting. 

# Use the RStudio administrative dashboard
<a name="rstudio-admin"></a>

 This topic shows how to access and use the RStudio administrative dashboard. With the RStudio administrative dashboard, admins can manage users and RSessions, as well as view information about RStudio Server instance utilization and Amazon CloudWatch Logs.

 

## Launch the RStudio administrative dashboard
<a name="rstudio-admin-launch"></a>

The `R_STUDIO_ADMIN` authorization allows the user to access the RStudio administrative dashboard. An `R_STUDIO_ADMIN` user can access the RStudio administrative dashboard by replacing `workspaces` with `admin` in their RStudio URL manually. The following shows how to modify the URL to access the RStudio administrative dashboard.

For example, the following RStudio URL: 

```
https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/workspaces
```

Can be converted to: 

```
https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/admin
```

## Dashboard tab
<a name="rstudio-admin-dashboard"></a>

This tab gives an overview of your RStudio Server instance utilization, as well as information on the number of active RSessions.

## Sessions tab
<a name="rstudio-admin-sessions"></a>

This tab gives information on the active RSessions, such as the user that launched the RSessions, the time that the RSessions have been running, and their resource utilization.

## Users tab
<a name="rstudio-admin-users"></a>

This tab gives information on the RStudio authorized users in the domain, such as the time that the last RSession was launched and their resource utilization.

## Stats tab
<a name="rstudio-admin-stats"></a>

This tab gives information on the utilization of your RStudio Server instance.

## Logs tab
<a name="rstudio-admin-logs"></a>

This tab displays Amazon CloudWatch Logs for the RStudio Server instance. For more information about logging events with Amazon CloudWatch Logs, see [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html).

# Shut down RStudio
<a name="rstudio-shutdown"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To shut down and restart your Posit Workbench and the associated RStudioServerPro app, you must first shut down all of your existing RSessions. You can shut down the RSessionGateway apps from within RStudio. You can then shut down the RStudioServerPro app using the AWS CLI. After the RStudioServerPro app is shut down, you must reopen RStudio through the SageMaker AI console.

Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't impacted.

**Note**  
If you are using a custom image with RStudio, ensure that your docker image is using an RStudio version that is compatible with the version of Posit Workbench being used by SageMaker AI after you restart your RStudioServerPro app.

The following topics show how to shut down the RSessionGateway and RStudioServerPro apps and restart them.

## Suspend your RSessions
<a name="rstudio-suspend"></a>

Complete the following procedure to suspend all of your RSessions.

1. From the RStudio Launcher, identify the RSession that you want to suspend. 

1. Select **Suspend** for the session. 

1. Repeat this for all RSessions.

## Delete your RSessions
<a name="rstudio-delete"></a>

Complete the following procedure to shut down all of your RSessions.

1. From the RStudio Launcher, identify the RSession that you want to delete. 

1. Select **Quit** for the session. This opens a new **Quit Session** window. 

1. From the **Quit Session** window, select **Force Quit**, to end all child processes in the session.

1. Select **Quit Session** to confirm deletion of the session.

1. Repeat this for all RSessions.

## Delete your RStudioServerPro app
<a name="rstudio-delete-restart"></a>

Run the following commands from the AWS CLI to delete and restart your RStudioServerPro app.

1. Delete the RStudioServerPro application by using your current domain id. 

   ```
   aws sagemaker delete-app \
       --domain-id <domainId> \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

1. Re-create the RStudioServerPro application. 

   ```
   aws sagemaker create-app \
       --domain-id <domainId> \
       --user-profile-name domain-shared \
       --app-type RStudioServerPro \
       --app-name default
   ```

# Billing and cost
<a name="rstudio-billing"></a>

 To track the costs associated with your RStudio environment, you can use the AWS Billing and Cost Management service. AWS Billing and Cost Management provides useful tools to help you gather information related to your cost and usage, analyze your cost drivers and usage trends, and take action to budget your spending. For more information, see [What is AWS Billing and Cost Management?](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html). The following describes components required to run RStudio on Amazon SageMaker AI and how each component factors into billing for your RStudio instance. 
+  RStudio License –You must purchase an RStudio license. There is no additional charge for using your RStudio license with Amazon SageMaker AI. For more information about your RStudio license, see [Get an RStudio license](rstudio-license.md).
+  RSession - These are RStudio working sessions launched by end users. You are charged while the RSession is running.
+  RStudio Server - A multi-tenant server manages all the RSessions. You can choose the instance type to run RStudio Server on, and pay the related costs. The default instance, "system", is free, but you can choose to pay for higher tiers. For more information about the available instance types for your RStudio Server, see [RStudioServerPro instance type](rstudio-select-instance.md). 

 **Tracking billing at user level** 

 To track billing at the user level using Cost Allocation Tags, see [Using Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html).

# Diagnose issues and get support
<a name="rstudio-troubleshooting"></a>

 The following sections describe how to diagnose issues with RStudio on Amazon SageMaker AI. To get support for RStudio on Amazon SageMaker AI, contact Amazon SageMaker AI support. For help with purchasing an RStudio license or modifying the number of license seats, contact [sales@rstudio.com](mailto:sales@rstudio.com).

## Upgrade your version
<a name="rstudio-troubleshooting-upgrade"></a>

If you receive a warning that there is a version mismatch between your RSession and RStudioServerPro apps, then you must upgrade the version of your RStudioServerPro app. For more information, see [RStudio Versioning](rstudio-version.md).

## View Metrics and Logs
<a name="rstudio-troubleshooting-view"></a>

You can monitor your workflow performance while using RStudio on Amazon SageMaker AI. View data logs and information about metrics with the RStudio administrative dashboard or Amazon CloudWatch. 

### View your RStudio logs from the RStudio administrative dashboard
<a name="rstudio-troubleshooting-logs"></a>

 You can view metrics and logs directly from the RStudio administrative dashboard. 

1.  Log in to your **Amazon SageMaker AI domain**. 

1.  Navigate to the RStudio administrative dashboard following the steps in [Use the RStudio administrative dashboard](rstudio-admin.md). 

1.  Select the **Logs** tab. 

### View your RStudio logs from Amazon CloudWatch Logs
<a name="rstudio-troubleshooting-logs-cw"></a>

 Amazon CloudWatch monitors your AWS resources and the applications that you run on AWS in real time. You can use Amazon CloudWatch to collect and track metrics, which are variables that you can measure for your resources and applications. To ensure that your RStudio apps have permissions for Amazon CloudWatch, you must include the permissions described in [Amazon SageMaker AI domain overview](gs-studio-onboard.md). You don’t need to do any setup to gather Amazon CloudWatch Logs. 

 The following steps show how to view Amazon CloudWatch Logs for your RSession. 

These logs can be found in the `/aws/sagemaker/studio` log stream from the AWS CloudWatch console.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Select `Logs` from the left side. From the dropdown menu, select `Log groups`.

1. On the `Log groups` screen, search for `aws/sagemaker/studio`. Select the Log group.

1. On the `aws/sagemaker/studio` `Log group` screen, navigate to the `Log streams` tab.

1. To find the logs for your domain, search `Log streams` using the following format:

   ```
   <DomainId>/domain-shared/rstudioserverpro/default
   ```

# RStudio on Amazon SageMaker AI user guide
<a name="rstudio-use"></a>

With RStudio support in Amazon SageMaker AI, you can put your production workflows in place and take advantage of SageMaker AI features. The following topics show how to launch an RStudio session and complete key workflows. For information about managing RStudio on SageMaker AI, see [RStudio on Amazon SageMaker AI management](rstudio-manage.md). 

For information about the onboarding steps to create an Amazon SageMaker AI domain with RStudio enabled, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).  

For information about the AWS Regions that RStudio on SageMaker AI is supported in, see [Supported Regions and Quotas](regions-quotas.md).  

**Topics**
+ [

## Collaborate in RStudio
](#rstudio-collaborate)
+ [

## Base R image
](#rstudio-base-image)
+ [

## RSession application colocation
](#rstudio-colocation)
+ [

# Launch RSessions from the RStudio Launcher
](rstudio-launcher.md)
+ [

# Suspend your RSessions
](rstudio-launcher-suspend.md)
+ [

# Delete your RSessions
](rstudio-launcher-delete.md)
+ [

# RStudio Connect
](rstudio-connect.md)
+ [

# Amazon SageMaker AI feature integration with RStudio on Amazon SageMaker AI
](rstudio-sm-features.md)

## Collaborate in RStudio
<a name="rstudio-collaborate"></a>

 To share your RStudio project, you can connect RStudio to your Git repo. For information on setting this up, see [ Version Control with Git and SVN](https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN). 

 Note: Project sharing and realtime collaboration are not currently supported when using RStudio on Amazon SageMaker AI.  

## Base R image
<a name="rstudio-base-image"></a>

 When launching your RStudio instance, the Base R image serves as the basis of your instance. This image extends the [r-session-complete](https://hub.docker.com/r/rstudio/r-session-complete) Docker image.  

 This Base R image includes the following: 
+  R v4.0 or higher
+  `awscli`, `sagemaker`, and `boto3` Python packages 
+  [Reticulate](https://rstudio.github.io/reticulate/) package for R SDK integration 

## RSession application colocation
<a name="rstudio-colocation"></a>

Users can create multiple RSession applications on the same instance. Each instance type supports up to four colocated RSession applications. This applies to each user independently. For example, if two users create applications, then SageMaker AI allocates different underlying instances to each user. Each of these instances would support 4 RSession applications.

Customers only pay for the instance type used regardless of how many Rsession applications are running on the instance. If a user creates an RSession with a different associated instance type, then a new underlying instance is created.

# Launch RSessions from the RStudio Launcher
<a name="rstudio-launcher"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

 The following sections show how to use the RStudio Launcher to launch RSessions. They also include information about how to open the RStudio Launcher when using RStudio on Amazon SageMaker AI.

## Open RStudio Launcher
<a name="rstudio-launcher-open"></a>

Open the RStudio launcher using the following set of procedures that matches your environment.

### Open RStudio Launcher from the Amazon SageMaker AI Console
<a name="rstudio-launcher-console"></a>

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1.  From the left navigation, select **RStudio**.

1.  Under **Get Started**, select the domain and user profile to launch.

1.  Choose **Launch RStudio**.

### Open RStudio Launcher from Amazon SageMaker Studio
<a name="rstudio-launcher-studio"></a>

1. Navigate to Studio following the steps in [Launch Amazon SageMaker Studio](studio-updated-launch.md).

1. Under **Applications**, select **RStudio**.

1. From the RStudio landing page, choose **Launch application**.

### Open RStudio Launcher from the AWS CLI
<a name="rstudio-launcher-cli"></a>

The procedure to open the RStudio Launcher using the AWS CLI differs depending on the method used to manage your users. 

 **IAM Identity Center** 

1.  Use the AWS access portal to open your Amazon SageMaker AI domain. 

1.  Modify the URL path to “/rstudio/default” as follows. 

   ```
   #Studio URL
   https://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/lab
   
   #modified URL
   https://<domain-id>.studio.<region>.sagemaker.aws/rstudio/default
   ```

 **IAM** 

 To open the RStudio Launcher from the AWS CLI in IAM mode, complete the following procedure. 

1.  Create a presigned URL using the following command. 

   ```
   aws sagemaker create-presigned-domain-url --region <REGION> \
       --domain-id <DOMAIN-ID> \
       --user-profile-name <USER-PROFILE-NAME>
   ```

1.  Append *&redirect=RStudioServerPro* to the generated URL. 

1.  Navigate to the updated URL. 

## Launch RSessions
<a name="rstudio-launcher-launch"></a>

 After you’ve launched the RStudio Launcher, you can create a new RSession. 

1.  Select **New Session**. 

1.  Enter a **Session Name**. 

1.  Select an instance type that your RSession runs on. This defaults to `ml.t3.medium`.

1.  Select an Image that your RSession uses as the kernel. 

1.  Select Start Session. 

1.  After your session has been created, you can start it by selecting the name.  
**Note**  
If you receive a warning that there is a version mismatch between your RSession and RStudioServerPro apps, then you must upgrade the version of your RStudioServerPro app. For more information, see [RStudio Versioning](rstudio-version.md).

# Suspend your RSessions
<a name="rstudio-launcher-suspend"></a>

The following procedure demonstrates how to suspend an RSession from the RStudio Launcher when using RStudio on Amazon SageMaker AI. For information about accessing the RStudio Launcher, see [Launch RSessions from the RStudio Launcher](rstudio-launcher.md).

1. From the RStudio Launcher, identify the RSession that you want to suspend. 

1. Select **Suspend** for the session. 

# Delete your RSessions
<a name="rstudio-launcher-delete"></a>

The following procedure demonstrates how to delete an RSession from the RStudio Launcher when using RStudio on Amazon SageMaker AI. For information about accessing the RStudio Launcher, see [Launch RSessions from the RStudio Launcher](rstudio-launcher.md).

1. From the RStudio Launcher, identify the RSession that you want to delete. 

1. Select **Quit** for the session. This opens a new **Quit Session** window. 

1. From the **Quit Session** window, select **Force Quit**, to end all child processes in the session.

1. Select **Quit Session** to confirm deletion of the session.

# RStudio Connect
<a name="rstudio-connect"></a>

 RStudio Connect enables data scientists to publish insights, dashboard and web applications from RStudio on Amazon SageMaker AI. For more information, see [Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/host-rstudio-connect-and-package-manager-for-ml-development-in-rstudio-on-amazon-sagemaker/).

 For more information on RStudio Connect, see the [RStudio Connect User Guide](https://docs.rstudio.com/connect/user/). 

# Amazon SageMaker AI feature integration with RStudio on Amazon SageMaker AI
<a name="rstudio-sm-features"></a>

 One of the benefits of using RStudio on Amazon SageMaker AI is the integration of Amazon SageMaker AI features. This includes integration with Amazon SageMaker Studio Classic and Reticulate. The following gives information about these integrations and examples for using them.

 **Use Amazon SageMaker Studio Classic and RStudio on Amazon SageMaker AI** 

 Your Amazon SageMaker Studio Classic and RStudio instances share the same Amazon EFS file system. This means that files that you import and create using Studio Classic can be accessed using RStudio and vice versa. This allows you to work on the same files using both Studio Classic and RStudio without having to move your files between the two. For more information on this workflow, see the [Announcing Fully Managed RStudio on Amazon SageMaker AI for Data Scientists](https://aws.amazon.com/blogs/aws/announcing-fully-managed-rstudio-on-amazon-sagemaker-for-data-scientists) blog.

 **Use Amazon SageMaker SDK with reticulate** 

The [reticulate](https://rstudio.github.io/reticulate) package is used as an R interface to [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to make API calls to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon SageMaker AI provides a serverless data science environment to train and deploy Machine Learning (ML) models at scale. For general information about the reticulate package, see [ R Interface to Python](https://rstudio.github.io/reticulate/).

For a blog that outlines how to use the reticulate package with Amazon SageMaker AI, see [Using R with Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-sagemaker/).

The following examples show how to use reticulate for specific use cases.
+ For a notebook that describes how to use reticulate to do batch transform to make predictions, see [Batch Transform Using R with Amazon SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/r_examples/r_batch_transform/r_xgboost_batch_transform.html).
+ For a notebook that describes how to use reticulate to conduct hyperparameter tuning and generate predictions, see [Hyperparameter Optimization Using R with Amazon SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/r_examples/r_xgboost_hpo_batch_transform/r_xgboost_hpo_batch_transform.html).

# Code Editor in Amazon SageMaker Studio
<a name="code-editor"></a>

Code Editor, based on [Code-OSS, Visual Studio Code - Open Source](https://github.com/microsoft/vscode#visual-studio-code---open-source-code---oss), helps you write, test, debug, and run your analytics and machine learning code. Code Editor extends and is fully integrated with Amazon SageMaker Studio. It also supports integrated development environment (IDE) extensions available in the [Open VSX Registry](https://open-vsx.org/). The following page gives information about Code Editor and key details for using it.

Code Editor has the [AWS Toolkit for VS Code](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/welcome.html) extension pre-installed, which enables connections to AWS services such as [Amazon CodeWhisperer](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/codewhisperer.html), a general purpose, machine learning-powered code generator that provides code recommendations in real time. For more information about extensions, see [Code Editor Connections and Extensions](code-editor-use-connections-and-extensions.md).

**Important**  
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic. The following section is specific to using the updated Studio experience. For information about using the Studio Classic application, see [Amazon SageMaker Studio Classic](studio.md).

To launch Code Editor, create a Code Editor private space. The Code Editor space uses a single Amazon Elastic Compute Cloud (Amazon EC2) instance for your compute and a single Amazon Elastic Block Store (Amazon EBS) volume for your storage. Everything in your space such as your code, Git profile, and environment variables are stored on the same Amazon EBS volume. The volume has 3000 IOPS and a throughput of 125 MBps. Your administrator has configured the default Amazon EBS storage settings for your space.

The default storage size is 5 GB, but your administrator can increase the amount of space you get. For more information, see [Change the default storage size](code-editor-admin-storage-size.md).

The working directory of your users within the storage volume is `/home/sagemaker-user`. If you specify your own AWS KMS key to encrypt the volume, everything in the working directory is encrypted using your customer managed key. If you don't specify an AWS KMS key, the data inside `/home/sagemaker-user` is encrypted with an AWS managed key. Regardless of whether you specify an AWS KMS key, all of the data outside of the working directory is encrypted with an AWS Managed Key.

You can scale your compute up or down by changing the Amazon EC2 instance type that runs your Code Editor application. Before you change the associated instance type, you must first stop your Code Editor space. For more information, see [Code Editor application instances and images](code-editor-use-instances.md).

Your administrator might provide you with a lifecycle configuration to customize your environment. You can specify the lifecycle configuration when you create the space. For more information, see [Code Editor lifecycle configurations](code-editor-use-lifecycle-configurations.md).

You can also bring your own file storage system if you have an Amazon EFS volume.

![\[The welcome page of the Code Editor application UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-home.png)


**Topics**
+ [

# Using the Code Editor
](code-editor-use.md)
+ [

# Code Editor administrator guide
](code-editor-admin.md)

# Using the Code Editor
<a name="code-editor-use"></a>

The topics in this section provide guides for using Code Editor, including how to launch, add connections to AWS services, shut down resources, and more. After creating a Code Editor space, you can access your Code Editor session directly through the browser.

Within your Code Editor environment, you can do the following: 
+ Access all artifacts persisted in your home directory
+ Clone your GitHub repositories and commit changes
+ Access the SageMaker Python SDK

You can return to Studio to review any assets created in your Code Editor environment such as experiments, pipelines, or training jobs. 

**Topics**
+ [

# Check the version of Code Editor
](code-editor-use-version.md)
+ [

# Code Editor application instances and images
](code-editor-use-instances.md)
+ [

# Launch a Code Editor application in Studio
](code-editor-use-studio.md)
+ [

# Launch a Code Editor application using the AWS CLI
](code-editor-launch-cli.md)
+ [

# Clone a repository in Code Editor
](code-editor-use-clone-a-repository.md)
+ [

# Code Editor Connections and Extensions
](code-editor-use-connections-and-extensions.md)
+ [

# Shut down Code Editor resources
](code-editor-use-log-out.md)

# Check the version of Code Editor
<a name="code-editor-use-version"></a>

The following steps show how to check the version of your Code Editor application.

**To check the Code Editor application version**

1. Launch and run a Code Editor space and navigate to the Code Editor application UI. For more information, see [Launch a Code Editor application in Studio](code-editor-use-studio.md).

1. In the upper-left corner of the Code Editor UI, choose the menu button (![\[Hamburger menu icon with three horizontal lines.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-menu-icon.png)). Then, choose **Help**. Then, choose **About**.

# Code Editor application instances and images
<a name="code-editor-use-instances"></a>

Only some instances are compatible with Code Editor applications. You can choose the instance type that is compatible with your use case from the **Instance** dropdown menu. 

The **Fast launch** instances start up much faster than the other instances. For more information about fast launch instance types in Studio, [Instance Types Available for Use With Amazon SageMaker Studio Classic Notebooks](notebooks-available-instance-types.md).

**Note**  
If you use a GPU instance type when configuring your Code Editor application, you must also use a GPU-based image. The Code Editor space UI automatically selects a compatible image when you select your instance type.

Within a space, your data is stored in an Amazon EBS volume that persists independently from the life of an instance. You won't lose your data when you change instances. If your Code Editor space is `Running`, you must stop your space before changing instance types.

The following table lists the ARNs of the available Code Editor CPU and GPU images for each Region.


|  Region  |  CPU  |  GPU  | 
| --- | --- | --- | 
|  us-east-1  | arn:aws:sagemaker:us-east-1:885854791233:image/sagemaker-distribution-cpu |  arn:aws:sagemaker:us-east-1:885854791233:image/sagemaker-distribution-gpu | 
|  us-east-2  | arn:aws:sagemaker:us-east-2:37914896644:image/sagemaker-distribution-cpu | arn:aws:sagemaker:us-east-2:37914896644:image/sagemaker-distribution-gpu | 
|  us-west-1  | arn:aws:sagemaker:us-west-1:053634841547:image/sagemaker-distribution-cpu | arn:aws:sagemaker:us-west-1:053634841547:image/sagemaker-distribution-gpu | 
|  us-west-2  | arn:aws:sagemaker:us-west-2:542918446943:image/sagemaker-distribution-cpu |  arn:aws:sagemaker:us-west-2:542918446943:image/sagemaker-distribution-gpu | 
|  af-south-1  | arn:aws:sagemaker:af-south-1:238384257742:image/sagemaker-distribution-cpu | arn:aws:sagemaker:af-south-1:238384257742:image/sagemaker-distribution-gpu | 
|  ap-east-1  | arn:aws:sagemaker:ap-east-1:523751269255:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-east-1:523751269255:image/sagemaker-distribution-gpu | 
|  ap-south-1  | arn:aws:sagemaker:ap-south-1:245090515133:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-south-1:245090515133:image/sagemaker-distribution-gpu | 
|  ap-northeast-2  | arn:aws:sagemaker:ap-northeast-2:064688005998:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-northeast-2:064688005998:image/sagemaker-distribution-gpu | 
|  ap-southeast-1  | arn:aws:sagemaker:ap-southeast-1:022667117163:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-southeast-1:022667117163:image/sagemaker-distribution-gpu | 
|  ap-southeast-2  | arn:aws:sagemaker:ap-southeast-2:648430277019:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-southeast-2:648430277019:image/sagemaker-distribution-gpu | 
|  ap-northeast-1  | arn:aws:sagemaker:ap-northeast-1:010972774902:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-northeast-1:010972774902:image/sagemaker-distribution-gpu | 
|  ca-central-1  | arn:aws:sagemaker:ca-central-1:481561238223:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ca-central-1:481561238223:image/sagemaker-distribution-gpu | 
|  eu-central-1  | arn:aws:sagemaker:eu-central-1:545423591354:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-central-1:545423591354:image/sagemaker-distribution-gpu | 
|  eu-west-1  | arn:aws:sagemaker:eu-west-1:819792524951:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-west-1:819792524951:image/sagemaker-distribution-gpu | 
|  eu-west-2  | arn:aws:sagemaker:eu-west-2:021081402939:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-west-2:021081402939:image/sagemaker-distribution-gpu | 
|  eu-west-3  | arn:aws:sagemaker:eu-west-3:856416204555:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-west-3:856416204555:image/sagemaker-distribution-gpu | 
|  eu-north-1  | arn:aws:sagemaker:eu-north-1:175620155138:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-north-1:175620155138:image/sagemaker-distribution-gpu | 
|  eu-south-1  | arn:aws:sagemaker:eu-south-1:810671768855:image/sagemaker-distribution-cpu | arn:aws:sagemaker:eu-south-1:810671768855:image/sagemaker-distribution-gpu | 
|  sa-east-1  | arn:aws:sagemaker:sa-east-1:567556641782:image/sagemaker-distribution-cpu | arn:aws:sagemaker:sa-east-1:567556641782:image/sagemaker-distribution-gpu | 
|  ap-northeast-3  | arn:aws:sagemaker:ap-northeast-3:564864627153:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-northeast-3:564864627153:image/sagemaker-distribution-gpu | 
|  ap-southeast-3  | arn:aws:sagemaker:ap-southeast-3:370607712162:image/sagemaker-distribution-cpu | arn:aws:sagemaker:ap-southeast-3:370607712162:image/sagemaker-distribution-gpu | 
|  me-south-1  | arn:aws:sagemaker:me-south-1:523774347010:image/sagemaker-distribution-cpu | arn:aws:sagemaker:me-south-1:523774347010:image/sagemaker-distribution-gpu | 
|  me-central-1  | arn:aws:sagemaker:me-central-1:358593528301:image/sagemaker-distribution-cpu | arn:aws:sagemaker:me-central-1:358593528301:image/sagemaker-distribution-gpu | 
|  il-central-1  | arn:aws:sagemaker:il-central-1:080319125002:image/sagemaker-distribution-cpu | arn:aws:sagemaker:il-central-1:080319125002:image/sagemaker-distribution-gpu | 
|  cn-north-1  | arn:aws:sagemaker:cn-north-1:674439102856:image/sagemaker-distribution-cpu |  arn:aws:sagemaker:cn-north-1:674439102856:image/sagemaker-distribution-gpu  | 
|  cn-northwest-1  | arn:aws:sagemaker:cn-northwest-1:651871951035:image/sagemaker-distribution-cpu |  arn:aws:sagemaker:cn-northwest-1:651871951035:image/sagemaker-distribution-gpu  | 
|  us-gov-west-1  | arn:aws:sagemaker:us-gov-west-1:300992924816:image/sagemaker-distribution-cpu | arn:aws:sagemaker:us-gov-west-1:300992924816:image/sagemaker-distribution-gpu | 
|  us-gov-east-1  | arn:aws:sagemaker:us-gov-east-1:300993876623:image/sagemaker-distribution-cpu | arn:aws:sagemaker:us-gov-east-1:300993876623:image/sagemaker-distribution-gpu | 

If you encounter instance limits, contact your administrator. To get more storage and compute for a user, administrators can request an increase to a user's AWS quotas. For more information about requesting a quota increase, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).

# Launch a Code Editor application in Studio
<a name="code-editor-use-studio"></a>

To configure and access your Code Editor integrated development environment through Studio, you must create a Code Editor space. For more information about spaces in Studio, see [Amazon SageMaker Studio spaces](studio-updated-spaces.md).

![\[The Code Editor application button and overview tile in the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-studio-home.png)


The following procedure shows how to create and run a Code Editor space.

**To create and run a Code Editor space**

1. Launch the updated Studio experience. For more information, see [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html).

1. Do one of the following:
   + Within the updated Amazon SageMaker Studio UI, select **Code Editor** from the **Applications** menu.
   + Within the updated Amazon SageMaker Studio UI, choose **View Code Editor spaces** in the **Overview** section of the Studio homepage.

1. In the upper-right corner of the Code Editor landing page, choose **Create Code Editor space**.

1. Enter a name for your Code Editor space. The name must be 1–62 characters in length using letters, numbers, and dashes only.

1. Choose **Create space**.

1. After the space is created, you have some options before you choose to run the space:
   + You can edit the **Storage (GB)**, **Lifecycle Configuration**, or **Attach custom EFS file system** settings. Options for these settings are available based on administrator specification.
   + From the **Instance** dropdown menu, you can choose the instance type most compatible with your use case. From the **Image** dropdown menu, you can choose a SageMaker Distribution image or a custom image provided by your administrator.
**Note**  
Switching between sagemaker-distribution images changes the underlying version of Code Editor being used, which may cause incompatibilities due to browser caching. You should clear the browser cache when switching between images.

     If you use a GPU instance type when configuring your Code Editor application, you must also use a GPU-based image. Within a space, your data is stored in an Amazon EBS volume that persists independently from the life of an instance. You won't lose your data when you change instances.
**Important**  
Custom IAM policies that allow Studio users to create spaces must also grant permissions to list images (`sagemaker: ListImage`) to view custom images. To add the permission, see [ Add or remove identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS Identity and Access Management* User Guide.   
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker AI resources already include permissions to list images while creating those resources.
**Note**  
To update space settings, you must first stop your space. If your Code Editor uses an instance with NVMe instance stores, any data stored on the NVMe store is deleted when the space is stopped.

1. After updating your settings, choose **Run Space** in the space detail page.

1. After the status of the space is `Running`, choose **Open Code Editor** to go to your Code Editor session. 

![\[The space detail page for a Code Editor application in the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-open.png)


# Launch a Code Editor application using the AWS CLI
<a name="code-editor-launch-cli"></a>

To configure and access your Code Editor integrated development environment through the AWS Command Line Interface (AWS CLI), you must create a Code Editor space. Be sure to meet the [Complete prerequisites](code-editor-admin-prerequisites.md) before going through the following steps. Use the following procedure to create and run a Code Editor space.

**To create and run a Code Editor space**

1. Access a space using AWS Identity and Access Management (IAM) or AWS IAM Identity Center authentication. For more information about accessing spaces using the AWS CLI, see *Accessing spaces using the AWS Command Line Interface* in [Amazon SageMaker Studio spaces](studio-updated-spaces.md). 

1. Create an application and specify `CodeEditor` as the `app-type` using the following command.

   If you use a GPU instance type when creating your Code Editor application, you must also use a GPU-based image.

   ```
   aws sagemaker create-app \
   --domain-id domain-id \
   --space-name space-name \
   --app-type CodeEditor \
   --app-name default \
   --resource-spec "SageMakerImageArn=arn:aws:sagemaker:region:account-id:image/sagemaker-distribution-cpu"
   ```

   For more information about available Code Editor image ARNs, see [Code Editor application instances and images](code-editor-use-instances.md).

1. After the Code Editor application is in service, launch the application using a presigned URL. You can use the `describe-app` API to check if your application is in service. Use the `create-presigned-domain-url` API to create a presigned URL:

   ```
   aws sagemaker create-presigned-domain-url \
   --domain-id domain-id \
   --space-name space-name \
   --user-profile-name user-profile-name \
   --session-expiration-duration-in-seconds 43200 \
   --landing-uri app:CodeEditor:
   ```

1. Open the generated URL to start working in your Code Editor application.

# Clone a repository in Code Editor
<a name="code-editor-use-clone-a-repository"></a>

You can navigate through folders and clone a repository in the **Explorer** window of the Code Editor application UI. 

To clone a repository, go through the following steps:

**To clone a repository**

1. Open your Code Editor application in the browser, and choose the **Exploration** button (![\[Icon representing multiple documents or pages stacked on top of each other.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-exploration-icon.png)) in the left navigation pane.

1. Choose **Clone Repository** in the **Explorer** window. Then, provide a repository URL or pick a repository source in the prompt.

1. Choose a folder to clone your repository into. Note that the default Code Editor folder is `/home/sagemaker-user/`. Cloning your repository may take some time.

1. To open the cloned repository, choose either **Open in New Window** or **Open**.

1.  To return to the Code Editor application UI homepage, choose **Cancel**.

1. Within the repository, a prompt asks if you trust the authors of the files in your new repository. You have two choices:

   1. To trust the folder and enable all features, choose **Yes, I trust the authors**.

   1. To browse the repository content in *restricted mode*, choose **No, I don't trust the authors**.

      In restricted mode, tasks are not allowed to run, debugging is disabled, workspace settings are not applied, and extensions have limited functionality.

      To exit restricted mode, trust the authors of all files in your current folder or its parent folder, and enable all features, choose **Manage** in the **Restricted Mode** banner.

# Code Editor Connections and Extensions
<a name="code-editor-use-connections-and-extensions"></a>

Code Editor supports IDE connections to AWS services as well as extensions available in the [Open VSX Registry](https://open-vsx.org/). 

## Connections to AWS
<a name="code-editor-use-connections"></a>

Code Editor environments are integrated with the [AWS Toolkit for VS Code](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/welcome.html) to add connections to AWS services. To get started with connections to AWS services, you must have valid AWS Identity and Access Management (IAM) credentials. For more information, see [Authentication and access for the AWS Toolkit for Visual Studio Code](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/establish-credentials.html).

Within your Code Editor environment, you can add connections to: 
+  [AWS Explorer ](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/aws-explorer.html) – View, modify, and deploy AWS resources in Amazon S3, CloudWatch, and more.

  Accessing certain features within AWS Explorer requires certain AWS permissions. For more information, see [Authentication and access for the AWS Toolkit for Visual Studio Code](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/establish-credentials.html).
+ [Amazon CodeWhisperer](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/codewhisperer.html) – Build applications faster with AI-powered code suggestions. 

  To use Amazon CodeWhisperer with Code Editor, you must add the following permissions to your SageMaker AI execution role.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "CodeWhispererPermissions",
        "Effect": "Allow",
        "Action": ["codewhisperer:GenerateRecommendations"],
        "Resource": "*"
      }
    ]
  }
  ```

------

  For more information, see [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) and [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*.

## Extensions
<a name="code-editor-use-extensions"></a>

Code Editor supports IDE extensions available in the [Open VSX Registry](https://open-vsx.org/). 

To get started with extensions in your Code Editor environment, choose the **Extensions** icon (![\[Icon showing two overlapping squares representing multiple windows or instances.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-extensions-icon.png)) in the left navigation pane. Here, you can configure connections to AWS by installing the AWS Toolkit. For more information, see [Installing the AWS Toolkit for Visual Studio Code](https://docs.aws.amazon.com/toolkit-for-vscode/latest/userguide/setup-toolkit.html).

In the search bar, you can search directly for additional extensions through the [Open VSX Registry](https://open-vsx.org/), such as the AWS Toolkit, Jupyter, Python, and more.

# Shut down Code Editor resources
<a name="code-editor-use-log-out"></a>

When you're finished using a Code Editor space, you can use Studio to stop it. That way, you stop incurring costs for the space. 

Alternatively, you can delete unused Code Editor resources by using the AWS CLI.

## Stop your Code Editor space using Studio
<a name="code-editor-use-log-out-stop-space"></a>

To stop your Code Editor space in Studio use the following steps:

**To stop your Code Editor space in Studio**

1. Return to the Code Editor landing page by doing one of the following: 

   1. In the navigation bar in the upper-left corner, choose **Code Editor**.

   1. Alternatively, in the left navigation pane, choose **Code Editor** in the **Applications** menu.

1. Find the name of the Code Editor space you created. If the status of your space is **Running**, choose **Stop** in the **Action** column. You can also stop your space directly in the space detail page by choosing **Stop space**. The space may take some time to stop.

![\[A space detail page in the Code Editor application UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-stop-space.png)


Additional resources such as SageMaker AI endpoints, Amazon EMR (Amazon EMR) clusters and Amazon Simple Storage Service (Amazon S3) buckets created from Studio are not automatically deleted when your space instance shuts down. To stop accruing charges from resources, delete any additional resources. For more information, see [Delete unused resources](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl-admin-guide-clean-up.html).

## Delete Code Editor resources using the AWS CLI
<a name="code-editor-use-log-out-shut-down-resources"></a>

You can delete your Code Editor application and space using the AWS Command Line Interface (AWS CLI).
+ [DeleteApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteApp.html)
+ [DeleteSpace](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteSpace.html)

# Code Editor administrator guide
<a name="code-editor-admin"></a>

You can use Code Editor with an On-Demand Instance for faster start-up time, and configurable storage. You can launch a Code Editor application through Amazon SageMaker Studio or through the AWS CLI. You can also edit Code Editor default settings within the domain console. For more information, see [Edit domain settings](domain-edit.md). The following topics outline how administrators can configure Code Editor, based on Code-OSS, Visual Studio Code - Open Source by changing storage options, customizing environments, and managing user access, as well as giving information about the prerequisites needed to use Code Editor.

**Topics**
+ [

# Complete prerequisites
](code-editor-admin-prerequisites.md)
+ [

# Give your users access to private spaces
](code-editor-admin-user-access.md)
+ [

# Change the default storage size
](code-editor-admin-storage-size.md)
+ [

# Code Editor lifecycle configurations
](code-editor-use-lifecycle-configurations.md)
+ [

# Custom images
](code-editor-custom-images.md)

# Complete prerequisites
<a name="code-editor-admin-prerequisites"></a>

To use Code Editor, based on Code-OSS, Visual Studio Code - Open Source, you must complete the following prerequisites.

1. You must first onboard to Amazon SageMaker AI domain and create a user profile. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

1. If you are interacting with your Code Editor application using the AWS CLI, you must also complete the following prerequisites.

   1.  Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled). 

   1.  From your local machine, run `aws configure` and provide your AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). 

1. (Optional) To get more storage and compute for your application, you can request an increase to your AWS quotas. For more information about requesting a quota increase, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).

# Give your users access to private spaces
<a name="code-editor-admin-user-access"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

This section provides a policy that grants user access to private spaces. You can also use the policy to restrict private spaces and applications that are associated with them to the owner associated with the user profile. 

You must provide your users with permissions to the following:
+ Private spaces
+ The user profile required for access to the private spaces

To provide permissions, attach the following policy to the IAM roles of your users.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {

      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateApp",
        "sagemaker:DeleteApp"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:app/*",
      "Condition": {
        "Null": {
          "sagemaker:OwnerUserProfileArn": "true"
        }
      }
    },
    {
      "Sid": "SMStudioCreatePresignedDomainUrlForUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreatePresignedDomainUrl"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/domain-id/user-profile-name"
    },
    {
      "Sid": "SMStudioAppPermissionsListAndDescribe",
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListApps",
        "sagemaker:ListDomains",
        "sagemaker:ListUserProfiles",
        "sagemaker:ListSpaces",
        "sagemaker:DescribeApp",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace"
      ],
      "Resource": "*"
    },
    {
      "Sid": "SMStudioAppPermissionsTagOnCreate",
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:*/*",
      "Condition": {
        "Null": {
          "sagemaker:TaggingAction": "false"
        }
      }
    },
    {
      "Sid": "SMStudioRestrictSharedSpacesWithoutOwners",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateSpace",
        "sagemaker:UpdateSpace",
        "sagemaker:DeleteSpace"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/*",
      "Condition": {
        "Null": {
          "sagemaker:OwnerUserProfileArn": "true"
        }
      }
    },
    {
      "Sid": "SMStudioRestrictSpacesToOwnerUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateSpace",
        "sagemaker:UpdateSpace",
        "sagemaker:DeleteSpace"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:space/domain-id/*",
      "Condition": {
        "ArnLike": {
        "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/domain-id/user-profile-name"
        },
        "StringEquals": {
          "sagemaker:SpaceSharingType": [
            "Private",
            "Shared"
          ]
        }
      }
    },
    {
      "Sid": "SMStudioRestrictCreatePrivateSpaceAppsToOwnerUserProfile",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateApp",
        "sagemaker:DeleteApp"
      ],
      "Resource": "arn:aws:sagemaker:us-east-1:111122223333:app/domain-id/*",
      "Condition": {
        "ArnLike": {
        "sagemaker:OwnerUserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/domain-id/user-profile-name"
        },
        "StringEquals": {
          "sagemaker:SpaceSharingType": [
            "Private"
          ]
        }
      }
    }
  ]
}
```

------

# Change the default storage size
<a name="code-editor-admin-storage-size"></a>

You can change the default storage settings of your users. You can also change the default storage settings based on your organizational requirements and the needs of your users.

To change the storage size of your users, do the following:

1. Update the Amazon EBS storage settings in the domain. 

1. Create a user profile and specify the storage settings within it.

Use the following AWS Command Line Interface (AWS CLI) command to update the domain.

```
aws --region $REGION sagemaker update-domain \
--domain-id $DOMAIN_ID \
--default-user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":5,
            "MaximumEbsVolumeSizeInGb":100
        }
    }
}'
```

Use the following AWS CLI command to create the user profile and specify the default storage settings.

```
aws --region $REGION sagemaker create-user-profile \
--domain-id $DOMAIN_ID \
--user-profile-name $USER_PROFILE_NAME \
--user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":5,
            "MaximumEbsVolumeSizeInGb":100
        }
    }
}'
```

Use the following AWS CLI commands to update the default storage settings in the user profile.

```
aws --region $REGION sagemaker update-user-profile \
--domain-id $DOMAIN_ID \
--user-profile-name $USER_PROFILE_NAME \
--user-settings '{
    "SpaceStorageSettings": {
        "DefaultEbsStorageSettings":{
            "DefaultEbsVolumeSizeInGb":25,
            "MaximumEbsVolumeSizeInGb":200
        }
    }
}'
```

# Code Editor lifecycle configurations
<a name="code-editor-use-lifecycle-configurations"></a>

You can use Code Editor lifecycle configurations to automate customization for your Studio environment. This customization includes installing custom packages, configuring extensions, preloading datasets, and setting up source code repositories

The following instructions use the AWS Command Line Interface (AWS CLI) to create, attach, debug, and detach lifecycle configurations for the `CodeEditor` application type:
+ [Create and attach lifecycle configurations in Studio](code-editor-use-lifecycle-configurations-studio-create.md)
+ [Debug lifecycle configurations in Studio](code-editor-use-lifecycle-configurations-studio-debug.md)
+ [Detach lifecycle configurations in Studio](code-editor-use-lifecycle-configurations-studio-detach.md)

# Create and attach lifecycle configurations in Studio
<a name="code-editor-use-lifecycle-configurations-studio-create"></a>

The following section provides AWS CLI commands to create a lifecycle configuration, attach a lifecycle configuration when creating a new user profile, and attach a lifecycle configuration when updating a user profile. For prerequisites and general steps on creating and attaching lifecycle configurations in Studio, see [Lifecycle configuration creation](jl-lcc-create.md). 

When creating your Studio lifecycle configuration with the `create-studio-lifecycle-config` command, be sure to specify that the `studio-lifecycle-config-app-type` is `CodeEditor`. The following example shows how to create a new Studio lifecycle configuration for your Code Editor application.

```
aws sagemaker create-studio-lifecycle-config \
--studio-lifecycle-config-name my-code-editor-lcc \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type CodeEditor
```

Note the ARN of the newly created lifecycle configuration that is returned. When attaching a lifecycle configuration, provide this ARN within the `LifecycleConfigArns` list of `CodeEditorAppSettings`. 

You can attach a lifecycle configuration when creating a user profile or domain. The following example shows how to create a new user profile with the lifecycle configuration attached. You can also create a new domain with a lifecycle configuration attached by using the [create-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/opensearch/create-domain.html) command.

```
# Create a new UserProfile
aws sagemaker create-user-profile \
--domain-id domain-id \
--user-profile-name user-profile-name \
--user-settings '{
"CodeEditorAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

You can alternatively attach a lifecycle configuration when updating a user profile or domain. The following example shows how to update a user profile with the lifecycle configuration attached. You can also update a new domain with a lifecycle configuration attached by using the [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html) command.

```
# Update a UserProfile
aws sagemaker update-user-profile \
--domain-id domain-id \
--user-profile-name user-profile-name \
--user-settings '{
"CodeEditorAppSettings": {
  "LifecycleConfigArns":
    [lifecycle-configuration-arn-list]
  }
}'
```

# Debug lifecycle configurations in Studio
<a name="code-editor-use-lifecycle-configurations-studio-debug"></a>

To debug lifecycle configuration scripts for Code Editor, you must use Studio. For instructions about debugging lifecycle configurations in Studio, see [Debug lifecycle configurations](jl-lcc-debug.md). To find the logs for a specific application, search the log streams using the following format:

```
domain-id/space-name/CodeEditor/default/LifecycleConfigOnStart
```

# Detach lifecycle configurations in Studio
<a name="code-editor-use-lifecycle-configurations-studio-detach"></a>

To detach lifecycle configurations for Code Editor, you can use either the console or the AWS CLI. For steps on detaching lifecycle configurations from the Studio console, see [Detach lifecycle configurations](jl-lcc-delete.md).

To detach a lifecycle configuration using the AWS CLI, remove the desired lifecycle configuration from the list of lifecycle configurations attached to the resource. Then pass the list as part of the respective command:
+ [update-user-profile](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-user-profile.html)
+ [update-domain](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-domain.html)

For example, the following command removes all lifecycle configurations for the Code Editor application attached to the domain.

```
aws sagemaker update-domain --domain-id domain-id \
--default-user-settings '{
"CodeEditorAppSettings": {
  "LifecycleConfigArns":
    []
  }
}'
```

# Create a lifecycle configuration to clone repositories into a Code Editor application
<a name="code-editor-use-lifecycle-configurations-repositories"></a>

This section shows how to clone a repository and create a Code Editor application with the lifecycle configuration attached.

1. From your local machine, create a file named `my-script.sh` with the following content:

   ```
   #!/bin/bash
   set -eux
   ```

1. Clone the repository of your choice in your lifecycle configuration script. 

   ```
   export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples.git"
   git -C /home/sagemaker-user clone $REPOSITORY_URL
   ```

1. After finalizing your script, create and attach your lifecycle configuration. For more information, see [Create and attach lifecycle configurations in Studio](code-editor-use-lifecycle-configurations-studio-create.md).

1. Create your Code Editor application with the lifecycle configuration attached.

   ```
   aws sagemaker create-app \
   --domain-id domain-id \
   --space-name space-name \
   --app-type CodeEditor \
   --app-name default \
   --resource-spec "SageMakerImageArn=arn:aws:sagemaker:region:image-account-id:image/sagemaker-distribution-cpu,LifecycleConfigArn=arn:aws:sagemaker:region:user-account-id:studio-lifecycle-config/my-code-editor-lcc,InstanceType=ml.t3.large"
   ```

   For more information about available Code Editor image ARNs, see [Code Editor application instances and images](code-editor-use-instances.md).

# Create a lifecycle configuration to install Code Editor extensions
<a name="code-editor-use-lifecycle-configurations-extensions"></a>

This section shows how to create a lifecycle configuration to install extensions from the [Open VSX Registry](https://open-vsx.org/) in your Code Editor environment.

1. From your local machine, create a file named `my-script.sh` with the following content:

   ```
   #!/bin/bash
   set -eux
   ```

1. Within the script, install the [Open VSX Registry](https://open-vsx.org/) extension of your choice:

   ```
   sagemaker-code-editor --install-extension AmazonEMR.emr-tools --extensions-dir /opt/amazon/sagemaker/sagemaker-code-editor-server-data/extensions
   ```

   You can retrieve the extension name from the URL of the extension in the [Open VSX Registry](https://open-vsx.org/). The extension name to use in the `sagemaker-code-editor` command should contain all text that follows `https://open-vsx.org/extension/` in the URL. Replace all instances of a slash (`/`) with a period (`.`). For example, `AmazonEMR/emr-tools` should be `AmazonEMR.emr-tools`.  
![\[The Amazon EMR extension page in the Open VSX Registry.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/code-editor/code-editor-emr-extension.png)

1. After finalizing your script, create and attach your lifecycle configuration. For more information, see [Create and attach lifecycle configurations in Studio](code-editor-use-lifecycle-configurations-studio-create.md).

1. Create your Code Editor application with the lifecycle configuration attached:

   ```
   aws sagemaker create-app \
   --domain-id domain-id \
   --space-name space-name \
   --app-type CodeEditor \
   --app-name default \
   --resource-spec "SageMakerImageArn=arn:aws:sagemaker:region:image-account-id:image/sagemaker-distribution-cpu,LifecycleConfigArn=arn:aws:sagemaker:region:user-account-id:studio-lifecycle-config/my-code-editor-lcc,InstanceType=ml.t3.large"
   ```

   For more information about available Code Editor image ARNs, see [Code Editor application instances and images](code-editor-use-instances.md). For more information about connections and extensions, see [Code Editor Connections and Extensions](code-editor-use-connections-and-extensions.md).

# Custom images
<a name="code-editor-custom-images"></a>

If you need functionality that is different than what's provided by SageMaker distribution, you can bring your own image with your custom extensions and packages. You can also use it to personalize the Code Editor UI for your own branding or compliance needs.

The following page will provide Code Editor-specific information and templates to create your own custom SageMaker AI images. This is meant to supplement the Amazon SageMaker Studio information and instructions on creating your own SageMaker AI image and bringing your own image to Studio. To learn about custom Amazon SageMaker AI images and how to bring your own image to Studio, see [Bring your own image (BYOI)](studio-updated-byoi.md). 

**Topics**
+ [

## Health check and URL for applications
](#code-editor-custom-images-app-healthcheck)
+ [

## Dockerfile examples
](#code-editor-custom-images-dockerfile-templates)

## Health check and URL for applications
<a name="code-editor-custom-images-app-healthcheck"></a>
+ `Base URL` – The base URL for the BYOI application must be `CodeEditor/default`. You can only have one application and it must always be named `default`.
+ Health check endpoint – You must host your Code Editor server at 0.0.0.0 port 8888 for SageMaker AI to detect it.
+  Authentication – You must pass `--without-connection-token` when opening `sagemaker-code-editor` to allow SageMaker AI to authenticate your users.

**Note**  
If you are using Amazon SageMaker Distribution as the base image, these requirements are already taken care of as part of the included `entrypoint-code-editor` script.

## Dockerfile examples
<a name="code-editor-custom-images-dockerfile-templates"></a>

The following examples are `Dockerfile`s that meets the above information and [Custom image specifications](studio-updated-byoi-specs.md).

**Note**  
If you are bringing your own image to SageMaker Unified Studio, you will need to follow the [Dockerfile specifications](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html) in the *Amazon SageMaker Unified Studio User Guide*.  
`Dockerfile` examples for SageMaker Unified Studio can be found in [Dockerfile example](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/byoi-specifications.html#byoi-specifications-example) in the *Amazon SageMaker Unified Studio User Guide*.

------
#### [ Example micromamba Dockerfile ]

The following is an example Dockerfile to create an image from scratch using a [https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html) base environment: 

```
FROM mambaorg/micromamba:latest
ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100

USER root

RUN micromamba install -y --name base -c conda-forge sagemaker-code-editor

USER $NB_UID

CMD eval "$(micromamba shell hook --shell=bash)"; \
    micromamba activate base; \
    sagemaker-code-editor --host 0.0.0.0 --port 8888 \
        --without-connection-token \
        --base-path "/CodeEditor/default"
```

------
#### [ Example SageMaker AI Distribution Dockerfile ]

The following is an example Dockerfile to create an image based on [Amazon SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution/tree/main):

```
FROM public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100
ENV MAMBA_USER=$NB_USER

USER root

 # install scrapy in the base environment
RUN micromamba install -y --name base -c conda-forge scrapy

 # download VSCodeVim
RUN \
  wget https://github.com/VSCodeVim/Vim/releases/download/v1.27.2/vim-1.27.2.vsix \
  -P /tmp/exts/ --no-check-certificate

 # Install the extension
RUN \
  extensionloc=/opt/amazon/sagemaker/sagemaker-code-editor-server-data/extensions \
  && sagemaker-code-editor \
    --install-extension "/tmp/exts/vim-1.27.2.vsix" \
    --extensions-dir "${extensionloc}"

USER $MAMBA_USER
ENTRYPOINT ["entrypoint-code-editor"]
```

------

# Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod"></a>

SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitor the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads.

To get started, check [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md), set up [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md), and choose one of the following orchestrator options supported by SageMaker HyperPod.

**Slurm support in SageMaker HyperPod**

SageMaker HyperPod provides support for running machine learning workloads on resilient clusters by integrating with Slurm, an open-source workload manager. Slurm support in SageMaker HyperPod enables seamless cluster orchestration through Slurm cluster configuration, allowing you to set up head, login, and worker nodes on the SageMaker HyperPod clusters This integration also facilitates Slurm-based job scheduling for running ML workloads on the cluster, as well as direct access to cluster nodes for job scheduling. With HyperPod's lifecycle configuration support, you can customize the computing environment of the clusters to meet your specific requirements. Additionally, by leveraging the Amazon SageMaker AI distributed training libraries, you can optimize the clusters' performance on AWS computing and network resources. To learn more, see [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md). 

**Amazon EKS support in SageMaker HyperPod**

SageMaker HyperPod also integrates with Amazon EKS to enable large-scale training of foundation models on long-running and resilient compute clusters. This allows cluster admin users to provision HyperPod clusters and attach them to an EKS control plane, enabling dynamic capacity management, direct access to cluster instances, and resiliency capabilities. For data scientists, Amazon EKS support in HyperPod allows running containerized workloads for training foundation models, inference on the EKS cluster, and leveraging the job auto-resume capability for Kubeflow PyTorch training. The architecture involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, providing a tightly integrated solution for running large-scale ML workloads. To learn more, see [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**UltraServers with HyperPod**

HyperPod with UltraServers delivers AI computing power by integrating NVIDIA superchips into a cohesive, high-performance infrastructure. Each NVL72 UltraServer combines 18 instances with 72 NVIDIA Blackwell GPUs interconnected via NVLink, enabling faster inference and faster training performance compared to previous generation instances. This architecture is particularly valuable for organizations working with trillion-parameter foundation models, as the unified GPU memory allows entire models to remain within a single NVLink domain, eliminating cross-node networking bottlenecks. HyperPod enhances this hardware advantage with intelligent topology-aware scheduling that optimizes workload placement, automatic instance replacement to minimize disruptions, and flexible deployment options that support both dedicated and shared resource configurations. For teams pushing the boundaries of model size and performance, this integration provides the computational foundation needed to train and deploy the most advanced AI models with unprecedented efficiency.

SageMaker HyperPod automatically optimizes instance placement across your UltraServers. By default, HyperPod prioritizes all instances in one UltraServer before using a different one. For example, if you want 14 instances and have 2 UltraServers in your plan, SageMaker AI uses all of the instances in the first UltraServer. If you want 20 instances, SageMaker AI uses all 18 instances in the first UltraServer and then uses 2 more from the second.

## AWS Regions supported by SageMaker HyperPod
<a name="sagemaker-hyperpod-available-regions"></a>

SageMaker HyperPod is available in the following AWS Regions. 
+ us-east-1
+ us-east-2
+ us-west-1
+ us-west-2
+ eu-central-1
+ eu-north-1
+ eu-west-1
+ eu-west-2
+ eu-south-2
+ ap-south-1
+ ap-southeast-1
+ ap-southeast-2
+ ap-southeast-3
+ ap-southeast-4
+ ap-northeast-1
+ sa-east-1

**Topics**
+ [

## AWS Regions supported by SageMaker HyperPod
](#sagemaker-hyperpod-available-regions)
+ [

# Amazon SageMaker HyperPod quickstart
](sagemaker-hyperpod-quickstart.md)
+ [

# Prerequisites for using SageMaker HyperPod
](sagemaker-hyperpod-prerequisites.md)
+ [

# AWS Identity and Access Management for SageMaker HyperPod
](sagemaker-hyperpod-prerequisites-iam.md)
+ [

# Customer managed AWS KMS key encryption for SageMaker HyperPod
](smcluster-cmk.md)
+ [

# SageMaker HyperPod recipes
](sagemaker-hyperpod-recipes.md)
+ [

# Orchestrating SageMaker HyperPod clusters with Slurm
](sagemaker-hyperpod-slurm.md)
+ [

# Orchestrating SageMaker HyperPod clusters with Amazon EKS
](sagemaker-hyperpod-eks.md)
+ [

# Using topology-aware scheduling in Amazon SageMaker HyperPod
](sagemaker-hyperpod-topology.md)
+ [

# Deploying models on Amazon SageMaker HyperPod
](sagemaker-hyperpod-model-deployment.md)
+ [

# HyperPod in Studio
](sagemaker-hyperpod-studio.md)
+ [

# SageMaker HyperPod references
](sagemaker-hyperpod-ref.md)
+ [

# Amazon SageMaker HyperPod release notes
](sagemaker-hyperpod-release-notes.md)
+ [

# Amazon SageMaker HyperPod AMI
](sagemaker-hyperpod-release-ami.md)

# Amazon SageMaker HyperPod quickstart
<a name="sagemaker-hyperpod-quickstart"></a>

This quickstart guides you through creating your first HyperPod cluster with Slurm and Amazon EKS (EKS) orchestrations. Choose the orchestration that best fits your infrastructure needs to get started with SageMaker HyperPod.

**Topics**
+ [

## Create a Slurm-orchestrated SageMaker HyperPod cluster
](#sagemaker-hyperpod-quickstart-slurm)
+ [

## Create an EKS-orchestrated SageMaker HyperPod cluster
](#sagemaker-hyperpod-quickstart-eks)
+ [

## Submit workloads
](#sagemaker-hyperpod-quickstart-workload)

## Create a Slurm-orchestrated SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-quickstart-slurm"></a>

Follow these steps to create your first SageMaker HyperPod cluster with Slurm orchestration.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the cluster creation page, choose **Quick setup**. With this option, you get started immediately with default settings. SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

1. On **General settings**, specify a name for the new cluster. You can’t change the name after the cluster is created.

1. On **Instance groups**, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

   Follow these steps to add an instance group.

   1. For **Instance group type**, choose a type for your instance group. For this quickstart, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`. 

   1. For **Name**, specify a name for the instance group. For this quickstart, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

   1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

   1. For **Instance type**, choose the instance for the instance group. For this quickstart, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 

      Ensure that you choose the instance type with sufficient quotas in your account, or request additional quotas by following the instructions at [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this quickstart, enter **1** for all three groups.

   1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

   1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com//ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

   1. Choose **Add instance group**.

1.  On **Quick configuration defaults**, review the default settings. This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process.

1. Choose **Submit**.

For more information, see [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md).

## Create an EKS-orchestrated SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-quickstart-eks"></a>

Follow these steps to create your first SageMaker HyperPod cluster with Amazon EKS orchestration.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Amazon EKS**.

1. On the cluster creation page, choose **Quick configuration**. With this option, you can get started immediately with default settings. SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

1. On **General settings**, specify a name for the new cluster. You can’t change the name after the cluster is created. 

1. On **Instance groups**, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

   Follow these steps to add an instance group.

   1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

   1. For **Name**, specify a name for the instance group.

   1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

   1. For **Instance type**, choose the instance for the instance group. Ensure that you choose the instance type with sufficient quotas in your account, or request additional quotas by following at [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this quickstart, enter **1** for all three groups.

   1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

   1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com//ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

   1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled.

   1. Choose **Add instance group**.

1.  On **Quick configuration defaults**, review the default settings. This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process.

1. Choose **Submit**.

For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

## Submit workloads
<a name="sagemaker-hyperpod-quickstart-workload"></a>

Follow these workshop tutorials to submit sample workloads.
+ [Amazon SageMaker HyperPod for Slurm](https://catalog.workshops.aws/sagemaker-hyperpod/en-US)
+ [Amazon SageMaker HyperPod for Amazon EKS](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US)

# Prerequisites for using SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites"></a>

The following sections walk you through prerequisites before getting started with SageMaker HyperPod.

**Topics**
+ [

## SageMaker HyperPod quotas
](#sagemaker-hyperpod-prerequisites-quotas)
+ [

## Setting up SageMaker HyperPod with a custom Amazon VPC
](#sagemaker-hyperpod-prerequisites-optional-vpc)
+ [

## Setting up SageMaker HyperPod clusters across multiple AZs
](#sagemaker-hyperpod-prerequisites-multiple-availability-zones)
+ [

## Setting up AWS Systems Manager and Run As for cluster user access control
](#sagemaker-hyperpod-prerequisites-ssm)
+ [

## (Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre
](#sagemaker-hyperpod-prerequisites-optional-fsx)

## SageMaker HyperPod quotas
<a name="sagemaker-hyperpod-prerequisites-quotas"></a>

You can create SageMaker HyperPod clusters given the quotas for *cluster usage* in your AWS account.

**Important**  
To learn more about SageMaker HyperPod pricing, see [SageMaker HyperPod pricing](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-pricing) and [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

### View Amazon SageMaker HyperPod quotas using the AWS Management Console
<a name="sagemaker-hyperpod-prerequisites-quotas-view"></a>

Look up the default and applied values of a *quota*, also referred to as a *limit*, for *cluster usage*, which is used for SageMaker HyperPod.

1. Open the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).

1. In the left navigation pane, choose **AWS services**.

1. From the **AWS services** list, search for and select **Amazon SageMaker AI**.

1. In the **Service quotas** list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable. 

1. In the search bar, type **cluster usage**. This shows quotas for cluster usage, applied quotas, and the default quotas.

**List of common service quotas to create a HyperPod cluster and its pre-requisites**

You might want to check if you have requested service quota limit increases for the following quotas to create a new HyperPod cluster along with pre-requisites in the SageMaker AI console. Navigate to the **Service Quota** console and search for the following terms.


****  

| No | Quota Name | Search Term | Description | 
| --- | --- | --- | --- | 
| 1 | Maximum number instances allowed per SageMaker HyperPod cluster | Under SageMaker AI search for “Maximum number instances allowed per SageMaker HyperPod cluster” | Your account-level quota value must be more than the number of instance you wish to add to your cluster | 
| 2 | Maximum size of EBS volume in GB for a SageMaker HyperPod cluster instance |  Under SageMaker AI search for “Maximum size of EBS volume in GB for a HyperPod cluster instance”   |  Your account-level quota value must be more than the EBS volume you wish to add to your cluster  | 
| 3 | Total number of instances allowed across SageMaker HyperPod clusters |  Under SageMaker AI search for  “Total number of instances allowed across SageMaker HyperPod clusters”   | Your account-level quota value must be more than the total instances you wish to add across all your clusters in your account in aggregate | 
| 4 |  Instance Quotas   |  Under SageMaker AI search for  "ml.<instance\$1type> for cluster usage" eg: ml.p5.48xlarge for cluster usage  | Your account-level quota value for the particular instance type (eg: ml.p5.48xlarge) must be greater than the number of instances to add across all your clusters in your account in aggregate. | 
| 5 |  VPCs per Region  | Under Amazon Virtual Private Cloud (Amazon VPC) search for “VPCs per Region" | Your account-level quota value must be enough to create a new VPC in the account when setting up your HyperPod cluster. Do check if you have already exhausted this quota limit by checking the VPC console. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console. | 
| 6 |  Internet gateways per Region  |  Under Amazon Virtual Private Cloud (Amazon VPC) search for “Internet gateways per Region"  | Your account-level quota value must be enough to create one additional Internet gateway in the account when setting up your SageMaker HyperPod cluster. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console.  | 
| 7 | Network interfaces per Region | Under Amazon Virtual Private Cloud (Amazon VPC) search for “Network interfaces per Region" |  Your account-level quota value must have enough Network Interfaces in the account when setting up your HyperPod cluster.   | 
| 8 | EC2-VPC Elastic IPs | Under Amazon Elastic Compute Cloud (Amazon EC2) search for “EC2-VPC Elastic IPs" | Your account-level quota value must be enough to create a new VPC in the account when setting up your HyperPod cluster. Do check whether you have already exhausted this quota limit by checking the VPC console. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console. | 

### Request a Amazon SageMaker HyperPod quota increase using the AWS Management Console
<a name="sagemaker-hyperpod-prerequisites-quotas-increase"></a>

Increase your quotas at the account or resource level.

1. To increase the quota of instances for *cluster usage*, select the quota that you want to increase.

1. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the **Adjustability** column.

1. For **Increase quota value**, enter the new value. The new value must be greater than the current value.

1. Choose **Request**.

1. To view any pending or recently resolved requests in the console, navigate to the **Request history** tab from the service's details page, or choose **Dashboard** from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is **Pending**. After the status changes to **Quota requested**, you see the case number with AWS Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see [Requesting a Quota Increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *AWS Service Quotas User Guide*.

## Setting up SageMaker HyperPod with a custom Amazon VPC
<a name="sagemaker-hyperpod-prerequisites-optional-vpc"></a>

To set up a SageMaker HyperPod cluster with a custom Amazon VPC, review the following prerequisites.

**Note**  
VPC configuration is mandatory for Amazon EKS orchestration. For Slurm orchestration, VPC setup is optional.
+  Validate [Elastic Network Interface](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) (ENI) capacity in your AWS account before creating a SageMaker HyperPod cluster with a custom VPC. The ENI limit is controlled by Amazon EC2 and varies by AWS Region. SageMaker HyperPod cannot automatically request quota increases. 

**To verify your current ENI quota:**

  1. Open the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).

  1. In the **Manage quotas** section, use the ** AWS Services** drop-down list to search for **VPC**. 

  1. Choose to view the quotas of **Amazon Virtual Private Cloud (Amazon VPC)**. 

  1. Look for the service quota **Network interfaces per Region** or the **Quota code** `L-DF5E4CA3`.

  If your current ENI limit is insufficient for your SageMaker HyperPod cluster needs, request a quota increase. Ensuring adequate ENI capacity beforehand helps prevent cluster deployment failures.
+ When using a custom VPC to connect a SageMaker HyperPod cluster with AWS resources, provide the VPC name, ID, AWS Region, subnet IDs, and security group IDs during cluster creation.
**Note**  
When your Amazon VPC and subnets support IPv6 in the [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html#sagemaker-CreateCluster-request-VpcConfig](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html#sagemaker-CreateCluster-request-VpcConfig) of the cluster or at the Instance group level using the `OverrideVPCConfig` attribute of [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html), network communications differ based on the cluster orchestration platform:  
Slurm-orchestrated clusters automatically configure nodes with dual IPv6 and IPv4 addresses, allowing immediate IPv6 network communications. No additional configuration is required beyond the `VPCConfig` IPv6 settings.
In EKS-orchestrated clusters, nodes receive dual-stack addressing, but pods can only use IPv6 when the Amazon EKS cluster is explicitly IPv6-enabled. You must create a new IPv6 Amazon EKS cluster - existing IPv4 Amazon EKS clusters cannot be converted to IPv6. For information about deploying an IPv6 Amazon EKS cluster, see [Amazon EKS IPv6 Cluster Deployment](https://docs.aws.amazon.com/eks/latest/userguide/deploy-ipv6-cluster.html#_deploy_an_ipv6_cluster_with_eksctl).
Additional resources for IPv6 configuration:  
For information about adding IPv6 support to your VPC, see to [IPv6 Support for VPC](https://docs.aws.amazon.com//vpc/latest/userguide/vpc-migrate-ipv6.html).
For information about creating a new IPv6-compatible VPC, see [Amazon VPC Creation Guide](https://docs.aws.amazon.com//vpc/latest/userguide/create-vpc.html).
To configure SageMaker HyperPod with a custom Amazon VPC, see [Custom Amazon VPC setup for SageMaker HyperPod](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-prerequisites.html#sagemaker-hyperpod-prerequisites-optional-vpc).
+ Make sure that all resources are deployed in the same AWS Region as the SageMaker HyperPod cluster. Configure security group rules to allow inter-resource communication within the VPC. For example, when creating a VPC in `us-west-2`, provision subnets across one or more Availability Zones (such as `us-west-2a` or `us-west-2b`), and create a security group allowing intra-group traffic.
**Note**  
SageMaker HyperPod supports multi-Availability Zone deployment. For more information, see [Setting up SageMaker HyperPod clusters across multiple AZs](#sagemaker-hyperpod-prerequisites-multiple-availability-zones).
+ Establish Amazon Simple Storage Service (Amazon S3) connectivity for VPC-deployed SageMaker HyperPod instance groups by creating a VPC endpoint. Without internet access, instance groups cannot store or retrieve lifecycle scripts, training data, or model artifacts. We recommend that you create a custom IAM policy restricting Amazon S3 bucket access to the private VPC. For more information, see [Endpoints for Amazon S3](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-endpoints-s3.html) in the *AWS PrivateLink Guide*.
+ For HyperPod clusters using Elastic Fabric Adapter (EFA)-enabled instances, configure the security group to allow all inbound and outbound traffic to and from the security group itself. Specifically, avoid using `0.0.0.0/0` for outbound rules, as this may cause EFA health check failures. For more information about EFA security group preparation guidelines, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.
+ Consider your subnet's Classless Inter-Domain Routing (CIDR) block size carefully before creating HyperPod clusters.
  + The subnet CIDR block size cannot be changed after creation. This is especially important when you use large accelerated instances like P5. Without sufficient block size, you must recreate your clusters when scaling up.
  + When choosing the appropriate subnet CIDR block size, consider these factors: your instance types, expected number of instances, and the number of IP addresses consumed by each instance.
  + For Slurm-orchestrated clusters, each P5 instance can create 32 IP addresses (one per network card). For EKS-orchestrated clusters, each P5 instance can create 81 IP addresses (50 from the primary card plus one from each of the remaining 31 cards). For detailed specifications, see [Network specifications ](https://docs.aws.amazon.com/ec2/latest/instancetypes/ac.html#ac_network) from *Amazon EC2 Instance Types Developer Guide*.
  + For examples of CloudFormation templates that specify the subnet CIDR block size, see the [HyperPod Slurm template](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) and [HyperPod Amazon EKS template](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/cfn-templates/nested-stacks/private-subnet-stack.yaml) in the [awsome-distributed-training repository](https://github.com/aws-samples/awsome-distributed-training/tree/main).

## Setting up SageMaker HyperPod clusters across multiple AZs
<a name="sagemaker-hyperpod-prerequisites-multiple-availability-zones"></a>

You can configure your SageMaker HyperPod clusters across multiple Availability Zones (AZs) to improve reliability and availability.

**Note**  
Elastic Fabric Adapter (EFA) traffic cannot cross AZs or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface. For more information, see [EFA limitations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html).
+ ** Default behavior **

  HyperPod deploys all cluster instances in a single Availability Zone. The VPC configuration determines the deployment AZ:
  + For Slurm-orchestrated clusters, VPC configuration is optional. When no VPC configuration is provided, HyperPod defaults to one subnet from the platform VPC. 
  + For EKS-orchestrated clusters, VPC configuration is required.
  + For both Slurm and EKS orchestrators, when [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html) is provided, HyperPod selects a subnet from the provided `VpcConfig`'s subnet list. All instance groups inherit the subnet's AZ. 
**Note**  
Once you create a cluster, you cannot modify its `VpcConfig` settings.

  To learn more about configuring VPCs for HyperPod clusters, see the preceding section, [Setting up SageMaker HyperPod with a custom Amazon VPC](#sagemaker-hyperpod-prerequisites-optional-vpc).
+ ** Multi-AZ configuration **

  You can set up your HyperPod cluster across multiple AZs when creating a cluster or when adding a new instance group to an existing cluster. To configure multi-AZ deployments, you can override the default VPC settings of the cluster by specifying different subnets and security groups, potentially across different Availability Zones, for individual instance groups within your cluster. 

  SageMaker HyperPod API users can use the `OverrideVpcConfig` property within the [ClusterInstanceGroupSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html) when working with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs.

  The `OverrideVpcConfig` field:
  + Cannot be modified after the instance group is created.
  + Is optional. If not specified, the cluster level [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html) is used as default.
  + For Slurm-orchestrated clusters, can only be specified when cluster level `VpcConfig` is provided. If no `VpcConfig` is specified at cluster level, `OverrideVpcConfig` cannot be used for any instance group.
  + Contains two required fields:
    + `Subnets` - accepts between 1 and 16 subnet IDs
    + `SecurityGroupIds` - accepts between 1 and 5 security group IDs

  For more information about creating or updating a SageMaker HyperPod cluster using the SageMaker HyperPod console UI or the AWS CLI:
  + Slurm orchestration: See [Operating Slurm-orchestrated HyperPod clusters](sagemaker-hyperpod-operate-slurm.md).
  + EKS orchestration. See [Operating EKS-orchestrated HyperPod clusters](sagemaker-hyperpod-eks-operate.md).

**Note**  
When running workloads across multiple AZs, be aware that network communication between AZs introduces additional latency. Consider this impact when designing latency-sensitive applications.

## Setting up AWS Systems Manager and Run As for cluster user access control
<a name="sagemaker-hyperpod-prerequisites-ssm"></a>

[SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) comes with [AWS Systems Manager](https://aws.amazon.com/systems-manager/) (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.

### Enabling Run As in your AWS account
<a name="sagemaker-hyperpod-prerequisites-ssm-enable-runas"></a>

As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the [Run As feature in SSM](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your AWS account, follow the steps in [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under **To turn on Run As support for Linux and macOS managed nodes**.

## (Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre
<a name="sagemaker-hyperpod-prerequisites-optional-fsx"></a>

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre ﬁle system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use. 

If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre ﬁle system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre ﬁle system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with your VPC.

# AWS Identity and Access Management for SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites-iam"></a>

AWS Identity and Access Management (IAM) is an AWS service that helps an administrator securely control access to AWS resources. IAM administrators control who can be *authenticated* (signed in) and *authorized* (have permissions) to use Amazon EKS resources. IAM is an AWS service that you can use with no additional charge.

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Let's assume that there are two main layers of SageMaker HyperPod users: *cluster admin users* and *data scientist users*.
+ **Cluster admin users** – Are responsible for creating and managing SageMaker HyperPod clusters. This includes configuring the HyperPod clusters and managing user access to them.
  + Create and configure SageMaker HyperPod clusters with Slurm or Amazon EKS.
  + Create and configure IAM roles for data scientist users and HyperPod cluster resources.
  + For SageMaker HyperPod orchestration with Amazon EKS, create and configure [EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html), [role-based access control (RBAC)](sagemaker-hyperpod-eks-setup-rbac.md), and Pod Identity to fulfill data science use cases.
+ **Data scientist users** – Focus on ML model training. They use the open-source orchestrator or the SageMaker HyperPod CLI to submit and manage training jobs.
  + Assume and use the IAM Role provided by cluster admin users.
  + Interact with the open-source orchestrator CLIs supported by SageMaker HyperPod (Slurm or Kubernetes) or the SageMaker HyperPod CLI to check clusters capacity, connect to cluster, and submit workloads.

Set up IAM roles for cluster admins by attaching the right permissions or policies to operate SageMaker HyperPod clusters. Cluster admins also should create IAM roles to provide to SageMaker HyperPod resources to assume to run and communicate with necessary AWS resources, such as Amazon S3, Amazon CloudWatch, and AWS Systems Manager (SSM). Finally, the AWS account admin or the cluster admins should grant scientists permissions to access the SageMaker HyperPod clusters and run ML workloads.

Depending on which orchestrator you choose, permissions needed for the cluster admin and scientists may vary. You can also control the scope of permissions for various actions in the roles using the condition keys per service. Use the following Service Authorization References for adding detailed scope for the services related to SageMaker HyperPod.
+ [Amazon Elastic Compute Cloud](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonec2.html)
+ [Amazon Elastic Container Registry](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelasticcontainerregistry.html) (for SageMaker HyperPod cluster orchestration with Amazon EKS)
+ [Amazon Elastic Kubernetes Service](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelastickubernetesservice.html) (for SageMaker HyperPod cluster orchestration with Amazon EKS)
+ [Amazon FSx](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonfsx.html)
+ [AWS IAM Identity Center (successor to AWS Single Sign-On)](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsiamidentitycentersuccessortoawssinglesign-on.html)
+ [AWS Identity and Access Management (IAM)](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsidentityandaccessmanagementiam.html)
+ [Amazon Simple Storage Service](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazons3.html)
+ [Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html)
+ [AWS Systems Manager](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awssystemsmanager.html)

**Topics**
+ [

## IAM permissions for cluster creation
](#sagemaker-hyperpod-prerequisites-iam-cluster-creation)
+ [

## IAM users for cluster admin
](#sagemaker-hyperpod-prerequisites-iam-cluster-admin)
+ [

## IAM users for scientists
](#sagemaker-hyperpod-prerequisites-iam-cluster-user)
+ [

## IAM role for SageMaker HyperPod
](#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod)

## IAM permissions for cluster creation
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-creation"></a>

Creating HyperPod clusters requires the IAM permissions outlined in the following policy example. If your AWS account has [https://docs.aws.amazon.com//aws-managed-policy/latest/reference/AdministratorAccess.html](https://docs.aws.amazon.com//aws-managed-policy/latest/reference/AdministratorAccess.html) permissions, these permissions are granted by default.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:UpdateCluster"
            ],
            "Resource": "arn:aws:sagemaker:*:*:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags"
            ],
            "Resource": "arn:aws:sagemaker:*:*:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListTags",
                "sagemaker:ListClusters",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListComputeQuotas",
                "sagemaker:ListTrainingPlans",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:CreateStack",
                "cloudformation:UpdateStack",
                "cloudformation:DeleteStack",
                "cloudformation:ContinueUpdateRollback",
                "cloudformation:SetStackPolicy",
                "cloudformation:ValidateTemplate",
                "cloudformation:DescribeStacks",
                "cloudformation:DescribeStackEvents",
                "cloudformation:Get*",
                "cloudformation:List*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/sagemaker-*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "eks.amazonaws.com",
                        "lambda.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "iam:GetRole"
            ],
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "eks.amazonaws.com",
                        "lambda.amazonaws.com",
                        "cloudformation.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "AmazonVPCFullAccess",
            "Effect": "Allow",
            "Action": [
                "ec2:AcceptVpcPeeringConnection",
                "ec2:AcceptVpcEndpointConnections",
                "ec2:AllocateAddress",
                "ec2:AssignIpv6Addresses",
                "ec2:AssignPrivateIpAddresses",
                "ec2:AssociateAddress",
                "ec2:AssociateDhcpOptions",
                "ec2:AssociateRouteTable",
                "ec2:AssociateSecurityGroupVpc",
                "ec2:AssociateSubnetCidrBlock",
                "ec2:AssociateVpcCidrBlock",
                "ec2:AttachClassicLinkVpc",
                "ec2:AttachInternetGateway",
                "ec2:AttachNetworkInterface",
                "ec2:AttachVpnGateway",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateCarrierGateway",
                "ec2:CreateCustomerGateway",
                "ec2:CreateDefaultSubnet",
                "ec2:CreateDefaultVpc",
                "ec2:CreateDhcpOptions",
                "ec2:CreateEgressOnlyInternetGateway",
                "ec2:CreateFlowLogs",
                "ec2:CreateInternetGateway",
                "ec2:CreateLocalGatewayRouteTableVpcAssociation",
                "ec2:CreateNatGateway",
                "ec2:CreateNetworkAcl",
                "ec2:CreateNetworkAclEntry",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:CreateRoute",
                "ec2:CreateRouteTable",
                "ec2:CreateSecurityGroup",
                "ec2:CreateSubnet",
                "ec2:CreateTags",
                "ec2:CreateVpc",
                "ec2:CreateVpcEndpoint",
                "ec2:CreateVpcEndpointConnectionNotification",
                "ec2:CreateVpcEndpointServiceConfiguration",
                "ec2:CreateVpcPeeringConnection",
                "ec2:CreateVpnConnection",
                "ec2:CreateVpnConnectionRoute",
                "ec2:CreateVpnGateway",
                "ec2:DeleteCarrierGateway",
                "ec2:DeleteCustomerGateway",
                "ec2:DeleteDhcpOptions",
                "ec2:DeleteEgressOnlyInternetGateway",
                "ec2:DeleteFlowLogs",
                "ec2:DeleteInternetGateway",
                "ec2:DeleteLocalGatewayRouteTableVpcAssociation",
                "ec2:DeleteNatGateway",
                "ec2:DeleteNetworkAcl",
                "ec2:DeleteNetworkAclEntry",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DeleteRoute",
                "ec2:DeleteRouteTable",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteSubnet",
                "ec2:DeleteTags",
                "ec2:DeleteVpc",
                "ec2:DeleteVpcEndpoints",
                "ec2:DeleteVpcEndpointConnectionNotifications",
                "ec2:DeleteVpcEndpointServiceConfigurations",
                "ec2:DeleteVpcPeeringConnection",
                "ec2:DeleteVpnConnection",
                "ec2:DeleteVpnConnectionRoute",
                "ec2:DeleteVpnGateway",
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAddresses",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeCarrierGateways",
                "ec2:DescribeClassicLinkInstances",
                "ec2:DescribeCustomerGateways",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeEgressOnlyInternetGateways",
                "ec2:DescribeFlowLogs",
                "ec2:DescribeInstances",
                "ec2:DescribeInternetGateways",
                "ec2:DescribeIpv6Pools",
                "ec2:DescribeLocalGatewayRouteTables",
                "ec2:DescribeLocalGatewayRouteTableVpcAssociations",
                "ec2:DescribeKeyPairs",
                "ec2:DescribeMovingAddresses",
                "ec2:DescribeNatGateways",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeNetworkInterfaceAttribute",
                "ec2:DescribeNetworkInterfacePermissions",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribePrefixLists",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroupReferences",
                "ec2:DescribeSecurityGroupRules",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSecurityGroupVpcAssociations",
                "ec2:DescribeStaleSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeTags",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeVpcClassicLink",
                "ec2:DescribeVpcClassicLinkDnsSupport",
                "ec2:DescribeVpcEndpointConnectionNotifications",
                "ec2:DescribeVpcEndpointConnections",
                "ec2:DescribeVpcEndpoints",
                "ec2:DescribeVpcEndpointServiceConfigurations",
                "ec2:DescribeVpcEndpointServicePermissions",
                "ec2:DescribeVpcEndpointServices",
                "ec2:DescribeVpcPeeringConnections",
                "ec2:DescribeVpcs",
                "ec2:DescribeVpnConnections",
                "ec2:DescribeVpnGateways",
                "ec2:DetachClassicLinkVpc",
                "ec2:DetachInternetGateway",
                "ec2:DetachNetworkInterface",
                "ec2:DetachVpnGateway",
                "ec2:DisableVgwRoutePropagation",
                "ec2:DisableVpcClassicLink",
                "ec2:DisableVpcClassicLinkDnsSupport",
                "ec2:DisassociateAddress",
                "ec2:DisassociateRouteTable",
                "ec2:DisassociateSecurityGroupVpc",
                "ec2:DisassociateSubnetCidrBlock",
                "ec2:DisassociateVpcCidrBlock",
                "ec2:EnableVgwRoutePropagation",
                "ec2:EnableVpcClassicLink",
                "ec2:EnableVpcClassicLinkDnsSupport",
                "ec2:GetSecurityGroupsForVpc",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:ModifySecurityGroupRules",
                "ec2:ModifySubnetAttribute",
                "ec2:ModifyVpcAttribute",
                "ec2:ModifyVpcEndpoint",
                "ec2:ModifyVpcEndpointConnectionNotification",
                "ec2:ModifyVpcEndpointServiceConfiguration",
                "ec2:ModifyVpcEndpointServicePermissions",
                "ec2:ModifyVpcPeeringConnectionOptions",
                "ec2:ModifyVpcTenancy",
                "ec2:MoveAddressToVpc",
                "ec2:RejectVpcEndpointConnections",
                "ec2:RejectVpcPeeringConnection",
                "ec2:ReleaseAddress",
                "ec2:ReplaceNetworkAclAssociation",
                "ec2:ReplaceNetworkAclEntry",
                "ec2:ReplaceRoute",
                "ec2:ReplaceRouteTableAssociation",
                "ec2:ResetNetworkInterfaceAttribute",
                "ec2:RestoreAddressToClassic",
                "ec2:RevokeSecurityGroupEgress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:UnassignIpv6Addresses",
                "ec2:UnassignPrivateIpAddresses",
                "ec2:UpdateSecurityGroupRuleDescriptionsEgress",
                "ec2:UpdateSecurityGroupRuleDescriptionsIngress"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CloudWatchPermissions",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:*",
                "logs:*",
                "sns:CreateTopic",
                "sns:ListSubscriptions",
                "sns:ListSubscriptionsByTopic",
                "sns:ListTopics",
                "sns:Subscribe",
                "iam:GetPolicy",
                "iam:GetPolicyVersion",
                "iam:GetRole",
                "oam:ListSinks",
                "rum:*",
                "synthetics:*",
                "xray:*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:PutBucketPolicy",
                "s3:PutBucketTagging",
                "s3:PutBucketPublicAccessBlock",
                "s3:PutBucketLogging",
                "s3:DeleteBucketPolicy",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutEncryptionConfiguration",
                "s3:AbortMultipartUpload",
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::*",
                "arn:aws:s3:::*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "eks:CreateCluster",
                "eks:DeleteCluster",
                "eks:CreateNodegroup",
                "eks:DeleteNodegroup",
                "eks:UpdateNodegroupConfig",
                "eks:UpdateNodegroupVersion",
                "eks:UpdateClusterConfig",
                "eks:UpdateClusterVersion",
                "eks:CreateFargateProfile",
                "eks:DeleteFargateProfile",
                "eks:CreateAddon",
                "eks:DeleteAddon",
                "eks:UpdateAddon",
                "eks:CreateAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:UpdateAccessEntry",
                "eks:AssociateAccessPolicy",
                "eks:AssociateIdentityProviderConfig",
                "eks:DisassociateIdentityProviderConfig",
                "eks:TagResource",
                "eks:UntagResource",
                "eks:AccessKubernetesApi",
                "eks:Describe*",
                "eks:List*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:PutParameter",
                "ssm:DeleteParameter",
                "ssm:DescribeParameters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "kms:ViaService": [
                        "sagemaker.*.amazonaws.com",
                        "ec2.*.amazonaws.com",
                        "s3.*.amazonaws.com",
                        "eks.*.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:GetFunction",
                "lambda:UpdateFunctionCode",
                "lambda:UpdateFunctionConfiguration",
                "lambda:AddPermission",
                "lambda:RemovePermission",
                "lambda:PublishLayerVersion",
                "lambda:DeleteLayerVersion",
                "lambda:InvokeFunction",
                "lambda:Get*",
                "lambda:List*",
                "lambda:TagResource"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*",
                "arn:aws:lambda:*:*:layer:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:DeleteRole",
                "iam:DeleteRolePolicy"
            ],
            "Resource": [
                "arn:aws:iam::*:role/*sagemaker*",
                "arn:aws:iam::*:role/*eks*",
                "arn:aws:iam::*:role/*hyperpod*",
                "arn:aws:iam::*:policy/*sagemaker*",
                "arn:aws:iam::*:policy/*hyperpod*",
                "arn:aws:iam::*:role/*LifeCycleScriptStack*",
                "arn:aws:iam::*:role/*LifeCycleScript*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:TagRole",
                "iam:PutRolePolicy",
                "iam:Get*",
                "iam:List*",
                "iam:AttachRolePolicy",
                "iam:DetachRolePolicy"
            ],
            "Resource": [
                "arn:aws:iam::*:role/*",
                "arn:aws:iam::*:policy/*"
            ]
        },
        {
            "Sid": "FullAccessToFSx",
            "Effect": "Allow",
            "Action": [
                "fsx:AssociateFileGateway",
                "fsx:AssociateFileSystemAliases",
                "fsx:CancelDataRepositoryTask",
                "fsx:CopyBackup",
                "fsx:CopySnapshotAndUpdateVolume",
                "fsx:CreateAndAttachS3AccessPoint",
                "fsx:CreateBackup",
                "fsx:CreateDataRepositoryAssociation",
                "fsx:CreateDataRepositoryTask",
                "fsx:CreateFileCache",
                "fsx:CreateFileSystem",
                "fsx:CreateFileSystemFromBackup",
                "fsx:CreateSnapshot",
                "fsx:CreateStorageVirtualMachine",
                "fsx:CreateVolume",
                "fsx:CreateVolumeFromBackup",
                "fsx:DetachAndDeleteS3AccessPoint",
                "fsx:DeleteBackup",
                "fsx:DeleteDataRepositoryAssociation",
                "fsx:DeleteFileCache",
                "fsx:DeleteFileSystem",
                "fsx:DeleteSnapshot",
                "fsx:DeleteStorageVirtualMachine",
                "fsx:DeleteVolume",
                "fsx:DescribeAssociatedFileGateways",
                "fsx:DescribeBackups",
                "fsx:DescribeDataRepositoryAssociations",
                "fsx:DescribeDataRepositoryTasks",
                "fsx:DescribeFileCaches",
                "fsx:DescribeFileSystemAliases",
                "fsx:DescribeFileSystems",
                "fsx:DescribeS3AccessPointAttachments",
                "fsx:DescribeSharedVpcConfiguration",
                "fsx:DescribeSnapshots",
                "fsx:DescribeStorageVirtualMachines",
                "fsx:DescribeVolumes",
                "fsx:DisassociateFileGateway",
                "fsx:DisassociateFileSystemAliases",
                "fsx:ListTagsForResource",
                "fsx:ManageBackupPrincipalAssociations",
                "fsx:ReleaseFileSystemNfsV3Locks",
                "fsx:RestoreVolumeFromSnapshot",
                "fsx:TagResource",
                "fsx:UntagResource",
                "fsx:UpdateDataRepositoryAssociation",
                "fsx:UpdateFileCache",
                "fsx:UpdateFileSystem",
                "fsx:UpdateSharedVpcConfiguration",
                "fsx:UpdateSnapshot",
                "fsx:UpdateStorageVirtualMachine",
                "fsx:UpdateVolume"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## IAM users for cluster admin
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-admin"></a>

Cluster administrators (admins) operate and configure SageMaker HyperPod clusters, performing the tasks in [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md). The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account.

**Note**  
IAM users with cluster admin roles can use condition keys to provide granular access control when managing SageMaker HyperPod cluster resources specifically for the `CreateCluster` and `UpdateCluster` actions. To find the condition keys supported for these actions, search for `CreateCluster` or `UpdateCluster` in the [Actions defined by SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html#amazonsagemaker-actions-as-permissions).

------
#### [ Slurm ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchDeleteClusterNodes"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        }
    ]
}
```

------
#### [ Amazon EKS ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111122223333:role/execution-role-name"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchAddClusterNodes",
                "sagemaker:BatchDeleteClusterNodes",
                "sagemaker:ListComputeQuotas",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DeleteClusterSchedulerConfig",
                "sagemaker:DeleteComputeQuota",
                "eks:DescribeCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:AssociateAccessPolicy",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to access the SageMaker AI console, use the sample policy provided at [Permissions required to use the Amazon SageMaker AI console](https://docs.aws.amazon.com/sagemaker/latest/dg/security_iam_id-based-policy-examples.html#console-permissions).

To grant permissions to access the Amazon EC2 Systems Manager console, use the sample policy provided at [Using the AWS Systems Manager console](https://docs.aws.amazon.com/systems-manager/latest/userguide/security_iam_id-based-policy-examples.html#security_iam_id-based-policy-examples-console) in the * AWS Systems Manager User Guide*.

You might also consider attaching the [`AmazonSageMakerFullAccess`](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSageMakerFullAccess) policy to the role; however, note that the `AmazonSageMakerFullAccess` policy grants permissions to the entire SageMaker API calls, features, and resources.

For guidance on IAM users in general, see [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) in the *AWS Identity and Access Management User Guide*.

## IAM users for scientists
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-user"></a>

Scientists log into and run ML workloads on SageMaker HyperPod cluster nodes provisioned by cluster admins. For scientists in your AWS account, you should grant the permission `"ssm:StartSession"` to run the SSM `start-session` command. The following is a policy example for IAM users.

------
#### [ Slurm ]

Add the following policy to grant SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

------
#### [ Amazon EKS ]

Grant the following IAM role permissions for data scientists to run `hyperpod list-clusters` and `hyperpod connect-cluster` commands among the HyperPod CLI commands. To learn more about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). It also includes SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "DescribeHyerpodClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeCluster"
            ],
            "Resource": "arn:aws:sagemaker:us-east-2:111122223333:cluster/hyperpod-cluster-name"
        },
        {
            "Sid": "UseEksClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:sagemaker:us-east-2:111122223333:cluster/eks-cluster-name"
        },
        {
            "Sid": "ListClustersPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

To grant data scientists IAM users or roles access to Kubernetes APIs in the cluster, see also [Grant IAM users and roles access to Kubernetes APIs](https://docs.aws.amazon.com/eks/latest/userguide/grant-k8s-access.html) in the *Amazon EKS User Guide*.

------

## IAM role for SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod"></a>

For SageMaker HyperPod clusters to run and communicate with necessary AWS resources, you need create an IAM role for HyperPod cluster to assume. 

Start with attaching the managed role [AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy](security-iam-awsmanpol-AmazonSageMakerHyperPodServiceRolePolicy.md). Given this AWS managed policy, SageMaker HyperPod cluster instance groups assume the role to communicate with Amazon CloudWatch, Amazon S3, and AWS Systems Manager Agent (SSM Agent). This managed policy is the minimum requirement for SageMaker HyperPod resources to run properly, so you must provide an IAM role with this policy to all instance groups. 

**Tip**  
Depending on your preference on designing the level of permissions for multiple instance groups, you can also set up multiple IAM roles and attach them to different instance groups. When you set up your cluster user access to specific SageMaker HyperPod cluster nodes, the nodes assume the role with the selective permissions you manually attached.  
When you set up the access for scientists to specific cluster nodes through [AWS Systems Manager](https://aws.amazon.com/systems-manager/) (see also [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm)), the cluster nodes assume the role with the selective permissions you manually attach.

After you are done with creating IAM roles, make notes of their names and ARNs. You use the roles when creating a SageMaker HyperPod cluster, granting the correct permissions required for each instance group to communicate with necessary AWS resources.

------
#### [ Slurm ]

For HyperPod orchestrated with Slurm, you must attach the following managed policy to the SageMaker HyperPod IAM role.
+ [AmazonSageMakerClusterInstanceRolePolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerClusterInstanceRolePolicy.html)

**(Optional) Additional permissions for using SageMaker HyperPod with Amazon Virtual Private Cloud**

If you want to use your own Amazon Virtual Private Cloud (VPC) instead of the default SageMaker AI VPC, you should add the following additional permissions to the IAM role for SageMaker HyperPod.

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeVpcs",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DetachNetworkInterface"
    ],
    "Resource": "*"
}
{
    "Effect": "Allow",
    "Action": "ec2:CreateTags",
    "Resource": [
        "arn:aws:ec2:*:*:network-interface/*"
    ]
}
```

The following list breaks down which permissions are needed to enable SageMaker HyperPod cluster functionalities when you configure the cluster with your own Amazon VPC.
+ The following `ec2` permissions are required to enable configuring a SageMaker HyperPod cluster with your VPC.

  ```
  {
      "Effect": "Allow",
      "Action": [
          "ec2:CreateNetworkInterface",
          "ec2:CreateNetworkInterfacePermission",
          "ec2:DeleteNetworkInterface",
          "ec2:DeleteNetworkInterfacePermission",
          "ec2:DescribeNetworkInterfaces",
          "ec2:DescribeVpcs",
          "ec2:DescribeDhcpOptions",
          "ec2:DescribeSubnets",
          "ec2:DescribeSecurityGroups"
      ],
      "Resource": "*"
  }
  ```
+ The following `ec2` permission is required to enable the [SageMaker HyperPod auto-resume functionality](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

  ```
  {
      "Effect": "Allow",
      "Action": [
          "ec2:DetachNetworkInterface"
      ],
      "Resource": "*"
  }
  ```
+ The following `ec2` permission allows SageMaker HyperPod to create tags on the network interfaces within your account.

  ```
  {
      "Effect": "Allow",
      "Action": "ec2:CreateTags",
      "Resource": [
          "arn:aws:ec2:*:*:network-interface/*"
      ]
  }
  ```

------
#### [ Amazon EKS ]

For HyperPod orchestrated with Amazon EKS, you must attach the following managed policies to the SageMaker HyperPod IAM role.
+ [AmazonSageMakerClusterInstanceRolePolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerClusterInstanceRolePolicy.html)

In addition to the managed policies, attach the following permission policy to the role.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:AssignPrivateIpAddresses",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeTags",
        "ec2:DescribeVpcs",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DetachNetworkInterface",
        "ec2:ModifyNetworkInterfaceAttribute",
        "ec2:UnassignPrivateIpAddresses",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetAuthorizationToken",
        "ecr:GetDownloadUrlForLayer",
        "eks-auth:AssumeRoleForPodIdentity"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateTags"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:network-interface/*"
      ]
    }
  ]
}
```

------

**Note**  
The `"eks-auth:AssumeRoleForPodIdentity"` permission is optional. It's required if you plan to use EKS Pod identity.

**SageMaker HyperPod service-linked role**

For Amazon EKS support in SageMaker HyperPod, HyperPod creates a service-linked role with [AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy](security-iam-awsmanpol-AmazonSageMakerHyperPodServiceRolePolicy.md) to monitor and support resiliency on your EKS cluster such as replacing nodes and restarting jobs.

**Additional IAM policies for Amazon EKS cluster with restricted instance group (RIG)**

Workloads running in restricted instance groups rely on the execution role to load data from Amazon S3. You must add the additional Amazon S3 permissions to the execution role so that customization jobs running in restricted instance groups can properly fetch input data.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket"      
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/*"
      ]
    }
  ]
}
```

------

------

# Customer managed AWS KMS key encryption for SageMaker HyperPod
<a name="smcluster-cmk"></a>

By default, the root Amazon EBS volume attached to your SageMaker HyperPod cluster is encrypted using an AWS KMS key owned by AWS. You now have the option to encrypt both the root Amazon EBS volume and the secondary volume with your own customer managed KMS keys. The following topic describes how customer managed keys (CMKs) work with volumes in HyperPod clusters.

**Note**  
The following exclusions apply when using customer managed keys for SageMaker HyperPod clusters:  
Customer managed key encryption is only supported for clusters using the continuous node provisioning mode. Restricted instance groups don't support customer managed keys.
HyperPod clusters don’t currently support passing AWS KMS encryption context in customer managed key encryption requests. Therefore, ensure your KMS key policy is not scoped down using encryption context conditions, as this prevents the cluster from using the key.
KMS key transition isn't currently supported, so you can't change the KMS key specified in your configuration. To use a different key, create a new instance group with the desired key and delete your old instance group.
Specifying customer managed keys for HyperPod clusters through the console isn't currently supported.

## Permissions
<a name="smcluster-cmk-permissions"></a>

Before you can use your customer managed key with HyperPod, you must complete the following prerequisites:
+ Ensure that the AWS IAM execution role that you're using for SageMaker AI has the following permissions for AWS KMS added. The `[ kms:CreateGrant](https://docs.aws.amazon.com/kms/latest/APIReference/API_CreateGrant.html)` permission allows HyperPod to take the following actions using permissions to your KMS key:
  + Scaling out your instance count (UpdateCluster operations)
  + Adding cluster nodes (BatchAddClusterNodes operations)
  + Patching software (UpdateClusterSoftware operations)

  For more information on updating your IAM role's permissions, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "kms:CreateGrant",
                  "kms:DescribeKey"
              ],
              "Resource": "*"
          }
      ]
  }
  ```

------
+ Add the following permissions to your KMS key policy. For more information, see [Change a key policy](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html) in the *AWS KMS Developer Guide*.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Id": "hyperpod-key-policy",
      "Statement": [
          {
              "Sid": "Enable IAM User Permissions",
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:root"
              },
              "Action": "kms:*",
              "Resource": "*"
          },
          {
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:role/<iam-role>"
              },
              "Action": "kms:CreateGrant",
              "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id",
              "Condition": {
                  "StringEquals": {
                      "kms:ViaService": "sagemaker.us-east-1.amazonaws.com"
                  },
                  "Bool": {
                      "kms:GrantIsForAWSResource": "true"
                  }
              }
          },
          {
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:role/<iam-role>"
              },
              "Action": "kms:DescribeKey",
              "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id",
              "Condition": {
                  "StringEquals": {
                      "kms:ViaService": "sagemaker.us-east-1.amazonaws.com"
                  }
              }
          }
      ]
  }
  ```

------

## How to use your KMS key
<a name="smcluster-cmk-usage"></a>

You can specify your customer managed keys when creating or updating a cluster using the [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API operations. The `InstanceStorageConfigs` structure allows up to two `EbsVolumeConfig` configurations, in which you can configure the root Amazon EBS volume and, optionally, a secondary volume. You can use the same KMS key or a different KMS key for each volume, depending on your needs.

You can choose to specify a customer managed key for neither, both, or either volume. However, you can't specify two root volumes or two secondary volumes.

When configuring the root volume, the following requirements apply:
+ `RootVolume` must be set to `True`. The default value is `False`, which configures the secondary volume instead.
+ The `VolumeKmsKeyId` field is required and you must specify your customer managed key. This is because the root volume must always be encrypted with either an AWS owned key or a customer managed key (if you don't specify your own, then an AWS owned key is used).
+ You can't specify the `VolumeSizeInGB` field for root volumes since HyperPod determines the size of the root volume for you.

When configuring the secondary volume, the following requirements apply:
+ `RootVolume` must be `False` (the default value of this field is `False`).
+ The `VolumeKmsKeyId` field is optional. You can use the same customer managed key you specified for the root volume, or you can use a different key.
+ The `VolumeSizeInGB` field is required, since you must specify your desired size for the secondary volume.

**Important**  
When using customer managed keys, we strongly recommend that you use different KMS keys for each instance group in your cluster. Using the same customer managed key across multiple instance groups might lead to unintentional continued permissions even if you try to revoke a grant. For example, if you revoke an AWS KMS grant for one instance group's volumes, that instance group might still allow scaling and patching operations due to grants existing on other instance groups using the same key. To prevent this issue, ensure that you assign unique KMS keys to each instance group in your cluster. If you need to restrict permissions on instance groups, you can try one of the following options:  
Disable the KMS key.
Apply deny policies to the KMS key policy.
Revoke all instance group grants for the key (rather than revoking one grant).
Delete the instance group.
Delete the cluster.

The following examples show how to specify customer managed keys for both root and secondary volumes using the CreateCluster and UpdateCluster APIs. These examples show only the required fields for customer managed key integration. To configure a customer managed key for only one of the volumes, then only specify one `EbsVolumeConfig`.

For more information about configuring cluster creation and update requests, see [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md) and [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md).

------
#### [ CreateCluster ]

The following example shows a [ create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) AWS CLI request with customer managed key encryption.

```
aws sagemaker create-cluster \
  --cluster-name <your-hyperpod-cluster> \
  --instance-groups '[{
    "ExecutionRole": "arn:aws:iam::111122223333:role/<your-SageMaker-Execution-Role>",
    "InstanceCount": 2,
    "InstanceGroupName": "<your-ig-name>",
    "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/root-volume-key-id"
                }
            },
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/secondary-volume-key-id"
                }
            }
    ],
    "InstanceType": "<desired-instance-type>"
  }]' \
  --vpc-config '{
    "SecurityGroupIds": ["<sg-id>"],
    "Subnets": ["<subnet-id>"]
  }'
```

------
#### [ UpdateCluster ]

The following example shows an [ update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI request with customer managed key encryption.

```
aws sagemaker update-cluster \
  --cluster-name <your-hyperpod-cluster> \
  --instance-groups '[{
    "InstanceGroupName": "<your-ig-name>",
    "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "RootVolume": true,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/root-volume-key-id"
                }
            },
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/secondary-volume-key-id"
                }
            }
    ]
  }]'
```

------

# SageMaker HyperPod recipes
<a name="sagemaker-hyperpod-recipes"></a>

Amazon SageMaker HyperPod recipes are pre-configured training stacks provided by AWS to help you quickly start training and fine-tuning publicly available foundation models (FMs) from various model families such as Llama, Mistral, Mixtral, or DeepSeek. Recipes automate the end-to-end training loop, including loading datasets, applying distributed training techniques, and managing checkpoints for faster recovery from faults. 

SageMaker HyperPod recipes are particularly beneficial for users who may not have deep machine learning expertise, as they abstract away much of the complexity involved in training large models.

You can run recipes within SageMaker HyperPod or as SageMaker training jobs.

The following tables are maintained in the SageMaker HyperPod GitHub repository and provide the most up-to-date information on the models supported for pre-training and fine-tuning, their respective recipes and launch scripts, supported instance types, and more.
+ For the most current list of supported models, recipes, and launch scripts for pre-training, see the [pre-training table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#pre-training).
+ For the most current list of supported models, recipes, and launch scripts for fine-tuning, see the [fine-tuning table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#fine-tuning).

For SageMaker HyperPod users, the automation of end-to-end training workflows comes from the integration of the training adapter with SageMaker HyperPod recipes. The training adapter is built on the [NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) and the [Neuronx Distributed Training package](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). If you're familiar with using NeMo, the process of using the training adapter is the same. The training adapter runs the recipe on your cluster.

![\[Diagram showing SageMaker HyperPod recipe workflow. A "Recipe" icon at the top feeds into a "HyperPod recipe launcher" box. This box connects to a larger section labeled "Cluster: Slurm, K8s, ..." containing three GPU icons with associated recipe files. The bottom of the cluster section is labeled "Train with HyperPod Training Adapter".\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipes-overview.png)


You can also train your own model by defining your own custom recipe.

To get started with a tutorial, see [Tutorials](sagemaker-hyperpod-recipes-tutorials.md).

**Topics**
+ [

# Tutorials
](sagemaker-hyperpod-recipes-tutorials.md)
+ [

# Default configurations
](default-configurations.md)
+ [

# Cluster-specific configurations
](cluster-specific-configurations.md)
+ [

# Considerations
](cluster-specific-configurations-special-considerations.md)
+ [

# Advanced settings
](cluster-specific-configurations-advanced-settings.md)
+ [

# Appendix
](appendix.md)

# Tutorials
<a name="sagemaker-hyperpod-recipes-tutorials"></a>

The following quick-start tutorials help you get started with using the recipes for training:
+ SageMaker HyperPod with Slurm Orchestration
  + Pre-training
    + [HyperPod Slurm cluster pre-training tutorial (GPU)](hyperpod-gpu-slurm-pretrain-tutorial.md)
    + [Trainium Slurm cluster pre-training tutorial](hyperpod-trainium-slurm-cluster-pretrain-tutorial.md)
  + Fine-tuning
    + [HyperPod Slurm cluster PEFT-Lora tutorial (GPU)](hyperpod-gpu-slurm-peft-lora-tutorial.md)
    + [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md)
+ SageMaker HyperPod with K8s Orchestration
  + Pre-training
    + [Kubernetes cluster pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)
+ SageMaker training jobs
  + Pre-training
    + [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)

# HyperPod Slurm cluster pre-training tutorial (GPU)
<a name="hyperpod-gpu-slurm-pretrain-tutorial"></a>

The following tutorial sets up Slurm environment and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment to run the recipe, make sure you have:  
Set up a HyperPod GPU Slurm cluster.  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or an NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## HyperPod GPU Slurm environment setup
<a name="hyperpod-gpu-slurm-environment-setup"></a>

To initiate a training job on a HyperPod GPU Slurm cluster, do the following:

1. SSH into the head node of your Slurm cluster.

1. After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

   ```
   #set up a virtual environment
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
   git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
   cd sagemaker-hyperpod-recipes
   pip3 install -r requirements.txt
   ```

1. Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). To gain a deeper understanding of how to use the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

   ```
   REGION="<region>"
   IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
   aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
   enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
   mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
   ```

1. To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

   ```
   container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
   ```

## Launch the training job
<a name="hyperpod-gpu-slurm-launch-training-job"></a>

After you install the dependencies, start a training job from the `sagemaker-hyperpod-recipes/launcher_scripts` directory. You get the dependencies by cloning the [SageMaker HyperPod recipes repository](https://github.com/aws/sagemaker-hyperpod-recipes):

First, pick your training recipe from Github, the model name is specified as part of the recipe. We use the `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` script to launch a Llama 8b with sequence length 8192 pre-training recipe, `llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain`, in the following example.
+ `IMAGE`: The container from the environment setup section.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"
```

After you've configured all the required parameters in the launcher script, you can run the script using the following command.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Trainium Slurm cluster pre-training tutorial
<a name="hyperpod-trainium-slurm-cluster-pretrain-tutorial"></a>

The following tutorial sets up a Trainium environment on a Slurm cluster and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a SageMaker HyperPod Trainium Slurm cluster.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the Trainium environment on the Slurm Cluster
<a name="hyperpod-trainium-slurm-cluster-pretrain-setup-trainium-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the Neuron environment. For information about setting up Neuron, see [Neuron setup steps](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html#setting-up-the-environment). We recommend relying on the Deep learning AMI's that come pre-installed with Neuron's drivers, such as [Ubuntu 20 with DLAMI Pytorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html#setup-torch-neuronx-ubuntu20-dlami-pytorch).
+ Clone the SageMaker HyperPod recipes repository to a shared storage location in the cluster. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Go through the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#)
+ Prepare a model configuration. The model configurations available in the Neuron repo. For the model configuration used the in this tutorial, see [llama3 8b model config](https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json)

## Launch the training job in Trainium
<a name="hyperpod-trainium-slurm-cluster-pretrain-launch-training-job-trainium"></a>

To launch a training job in Trainium, specify a cluster configuration and a Neuron recipe. For example, to launch a llama3 8b pre-training job in Trainium, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh`, to the following:
+ `MODEL_CONFIG`: The model config from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash

#Users should set up their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

COMPILE=0
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
MODEL_CONFIG="${MODEL_CONFIG}" # Location of config.json for the model

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    instance_type="trn1.32xlarge" \
    recipes.run.compile="$COMPILE" \
    recipes.run.name="hf-llama3-8b" \
    recipes.trainer.num_nodes=4 \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    recipes.data.train_dir="$TRAIN_DIR" \
    recipes.model.model_config="$MODEL_CONFIG"
```

To launch the training job, run the following command:

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster DPO tutorial (GPU)
<a name="hyperpod-gpu-slurm-dpo-tutorial"></a>

The following tutorial sets up a Slurm environment and starts a direct preference optimization (DPO) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
A tokenized binary preference dataset in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-dpo-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-dpo-launch-training-job"></a>

To launch a DPO job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

**Note**  
The reference model used for DPO in this setup is automatically derived from the base model being trained (no separate reference model is explicitly defined). DPO specific hyperparameters are preconfigured with the following default values:  
`beta`: 0.1 (controls the strength of KL divergence regularization)
`label_smoothing`: 0.0 (no smoothing applied to preference labels)

```
recipes.dpo.beta=${BETA}
recipes.dpo.label_smoothing=${LABEL_SMOOTHING}
```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset
# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}"
BETA="${BETA}"
LABEL_SMOOTHING="${LABEL_SMOOTHING}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_dpo \
base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
recipes.run.name="hf_llama3_dpo" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
recipes.model.data.train_dir="$TRAIN_DIR" \
recipes.model.data.val_dir="$VAL_DIR" \
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
container="${IMAGE}" \
+cluster.container_mounts.0="/fsx:/fsx" \
recipes.model.hf_access_token="${HF_ACCESS_TOKEN}" \
recipes.dpo.enabled=true \
recipes.dpo.beta="${BETA}" \
recipes.dpo.label_smoothing="${LABEL_SMOOTHING}$" \
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster PEFT-Lora tutorial (GPU)
<a name="hyperpod-gpu-slurm-peft-lora-tutorial"></a>

The following tutorial sets up Slurm environment and starts a parameter-efficient fine-tuning (PEFT) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-peft-lora-setup-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-peft-lora-launch-training-job"></a>

To launch a PEFT job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${YOUR_HF_MODEL_NAME_OR_PATH}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
    recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_lora \
    base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
    recipes.run.name="hf_llama3_lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    container="${IMAGE}" \
    +cluster.container_mounts.0="/fsx:/fsx" \
    recipes.model.hf_access_token="${HF_ACCESS_TOKEN}"
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Kubernetes cluster pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

There are two ways to launch a training job in a GPU Kubernetes cluster:
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
A HyperPod GPU Kubernetes cluster is setup properly.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU Kubernetes environment setup
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

To set up a GPU Kubernetes environment, do the following:
+ Set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Install dependencies using one of the following methods:
  + (Recommended): [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli) method:

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod recipes method:

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [Install Helm](https://helm.sh/docs/intro/install/)
+ Connect to your Kubernetes cluster

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq16k_gpu_p5x16_pretrain` model.
+ `your_training_container`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` becomes `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
+ In `k8s.yaml`, update `persistent_volume_claims`. It mounts the Amazon FSx claim to the `/data` directory of each computing pod

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ In `config.yaml`, update `repo_url_or_path` under `git`.

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
  + `your_contrainer`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
  + (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ Launch the training job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod <name-of-pod>
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md).

# Trainium Kubernetes cluster pre-training tutorial
<a name="sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial"></a>

You can use one of the following methods to start a training job in a Trainium Kubernetes cluster.
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a HyperPod Trainium Kubernetes cluster
A shared storage location that can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your Trainium Kubernetes environment
<a name="sagemaker-hyperpod-trainium-setup-trainium-kubernetes-environment"></a>

To set up the Trainium Kubernetes environment, do the following:

1. Complete the steps in the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#download-the-dataset) starting from **Download the dataset**. 

1. Prepare a model configuration. They're available in the Neuron repo. For this tutorial, you can use the llama3 8b model config.

1. Virtual environment setup. Make sure you're using Python 3.9 or greater.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Install the dependencies
   + (Recommended) Use the following HyperPod command-line tool

     ```
     # install HyperPod command line tools
     git clone https://github.com/aws/sagemaker-hyperpod-cli
     cd sagemaker-hyperpod-cli
     pip3 install .
     ```
   + If you're using SageMaker HyperPod recipes, specify the following

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```

1. [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   hyperpod connect-cluster --cluster-name "${CLUSTER_NAME}" [--region "${CLUSTER_REGION}"] [--namespace <namespace>]
   ```

1. Container: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-trainium-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq8k_trn1x4_pretrain` Trainium model.
+ `your_neuron_container`: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx).
+ `your_model_config`: The model configuration from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
 "cluster": "k8s",
 "cluster_type": "k8s",
 "container": "<your_neuron_contrainer>",
 "recipes.run.name": "hf-llama3",
 "recipes.run.compile": 0,
 "recipes.model.model_config": "<your_model_config>",
 "instance_type": "trn1.32xlarge",
 "recipes.data.train_dir": "<your_train_data_dir>"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                              READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-launch-training-job-recipes"></a>

Alternatively, use SageMaker HyperPod recipes to submit your training job. To submit the training job using a recipe, update `k8s.yaml` and `config.yaml`. Run the bash script for the model to launch it.
+ In `k8s.yaml`, update persistent\$1volume\$1claims to mount the Amazon FSx claim to the /data directory in the compute nodes

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ Update launcher\$1scripts/llama/run\$1hf\$1llama3\$18b\$1seq8k\$1trn1x4\$1pretrain.sh
  + `your_neuron_contrainer`: The container from the environment setup section
  + `your_model_config`: The model config from the environment setup section

  (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

  ```
   #!/bin/bash
  #Users should set up their cluster type in /recipes_collection/config.yaml
  IMAGE="<your_neuron_contrainer>"
  MODEL_CONFIG="<your_model_config>"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3-8b" \
    instance_type=trn1.32xlarge \
    recipes.model.model_config="$MODEL_CONFIG" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.data.train_dir=$TRAIN_DIR \
    recipes.data.val_dir=$VAL_DIR
  ```
+ Launch the job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
  ```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is at `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Trainium Kubernetes cluster pre-training tutorial](#sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial).

# SageMaker training jobs pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with GPU instances.
+ Set up your environment
+ Launch a training job using SageMaker HyperPod recipes

Before you begin, make sure you have following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or an Amazon S3 bucket where you can load the data and output the training artifacts.
Requested a Service Quota for 1x ml.p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker AI. To request a service quota increase, do the following:  
On the AWS Service Quotas console, navigate to AWS services,
Choose **Amazon SageMaker AI**.
Choose one ml.p4d.24xlarge and one ml.p5.48xlarge instance.
Create an AWS Identity and Access Management(IAM) role with the following managed policies to give SageMaker AI permissions to run the examples.  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU SageMaker training jobs environment setup
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, configure your AWS credentials and preferred region by running the `aws configure` command. As an alternative to the configure command, you can provide your credentials through environment variables such as `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN.` For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) Set up the virtual environment and dependencies. If you are using a Jupyter notebook in Amazon SageMaker Studio, you can skip this step. Make sure you're using Python 3.9 or greater.

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`: The GPU container is set automatically by the SageMaker AI Python SDK. You can also provide your own container.
**Note**  
If you're running a Llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2 `or greater.

  Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using the SageMaker AI Python SDK. For example, append it if you're using it in a notebook in SageMaker AI JupyterLab.

  If you are using HyperPod recipes to launch using cluster type `sm_jobs`, this will be done automatically.

## Launch the training job using a Jupyter Notebook
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the training\$1recipe parameter to specify the recipe you want to use for training.

**Note**  
If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using SageMaker AI Python SDK directly. For example, you must append the version to the text file when you're using a Jupyter notebook.

When you deploy the endpoint for a SageMaker training job, you must specify the image URI that you're using. If don't provide the image URI, the estimator uses the training image as the image for the deployment. The training images that SageMaker HyperPod provides don't contain the dependencies required for inference and deployment. The following is an example of how an inference image can be used for deployment:

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**Note**  
Running the preceding code on Sagemaker notebook instance might need more than the default 5GB of storage that SageMaker AI JupyterLab provides. If you run into space not available issues, create a new notebook instance where you use a different notebook instance and increase the storage of the notebook.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

Update the `./recipes_collection/cluster/sm_jobs.yaml` file to look like the following:

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

Update `./recipes_collection/config.yaml` to specify `sm_jobs` in the `cluster` and `cluster_type`.

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

Launch the job with the following command

```
python3 main.py --config-path recipes_collection --config-name config
```

For more information about configuring SageMaker training jobs, see Run a training job on SageMaker training jobs.

# Trainium SageMaker training jobs pre-training tutorial
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with AWS Trainium instances.
+ Set up your environment
+ Launch a training job

Before you begin, make sure you have the following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or S3 bucket where you can load the data and output the training artifacts.
Request a Service Quota for the `ml.trn1.32xlarge` instance on Amazon SageMaker AI. To request a service quota increase, do the following:  
Navigate to the AWS Service Quotas console.
Choose AWS services.
Select JupyterLab.
Specify one instance for `ml.trn1.32xlarge`.
Create an AWS Identity and Access Management (IAM) role with the `AmazonSageMakerFullAccess` and `AmazonEC2FullAccess` managed policies. These policies provide Amazon SageMaker AI with permissions to run the examples.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your environment for Trainium SageMaker training jobs
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, use the `aws configure` command to configure your AWS credentials and preferred region . As an alternative, you can also provide your credentials through environment variables such as the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`. For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) If you are using Jupyter notebook in Amazon SageMaker Studio, you can skip running the following command. Make sure to use a version >= python 3.9

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ 
  + If you are running a llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2` or greater.
    + Append `transformers==4.45.2` to `requirements.txt` in source\$1dir only when you're using the SageMaker AI Python SDK.
    + If you are using HyperPod recipes to launch using `sm_jobs` as the cluster type, you don't have to specify the transformers version.
  + `Container`: The Neuron container is set automatically by SageMaker AI Python SDK.

## Launch the training job with a Jupyter Notebook
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job using your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe as a SageMaker AI Training Job.
+ `compiler_cache_url`: Cache to be used to save the compiled artifacts, such as an Amazon S3 artifact.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the `training_recipe` parameter to specify the recipe you want to use for training.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-recipes"></a>
+ Update `./recipes_collection/cluster/sm_jobs.yaml`
  + compiler\$1cache\$1url: The URL used to save the artifacts. It can be an Amazon S3 URL.

  ```
  sm_jobs_config:
    output_path: <s3_output_path>
    wait: True
    tensorboard_config:
      output_path: <s3_output_path>
      container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
    wait: True  # Whether to wait for training job to finish
    inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
      s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
        train: <s3_train_data_path>
        val: null
    additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
      max_run: 180000
      image_uri: <your_image_uri>
      enable_remote_debug: True
      py_version: py39
    recipe_overrides:
      model:
        exp_manager:
          exp_dir: <exp_dir>
        data:
          train_dir: /opt/ml/input/data/train
          val_dir: /opt/ml/input/data/val
  ```
+ Update `./recipes_collection/config.yaml`

  ```
  defaults:
    - _self_
    - cluster: sm_jobs
    - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
  cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
  
  instance_type: ml.trn1.32xlarge
  base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
  ```
+ Launch the job with `main.py`

  ```
  python3 main.py --config-path recipes_collection --config-name config
  ```

For more information about configuring SageMaker training jobs, see [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md).

# Default configurations
<a name="default-configurations"></a>

This section outlines the essential components and settings required to initiate and customize your Large Language Model (LLM) training processes using SageMaker HyperPod. This section covers the key repositories, configuration files, and recipe structures that form the foundation of your training jobs. Understanding these default configurations is crucial for effectively setting up and managing your LLM training workflows, whether you're using pre-defined recipes or customizing them to suit your specific needs.

**Topics**
+ [

# GitHub repositories
](github-repositories.md)
+ [

# General configuration
](sagemaker-hyperpod-recipes-general-configuration.md)

# GitHub repositories
<a name="github-repositories"></a>

To launch a training job, you utilize files from two distinct GitHub repositories:
+ [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes)
+ [SageMaker HyperPod training adapter for NeMo](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo)

These repositories contain essential components for initiating, managing, and customizing Large Language Model (LLM) training processes. You use the scripts from the repositories to set up and run the training jobs for your LLMs.

## HyperPod recipe repository
<a name="sagemaker-hyperpod-recipe-repository"></a>

Use the [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes) repository to get a recipe.

1. `main.py`: This file serves as the primary entry point for initiating the process of submitting a training job to either a cluster or a SageMaker training job.

1. `launcher_scripts`: This directory contains a collection of commonly used scripts designed to facilitate the training process for various Large Language Models (LLMs).

1. `recipes_collection`: This folder houses a compilation of pre-defined LLM recipes provided by the developers. Users can leverage these recipes in conjunction with their custom data to train LLM models tailored to their specific requirements.

You use the SageMaker HyperPod recipes to launch training or fine-tuning jobs. Regardless of the cluster you're using, the process of submitting the job is the same. For example, you can use the same script to submit a job to a Slurm or Kubernetes cluster. The launcher dispatches a training job based on three configuration files:

1. General Configuration (`config.yaml`): Includes common settings such as the default parameters or environment variables used in the training job.

1. Cluster Configuration (cluster): For training jobs using clusters only. If you're submitting a training job to a Kubernetes cluster, you might need to specify information such as volume, label, or restart policy. For Slurm clusters, you might need to specify the Slurm job name. All the parameters are related to the specific cluster that you're using.

1. Recipe (recipes): Recipes contain the settings for your training job, such as the model types, sharding degree, or dataset paths. For example, you can specify Llama as your training model and train it using model or data parallelism techniques like Fully Sharded Distributed Parallel (FSDP) across eight machines. You can also specify different checkpoint frequencies or paths for your training job.

After you've specified a recipe, you run the launcher script to specify an end-to-end training job on a cluster based on the configurations through the `main.py` entry point. For each recipe that you use, there are accompanying shell scripts located in the launch\$1scripts folder. These examples guide you through submitting and initiating training jobs. The following figure illustrates how a SageMaker HyperPod recipe launcher submits a training job to a cluster based on the preceding. Currently, the SageMaker HyperPod recipe launcher is built on top of the Nvidia NeMo Framework Launcher. For more information, see [NeMo Launcher Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).

![\[Diagram illustrating the HyperPod recipe launcher workflow. On the left, inside a dashed box, are three file icons labeled "Recipe", "config.yaml", and "slurm.yaml or k8s.yaml or sm_job.yaml (Cluster config)". An arrow points from this box to a central box labeled "HyperPod recipe Launcher". From this central box, another arrow points right to "Training Job", with "main.py" written above the arrow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipe-launcher.png)


## HyperPod recipe adapter repository
<a name="hyperpod-recipe-adapter"></a>

The SageMaker HyperPod training adapter is a training framework. You can use it to manage the entire lifecycle of your training jobs. Use the adapter to distribute the pre-training or fine-tuning of your models across multiple machines. The adaptor uses different parallelism techniques to distribute the training. It also handles the implementation and management of saving the checkpoints. For more details, see [Advanced settings](cluster-specific-configurations-advanced-settings.md).

Use the [SageMaker HyperPod recipe adapter repository](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo) to use the recipe adapter.

1. `src`: This directory contains the implementation of Large-scale Language Model (LLM) training, encompassing various features such as model parallelism, mixed-precision training, and checkpointing management.

1. `examples`: This folder provides a collection of examples demonstrating how to create an entry point for training an LLM model, serving as a practical guide for users.

# General configuration
<a name="sagemaker-hyperpod-recipes-general-configuration"></a>

The config.yaml file specifies the training recipe and the cluster. It also includes runtime configurations such as environment variables for the training job.

```
defaults:
  - _self_
  - cluster: slurm 
  - recipes: training/llama/hf_llama3_8b_seq8192_gpu
instance_type: p5.48xlarge
git:
  repo_url_or_path: null
  branch: null
  commit: null
  entry_script: null
  token: null
env_vars:
  NCCL_DEBUG: WARN
```

You can modify the following parameters in `config.yaml`:

1. `defaults`: Specify your default settings, such as the default cluster or default recipes.

1. `instance_type`: Modify the Amazon EC2 instance type to match the instance type that you're using.

1. `git`: Specify the location of the SageMaker HyperPod recipe adapter repository for the training job.

1. `env_vars`: You can specify the environment variables to be passed into your runtime training job. For example, you can adjust the logging level of NCCL by specifying the NCCL\$1DEBUG environment variable.

The recipe is the core configuration that defines your training job architecture. This file includes many important pieces of information for your training job, such as the following:
+ Whether to use model parallelism
+ The source of your datasets
+ Mixed precision training
+ Checkpointing-related configurations

You can use the recipes as-is. You can also use the following information to modify them.

## run
<a name="run"></a>

The following is the basic run information for running your training job.

```
run:
  name: llama-8b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
  model_type: hf
```

1. `name`: Specify the name for your training job in the configuration file.

1. `results_dir`: You can specify the directory where the results of your training job are stored.

1. `time_limit`: You can set a maximum training time for your training job to prevent it from occupying hardware resources for too long.

1. `model_type`: You can specify the type of model you are using. For example, you can specify `hf` if your model is from HuggingFace.

## exp\$1manager
<a name="exp-manager"></a>

The exp\$1manager configures the experiment. With the exp\$1manager, you can specify fields such as the output directory or checkpoint settings. The following is an example of how you can configure the exp\$1manager.

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

1. `exp_dir`: The experiment directory includes the standard output and standard error files for your training job. By default, it uses your current directory.

1. `name`: The experiment name used to identify your experiment under the exp\$1dir.

1. `create_tensorboard_logger`: Specify `True` or `False` to enable or disable the TensorBoard logger.

## Checkpointing
<a name="checkpointing"></a>

Here are three types of checkpointing we support:
+ Auto checkpointing
+ Manual checkpointing
+ Full checkpointing

### Auto checkpointing
<a name="auto-checkpointing"></a>

If you're saving or loading checkpoints that are automatically managed by the SageMaker HyperPod recipe adapter, you can enable `auto_checkpoint`. To enable `auto_checkpoint`, set `enabled` to `True`. You can use auto checkpointing for both training and fine-tuning. You can use auto checkpoinitng for both shared file systems and Amazon S3.

```
exp_manager
  checkpoint_dir: ${recipes.exp_manager.exp_dir}/checkpoints/
  auto_checkpoint:
    enabled: True
```

Auto checkpoint is saving the local\$1state\$1dict asynchronously with an automatically computed optimal saving interval.

**Note**  
Under this checkpointing mode, the auto saved checkpointing doesn't support re-sharding between training runs. To resume from the latest auto saved checkpoint, you must preserve the same shard degrees. You don't need to specify extra information to auto resume.

### Manual checkpointing
<a name="manual-checkpointing"></a>

You can modify `checkpoint_callback_params` to asynchronously save an intermediate checkpoint in shared\$1state\$1dict. For example, you can specify the following configuration to enable sharded checkpointing every 10 steps and keep the latest 3 checkpoints.

Sharded checkpointing allows you to change the shard degrees between training runs and load the checkpoint by setting `resume_from_checkpoint`.

**Note**  
If is a PEFT fine tuning, sharded checkpointing doesn't support Amazon S3.
Auto and manual checkpointing are mutually exclusive.
Only FSDP shard degrees and replication degrees changes are allowed.

```
exp_manager:
  checkpoint_callback_params:
    # Set save_top_k = 0 to disable sharded checkpointing
    save_top_k: 3
    every_n_train_steps: 10
    monitor: "step"
    mode: "max"
    save_last: False
  resume_from_checkpoint: ${recipes.exp_manager.exp_dir}/checkpoints/
```

To learn more about checkpointing, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

### Full checkpointing
<a name="full-checkpointing"></a>

The exported full\$1state\$1dict checkpoint can be used for inference or fine tuning. You can load a full checkpoint through hf\$1model\$1name\$1or\$1path. Under this mode, only the model weights are saved.

To export the full\$1state\$1dict model, you can set the following parameters.

**Note**  
Currently, full checkpointing isn't supported for Amazon S3 checkpointing. You can't set the S3 path for `exp_manager.checkpoint_dir` if you're enabling full checkpointing. However, you can set `exp_manager.export_full_model.final_export_dir` to a specific directory on your local filesystem while setting `exp_manager.checkpoint_dir` to an Amazon S3 path.

```
exp_manager:
  export_full_model:
    # Set every_n_train_steps = 0 to disable full checkpointing
    every_n_train_steps: 0
    save_last: True
    final_export_dir : null
```

## model
<a name="model"></a>

Define various aspects of your model architecture and training process. This includes settings for model parallelism, precision, and data handling. Below are the key components you can configure within the model section:

### model parallelism
<a name="model-parallelism"></a>

After you've specified the recipe, you define the model that you're training. You can also define the model parallelism. For example, you can define tensor\$1model\$1parallel\$1degree. You can enable other features like training with FP8 precision. For example, you can train a model with tensor parallelism and context parallelism:

```
model:
  model_type: llama_v3
  # Base configs
  train_batch_size: 4
  val_batch_size: 1
  seed: 12345
  grad_clip: 1.0

  # Model parallelism
  tensor_model_parallel_degree: 4
  expert_model_parallel_degree: 1
  context_parallel_degree: 2
```

To gain a better understanding of different types of model parallelism techniques, you can refer to the following approaches:

1. [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md)

1. [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md)

1. [Context parallelism](model-parallel-core-features-v2-context-parallelism.md)

1. [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md)

### FP8
<a name="fp8"></a>

To enable FP8 (8-bit floating-point precision), you can specify the FP8-related configuration in the following example:

```
model:
  # FP8 config
  fp8: True
  fp8_amax_history_len: 1024
  fp8_amax_compute_algo: max
```

It's important to note that the FP8 data format is currently supported only on the P5 instance type. If you are using an older instance type, such as P4, disable the FP8 feature for your model training process. For more information about FP8, see [Mixed precision training](model-parallel-core-features-v2-mixed-precision.md).

### data
<a name="data"></a>

You can specify your custom datasets for your training job by adding the data paths under data. The data module in our system supports the following data formats:

1. JSON

1. JSONGZ (Compressed JSON)

1. ARROW

However, you are responsible for preparing your own pre-tokenized dataset. If you're an advanced user with specific requirements, there is also an option to implement and integrate a customized data module. For more information on HuggingFace datasets, see [Datasets](https://huggingface.co/docs/datasets/v3.1.0/en/index).

```
model:
  data:
    train_dir: /path/to/your/train/data
    val_dir: /path/to/your/val/data
    dataset_type: hf
    use_synthetic_data: False
```

You can specify how you're training the model. By default, the recipe uses pre-training instead of fine-tuning. The following example configures the recipe to run a fine-tuning job with LoRA (Low-Rank Adaptation).

```
model:
  # Fine tuning config
  do_finetune: True
  # The path to resume from, needs to be HF compatible
  hf_model_name_or_path: null
  hf_access_token: null
  # PEFT config
  peft:
    peft_type: lora
    rank: 32
    alpha: 16
    dropout: 0.1
```

For information about the recipes, see [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

# Cluster-specific configurations
<a name="cluster-specific-configurations"></a>

SageMaker HyperPod offers flexibility in running training jobs across different cluster environments. Each environment has its own configuration requirements and setup process. This section outlines the steps and configurations needed for running training jobs in SageMaker HyperPod Slurm, SageMaker HyperPod k8s, and SageMaker training jobs. Understanding these configurations is crucial for effectively leveraging the power of distributed training in your chosen environment.

You can use a recipe in the following cluster environments:
+ SageMaker HyperPod Slurm Orchestration
+ SageMaker HyperPod Amazon Elastic Kubernetes Service Orchestration
+ SageMaker training jobs

To launch a training job in a cluster, set and install the corresponding cluster configuration and environment.

**Topics**
+ [

# Running a training job on HyperPod Slurm
](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)
+ [

# Running a training job on HyperPod k8s
](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)
+ [

# Running a SageMaker training job
](cluster-specific-configurations-run-sagemaker-training-job.md)

# Running a training job on HyperPod Slurm
<a name="cluster-specific-configurations-run-training-job-hyperpod-slurm"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium slurm cluster. Before you submit the training job, update the cluster configuration. Use one of the following methods to update the cluster configuration:
+ Modify `slurm.yaml`
+ Override it through the command line

After you've updated the cluster configuration, install the environment.

## Configure the cluster
<a name="cluster-specific-configurations-configure-cluster-slurm-yaml"></a>

To submit a training job to a Slurm cluster, specify the Slurm-specific configuration. Modify `slurm.yaml` to configure the Slurm cluster. The following is an example of a Slurm cluster configuration. You can modify this file for your own training needs:

```
job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False 
stderr_to_stdout: True
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" 
  post_launch_commands: 
container_mounts: 
  - "/fsx:/fsx"
```

1. `job_name_prefix`: Specify a job name prefix to easily identify your submissions to the Slurm cluster.

1. `slurm_create_submission_file_only`: Set this configuration to True for a dry run to help you debug.

1. `stderr_to_stdout`: Specify whether you're redirecting your standard error (stderr) to standard output (stdout).

1. `srun_args`: Customize additional srun configurations, such as excluding specific compute nodes. For more information, see the srun documentation.

1. `slurm_docker_cfg`: The SageMaker HyperPod recipe launcher launches a Docker container to run your training job. You can specify additional Docker arguments within this parameter.

1. `container_mounts`: Specify the volumes you're mounting into the container for the recipe launcher, for your training jobs to access the files in those volumes.

# Running a training job on HyperPod k8s
<a name="cluster-specific-configurations-run-training-job-hyperpod-k8s"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium Kubernetes cluster. Before you submit the training job do one of the following:
+ Modify the `k8s.yaml` cluster configuration file
+ Override the cluster configuration through the command line

After you've done either of the preceding steps, install the corresponding environment.

## Configure the cluster using `k8s.yaml`
<a name="cluster-specific-configurations-configure-cluster-k8s-yaml"></a>

To submit a training job to a Kubernetes cluster, you specify Kubernetes-specific configurations. The configurations include the cluster namespace or the location of the persistent volume.

```
pullPolicy: Always
restartPolicy: Never
namespace: default
persistent_volume_claims:
  - null
```

1. `pullPolicy`: You can specify the pull policy when you submit a training job. If you specify "Always," the Kubernetes cluster always pulls your image from the repository. For more information, see [Image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy).

1. `restartPolicy`: Specify whether to restart your training job if it fails.

1. `namespace`: You can specify the Kubernetes namespace where you're submitting the training job.

1. `persistent_volume_claims`: You can specify a shared volume for your training job for all training processes to access the files in the volume.

# Running a SageMaker training job
<a name="cluster-specific-configurations-run-sagemaker-training-job"></a>

SageMaker HyperPod Recipes supports submitting a SageMaker training job. Before you submit the training job, you must update the cluster configuration, `sm_job.yaml`, and install corresponding environment.

## Use your recipe as a SageMaker training job
<a name="cluster-specific-configurations-cluster-config-sm-job-yaml"></a>

You can use your recipe as a SageMaker training job if you aren't hosting a cluster. You must modify the SageMaker training job configuration file, `sm_job.yaml`, to run your recipe.

```
sm_jobs_config:
  output_path: null 
  tensorboard_config:
    output_path: null 
    container_logs_path: null
  wait: True 
  inputs: 
    s3: 
      train: null
      val: null
    file_system:  
      directory_path: null
  additional_estimator_kwargs: 
    max_run: 1800
```

1. `output_path`: You can specify where you're saving your model to an Amazon S3 URL.

1. `tensorboard_config`: You can specify a TensorBoard related configuration such as the output path or TensorBoard logs path.

1. `wait`: You can specify whether you're waiting for the job to be completed when you submit your training job.

1. `inputs`: You can specify the paths for your training and validation data. The data source can be from a shared filesystem such as Amazon FSx or an Amazon S3 URL.

1. `additional_estimator_kwargs`: Additional estimator arguments for submitting a training job to the SageMaker training job platform. For more information, see [Algorithm Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html).

# Considerations
<a name="cluster-specific-configurations-special-considerations"></a>

When you're using a Amazon SageMaker HyperPod recipes, there are some factors that can impact the process of model training.
+ The `transformers` version must be `4.45.2` or greater for Llama 3.2. If you're using a Slurm or K8s workflow, the version is automatically updated.
+ Mixtral does not support 8-bit floating point precision (FP8)
+ Amazon EC2 p4 instance does not support FP8

# Advanced settings
<a name="cluster-specific-configurations-advanced-settings"></a>

The SageMaker HyperPod recipe adapter is built on top of the Nvidia Nemo and Pytorch-lightning frameworks. If you've already used these frameworks, integrating your custom models or features into the SageMaker HyperPod recipe adapter is a similar process. In addition to modifying the recipe adapter, you can change your own pre-training or fine-tuning script. For guidance on writing your custom training script, see [examples](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/tree/main/examples).

## Use the SageMaker HyperPod adapter to create your own model
<a name="cluster-specific-configurations-use-hyperpod-adapter-create-model"></a>

Within the recipe adapter, you can customize the following files in the following locations:

1. `collections/data`: Contains a module responsible for loading datasets. Currently, it only supports datasets from HuggingFace. If you have more advanced requirements, the code structure allows you to add custom data modules within the same folder.

1. `collections/model`: Includes the definitions of various language models. Currently, it supports common large language models like Llama, Mixtral, and Mistral. You have the flexibility to introduce your own model definitions within this folder.

1. `collections/parts`: This folder contains strategies for training models in a distributed manner. One example is the Fully Sharded Data Parallel (FSDP) strategy, which allows for sharding a large language model across multiple accelerators. Additionally, the strategies support various forms of model parallelism. You also have the option to introduce your own customized training strategies for model training.

1. `utils`: Contains various utilities aimed at facilitating the management of a training job. It serves as a repository where for your own tools. You can use your own tools for tasks such as troubleshooting or benchmarking. You can also add your own personalized PyTorch Lightning callbacks within this folder. You can use PyTorch Lightning callbacks to seamlessly integrate specific functionalities or operations into the training lifecycle.

1. `conf`: Contains the configuration schema definitions used for validating specific parameters in a training job. If you introduce new parameters or configurations, you can add your customized schema to this folder. You can use the customized schema to define the validation rules. You can validate data types, ranges, or any other parameter constraint. You can also define you own custom schema to validate the parameters.

# Appendix
<a name="appendix"></a>

Use the following information to get information about monitoring and analyzing training results.

## Monitor training results
<a name="monitor-training-results"></a>

Monitoring and analyzing training results is essential for developers to assess convergence and troubleshoot issues. SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. To address the challenges of profiling large distributed training jobs, these recipes also incorporate VizTracer. VizTracer is a low-overhead tool for tracing and visualizing Python code execution. For more information about VizTracer, see [VizTracer](https://viztracer.readthedocs.io/en/latest/installation.html).

The following sections guide you through the process of implementing these features in your SageMaker HyperPod recipes.

### Tensorboard
<a name="tensorboard"></a>

Tensorboard is a powerful tool for visualizing and analyzing the training process. To enable Tensorboard, modify your recipe by setting the following parameter:

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

After you enable the Tensorboard logger, the training logs are generated and stored within the experiment directory. The experiment directed is defined in exp\$1manager.exp\$1dir. To access and analyze these logs locally, use the following procedure:

**To access and analyze logs**

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard with the following the command.

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit http://localhost:8008.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/index.html).

### VizTracer
<a name="viztracer"></a>

To enable VizTracer, you can modify your recipe by setting the model.viztracer.enabled parameter to true. For example, you can update your llama recipe to enable VizTracer by adding the following configuration:

```
model:
  viztracer:
    enabled: true
```

After the training has completed, your VizTracer profile is in the experiment folder exp\$1dir/result.json. To analyze your profile, you can download it and open it using the vizviewer tool:

```
vizviewer --port <port> result.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by specifying http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see VizTracer documentation.

## SageMaker JumpStart versus SageMaker HyperPod
<a name="sagemaker-jumpstart-vs-hyperpod"></a>

While SageMaker JumpStart provides fine-tuning capabilities, the SageMaker HyperPod recipes provide the following:
+ Additional fine-grained control over the training loop
+ Recipe customization for your own models and data
+ Support for model parallelism

Use the SageMaker HyperPod recipes when you need access to the model's hyperparameters, multi-node training, and customization options for the training loop.

For more information about fine-tuning your models in SageMaker JumpStart, see [Fine-tune publicly available foundation models with the `JumpStartEstimator` class](jumpstart-foundation-models-use-python-sdk-estimator-class.md)

# Orchestrating SageMaker HyperPod clusters with Slurm
<a name="sagemaker-hyperpod-slurm"></a>

Slurm support in SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitors the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads. Additionally, with lifecycle configuration support in SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker AI distributed training libraries to achieve optimal performance on AWS.

**Operating clusters**

You can create, conﬁgure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the AWS command line interface (CLI) or AWS SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

**Configuring your ML environment**

SageMaker HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami), which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md) and [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

**Scheduling jobs**

After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

**Resiliency against hardware failures**

SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see [SageMaker HyperPod cluster resiliency](sagemaker-hyperpod-resiliency-slurm.md).

**Logging and managing clusters**

You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and manage SageMaker HyperPod resources by tagging them. Each `CreateCluster` API run creates a distinct log stream, named in `<cluster-name>-<timestamp>` format. In the log stream, you can check the host names, the name of failed lifecycle scripts, and outputs from the failed scripts such as `stdout` and `stderr`. For more information, see [SageMaker HyperPod cluster management](sagemaker-hyperpod-cluster-management-slurm.md).

**Compatible with SageMaker AI tools**

Using SageMaker HyperPod, you can configure clusters with AWS optimized collective communications libraries offered by SageMaker AI, such as the [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md). The SMDDP library implements the `AllGather` operation optimized to the AWS compute and network infrastructure for the most performant SageMaker AI machine learning instances powered by NVIDIA A100 GPUs. To learn more, see [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

**Instance placement with UltraServers**

SageMaker AI automatically allocates jobs to instances within your UltraServer based on a best effort strategy of using all of the instances in one UltraServer before using another one. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer.

**Topics**
+ [

# Getting started with SageMaker HyperPod
](smcluster-getting-started-slurm.md)
+ [

# SageMaker HyperPod Slurm cluster operations
](sagemaker-hyperpod-operate-slurm.md)
+ [

# Customizing SageMaker HyperPod clusters using lifecycle scripts
](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [

# SageMaker HyperPod multi-head node support
](sagemaker-hyperpod-multihead-slurm.md)
+ [

# Jobs on SageMaker HyperPod clusters
](sagemaker-hyperpod-run-jobs-slurm.md)
+ [

# SageMaker HyperPod cluster resources monitoring
](sagemaker-hyperpod-cluster-observability-slurm.md)
+ [

# SageMaker HyperPod cluster resiliency
](sagemaker-hyperpod-resiliency-slurm.md)
+ [

# Continuous provisioning for enhanced cluster operations with Slurm
](sagemaker-hyperpod-scaling-slurm.md)
+ [

# SageMaker HyperPod cluster management
](sagemaker-hyperpod-cluster-management-slurm.md)
+ [

# SageMaker HyperPod FAQs
](sagemaker-hyperpod-faq-slurm.md)

# Getting started with SageMaker HyperPod
<a name="smcluster-getting-started-slurm"></a>

Get started with creating your first SageMaker HyperPod cluster and learn the cluster operation functionalities of SageMaker HyperPod. You can create a SageMaker HyperPod cluster through the SageMaker AI console UI or the AWS CLI commands. This tutorial shows how to create a new SageMaker HyperPod cluster with Slurm, which is a popular workload scheduler software. After you go through this tutorial, you will know how to log into the cluster nodes using the AWS Systems Manager commands (`aws ssm`). After you complete this tutorial, see also [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md) to learn more about the SageMaker HyperPod basic oparations, and [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md) to learn how to schedule jobs on the provisioned cluster.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [

# Getting started with SageMaker HyperPod using the SageMaker AI console
](smcluster-getting-started-slurm-console.md)
+ [

# Creating SageMaker HyperPod clusters using CloudFormation templates
](smcluster-getting-started-slurm-console-create-cluster-cfn.md)
+ [

# Getting started with SageMaker HyperPod using the AWS CLI
](smcluster-getting-started-slurm-cli.md)

# Getting started with SageMaker HyperPod using the SageMaker AI console
<a name="smcluster-getting-started-slurm-console"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the SageMaker AI console UI. Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes, `my-controller-group`, `my-login-group`, and `worker-group-1`.

**Topics**
+ [

## Create cluster
](#smcluster-getting-started-slurm-console-create-cluster-page)
+ [

## Deploy resources
](#smcluster-getting-started-slurm-console-create-cluster-deploy)
+ [

## Delete the cluster and clean resources
](#smcluster-getting-started-slurm-console-delete-cluster-and-clean)

## Create cluster
<a name="smcluster-getting-started-slurm-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose **Slurm** orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the Slurm cluster creation page, you will see two options. Choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-slurm-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group for Controller and Compute group types.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-slurm-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-slurm-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**.

### Networking
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-network"></a>

Configure your network settings for the cluster creation. These settings can't be changed after the cluster is created.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security groups**, create a security group or choose up to five security groups configured with rules to allow inter-resource communication within the VPC.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf.

### Storage
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-slurm-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-slurm-console-create-cluster-cfn.md).

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-console-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI instances when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of two instance groups. One of them uses a C5 instance, so make sure you delete the cluster by following the instructions at [Delete a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster).

However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.

To clean up the lifecycle scripts from the S3 bucket used for this tutorial, go to the S3 bucket you used during cluster creation and remove the files entirely.

If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-slurm-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [

## Configure resources in the console and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-console)
+ [

## Configure resources and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates. 

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure resources and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-cfn"></a>

You can configure resources and deploy using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

# Getting started with SageMaker HyperPod using the AWS CLI
<a name="smcluster-getting-started-slurm-cli"></a>

Create your first SageMaker HyperPod cluster using the AWS CLI commands for HyperPod.

## Create your first SageMaker HyperPod cluster with Slurm
<a name="smcluster-getting-started-slurm-cli-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-cli). Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes: `my-controller-group`, `my-login-group`, and `worker-group-1`.

With the API-driven configuration approach, you define Slurm node types and partition assignments directly in the CreateCluster API request using `SlurmConfig`. This eliminates the need for a separate `provisioning_parameters.json` file and provides built-in validation, drift detection, and per-instance-group FSx configuration.

1. First, prepare and upload lifecycle scripts to an Amazon S3 bucket. During cluster creation, HyperPod runs them in each instance group. Upload lifecycle scripts to Amazon S3 using the following command.

   ```
   aws s3 sync \
       ~/local-dir-to-lifecycle-scripts/* \
       s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
   ```
**Note**  
The S3 bucket path should start with a prefix `sagemaker-`, because the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) with `AmazonSageMakerClusterInstanceRolePolicy` only allows access to Amazon S3 buckets that starts with the specific prefix.

   If you are starting from scratch, use sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following sub-steps show how to download and upload the sample lifecycle scripts to an Amazon S3 bucket.

   1. Download a copy of the lifecycle script samples to a directory on your local computer.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      ```

   1. Go into the directory [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config), where you can find a set of lifecycle scripts.

      ```
      cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
      ```

      To learn more about the lifecycle script samples, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

   1. Upload the scripts to `s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src`. You can do so by using the Amazon S3 console, or by running the following AWS CLI Amazon S3 command.

      ```
      aws s3 sync \
          ~/local-dir-to-lifecycle-scripts/* \
          s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
      ```
**Note**  
With API-driven configuration, you do not need to create or upload a `provisioning_parameters.json` file. The Slurm configuration is defined directly in the CreateCluster API request in the next step.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) request file in JSON format and save as `create_cluster.json`.

   With API-driven configuration, you specify the Slurm node type and partition assignment for each instance group using the `SlurmConfig` field. You also configure the cluster-level Slurm settings using `Orchestrator.Slurm`.

   For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole",
               "InstanceStorageConfigs": [
                   {
                       "EbsVolumeConfig": {
                           "VolumeSizeInGB": 500
                       }
                   }
               ]
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       }
   }
   ```

   **SlurmConfig fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **Orchestrator.Slurm fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **SlurmConfigStrategy options:**
   + `Managed` (recommended): HyperPod fully manages `slurm.conf` and detects unauthorized changes (drift detection). Updates fail if drift is detected.
   + `Overwrite`: HyperPod overwrites `slurm.conf` on updates, ignoring any manual changes.
   + `Merge`: HyperPod preserves manual changes and merges them with API configuration.

   **Adding FSx for Lustre (optional):**

   To mount an FSx for Lustre filesystem to your compute nodes, add `FsxLustreConfig` to the `InstanceStorageConfigs` for the instance group. This requires a Custom VPC configuration.

   ```
   {
       "InstanceGroupName": "worker-group-1",
       "InstanceType": "ml.trn1.32xlarge",
       "InstanceCount": 1,
       "SlurmConfig": {
           "NodeType": "Compute",
           "PartitionNames": ["partition-1"]
       },
       "InstanceStorageConfigs": [
           {
               "FsxLustreConfig": {
                   "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                   "MountPath": "/fsx",
                   "MountName": "abcdefgh"
               }
           }
       ],
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
   }
   ```

   **Adding FSx for OpenZFS (optional):**

   You can also mount FSx for OpenZFS filesystems:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-0xyz789abc123456.fsx.us-west-2.amazonaws.com",
               "MountPath": "/shared"
           }
       }
   ]
   ```
**Note**  
Each instance group can have at most one FSx for Lustre and one FSx for OpenZFS configuration. Different instance groups can mount different filesystems.

   **Adding VPC configuration (required for FSx):**

   If using FSx, you must specify a Custom VPC configuration:

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "VpcConfig": {
           "SecurityGroupIds": ["sg-0abc123def456789a"],
           "Subnets": ["subnet-0abc123def456789a"]
       }
   }
   ```

1. Run the following command to create the cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the created cluster.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster"
   }
   ```

   If you receive an error due to resource limits, ensure that you change the instance type to one with sufficient quotas in your account, or request additional quotas by following [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   **Common validation errors:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

1. Run `describe-cluster` to check the status of the cluster.

   ```
   aws sagemaker describe-cluster --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster",
       "ClusterName": "my-hyperpod-cluster",
       "ClusterStatus": "Creating",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "CreationTime": "2024-01-15T10:30:00Z"
   }
   ```

   After the status of the cluster turns to **InService**, proceed to the next step. Cluster creation typically takes 10-15 minutes.

1. Run `list-cluster-nodes` to check the details of the cluster nodes.

   ```
   aws sagemaker list-cluster-nodes --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterNodeSummaries": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceId": "i-0abc123def456789a",
               "InstanceType": "ml.c5.xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceId": "i-0abc123def456789b",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceId": "i-0abc123def456789c",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:36:00Z"
           }
       ]
   }
   ```

   The `InstanceId` is what your cluster users need for logging (`aws ssm`) into them. For more information about logging into the cluster nodes and running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

1. Connect to your cluster using AWS Systems Manager Session Manager.

   ```
   aws ssm start-session \
       --target sagemaker-cluster:my-hyperpod-cluster_my-login-group-i-0abc123def456789b \
       --region us-west-2
   ```

   Once connected, verify Slurm is configured correctly:

   ```
   # Check Slurm nodes
   sinfo
   
   # Check Slurm partitions
   sinfo -p partition-1
   
   # Submit a test job
   srun -p partition-1 --nodes=1 hostname
   ```

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-cli-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI capacity when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of three instance groups. Make sure you delete the cluster by running the following command.

```
aws sagemaker delete-cluster --cluster-name my-hyperpod-cluster
```

To clean up the lifecycle scripts from the Amazon S3 bucket used for this tutorial, go to the Amazon S3 bucket you used during cluster creation and remove the files entirely.

```
aws s3 rm s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src --recursive
```

If you have tested running any model training workloads on the cluster, also check if you have uploaded any data or if your job has saved any artifacts to different Amazon S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent incurring charges, delete all artifacts and data from the storage or file system.

## Related topics
<a name="smcluster-getting-started-slurm-cli-related-topics"></a>
+ [SageMaker HyperPod Slurm configuration](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-configuration)
+ [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [FSx configuration via InstanceStorageConfigs](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-fsx-config)
+ [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)

# SageMaker HyperPod Slurm cluster operations
<a name="sagemaker-hyperpod-operate-slurm"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). You'll learn how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [

# Managing SageMaker HyperPod Slurm clusters using the SageMaker console
](sagemaker-hyperpod-operate-slurm-console-ui.md)
+ [

# Managing SageMaker HyperPod Slurm clusters using the AWS CLI
](sagemaker-hyperpod-operate-slurm-cli-command.md)

# Managing SageMaker HyperPod Slurm clusters using the SageMaker console
<a name="sagemaker-hyperpod-operate-slurm-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod through the console UI.

**Topics**
+ [

## Create a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster)
+ [

## Browse your SageMaker HyperPod clusters
](#sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters)
+ [

## View details of each SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters)
+ [

## Edit a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters)
+ [

## Delete a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster)

## Create a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-create-cluster"></a>

See the instructions in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md) to create a new SageMaker HyperPod cluster through the SageMaker HyperPod console UI.

## Browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters"></a>

Under **Clusters** in the main pane of the SageMaker HyperPod console on the SageMaker HyperPod console main page, all created clusters should appear listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## View details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the console main page, the cluster **Names** are activated as links. Choose the cluster name link to see details of each cluster.

## Edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

## Delete a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

# Managing SageMaker HyperPod Slurm clusters using the AWS CLI
<a name="sagemaker-hyperpod-operate-slurm-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [

## Create a new cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster)
+ [

## Describe a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster)
+ [

## List details of cluster nodes
](#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes)
+ [

## Describe details of a cluster node
](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node)
+ [

## List clusters
](#sagemaker-hyperpod-operate-slurm-cli-command-list-clusters)
+ [

## Update cluster configuration
](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster)
+ [

## Update the SageMaker HyperPod platform software of a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software)
+ [

## Scale down a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down)
+ [

## Delete a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster)

## Create a new cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-create-cluster"></a>

1. Prepare lifecycle configuration scripts and upload them to an S3 bucket, such as `s3://sagemaker-amzn-s3-demo-bucket/lifecycle-script-directory/src/`. The following step 2 assumes that there’s an entry point script named `on_create.sh` in the specified S3 bucket.
**Important**  
Make sure that you set the S3 path to start with `s3://sagemaker-`. The [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) has the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html) attached, which allows access to S3 buckets with the specific prefix `sagemaker-`.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. You should configure instance groups to match with the Slurm cluster you design in the `provisioning_parameters.json` file that'll be used during cluster creating as part of running a set of lifecycle scripts. To learn more, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The following template has two instance groups to meet the minimum requirement for a Slurm cluster: one controller (head) node and one compute (worker) node. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

   ```
   // create_cluster.json
   {
       "ClusterName": "your-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "controller-group",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           }, 
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.p4d.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
           }
       ],
       // Optional
       "Tags": [ 
           { 
              "Key": "string",
              "Value": "string"
           }
       ],
       // Optional
       "VpcConfig": { 
           "SecurityGroupIds": [ "string" ],
           "Subnets": [ "string" ]
       }
   }
   ```

   Depending on how you design the cluster structure through your lifecycle scripts, you can configure up to 20 instance groups under the `InstanceGroups` parameter.

   For the `Tags` request parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

   For the `VpcConfig` request parameter, specify the information of a VPC you want to use. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.

## Describe a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

## Update cluster configuration
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
You can use the `UpdateCluster` API to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scale down a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down).

1. Create an `UpdateCluster` request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. You can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change both `SourceS3Uri` and `OnCreate` values as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [
           {
               "InstanceGroupName": "name-of-instance-group-to-update",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           },
           // add more blocks of instance groups as needed
           { ... }
       ]
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

## Update the SageMaker HyperPod platform software of a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software"></a>

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
Note that you must back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

This command calls the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. After the API call, SageMaker HyperPod checks if there's a newer DLAMI available for the cluster instances. If a DLAMI update is required, SageMaker HyperPod will update the cluster instances to use the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) and run your lifecycle scripts in the Amazon S3 bucket that you specified during cluster creation or update. If the cluster is already using the latest DLAMI, SageMaker HyperPod will not make any changes to the cluster or run the lifecycle scripts again. The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Tip**  
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster).

**Note**  
You can only run this API programatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

### Use the backup script provided by SageMaker HyperPod
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup"></a>

SageMaker HyperPod provides a script to back up and restore your data at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh) in the *Awsome Distributed Training GitHub repository*. The script provides the following two functions.

**To back up data to an S3 bucket before patching**

```
sudo bash patching-backup.sh --create <s3-buckup-bucket-path>
```

After you run the command, the script checks `squeue` if there are queued jobs, stops Slurm if there's no job in the queue, backs up `mariadb`, and copies local items on disc defined under `LOCAL_ITEMS`. You can add more files and directories to `LOCAL_ITEMS`.

```
# Define files and directories to back up.
LOCAL_ITEMS=(
    "/var/spool/slurmd"
    "/var/spool/slurmctld"
    "/etc/systemd/system/slurmctld.service"
    "/home/ubuntu/backup_slurm_acct_db.sql"
    # ... Add more items as needed
)
```

Also, you can add custom code to the provided script to back up any applications for your use case.

**To restore data from an S3 bucket after patching**

```
sudo bash patching-backup.sh --restore <s3-buckup-bucket-path>
```

## Scale down a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-scale-down"></a>

You can scale down the number of instances or delete instance groups in your SageMaker HyperPod cluster to optimize resource allocation or reduce costs.

You scale down by either using the `UpdateCluster` API operation to randomly terminate instances from your instance group down to a specified number, or by terminating specific instances using the `BatchDeleteClusterNodes` API operation. You can also completely remove entire instance groups using the `UpdateCluster` API. For more information about how to scale down using these methods, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

**Note**  
You cannot remove instances that are configured as Slurm controller nodes. Attempting to delete a Slurm controller node results in a validation error with the error code `NODE_ID_IN_USE`.

## Delete a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

# Customizing SageMaker HyperPod clusters using lifecycle scripts
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm"></a>

SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.

The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.

## High-level overview
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview"></a>

The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a ***bottom-up*** approach.

1. Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.

1. Prepare Slurm configuration. Choose one of the following approaches:
   + **Option A: API-driven configuration (recommended)** – Define Slurm node types and partitions directly in the `CreateCluster` API payload using `SlurmConfig` within each instance group. With this approach:
     + No `provisioning_parameters.json` file is needed
     + Slurm topology is defined in the API payload alongside instance group definitions
     + FSx filesystems are configured per-instance-group via `InstanceStorageConfigs`
     + Configuration strategy is controlled via `Orchestrator.Slurm.SlurmConfigStrategy`

     Example `SlurmConfig` in an instance group:

     ```
     {
         "InstanceGroupName": "gpu-compute",
         "InstanceType": "ml.p4d.24xlarge",
         "InstanceCount": 8,
         "SlurmConfig": {
             "NodeType": "Compute",
             "PartitionNames": ["gpu-training"]
         }
     }
     ```
   + **Option B: Legacy configuration** – Prepare a `provisioning_parameters.json` file, which is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). `provisioning_parameters.json` should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.

1. Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (`lifecycle_script.py`), and write an entrypoint shell script (`on_create.sh`) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5. 

   Also, note that you should write the scripts to expect `resource_config.json` that will be generated by HyperPod during cluster creation. `resource_config.json` contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.

1. Collect all the files from the previous steps into a folder. The folder structure depends on the configuration approach you selected in Step 2.

   If you selected Option A (API-driven configuration):

   Your folder only needs lifecycle scripts for custom setup tasks. Slurm configuration and FSx mounting are handled automatically by HyperPod based on the API payload.

   ```
   └── lifecycle_files // your local folder
   
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scripts to be fed into lifecycle_script.py
   ```
**Note**  
The `provisioning_parameters.json` file is not required when using API-driven configuration.

   If you selected Option B (legacy configuration):

   Your folder must include `provisioning_parameters.json` and the full set of lifecycle scripts.

   ```
   └── lifecycle_files // your local folder
   
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

1. Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with `sagemaker-` because you need to choose an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attached with [`AmazonSageMakerClusterInstanceRolePolicy`](security-iam-awsmanpol-AmazonSageMakerClusterInstanceRolePolicy.md), which only allows S3 bucket paths starting with the prefix `sagemaker-`. The following command is an example command to upload all the files to an S3 bucket.

   ```
   aws s3 cp --recursive ./lifecycle_files s3://sagemaker-hyperpod-lifecycle/src
   ```

1. Prepare a HyperPod cluster creation request. 
   + Option 1: If you use the AWS CLI, write a cluster creation request in JSON format (`create_cluster.json`) following the instructions at [Create a new cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster).
   + Option 2: If you use the SageMaker AI console UI, fill the **Create a cluster** request form in the HyperPod console UI following the instructions at [Create a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster).

   At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.

1. Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a `resource_config.json` file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.

The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.

**Topics**
+ [

## High-level overview
](#sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview)
+ [

# Base lifecycle scripts provided by HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md)
+ [

# What particular configurations HyperPod manages in Slurm configuration files
](sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf.md)
+ [

# Slurm log rotations
](sagemaker-hyperpod-slurm-log-rotation.md)
+ [

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx.md)
+ [

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md)
+ [

# Validating runtime before running production workloads on a HyperPod Slurm cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime.md)
+ [

# Developing lifecycle scripts interactively on a HyperPod cluster node
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts.md)

# Base lifecycle scripts provided by HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config"></a>

This section walks you through every component of the basic flow of setting up Slurm on HyperPod in a ***top-down*** approach. It starts from preparing a HyperPod cluster creation request to run the `CreateCluster` API, and dives deep into the hierarchical structure down to lifecycle scripts. Use the sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). Clone the repository by running the following command.

```
git clone https://github.com/aws-samples/awsome-distributed-training/
```

The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod are available at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config).

```
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
```

The following flowchart shows a detailed overview of how you should design the base lifecycle scripts. The descriptions below the diagram and the procedural guide explain how they work during the HyperPod `CreateCluster` API call.

![\[A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-structure.png)


***Figure:** A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts. (1) The dashed arrows are directed to where the boxes are "called into" and shows the flow of configuration files and lifecycle scripts preparation. It starts from preparing `provisioning_parameters.json` and lifecycle scripts. These are then coded in `lifecycle_script.py` for a collective execution in order. And the execution of the `lifecycle_script.py` script is done by the `on_create.sh` shell script, which to be run in the HyperPod instance terminal. (2) The solid arrows show the main HyperPod cluster creation flow and how the boxes are "called into" or "submitted to". `on_create.sh` is required for cluster creation request, either in `create_cluster.json` or the **Create a cluster** request form in the console UI. After you submit the request, HyperPod runs the `CreateCluster` API based on the given configuration information from the request and the lifecycle scripts. (3) The dotted arrow indicates that the HyperPod platform creates `resource_config.json` in the cluster instances during cluster resource provisioning. `resource_config.json` contains HyperPod cluster resource information such as the cluster ARN, instance types, and IP addresses. It is important to note that you should prepare the lifecycle scripts to expect the `resource_config.json` file during cluster creation. For more information, see the procedural guide below.*

The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.

1. `create_cluster.json` – To submit a HyperPod cluster creation request, you prepare a `CreateCluster` request file in JSON format. In this best practices example, we assume that the request file is named `create_cluster.json`. Write `create_cluster.json` to provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.

   Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name `InstanceGroups.LifeCycleConfig.SourceS3Uri` in the `CreateCluster` request form, and specify the file name of an entrypoint shell script (assume that it's named `on_create.sh`) to `InstanceGroups.LifeCycleConfig.OnCreate`.
**Note**  
If you are using the **Create a cluster** submission form in the HyperPod console UI, the console manages filling and submitting the `CreateCluster` request on your behalf, and runs the `CreateCluster` API in the backend. In this case, you don't need to create `create_cluster.json`; instead, make sure that you specify the correct cluster configuration information to the **Create a cluster** submission form.

1. `on_create.sh` – For each instance group, you need to provide an entrypoint shell script, `on_create.sh`, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are a `provisioning_parameters.json` required by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh).
**Note**  
Make sure that you upload the entire set of lifecycle scripts to the S3 location you specify in `create_cluster.json`. You should also place your `provisioning_parameters.json` in the same location.

   1. `provisioning_parameters.json` – This is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). The `on_create.sh` script finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. In `provisioning_parameters.json`, make sure that you assign the HyperPod cluster instance groups using the names you specified in `create_cluster.json` to the Slurm nodes appropriately based on how you plan to set them up.

      The following diagram shows an example of how the two JSON configuration files `create_cluster.json` and `provisioning_parameters.json` should be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.
**Tip**  
To help you validate these two JSON files, the HyperPod service team provides a validation script, [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). To learn more, see [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md).  
![\[Direct comparison between .json files.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-slurm-config.png)

      ***Figure:** Direct comparison between `create_cluster.json` for HyperPod cluster creation and `provisiong_params.json` for Slurm configuration. The number of instance groups in `create_cluster.json` should match with the number of nodes you want to configure as Slurm nodes. In case of the example in the figure, three Slurm nodes will be configured on a HyperPod cluster of three instance groups. You should assign the HyperPod cluster instance groups to Slurm nodes by specifying the instance group names accordingly.*

   1. `resource_config.json` – During cluster creation, the `lifecycle_script.py` script is written to expect a `resource_config.json` file from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.

      When you run the `CreateCluster` API, HyperPod creates a resource configuration file at `/opt/ml/config/resource_config.json` based on the `create_cluster.json` file. The file path is saved to the environment variable named `SAGEMAKER_RESOURCE_CONFIG_PATH`. 
**Important**  
The `resource_config.json` file is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example of `resource_config.json` that would be created from the cluster creation based on `create_cluster.json` in the previous step, and to help you understand what happens in the backend and how an auto-generated `resource_config.json` would look.

      ```
      {
      
          "ClusterConfig": {
              "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz",
              "ClusterName": "your-hyperpod-cluster"
          },
          "InstanceGroups": [
              {
                  "Name": "controller-machine",
                  "InstanceType": "ml.c5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "controller-machine-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "login-group",
                  "InstanceType": "ml.m5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "login-group-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "compute-nodes",
                  "InstanceType": "ml.trn1.32xlarge",
                  "Instances": [
                      {
                          "InstanceName": "compute-nodes-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-2",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-3",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-4",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              }
          ]
      }
      ```

   1. `lifecycle_script.py` – This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads in `provisioning_parameters.json` and `resource_config.json` from the paths that are specified or identified in `on_create.sh`, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.

      Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh)), mounting Amazon FSx for Lustre ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh)), and setting up MariaDB accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh)) and RDS accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh)). You can also add more scripts, package them under the same directory, and add code lines to `lifecycle_script.py` to let HyperPod run the scripts. For more information about the base lifecycle scripts, see also [3.1 Lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#31-lifecycle-scripts) in the *Awsome Distributed Training GitHub repository*.
**Note**  
HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected.

      In addition to the default setups, more scripts for installing the following software are available under the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The `lifecycle_script.py` file is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.

      1. The following code lines are for installing [Docker](https://www.docker.com/), [Enroot](https://github.com/NVIDIA/enroot), and [Pyxis](https://github.com/NVIDIA/pyxis). These packages are required to run Docker containers on a Slurm cluster. 

         To enable this installation step, set the `enable_docker_enroot_pyxis` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install Docker/Enroot/Pyxis
         if Config.enable_docker_enroot_pyxis:
             ExecuteBashScript("./utils/install_docker.sh").run()
             ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
         ```

      1. You can integrate your HyperPod cluster with [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the [Slurm dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/), the [NVIDIA DCGM Exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/), and the [EFA Metrics dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/) on Amazon Managed Grafana, you need to install the [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter), the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter), and the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). 

         To enable this installation step, set the `enable_observability` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install metric exporting software and Prometheus for observability
         
         if Config.enable_observability:
             if node_type == SlurmNodeType.COMPUTE_NODE:
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()
             
             if node_type == SlurmNodeType.HEAD_NODE:
                 wait_for_scontrol()
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_prometheus.sh").run()
         ```

1. Make sure that you upload all configuration files and setup scripts from **Step 2** to the S3 bucket you provide in the `CreateCluster` request in **Step 1**. For example, assume that your `create_cluster.json` has the following.

   ```
   "LifeCycleConfig": { 
   
       "SourceS3URI": "s3://sagemaker-hyperpod-lifecycle/src",
       "OnCreate": "on_create.sh"
   }
   ```

   Then, your `"s3://sagemaker-hyperpod-lifecycle/src"` should contain `on_create.sh`, `lifecycle_script.py`, `provisioning_parameters.json`, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.

   ```
   └── lifecycle_files // your local folder
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

   To upload the files, use the S3 command as follows.

   ```
   aws s3 cp --recursive ./lifecycle_scripts s3://sagemaker-hyperpod-lifecycle/src
   ```

# What particular configurations HyperPod manages in Slurm configuration files
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows which specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you **do not** change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

# Slurm log rotations
<a name="sagemaker-hyperpod-slurm-log-rotation"></a>

SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.

## How log rotation works
<a name="sagemaker-hyperpod-slurm-log-rotation-how-it-works"></a>

When enabled, the log rotation configuration:
+ Monitors all Slurm log files with the extension `.log` located in the `/var/log/slurm/` folder on the controller, login and compute nodes.
+ Rotates logs when they reach 50 MB in size.
+ Maintains up to two rotated log files before deleting them.
+ Sends SIGUSR2 signal to Slurm daemons (`slurmctld`, `slurmd`, and `slurmdbd`) after rotation.

## List of log files rotated
<a name="sagemaker-hyperpod-slurm-log-rotation-log-files-list"></a>

Slurm logs are located in the `/var/log/slurm/` directory. Log rotation is enabled for all files that match `/var/log/slurm/*.log`. When rotation occurs, rotated files have numerical suffixes (such as `slurmd.log.1`). The following list is not exhaustive but shows some of the critical log files that rotate automatically:
+ `/var/log/slurm/slurmctld.log`
+ `/var/log/slurm/slurmd.log`
+ `/var/log/slurm/slurmdb.log`
+ `/var/log/slurm/slurmrestd.log`

## Enable or disable log rotation
<a name="sagemaker-hyperpod-slurm-log-rotation-enable-disable"></a>

You can control the log rotation feature using the `enable_slurm_log_rotation` parameter in the `config.py` script of your cluster's lifecycle scripts, as shown in the following example:

```
class Config:
    # Set false if you want to disable log rotation of Slurm daemon logs
    enable_slurm_log_rotation = True  # Default value
```

To disable log rotation, set the parameter to `False`, as shown in the following example:

```
enable_slurm_log_rotation = False
```

**Note**  
Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation. They also run on new nodes when added to the cluster. Updating the log rotation configurations must be done manually after cluster creation. The log rotation configuration is stored in `/etc/logrotate.d/sagemaker-hyperpod-slurm`. We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable log rotation, delete the `sagemaker-hyperpod-slurm` file or comment out its contents by adding `#` at the start of each line in the `sagemaker-hyperpod-slurm` file.

## Default log rotation settings
<a name="sagemaker-hyperpod-slurm-log-rotation-default-settings"></a>

The following settings are configured automatically for each log file rotated:


| Setting | Value | Description | 
| --- | --- | --- | 
| rotate | 2 | Number of rotated log files to keep | 
| size | 50 MB | Maximum size before rotation | 
| copytruncate | enabled | Copies and truncates the original log file | 
| compress | disabled | Rotated logs are not compressed | 
| missingok | enabled | No error if log file is missing | 
| notifempty | enabled | Doesn't rotate empty files | 
| noolddir | enabled | Rotated files stay in same directory | 

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx"></a>

To mount an Amazon FSx for Lustre shared file system to your HyperPod cluster, set up the following.

1. Use your Amazon VPC. 

   1. For HyperPod cluster instances to communicate within your VPC, make sure that you attach the [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc) to the IAM role for SageMaker HyperPod. 

   1. In `create_cluster.json`, include the following VPC information.

      ```
      "VpcConfig": { 
          "SecurityGroupIds": [ "string" ],
          "Subnets": [ "string" ]
      }
      ```

      For more tips about setting up Amazon VPC, see [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

1. To finish configuring Slurm with Amazon FSx for Lustre, you can use one of the following approaches. You can find the Amazon FSx information either from the Amazon FSx for Lustre console in your account or by running the following AWS CLI command, `aws fsx describe-file-systems`.

   **Option A: API-Driven Configuration (Recommended)**

   Specify the Amazon FSx configuration directly in the CreateCluster API payload using `InstanceStorageConfigs` within each instance group. This approach supports both FSx for Lustre and FSx for OpenZFS, and allows per-instance-group FSx configuration.

   ```
   "InstanceStorageConfigs": [
       {
           "FsxLustreConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx",
               "MountName": "1abcdefg"
           }
       }
   ]
   ```

   For FSx for OpenZFS, use `FsxOpenZfsConfig` instead:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx-openzfs"
           }
       }
   ]
   ```

   For more details, see [Getting started with SageMaker HyperPod using the AWS CLI](sagemaker-hyperpod-quickstart.md).

   **Option B: Legacy Configuration**

   Specify the Amazon FSx DNS name and Amazon FSx mount name in `provisioning_parameters.json` as shown in the figure in the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section.

   ```
   "fsx_dns_name": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
   "fsx_mountname": "1abcdefg"
   ```

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files"></a>

To validate the JSON configuration files before submitting a cluster creation request, use the configuration validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). This script parses and compares your HyperPod cluster configuration JSON file and Slurm configuration JSON file, and identifies if there's any resource misconfiguration between the two files and also across Amazon EC2, Amazon VPC, and Amazon FSx resources. For example, to validate the `create_cluster.json` and `provisioning_parameters.json` files from the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section, run the validation script as follows.

```
python3 validate-config.py --cluster-config create_cluster.json --provisioning-parameters provisioning_parameters.json
```

The following is an example output of a successful validation.

```
✔️  Validated instance group name worker-group-1 is correct ...

✔️  Validated subnet subnet-012345abcdef67890 ...
✔️  Validated security group sg-012345abcdef67890 ingress rules ...
✔️  Validated security group sg-012345abcdef67890 egress rules ...
✔️  Validated FSx Lustre DNS name fs-012345abcdef67890.fsx.us-east-1.amazonaws.com
✔️  Validated FSx Lustre mount name abcdefgh
✅ Cluster Validation succeeded
```

# Validating runtime before running production workloads on a HyperPod Slurm cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime"></a>

To check the runtime before running any production workloads on a Slurm cluster on HyperPod, use the runtime validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py). This script checks if the Slurm cluster has all packages installed for running Docker, if the cluster has a properly mounted FSx for Lustre file system and a user directory sharing the file system, and if the Slurm deamon is running on all compute nodes.

To run the script on multiple nodes at once, use `srun` as shown in the following example command of running the script on a Slurm cluster of 8 nodes.

```
# The following command runs on 8 nodes
srun -N 8 python3 hyperpod-precheck.py
```

**Note**  
To learn more about the validation script such as what runtime validation functions the script provides and guidelines to resolve issues that don't pass the validations, see [Runtime validation before running workloads](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#35-runtime-validation-before-running-workloads) in the *Awsome Distributed Training GitHub repository*.

# Developing lifecycle scripts interactively on a HyperPod cluster node
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts"></a>

This section explains how you can interactively develop lifecycle scripts without repeatedly creating and deleting a HyperPod cluster.

1. Create a HyperPod cluster with the base lifecycle scripts.

1. Log in to a cluster node.

1. Develop a script (`configure_xyz.sh`) by editing and running it repeatedly on the node.

   1. HyperPod runs the lifecycle scripts as the root user, so we recommend that you run the `configure_xyz.sh` as the root user while developing to make sure that the script is tested under the same condition while run by HyperPod.

1. Integrate the script into `lifecycle_script.py` by adding a code line similar to the following.

   ```
   ExecuteBashScript("./utils/configure_xyz.sh").run()
   ```

1. Upload the updated lifecycle scripts to the S3 bucket that you initially used for uploading the base lifecycle scripts.

1. Test the integrated version of `lifecycle_script.py` by creating a new HyperPod cluster. You can also use manual instance replacement to test the updated lifecycle scripts by creating new instances. For detailed instructions, see [Manually replace a node](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html#sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace). Note that only worker nodes are replaceable.

# SageMaker HyperPod multi-head node support
<a name="sagemaker-hyperpod-multihead-slurm"></a>

You can create multiple controller (head) nodes in a single SageMaker HyperPod Slurm cluster, with one serving as the primary controller node and the others serving as backup controller nodes. The primary controller node is responsible for controlling the compute (worker) nodes and handling Slurm operations. The backup controller nodes constantly monitor the primary controller node. If the primary controller node fails or becomes unresponsive, one of the backup controller nodes will automatically take over as the new primary controller node.

Configuring multiple controller nodes in SageMaker HyperPod Slurm clusters provides several key benefits. It eliminates the risk of single controller node failure by providing controller head nodes, enables automatic failover to backup controller nodes with faster recovery, and allows you to manage your own accounting databases and Slurm configuration independently.

## Key concepts
<a name="sagemaker-hyperpod-multihead-slurm-concepts"></a>

The following provides details about the concepts related to SageMaker HyperPod multiple controller (head) nodes support for Slurm clusters.

**Controller node**

A controller node is an Amazon EC2 instance within a cluster that runs critical Slurm services for managing and coordinating the cluster's operations. Specifically, it hosts the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) and the [Slurm database daemon (slurmdbd)](https://slurm.schedmd.com/slurmdbd.html). A controller node is also known as a head node.

**Primary controller node**

A primary controller node is the active and currently controlling controller node in a Slurm cluster. It is identified by Slurm as the primary controller node responsible for managing the cluster. The primary controller node receives and executes commands from users to control and allocate resources on the compute nodes for running jobs.

**Backup controller node**

A backup controller node is an inactive and standby controller node in a Slurm cluster. It is identified by Slurm as a backup controller node that is not currently managing the cluster. The backup controller node runs the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) in standby mode. Any controller commands executed on the backup controller nodes will be propagated to the primary controller node for execution. Its primary purpose is to continuously monitor the primary controller node and take over its responsibilities if the primary controller node fails or becomes unresponsive.

**Compute node**

A compute node is an Amazon EC2 instance within a cluster that hosts the [Slurm worker daemon (slurmd)](https://slurm.schedmd.com/slurmd.html). The compute node's primary function is to execute jobs assigned by the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) running on the primary controller node. When a job is scheduled, the compute node receives instructions from the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) to carry out the necessary tasks and computations for that job within the node itself. A compute is also known as a worker node.

## How it works
<a name="sagemaker-hyperpod-multihead-slurm-how"></a>

The following diagram illustrates how different AWS services work together to support the multiple controller (head) nodes architecture for SageMaker HyperPod Slurm clusters.

![\[SageMaker HyperPod multi-head nodes architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-multihead-architecture.png)


The AWS services that work together to support the SageMaker HyperPod multiple controller (head) nodes architecture include the following.


**AWS services that work together to support the SageMaker HyperPod multiple controller nodes architecture**  

| Service | Description | 
| --- | --- | 
| IAM (AWS Identity and Access Management) | Defines two IAM roles to control the access permissions: one role for the compute node instance group and the other for the controller node instance group. | 
| Amazon RDS for MariaDB | Stores accounting data for Slurm, which holds job records and metering data. | 
| AWS Secrets Manager | Stores and manages credentials that can be accessed by Amazon FSx for Lustre. | 
| Amazon FSx for Lustre  | Stores Slurm configurations and runtime state. | 
| Amazon VPC | Provides an isolated network environment where the HyperPod cluster and its resources are deployed. | 
| Amazon SNS  | Sends notifications to administrators when there are status changes (Slurm controller is ON or OFF) related to the primary controller (head) node. | 

The HyperPod cluster itself consists of controller nodes (primary and backup) and compute nodes. The controller nodes run the Slurm controller (SlurmCtld) and database (SlurmDBd) components, which manage and monitor the workload across the compute nodes.

The controller nodes access Slurm configurations and runtime state stored in the Amazon FSx for Lustre file system. The Slurm accounting data is stored in the Amazon RDS for MariaDB database. AWS Secrets Manager provides secure access to the database credentials for the controller nodes.

If there is a status change (Slurm controller is `ON` or `OFF`) in the Slurm controller nodes, Amazon SNS sends notifications to the admin for further action.

This multiple controller nodes architecture eliminates the single point of failure of a single controller (head) node, enables fast and automatic failover recovery, and gives you control over the Slurm accounting database and configurations.

# Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster
<a name="sagemaker-hyperpod-multihead-slurm-setup"></a>

This topic explains how to configure multiple controller (head) nodes in a SageMaker HyperPod Slurm cluster using lifecycle scripts. Before you start, review the prerequisites listed in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) and familiarize yourself with the lifecycle scripts in [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The instructions in this topic use AWS CLI commands in Amazon Linux environment. Note that the environment variables used in these commands are available in the current session unless explicitly preserved.

**Topics**
+ [

# Provisioning resources using CloudFormation stacks
](sagemaker-hyperpod-multihead-slurm-cfn.md)
+ [

# Creating and attaching an IAM policy
](sagemaker-hyperpod-multihead-slurm-iam.md)
+ [

# Preparing and uploading lifecycle scripts
](sagemaker-hyperpod-multihead-slurm-scripts.md)
+ [

# Creating a SageMaker HyperPod cluster
](sagemaker-hyperpod-multihead-slurm-create.md)
+ [

# Considering important notes
](sagemaker-hyperpod-multihead-slurm-notes.md)
+ [

# Reviewing environment variables reference
](sagemaker-hyperpod-multihead-slurm-variables-reference.md)

# Provisioning resources using CloudFormation stacks
<a name="sagemaker-hyperpod-multihead-slurm-cfn"></a>

To set up multiple controller nodes in a HyperPod Slurm cluster, provision AWS resources through two CloudFormation stacks: [Provision basic resources](#sagemaker-hyperpod-multihead-slurm-cfn-basic) and [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

## Provision basic resources
<a name="sagemaker-hyperpod-multihead-slurm-cfn-basic"></a>

Follow these steps to provision basic resources for your Amazon SageMaker HyperPod Slurm cluster.

1. Download the [sagemaker-hyperpod.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) template file to your machine. This YAML file is an CloudFormation template that defines the following resources to create for your Slurm cluster.
   + An execution IAM role for the compute node instance group
   + An Amazon S3 bucket to store the lifecycle scripts
   + Public and private subnets (private subnets have internet access through NAT gateways)
   + Internet Gateway/NAT gateways
   + Two Amazon EC2 security groups
   + An Amazon FSx volume to store configuration files

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod`. Define the Availability Zone (AZ) IDs for your cluster in `PrimarySubnetAZ` and `BackupSubnetAZ`. For example, *use1-az4* is an AZ ID for an Availability Zone in the `us-east-1` Region. For more information, see [Availability Zone IDs](https://docs.aws.amazon.com//ram/latest/userguide/working-with-az-ids.html) and [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/sagemaker-hyperpod.yaml \
   --stack-name sagemaker-hyperpod \
   --parameter-overrides PrimarySubnetAZ=use1-az4 BackupSubnetAZ=use1-az1 \
   --capabilities CAPABILITY_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod
   ```

1. (Optional) Verify the stack in the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod`) outputs. You will use values of these variables to [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

   ```
   source .env
   PRIMARY_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`PrimaryPrivateSubnet`].OutputValue' --output text)
   BACKUP_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`BackupPrivateSubnet`].OutputValue' --output text)
   EMAIL=$(bash -c 'read -p "INPUT YOUR SNSSubEmailAddress HERE: " && echo $REPLY')
   DB_USER_NAME=$(bash -c 'read -p "INPUT YOUR DB_USER_NAME HERE: " && echo $REPLY')
   SECURITY_GROUP=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`SecurityGroup`].OutputValue' --output text)
   ROOT_BUCKET_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonS3BucketName`].OutputValue' --output text)
   SLURM_FSX_DNS_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemDNSname`].OutputValue' --output text)
   SLURM_FSX_MOUNT_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemMountname`].OutputValue' --output text)
   COMPUTE_NODE_ROLE=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonSagemakerClusterExecutionRoleArn`].OutputValue' --output text)
   ```

   When you see prompts asking for your email address and database user name, enter values like the following.

   ```
   INPUT YOUR SNSSubEmailAddress HERE: Email_address_to_receive_SNS_notifications
   INPUT YOUR DB_USER_NAME HERE: Database_user_name_you_define
   ```

   To verify variable values, use the `print $variable` command.

   ```
   print $REGION
   us-east-1
   ```

## Provision additional resources to support multiple controller nodes
<a name="sagemaker-hyperpod-multihead-slurm-cfn-multihead"></a>

Follow these steps to provision additional resources for your Amazon SageMaker HyperPod Slurm cluster with multiple controller nodes.

1. Download the [sagemaker-hyperpod-slurm-multi-headnode.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod-slurm-multi-headnode.yaml) template file to your machine. This second YAML file is an CloudFormation template that defines the additional resources to create for multiple controller nodes support in your Slurm cluster.
   + An execution IAM role for the controller node instance group
   + An Amazon RDS for MariaDB instance
   + An Amazon SNS topic and subscription
   + AWS Secrets Manager credentials for Amazon RDS for MariaDB

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod-mh`. This second stack uses the CloudFormation template to create additional AWS resources to support the multiple controller nodes architecture.

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/slurm-multi-headnode.yaml \
   --stack-name sagemaker-hyperpod-mh \
   --parameter-overrides \
   SlurmDBSecurityGroupId=$SECURITY_GROUP \
   SlurmDBSubnetGroupId1=$PRIMARY_SUBNET \
   SlurmDBSubnetGroupId2=$BACKUP_SUBNET \
   SNSSubEmailAddress=$EMAIL \
   SlurmDBUsername=$DB_USER_NAME \
   --capabilities CAPABILITY_NAMED_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod-mh
   ```

1. (Optional) Verify the stack in the [AWS Cloud Formation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod-mh**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod-mh`) outputs. You will use values of these variables to update the configuration file (`provisioning_parameters.json`) in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md).

   ```
   source .env
   SLURM_DB_ENDPOINT_ADDRESS=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBEndpointAddress`].OutputValue' --output text)
   SLURM_DB_SECRET_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBSecretArn`].OutputValue' --output text)
   SLURM_EXECUTION_ROLE_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmExecutionRoleArn`].OutputValue' --output text)
   SLURM_SNS_FAILOVER_TOPIC_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmFailOverSNSTopicArn`].OutputValue' --output text)
   ```

# Creating and attaching an IAM policy
<a name="sagemaker-hyperpod-multihead-slurm-iam"></a>

This section explains how to create an IAM policy and attach it to the execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

1. Download the [IAM policy example](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/1.AmazonSageMakerClustersExecutionRolePolicy.json) to your machine from the GitHub repository.

1. Create an IAM policy with the downloaded example, using the [create-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/create-policy.html) CLI command.

   ```
   aws --region us-east-1 iam create-policy \
       --policy-name AmazonSagemakerExecutionPolicy \
       --policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json
   ```

   Example output of the command.

   ```
   {
       "Policy": {
           "PolicyName": "AmazonSagemakerExecutionPolicy",
           "PolicyId": "ANPAXISIWY5UYZM7WJR4W",
           "Arn": "arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy",
           "Path": "/",
           "DefaultVersionId": "v1",
           "AttachmentCount": 0,
           "PermissionsBoundaryUsageCount": 0,
           "IsAttachable": true,
           "CreateDate": "2025-01-22T20:01:21+00:00",
           "UpdateDate": "2025-01-22T20:01:21+00:00"
       }
   }
   ```

1. Attach the policy `AmazonSagemakerExecutionPolicy` to the Slurm execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead), using the [attach-role-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/attach-role-policy.html) CLI command.

   ```
   aws --region us-east-1 iam attach-role-policy \
       --role-name AmazonSagemakerExecutionRole \
       --policy-arn arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy
   ```

   This command doesn't produce any output.

   (Optional) If you use environment variables, here are the example commands.
   + To get the role name and policy name 

     ```
     POLICY=$(aws --region $REGION iam list-policies --query 'Policies[?PolicyName==AmazonSagemakerExecutionPolicy].Arn' --output text)
     ROLENAME=$(aws --region $REGION iam list-roles --query "Roles[?Arn=='${SLURM_EXECUTION_ROLE_ARN}'].RoleName" —output text)
     ```
   + To attach the policy

     ```
     aws  --region us-east-1 iam attach-role-policy \
          --role-name $ROLENAME --policy-arn $POLICY
     ```

For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

# Preparing and uploading lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-scripts"></a>

After creating all the required resources, you'll need to set up [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) for your SageMaker HyperPod cluster. These [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) provide a [base configuration](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) you can use to create a basic HyperPod Slurm cluster.

## Prepare the lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-prepare-scripts"></a>

Follow these steps to get the lifecycle scripts.

1. Download the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) from the GitHub repository to your machine.

1. Upload the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) to the Amazon S3 bucket you created in [Provision basic resources](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-basic), using the [cp](https://docs.aws.amazon.com//cli/latest/reference/s3/cp.html) CLI command.

   ```
   aws s3 cp --recursive LifeCycleScripts/base-config s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config
   ```

## Create configuration file
<a name="sagemaker-hyperpod-multihead-slurm-update-config-file"></a>

Follow these steps to create the configuration file and upload it to the same Amazon S3 bucket where you store the lifecycle scripts.

1. Create a configuration file named `provisioning_parameters.json` with the following configuration. Note that `slurm_sns_arn` is optional. If not provided, HyperPod will not set up the Amazon SNS notifications.

   ```
   cat <<EOF > /tmp/provisioning_parameters.json
   {
     "version": "1.0.0",
     "workload_manager": "slurm",
     "controller_group": "$CONTOLLER_IG_NAME",
     "login_group": "my-login-group",
     "worker_groups": [
       {
         "instance_group_name": "$COMPUTE_IG_NAME",
         "partition_name": "dev"
       }
     ],
     "fsx_dns_name": "$SLURM_FSX_DNS_NAME",
     "fsx_mountname": "$SLURM_FSX_MOUNT_NAME",
     "slurm_configurations": {
       "slurm_database_secret_arn": "$SLURM_DB_SECRET_ARN",
       "slurm_database_endpoint": "$SLURM_DB_ENDPOINT_ADDRESS",
       "slurm_shared_directory": "/fsx",
       "slurm_database_user": "$DB_USER_NAME",
       "slurm_sns_arn": "$SLURM_SNS_FAILOVER_TOPIC_ARN"
     }
   }
   EOF
   ```

1. Upload the `provisioning_parameters.json` file to the same Amazon S3 bucket where you store the lifecycle scripts.

   ```
   aws s3 cp /tmp/provisioning_parameters.json s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config/provisioning_parameters.json
   ```
**Note**  
If you are using API-driven configuration, the `provisioning_parameters.json` file is not required. With API-driven configuration, you define Slurm node types, partitions, and FSx mounting directly in the CreateCluster API payload. For details, see [Getting started with SageMaker HyperPod using the AWS CLI](smcluster-getting-started-slurm-cli.md).

## Verify files in Amazon S3 bucket
<a name="sagemaker-hyperpod-multihead-slurm-verify-s3"></a>

After you upload all the lifecycle scripts and the `provisioning_parameters.json` file, your Amazon S3 bucket should look like the following.

![\[Image showing all the lifecycle scripts uploaded to the Amazon S3 bucket in the Amazon Simple Storage Service console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-scripts-s3.png)


For more information, see [Start with base lifecycle scripts provided by HyperPod](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.html).

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-multihead-slurm-create"></a>

After setting up all the required resources and uploading the scripts to the Amazon S3 bucket, you can create a cluster.

1. To create a cluster, run the [https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html) AWS CLI command. The creation process can take up to 15 minutes to complete.

   ```
   aws --region $REGION sagemaker create-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --vpc-config '{
           "SecurityGroupIds":["'$SECURITY_GROUP'"],
           "Subnets":["'$PRIMARY_SUBNET'", "'$BACKUP_SUBNET'"]
       }' \
       --instance-groups '[{                  
       "InstanceGroupName": "'$CONTOLLER_IG_NAME'",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$SLURM_EXECUTION_ROLE_ARN'",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "'$COMPUTE_IG_NAME'",          
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$COMPUTE_NODE_ROLE'",
       "ThreadsPerCore": 1
   }]'
   ```

   After successful execution, the command returns the cluster ARN like the following.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster_id"
   }
   ```

1. (Optional) To check the status of your cluster, you can use the SageMaker AI console ([https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)). From the left navigation, choose **HyperPod Clusters**, then choose **Cluster Management**. Choose a cluster name to open the cluster details page. If your cluster is created successfully, you will see the cluster status is **InService**.  
![\[Image showing a HyperPod Slurm cluster with multiple controller nodes in the Amazon SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-cluster.png)

# Considering important notes
<a name="sagemaker-hyperpod-multihead-slurm-notes"></a>

This section provides several important notes which you might find helpful. 

1. To migrate to a multi-controller Slurm cluster, complete these steps.

   1. Follow the instructions in [Provisioning resources using CloudFormation stacks](sagemaker-hyperpod-multihead-slurm-cfn.md) to provision all the required resources.

   1. Follow the instructions in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md) to upload the updated lifecycle scripts. When updating the `provisioning_parameters.json` file, move your existing controller group to the `worker_groups` section, and add a new controller group name in the `controller_group` section.

   1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) API call to create a new controller group and keep the original compute instance groups and controller group.

1. To scale down the number of controller nodes, use the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command. For each controller instance group, the minimum number of controller nodes you can scale down to is 1. This means that you cannot scale down the number of controller nodes to 0.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command.

   The following is an example CLI command to scale down the number of controller nodes.

   ```
   aws sagemaker update-cluster \
       --cluster-name my_cluster \
       --instance-groups '[{                  
       "InstanceGroupName": "controller_ig_name",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 3,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "slurm_execution_role_arn",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "compute-ig_name",       
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "compute_node_role_arn",
       "ThreadsPerCore": 1
   }]'
   ```

1. To batch delete the controller nodes, use the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command. For each controller instance group, you must keep at least one controller node. If you want to batch delete all the controller nodes, the API operation won't work.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command.

   The following is an example CLI command to batch delete the controller nodes.

   ```
   aws sagemaker batch-delete-cluster-nodes --cluster-name my_cluster --node-ids instance_ids_to_delete
   ```

1. To troubleshoot your cluster creation issues, check the failure message from the cluster details page in your SageMaker AI console. You can also use CloudWatch logs to troubleshoot cluster creation issues. From the CloudWatch console, choose **Log groups**. Then, search `clusters` to see the list of log groups related to your cluster creation.  
![\[Image showing Amazon SageMaker HyperPod cluster log groups in the CloudWatch console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-logs.png)

# Reviewing environment variables reference
<a name="sagemaker-hyperpod-multihead-slurm-variables-reference"></a>

The following environment variables are defined and used in the tutorial of [Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster](sagemaker-hyperpod-multihead-slurm-setup.md). These environment variables are only available in the current session unless explicitly preserved. They are defined using the `$variable_name` syntax. Variables with key/value pairs represent AWS-created resources, while variables without keys are user-defined.


**Environment variables reference**  

| Variable | Description | 
| --- | --- | 
| \$1BACKUP\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1NODE\$1ROLE |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1CONTOLLER\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1DB\$1USER\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1EMAIL |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1PRIMARY\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1POLICY |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1REGION |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1ROOT\$1BUCKET\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SECURITY\$1GROUP |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1ENDPOINT\$1ADDRESS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1SECRET\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1EXECUTION\$1ROLE\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1DNS\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1MOUNT\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1SNS\$1FAILOVER\$1TOPIC\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 

# Jobs on SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-run-jobs-slurm"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters. Examples of running ML workloads on HyperPod clusters are also provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following topics walk you through how to log in to the provisioned HyperPod clusters and get you started with running sample ML workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [

# Accessing your SageMaker HyperPod cluster nodes
](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md)
+ [

# Scheduling a Slurm job on a SageMaker HyperPod cluster
](sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job.md)
+ [

# Running Docker containers on a Slurm compute node on HyperPod
](sagemaker-hyperpod-run-jobs-slurm-docker.md)
+ [

# Running distributed training workloads with Slurm on HyperPod
](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md)

# Accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes"></a>

You can access your **InService** cluster through AWS Systems Manager (SSM) by running the AWS CLI command `aws ssm start-session` with the SageMaker HyperPod cluster host name in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

For advanced settings for practical use of HyperPod clusters, see the following topics.

**Topics**
+ [

## Additional tips for accessing your SageMaker HyperPod cluster nodes
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips)
+ [

## Set up a multi-user environment through the Amazon FSx shared space
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space)
+ [

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory)

## Additional tips for accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips"></a>

**Use the `easy-ssh.sh` script provided by HyperPod for simplifying the connection process**

To make the previous process into a single line command, the HyperPod team provides the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script that retrieves your cluster information, aggregates them into the SSM command, and connects to the compute node. You don't need to manually look for the required HyperPod cluster information as this script runs `describe-cluster` and `list-cluster-nodes` commands and parses the information needed for completing the SSM command. The following example commands show how to run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script. If it runs successfully, you'll be connected to the cluster as the root user. It also prints a code snippet to set up SSH by adding the HyperPod cluster as a remote host through an SSM proxy. By setting up SSH, you can connect your local development environment such as Visual Studio Code with the HyperPod cluster.

```
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c <node-group> <cluster-name>
Cluster id: <cluster_id>
Instance id: <instance_id>
Node Group: <node-group>
Add the following to your ~/.ssh/config to easily connect:

$ cat <<EOF >> ~/.ssh/config
Host <cluster-name>
  User ubuntu
  ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
EOF

Add your ssh keypair and then you can do:

$ ssh <cluster-name>

aws ssm start-session --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id>

Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

**Set up for easy access with SSH by using the HyperPod compute node as a remote host**

To further simplify access to the compute node using SSH from a local machine, the `easy-ssh.sh` script outputs a code snippet of setting up the HyperPod cluster as a remote host as shown in the previous section. The code snippet is auto-generated to help you directly add to the `~/.ssh/config` file on your local device. The following procedure shows how to set up for easy access using SSH through the SSM proxy, so that you or your cluster users can directly run `ssh <cluster-name>` to connect to the HyperPod cluster node.

1. On your local device, add the HyperPod compute node with a user name as a remote host to the `~/.ssh/config` file. The following command shows how to append the auto-generated code snippet from the `easy-ssh.sh` script to the `~/.ssh/config` file. Make sure that you copy it from the auto-generated output of the `easy-ssh.sh` script that has the correct cluster information.

   ```
   $ cat <<EOF >> ~/.ssh/config
   Host <cluster-name>
     User ubuntu
     ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
   EOF
   ```

1. On the HyperPod cluster node, add the public key on your local device to the `~/.ssh/authorized_keys` file on the HyperPod cluster node.

   1. Print the public key file on your local machine.

      ```
      $ cat ~/.ssh/id_rsa.pub
      ```

      This should return your key. Copy the output of this command. 

      (Optional) If you don't have a public key, create one by running the following command.

      ```
      $ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
      ```

   1. Connect to the cluster node and switch to the user to add the key. The following command is an example of accessing as the `ubuntu` user. Replace `ubuntu` to the user name for which you want to set up the easy access with SSH.

      ```
      $ ./easy-ssh.sh -c <node-group> <cluster-name>
      $ sudo su - ubuntu
      ubuntu@ip-111-22-333-444:/usr/bin#
      ```

   1. Open the `~/.ssh/authorized_keys` file and add the public key at the end of the file.

      ```
      ubuntu@ip-111-22-333-444:/usr/bin# vim ~/.ssh/authorized_keys
      ```

After you finish setting up, you can connect to the HyperPod cluster node as the user by running a simplified SSH command as follows.

```
$ ssh <cluster-name>
ubuntu@ip-111-22-333-444:/usr/bin#
```

Also, you can use the host for remote development from an IDE on your local device, such as [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh).

## Set up a multi-user environment through the Amazon FSx shared space
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space"></a>

You can use the Amazon FSx shared space to manage a multi-user environment in a Slurm cluster on SageMaker HyperPod. If you have configured your Slurm cluster with Amazon FSx during the HyperPod cluster creation, this is a good option for setting up workspace for your cluster users. Create a new user and setup the home directory for the user on the Amazon FSx shared file system.

**Tip**  
To allow users to access your cluster through their user name and dedicated directories, you should also associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under the procedure **To turn on Run As support for Linux and macOS managed nodes** provided at [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html) in the AWS Systems Manager User Guide. See also [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

**To set up a multi-user environment while creating a Slurm cluster on SageMaker HyperPod**

The SageMaker HyperPod service team provides a script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) as part of the base lifecycle script samples. 

1. Prepare a text file named `shared_users.txt` that you need to create in the following format. The first column is for user names, the second column is for unique user IDs, and the third column is for the user directories in the Amazon FSx shared space.

   ```
   username1,uid1,/fsx/username1
   username2,uid2,/fsx/username2
   ...
   ```

1. Make sure that you upload the `shared_users.txt` and [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) files to the S3 bucket for HyperPod lifecycle scripts. While the cluster creation, cluster update, or cluster software update is in progress, the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) reads in the `shared_users.txt` and sets up the user directories properly.

**To create new users and add to an existing Slurm cluster running on SageMaker HyperPod **

1. On the head node, run the following command to save a script that helps create a user. Make sure that you run this with sudo permissions.

   ```
   $ cat > create-user.sh << EOL
   #!/bin/bash
   
   set -x
   
   # Prompt user to get the new user name.
   read -p "Enter the new user name, i.e. 'sean': 
   " USER
   
   # create home directory as /fsx/<user>
   # Create the new user on the head node
   sudo useradd \$USER -m -d /fsx/\$USER --shell /bin/bash;
   user_id=\$(id -u \$USER)
   
   # add user to docker group
   sudo usermod -aG docker \${USER}
   
   # setup SSH Keypair
   sudo -u \$USER ssh-keygen -t rsa -q -f "/fsx/\$USER/.ssh/id_rsa" -N ""
   sudo -u \$USER cat /fsx/\$USER/.ssh/id_rsa.pub | sudo -u \$USER tee /fsx/\$USER/.ssh/authorized_keys
   
   # add user to compute nodes
   read -p "Number of compute nodes in your cluster, i.e. 8: 
   " NUM_NODES
   srun -N \$NUM_NODES sudo useradd -u \$user_id \$USER -d /fsx/\$USER --shell /bin/bash;
   
   # add them as a sudoer
   read -p "Do you want this user to be a sudoer? (y/N):
   " SUDO
   if [ "\$SUDO" = "y" ]; then
           sudo usermod -aG sudo \$USER
           sudo srun -N \$NUM_NODES sudo usermod -aG sudo \$USER
           echo -e "If you haven't already you'll need to run:\n\nsudo visudo /etc/sudoers\n\nChange the line:\n\n%sudo   ALL=(ALL:ALL) ALL\n\nTo\n\n%sudo   ALL=(ALL:ALL) NOPASSWD: ALL\n\nOn each node."
   fi
   EOL
   ```

1. Run the script with the following command. You'll be prompted for adding the name of a user and the number of compute nodes that you want to allow the user to access.

   ```
   $ bash create-user.sh
   ```

1. Test the user by running the following commands. 

   ```
   $ sudo su - <user> && ssh $(srun hostname)
   ```

1. Add the user information to the `shared_users.txt` file, so the user will be created on any new compute nodes or new clusters.

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory"></a>

In practical use cases, HyperPod clusters are typically used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files and run their own jobs without impacting each other's work. To set up a multi-user environment, use the Linux user and group mechanism to statically create multiple users on each instance through lifecycle scripts. However, the drawback to this approach is that you need to duplicate user and group settings across multiple instances in the cluster to keep a consistent configuration across all instances when you make updates such as adding, editing, and removing users.

To solve this, you can use [Lightweight Directory Access Protocol (LDAP)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) and [LDAP over TLS/SSL (LDAPS)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) to integrate with a directory service such as [AWS Directory Service for Microsoft Active Directory](https://aws.amazon.com/directoryservice/). To learn more about setting up Active Directory and a multi-user environment in a HyperPod cluster, see the blog post [Integrate HyperPod clusters with Active Directory for seamless multi-user login](https://aws.amazon.com/blogs/machine-learning/integrate-hyperpod-clusters-with-active-directory-for-seamless-multi-user-login/).

# Scheduling a Slurm job on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job"></a>

You can launch training jobs using the standard Slurm `sbatch` or `srun` commands. For example, to launch an 8-node training job, you can run `srun -N 8 --exclusive train.sh` SageMaker HyperPod supports training in a range of environments, including `conda`, `venv`, `docker`, and `enroot`. You can configure an ML environment by running lifecycle scripts on your SageMaker HyperPod clusters. You also have an option to attach a shared file system such as Amazon FSx, which can also be used as a virtual environment.

The following example shows how to run a job for training Llama-2 with the Fully Sharded Data Parallelism (FSDP) technique on a SageMaker HyperPod cluster with an Amazon FSx shared file system. You can also find more examples from the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

**Tip**  
All SageMaker HyperPod examples are available in the `3.test_cases` folder of the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

1. Clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and copy the training job examples to your Amazon FSx file system. 

   ```
   $ TRAINING_DIR=/fsx/users/my-user/fsdp
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   ```

1. Run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh) script. This creates a `conda` environment on your Amazon FSx file system. Make sure that the file system is accessible to all nodes in the cluster.

1. Build the virtual Conda environment by launching a single node slurm job as follows.

   ```
   $ srun -N 1 /path_to/create_conda_env.sh
   ```

1. After the environment is built, you can launch a training job by pointing to the environment path on the shared volume. You can launch both single-node and multi-node training jobs with the same setup. To launch a job, create a job launcher script (also called an entry point script) as follows.

   ```
   #!/usr/bin/env bash
   set -ex
   
   ENV_PATH=/fsx/users/my_user/pytorch_env
   TORCHRUN=$ENV_PATH/bin/torchrun
   TRAINING_SCRIPT=/fsx/users/my_user/pt_train.py
   
   WORLD_SIZE_JOB=$SLURM_NTASKS
   RANK_NODE=$SLURM_NODEID
   PROC_PER_NODE=8
   MASTER_ADDR=(`scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1`)
   MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
   
   DIST_ARGS="--nproc_per_node=$PROC_PER_NODE \
              --nnodes=$WORLD_SIZE_JOB \
              --node_rank=$RANK_NODE \
              --master_addr=$MASTER_ADDR \
              --master_port=$MASTER_PORT \
             "
             
   $TORCHRUN $DIST_ARGS $TRAINING_SCRIPT
   ```
**Tip**  
If you want to make your training job more resilient against hardware failures by using the auto-resume capability of SageMaker HyperPod, you need to properly set up the environment variable `MASTER_ADDR` in the entrypoint script. To learn more, see [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

   This tutorial assumes that this script is saved as `/fsx/users/my_user/train.sh`.

1. With this script in the shared volume at `/fsx/users/my_user/train.sh`, run the following `srun` command to schedule the Slurm job.

   ```
   $ cd /fsx/users/my_user/
   $ srun -N 8 train.sh
   ```

# Running Docker containers on a Slurm compute node on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-docker"></a>

To run Docker containers with Slurm on SageMaker HyperPod, you need to use [Enroot](https://github.com/NVIDIA/enroot) and [Pyxis](https://github.com/NVIDIA/pyxis). The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an `srun` command, `srun --container-image=docker/image:tag`. 

**Tip**  
The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md). Use the [base lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) script, there's the `Config` class with the boolean type parameter for installing the packages set to `True` (`enable_docker_enroot_pyxis=True`). This is called by and parsed in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script, which calls `install_docker.sh` and `install_enroot_pyxis.sh` scripts from the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to `/opt/dlami/nvme`. The default root volume of any fresh instance is mounted to `/tmp` only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at `/opt/dlami/nvme`, and the installation scripts take care of the configuration processes.

**To check if the root paths are set up properly**

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to `/opt/dlami/nvme/*`. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.

```
$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"
ENROOT_RUNTIME_PATH        /opt/dlami/nvme/tmp/enroot/user-$(id -u)
... // The same or similar lines repeat 7 times
```

```
$ srun -N 8 cat /etc/docker/daemon.json
{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}
... // The same or similar lines repeat 7 times
```

After you confirm that the runtime paths are properly set to `/opt/dlami/nvme/*`, you're ready to build and run Docker containers with Enroot and Pyxis.

**To test Docker with Slurm**

1. On your compute node, try the following commands to check if Docker and Enroot are properly installed.

   ```
   $ docker --help
   $ enroot --help
   ```

1. Test if Pyxis and Enroot installed correctly by running one of the [NVIDIA CUDA Ubuntu](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda) images.

   ```
   $ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

   You can also test it by creating a script and running an `sbatch` command as follows.

   ```
   $ cat <<EOF >> container-test.sh
   #!/bin/bash
   #SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   nvidia-smi
   EOF
   
   $ sbatch container-test.sh
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

**To run a test Slurm job with Docker**

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker AI model parallelism (SMP) library.

1. If you want to use one of the pre-built ECR images distributed by SageMaker AI or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). In this tutorial, we use the [SMP Docker image](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2) pre-packaged with the SMP library .

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:BatchGetImage",
                   "ecr-public:*",
                   "ecr:GetDownloadUrlForLayer",
                   "ecr:GetAuthorizationToken",
                   "sts:*"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.

   ```
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   $ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
   ```

1. In this tutorial, run the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh) that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.

   ```
   $ cat docker_build.sh
   #!/usr/bin/env bash
   
   region=us-west-2
   dlc_account_id=658645717510
   aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
   
   docker build -t smpv2 .
   enroot import -o smpv2.sqsh  dockerd://smpv2:latest
   ```

   ```
   $ bash docker_build.sh
   ```

1. Create a batch script to launch a training job using `sbatch`. In this tutorial, the provided sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh) launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts), and `launch_training_enroot.sh` takes `train_external.py` as the entrypoint script.
**Important**  
To use the a Docker container on SageMaker HyperPod, you must mount the `/var/log` directory from the host machine, which is the HyperPod compute node in this case, onto the `/var/log` directory in the container. You can set it up by adding the following variable for Enroot.  

   ```
   "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"
   ```

   ```
   $ cat launch_training_enroot.sh
   #!/bin/bash
   
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: MIT-0
   
   #SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs
   #SBATCH --job-name=smpv2_llama # name of your job
   #SBATCH --exclusive # job has exclusive use of the resource, no sharing
   #SBATCH --wait-all-nodes=1
   
   set -ex;
   
   ###########################
   ###### User Variables #####
   ###########################
   
   #########################
   model_type=llama_v2
   model_size=70b
   
   # Toggle this to use synthetic data
   use_synthetic_data=1
   
   
   # To run training on your own data  set Training/Test Data path  -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines.
   # Also change the use_synthetic_data to 0
   
   export TRAINING_DIR=/fsx/path_to_data
   export TEST_DIR=/fsx/path_to_data
   export CHECKPOINT_DIR=$(pwd)/checkpoints
   
   # Variables for Enroot
   : "${IMAGE:=$(pwd)/smpv2.sqsh}"
   : "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster
   : "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}"
   : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}"
   : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}"   
   
   
   ###########################
   ## Environment Variables ##
   ###########################
   
   #export NCCL_SOCKET_IFNAME=en
   export NCCL_ASYNC_ERROR_HANDLING=1
   
   export NCCL_PROTO="simple"
   export NCCL_SOCKET_IFNAME="^lo,docker"
   export RDMAV_FORK_SAFE=1
   export FI_EFA_USE_DEVICE_RDMA=1
   export NCCL_DEBUG_SUBSYS=off
   export NCCL_DEBUG="INFO"
   export SM_NUM_GPUS=8
   export GPU_NUM_DEVICES=8
   export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
   
   # async runtime error ...
   export CUDA_DEVICE_MAX_CONNECTIONS=1
   
   
   #########################
   ## Command and Options ##
   #########################
   
   if [ "$model_size" == "7b" ]; then
       HIDDEN_WIDTH=4096
       NUM_LAYERS=32
       NUM_HEADS=32
       LLAMA_INTERMEDIATE_SIZE=11008
       DEFAULT_SHARD_DEGREE=8
   # More Llama model size options
   elif [ "$model_size" == "70b" ]; then
       HIDDEN_WIDTH=8192
       NUM_LAYERS=80
       NUM_HEADS=64
       LLAMA_INTERMEDIATE_SIZE=28672
       # Reduce for better perf on p4de
       DEFAULT_SHARD_DEGREE=64
   fi
   
   
   if [ -z "$shard_degree" ]; then
       SHARD_DEGREE=$DEFAULT_SHARD_DEGREE
   else
       SHARD_DEGREE=$shard_degree
   fi
   
   if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then
       LLAMA_ARGS=""
   else
       LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE "
   fi
   
   
   if [ $use_synthetic_data == 1 ]; then
       echo "using synthetic data"
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH
       )
   else
       echo "using real data...."
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH
       )
   fi
   
   
   declare -a TORCHRUN_ARGS=(
       # change this to match the number of gpus per node:
       --nproc_per_node=8 \
       --nnodes=$SLURM_JOB_NUM_NODES \
       --rdzv_id=$SLURM_JOB_ID \
       --rdzv_backend=c10d \
       --rdzv_endpoint=$(hostname) \
   )
   
   srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \
               --train_batch_size 4 \
               --max_steps 100 \
               --hidden_width $HIDDEN_WIDTH \
               --num_layers $NUM_LAYERS \
               --num_heads $NUM_HEADS \
               ${LLAMA_ARGS} \
               --shard_degree $SHARD_DEGREE \
               --model_type $model_type \
               --profile_nsys 1 \
               --use_smp_implementation 1 \
               --max_context_width 4096 \
               --tensor_parallel_degree 1 \
               --use_synthetic_data $use_synthetic_data \
               --training_dir $TRAINING_DIR \
               --test_dir $TEST_DIR \
               --dataset_type hf \
               --checkpoint_dir $CHECKPOINT_DIR \
               --checkpoint_freq 100 \
   
   $ sbatch launch_training_enroot.sh
   ```

To find the downloadable code examples, see [Run a model-parallel training job using the SageMaker AI model parallelism library, Docker and Enroot with Slurm](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2#option-2----run-training-using-docker-and-enroot) in the *Awsome Distributed Training GitHub repository*. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

# Running distributed training workloads with Slurm on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload"></a>

SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker AI distributed training frameworks:
+ The [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md) that offers collective communication operations optimized for AWS.
+ The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) that implements various model parallelism techniques.

**Topics**
+ [

## Using SMDDP on a SageMaker HyperPod
](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp)
+ [

## Using SMP on a SageMaker HyperPod cluster
](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp)

## Using SMDDP on a SageMaker HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp"></a>

The [SMDDP library](data-parallel.md) is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:
+ [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html)
+ [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html)
+ [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+ [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)

The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.
+ The library offers `AllGather` optimized for AWS. `AllGather` is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).
+ The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and the SageMaker AI ML instance topology. 

**To run sample data-parallel training jobs**

Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP)
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed)

**To set up an environment for using the SMDDP library on SageMaker HyperPod**

The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.
+ PyTorch v2.0.1 and later
+ CUDA v11.8 and later
+ `libstdc++` runtime version greater than 3
+ Python v3.10.x and later
+ `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, which are supported instance types by the SMDDP library
+ `imdsv2` enabled on training host

Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:
+ A direct installation using the SMDDP binary file.
+ Using the SageMaker AI Deep Learning Containers (DLCs) pre-installed with the SMDDP library.

Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation.

**To install the SMDDP library on the SageMaker HyperPod DLAMI**
+ `pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64.whl`
**Note**  
If you work in a Conda environment, ensure that you install PyTorch using `conda install` instead of `pip`.  

  ```
  conda install pytorch==X.Y.Z  torchvision==X.Y.Z torchaudio==X.Y.Z pytorch-cuda=X.Y.Z -c pytorch -c nvidia
  ```

**To use the SMDDP library on a Docker container**
+ The SMDDP library is pre-installed on the SageMaker AI Deep Learning Containers (DLCs). To find the list of SageMaker AI framework DLCs for PyTorch with the SMDDP library, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).
**Important**  
To use the SMDDP library in a Docker container, mount the `/var/log` directory from the host machine onto `/var/log` in the container. This can be done by adding the following option when running your container.  

  ```
  docker run <OTHER_OPTIONS> -v /var/log:/var/log ...
  ```

To learn how to run data-parallel training jobs with SMDDP in general, see [Distributed training with the SageMaker AI distributed data parallelism library](data-parallel-modify-sdp.md).

## Using SMP on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp"></a>

The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) offers various [state-of-the-art model parallelism techniques](model-parallel-core-features-v2.md), including:
+ fully sharded data parallelism
+ expert parallelism
+ mixed precision training with FP16/BF16 and FP8 data types
+ tensor parallelism

The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.

**To run a sample model-parallel training workload**

The SageMaker AI service teams provide sample training jobs implementing model parallelism with the SMP library at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2).

# SageMaker HyperPod cluster resources monitoring
<a name="sagemaker-hyperpod-cluster-observability-slurm"></a>

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

![\[An overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-observability-architecture.png)


Figure: This architecture diagram shows an overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [

# Prerequisites for SageMaker HyperPod cluster observability
](sagemaker-hyperpod-cluster-observability-slurm-prerequisites.md)
+ [

# Installing metrics exporter packages on your HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md)
+ [

# Validating Prometheus setup on the head node of a HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup.md)
+ [

# Setting up an Amazon Managed Grafana workspace
](sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws.md)
+ [

# Exported metrics reference
](sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference.md)
+ [

# Amazon SageMaker HyperPod Slurm metrics
](smcluster-slurm-metrics.md)

# Prerequisites for SageMaker HyperPod cluster observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites"></a>

Before proceeding with the steps to [Installing metrics exporter packages on your HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md), ensure that the following prerequisites are met.

## Enable IAM Identity Center
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-iam-id-center"></a>

To enable observability for your SageMaker HyperPod cluster, you must first enable IAM Identity Center. This is a prerequisite for deploying an CloudFormation stack that sets up the Amazon Managed Grafana workspace and Amazon Managed Service for Prometheus. Both of these services also require the IAM Identity Center for authentication and authorization, ensuring secure user access and management of the monitoring infrastructure.

For detailed guidance on enabling IAM Identity Center, see the [Enabling IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/get-set-up-for-idc.html) section in the *AWS IAM Identity Center User Guide*. 

After successfully enabling IAM Identity Center, set up a user account that will serve as the administrative user throughout the following configuration precedures.

## Create and deploy an CloudFormation stack for SageMaker HyperPod observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-cloudformation-stack"></a>

Create and deploy a CloudFormation stack for SageMaker HyperPod observability to monitor HyperPod cluster metrics in real time using Amazon Managed Service for Prometheus and Amazon Managed Grafana. To deploy the stack, note that you also should enable your [IAM Identity Center](https://console.aws.amazon.com/singlesignon) beforehand.

Use the sample CloudFormation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml) that helps you set up Amazon VPC subnets, Amazon FSx for Lustre file systems, Amazon S3 buckets, and IAM roles required to create a HyperPod cluster observability stack.

# Installing metrics exporter packages on your HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-install-exporters"></a>

In the [base configuration lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.


|  |  |  | 
| --- |--- |--- |
| Name | Script deployment target node | Exporter description | 
| [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter) | Head (controller) node |  Exports Slurm Accounting metrics.  | 
|  [Elastic Fabric Adapter (EFA) node exporter](https://github.com/aws-samples/awsome-distributed-training/tree/main/4.validation_and_observability/3.efa-node-exporter)  |  Compute node  |  Exports metrics from cluster nodes and EFA. The package is a fork of the [Prometheus node exporter](https://github.com/prometheus/node_exporter).  | 
|  [NVIDIA Data Center GPU Management (DCGM) exporter](https://github.com/NVIDIA/dcgm-exporter)  | Compute node |  Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.  | 

With `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file, the following installation step is activated in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script. 

```
# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
    if node_type == SlurmNodeType.COMPUTE_NODE:
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
        ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()

    if node_type == SlurmNodeType.HEAD_NODE:
        wait_for_scontrol()
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
        ExecuteBashScript("./utils/install_prometheus.sh").run()
```

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the [Prometheus open-source software](https://prometheus.io/docs/introduction/overview/). The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder of the *Awsome Distributed Training GitHub repository*.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

# Validating Prometheus setup on the head node of a HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup"></a>

After you have successfully set up your HyperPod cluster installed with the exporter packages, check if Prometheus is properly set up on the head node of your HyperPod cluster.

1. Connect to the head node of your cluster. For instructions on accessing a node, see [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md).

1. Run the following command to verify the Prometheus config and service file created by the lifecycle script `install_prometheus.sh` is running on the controller node. The output should show the Active status as **active (running)**.

   ```
   $ sudo systemctl status prometheus
   • prometheus service - Prometheus Exporter
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset:disabled)
   Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; Ss ago
   Main PID: 12345 (prometheus)
   Tasks: 7 (limit: 9281)
   Memory: 35M
   CPU: 234ms
   CGroup: /system.slice/prometheus.service
           -12345 /usr/bin/prometheus--config.file=/etc/prometheus/prometheus.yml
   ```

1. Validate the Prometheus configuration file as follows. The output must be similar to the following, with three exporter configured with the right compute node IP addresses.

   ```
   $ cat /etc/prometheus/prometheus.yml
   global:
     scrape_interval: 15s
     evaluation_interval: 15s
     scrape_timeout: 15s
   
   scrape_configs:
     - job_name: 'slurm_exporter'
       static_configs:
         - targets:
             - 'localhost:8080'
     - job_name: 'dcgm_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9400'
             - '<ComputeNodeIP>:9400'
     - job_name: 'efa_node_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9100'
             - '<ComputeNodeIP>:9100'
   
   remote_write:
     - url: <AMPReoteWriteURL>
       queue_config:
         max_samples_per_send: 1000
         max_shards: 200
         capacity: 2500
       sigv4:
         region: <Region>
   ```

1. To test if Prometheus is exporting Slurm, DCGM, and EFA metrics properly, run the following `curl` command for Prometheus on port `:9090` on the head node.

   ```
   $ curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'
   ```

   With the metrics exported to Amazon Managed Service for Prometheus Workspace through the Prometheus remote write configuration from the controller node, you can proceed to the next topic to set up Amazon Managed Grafana dashboards to display the metrics.

# Setting up an Amazon Managed Grafana workspace
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws"></a>

Create a new Amazon Managed Grafana workspace or update an existing Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source.

**Topics**
+ [

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create)
+ [

## Open the Grafana workspace and finish setting up the data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source)
+ [

## Import open-source Grafana dashboards
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards)

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create"></a>

To visualize metrics from Amazon Managed Service for Prometheus, create an Amazon Managed Grafana workspace and set it up to use Amazon Managed Service for Prometheus as a data source.

1. To create a Grafana workspace, follow the instructions at [Creating a workspace](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-create-workspace.html#creating-workspace) in the *Amazon Managed Service for Prometheus User Guide*.

   1. In Step 13, select Amazon Managed Service for Prometheus as the data source.

   1. In Step 17, you can add the admin user and also other users in your IAM Identity Center.

For more information, see also the following resources.
+ [Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-amg.html) in the *Amazon Managed Service for Prometheus User Guide*
+ [Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/AMP-adding-AWS-config.html) in the *Amazon Managed Grafana User Guide*

## Open the Grafana workspace and finish setting up the data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source"></a>

After you have successfully created or updated an Amazon Managed Grafana workspace, select the workspace URL to open the workspace. This prompts you to enter a user name and the password of the user that you have set up in IAM Identity Center. You should log in using the admin user to finish setting up the workspace.

1. In the workspace **Home** page, choose **Apps**, **AWS Data Sources**, and **Data sources**.

1. In the **Data sources** page, and choose the **Data sources** tab.

1. For **Service**, choose Amazon Managed Service for Prometheus.

1. In the **Browse and provision data sources** section, choose the AWS region where you provisioned an Amazon Managed Service for Prometheus workspace.

1. From the list of data sources in the selected Region, choose the one for Amazon Managed Service for Prometheus. Make sure that you check the resource ID and the resource alias of the Amazon Managed Service for Prometheus workspace that you have set up for HyperPod observability stack.

## Import open-source Grafana dashboards
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards"></a>

After you've successfully set up your Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source, you'll start collecting metrics to Prometheus, and then should start seeing the various dashboards showing charts, information, and more. The Grafana open source software provides various dashboards, and you can import them into Amazon Managed Grafana.

**To import open-source Grafana dashboards to Amazon Managed Grafana**

1. In the **Home** page of your Amazon Managed Grafana workspace, choose **Dashboards**.

1. Choose the drop down menu button with the UI text **New**, and select **Import**.

1. Paste the URL to the [Slurm Dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/).

   ```
   https://grafana.com/grafana/dashboards/4323-slurm-dashboard/
   ```

1. Select **Load**.

1. Repeat the previous steps to import the following dashboards.

   1. [Node Exporter Full Dashboard](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)

      ```
      https://grafana.com/grafana/dashboards/1860-node-exporter-full/
      ```

   1. [NVIDIA DCGM Exporter Dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/)

      ```
      https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
      ```

   1. [EFA Metrics Dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/)

      ```
      https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/
      ```

   1. [FSx for Lustre Metrics Dashboard](https://grafana.com/grafana/dashboards/20906-fsx-lustre/)

      ```
      https://grafana.com/grafana/dashboards/20906-fsx-lustre/
      ```

# Exported metrics reference
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference"></a>

The following sections present comprehensive lists of metrics exported from SageMaker HyperPod to Amazon Managed Service for Prometheus upon the successful configuration of the CloudFormation stack for SageMaker HyperPod observability. You can start monitoring these metrics visualized in the Amazon Managed Grafana dashboards.

## Slurm exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-slurm-exporter"></a>

Provides visualized information of Slurm clusters on SageMaker HyperPod.

**Types of metrics**
+ **Cluster Overview:** Displaying the total number of nodes, jobs, and their states.
+ **Job Metrics:** Visualizing job counts and states over time.
+ **Node Metrics:** Showing node states, allocation, and available resources.
+ **Partition Metrics:** Monitoring partition-specific metrics such as CPU, memory, and GPU utilization.
+ **Job Efficiency:** Calculating job efficiency based on resources utilized.

**List of metrics**


| Metric name | Description | 
| --- | --- | 
| slurm\$1job\$1count | Total number of jobs in the Slurm cluster | 
| slurm\$1job\$1state\$1count | Count of jobs in each state (e.g., running, pending, completed) | 
| slurm\$1node\$1count  | Total number of nodes in the Slurm cluster | 
| slurm\$1node\$1state\$1count  | Count of nodes in each state (e.g., idle, alloc, mix) | 
| slurm\$1partition\$1node\$1count  | Count of nodes in each partition | 
| slurm\$1partition\$1job\$1count  | Count of jobs in each partition | 
| slurm\$1partition\$1alloc\$1cpus  | Total number of allocated CPUs in each partition | 
| slurm\$1partition\$1free\$1cpus  | Total number of available CPUs in each partition | 
| slurm\$1partition\$1alloc\$1memory  | Total allocated memory in each partition | 
| slurm\$1partition\$1free\$1memory  | Total available memory in each partition | 
| slurm\$1partition\$1alloc\$1gpus  | Total allocated GPUs in each partition | 
| slurm\$1partition\$1free\$1gpus  | Total available GPUs in each partition | 

## Node exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-node-exporter"></a>

Provides visualized information of system metrics collected by the [Prometheus node exporter](https://github.com/prometheus/node_exporter) from the HyperPod cluster nodes.

**Types of metrics**
+ **System overview:** Displaying CPU load averages and memory usage.
+ **Memory metrics:** Visualizing memory utilization including total memory, free memory, and swap space.
+ **Disk usage:** Monitoring disk space utilization and availability.
+ **Network traffic:** Showing network bytes received and transmitted over time.
+ **File system metrics:** Analyzing file system usage and availability.
+ **Disk I/O metrics:** Visualizing disk read and write activity.

**List of metrics**

For a complete list of metrics exported, see the [Node exporter ](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default) and [procfs](https://github.com/prometheus/procfs?tab=readme-ov-file) GitHub repositories. The following table shows a subset of the metrics that provides insights into system resource utilization such as CPU load, memory usage, disk space, and network activity.


| Metric name | Description | 
| --- | --- | 
|  node\$1load1  | 1-minute load average | 
|  node\$1load5  | 5-minute load average | 
|  node\$1load15  | 15-minute load average | 
|  node\$1memory\$1MemTotal  | Total system memory | 
|  node\$1memory\$1MemFree  | Free system memory | 
|  node\$1memory\$1MemAvailable  | Available memory for allocation to processes | 
|  node\$1memory\$1Buffers  | Memory used by the kernel for buffering | 
|  node\$1memory\$1Cached  | Memory used by the kernel for caching file system data | 
|  node\$1memory\$1SwapTotal  | Total swap space available | 
|  node\$1memory\$1SwapFree  | Free swap space | 
|  node\$1memory\$1SwapCached  | Memory that once was swapped out, is swapped back in but still in swap | 
|  node\$1filesystem\$1avail\$1bytes  | Available disk space in bytes | 
|  node\$1filesystem\$1size\$1bytes  | Total disk space in bytes | 
|  node\$1filesystem\$1free\$1bytes  | Free disk space in bytes | 
|  node\$1network\$1receive\$1bytes  | Network bytes received | 
|  node\$1network\$1transmit\$1bytes  | Network bytes transmitted | 
|  node\$1disk\$1read\$1bytes  | Disk bytes read | 
|  node\$1disk\$1written\$1bytes  | Disk bytes written | 

## NVIDIA DCGM exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-nvidia-dcgm-exporter"></a>

Provides visualized information of NVIDIA GPU metrics collected by the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter).

**Types of metrics**
+ **GPU Overview:** Displaying GPU utilization, temperatures, power usage, and memory usage. 
+ **Temperature Metrics:** Visualizing GPU temperatures over time. 
+ **Power Usage:** Monitoring GPU power draw and power usage trends. 
+ **Memory Utilization:** Analyzing GPU memory usage including used, free, and total memory. 
+ **Fan Speed:** Showing GPU fan speeds and variations. 
+ **ECC Errors:** Tracking GPU memory ECC errors and pending errors.

**List of metrics**

The following table shows a list of the metrics that provides insights into the NVIDIA GPU health and performance, including clock frequencies, temperatures, power usage, memory utilization, fan speeds, and error metrics.


| Metric name | Description | 
| --- | --- | 
|  DCGM\$1FI\$1DEV\$1SM\$1CLOCK  | SM clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEM\$1CLOCK  | Memory clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEMORY\$1TEMP  | Memory temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1GPU\$1TEMP  | GPU temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1POWER\$1USAGE  | Power draw (in W) | 
|  DCGM\$1FI\$1DEV\$1TOTAL\$1ENERGY\$1CONSUMPTION  | Total energy consumption since boot (in mJ) | 
|  DCGM\$1FI\$1DEV\$1PCIE\$1REPLAY\$1COUNTER  | Total number of PCIe retries | 
|  DCGM\$1FI\$1DEV\$1MEM\$1COPY\$1UTIL  | Memory utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1ENC\$1UTIL  | Encoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1DEC\$1UTIL  | Decoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1XID\$1ERRORS  | Value of the last XID error encountered | 
|  DCGM\$1FI\$1DEV\$1FB\$1FREE  | Frame buffer memory free (in MiB) | 
|  DCGM\$1FI\$1DEV\$1FB\$1USED  | Frame buffer memory used (in MiB) | 
|  DCGM\$1FI\$1DEV\$1NVLINK\$1BANDWIDTH\$1TOTAL  | Total number of NVLink bandwidth counters for all lanes | 
|  DCGM\$1FI\$1DEV\$1VGPU\$1LICENSE\$1STATUS  | vGPU License status | 
|  DCGM\$1FI\$1DEV\$1UNCORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for uncorrectable errors | 
|  DCGM\$1FI\$1DEV\$1CORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for correctable errors | 
|  DCGM\$1FI\$1DEV\$1ROW\$1REMAP\$1FAILURE  | Whether remapping of rows has failed | 

## EFA metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-efa-exporter"></a>

Provides visualized information of the metrics from [Amazon Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) equipped on P instances collected by the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md).

**Types of metrics**
+ **EFA error metrics:** Visualizing errors such as allocation errors, command errors, and memory map errors.
+ **EFA network traffic:** Monitoring received and transmitted bytes, packets, and work requests.
+ **EFA RDMA performance:** Analyzing RDMA read and write operations, including bytes transferred and error rates.
+ **EFA port lifespan:** Displaying the lifespan of EFA ports over time.
+ **EFA keep-alive packets:** Tracking the number of keep-alive packets received.

**List of metrics**

The following table shows a list of the metrics that provides insights into various aspects of EFA operation, including errors, completed commands, network traffic, and resource utilization.


| Metric name | Description | 
| --- | --- | 
|  node\$1amazonefa\$1info  | Non-numeric data from /sys/class/infiniband/, value is always 1. | 
|  node\$1amazonefa\$1lifespan  | Lifespan of the port | 
|  node\$1amazonefa\$1rdma\$1read\$1bytes  | Number of bytes read with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1resp\$1bytes  | Number of read response bytes with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wr\$1err  | Number of read write errors with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wrs  | Number of read rs with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1bytes  | Number of bytes written with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1recv\$1bytes  | Number of bytes written and received with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wr\$1err  | Number of bytes written with error RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wrs  | Number of bytes written wrs RDMA | 
|  node\$1amazonefa\$1recv\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1recv\$1wrs  | Number of bytes received wrs | 
|  node\$1amazonefa\$1rx\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1rx\$1drops  | Number of packets dropped | 
|  node\$1amazonefa\$1rx\$1pkts  | Number of packets received | 
|  node\$1amazonefa\$1send\$1bytes  | Number of bytes sent | 
|  node\$1amazonefa\$1send\$1wrs  | Number of wrs sent | 
|  node\$1amazonefa\$1tx\$1bytes  | Number of bytes transmitted | 
|  node\$1amazonefa\$1tx\$1pkts  | Number of packets transmitted | 

## FSx for Lustre metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-fsx-exporter"></a>

Provides visualized information of the [metrics from Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html) file system collected by [Amazon CloudWatch](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html).

**Note**  
The Grafana FSx for Lustre dashboard utilizes Amazon CloudWatch as its data source, which differs from the other dashboards that you have configured to use Amazon Managed Service for Prometheus. To ensure accurate monitoring and visualization of metrics related to your FSx for Lustre file system, configure the FSx for Lustre dashboard to use Amazon CloudWatch as the data source, specifying the same AWS Region where your FSx for Lustre file system is deployed.

**Types of metrics**
+ **DataReadBytes:** The number of bytes for file system read operations.
+ **DataWriteBytes:** The number of bytes for file system write operations.
+ **DataReadOperations:** The number of read operations.
+ **DataWriteOperations:** The number of write operations.
+ **MetadataOperations:** The number of meta data operations.
+ **FreeDataStorageCapacity:** The amount of available storage capacity.

# Amazon SageMaker HyperPod Slurm metrics
<a name="smcluster-slurm-metrics"></a>

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the `/aws/sagemaker/Clusters` CloudWatch namespace.

## Cluster level metrics
<a name="smcluster-slurm-metrics-cluster"></a>

The following cluster-level metrics are available for HyperPod. These metrics use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| cluster\$1node\$1count | Total number of nodes in the cluster | cluster\$1node\$1count | 
| cluster\$1idle\$1node\$1count | Number of idle nodes in the cluster | N/A | 
| cluster\$1failed\$1node\$1count | Number of failed nodes in the cluster | cluster\$1failed\$1node\$1count | 
| cluster\$1cpu\$1count | Total CPU cores in the cluster | node\$1cpu\$1limit | 
| cluster\$1idle\$1cpu\$1count | Number of idle CPU cores in the cluster | N/A | 
| cluster\$1gpu\$1count | Total GPUs in the cluster | node\$1gpu\$1limit | 
| cluster\$1idle\$1gpu\$1count | Number of idle GPUs in the cluster | N/A | 
| cluster\$1running\$1task\$1count | Number of running Slurm jobs in the cluster | N/A | 
| cluster\$1pending\$1task\$1count | Number of pending Slurm jobs in the cluster | N/A | 
| cluster\$1preempted\$1task\$1count | Number of preempted Slurm jobs in the cluster | N/A | 
| cluster\$1avg\$1task\$1wait\$1time | Average wait time for Slurm jobs in the cluster | N/A | 
| cluster\$1max\$1task\$1wait\$1time | Maximum wait time for Slurm jobs in the cluster | N/A | 

## Instance level metrics
<a name="smcluster-slurm-metrics-instance"></a>

The following instance-level metrics are available for HyperPod. These metrics also use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| node\$1gpu\$1utilization | Average GPU utilization across all instances | node\$1gpu\$1utilization | 
| node\$1gpu\$1memory\$1utilization | Average GPU memory utilization across all instances | node\$1gpu\$1memory\$1utilization | 
| node\$1cpu\$1utilization | Average CPU utilization across all instances | node\$1cpu\$1utilization | 
| node\$1memory\$1utilization | Average memory utilization across all instances | node\$1memory\$1utilization | 

# SageMaker HyperPod cluster resiliency
<a name="sagemaker-hyperpod-resiliency-slurm"></a>

SageMaker HyperPod through Slurm orchestration provides the following cluster resiliency features.

**Topics**
+ [

# Health monitoring agent
](sagemaker-hyperpod-resiliency-slurm-cluster-health-check.md)
+ [

# Automatic node recovery and auto-resume
](sagemaker-hyperpod-resiliency-slurm-auto-resume.md)
+ [

# Manually replace or reboot a node using Slurm
](sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.md)

# Health monitoring agent
<a name="sagemaker-hyperpod-resiliency-slurm-cluster-health-check"></a>

This section describes the set of health checks that SageMaker HyperPod uses to regularly monitor cluster instance health for issues with devices such as accelerators (GPU and Trainium cores) and networking (EFA). SageMaker HyperPod health-monitoring agent (HMA) continuously monitors the health status of each GPU-based or Trainium-based instance. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

SageMaker HyperPod HMA performs the same health checks for both EKS and Slurm orchestrators. For more information about HMA, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).

# Automatic node recovery and auto-resume
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume"></a>

**Note**  
As of September 11, 2025, HyperPod with Slurm orchestration now supports health monitoring agents. Run [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) and update to the latest version of the AMI in order to use this functionality.

This section talks about Amazon SageMaker HyperPod's two complementary resilience features: automatic node recovery that replaces faulty infrastructure without manual intervention, and auto-resume functionality that restarts training jobs from the last checkpoint after hardware failures.

## How automatic node recovery works
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-how"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option. By default, the clusters are set up with Automatic node recovery.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. We do not recommend this option.

## Running a training job with the Amazon SageMaker HyperPod auto-resume functionality
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-job"></a>

This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure.

With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced. The following hardware checks are run whenever a job fails while using auto-resume:


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NVIDIA SMI | GPU | [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) utility is a well-known CLI to manage and monitor GPUs. The built-in health checker parses the output from nvidia-smi to determine the health of the instance. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Network | EFA | GPU and Trainium | To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance. | 

**Note**  
When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Using the SageMaker HyperPod auto-resume functionality with Slurm**

When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an exclusive allocation acquired either by using `salloc` or `sbatch`. In any case, you need to modify the entrypoint script to make sure that all setup steps run in a single `srun` command when resuming the job. Through the entrypoint script, it is important to set up the environment on the replaced node to be consistent with the environment that the job step was running before it was stopped. The following procedure shows how to prepare an entrypoint script to keep the environment consistent and run it as a single `srun` command.

**Tip**  
If you use `sbatch`, you can keep the batch script simple by creating a separate script for setting up the environment and using a single `srun` command.

1. Create a script using the following code example and save it as `train_auto_resume.sh`. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.
**Note**  
The following code example shows how to discover the Slurm node list associated with the job. Do not use the `$SLURM_JOB_NODELIST` environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a new `NODE_LIST` variable to replace `SLURM_JOB_NODELIST`, and then set up the `MASTER_NODE` and `MASTER_ADDR` variables off of the `NODE_LIST` variable.

   ```
   #!/bin/bash
   
   # Filename: train_auto_resume.sh
   # Sample containerized script to launch a training job with a single srun which can be auto-resumed.
   
   # Place your training environment setup here. 
   # Example: Install conda, docker, activate virtual env, etc.
   
   # Get the list of nodes for a given job
   NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job
               awk -F= '/NodeList=/{print $2}' | \  # Extract NodeList field
               grep -v Exc)                         # Exclude nodes marked as excluded
   
   # Determine the master node from the node list
   MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames
                 head -n 1)                            # Select the first hostname as master node
   
   # Get the master node address
   MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information
                 awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr
                 awk '{print $1}')                   # Print the first part of NodeAddr
   
   
   # Torchrun command to launch the training job
   torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \
                          --nproc_per_node=1 \
                          --node_rank=$SLURM_NODE \
                          --master-addr=$MASTER_ADDR \
                          --master_port=1234 \
                          <your_training_script.py>"
   
   # Execute the torchrun command in the 'pytorch' Conda environment, 
   # streaming output live
   /opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmd
   ```
**Tip**  
You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the [set of lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.

1. Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag `--auto-resume=1` to indicate that the `srun` command should be automatically retried in case of hardware failure. 
**Note**  
If you have set up a resource allocation using `sbatch` or `salloc`, you can run multiple `srun` commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current [job step](https://slurm.schedmd.com/job_launch.html#step_allocation) of the `srun` command with the flag `--auto-resume=1`. In other words, activating auto-resume in an `srun` command doesn't apply to other `srun` commands launched within a resource allocation session.

   The following are `srun` command examples with `auto-resume` enabled.

   **Using sbatch**

   Because most of the logic for setting up the environment is already in `train_auto_resume.sh`, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved as `batch.sh`.

   ```
   #!/bin/bash
   #SBATCH --nodes 2
   #SBATCH --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

   Run the preceding batch script using the following command.

   ```
   sbatch batch.sh
   ```

   **Using salloc**

   Start by acquiring an exclusive allocation, and run the `srun` command with the `--auto-resume` flag and the entrypoint script.

   ```
   salloc -N 2 --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

## How automatic node recovery and auto-resume work together
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-node-recovery"></a>

When both automatic node recovery and auto-resume systems are active, they follow a coordinated approach to handling failures. If the HMA detects a hardware fault, the node is marked for drain regardless of job-level status. With node automatic recovery enabled, the nodes are automatically replaced once all the jobs running in the nodes exit. In this scenario, for jobs with auto-resume enabled, if there is a non-zero exit status in the step, the auto resume kicks in (the jobs resume once nodes are replaced). Jobs without auto-resume enabled will simply exit, requiring manual resubmission by administrators or users.

**Note**  
If you use auto-resume, the nodes are always replaced (no reboots) when hardware failures are detected.

# Manually replace or reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance"></a>

This section talks about when you should manually reboot or replace a node, with instructions on how to do both.

## When to manually reboot or replace a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-when"></a>

The HyperPod auto-resume functionality monitors if the state of your Slurm nodes turns to `fail` or `down`. You can check the state of Slurm nodes by running `sinfo`.

If a node remains stuck or unresponsive and the auto-resume process does not recover it, you can manually initiate recovery. The choice between rebooting and replacing a node depends on the nature of the issue. Consider rebooting when facing temporary or software-related problems, such as system hangs, memory leaks, GPU driver issues, kernel updates, or hung processes. However, if you encounter persistent or hardware-related problems like failing GPUs, memory or networking faults, repeated health check failures, or nodes that remain unresponsive after multiple reboot attempts, node replacement is the more appropriate solution.

## Ways to manually reboot or replace nodes
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-ways"></a>

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use traditional Slurm commands like `scontrol update`, though this legacy method requires direct access to the Slurm's controller node. Both methods activate the same SageMaker HyperPod recovery processes.

## Manually reboot a node using reboot API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot-api"></a>

 You can use the **BatchRebootClusterNodes** to manually reboot a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually replace a node using replace API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace-api"></a>

 You can use the **BatchReplaceClusterNodes** to manually replace a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot"></a>

You can also use the scontrol Slurm commands to trigger node recovery. These commands interact directly with the Slurm control plane and invoke the same underlying SageMaker HyperPod recovery mechanisms. 

In the following command , replace <ip-ipv4> with the Slurm node name (host name) of the faulty instance you want to reboot.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Reboot"
```

This marks the node as FAIL with the specified reason. SageMaker HyperPod detects this and reboots the instance. Avoid changing the node state or restarting the Slurm controller during the operation.

## Manually replace a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace"></a>

You can use the scontrol update command as follows to replace a node.

In the following command, replace `<ip-ipv4>` with the Slurm node name (host name) of the faulty instance you want to replace.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Replace"
```

After running this command, the node will go into the `fail` state, waits for the currently running jobs to finish, is replaced with a healthy instance, and is recovered with the same host name. This process takes time depending on the available instances in your Availability Zone and the time it takes to run your lifecycle scripts. During the update and replacement processes, avoid changing the state of the node manually again or restarting the Slurm controller; doing so can lead to a replacement failure. If the node does not get recovered nor turn to the `idle` state after a long time, contact [AWS Support](https://console.aws.amazon.com/support/).

## Manually force change a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-force"></a>

If the faulty node is continuously stuck in the `fail` state, the last resort you might try is to manually force change the node state to `down`. This requires administrator privileges (sudo permissions).

**Warning**  
Proceed carefully before you run the following command as it forces kill all jobs, and you might lose all unsaved work.

```
scontrol update node=<ip-ipv4> state=down reason="Action:Replace"
```

# Continuous provisioning for enhanced cluster operations with Slurm
<a name="sagemaker-hyperpod-scaling-slurm"></a>

Amazon SageMaker HyperPod clusters created with Slurm orchestration now support continuous provisioning, a capability that enables greater flexibility and efficiency when running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations.

**Note**  
Continuous provisioning is available as an optional configuration for new HyperPod clusters created with Slurm orchestration. Existing clusters using the previous scaling model cannot be migrated to continuous provisioning at this time.

## How it works
<a name="sagemaker-hyperpod-scaling-slurm-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional all-or-nothing scaling model. In the previous model, if any instance group could not be fully provisioned, the entire cluster creation or update operation failed and rolled back. With continuous provisioning, the system accepts partial capacity and continues to provision remaining instances asynchronously.

The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group.
+ **Initiates provisioning**: Begins launching instances for all instance groups in parallel.
+ **Provisions priority nodes first**: The cluster transitions to `InService` after at least one controller node (and one login node, if a login instance group is specified) is successfully provisioned.
+ **Tracks progress**: Monitors each instance launch attempt and records the status.
+ **Handles failures**: Automatically retries failed launches for worker nodes asynchronously.

Continuous provisioning is disabled by default. To use this feature, set `NodeProvisioningMode` to `Continuous` in your `CreateCluster` request.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group.

## Priority-based provisioning
<a name="sagemaker-hyperpod-scaling-slurm-priority"></a>

Slurm clusters require a controller node to be operational before worker nodes can register and accept jobs. Continuous provisioning handles this automatically through priority-based provisioning:

1. The controller instance group is provisioned first.

1. Once one controller node is healthy, login nodes and worker nodes begin provisioning in parallel.

1. The cluster transitions to `InService` when one controller node is up and one login node is up (if a login instance group is specified). If no login instance group is specified, the cluster transitions to `InService` as soon as the controller node is provisioned.

1. Worker nodes that cannot be immediately provisioned due to capacity constraints enter an asynchronous retry loop and are added to the Slurm cluster automatically as they become available.

## Controller failure handling
<a name="sagemaker-hyperpod-scaling-slurm-controller-failure"></a>

During cluster creation, if the controller node fails to provision, the behavior depends on whether the error is retryable or non-retryable.

**Retryable errors** (for example, unhealthy instance or transient failures):
+ HyperPod continuously replaces the instance and retries provisioning until the controller comes up.
+ Worker and login nodes that have already been provisioned remain available, but the cluster does not transition to `InService` until the controller is healthy.

**Non-retryable errors** (for example, no capacity available for the controller instance type or lifecycle script failure):
+ The cluster is marked as `Failed`.
+ You are notified of the failure reason and must take corrective action, such as choosing a different instance type, fixing lifecycle scripts, or retrying in a different Availability Zone.

## Prerequisites
<a name="sagemaker-hyperpod-scaling-slurm-prerequisites"></a>

Continuous provisioning requires that Slurm provisioning parameters (node types, partition names) are provided via the API payload in each instance group's `SlurmConfig` field. Clusters that rely on the legacy `provisioning_parameters.json` file in Amazon S3 are not compatible with continuous provisioning.

**Note**  
The following features are not currently supported with continuous provisioning on Slurm clusters: migration of existing clusters, multi-head node configuration via API-based Slurm topology, and `SlurmConfigStrategy`. Continuous provisioning operates exclusively in merge mode for `slurm.conf` management.

## Usage metering
<a name="sagemaker-hyperpod-scaling-slurm-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you are charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors.
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle configuration script and stops when the instance enters a terminating state.

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script.
+ **Billing continues**: Throughout the instance's operational lifetime.
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination.

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN).

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-slurm-create"></a>

**Note**  
Prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

Prepare a `CreateCluster` API request file in JSON format. Set `NodeProvisioningMode` to `Continuous` and provide Slurm topology information in each instance group's `SlurmConfig` field.

```
// create_cluster.json
{
    "ClusterName": "my-training-cluster",
    "NodeProvisioningMode": "Continuous",
    "Orchestrator": {
        "Slurm": {}
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Controller"
            }
        },
        {
            "InstanceGroupName": "login-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Login"
            }
        },
        {
            "InstanceGroupName": "worker-gpu-a",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 16,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["gpu-training"]
            }
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-12345678"],
        "Subnets": ["subnet-12345678"]
    }
}
```

Run the `create-cluster` command to submit the request.

```
aws sagemaker create-cluster \
    --cli-input-json file://complete/path/to/create_cluster.json
```

This returns the ARN of the new cluster.

```
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde12345"
}
```

## Slurm configuration management
<a name="sagemaker-hyperpod-scaling-slurm-config"></a>

Continuous provisioning operates exclusively in merge mode for `slurm.conf` partition management. In merge mode, HyperPod applies its partition configuration changes additively on top of whatever you have modified in `slurm.conf`. HyperPod only updates the partition-related sections of `slurm.conf` (such as partition name and node name entries); other Slurm configuration parameters are not modified. This means:
+ Your manual edits to `slurm.conf` are preserved.
+ There is no automated drift detection or resolution of conflicts between your modifications and HyperPod's expected state.

The `SlurmConfigStrategy` parameter (`Managed`, `Merge`, `Overwrite`) is not supported with continuous provisioning. Passing any `SlurmConfigStrategy` value results in an API error.

# SageMaker HyperPod cluster management
<a name="sagemaker-hyperpod-cluster-management-slurm"></a>

The following topics discuss logging and managing SageMaker HyperPod clusters.

## Logging SageMaker HyperPod events
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-hyperpod-events"></a>

All events and logs from SageMaker HyperPod are saved to Amazon CloudWatch under the log group name `/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]`. Every call to the `CreateCluster` API creates a new log group. The following list contains all of the available log streams collected in each log group.


|  |  | 
| --- |--- |
| Log Group Name | Log Stream Name | 
| /aws/sagemaker/Clusters/[ClusterName]/[ClusterID] | LifecycleConfig/[instance-group-name]/[instance-id] | 

## Logging SageMaker HyperPod at instance level
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level"></a>

You can access the LifecycleScript logs published to CloudWatch during cluster instance configuration. Every instance within the created cluster generates a separate log stream, distinguishable by the `LifecycleConfig/[instance-group-name]/[instance-id]` format. 

All logs that are written to `/var/log/provision/provisioning.log` are uploaded to the preceding CloudWatch stream. Sample LifecycleScripts at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) redirect their `stdout` and `stderr` to this location. If you are using your custom scripts, write your logs to the `/var/log/provision/provisioning.log` location for them to be available in CloudWatch.

**Lifecycle script log markers**

CloudWatch logs for lifecycle scripts include specific markers to help you track execution progress and identify issues:


|  |  | 
| --- |--- |
| Marker | Description | 
| START | Indicates the beginning of lifecycle script logs for the instance | 
| [SageMaker] Lifecycle scripts were provided, with S3 uri: [s3://bucket-name/] and entrypoint script: [script-name.sh] | Indicates the S3 location and entrypoint script that will be used | 
| [SageMaker] Downloading lifecycle scripts | Indicates scripts are being downloaded from the specified S3 location | 
| [SageMaker] Lifecycle scripts have been downloaded | Indicates scripts have been successfully downloaded from S3 | 
| [SageMaker] The lifecycle scripts succeeded | Indicates successful completion of all lifecycle scripts | 
| [SageMaker] The lifecycle scripts failed | Indicates failed execution of lifecycle scripts | 

These markers help you quickly identify where in the lifecycle script execution process an issue occurred. When troubleshooting failures, review the log entries to identify where the process stopped or failed.

**Lifecycle script failure messages**

If the lifecycle script exists but fails during execution, you will receive an error message that includes the CloudWatch log group name and log stream name. In the event that there are lifecycle script failures across multiple instances, the error message will indicate only one failed instance, but the log group should contain streams for all instances.

You can view the error message by running the [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API or by viewing the cluster details page in the SageMaker console. In the console, a **View lifecycle script logs** button is provided that navigates directly to the CloudWatch log stream. The error message has the following format:

```
Instance [instance-id] failed to provision with the following error: "Lifecycle scripts did not run successfully. To view lifecycle script logs,
visit log group ‘/aws/sagemaker/Clusters/[cluster-name]/[cluster-id]' and log stream ‘LifecycleConfig/[instance-group-name]/[instance-id]’.
If you cannot find corresponding lifecycle script logs in CloudWatch, please make sure you follow one of the options here:
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-faq-slurm.html#hyperpod-faqs-q1.” Note that multiple instances may be impacted.
```

## Tagging resources
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging"></a>

AWS Tagging system helps manage, identify, organize, search for, and filter resources. SageMaker HyperPod supports tagging, so you can manage the clusters as an AWS resource. During cluster creation or editing an existing cluster, you can add or edit tags for the cluster. To learn more about tagging in general, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

### Using the SageMaker HyperPod console UI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-console"></a>

When you are [creating a new cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster) and [editing a cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters), you can add, remove, or edit tags.

### Using the SageMaker HyperPod APIs
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-api-request"></a>

When you write a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format, edit the `Tags` section.

### Using the AWS CLI tagging commands for SageMaker AI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-using-cli"></a>

**To tag a cluster**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/add-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/add-tags.html) as follows.

```
aws sagemaker add-tags --resource-arn cluster_ARN --tags Key=string,Value=string
```

**To untag a cluster**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-tags.html) as follows.

```
aws sagemaker delete-tags --resource-arn cluster_ARN --tag-keys "tag_key"
```

**To list tags for a resource**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-tags.html) as follows.

```
aws sagemaker list-tags --resource-arn cluster_ARN
```

# SageMaker HyperPod FAQs
<a name="sagemaker-hyperpod-faq-slurm"></a>

Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.

**Topics**
+ [

## Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
](#hyperpod-faqs-q1)
+ [

## What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?
](#hyperpod-faqs-q2)
+ [

## How do I run Docker on Slurm nodes on HyperPod?
](#hyperpod-faqs-q3)
+ [

## Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?
](#hyperpod-faqs-q4)
+ [

## How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
](#hyperpod-faqs-q5)
+ [

## How to set up EFA security groups?
](#hyperpod-faqs-q6)
+ [

## How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?
](#hyperpod-faqs-q7)
+ [

## Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
](#hyperpod-faqs-q8)
+ [

## Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?
](#hyperpod-faqs-q9)
+ [

## Why do my nodes keep getting drained due to Out of Memory (OOM) issues?
](#hyperpod-faqs-q10)
+ [

## How can I ensure resources are properly cleaned up after jobs complete?
](#hyperpod-faqs-q11)

## Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
<a name="hyperpod-faqs-q1"></a>

By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.

If you use the [sample lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) provided by the HyperPod service team, you can expect to find the lifecycle configuration logs written to `/var/log/provision/provisioning.log`, and you wouldn’t encounter this problem.

However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.
+ **Option 1:** Update your lifecycle scripts to write logs to `/var/log/provision/provisioning.log`.
+ **Option 2:** Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.

  1. Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at `/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json`. In the configuration file, find the field name `logs.logs_collected.files.collect_list.file_path`. With the default setup by HyperPod, the key-value pair should be `"file_path": "/var/log/provision/provisioning.log"` as documented at [Logging SageMaker HyperPod at instance level](sagemaker-hyperpod-cluster-management-slurm.md#sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level). The following code snippet shows how the JSON file looks with the HyperPod default configuration.

     ```
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/provision/provisioning.log",
                         "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]",
                         "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}",
                         "retention_in_days": -1
                     }
                 ]
             }
         },
         "force_flush_interval": 3
     }
     ```

  1. Replace the value for the `"file_path"` field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to `/var/log/custom-provision/custom-provisioning.log`, update the value to match with it as follows.

     ```
     "file_path": "/var/log/custom-provision/custom-provisioning.log"
     ```

  1. Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also [Troubleshooting the CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/troubleshooting-CloudWatch-Agent.html).

     ```
     sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
         -a fetch-config -m ec2 -s -c \
         file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
     ```

## What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?
<a name="hyperpod-faqs-q2"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows what specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you DON’T change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

## How do I run Docker on Slurm nodes on HyperPod?
<a name="hyperpod-faqs-q3"></a>

To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) and [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?
<a name="hyperpod-faqs-q4"></a>

By default, the Linux OS sets the `#RemoveIPC=yes` flag. Slurm and mpirun jobs that use NCCL generate inter-process communication (IPC) resources under non-root user sessions. These user sessions might log out during the job process.

 When you run jobs with Slurm or mpirun, if `systemd` detects that the user isn't logged in, it cleans up the IPC resources. Slurm and mpirun jobs can run without the user being logged in, but that requires that you disable cleanup at the systemd level and set it up at the Slurm level instead. For more information, see [ Systemd in the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#systemd). 

To disable cleanup at the systemd level, complete the following steps.

1. Set the flag `#RemoveIPC=no` in the file `/etc/systemd/logind.conf` if you're running training jobs that use Slurm and NCCL.

1.  By default, Slurm doesn't clean up shared resources. We recommend that you set up a Slurm epilog script to clean up shared resources. This cleanup is useful when you have a lot of shared resources and want to clean them up after training jobs. The following is an example script.

   ```
   #!/bin/bash
   : <<'SUMMARY'
   Script: epilog.sh
   
   Use this script with caution, as it can potentially delete unnecessary resources and cause issues if you don't use it correctly.
   
   Note: You must save this script in a shared in a shared location that is accessible to all nodes in the cluster, such as /fsx volume.
   Workers must be able to access the script to run the script after jobs.
   
   SUMMARY
   
   # Define the log directory and create it if it doesn't exist
   LOG_DIR="/<PLACEHOLDER>/epilogue" #NOTE: Update PLACEHOLDER to be a shared value path, such as /fsx/epilogue.
   mkdir -p "$LOG_DIR"
   
   # Name the log file using the Slurm job name and job ID
   log_file="$LOG_DIR/epilogue-${SLURM_JOB_NAME}_${SLURM_JOB_ID}.log"
   
   logging() {
       echo "[$(date)] $1" | tee -a "$log_file"
   }
   
   # Slurm epilogue script to clean up IPC resources
   logging "Starting IPC cleanup for Job $SLURM_JOB_ID"
   
   # Clean up shared memory segments by username
   for seg in $(ipcs -m | awk -v owner="$SLURM_JOB_USER" '$3 == owner {print $2}'); do
       if ipcrm -m "$seg"; then
           logging "Removed shared memory segment $seg"
       else
           logging "Failed to remove shared memory segment $seg"
       fi
   done
   
   # Clean up semaphores by username
   for sem in $(ipcs -s | awk -v user="$SLURM_JOB_USER" '$3 == user {print $2}'); do
       if ipcrm -s "$sem"; then
           logging "Removed semaphore $sem"
       else
           logging "Failed to remove semaphore $sem"
       fi
   done
   
   # Clean up NCCL IPC
   NCCL_IPC_PATH="/dev/shm/nccl-*"
   for file in $NCCL_IPC_PATH; do
       if [ -e "$file" ]; then
           if rm "$file"; then
               logging "Removed NCCL IPC file $file"
           else
               logging "Failed to remove NCCL IPC file $file"
           fi
       fi
   done
   logging "IPC cleanup completed for Job $SLURM_JOB_ID"
   exit 0
   ```

   For more information about the Epilog parameter, see [Slurm documentation](https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog).

1. In the `slurm.conf` file from the controller node, add in a line to point to the epilog script you created.

   ```
   Epilog="/path/to/epilog.sh"  #For example: /fsx/epilogue/epilog.sh
   ```

1. Run the following commands to change permissions of the script and to make it executable.

   ```
   chown slurm:slurm /path/to/epilog.sh
   chmod +x  /path/to/epilog.sh
   ```

1. To apply all of your changes, run `scontrol reconfigure`.

## How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
<a name="hyperpod-faqs-q5"></a>

Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## How to set up EFA security groups?
<a name="hyperpod-faqs-q6"></a>

If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

## How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?
<a name="hyperpod-faqs-q7"></a>

To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.

## Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
<a name="hyperpod-faqs-q8"></a>

If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the [release on June 20, 2024](sagemaker-hyperpod-release-notes.md#sagemaker-hyperpod-release-notes-20240620), you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

## Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?
<a name="hyperpod-faqs-q9"></a>

This typically occurs when nodes are rebooted using `sudo reboot` instead of Slurm's control interface. To properly reboot nodes, use the Slurm command `scontrol reboot nextstate=resume <list_of_nodes>`. This ensures Slurm maintains proper control of the node state and resumes normal operation after reboot.

For GPU instances (like NVIDIA P5), this can also happen if the node can't complete its boot process within Slurm's default time limit (60 seconds). To resolve this, increase the `TimeToResume` parameter in `slurm.conf` to 300 seconds. This gives GPU instances sufficient time to boot and initialize drivers.

## Why do my nodes keep getting drained due to Out of Memory (OOM) issues?
<a name="hyperpod-faqs-q10"></a>

OOM issues occur when jobs exceed the node's memory capacity. To prevent this, implement `cgroups` to enforce memory limits per job. This prevents a single job from affecting the entire node and improves isolation and stability.

Example setup in `slurm.conf`: 

```
TaskPlugin=task/cgroup
```

Example setup in `cgroup.conf`:

```
CgroupAutomount=yes
ConstrainCores=yes
CgroupPlugin=autodetect
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
SignalChildrenProcesses=yes
MaxRAMPercent=99
MaxSwapPercent=80
MinRAMSpace=100
```

For more information, see [Control Group in Slurm](https://slurm.schedmd.com/cgroups.html), [Cgroup and PAM-based login control for Slurm compute nodes](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/pam_adopt_cgroup_wheel.sh#L197), and [Configure Cgroups for Slurm](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/16-enable-cgroups).

## How can I ensure resources are properly cleaned up after jobs complete?
<a name="hyperpod-faqs-q11"></a>

Implement epilogue scripts to automatically clean up resources after jobs complete. Resources might not be cleared correctly when jobs crash unexpectedly, contain bugs that prevent normal cleanup, or when shared memory buffers (include those shared between processes and GPU drivers) retain allocated.

Epilogue scripts can perform tasks such as clearing GPU memory, removing temporary files, and unmounting file systems. These scripts have limitations when resources are not exclusively allocated to a single job. For detailed instructions and sample scripts, see the second bullet point of the question [Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?](#hyperpod-faqs-q4). For more information, see [Enable Slurm epilog script](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/18-slurm-epilogue).

# Orchestrating SageMaker HyperPod clusters with Amazon EKS
<a name="sagemaker-hyperpod-eks"></a>

SageMaker HyperPod is a SageMaker AI-managed service that enables large-scale training of foundation models on long-running and resilient compute clusters, integrating with Amazon EKS for orchestrating the HyperPod compute resources. You can run uninterrupted training jobs spanning weeks or months at scale using Amazon EKS clusters with HyperPod resiliency features that check for various hardware failures and automatically recover faulty nodes. 

Key features for cluster admin users include the following.
+ Provisioning resilient HyperPod clusters and attaching them to an EKS control plane
+ Enabling dynamic capacity management, such as adding more nodes, updating software, and deleting clusters
+ Enabling access to the cluster instances directly through `kubectl` or SSM/SSH
+ Offering [resiliency capabilities](sagemaker-hyperpod-eks-resiliency.md), including basic health checks, deep health checks, a health-monitoring agent, and support for PyTorch job auto-resume
+ Integrating with observability tools such as [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html)

For data scientist users, EKS support in HyperPod enables the following.
+ Running containerized workloads for training foundation models on the HyperPod cluster
+ Running inference on the EKS cluster, leveraging the integration between HyperPod and EKS
+ Leveraging the job auto-resume capability for [Kubeflow PyTorch training (PyTorchJob)](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/)

**Note**  
Amazon EKS enables user-managed orchestration of tasks and infrastructure on SageMaker HyperPod through the Amazon EKS Control Plane. Ensure that user access to the cluster through the Kubernetes API Server endpoint follows the principle of least-privilege, and that network egress from the HyperPod cluster is secured.  
To learn more about securing access to the Amazon EKS API Server, see [Control network access to cluster API server endpoint](https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html).  
To learn more about securing network access on HyperPod, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

The high-level architecture of Amazon EKS support in HyperPod involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, as shown in the following diagram.

![\[EKS and HyperPod VPC architecture with control plane, cluster nodes, and AWS services.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-diagram.png)


# Managing SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-operate"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). It explains how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [

# Getting started with Amazon EKS support in SageMaker HyperPod
](sagemaker-hyperpod-eks-prerequisites.md)
+ [

# Installing packages on the Amazon EKS cluster using Helm
](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md)
+ [

# Setting up Kubernetes role-based access control
](sagemaker-hyperpod-eks-setup-rbac.md)
+ [

# Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters
](hyperpod-custom-ami-support.md)
+ [

# Managing SageMaker HyperPod EKS clusters using the SageMaker console
](sagemaker-hyperpod-eks-operate-console-ui.md)
+ [

# Creating SageMaker HyperPod clusters using CloudFormation templates
](smcluster-getting-started-eks-console-create-cluster-cfn.md)
+ [

# Managing SageMaker HyperPod EKS clusters using the AWS CLI
](sagemaker-hyperpod-eks-operate-cli-command.md)
+ [

# HyperPod managed tiered checkpointing
](managed-tier-checkpointing.md)
+ [

# SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance.md)
+ [

# Usage reporting for cost attribution in SageMaker HyperPod
](sagemaker-hyperpod-usage-reporting.md)
+ [

# Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS
](sagemaker-hyperpod-eks-setup-storage.md)
+ [

# Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
](sagemaker-hyperpod-eks-ebs.md)
+ [

# Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod
](sagemaker-hyperpod-eks-custom-labels-and-taints.md)

# Getting started with Amazon EKS support in SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-prerequisites"></a>

In addition to the general [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) for SageMaker HyperPod, check the following requirements and considerations for orchestrating SageMaker HyperPod clusters using Amazon EKS.

**Important**  
You can set up resources configuration for creating SageMaker HyperPod clusters using the AWS Management Console and CloudFormation. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) and [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

**Requirements**

**Note**  
Before creating a HyperPod cluster, you need a running Amazon EKS cluster configured with VPC and installed using Helm.
+ If using the SageMaker AI console, you can create an Amazon EKS cluster within the HyperPod cluster console page. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).
+ If using AWS CLI, you should create an Amazon EKS cluster before creating a HyperPod cluster to associate with. For more information, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the Amazon EKS User Guide.

When provisioning your Amazon EKS cluster, consider the following:

1. **Kubernetes version support**
   + SageMaker HyperPod supports Kubernetes versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, and 1.34.

1. **Amazon EKS cluster authentication mode**
   + The authentication mode of an Amazon EKS cluster supported by SageMaker HyperPod are `API` and `API_AND_CONFIG_MAP`.

1. **Networking**
   + SageMaker HyperPod requires the Amazon VPC Container Network Interface (CNI) plug-in version 1.18.3 or later.
**Note**  
[AWS VPC CNI plugin for Kubernetes](https://github.com/aws/amazon-vpc-cni-k8s) is the only CNI supported by SageMaker HyperPod.
   + The [type of the subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-types) in your VPC must be private for HyperPod clusters.

1. **IAM roles**
   + Ensure the necessary IAM roles for HyperPod are set up as guided in the [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md) section.

1. **Amazon EKS cluster add-ons**
   + You can continue using the various add-ons provided by Amazon EKS such as [Kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-kube-proxy.html), [CoreDNS](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-coredns.html), the [Amazon VPC Container Network Interface (CNI)](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-vpc-cni.html) plugin, Amazon EKS pod identity, the GuardDuty agent, the Amazon FSx Container Storage Interface (CSI) driver, the Mountpoint for Amazon S3 CSI driver, the AWS Distro for OpenTelemetry, and the CloudWatch Observability agent.

**Considerations for configuring SageMaker HyperPod clusters with Amazon EKS**
+ You must use distinct IAM roles based on the type of your nodes. For HyperPod nodes, use a role based on [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). For Amazon EKS nodes, see [Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html).
+ You can provision and mount additional Amazon EBS volumes on SageMaker HyperPod nodes using two approaches: use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs) for cluster-level volume provisioning (available when creating or updating instance groups), or use the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver for dynamic pod-level volume management. With [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs), set the [local path](https://kubernetes.io/docs/concepts/storage/volumes/#local) to `/opt/sagemaker` to properly mount the volumes to your Amazon EKS pods. For information about how to deploy the [Amazon EBS CSI](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) controller on HyperPod nodes, see [Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](sagemaker-hyperpod-eks-ebs.md).
+ If you use instance-type labels for defining scheduling constraints, ensure that you use the SageMaker AI ML instance types prefixed with `ml.`. For example, for P5 instances, use `ml.p5.48xlarge` instead of `p5.48xlarge`.

**Considerations for configuring network for SageMaker HyperPod clusters with Amazon EKS**
+ Each HyperPod cluster instance supports one Elastic Network Interface (ENI). For the maximum number of Pods per instance type, refer to the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ Only Pods with `hostNetwork = true` have access to the Amazon EC2 Instance Metadata Service (IMDS) by default. Use the Amazon EKS Pod identity or the [IAM roles for service accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) to manage access to the AWS credentials for Pods.
+ EKS-orchestrated HyperPod clusters support dual IP addressing modes, allowing configuration with IPv4 or IPv6 for IPv6 Amazon EKS clusters in IPv6-enabled VPC and subnet environments. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

**Considerations for using the HyperPod cluster resiliency features**
+ Node auto-replacement is not supported for CPU instances.
+ The HyperPod health monitoring agent needs to be installed for node auto-recovery to work. The agent can be installed using Helm. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).
+ The HyperPod deep health check and health monitoring agent supports GPU and Trn instances.
+ SageMaker AI applies the following taint to nodes when they are undergoing deep health checks:

  ```
  effect: NoSchedule
  key: sagemaker.amazonaws.com/node-health-status
  value: Unschedulable
  ```
**Note**  
You cannot add custom taints to nodes in instance groups with `DeepHealthChecks` turned on.

 Once your Amazon EKS cluster is running, configure your cluster using the Helm package manager as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md) before creating your HyperPod cluster.

# Installing packages on the Amazon EKS cluster using Helm
<a name="sagemaker-hyperpod-eks-install-packages-using-helm-chart"></a>

Before creating a SageMaker HyperPod cluster and attaching it to an Amazon EKS cluster, you should install packages using [Helm](https://helm.sh/), a package manager for Kubernetes. Helm is an open-source tool for setting up a installation process for Kubernetes clusters. It enables the automation and streamlining of dependency installations and simplifies various setups needed for preparing the Amazon EKS cluster as the orchestrator (control plane) for a SageMaker HyperPod cluster.

The SageMaker HyperPod service team provides a Helm chart package, which bundles key dependencies such as device/EFA plug-ins, plug-ins, [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/), and associated permission configurations.

**Important**  
This Helm installation step is required. If you set up your Amazon EKS cluster using the [AWS Management Console](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) or [CloudFormation](smcluster-getting-started-eks-console-create-cluster-cfn.md), you can skip this step because the installation is handled automatically during the setup process. If you set up the cluster directly using the APIs, use the provided Helm chart to configure your Amazon EKS cluster. Failure to configure your Amazon EKS cluster using the provided Helm chart might result in the SageMaker HyperPod cluster not functioning correctly or the creation process failing entirely. The `aws-hyperpod` namespace name cannot be modified.

1. [Install Helm](https://helm.sh/docs/intro/install/) on your local machine.

1. Download the Helm charts provided by SageMaker HyperPod located at `helm_chart/HyperPodHelmChart` in the [SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   cd sagemaker-hyperpod-cli/helm_chart
   ```

1. Update the dependencies of the Helm chart, preview the changes that will be made to your Kubernetes cluster, and install the Helm chart.

   ```
   helm dependencies update HyperPodHelmChart
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system --dry-run
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system
   ```

In summary, the Helm installation sets up various components for your Amazon EKS cluster, including job scheduling and queueing (Kueue), storage management, MLflow integration, and Kubeflow. Additionally, the charts install the following components for integrating with the SageMaker HyperPod cluster resiliency features, which are required components.
+ **Health monitoring agent** – This installs the health-monitoring agent provided by SageMaker HyperPod. This is required if you want to get your HyperPod cluster be monitored. Health-monitoring agents are provided as Docker images as follows. In the provided `values.yaml` in the Helm charts, the image is preset. The agent support GPU-based instances and Trainium-accelerator-based instances (`trn1`, `trn1n`, `inf2`). It is installed to the `aws-hyperpod` namespace. To find your supported URI, see [Supported Regions and their ECR URIs in the sagemaker-hyperpod-cli repository on GitHub](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/readme.md#6-notes).
+ **Deep health check** – This sets up a `ClusterRole`, a ServiceAccount (`deep-health-check-service-account`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding` to enable the SageMaker HyperPod deep health check feature. For more information about the Kubernetes RBAC file for deep health check, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **`job-auto-restart`** - This sets up a `ClusterRole`, a ServiceAccount (`job-auto-restart`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding`, to enable the auto-restart feature for PyTorch training jobs in SageMaker HyperPod. For more information about the Kubernetes RBAC file for `job-auto-restart`, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **Kubeflow MPI operator** – The [MPI Operator](https://github.com/kubeflow/mpi-operator) is a Kubernetes operator that simplifies running distributed Machine Learning (ML) and High-Performance Computing (HPC) workloads using the Message Passing Interface (MPI) on Kubernetes clusters. It installs MPI Operator v0.5. It is installed to the `mpi-operator` namespace.
+ **`nvidia-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose NVIDIA GPUs for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to allocate and provide access to the requested GPUs for that container. Required when using an instance type with GPU.
+ **`neuron-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose AWS Inferentia chips for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to access and utilize the AWS Inferentia chips on the cluster nodes. Required when using a Neuron instance type.
+ **`aws-efa-k8s-device-plugin`** – This is a Kubernetes device plug-in that enables the use of AWS Elastic Fabric Adapter (EFA) on Amazon EKS clusters. EFA is a network device that provides low-latency and high-throughput communication between instances in a cluster. Required when using an EFA supported instance type.

For more information about the installation procedure using the provided Helm charts, see the [README file in the SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

# Setting up Kubernetes role-based access control
<a name="sagemaker-hyperpod-eks-setup-rbac"></a>

Cluster admin users also need to set up [Kubernetes role-based access control (RBAC)](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) for data scientist users to use the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) to run workloads on HyperPod clusters orchestrated with Amazon EKS.

## Option 1: Set up RBAC using Helm chart
<a name="sagemaker-hyperpod-eks-setup-rbac-helm"></a>

The SageMaker HyperPod service team provides a Helm sub-chart for setting up RBAC. To learn more, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

## Option 2: Set up RBAC manually
<a name="sagemaker-hyperpod-eks-setup-rbac-manual"></a>

Create `ClusterRole` and `ClusterRoleBinding` with the minimum privilege, and create `Role` and `RoleBinding` with mutation permissions.

**To create `ClusterRole` & `ClusterRoleBinding` for data scientist IAM role**

Create a cluster-level configuration file `cluster_level_config.yaml` as follows.

```
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: hyperpod-scientist-user-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hyperpod-scientist-user-cluster-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-cluster-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: hyperpod-scientist-user-cluster-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f cluster_level_config.yaml
```

**To create Role and RoleBinding in namespace**

This is the namespace training operator that run training jobs and Resiliency will monitor by default. Job auto-resume can only support in `kubeflow` namespace or namespace prefixed `aws-hyperpod`. 

Create a role configuration file `namespace_level_role.yaml` as follows. This example creates a role in the `kubeflow` namespace

```
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role
###
#  1) add/list/describe/delete pods
#  2) get/list/watch/create/patch/update/delete/describe kubeflow pytroch job
#  3) get pod log
###
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["get", "create"]
- apiGroups: ["kubeflow.org"]
  resources: ["pytorchjobs", "pytorchjobs/status"]
  verbs: ["get", "list", "create", "delete", "update", "describe"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "update", "get", "list", "delete"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-namespace-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: hyperpod-scientist-user-namespace-level-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f namespace_level_role.yaml
```

## Create an access entry for Kubernetes groups
<a name="sagemaker-hyperpod-eks-setup-rbac-access-entry"></a>

After you have set up RBAC using one of the two options above, use the following sample command replacing the necessary information.

```
aws eks create-access-entry \
    --cluster-name <eks-cluster-name> \
    --principal-arn arn:aws:iam::<AWS_ACCOUNT_ID_SCIENTIST_USER>:role/ScientistUserRole \
    --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
```

For the `principal-arn` parameter, you need to use the [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user).

# Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters
<a name="hyperpod-custom-ami-support"></a>

Using base Amazon Machine Images (AMIs) provided and made public by Amazon SageMaker HyperPod, you can build custom AMIs. With a custom AMI, you can create specialized environments for AI workloads with pre-configured software stacks, driver customizations, proprietary dependencies, and security agents. This capability eliminates the need for complex post-launch bootstrapping using lifecycle configuration scripts.

With custom AMIs, you can standardize environments across different stages, accelerate startup times, and have full control over your runtime environment while leveraging SageMaker HyperPod's infrastructure capabilities and scaling advantages. This helps you maintain control over your AI infrastructure while still benefiting from SageMaker HyperPod's optimized base runtime.

You can build upon the SageMaker HyperPod performance-tuned base images by adding security agents, compliance tools, and specialized libraries while preserving all the distributed training benefits. This capability removes the previously required choice between infrastructure optimization and organizational security policies.

The custom AMI experience integrates seamlessly with established enterprise security workflows. Security teams build hardened images using SageMaker HyperPod's public AMIs as a base, and AI platform teams can specify these custom AMIs when creating or updating clusters through the SageMaker HyperPod APIs. The APIs validate image compatibility, handle necessary permissions, and maintain backwards compatibility so existing workflows continue functioning. Organizations with stringent security protocols can eliminate the error-prone alternative of installing security agents at runtime through lifecycle scripts. By aligning with enterprise security practices rather than forcing organizations to adapt their protocols to SageMaker HyperPod's limitations, custom AMIs remove a common barrier to adoption for security-conscious organizations running critical AI workloads.

For release notes on updates to the public AMIs, see [Public AMI releases](sagemaker-hyperpod-release-public-ami.md). To learn how to get started with building a custom AMI and using it in your HyperPod clusters, see the following topics.

**Topics**
+ [

# Build a custom AMI
](hyperpod-custom-ami-how-to.md)
+ [

# Cluster management with custom AMIs
](hyperpod-custom-ami-cluster-management.md)

# Build a custom AMI
<a name="hyperpod-custom-ami-how-to"></a>

The following page explains how to build a custom Amazon Machine Image (AMI) using Amazon SageMaker HyperPod base AMIs. You begin by selecting a base AMI, and then you create your own customized AMI using any of the common methods for creating new images, such as the AWS CLI.

## Select a SageMaker HyperPod base AMI
<a name="hyperpod-custom-ami-select-base"></a>

You can select a SageMaker HyperPod base AMI through one of the following methods.

### AWS console selection
<a name="hyperpod-custom-ami-console-selection"></a>

You can select public SageMaker HyperPod AMIs through the AWS console or by using the `DescribeImages` API call. SageMaker HyperPod AMIs are public and visible in every AWS account. You can find them in the Amazon EC2 AMI catalog by applying a filter to search for public AMIs owned by Amazon.

To find SageMaker HyperPod AMIs in the console:

1. Sign in to the Amazon EC2 console.

1. In the left navigation pane, choose **AMIs**.

1. For the **Image type** dropdown, select **Public images**.

1. In the search bar filters, set the **Owner alias** filter to **amazon**.

1. Search for AMIs prefixed as **HyperPod EKS** and select the AMI (preferably latest) that works for your use case. For instance, you can choose an AMI between Kubernetes 1.31 versus Kubernetes 1.30.

### Fetch latest public AMI ID through the AWS CLI
<a name="hyperpod-custom-ami-cli-fetch"></a>

If you want to always use the latest release public AMI, it is more efficient to use the public SageMaker HyperPod SSM parameter that contains the value of the latest AMI ID released by SageMaker HyperPod.

The following example shows how to retrieve the latest AMI ID using the AWS CLI:

```
aws ssm get-parameter \
  --name "/aws/service/sagemaker-hyperpod/ami/x86_64/eks-1.31-amazon-linux-2/latest/ami-id" \
  --region us-east-1 \
  --query "Parameter.Value" \
  --output text
```

**Note**  
Replace the parameter name with the corresponding Kubernetes version as required. For example, if you want to use Kubernetes 1.30, use the following parameter: `/aws/service/hyperpod/ami/x86_64/eks-1.30-amazon-linux-2/latest/ami-id`.

## Build your custom AMI
<a name="hyperpod-custom-ami-build"></a>

After you have selected a SageMaker HyperPod public AMI, use that as the base AMI to build your own custom AMI with one of the following methods. Note that this is not an exhaustive list for building AMIs. You can use any method of your choice for building AMIs. SageMaker HyperPod does not have any specific recommendation.
+ **AWS Management Console**: You can launch an Amazon EC2 instance using the SageMaker HyperPod AMI, make desired customizations, and then create an AMI from that instance.
+ **AWS CLI**: You can also use the `aws ec2 create-image` command to create an AMI from an existing Amazon EC2 instance after performing the customization.
+ **HashiCorp Packer**: Packer is an open-source tool from HashiCorp that enables you to create identical machine images for multiple platforms from a single source configuration. It supports creating AMIs for AWS, as well as images for other cloud providers and virtualization platforms.
+ **Image Builder**: EC2 Image Builder is a fully managed AWS service that makes it easier to automate the creation, maintenance, validation, sharing, and deployment of Linux or Windows Server images. For more information, see the [EC2 Image Builder User Guide](https://docs.aws.amazon.com/imagebuilder/latest/userguide/what-is-image-builder.html).

### Build a custom AMI with customer managed AWS KMS encryption
<a name="hyperpod-custom-ami-build-kms"></a>

The following sections describe how to build a custom AMI with a customer managed AWS KMS key to encrypt your HyperPod cluster volumes. For more information about customer managed keys in HyperPod and granting the required IAM and KMS key policy permissions, see [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). If you plan to use a custom AMI that is encrypted with a customer managed key, ensure that you also encrypt your HyperPod cluster's Amazon EBS root volume with the same key.

#### AWS CLI example: Create a new AMI using EC2 Image Builder and a HyperPod base image
<a name="hyperpod-custom-ami-cli-example"></a>

The following example shows how to create an AMI using Image Builder with AWS KMS encryption:

```
aws imagebuilder create-image-recipe \
    name "hyperpod-custom-recipe" \
    version "1.0.0" \
    parent-image "<hyperpod-base-image-id>" \
    block-device-mappings DeviceName="/dev/xvda",Ebs={VolumeSize=100,VolumeType=gp3,Encrypted=true,KmsKeyId=arn:aws:kms:us-east-1:111122223333:key/key-id,DeleteOnTermination=true}
```

#### Amazon EC2 console: Create a new AMI from an Amazon EC2
<a name="hyperpod-custom-ami-console-example"></a>

To create an AMI from an Amazon EC2 instance using the Amazon EC2 console:

1. Right-click on your customized Amazon EC2 instance and choose **Create Image**.

1. In the **Encryption** section, select **Encrypt snapshots**.

1. Select your KMS key from the dropdown. For example: `arn:aws:kms:us-east-2:111122223333:key/<your-kms-key-id>` or use the key alias: `alias/<your-hyperpod-key>`.

#### AWS CLI example: Create a new AMI from an Amazon EC2 instance
<a name="hyperpod-custom-ami-cli-create-image"></a>

Use the `aws ec2 create-image` command with AWS KMS encryption:

```
aws ec2 create-image \
    instance-id "<instance-id>" \
    name "MyCustomHyperPodAMI" \
    description "Custom HyperPod AMI" \
    block-device-mappings '[
        {
            "DeviceName": "/dev/xvda",
            "Ebs": {
                "Encrypted": true,
                "KmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id",
                "VolumeType": "gp2" 
            }
        }
    ]'
```

# Cluster management with custom AMIs
<a name="hyperpod-custom-ami-cluster-management"></a>

After the custom AMI is built, you can use it for creating or updating an Amazon SageMaker HyperPod cluster. You can also scale up or add instance groups that use the new AMI.

## Permissions required for cluster operations
<a name="hyperpod-custom-ami-permissions"></a>

Add the following permissions to the cluster admin user who operates and configures SageMaker HyperPod clusters. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters with custom AMI.

Note that AMI and AMI EBS snapshot sharing permissions are included through `ModifyImageAttribute` and `ModifySnapshotAttribute` API permissions as part of the following policy. For scoping down the sharing permissions, you can take the following steps:
+ Add tags to control the AMI sharing permissions to AMI and AMI snapshot. For example, you can tag the AMI with `AllowSharing` as `true`.
+ Add the context key in the policy to only allow AMI sharing for AMIs tagged with certain tags.

The following policy is a scoped down policy to ensure only AMIs tagged with `AllowSharing` as `true` are allowed.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111122223333:role/your-execution-role-name"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchDeleteClusterNodes",
                "eks:DescribeCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:AssociateAccessPolicy",
                "iam:CreateServiceLinkedRole",
                "ec2:DescribeImages",
                "ec2:DescribeSnapshots"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:ModifyImageAttribute",
                "ec2:ModifySnapshotAttribute"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/AllowSharing": "true"
                }
            }
        }
    ]
}
```

------

**Important**  
If you plan to use an encrypted custom AMI, then make sure that your KMS key meets the permissions described in [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). Additionally, ensure that your custom AMI's KMS key is also used to encrypt your cluster's Amazon EBS root volume.

## Create a cluster
<a name="hyperpod-custom-ami-api-create"></a>

You can specify your custom AMI in the `ImageId` field for the `CreateCluster` operation.

The following examples show how to create a cluster with a custom AMI, both with and without an AWS KMS customer managed key for encrypting the cluster volumes.

------
#### [ Standard example ]

The following example shows how to create a cluster with a custom AMI.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
   
        {
            "EbsVolumeConfig": {
                "VolumeSizeInGB": 200
            }
        }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------
#### [ Customer managed key example ]

The following example shows how to create a cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam:us-east-1:444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------

## Update the cluster software
<a name="hyperpod-custom-ami-api-update"></a>

If you want to update an existing instance group on your cluster with your custom AMI, you can use the `UpdateClusterSoftware` operation and specify your custom AMI in the `ImageId` field. Note that unless you specify the name of a specific instance group in your request, then the new image is applied to all of the instance groups in your cluster.

The following example shows how to update a cluster's platform software with a custom AMI:

```
aws sagemaker update-cluster-software \
   --cluster-name <exampleClusterName> \
   --instance-groups <instanceGroupToUpdate> \
   --image-id <customAmiId>
```

## Scale up an instance group
<a name="hyperpod-custom-ami-scale-up"></a>

The following examples show how to scale up an instance group for a cluster using a custom AMI, both with and without using an AWS KMS customer managed key for encryption.

------
#### [ Standard example ]

The following example shows how to scale up an instance group with a custom AMI.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}]'
```

------
#### [ Customer managed key example ]

The following example shows how to update and scale up your cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>",
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}]'
```

------

## Add an instance group
<a name="hyperpod-custom-ami-add-instance-group"></a>

The following example shows how to add an instance group to a cluster using a custom AMI:

```
aws sagemaker update-cluster \
   --cluster-name "<exampleClusterName>" \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}' '{
   "InstanceGroupName": "<exampleGroupName2>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 1,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}'
```

# Managing SageMaker HyperPod EKS clusters using the SageMaker console
<a name="sagemaker-hyperpod-eks-operate-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod in the SageMaker AI console.

**Topics**
+ [

# Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md)
+ [

# Browsing, viewing, and editing SageMaker HyperPod clusters
](sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit.md)
+ [

# Deleting a SageMaker HyperPod cluster
](sagemaker-hyperpod-eks-operate-console-ui-delete-cluster.md)

# Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
<a name="sagemaker-hyperpod-eks-operate-console-ui-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Amazon EKS orchestration through the SageMaker AI console UI.

**Topics**
+ [

## Create cluster
](#smcluster-getting-started-eks-console-create-cluster-page)
+ [

## Deploy resources
](#smcluster-getting-started-eks-console-create-cluster-deploy)

## Create cluster
<a name="smcluster-getting-started-eks-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose Amazon EKS orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Amazon EKS**.

1. On the EKS cluster creation page, you will see two options, choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-eks-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled.

1. If your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable GPU partition configuration for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-eks-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-eks-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your first HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**. 

### Networking
<a name="smcluster-getting-started-eks-console-create-cluster-custom-network"></a>

Configure network settings within the cluster and in-and-out of the cluster. For orchestration of SageMaker HyperPod cluster with Amazon EKS, the VPC is automatically set to the one configured with the EKS cluster you selected.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security group(s)**, choose security groups that are either attached to the Amazon EKS cluster or whose inbound traffic is permitted by the security group associated with the Amazon EKS cluster. To create new security groups, go to the Amazon VPC console.

### Orchestration
<a name="smcluster-getting-started-eks-console-create-cluster-custom-orchestration"></a>

Follow these steps to create or select an Amazon EKS cluster to use as an orchestrator. 

1. For **EKS cluster**, choose either create a new Amazon EKS cluster or use an existing one. 

   If you need to create a new EKS cluster, you can create it from the EKS cluster section without having to open the Amazon EKS console.
**Note**  
The VPC subnet you choose for HyperPod has to be private.   
After submitting a new EKS cluster creation request, wait until the EKS cluster becomes `Active`.

1. For **Kubernetes version**, choose a version from the drop-down menu. For more information about Kubernetes versions, see [Understand the Kubernetes version lifecycle on EKS](https://docs.aws.amazon.com//eks/latest/userguide/kubernetes-versions.html) from the *Amazon EKS User Guide*.

1. For **Operators**, choose **Use default Helm charts and add-ons** or **Don't install operators**. The option defaults to **Use default Helm charts and add-ons**, which will be used to install operators on the EKS cluster. For more information about the default Helm charts and add-ons, see [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart) from the GitHub repository. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

1. For **Enabled operators**, view the list of enabled operators. To edit the operators, uncheck the box at top and choose operators to enable for the EKS cluster. 
**Note**  
To use HyperPod with EKS, you must install Helm charts and add-ons that enable operators on the EKS cluster. These components configure EKS as the control plane for HyperPod and provide the necessary setup for workload management and orchestration.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled. To learn more, see [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)

1. For **Use GPU partition - optional**, if your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable this option to configure the GPU partition profile for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-eks-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-eks-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf. For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

### Storage
<a name="smcluster-getting-started-eks-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-eks-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-eks-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

# Browsing, viewing, and editing SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit"></a>

Use the following instructions to browse, view, and edit SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

**Topics**
+ [

## To browse your SageMaker HyperPod clusters
](#sagemaker-hyperpod-eks-operate-console-ui-browse-clusters)
+ [

## To view details of each SageMaker HyperPod cluster
](#sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters)
+ [

## To edit a SageMaker HyperPod cluster
](#sagemaker-hyperpod-eks-operate-console-ui-edit-clusters)

## To browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, all created clusters should be listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## To view details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, the cluster names are activated as links. Choose the cluster name link to see details of each cluster.

## To edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-delete-cluster"></a>

Use the following instructions to delete SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

**Note**  
When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-eks-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [

## Configure resources in the console and deploy using CloudFormation
](#smcluster-getting-started-eks-console-create-cluster-deploy-console)
+ [

## Configure and deploy resources using CloudFormation
](#smcluster-getting-started-eks-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates.

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure and deploy resources using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-cfn"></a>

You can configure and deploy resources using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes.

# Managing SageMaker HyperPod EKS clusters using the AWS CLI
<a name="sagemaker-hyperpod-eks-operate-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [

# Creating a SageMaker HyperPod cluster
](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md)
+ [

# Retrieving SageMaker HyperPod cluster details
](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md)
+ [

# Updating SageMaker HyperPod cluster configuration
](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md)
+ [

# Updating the SageMaker HyperPod platform software
](sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software.md)
+ [

# Accessing SageMaker HyperPod cluster nodes
](sagemaker-hyperpod-eks-operate-access-through-terminal.md)
+ [

# Scaling down a SageMaker HyperPod cluster
](smcluster-scale-down.md)
+ [

# Deleting a SageMaker HyperPod cluster
](sagemaker-hyperpod-eks-operate-cli-command-delete-cluster.md)

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-create-cluster"></a>

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.

1. Before creating an SageMaker HyperPod cluster:

   1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the *Amazon EKS User Guide*.

   1. Install the Helm chart as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). If you create a [Amazon Nova SageMaker HyperPod cluster](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html), you will need a separate Helm chart.

1. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as `s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/`.

   For a quick start, download the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh) from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.
**Important**  
If you create an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attaching only the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html), your cluster has access to Amazon S3 buckets with the specific prefix `sagemaker-`.

   If you create a restricted instance group, you don't need to download and run the lifecycle script. Instead, you need to run `install_rig_dependencies.sh`. 

   The prerequisites to run the `install_rig_dependencies.sh` script include:
   + AWS Node (CNI) and CoreDNS should both be enabled. These are standard EKS add-ons that are not managed by the standard SageMaker HyperPod Helm, but can be easily enabled in the EKS console under Add-ons.
   +  The standard SageMaker HyperPod Helm chart should be installed before running this script.

   The `install_rig_dependencies.sh` script performs the following actions. 
   + `aws-node` (CNI): New `rig-aws-node` Daemonset created; existing `aws-node` patched to avoid RIG nodes.
   + `coredns`: Converted to Daemonset for RIGs to support multi-RIG use and prevent overloading.
   + training-operators: Updated with RIG Worker taint tolerations and nodeAffinity favoring non-RIG instances.
   + Elastic Fabric Adapter (EFA): Updated to tolerate RIG worker taint and use correct container images for each Region.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).
**Note**  
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

   ```
   // create_cluster.json
   {
       "ClusterName": "string",
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
               "OnCreate": "on_create.sh"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "RestrictedInstanceGroups": [ 
         { 
            "EnvironmentConfig": { 
               "FSxLustreConfig": { 
                  "PerUnitStorageThroughput": number,
                  "SizeInGiB": number
               }
            },
            "ExecutionRole": "string",
            "InstanceCount": number,
            "InstanceGroupName": "string",
            "InstanceStorageConfigs": [ 
               { ... }
            ],
            "InstanceType": "string",
            "OnStartDeepHealthChecks": [ "string" ],
            "OverrideVpcConfig": { 
               "SecurityGroupIds": [ "string" ],
               "Subnets": [ "string" ]
            },
            "ScheduledUpdateConfig": { 
               "DeploymentConfig": { 
                  "AutoRollbackConfiguration": [ 
                     { 
                        "AlarmName": "string"
                     }
                  ],
                  "RollingUpdatePolicy": { 
                     "MaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     },
                     "RollbackMaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     }
                  },
                  "WaitIntervalInSeconds": number
               },
               "ScheduleExpression": "string"
            },
            "ThreadsPerCore": number,
            "TrainingPlanArn": "string"
         }
      ],
       "VpcConfig": {
           "SecurityGroupIds": ["string"],
           "Subnets": ["string"]
       },
       "Tags": [{
           "Key": "string",
           "Value": "string"
       }],
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "string",
               "KubernetesConfig": {
                   "Labels": {
                       "nvidia.com/mig.config": "all-3g.40gb"
                   }
               }
           }
       },
       "NodeRecovery": "Automatic"
   }
   ```
**Flexible instance groups**  
Instead of specifying a single `InstanceType`, you can use the `InstanceRequirements` parameter to specify multiple instance types for an instance group. Note the following:  
`InstanceType` and `InstanceRequirements` are mutually exclusive. You must specify one or the other, but not both.
`InstanceRequirements.InstanceTypes` is an ordered list that determines provisioning priority. SageMaker HyperPod attempts to provision the first instance type in the list and falls back to subsequent types if capacity is unavailable. You can specify up to 20 instance types, and the list must not contain duplicates.
Flexible instance groups require continuous node provisioning mode.
The following example shows an instance group using `InstanceRequirements`:  

   ```
   {
       "InstanceGroupName": "flexible-ig",
       "InstanceRequirements": {
           "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
       },
       "InstanceCount": 10,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
   }
   ```

   Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
   + You can configure up to 20 instance groups under the `InstanceGroups` parameter.
   + For `Orchestator.Eks.ClusterArn`, specify the ARN of the EKS cluster you want to use as the orchestrator.
   + For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).
   + For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.
   + For the `Tags` parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).
   + For the `VpcConfig` parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
   + For `Orchestrator.Eks.KubernetesConfig.Labels`, you can optionally specify Kubernetes labels to apply to the nodes. To enable GPU partitioning with Multi-Instance GPU (MIG), add the `nvidia.com/mig.config` label with the desired MIG profile. For example, `"nvidia.com/mig.config": "all-3g.40gb"` configures all GPUs with the 3g.40gb partition profile. For more information about GPU partitioning and available profiles, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.
**Important**  
When running the `create-cluster` command with the `--cli-input-json` parameter, you must include the `file://` prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting the `file://` prefix results in a parsing parameter error.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.
**Important**  
You can use the [update-cluster](https://docs.aws.amazon.com//cli/latest/reference/ecs/update-cluster.html) operation to remove a restricted instance group (RIG). When a RIG is scaled down to 0, the FSx for Lustre file system won't be deleted. To completely remove the FSx for Lustre file system, you must remove the RIG entirely.   
Removing a RIG will not delete any artifacts stored in the service-managed Amazon S3 bucket. However, you should ensure all artifacts in the FSx for Lustre file system are fully synchronized to Amazon S3 before removal. We recommend waiting at least 30 minutes after job completion to ensure complete synchronization of all artifacts from the FSx for Lustre file system to the service-managed Amazon S3 bucket.
**Important**  
When using an onboarded On-Demand Capacity Reservation (ODCR), you must map your instance group to the same Availability Zone ID (AZ ID) as the ODCR by setting `OverrideVpcConfig` with a subnet in the matching AZ ID.  
CRITICAL: Verify `OverrideVpcConfig` configuration before deployment to avoid incurring duplicate charges for both ODCR and On-Demand Capacity.

# Retrieving SageMaker HyperPod cluster details
<a name="sagemaker-hyperpod-eks-operate-cli-command-cluster-details"></a>

Learn how to retrieve SageMaker HyperPod cluster details using the AWS CLI.

## Describe a cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

# Updating SageMaker HyperPod cluster configuration
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
Important considerations:  
You cannot change the EKS cluster information that your HyperPod cluster is associated after the cluster is created. 
If deep health checks are running on the cluster, this API will not function as expected. You might encounter an error message stating that deep health checks are in progress. To update the cluster, you should wait until the deep health checks finish.

1. Create an [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. For each instance group, you can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.
**Note**  
You can use the `UpdateCluster` to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

      For instance groups that use `InstanceRequirements` instead of `InstanceType`, you can add or remove instance types from the `InstanceTypes` list. However, you cannot remove an instance type that has active nodes running on it. You also cannot switch between `InstanceType` and `InstanceRequirements` when updating an existing instance group. `InstanceType` and `InstanceRequirements` are mutually exclusive.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change the values for both `SourceS3Uri` and `OnCreate` as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   1. For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).

   1. For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "string",
               "OnCreate": "string"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "NodeRecovery": "Automatic"
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

# Updating the SageMaker HyperPod platform software
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software"></a>

When you create your SageMaker HyperPod cluster, SageMaker HyperPod selects an Amazon Machine Image (AMI) corresponding to the Kubernetes version of your Amazon EKS cluster.

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.
The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre.
All cluster nodes experience downtime (nodes appear as `<NotReady>` in the output of `kubectl get node`) while the patching is in progress. We recommend that you terminate all workloads before patching and resume them after the patch completes.   
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md#sagemaker-hyperpod-eks-operate-cli-command-describe-cluster).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

**Rolling upgrades with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, rolling upgrades spread each instance type proportionally across batches. For example, if an instance group has 100 instances (10 P5 and 90 G6) and you configure a 10% batch size, each batch contains 1 P5 instance and 9 G6 instances.

 When calling the `UpdateClusterSoftware` API, SageMaker HyperPod updates the Kubernetes version of the nodes by selecting the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) based on the Kubernetes version of your Amazon EKS cluster. It then runs the lifecycle scripts in the Amazon S3 bucket that you specified during the cluster creation or update. 

You can verify the kubelet version of a node by running the `kubectl describe node` command.

The Kubernetes version of SageMaker HyperPod cluster nodes does not automatically update when you update your Amazon EKS cluster version. After updating the Kubernetes version for your Amazon EKS cluster, you must use the `UpdateClusterSoftware` API to update your SageMaker HyperPod cluster nodes to the same Kubernetes version.

 It is recommended to update your SageMaker HyperPod cluster after updating your Amazon EKS nodes, and avoid having more than one version difference between the Amazon EKS cluster version and the SageMaker HyperPod cluster nodes version.

The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
You can only run this API programmatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

# Accessing SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-eks-operate-access-through-terminal"></a>

You can directly access the nodes of a SageMaker HyperPod cluster in service using the AWS CLI commands for AWS Systems Manager (SSM). Run `aws ssm start-session` with the host name of the node in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

# Scaling down a SageMaker HyperPod cluster
<a name="smcluster-scale-down"></a>

You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.

The following page outlines two main approaches to scaling down:
+ **Scale down at the instance group level:** This approach uses the `UpdateCluster` API, with which you can:
  + Scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. See [Scale down an instance group](#smcluster-scale-down-updatecluster).
  + Completely delete instance groups from your cluster. See [Delete instance groups](#smcluster-remove-instancegroup).
+ **Scale down at the instance level:** This approach uses the `BatchDeleteClusterNodes` API, with which you can specify the individual nodes you want to terminate. See [Scale down at the instance level](#smcluster-scale-down-batchdelete).

**Note**  
When scaling down at the instance level with `BatchDeleteCusterNodes`, you can only terminate a maximum of 99 instances at a time. `UpdateCluster` supports terminating any number of instances.

## Important considerations
<a name="smcluster-scale-down-considerations"></a>
+ When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions. 
+ Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).
+ To invoke this API on an existing cluster, you must first patch the cluster by running the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. For more information about patching a cluster, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
+ Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
+ You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.

## Scale down at the instance group level
<a name="smcluster-scale-down-or-delete"></a>

The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation allows you to make changes to the configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances of an instance group or removing entire instance groups. This can be useful when you want to adjust the resources allocated to your cluster based on changes in your workload, optimize costs, or change the instance type of an instance group.

### Scale down an instance group
<a name="smcluster-scale-down-updatecluster"></a>

Use this approach when you have an instance group that is idle and it's safe to terminate any of the instances for scaling down. When you submit an `UpdateCluster` request to scale down, HyperPod randomly chooses instances for termination and scales down to the specified number of nodes for the instance group.

**Scale-down behavior with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, HyperPod terminates the lowest-priority instance types first during scale-down. The priority is determined by the order of instance types in the `InstanceTypes` list, where the first type has the highest priority. This protects higher-priority instances, which are typically higher-performance, during scale-down operations.

**Note**  
When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.   
Alternatively, you can choose to remove an instance group permanently. For more information, see [Delete instance groups](#smcluster-remove-instancegroup).

**To scale down with `UpdateCluster`**

1. Follow the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md). When you reach step **1.d** where you specify the **InstanceCount** field, enter a number that is smaller than the current number of instances to scale down the cluster.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

The following is an example of an `UpdateCluster` JSON object. Consider the case where your instance group currently has 2 running instances. If you set the **InstanceCount** field to 1, as shown in the example, then HyperPod randomly selects one of the instances and terminates it.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training-instances",
      "InstanceType": "instance-type",
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    }
  ],
  "NodeRecovery": "Automatic"
}
```

### Delete instance groups
<a name="smcluster-remove-instancegroup"></a>

You can use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation to remove entire instance groups from your SageMaker HyperPod cluster when they are no longer needed. This goes beyond simple scaling down, allowing you to completely eliminate specific instance groups from your cluster's configuration. 

**Note**  
When removing an instance group:  
All instances within the targeted group are terminated.
The entire group configuration is deleted from the cluster.
Any workloads running on that instance group are stopped.

**To delete instance groups with `UpdateCluster`**

1. When following the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md):

   1. Set the optional `InstanceGroupsToDelete` parameter in your `UpdateCluster` JSON and pass the comma-separated list of instance group names that you want to delete.

   1.  When you specify the `InstanceGroups` list, ensure that the specifications of the instance groups you are removing are no longer listed in the `InstanceGroups` list.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

**Important**  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.

The following is an example of an `UpdateCluster` JSON object. Consider the case where a cluster currently has 3 instance groups, a *training*, a *prototype-training*, and an *inference-serving* group. You want to delete the *prototype-training* group.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training",
      "InstanceType": "instance-type",
      "InstanceCount": ,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    },
    {
      "InstanceGroupName": "inference-serving",
      "InstanceType": "instance-type",
      "InstanceCount": 2,
      [...]
    },
  ],
  "InstanceGroupsToDelete": [ "prototype-training" ],
  "NodeRecovery": "Automatic"
}
```

## Scale down at the instance level
<a name="smcluster-scale-down-batchdelete"></a>

The `BatchDeleteClusterNodes` operation allows you to scale down a SageMaker HyperPod cluster by specifying the individual nodes you want to terminate. `BatchDeleteClusterNodes` provides more granular control for targeted node removal and cluster optimization. For example, you might use `BatchDeleteClusterNodes` to delete targeted nodes for maintenance, rolling upgrades, or rebalancing resources geographically.

**API request and response**

When you submit a `BatchDeleteClusterNodes` request, SageMaker HyperPod deletes nodes by their instance IDs. The API accepts a request with the cluster name and a list of node IDs to be deleted. 

The response includes two sections: 
+  `Failed`: A list of errors of type `[ BatchDeleteClusterNodesError ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodesError.html)` - one per instance ID.
+  `Successful`: The list of instance IDs successfully terminated. 

**Validation and error handling**

The API performs various validations, such as:
+ Verifying the node ID format (prefix of `i-` and Amazon EC2 instance ID structure). 
+ Checking the node list length, with a limit of 99 or fewer node IDs in a single `BatchDeleteClusterNodes` request.
+ Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
+ Handling cases where instances are not found, have invalid status, or are in use.

**API Response Codes**
+  The API returns a `200` status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation). 
+  If all of these validations fail (e.g., all input nodes fail validation), the API will return a `400` Bad Request response with the appropriate error messages and error codes. 

**Example**

The following is an example of **scaling down a cluster at the instance level** using the AWS CLI:

```
aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'
```

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

This API only cleans up the SageMaker HyperPod resources and doesn’t delete any resources of the associated EKS cluster. This includes the Amazon EKS cluster, EKS Pod identities, Amazon FSx volumes, and EKS add-ons. This also includes the initial configuration you added to your EKS cluster. If you want to clean up all resources, make sure that you also clean up the EKS resources separately. 

Make sure that you first delete the SageMaker HyperPod resources, followed by the EKS resources. Performing the deletion in the reverse order may result in lingering resources.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.

# HyperPod managed tiered checkpointing
<a name="managed-tier-checkpointing"></a>

This section explains how managed tiered checkpointing works and the benefits it provides for large-scale model training.

Amazon SageMaker HyperPod managed tiered checkpointing helps you train large-scale generative AI models more efficiently. It uses multiple storage tiers, including your cluster’s CPU memory. This approach reduces your time to recovery and minimizes loss in training progress. It also uses underutilized memory resources in your training infrastructure.

Managed tiered checkpointing enables saving checkpoints at a higher frequency to memory. It periodically persists them to durable storage. This maintains both performance and reliability during your training process.

This guide covers how to set up, configure, and use managed tiered checkpointing with PyTorch frameworks on Amazon EKS HyperPod clusters.

## How managed tiered checkpointing works
<a name="managed-tier-checkpointing-works"></a>

Managed tiered checkpointing uses a multi-tier storage approach. CPU memory serves as the primary tier to store model checkpoints. Secondary tiers include persistent storage options like Amazon S3.

When you save a checkpoint, the system stores it in allocated memory space across your cluster nodes. It automatically replicates data across adjacent compute nodes for enhanced reliability. This replication strategy protects against single or multiple node failures while providing fast access for recovery operations.

The system also periodically saves checkpoints to persistent storage according to your configuration. This ensures long-term durability of your training progress.

Key components include:
+ **Memory management system**: A memory management daemon that provides disaggregated memory as a service for checkpoint storage
+ **HyperPod Python library**: Interfaces with the disaggregated storage APIs and provides utilities for saving, loading, and managing checkpoints across tiers
+ **Checkpoint replication**: Automatically replicates checkpoints across multiple nodes for fault tolerance

The system integrates seamlessly with PyTorch training loops through simple API calls. It requires minimal changes to your existing code.

## Benefits
<a name="managed-tier-checkpointing-benefits"></a>

Managed tiered checkpointing delivers several advantages for large-scale model training:
+ **Improved usability**: Manages checkpoint save, replication, persistence, and recovery
+ **Faster checkpoint operations**: Memory-based storage provides faster save and load times compared to disk-based checkpointing, leading to faster recovery
+ **Fault tolerance**: Automatic checkpoint replication across nodes protects against hardware node failures
+ **Minimal code changes**: Simple API integration requires only minor modifications to existing training scripts
+ **Improved training throughput**: Reduced checkpoint overhead means more time spent on actual training

**Topics**
+ [

## How managed tiered checkpointing works
](#managed-tier-checkpointing-works)
+ [

## Benefits
](#managed-tier-checkpointing-benefits)
+ [

# Set up managed tiered checkpointing
](managed-tier-checkpointing-setup.md)
+ [

# Removing managed tiered checkpointing
](managed-tier-checkpointing-remove.md)
+ [

# Security considerations for managed tiered checkpointing
](managed-tier-security-considerations.md)

# Set up managed tiered checkpointing
<a name="managed-tier-checkpointing-setup"></a>

This section contains setup process for managed tiered checkpointing for Amazon SageMaker HyperPod. You’ll learn how to enable the capability on your cluster and implement checkpointing in your training code.

**Topics**
+ [

## Prerequisites
](#managed-tier-checkpointing-setup-prerequisites)
+ [

## Step 1: Enable managed tiered checkpointing for your cluster
](#managed-tier-checkpointing-setup-step-enable-for-cluster)
+ [

## Step 2: Install the Python library in your training image
](#managed-tier-checkpointing-setup-step-install-library)
+ [

## Step 3: Save checkpoints in your training loop
](#managed-tier-checkpointing-setup-step-save-checkpoint-in-loop)
+ [

## Step 4: Load checkpoints for recovery
](#managed-tier-checkpointing-setup-step-load-checkpoint)
+ [

## Validate your managed tiered checkpointing operations
](#managed-tier-checkpointing-setup-validation)

## Prerequisites
<a name="managed-tier-checkpointing-setup-prerequisites"></a>

Before setting up managed tiered checkpointing, ensure you have:
+ An Amazon EKS HyperPod cluster with sufficient CPU memory available for checkpoint allocation
+ PyTorch training workloads and DCP jobs (both are supported)
+ Appropriate IAM permissions for cluster management, including:
  + Amazon CloudWatch and Amazon S3 write permissions for the training pod to read/write checkpoints and push metrics
  + These permissions can be configured via [EKS OIDC setup](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)

## Step 1: Enable managed tiered checkpointing for your cluster
<a name="managed-tier-checkpointing-setup-step-enable-for-cluster"></a>

**Important**  
You must opt in to use managed tiered checkpointing.

Enable managed tiered checkpointing through the HyperPod APIs when creating or updating your cluster. The service automatically installs the memory management system when you specify the `TieredStorageConfig` parameter.

For new clusters, you can use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) AWS CLI.

```
aws sagemaker create-cluster \
    --cluster-name cluster-name \
    --orchestrator "Eks={ClusterArn=eks-cluster-arn}" \
    --instance-groups '{
        "InstanceGroupName": "instance-group-name",
        "InstanceType": "instance-type",
        "InstanceCount": instance-count,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3-path-to-lifecycle-scripts",
            "OnCreate": "lifecycle-script-name"
        },
        "ExecutionRole": "instance-group-iam-role",
        "ThreadsPerCore": threads-per-core,
        "InstanceStorageConfigs": [
            { "EbsVolumeConfig": {"VolumeSizeInGB": volume-size} }
        ]
    }' \
    --vpc-config '{
        "SecurityGroupIds": ["security-group-ids"],
        "Subnets": ["subnets"]
    }' \
    --tiered-storage-config '{
        "Mode": "Enable"
    }'
```

The `InstanceMemoryAllocationPercentage` parameter specifies the `percentage` (int) of cluster memory to allocate for checkpointing. The range is 20-100.

## Step 2: Install the Python library in your training image
<a name="managed-tier-checkpointing-setup-step-install-library"></a>

Install the [Amazon SageMaker checkpointing library](https://pypi.org/project/amzn-sagemaker-checkpointing/) and its dependencies in your training image by adding it to your Dockerfile:

```
# Add this line to your training image Dockerfile
RUN pip install amzn-sagemaker-checkpointing s3torchconnector tenacity torch boto3 s3torchconnector
```

## Step 3: Save checkpoints in your training loop
<a name="managed-tier-checkpointing-setup-step-save-checkpoint-in-loop"></a>

In your training loop, you can asynchronously save checkpoints using PyTorch DCP. The following is an example on how to do so.

```
import torch
import torch.distributed as dist
from torch.distributed.checkpoint import async_save, load
from amzn_sagemaker_checkpointing.checkpointing.filesystem.filesystem import (
    SageMakerTieredStorageWriter,
    SageMakerTieredStorageReader
)

# Initialize distributed training
dist.init_process_group(backend="nccl")

# Configure checkpointing
checkpoint_config = SageMakerCheckpointConfig(
    # Unique ID for your training job 
    # Allowed characters in ID include: alphanumeric, hyphens, and underscores
    namespace=os.environ.get('TRAINING_JOB_NAME', f'job-{int(time.time())}'),

    # Number of distributed processes/available GPUs
    world_size=dist.get_world_size(),

    # S3 storage location, required for SageMakerTieredStorageReader for read fallbacks
    # Required for SageMakerTieredStorageWriter when save_to_s3 is True
    s3_tier_base_path="s3://my-bucket/checkpoints"
)

# Your model and optimizer
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters())

# Training loop
future = None
in_memory_ckpt_freq = 10
s3_ckpt_freq = 50

for training_step in range(1000):
    # ... training code ...
    
    # Save checkpoint
    if (training_step % in_memory_ckpt_freq == 0 or 
        training_step % s3_ckpt_freq == 0):
        # Create state dictionary
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "step": training_step,
            "epoch": epoch
        }
        
        # Create storage writer for current step
        checkpoint_config.save_to_s3 = training_step % s3_ckpt_freq == 0
        storage_writer = SageMakerTieredStorageWriter(
            checkpoint_config=checkpoint_config,
            step=training_step
        )

        # wait for previous checkpoint to get completed
        if future is not None:
            exc = future.exception()
            if exc:
                print(f"Failure in saving previous checkpoint:{str(exc)}")
                # Handle failures as required
            else:
                result = future.result()
                # Process results from save, if required
        
        # Async save checkpoint using PyTorch DCP
        future = async_save(state_dict=state_dict, storage_writer=storage_writer)
        
        # Continue training while checkpoint saves in background
```

## Step 4: Load checkpoints for recovery
<a name="managed-tier-checkpointing-setup-step-load-checkpoint"></a>

The following is an example on loading a checkpoint.

```
# Create state dictionary template
state_dict = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "step": 0,
    "epoch": 0
}

# Load latest checkpoint
storage_reader = SageMakerTieredStorageReader(checkpoint_config=checkpoint_config)
load(state_dict, storage_reader=storage_reader)

# Load specific checkpoint step
storage_reader = SageMakerTieredStorageReader(
    checkpoint_config=checkpoint_config, 
    step=500 # Or don't pass step if you have to load the latest available step.
)
try:
    load(state_dict, storage_reader=storage_reader)
except BaseException as e:
    print(f"Checkpoint load failed: {str(e)}")
    # Add additional exception handling
```

## Validate your managed tiered checkpointing operations
<a name="managed-tier-checkpointing-setup-validation"></a>

You can validate your managed tiered checkpointing operations with logs.

**Custom logging (optional)**

You can integrate checkpointing logs with other logs by passing a custom logger to the library. For example, you can add a custom logger to your training code so that all logs from the library are also collected in the training logger.

**Enhanced service logging (optional)**

For enhanced debugging and service visibility, you can mount the checkpointing log path `/var/log/sagemaker_checkpointing` from within your pod to a path `/var/logs/sagemaker_checkpointing` on your host. This ensures that only library-specific logs are collected separately. This provides the service team with enhanced visibility for debugging and support.

# Removing managed tiered checkpointing
<a name="managed-tier-checkpointing-remove"></a>

This section explains how to disable managed tiered checkpointing when you no longer need it.

To disable managed tiered checkpointing, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI to update your cluster configuration:

```
aws sagemaker update-cluster \
    --cluster-name cluster-name \
    --tiered-storage-config '{ "Mode": "Disable" }'
```

This removes the memory management daemon from your cluster. The daemon is implemented as a standard Kubernetes DaemonSet and follows standard Kubernetes lifecycle management.

# Security considerations for managed tiered checkpointing
<a name="managed-tier-security-considerations"></a>

This section covers important security considerations when using managed tiered checkpointing. It includes Python pickle usage, Amazon S3 encryption, and network endpoint security.

**Python pickle usage**

Managed tiered checkpointing uses Python’s pickle module to deserialize checkpoint data stored in Amazon S3. This implementation has important security implications:
+ **Extended trust boundary**: When using managed tiered checkpointing with Amazon S3, the Amazon S3 bucket becomes part of your cluster’s trust boundary.
+ **Code execution risk**: Python’s pickle module can execute arbitrary code during deserialization. If an unauthorized user gains write access to your checkpoint Amazon S3 bucket, they could potentially craft malicious pickle data that executes when loaded by managed tiered checkpointing.

**Best practices for Amazon S3 storage**

When using managed tiered checkpointing with Amazon S3 storage:
+ **Restrict Amazon S3 bucket access**: Ensure that only authorized users and roles associated with your training cluster have access to the Amazon S3 bucket used for checkpointing.
+ **Implement bucket policies**: Configure appropriate bucket policies to prevent unauthorized access or modifications.
+ **Validate access patterns**: Implement logging for validating access patterns to your checkpoint Amazon S3 buckets.
+ **Validate bucket names**: Use caution with bucket name selection to avoid potential bucket hijacking.

**Network endpoints**

Managed tiered checkpointing enables network endpoints on each of your compute nodes on the following ports: 9200/TCP, 9209/UDP, 9210/UDP, 9219/UDP, 9220/UDP, 9229/UDP, 9230/UDP, 9239/UDP, 9240/UDP. These ports are necessary for the checkpointing service to function and maintain data synchronization.

By default, SageMaker’s network configuration restricts access to these endpoints for security purposes. We recommend that you maintain these default restrictions.

When configuring your network settings for your nodes and VPC, follow AWS best practices for VPCs, security groups, and ACLs. For more information, see the following:
+ [Amazon SageMaker HyperPod prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-prerequisites.html#sagemaker-hyperpod-prerequisites-optional-vpcCluster)
+ [VPC security best practices](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-best-practices.html)

# SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance"></a>

SageMaker HyperPod task governance is a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. This provides administrators with the capability to set:
+ Priority levels for various tasks
+ Compute allocation for each team
+ How each team lends and borrows idle compute
+ If a team preempts their own tasks

HyperPod task governance also provides Amazon EKS cluster Observability, offering real-time visibility into cluster capacity. This includes compute availability and usage, team allocation and utilization, and task run and wait time information, setting you up for informed decision-making and proactive resource management. 

The following sections cover how to set up, understand key concepts, and use HyperPod task governance for your Amazon EKS clusters.

**Topics**
+ [

# Setup for SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md)
+ [

# Dashboard
](sagemaker-hyperpod-eks-operate-console-ui-governance-metrics.md)
+ [

# Tasks
](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks.md)
+ [

# Policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md)
+ [

# Example HyperPod task governance AWS CLI commands
](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md)
+ [

# Troubleshoot
](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md)
+ [

# Attribution document for Amazon SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-attributions.md)

# Setup for SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup"></a>

The following section provides information on how to get set up with the Amazon CloudWatch Observability EKS and SageMaker HyperPod task governance add-ons.

Ensure that you have the minimum permission policy for HyperPod cluster administrators with Amazon EKS, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 

**Topics**
+ [

# Dashboard setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md)
+ [

# Task governance setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md)

# Dashboard setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard"></a>

Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
](#hp-eks-dashboard-prerequisites)
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on setup
](#hp-eks-dashboard-setup)

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
<a name="hp-eks-dashboard-prerequisites"></a>

The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin).
+ Attach the `CloudWatchAgentServerPolicy` IAM policy to your worker nodes. To do so, enter the following command. Replace `my-worker-node-role` with the IAM role used by your Kubernetes worker nodes.

  ```
  aws iam attach-role-policy \
  --role-name my-worker-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  ```

## HyperPod Amazon CloudWatch Observability EKS add-on setup
<a name="hp-eks-dashboard-setup"></a>

Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

------
#### [ Setup using the SageMaker AI console ]

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). 

To manage task governance, use the sample policy:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters",
                "sagemaker:DescribeCluster",
                "sagemaker:ListComputeQuotas",
                "sagemaker:CreateComputeQuota",
                "sagemaker:UpdateComputeQuota",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:DeleteComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "sagemaker:CreateClusterSchedulerConfig",
                "sagemaker:UpdateClusterSchedulerConfig",
                "sagemaker:DeleteClusterSchedulerConfig",
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeCluster",
                "eks:DescribeAccessEntry",
                "eks:ListAssociatedAccessPolicies",
                "eks:AssociateAccessPolicy",
                "eks:DisassociateAccessPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:ListComputeQuotas",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "eks:DescribeCluster",
                "cloudwatch:GetMetricData",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the **Dashboard**, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch **Metrics** costs, after free-tier limit is reached. For more information, see **Metrics** in [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

------
#### [ Setup using the EKS AWS CLI ]

Use the following EKS AWS CLI command to install the add-on:

```
aws eks create-addon --cluster-name cluster-name 
--addon-name amazon-cloudwatch-observability 
--configuration-values "configuration json"
```

Below is an example of the JSON of the configuration values:

```
{
    "agent": {
        "config": {
            "logs": {
                "metrics_collected": {
                    "kubernetes": {
                        "kueue_container_insights": true,
                        "enhanced_container_insights": true
                    },
                    "application_signals": { }
                }
            },
            "traces": {
                "traces_collected": {
                    "application_signals": { }
                }
            }
        },
    },
}
```

------
#### [ Setup using the EKS Console UI ]

1. Navigate to the [EKS console](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose **Add-ons**.

1. Find the **Amazon CloudWatch Observability** add-on and install. Install version >= 2.4.0 for the add-on. 

1. Include the following JSON, Configuration values:

   ```
   {
       "agent": {
           "config": {
               "logs": {
                   "metrics_collected": {
                       "kubernetes": {
                           "kueue_container_insights": true,
                           "enhanced_container_insights": true
                       },
                       "application_signals": { }
                   },
               },
               "traces": {
                   "traces_collected": {
                       "application_signals": { }
                   }
               }
           },
       },
   }
   ```

------

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console **Dashboard** tab.

# Task governance setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance"></a>

This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## Kueue Settings
](#hp-eks-task-governance-kueue-settings)
+ [

## HyperPod Task governance prerequisites
](#hp-eks-task-governance-prerequisites)
+ [

## HyperPod task governance setup
](#hp-eks-task-governance-setup)

## Kueue Settings
<a name="hp-eks-task-governance-kueue-settings"></a>

HyperPod task governance EKS add-on installs [Kueue](https://github.com/kubernetes-sigs/kueue/tree/main/apis/kueue) for your HyperPod EKS clusters. Kueue is a kubernetes-native system that manages quotas and how jobs consume them. 


| EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | 
| --- | --- | 
|  v1.1.3  |  v0.12.0  | 

**Note**  
Kueue v.012.0 and higher don't include kueue-rbac-proxy as part of the installation. Previous versions might have kueue-rbac-proxy installed. For example, if you're using Kueue v0.8.1, you might have kueue-rbac-proxy v0.18.1.

HyperPod task governance leverages Kueue for Kubernetes-native job queueing, scheduling, and quota management, and is installed with the HyperPod task governance EKS add-on. When installed, HyperPod creates and modifies SageMaker AI-managed Kubernetes resources such as `KueueManagerConfig`, `ClusterQueues`, `LocalQueues`, `WorkloadPriorityClasses`, `ResourceFlavors`, and `ValidatingAdmissionPolicies`. While Kubernetes administrators have the flexibility to modify the state of these resources, it is possible that any changes made to a SageMaker AI-managed resource may be updated and overwritten by the service.

The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.

```
  apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8443
      enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: false
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/mxjob"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "pod"
      - "deployment"
      - "statefulset"
      - "leaderworkerset.x-k8s.io/leaderworkerset"
      podOptions:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: [ kube-system, kueue-system ]
    fairSharing:
      enable: true
      preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    resources:
      excludeResourcePrefixes: []
```

For more information about each configuration entry, see [Configuration](https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Configuration) in the Kueue documentation.

## HyperPod Task governance prerequisites
<a name="hp-eks-task-governance-prerequisites"></a>
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs, manage SageMaker HyperPod clusters within your AWS account, and performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 
+ You will need to have your Kubernetes version >= 1.30. For instructions, see [Update existing clusters to the new Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).
+ If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
+ A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on. 

## HyperPod task governance setup
<a name="hp-eks-task-governance-setup"></a>

The following provides information on how to get set up with HyperPod task governance.

------
#### [ Setup using the SageMaker AI console ]

The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.

You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the [HyperPod Amazon CloudWatch Observability EKS add-on setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md#hp-eks-dashboard-setup). If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "eks:DescribeCluster",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on. 

------
#### [ Setup using the Amazon EKS AWS CLI ]

Use the example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html) EKS AWS CLI command to set up the HyperPod task governance Amazon EKS API and console UI using the AWS CLI:

```
aws eks create-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

------

You can view the **Policies** tab in the HyperPod SageMaker AI console if the install was successful. You can also use the following example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html) EKS AWS CLI command to check the status. 

```
aws eks describe-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

# Dashboard
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-metrics"></a>

Amazon SageMaker HyperPod task governance provides a comprehensive dashboard view of your Amazon EKS cluster utilization metrics, including hardware, team, and task metrics. The following provides information on your HyperPod EKS cluster dashboard.

The dashboard provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. You will need to install the EKS add-on to view the dashboard. For more information, see [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md).

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Dashboard** tab. The dashboard contains the following metrics. You can download the data for a section by choosing the corresponding **Export**.

**Utilization**

Provides health of the EKS cluster point-in-time and trend-based metrics for critical compute resources. By default, **All Instance Groups** are shown. Use the dropdown menu to filter your instance groups. The metrics included in this section are:
+ Number of total, running, and pending recovery instances. The number of pending recovery instances refer to the number of instances that need attention for recovery.
+ GPUs, GPU memory, vCPUs, and vCPUs memory.
+ GPU utilization, GPU memory utilization, vCPU utilization, and vCPU memory utilization.
+ An interactive graph of your GPU and vCPU utilization. 

**Teams**

Provides information into team-specific resource management. This includes:
+ Instance and GPU allocation.
+ GPU utilization rates.
+ Borrowed GPU statistics.
+ Task status (running or pending).
+ A bar chart view of GPU utilization versus compute allocation across teams.
+ Team detailed GPU and vCPU-related information. By default, the information displayed includes **All teams**. You can filter by team and instances by choosing the dropdown menus. In the interactive plot you can filter by time.

**Tasks**

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

Provides information on task-related metrics. This includes number of running, pending, and preempted tasks, and run and wait time statistics. By default, the information displayed includes **All teams**. You can filter by team by choosing the dropdown menu. In the interactive plot you can filter by time.

# Tasks
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks"></a>

The following provides information on Amazon SageMaker HyperPod EKS cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The viewable task details list include status, run time, and how much compute is being used per task. 

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Tasks** tab.

For the **Tasks** tab to be viewable from anyone besides the administrator, the administrator needs to [add an access entry to the EKS cluster for the IAM role](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html). 

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

For EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown. By default, PyTorch tasks are shown. You can filter for PyTorch, MPI, TensorFlow tasks by choosing the dropdown menu or using the search field. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

# Using topology-aware scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-scheduling"></a>

Topology-aware scheduling in Amazon SageMaker HyperPod task governance optimizes the training efficiency of distributed machine learning workloads by placing pods based on the physical network topology of your Amazon EC2 instances. By considering the hierarchical structure of AWS infrastructure, including Availability Zones, network blocks, and physical racks, topology-aware scheduling ensures that pods requiring frequent communication are scheduled in close proximity to minimize network latency. This intelligent placement is particularly beneficial for large-scale machine learning training jobs that involve intensive pod-to-pod communication, resulting in reduced training times and more efficient resource utilization across your cluster.

**Note**  
To use topology-aware scheduling, make sure that your version of HyperPod task governance is v1.2.2-eksbuild.1 or higher.

Topology-aware scheduling supports the following instance types:
+ ml.p3dn.24xlarge
+ ml.p4d.24xlarge
+ ml.p4de.24xlarge
+ ml.p5.48xlarge
+ ml.p5e.48xlarge
+ ml.p5en.48xlarge
+ ml.p6e-gb200.36xlarge
+ ml.trn1.2xlarge
+ ml.trn1.32xlarge
+ ml.trn1n.32xlarge
+ ml.trn2.48xlarge
+ ml.trn2u.48xlarge

Topology-aware scheduling integrates with your existing HyperPod workflows while providing flexible topology preferences through both kubectl YAML files and the HyperPod CLI. HyperPod task governance automatically configures cluster nodes with topology labels and works with HyperPod task governance policies and resource borrowing mechanisms, ensuring that topology-aware scheduling doesn't disrupt your current operational processes. With built-in support for both preferred and required topology specifications, you can fine-tune workload placement to match your specific performance requirements while maintaining the flexibility to fall back to standard scheduling when topology constraints cannot be satisfied.

By leveraging topology-aware labels in HyperPod, you can enhance their machine learning workloads through intelligent pod placement that considers the physical network infrastructure. HyperPod task governance automatically optimizes pod scheduling based on the hierarchical data center topology, which directly translates to reduced network latency and improved training performance for distributed ML tasks. This topology awareness is particularly valuable for large-scale machine learning workloads, as it minimizes communication overhead by strategically placing related pods closer together in the network hierarchy. The result is optimized communication network latency between pods, more efficient resource utilization, and better overall performance for compute-intensive AI/ML applications, all achieved without you needing to manually manage complex network topology configurations.

The following are labels for the available topology network layers that HyperPod task governance can schedule pods in:
+ topology.k8s.aws/network-node-layer-1
+ topology.k8s.aws/network-node-layer-2
+ topology.k8s.aws/network-node-layer-3
+ topology.k8s.aws/ultraserver-id

To use topology-aware scheduling, include the following labels in your YAML file:
+ kueue.x-k8s.io/podset-required-topology - indicates that this job must have the required pods and that all pods in the nodes must be scheduled within the same topology layer.
+ kueue.x-k8s.io/podset-preferred-topology - indicates that this job must have the pods, but that scheduling pods within the same topology layer is preferred but not required. HyperPod task governance will try to schedule the pods within one layer before trying the next topology layer.

If resources don’t share the same topology label, the job will be suspended. The job will be in the waitlist. Once Kueue sees that there are enough resources, it will admit and run the job.

The following example demonstrates how to use the labels in your YAML files:

```
apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: PRIORITY_CLASS-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
        or
        kueue.x-k8s.io/podset-preferred-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      nodeSelector:
        topology.k8s.aws/network-node-layer-3: TOPOLOGY_LABEL_VALUE
      containers:
        - name: dummy-job
          image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
          args: ["3600s"]
          resources:
            requests:
              cpu: "100"
      restartPolicy: Never
```

The following table explains the new parameters you can use in the kubectl YAML file.


| Parameter | Description | 
| --- | --- | 
| kueue.x-k8s.io/queue-name | The name of the queue to use to run the job. The format of this queue-name must be hyperpod-ns-team-name-localqueue. | 
| kueue.x-k8s.io/priority-class | Lets you specify a priority for pod scheduling. This specification is optional. | 
| annotations | Contains the topology annotation that you attach to the job. Available topologies are kueue.x-k8s.io/podset-required-topology and kueue.x-k8s.io/podset-preferred-topology. You can use either an annotation or nodeSelector, but not both at the same time. | 
| nodeSelector | Specifies the network layer that represents the layer of Amazon EC2 instance placement. Use either this field or an annotation, but not both at the same time. In your YAML file, you can also use the nodeSelector parameter to choose the exact layer for your pods. To get the value of your label, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. | 

You can also use the HyperPod CLI to run your job and use topology aware scheduling. For more information about the HyperPod CLI, see [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md).

```
hyp create hyp-pytorch-job \                                            
  --version 1.1 \
  --job-name sample-pytorch-job \
  --image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
  --pull-policy "Always" \
  --tasks-per-node 1 \
  --max-retry 1 \
  --priority high-priority \
  --namespace hyperpod-ns-team-name \
  --queue-name hyperpod-ns-team-name-localqueue \
  --preferred-topology-label topology.k8s.aws/network-node-layer-1
```

The following is an example configuration file that you might use to run a PytorchJob with topology labels. The file is largely similar if you want to run MPI and Tensorflow jobs. If you want to run those jobs instead, remember to change the configuration file accordingly, such as using the correct image instead of PyTorchJob. If you’re running a PyTorchJob, you can assign different topologies to the master and worker nodes. PyTorchJob always has one master node, so we recommend that you use topology to support worker pods instead.

```
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations: {}
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
  name: tas-test-pytorch-job
  namespace: hyperpod-ns-team-name
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
    Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
        metadata:
          # annotations:
            # kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: 1
              requests:
                memory: 200Mi
                cpu: 1
          #nodeSelector:
          #  topology.k8s.aws/network-node-layer-3: xxxxxxxxxxx
```

To see the topologies for your cluster, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. By default, the topologies are hidden in the AWS Management Console and Amazon SageMaker Studio. Follow these steps to see them in the interface that you’re using.

**SageMaker Studio**

1. In SageMaker Studio, navigate to your cluster.

1. In the Tasks view, choose the options menu in the Name column, then choose **Manage columns**.

1. Select **Requested topology** and **Topology constraint** to add the columns to see the topology information in the list of Kubernetes pods.

**AWS Management Console**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **HyperPod clusters**, choose **Cluster management**.

1. Choose the **Tasks** tab, then choose the gear icon.

1. Under instance attributes, toggle **Requested topology** and **Topology constraint**.

1. Choose **Confirm** to see the topology information in the table.

# Using gang scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling"></a>

In distributed ML training, a job often requires multiple pods running concurrently across nodes with pod-to-pod communication. HyperPod task governance uses Kueue's `waitForPodsReady` feature to implement gang scheduling. When enabled, the workload is monitored by Kueue until all of its pods are ready, meaning scheduled, running, and passing the optional readiness probe. If not all pods of the workload are ready within the configured timeout, the workload is evicted and requeued.

Gang scheduling provides the following benefits:
+ **Prevents resource waste** — Kueue evicts and requeues the workload if all pods do not become ready, ensuring resources are not held indefinitely by partially running workloads.
+ **Avoids deadlocks** — Prevents jobs from holding partial resources and blocking each other indefinitely.
+ **Automatic recovery** — If pods aren't ready within the timeout, the workload is evicted and requeued with configurable exponential backoff, rather than hanging indefinitely.

## Activate gang scheduling
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-activate"></a>

To activate gang scheduling, you must have a HyperPod Amazon EKS cluster with the task governance Amazon EKS add-on installed. The add-on status must be `Active` or `Degraded`.

**Note**  
Gang scheduling can also be configured directly using `kubectl` by editing the Kueue configuration on the cluster.

**Activate gang scheduling (SageMaker AI console)**

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/) and navigate to your HyperPod cluster.

1. Choose the **Policy management** tab.

1. In the **Task governance** section, open **Actions**, then choose **Configure gang scheduling**.

1. Toggle gang scheduling on and configure the settings.

1. Choose **Save**. The Kueue controller restarts to apply the change.

## Gang scheduling configuration settings
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-settings"></a>

The following table describes the configuration settings for gang scheduling.


| Setting | Description | Default | 
| --- | --- | --- | 
| timeout | How long Kueue waits for all pods to become ready before evicting and requeuing the workload. | 5m | 
| recoveryTimeout | How long Kueue waits for a pod to recover after a node failure before requeuing the workload. Set to 0s to disable. Defaults to the value of timeout if not set. | 5m | 
| blockAdmission | When enabled, workloads are admitted sequentially. No new workload is admitted until all pods of the current one are ready. Prevents deadlocks on resource-constrained clusters. | Off | 
| requeuingStrategy timestamp | Whether requeue order uses Creation (original submission time, preserves queue position) or Eviction (time of last eviction, effectively deprioritizing repeatedly failing jobs). | Eviction | 
| requeuingStrategy backoffLimitCount | Maximum requeue attempts before Kueue permanently deactivates the workload. Leave empty for unlimited retries. | Unlimited | 
| requeuingStrategy backoffBaseSeconds | The base time in seconds for exponential backoff when requeuing a workload after each consecutive timeout. The exponent is 2. | 60s | 
| requeuingStrategy backoffMaxSeconds | Cap on the exponential backoff delay. Once reached, Kueue continues requeuing at this fixed interval. | 3600s | 

**Note**  
Modifying gang scheduling settings restarts the Kueue controller, which may temporarily delay job admission. This applies whether you are enabling, disabling, or updating any value. Running jobs are not interrupted.

**Note**  
Gang scheduling is cluster-wide. It applies to all Kueue-managed workloads on the cluster, not just specific teams or queues.

# Policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies"></a>

Amazon SageMaker HyperPod task governance simplifies how your Amazon EKS cluster resources are allocated and how tasks are prioritized. The following provides information on HyperPod EKS cluster policies. For information on how to set up task governance, see [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md).

The policies are divided up into **Compute prioritization** and **Compute allocation**. The policy concepts below will be organized in the context of these policies.

**Compute prioritization**, or cluster policy, determines how idle compute is borrowed and how tasks are prioritized by teams.
+ **Idle compute allocation** defines how idle compute is allocated across teams. That is, how unused compute can be borrowed from teams. When choosing an **Idle compute allocation**, you can choose between:
  + **First-come first-serve**: When applied, teams are not prioritized against each other and each incoming task is equally likely to obtain over-quota resources. Tasks are prioritized based on order of submission. This means a user may be able to use 100% of the idle compute if they request it first.
  + **Fair-share**: When applied, teams borrow idle compute based on their assigned **Fair-share weight**. These weights are defined in **Compute allocation**. For more information on how this can be used, see [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples).
+ **Task prioritization** defines how tasks are queued as compute becomes available. When choosing a **Task prioritization**, you can choose between:
  + **First-come first-serve**: When applied, tasks are queued in the order they are requested.
  + **Task ranking**: When applied, tasks are queued in the order defined by their prioritization. If this option is chosen, you must add priority classes along with the weights at which they should be prioritized. Tasks of the same priority class will be executed on a first-come first-serve basis. When enabled in Compute allocation, tasks are preempted from lower priority tasks by higher priority tasks within the team.

    When data scientists submit jobs to the cluster, they use the priority class name in the YAML file. The priority class is in the format `priority-class-name-priority`. For an example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).
  + **Priority classes**: These classes establish a relative priority for tasks when borrowing capacity. When a task is running using borrowed quota, it may be preempted by another task of higher priority than it, if no more capacity is available for the incoming task. If **Preemption** is enabled in the **Compute allocation**, a higher priority task may also preempt tasks within its own team.
+ **Unallocated resource sharing** enables teams to borrow compute resources that are not allocated to any team through compute quota. When enabled, unallocated cluster capacity becomes available for teams to borrow automatically. For more information, see [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works).

**Compute allocation**, or compute quota, defines a team’s compute allocation and what weight (or priority level) a team is given for fair-share idle compute allocation. 
+ **Team name**: The team name. A corresponding **Namespace** will be created, of type `hyperpod-ns-team-name`. 
+ **Members**: Members of the team namespace. You will need to set up a Kubernetes role-based access control (RBAC) for data scientist users that you want to be part of this team, to run tasks on HyperPod clusters orchestrated with Amazon EKS. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
+ **Fair-share weight**: This is the level of prioritization assigned to the team when **Fair-share** is applied for **Idle compute allocation**. The highest priority has a weight of 100 and the lowest priority has a weight of 0. Higher weight enables a team to access unutilized resources within shared capacity sooner. A zero weight signifies the lowest priority, implying this team will always be at a disadvantage compared to other teams. 

  The fair-share weight provides a comparative edge to this team when vying for available resources against others. Admission prioritizes scheduling tasks from teams with the highest weights and the lowest borrowing. For example, if Team A has a weight of 10 and Team B has a weight of 5, Team A would have priority in accessing unutilized resources as in would have jobs that are scheduled earlier than Team B.
+ **Task preemption**: Compute is taken over from a task based on priority. By default, the team loaning idle compute will preempt tasks from other teams. 
+ **Lending and borrowing**: How idle compute is being lent by the team and if the team can borrow from other teams.
  + **Percentage-based borrow limit**: The limit of idle compute that a team is allowed to borrow, expressed as a percentage of their guaranteed quota. A team can borrow up to 10,000% of allocated compute. The value you provide here is interpreted as a percentage. For example, a value of 500 will be interpreted as 500%. This percentage applies uniformly across all resource types (CPU, GPU, Memory) and instance types in the team's quota.
  + **Absolute borrow limit**: The limit of idle compute that a team is allowed to borrow, defined as absolute resource values per instance type. This provides granular control over borrowing behavior for specific instance types. You need to specify absolute limits using the same schema as **Compute quota**, including instance count, accelerators, vCPU, memory, or accelerator partitions. You can specify absolute limits for one or more instance types in your team's quota.

For information on how these concepts are used, such as priority classes and name spaces, see [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md).

## Sharing idle compute resources examples
<a name="hp-eks-task-governance-policies-examples"></a>

The total reserved quota should not surpass the cluster's available capacity for that resource, to ensure proper quota management. For example, if a cluster comprises 20 `ml.c5.2xlarge` instances, the cumulative quota assigned to teams should remain under 20. 

If the **Compute allocation** policies for teams allow for **Lend and Borrow** or **Lend**, the idle capacity is shared between these teams. For example, Team A and Team B have **Lend and Borrow** enabled. Team A has a quota of 6 but is using only 2 for its jobs, and Team B has a quota of 5 and is using 4 for its jobs. A job that is submitted to Team B requiring 4 resources. 3 will be borrowed from Team A. 

If any team's **Compute allocation** policy is set to **Don't Lend**, the team would not be able to borrow any additional capacity beyond its own allocations.

## How unallocated resource sharing works
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works"></a>

Unallocated resource sharing automatically manages the pool of resources that are not allocated to any compute quota in your cluster. This means HyperPod continuously monitors your cluster state and automatically updates to the correct configuration over time.

**Initial Setup**
+ When you set `IdleResourceSharing` to `Enabled` in your ClusterSchedulerConfig (by default it is `Disabled`), HyperPod task governance begins monitoring your cluster and calculates available idle resources by subtracting team quotas from total node capacity.
+ Unallocated resource sharing ClusterQueues are created to represent the borrowable resource pool.
+ When you first enable unallocated resource sharing, infrastructure setup takes several mins. You can monitor the progress through policy `Status` and `DetailedStatus` in ClusterSchedulerConfig.

**Ongoing Reconciliation**
+ HyperPod task governance continuously monitors for changes such as node additions or removals and cluster queue quota updates.
+  When changes occur, unallocated resource sharing recalculates quota and updates ClusterQueues. Reconciliation typically completes within seconds. 

**Monitoring**

 You can verify that unallocated resource sharing is fully configured by checking for unallocated resource sharing ClusterQueues: 

```
kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
```

When you see ClusterQueues with names like `hyperpod-ns-idle-resource-sharing-cq-1`, unallocated resource sharing is active. Note that multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

## Node eligibility for unallocated resource sharing
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility"></a>

Unllocated Resource Sharing only includes nodes that meet the following requirements:

1. **Node Ready Status**
   + Nodes must be in `Ready` status to contribute to the unallocated resource pool.
   + Nodes in `NotReady` or other non-ready states are excluded from capacity calculations.
   + When a node becomes `Ready`, it is automatically included in the next reconciliation cycle.

1. **Node Schedulable Status**
   + Nodes with `spec.unschedulable: true` are excluded from unallocated resource sharing.
   + When a node becomes schedulable again, it is automatically included in the next reconciliation cycle.

1. **MIG Configuration (GPU nodes only)**
   + For GPU nodes with MIG (Multi-Instance GPU) partitioning, the `nvidia.com/mig.config.state` label must show `success` for the node to contribute MIG profiles to unallocated resource sharing.
   + These nodes will be retried automatically once MIG configuration completes successfully.

1. **Supported Instance Types**
   + The instance must be a supported SageMaker HyperPod instance type.
   + See the list of supported instance types in the SageMaker HyperPod cluster.

**Topics**
+ [

## Sharing idle compute resources examples
](#hp-eks-task-governance-policies-examples)
+ [

## How unallocated resource sharing works
](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works)
+ [

## Node eligibility for unallocated resource sharing
](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility)
+ [

# Create policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create.md)
+ [

# Edit policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md)
+ [

# Delete policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md)
+ [

# Allocating compute quota in Amazon SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation.md)

# Create policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create"></a>

You can create your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to create the following configurations.
+ Create your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Create **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Create HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To create your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To create a **Compute allocation**:

1. 

   1. Choose the corresponding **Create**. This takes you to the compute allocation creation page.

   1. After you have made your changes, choose **Submit**.

# Edit policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit"></a>

You can edit your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to edit the following configurations.
+ Edit your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Edit **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Edit HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To edit your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To edit your **Compute allocation**:

1. 

   1. Choose the configuration you wish to edit under **Compute allocation**. This takes you to the configuration details page.

   1. If you wish to edit these configurations, choose **Edit**.

   1. After you have made your changes, choose **Submit**.

# Delete policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete"></a>

You can delete your **Cluster policy** and **Compute allocation** configurations using the SageMaker AI console or AWS CLI. The following page provides instructions on how to delete your SageMaker HyperPod task governance policies and configurations.

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Note**  
If you are having issues with listing or deleting task governance policies, you may need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section. For additional information, see [Deleting clusters](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md#hp-eks-troubleshoot-delete-policies).

## Delete HyperPod task governance policies (console)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-console"></a>

The following uses the SageMaker AI console to delete your HyperPod task governance policies.

**Note**  
You cannot delete your **Cluster policy** (`ClusterSchedulerConfig`) using the SageMaker AI console. To learn how to do so using the AWS CLI, see [Delete HyperPod task governance policies (AWS CLI)](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli).

**To delete task governance policies (console)**

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To delete your **Compute allocation** (`ComputeQuota`):

   1. In the **Compute allocation** section, select the configuration you want to delete.

   1. In the **Actions** dropdown menu, choose **Delete**.

   1. Follow the instructions in the UI to complete the task.

## Delete HyperPod task governance policies (AWS CLI)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli"></a>

The following uses the AWS CLI to delete your HyperPod task governance policies.

**Note**  
If you are having issues using the following commands, you may need to update your AWS CLI. For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

**To delete task governance policies (AWS CLI)**

First set your variables for the AWS CLI commands that follow.

```
REGION=aws-region
```

1. Get the *cluster-arn* associated with the policies you wish to delete. You can use the following AWS CLI command to list the clusters in your AWS Region.

   ```
   aws sagemaker list-clusters \
       --region ${REGION}
   ```

1. To delete your compute allocations (`ComputeQuota`):

   1. List all of the compute quotas associated with the HyperPod cluster.

      ```
      aws sagemaker list-compute-quotas \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `compute-quota-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-compute-quota \
          --compute-quota-id compute-quota-id \
          --region ${REGION}
      ```

1. To delete your cluster policies (`ClusterSchedulerConfig`):

   1. List all of the cluster policies associated with the HyperPod cluster.

      ```
      aws sagemaker list-cluster-scheduler-configs \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `cluster-scheduler-config-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-cluster-scheduler-config 
          --cluster-scheduler-config-id scheduler-config-id \
          --region ${REGION}
      ```

# Allocating compute quota in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation"></a>

Cluster administrators can decide how the organization uses purchased compute. Doing so reduces waste and idle resources. You can allocate compute quota such that teams can borrow unused resources from each other. Compute quota allocation in HyperPod task governance lets administrators allocate resources at the instance level and at a more granular resource level. This capability provides flexible and efficient resource management for teams by allowing granular control over individual compute resources instead of requiring entire instance allocations. Allocating at a granular level eliminates inefficiencies of traditional instance-level allocation. Through this approach, you can optimize resource utilization and reduce idle compute.

Compute quota allocation supports three types of resource allocation: accelerators, vCPU, and memory. Accelerators are components in accelerated computer instances that that perform functions, such as floating point number calculations, graphics processing, or data pattern matching. Accelerators include GPUs, Trainium accelerators, and neuron cores. For multi-team GPU sharing, different teams can receive specific GPU allocations from the same instance type, maximizing utilization of accelerator hardware. For memory-intensive workloads that require additional RAM for data preprocessing or model caching scenarios, you can allocate memory quota beyond the default GPU-to-memory ratio. For CPU-heavy preprocessing tasks that need substantial CPU resources alongside GPU training, you can allocate independent CPU resource allocation.

Once you provide a value, HyperPod task governance calculates the ratio using the formula **allocated resource divided by the total amount of resources available in the instance**. HyperPod task governance then uses this ratio to apply default allocations to other resources, but you can override these defaults and customize them based on your use case. The following are sample scenarios of how HyperPod task governance allocates resources based on your values:
+ **Only accelerator specified** - HyperPod task governance applies the default ratio to vCPU and memory based on the accelerator values.
+ **Only vCPU specified** - HyperPod task governance calculates the ratio and applies it to memory. Accelerators are set to 0.
+ **Only memory specified** - HyperPod task governance calculates the ratio and applies it to vCPU because compute is required to run memory-specified workloads. Accelerators are set to 0.

To programmatically control quota allocation, you can use the [ ComputeQuotaResourceConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ComputeQuotaResourceConfig.html) object and specify your allocations in integers.

```
{
    "ComputeQuotaConfig": {
        "ComputeQuotaResources": [{
            "InstanceType": "ml.g5.24xlarge",
            "Accelerators": "16",
            "vCpu": "200.0",
            "MemoryInGiB": "2.0"
        }]
    }
}
```

To see all of the allocated allocations, including the defaults, use the [ DescribeComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeComputeQuota.html) operation. To update your allocations, use the [ UpdateComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateComputeQuota.html) operation.

You can also use the HyperPod CLI to allocate compute quotas. For more information about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). The following example demonstrates how to set compute quotas using the HyperPod CLI.

```
hyp create hyp-pytorch-job --version 1.1 --job-name sample-job \
--image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
--pull-policy "Always" \
--tasks-per-node 1 \
--max-retry 1 \
--priority high-priority \
--namespace hyperpod-ns-team-name \
--queue-name hyperpod-ns-team-name-localqueue \
--instance-type sample-instance-type \
--accelerators 1 \
--vcpu 3 \
--memory 1 \
--accelerators-limit 1 \
--vcpu-limit 4 \
--memory-limit 2
```

To allocate quotas using the AWS console, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under HyperPod clusters, choose **Cluster management**.

1. Under **Compute allocations**, choose **Create**.

1. If you don’t already have instances, choose **Add allocation** to add an instance.

1. Under **Allocations**, choose to allocate by instances or individual resources. If you allocate by individual resources, SageMaker AI automatically assigns allocations to other resources by the ratio that you chose. To override this ratio-based allocation, use the corresponding toggle to override that compute.

1. Repeat steps 4 and 5 to configure additional instances.

After allocating compute quota, you can then submit jobs through the HyperPod CLI or `kubectl`. HyperPod efficiently schedules workloads based on available quota. 

# Allocating GPU partition quota
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions"></a>

You can extend compute quota allocation to support GPU partitioning, enabling fine-grained resource sharing at the GPU partition level. When GPU partitioning is enabled on supported GPUs in the cluster, each physical GPU can be partitioned into multiple isolated GPUs with defined compute, memory, and streaming multiprocessor allocations. For more information about GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md). You can allocate specific GPU partitions to teams, allowing multiple teams to share a single GPU while maintaining hardware-level isolation and predictable performance.

For example, an ml.p5.48xlarge instance with 8 H100 GPUs can be partitioned into GPU partitions, and you can allocate individual partitions to different teams based on their task requirements. When you specify GPU partition allocations, HyperPod task governance calculates proportional vCPU and memory quotas based on the GPU partition, similar to GPU-level allocation. This approach maximizes GPU utilization by eliminating idle capacity and enabling cost-effective resource sharing across multiple concurrent tasks on the same physical GPU.

## Creating Compute Quotas
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-creating"></a>

```
aws sagemaker create-compute-quota \
  --name "fractional-gpu-quota" \
  --compute-quota-config '{
    "ComputeQuotaResources": [
      {
        "InstanceType": "ml.p4d.24xlarge",
        "AcceleratorPartition": {
            "Count": 4,
            "Type": "mig-1g.5gb"
        }
      }
    ],
    "ResourceSharingConfig": { 
      "Strategy": "LendAndBorrow", 
      "BorrowLimit": 100 
    }
  }'
```

## Verifying Quota Resources
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-verifying"></a>

```
# Check ClusterQueue
kubectl get clusterqueues
kubectl describe clusterqueue QUEUE_NAME

# Check ResourceFlavors
kubectl get resourceflavor
kubectl describe resourceflavor FLAVOR_NAME
```

# Example HyperPod task governance AWS CLI commands
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-cli"></a>

You can use HyperPod with EKS through Kubectl or through HyperPod custom CLI. You can use these commands through Studio or AWS CLI. The following provides SageMaker HyperPod task governance examples, on how to view cluster details using the HyperPod AWS CLI commands. For more information, including how to install, see the [HyperPod CLI Github repository](https://github.com/aws/sagemaker-hyperpod-cli).

**Topics**
+ [

## Get cluster accelerator device quota information
](#hp-eks-cli-get-clusters)
+ [

## Submit a job to SageMaker AI-managed queue and namespace
](#hp-eks-cli-start-job)
+ [

## List jobs
](#hp-eks-cli-list-jobs)
+ [

## Get job detailed information
](#hp-eks-cli-get-job)
+ [

## Suspend and unsuspend jobs
](#hp-eks-cli-patch-job)
+ [

## Debugging jobs
](#hp-eks-cli-other)

## Get cluster accelerator device quota information
<a name="hp-eks-cli-get-clusters"></a>

The following example command gets the information on the cluster accelerator device quota.

```
hyperpod get-clusters -n hyperpod-ns-test-team
```

The namespace in this example, `hyperpod-ns-test-team`, is created in Kubernetes based on the team name provided, `test-team`, when the compute allocation is created. For more information, see [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md).

Example response:

```
[
    {
        "Cluster": "hyperpod-eks-test-cluster-id",
        "InstanceType": "ml.g5.xlarge",
        "TotalNodes": 2,
        "AcceleratorDevicesAvailable": 1,
        "NodeHealthStatus=Schedulable": 2,
        "DeepHealthCheckStatus=Passed": "N/A",
        "Namespaces": {
            "hyperpod-ns-test-team": {
                "TotalAcceleratorDevices": 1,
                "AvailableAcceleratorDevices": 1
            }
        }
    }
]
```

## Submit a job to SageMaker AI-managed queue and namespace
<a name="hp-eks-cli-start-job"></a>

The following example command submits a job to your HyperPod cluster. If you have access to only one team, the HyperPod AWS CLI will automatically assign the queue for you in this case. Otherwise if multiple queues are discovered, we will display all viable options for you to select.

```
hyperpod start-job --job-name hyperpod-cli-test --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.xlarge --node-count 1 --tasks-per-node 1 --results-dir ./result --priority training-priority
```

The priority classes are defined in the **Cluster policy**, which defines how tasks are prioritized and idle compute is allocated. When a data scientist submits a job, they use one of the priority class names with the format `priority-class-name-priority`. In this example, `training-priority` refers to the priority class named “training”. For more information on policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

If a priority class is not specified, the job is treated as a low priority job, with a task ranking value of 0. 

If a priority class is specified, but does not correspond to one of the priority classes defined in the **Cluster policy**, the submission fails and an error message provides the defined set of priority classes.

You can also submit the job using a YAML configuration file using the following command: 

```
hyperpod start-job --config-file ./yaml-configuration-file-name.yaml
```

The following is an example YAML configuration file that is equivalent to submitting a job as discussed above.

```
defaults:
  - override hydra/job_logging: stdout
hydra:
  run:
    dir: .
  output_subdir: null
training_cfg:
  entry_script: /opt/pytorch-mnist/mnist.py
  script_args: []
  run:
    name: hyperpod-cli-test
    nodes: 1
    ntasks_per_node: 1
cluster:
  cluster_type: k8s
  instance_type: ml.g5.xlarge
  custom_labels:
    kueue.x-k8s.io/priority-class: training-priority
  cluster_config:
    label_selector:
      required:
        sagemaker.amazonaws.com/node-health-status:
          - Schedulable
      preferred:
        sagemaker.amazonaws.com/deep-health-check-status:
          - Passed
      weights:
        - 100
    pullPolicy: IfNotPresent
base_results_dir: ./result
container: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
env_vars:
  NCCL_DEBUG: INFO
```

Alternatively, you can submit a job using `kubectl` to ensure the task appears in the **Dashboard** tab. The following is an example kubectl command.

```
kubectl apply -f ./yaml-configuration-file-name.yaml
```

When submitting the job, include your queue name and priority class labels. For example, with the queue name `hyperpod-ns-team-name-localqueue` and priority class `priority-class-name-priority`, you must include the following labels:
+ `kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue` 
+ `kueue.x-k8s.io/priority-class: priority-class-name-priority`

The following YAML configuration snippet demonstrates how to add labels to your original configuration file to ensure your task appears in the **Dashboard** tab:

```
metadata:
    name: job-name
    namespace: hyperpod-ns-team-name
    labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        kueue.x-k8s.io/priority-class: priority-class-name-priority
```

## List jobs
<a name="hp-eks-cli-list-jobs"></a>

The following command lists the jobs and their details.

```
hyperpod list-jobs
```

Example response:

```
{
    "jobs": [
        {
            "Name": "hyperpod-cli-test",
            "Namespace": "hyperpod-ns-test-team",
            "CreationTime": "2024-11-18T21:21:15Z",
            "Priority": "training",
            "State": "Succeeded"
        }
    ]
}
```

## Get job detailed information
<a name="hp-eks-cli-get-job"></a>

The following command provides a job’s details. If no namespace is specified, HyperPod AWS CLI will fetch a namespace managed by SageMaker AI that you have access to.

```
hyperpod get-job --job-name hyperpod-cli-test
```

Example response:

```
{
    "Name": "hyperpod-cli-test",
    "Namespace": "hyperpod-ns-test-team",
    "Label": {
        "app": "hyperpod-cli-test",
        "app.kubernetes.io/managed-by": "Helm",
        "kueue.x-k8s.io/priority-class": "training"
    },
    "CreationTimestamp": "2024-11-18T21:21:15Z",
    "Status": {
        "completionTime": "2024-11-18T21:25:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2024-11-18T21:21:15Z",
                "lastUpdateTime": "2024-11-18T21:21:15Z",
                "message": "PyTorchJob hyperpod-cli-test is created.",
                "reason": "PyTorchJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2024-11-18T21:21:17Z",
                "lastUpdateTime": "2024-11-18T21:21:17Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test is running.",
                "reason": "PyTorchJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2024-11-18T21:25:24Z",
                "lastUpdateTime": "2024-11-18T21:25:24Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test successfully completed.",
                "reason": "PyTorchJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
            "replicaStatuses": {
                "Worker": {
                    "selector": "training.kubeflow.org/job-name=hyperpod-cli-test,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker",
                    "succeeded": 1
                }
            },
        "startTime": "2024-11-18T21:21:15Z"
    },
    "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/hyperpod-eks-test-cluster-id“
}
```

## Suspend and unsuspend jobs
<a name="hp-eks-cli-patch-job"></a>

If you want to remove some submitted job from the scheduler, HyperPod AWS CLI provides `suspend` command to temporarily remove the job from orchestration. The suspended job will no longer be scheduled unless the job is manually unsuspended by the `unsuspend` command

To temporarily suspend a job:

```
hyperpod patch-job suspend --job-name hyperpod-cli-test
```

To add a job back to the queue:

```
hyperpod patch-job unsuspend --job-name hyperpod-cli-test
```

## Debugging jobs
<a name="hp-eks-cli-other"></a>

The HyperPod AWS CLI also provides other commands for you to debug job submission issues. For example `list-pods` and `get-logs` in the HyperPod AWS CLI Github repository.

# Troubleshoot
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot"></a>

The following page contains known solutions for troubleshooting your HyperPod EKS clusters.

**Topics**
+ [

## Dashboard tab
](#hp-eks-troubleshoot-dashboard)
+ [

## Tasks tab
](#hp-eks-troubleshoot-tasks)
+ [

## Policies
](#hp-eks-troubleshoot-policies)
+ [

## Deleting clusters
](#hp-eks-troubleshoot-delete-policies)
+ [

## Unallocated resource sharing
](#hp-eks-troubleshoot-unallocated-resource-sharing)

## Dashboard tab
<a name="hp-eks-troubleshoot-dashboard"></a>

**The EKS add-on fails to install**

For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see [Update Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).

For the EKS add-on installation to succeed, all of the nodes need to be in **Ready** status and all of the pods need to be in **Running** status. 

To check the status of your nodes, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) AWS CLI command or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your nodes. Resolve the issue for each node or reach out to your administrator. If the node status is **Unknown**, delete the node. Once all nodes statuses are **Ready**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

To check the status of your pods, use the [Kubernetes CLI](https://kubernetes.io/docs/reference/kubectl/) command `kubectl get pods -n cloudwatch-agent` or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your pods with the namespace `cloudwatch-agent`. Resolve the issue for the pods or reach out to your administrator to resolve the issues. Once all pod statuses are **Running**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

For more troubleshooting, see [Troubleshooting the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html#Container-Insights-setup-EKS-addon-troubleshoot).

## Tasks tab
<a name="hp-eks-troubleshoot-tasks"></a>

If you see the error message about how the **Custom Resource Definition (CRD) is not configured on the cluster**, grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 
+ For information on how to get your execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
+ To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

## Policies
<a name="hp-eks-troubleshoot-policies"></a>

The following lists solutions to errors relating to policies using the HyperPod APIs or console.
+ If the policy is in `CreateFailed` or `CreateRollbackFailed` status, you need to delete the failed policy and create a new one.
+ If the policy is in `UpdateFailed` status, retry the update with the same policy ARN.
+ If the policy is in `UpdateRollbackFailed` status, you need to delete the failed policy and then create a new one.
+ If the policy is in `DeleteFailed` or `DeleteRollbackFailed` status, retry the delete with the same policy ARN.
  + If you ran into an error while trying to delete the **Compute prioritization**, or cluster policy, using the HyperPod console, try to delete the `cluster-scheduler-config` using the API. To check the status of the resource, go to the details page of a compute allocation.

To see more details into the failure, use the describe API.

## Deleting clusters
<a name="hp-eks-troubleshoot-delete-policies"></a>

The following lists known solutions to errors relating to deleting clusters.
+ When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).
+ When cluster deletion fails due to the missing the following permissions, you will need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section.
  + `sagemaker:ListComputeQuotas`
  + `sagemaker:ListClusterSchedulerConfig`
  + `sagemaker:DeleteComputeQuota`
  + `sagemaker:DeleteClusterSchedulerConfig`

## Unallocated resource sharing
<a name="hp-eks-troubleshoot-unallocated-resource-sharing"></a>

If your unallocated resource pool capacity is less than expected:

1. **Check node ready status**

   ```
   kubectl get nodes
   ```

   Verify all nodes show `Ready` status in the STATUS column.

1. **Check node schedulable status**

   ```
   kubectl get nodes -o custom-columns=NAME:.metadata.name,UNSCHEDULABLE:.spec.unschedulable
   ```

   Verify nodes show `<none>` or `false` (not `true`).

1. **List unallocated resource sharing ClusterQueues:**

   ```
   kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
   ```

   This shows all unallocated resource sharing ClusterQueues. If the ClusterQueues are not showing up, check the `FailureReason` under ClusterSchedulerConfig policy to see if there are any failure messages to continue the debugging.

1. **Verify unallocated resource sharing quota:**

   ```
   kubectl describe clusterqueue hyperpod-ns-idle-resource-sharing-<index>
   ```

   Check the `spec.resourceGroups[].flavors[].resources` section to see the quota allocated for each resource flavor.

   Multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

1. **Check MIG configuration status (GPU nodes):**

   ```
   kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}{end}'
   ```

   Verify MIG-enabled nodes show `success` state.

# Attribution document for Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-attributions"></a>

In the following you can learn about attributions and third-party licenses for material used in Amazon SageMaker HyperPod task governance.

**Topics**
+ [

## [base-files](https://packages.debian.org/bookworm/base-files)
](#hp-eks-task-governance-attributions-base-files)
+ [

## [netbase](https://packages.debian.org/source/stable/netbase)
](#hp-eks-task-governance-attributions-netbase)
+ [

## [golang-lru](https://github.com/hashicorp/golang-lru)
](#hp-eks-task-governance-attributions-golang-lru)

## [base-files](https://packages.debian.org/bookworm/base-files)
<a name="hp-eks-task-governance-attributions-base-files"></a>

```
This is the Debian prepackaged version of the Debian Base System
Miscellaneous files. These files were written by Ian Murdock
<imurdock@debian.org> and Bruce Perens <bruce@pixar.com>.

This package was first put together by Bruce Perens <Bruce@Pixar.com>,
from his own sources.

The GNU Public Licenses in /usr/share/common-licenses were taken from
ftp.gnu.org and are copyrighted by the Free Software Foundation, Inc.

The Artistic License in /usr/share/common-licenses is the one coming
from Perl and its SPDX name is "Artistic License 1.0 (Perl)".


Copyright © 1995-2011 Software in the Public Interest.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

On Debian systems, the complete text of the GNU General
Public License can be found in `/usr/share/common-licenses/GPL'.
```

## [netbase](https://packages.debian.org/source/stable/netbase)
<a name="hp-eks-task-governance-attributions-netbase"></a>

```
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Comment:
 This package was created by Peter Tobias tobias@et-inf.fho-emden.de on
 Wed, 24 Aug 1994 21:33:28 +0200 and maintained by Anthony Towns
 <ajt@debian.org> until 2001.
 It is currently maintained by Marco d'Itri <md@linux.it>.

Files: *
Copyright:
 Copyright © 1994-1998 Peter Tobias
 Copyright © 1998-2001 Anthony Towns
 Copyright © 2002-2022 Marco d'Itri
License: GPL-2
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License, version 2, as
 published by the Free Software Foundation.
 .
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation,
 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 .
 On Debian systems, the complete text of the GNU General Public License
 version 2 can be found in '/usr/share/common-licenses/GPL-2'.
```

## [golang-lru](https://github.com/hashicorp/golang-lru)
<a name="hp-eks-task-governance-attributions-golang-lru"></a>

```
Copyright © 2014 HashiCorp, Inc.

Mozilla Public License, version 2.0

1. Definitions

1.1. "Contributor"

     means each individual or legal entity that creates, contributes to the
     creation of, or owns Covered Software.

1.2. "Contributor Version"

     means the combination of the Contributions of others (if any) used by a
     Contributor and that particular Contributor's Contribution.

1.3. "Contribution"

     means Covered Software of a particular Contributor.

1.4. "Covered Software"

     means Source Code Form to which the initial Contributor has attached the
     notice in Exhibit A, the Executable Form of such Source Code Form, and
     Modifications of such Source Code Form, in each case including portions
     thereof.

1.5. "Incompatible With Secondary Licenses"
     means

     a. that the initial Contributor has attached the notice described in
        Exhibit B to the Covered Software; or

     b. that the Covered Software was made available under the terms of
        version 1.1 or earlier of the License, but not also under the terms of
        a Secondary License.

1.6. "Executable Form"

     means any form of the work other than Source Code Form.

1.7. "Larger Work"

     means a work that combines Covered Software with other material, in a
     separate file or files, that is not Covered Software.

1.8. "License"

     means this document.

1.9. "Licensable"

     means having the right to grant, to the maximum extent possible, whether
     at the time of the initial grant or subsequently, any and all of the
     rights conveyed by this License.

1.10. "Modifications"

     means any of the following:

     a. any file in Source Code Form that results from an addition to,
        deletion from, or modification of the contents of Covered Software; or

     b. any new file in Source Code Form that contains any Covered Software.

1.11. "Patent Claims" of a Contributor

      means any patent claim(s), including without limitation, method,
      process, and apparatus claims, in any patent Licensable by such
      Contributor that would be infringed, but for the grant of the License,
      by the making, using, selling, offering for sale, having made, import,
      or transfer of either its Contributions or its Contributor Version.

1.12. "Secondary License"

      means either the GNU General Public License, Version 2.0, the GNU Lesser
      General Public License, Version 2.1, the GNU Affero General Public
      License, Version 3.0, or any later versions of those licenses.

1.13. "Source Code Form"

      means the form of the work preferred for making modifications.

1.14. "You" (or "Your")

      means an individual or a legal entity exercising rights under this
      License. For legal entities, "You" includes any entity that controls, is
      controlled by, or is under common control with You. For purposes of this
      definition, "control" means (a) the power, direct or indirect, to cause
      the direction or management of such entity, whether by contract or
      otherwise, or (b) ownership of more than fifty percent (50%) of the
      outstanding shares or beneficial ownership of such entity.


2. License Grants and Conditions

2.1. Grants

     Each Contributor hereby grants You a world-wide, royalty-free,
     non-exclusive license:

     a. under intellectual property rights (other than patent or trademark)
        Licensable by such Contributor to use, reproduce, make available,
        modify, display, perform, distribute, and otherwise exploit its
        Contributions, either on an unmodified basis, with Modifications, or
        as part of a Larger Work; and

     b. under Patent Claims of such Contributor to make, use, sell, offer for
        sale, have made, import, and otherwise transfer either its
        Contributions or its Contributor Version.

2.2. Effective Date

     The licenses granted in Section 2.1 with respect to any Contribution
     become effective for each Contribution on the date the Contributor first
     distributes such Contribution.

2.3. Limitations on Grant Scope

     The licenses granted in this Section 2 are the only rights granted under
     this License. No additional rights or licenses will be implied from the
     distribution or licensing of Covered Software under this License.
     Notwithstanding Section 2.1(b) above, no patent license is granted by a
     Contributor:

     a. for any code that a Contributor has removed from Covered Software; or

     b. for infringements caused by: (i) Your and any other third party's
        modifications of Covered Software, or (ii) the combination of its
        Contributions with other software (except as part of its Contributor
        Version); or

     c. under Patent Claims infringed by Covered Software in the absence of
        its Contributions.

     This License does not grant any rights in the trademarks, service marks,
     or logos of any Contributor (except as may be necessary to comply with
     the notice requirements in Section 3.4).

2.4. Subsequent Licenses

     No Contributor makes additional grants as a result of Your choice to
     distribute the Covered Software under a subsequent version of this
     License (see Section 10.2) or under the terms of a Secondary License (if
     permitted under the terms of Section 3.3).

2.5. Representation

     Each Contributor represents that the Contributor believes its
     Contributions are its original creation(s) or it has sufficient rights to
     grant the rights to its Contributions conveyed by this License.

2.6. Fair Use

     This License is not intended to limit any rights You have under
     applicable copyright doctrines of fair use, fair dealing, or other
     equivalents.

2.7. Conditions

     Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted in
     Section 2.1.


3. Responsibilities

3.1. Distribution of Source Form

     All distribution of Covered Software in Source Code Form, including any
     Modifications that You create or to which You contribute, must be under
     the terms of this License. You must inform recipients that the Source
     Code Form of the Covered Software is governed by the terms of this
     License, and how they can obtain a copy of this License. You may not
     attempt to alter or restrict the recipients' rights in the Source Code
     Form.

3.2. Distribution of Executable Form

     If You distribute Covered Software in Executable Form then:

     a. such Covered Software must also be made available in Source Code Form,
        as described in Section 3.1, and You must inform recipients of the
        Executable Form how they can obtain a copy of such Source Code Form by
        reasonable means in a timely manner, at a charge no more than the cost
        of distribution to the recipient; and

     b. You may distribute such Executable Form under the terms of this
        License, or sublicense it under different terms, provided that the
        license for the Executable Form does not attempt to limit or alter the
        recipients' rights in the Source Code Form under this License.

3.3. Distribution of a Larger Work

     You may create and distribute a Larger Work under terms of Your choice,
     provided that You also comply with the requirements of this License for
     the Covered Software. If the Larger Work is a combination of Covered
     Software with a work governed by one or more Secondary Licenses, and the
     Covered Software is not Incompatible With Secondary Licenses, this
     License permits You to additionally distribute such Covered Software
     under the terms of such Secondary License(s), so that the recipient of
     the Larger Work may, at their option, further distribute the Covered
     Software under the terms of either this License or such Secondary
     License(s).

3.4. Notices

     You may not remove or alter the substance of any license notices
     (including copyright notices, patent notices, disclaimers of warranty, or
     limitations of liability) contained within the Source Code Form of the
     Covered Software, except that You may alter any license notices to the
     extent required to remedy known factual inaccuracies.

3.5. Application of Additional Terms

     You may choose to offer, and to charge a fee for, warranty, support,
     indemnity or liability obligations to one or more recipients of Covered
     Software. However, You may do so only on Your own behalf, and not on
     behalf of any Contributor. You must make it absolutely clear that any
     such warranty, support, indemnity, or liability obligation is offered by
     You alone, and You hereby agree to indemnify every Contributor for any
     liability incurred by such Contributor as a result of warranty, support,
     indemnity or liability terms You offer. You may include additional
     disclaimers of warranty and limitations of liability specific to any
     jurisdiction.

4. Inability to Comply Due to Statute or Regulation

   If it is impossible for You to comply with any of the terms of this License
   with respect to some or all of the Covered Software due to statute,
   judicial order, or regulation then You must: (a) comply with the terms of
   this License to the maximum extent possible; and (b) describe the
   limitations and the code they affect. Such description must be placed in a
   text file included with all distributions of the Covered Software under
   this License. Except to the extent prohibited by statute or regulation,
   such description must be sufficiently detailed for a recipient of ordinary
   skill to be able to understand it.

5. Termination

5.1. The rights granted under this License will terminate automatically if You
     fail to comply with any of its terms. However, if You become compliant,
     then the rights granted under this License from a particular Contributor
     are reinstated (a) provisionally, unless and until such Contributor
     explicitly and finally terminates Your grants, and (b) on an ongoing
     basis, if such Contributor fails to notify You of the non-compliance by
     some reasonable means prior to 60 days after You have come back into
     compliance. Moreover, Your grants from a particular Contributor are
     reinstated on an ongoing basis if such Contributor notifies You of the
     non-compliance by some reasonable means, this is the first time You have
     received notice of non-compliance with this License from such
     Contributor, and You become compliant prior to 30 days after Your receipt
     of the notice.

5.2. If You initiate litigation against any entity by asserting a patent
     infringement claim (excluding declaratory judgment actions,
     counter-claims, and cross-claims) alleging that a Contributor Version
     directly or indirectly infringes any patent, then the rights granted to
     You by any and all Contributors for the Covered Software under Section
     2.1 of this License shall terminate.

5.3. In the event of termination under Sections 5.1 or 5.2 above, all end user
     license agreements (excluding distributors and resellers) which have been
     validly granted by You or Your distributors under this License prior to
     termination shall survive termination.

6. Disclaimer of Warranty

   Covered Software is provided under this License on an "as is" basis,
   without warranty of any kind, either expressed, implied, or statutory,
   including, without limitation, warranties that the Covered Software is free
   of defects, merchantable, fit for a particular purpose or non-infringing.
   The entire risk as to the quality and performance of the Covered Software
   is with You. Should any Covered Software prove defective in any respect,
   You (not any Contributor) assume the cost of any necessary servicing,
   repair, or correction. This disclaimer of warranty constitutes an essential
   part of this License. No use of  any Covered Software is authorized under
   this License except under this disclaimer.

7. Limitation of Liability

   Under no circumstances and under no legal theory, whether tort (including
   negligence), contract, or otherwise, shall any Contributor, or anyone who
   distributes Covered Software as permitted above, be liable to You for any
   direct, indirect, special, incidental, or consequential damages of any
   character including, without limitation, damages for lost profits, loss of
   goodwill, work stoppage, computer failure or malfunction, or any and all
   other commercial damages or losses, even if such party shall have been
   informed of the possibility of such damages. This limitation of liability
   shall not apply to liability for death or personal injury resulting from
   such party's negligence to the extent applicable law prohibits such
   limitation. Some jurisdictions do not allow the exclusion or limitation of
   incidental or consequential damages, so this exclusion and limitation may
   not apply to You.

8. Litigation

   Any litigation relating to this License may be brought only in the courts
   of a jurisdiction where the defendant maintains its principal place of
   business and such litigation shall be governed by laws of that
   jurisdiction, without reference to its conflict-of-law provisions. Nothing
   in this Section shall prevent a party's ability to bring cross-claims or
   counter-claims.

9. Miscellaneous

   This License represents the complete agreement concerning the subject
   matter hereof. If any provision of this License is held to be
   unenforceable, such provision shall be reformed only to the extent
   necessary to make it enforceable. Any law or regulation which provides that
   the language of a contract shall be construed against the drafter shall not
   be used to construe this License against a Contributor.


10. Versions of the License

10.1. New Versions

      Mozilla Foundation is the license steward. Except as provided in Section
      10.3, no one other than the license steward has the right to modify or
      publish new versions of this License. Each version will be given a
      distinguishing version number.

10.2. Effect of New Versions

      You may distribute the Covered Software under the terms of the version
      of the License under which You originally received the Covered Software,
      or under the terms of any subsequent version published by the license
      steward.

10.3. Modified Versions

      If you create software not governed by this License, and you want to
      create a new license for such software, you may create and use a
      modified version of this License if you rename the license and remove
      any references to the name of the license steward (except to note that
      such modified license differs from this License).

10.4. Distributing Source Code Form that is Incompatible With Secondary
      Licenses If You choose to distribute Source Code Form that is
      Incompatible With Secondary Licenses under the terms of this version of
      the License, the notice described in Exhibit B of this License must be
      attached.

Exhibit A - Source Code Form License Notice

      This Source Code Form is subject to the
      terms of the Mozilla Public License, v.
      2.0. If a copy of the MPL was not
      distributed with this file, You can
      obtain one at
      http://mozilla.org/MPL/2.0/.

If it is not possible or desirable to put the notice in a particular file,
then You may include the notice in a location (such as a LICENSE file in a
relevant directory) where a recipient would be likely to look for such a
notice.

You may add additional accurate notices of copyright ownership.

Exhibit B - "Incompatible With Secondary Licenses" Notice

      This Source Code Form is "Incompatible
      With Secondary Licenses", as defined by
      the Mozilla Public License, v. 2.0.
```

# Usage reporting for cost attribution in SageMaker HyperPod
<a name="sagemaker-hyperpod-usage-reporting"></a>

Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides granular visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization - captured in *both team-level aggregates and task-specific breakdowns* - usage reporting complements HyperPod's [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality, ensuring fair cost distribution in shared multi-tenant clusters by:
+ Eliminating guesswork in cost allocation
+ Directly linking expenses to measurable resource consumption
+ Enforcing usage-based accountability in shared infrastructure environments

## Prerequisites
<a name="sagemaker-hyperpod-usage-reporting-prerequisites"></a>

To use this capability:
+ You need:
  + An active **SageMaker HyperPod environment** with a running EKS-orchestrated cluster.
  + (Strongly recommended) **Task Governance configured** with compute quotas and priority rules. For setup instructions, see [Task Governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ Familiarize yourself with these core concepts:
  + **Allocated compute quota:** Resources reserved for a team based on predefined quotas in their Task Governance policies. This is *guaranteed capacity* for their workloads.
  + **Borrowed compute:** Idle resources from the shared cluster pool that teams can temporarily use *beyond their allocated quota*. Borrowed compute is assigned dynamically based on priority rules in the Task Governance policies and availability of unused resources.
  + **Compute usage:** The measurement of resources (GPU, CPU, Neuron Core hours) consumed by a team, tracked as:
    + **Allocated utilization**: Usage within the team's quota.
    + **Borrowed utilization**: Usage beyond the quota, drawn from the shared pool.
  + **Cost attribution:** The process of allocating cluster costs to teams based on their *actual compute usage*, including both resources consumed within their predefined quota and resources temporarily used from the shared cluster pool beyond their quota.

## Reports types
<a name="sagemaker-hyperpod-usage-reporting-report-types"></a>

HyperPod's usage reports provide varying operational granularity:
+ **Summary reports** provide organization-wide visibility into compute usage, aggregating total GPU/CPU/Neuron Core hours per team (namespace) while distinguishing between *regular usage* (resources from a team's allocated quota) and *borrowed compute* (overflow capacity from shared pools).
+ **Detailed reports** offer task-level breakdowns by team, tracking exact compute hours spent running specific tasks – including preempted tasks, hourly utilization patterns, and namespace-specific allocations.

**Important**  
HyperPod usage reporting tracks compute utilization across *all Kubernetes namespaces* in a cluster—including those managed by Task Governance, default namespaces, and namespaces created **outside of Task Governance** (e.g., via direct Kubernetes API calls or external tools). This infrastructure-level monitoring ensures comprehensive usage-based accountability, preventing gaps in cost attribution for shared clusters regardless of how namespaces are managed.

## Reports formats and time range
<a name="sagemaker-hyperpod-usage-reporting-formats"></a>

Using the Python script provided in [Generate reports](sagemaker-hyperpod-usage-reporting-generate.md), administrators can generate usage reports on demand in CSV or PDF formats, selecting time ranges from daily snapshots to 180-day (6-month) historical windows.

**Note**  
You can configure the historical window to extend beyond the default 180-day maximum when setting up the reporting infrastructure. For more information on configuring the data retention period, see [Install Usage Report Infrastructure using CloudFormation](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#install-usage-report-infrastructure-using-cloudformation). 

## Illustrative use cases
<a name="sagemaker-hyperpod-usage-reporting-use-cases"></a>

This capability addresses critical scenarios in multi-tenant AI/ML environments such as:

1. **Cost allocation for shared clusters**: An administrator manages a HyperPod cluster shared by 20 teams training generative AI models. Using a *summary usage report*, they analyze daily GPU utilization over 180 days and discover Team A consumed 200 GPU hours of a specific instance type—170 from their allocated quota and 30 from borrowed compute. The administrator invoices Team A based on this reported usage.

1. **Auditing and dispute resolution**: A finance team questions cost attribution accuracy, citing inconsistencies. The administrator can export a *detailed task-level report* to audit discrepancies. By cross-referencing timestamps, instance types, and preempted jobs within the team's namespace, the report transparently reconcile disputed usage data.

# Reports details and data breakdown
<a name="sagemaker-hyperpod-usage-reporting-content"></a>

SageMaker HyperPod's usage reports provide two distinct lenses for analyzing compute resource consumption: **summary reports** for cost allocation and **detailed reports** for granular auditing. Summary reports aggregate cluster-wide usage by team or namespace, highlighting trends in allocated versus borrowed compute across GPU, CPU, and Neuron Core resources. Detailed reports drill into individual tasks, exposing metrics such as execution windows, task status, and priority-class utilization. In this section, we break down the structure of these reports, understand their key metrics, and demonstrate how administrators and finance teams can cross-reference summary trends with task-level data to validate cost attribution accuracy, resolve discrepancies, and optimize shared infrastructure.

## Common report headers
<a name="sagemaker-hyperpod-usage-reporting-content-headers"></a>

Both summary and detailed reports include the following metadata to contextualize the usage data:
+ **ClusterName:** The EKS-orchestrated Hyperpod cluster name where resources were consumed.
+ **Type:** The report category (`Summary Utilization Report` or `Detailed Utilization Report`).
+ **Date Generated:** When the report was created (e.g., `2025-04-18`).
+ **Date Range (UTC):** The timeframe covered (e.g., `2025-04-16 to 2025-04-18`).
+ **Missing data periods:** Gaps in data collection due to cluster downtime or monitoring issues (e.g., `2025-04-16 00:00:00 to 2025-04-19 00:00:00`).

## Summary reports
<a name="sagemaker-hyperpod-usage-reporting-content-summary"></a>

Summary reports provide a per-day high-level overview of compute resource consumption across teams/namespaces, and instance types distinguishing between allocated (reserved quota) and borrowed (lended pool) utilization. These reports are ideal for invoice generation, cost attribution statements, or capacity forecasting.

*Example: A summary report might show that Team A used 200 GPU hours—170 from their allocated quota and 30 borrowed.*

Here's a structured breakdown of the key columns in a summary report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Instance Type:** The compute instance used (e.g., ml.g5.4xlarge).
+ **Total/Allocated/Borrowed Utilization (Hours):** The breakdown of GPU, CPU, or Neuron Core usage by category.

  Where:
  + **Total utilization = Allocated utilization \$1 Borrowed utilization**
  + **Allocated utilization** is the actual GPU CPU, or Neuron Core hours a team has used, capped at 100% of their allocated quota.
  + **Borrowed utilization** is the actual GPU, CPU, or Neuron Core hours a team has used *beyond their allocated quota*, drawn from the shared cluster pool based on Task Governance priority rules and resource availability.

Example: 72 GPU hours total (48 allocated, 24 borrowed).

**Note**  
Only total utilization is displayed for namespaces not managed by Task Governance.

## Detailed reports
<a name="sagemaker-hyperpod-usage-reporting-content-detailed"></a>

Detailed reports provide forensic-level visibility into compute usage, breaking down resource consumption by task, exposing granular metrics like task execution windows, status (e.g., Succeeded, Failed), and priority-class usage. These reports are ideal for billing discrepancies validation, or ensuring compliance with governance policies.

Here's a structured breakdown of the key columns in a detailed report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Period Start/End:** Exact execution window (UTC) for the task. (e.g., `19:54:34`)
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Task:** The identifier for the job/pod (e.g., `pytorchjob-ml-pytorch-job-2p5zt-db686`).
+ **Instance:** The compute instance used (e.g., `ml.g5.4xlarge`).
+ **Status:** Task outcome (Succeeded, Failed, Preempted).
+ **Total Utilization:** Total consumption (hours and instance count) of GPU, CPU, or Neuron Core resources.
+ **Priority Class:** The priority tier assigned (e.g., training-priority).

# Generate reports
<a name="sagemaker-hyperpod-usage-reporting-generate"></a>

This guide provides step-by-step instructions to configure and manage usage reporting for your SageMaker HyperPod clusters. Follow these procedures to deploy infrastructure, generate custom reports, and remove resources when no longer needed.

## Set up usage reporting
<a name="sagemaker-hyperpod-usage-reporting-install"></a>

**Note**  
Before configuring the SageMaker HyperPod usage report infrastructure in your SageMaker HyperPod cluster, ensure you have met all prerequisites detailed in this [https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites).

Usage reporting in HyperPod requires:
+ Deploying SageMaker HyperPod usage report AWS resources using an CloudFormation stack
+ Installing the SageMaker HyperPod usage report Kubernetes operator via a Helm chart

You can find comprehensive installation instructions in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md). Specifically, follow the steps in the [Set up](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#set-up-usage-reporting) section.

## Generate usage reports on demand
<a name="sagemaker-hyperpod-usage-reporting-use"></a>

Once the usage reporting infrastructure and Kubernetes operator are installed, job data for your SageMaker HyperPod cluster is automatically collected and stored in the S3 bucket you configured during setup. The operator continuously captures detailed usage metrics in the background, creating raw data files in the `raw` directory of your designated S3 bucket.

To generate an on-demand usage report, you can use the `run.py` script provided in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md) to extract and export usage metrics. Specifically, you can find the script and comprehensive instructions for generating a report in the [Generate Reports](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#generate-reports) section.

The script allows you to:
+ Specify custom date ranges for report generation
+ Choose between detailed and summary report types
+ Export reports in CSV or PDF formats
+ Direct reports to a specific S3 location

## Clean up usage reporting resources
<a name="sagemaker-hyperpod-usage-reporting-cleanup"></a>

When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow the steps in [Clean Up Resources](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#clean-up-resources) to clean up the Kubernetes operator and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.

# Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-setup-storage"></a>

Cluster admin needs to configure storage for data scientist users to manage input and output data and storing checkpoints during training on SageMaker HyperPod clusters.

**Handling large datasets (input/output data)**
+ **Data access and management**: Data scientists often work with large datasets that are required for training machine learning models. Specifying storage parameters in the job submission allows them to define where these datasets are located (e.g., Amazon S3 buckets, persistent volumes in Kubernetes) and how they are accessed during the job execution.
+ **Performance optimization**: The efficiency of accessing input data can significantly impact the performance of the training job. By optimizing storage parameters, data scientists can ensure that data is read and written efficiently, reducing I/O bottlenecks.

**Storing checkpoints**
+ **Checkpointing in training**: During long-running training jobs, it’s common practice to save checkpoints—intermediate states of the model. This allows data scientists to resume training from a specific point in case of a failure, rather than starting from scratch.
+ **Data recovery and experimentation**: By specifying the storage location for checkpoints, data scientists can ensure that these checkpoints are securely stored, potentially in a distributed storage system that offers redundancy and high availability. This is crucial for recovering from interruptions and for experimenting with different training strategies.

**Tip**  
For a hands-on experience and guidance on how to set up storage for SageMaker HyperPod cluster orchestrated with Amazon EKS, see the following sections in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).  
[Set up Amazon FSx for Lustre on SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/06-fsx-for-lustre)
[Set up a mountpoint for Amazon S3](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/09-s3-mountpoint) using [Mountpoint for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountpoint.html)

# Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs"></a>

SageMaker HyperPod supports the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver, which manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes volumes that you create. With the Amazon EBS CSI driver, you can create, attach, and manage your Amazon EBS volumes for your machine learning workloads running on SageMaker HyperPod clusters with Amazon EKS orchestration.

**Topics**
+ [

## Key storage capabilities
](#sagemaker-hyperpod-eks-ebs-features)
+ [

## Use cases
](#sagemaker-hyperpod-eks-ebs-use)
+ [

## Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
](#sagemaker-hyperpod-eks-ebs-setup)
+ [

## Using the APIs
](#sagemaker-hyperpod-eks-ebs-setup-apis)

## Key storage capabilities
<a name="sagemaker-hyperpod-eks-ebs-features"></a>

The Amazon EBS CSI driver on SageMaker HyperPod supports the following storage capabilities.
+ Static provisioning: Associates pre-created Amazon EBS volumes with Kubernetes [persistent volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) for use in your pods.
+ Dynamic provisioning: Automatically creates Amazon EBS volumes and associated persistent volumes from [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims). Parameters can be passed via [https://kubernetes.io/docs/concepts/storage/storage-classes/](https://kubernetes.io/docs/concepts/storage/storage-classes/) for fine-grained control over volume creation.
+ Volume resizing: Expands existing volumes by updating the [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) size specification without disrupting running workloads. This can be essential for handling growing model repositories or adapting to larger nodes without service disruption.
+ Volume snapshots: Creates point-in-time snapshots of volumes for backup, recovery, and data versioning.
+ Block volumes: Provides raw block device access for high-performance applications requiring direct storage access.
+ Volume modification: Changes volume properties such as type, input or output operations per second (IOPS), or throughput using [volume attributes classes](https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/).

For more information about the Amazon EBS CSI driver, see [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*.

For more information about storage to pods in your cluster, see [Storage](https://kubernetes.io/docs/concepts/storage/) from the *Kubernetes Documentation*.

## Use cases
<a name="sagemaker-hyperpod-eks-ebs-use"></a>

The Amazon EBS CSI driver integration enables several key use cases for both training and inference workloads on SageMaker HyperPod EKS clusters.

**Training workloads**
+ Dataset storage: Provision volumes for training datasets that persist across pod restarts
+ Checkpoint storage: Save model checkpoints and intermediate training results
+ Shared artifacts: Access common datasets and model artifacts across multiple training jobs

**Inference workloads**
+ Model storage: Dynamically provision appropriately sized volumes based on model requirements
+ Container caching: Create ephemeral storage for improved inference performance
+ Event logging: Store inference results and logs with persistent storage

## Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs-setup"></a>

The Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver allows you to dynamically provision and manage Amazon EBS volumes for your containerized workloads running on SageMaker HyperPod clusters with EKS orchestration. This section walks you through installing and configuring the Amazon EBS CSI driver to enable persistent storage for your machine learning workloads.

### Prerequisites
<a name="sagemaker-hyperpod-eks-ebs-setup-prerequisite"></a>

Before you begin, do the following:
+ [Install and configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html)
+ [Create a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)
+ Install the Amazon EBS CSI driver with the version of [v1.47.0](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1470)

### Additional permissions
<a name="sagemaker-hyperpod-eks-ebs-setup-permissions"></a>

To set up the Amazon EBS CSI driver add-on, follow the instructions in [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*. You should also add the following additional permissions to the IAM role used to run the driver add-on. Note that this is the IAM role specified in your service account configuration for the driver add-on, not the HyperPod cluster execution role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume",
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        }
    ]
}
```

------

## Using the APIs
<a name="sagemaker-hyperpod-eks-ebs-setup-apis"></a>

As an alternative, you can use the [AttachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AttachClusterNodeVolume.html) and [DetachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DetachClusterNodeVolume.html) API operations to attach and detach your Amazon EBS volumes to SageMaker HyperPod EKS cluster instances.

**Key requirements for using these APIs include the following.**
+ Both the Amazon EBS volume and SageMaker HyperPod EKS cluster must be owned by the same AWS account.
+ The calling principal needs specific minimum permissions to successfully perform the attach or detach operation. For more information about the minimum permissions, see the following sections.
+ After attaching a volume to your HyperPod node, follow the instructions in [Accessing SageMaker HyperPod cluster nodes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-access-through-terminal.html) to access the cluster node, and [Make a volume available for use](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-using-volumes.html) to mount the attached volume.

### Required permissions for `sagemaker:AttachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-attach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:AttachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for `sagemaker:DetachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-detach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:DetachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for AWS KMS keys
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-kms"></a>

Add the following AWS KMS permissions only if you're using customer managed KMS keys to encrypt your Amazon EBS volumes attached to HyperPod cluster nodes. These permissions are not required if you're using AWS-managed KMS keys (the default encryption option).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Id": "key-default-1",
    "Statement":
    [
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:DescribeKey",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:CreateGrant",
            "Resource": "*",
            "Condition":
            {
                "StringEquals":
                {
                    "kms:CallerAccount": "111122223333",
                    "kms:ViaService": "ec2.us-east-1.amazonaws.com"
                },
                "ForAnyValue:StringEquals":
                {
                    "kms:EncryptionContextKeys": "aws:ebs:id"
                },
                "Bool":
                {
                    "kms:GrantIsForAWSResource": true
                },
                "ForAllValues:StringEquals":
                {
                    "kms:GrantOperations":
                    [
                        "Decrypt"
                    ]
                }
            }
        }
    ]
}
```

------

**Note**  
These AWS KMS permissions are not required for `sagemaker:DetachClusterNodeVolume` when detaching a Cluster Auto Volume Attachment (CAVA) volume encrypted with customer managed KMS keys.

# Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints"></a>

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (Amazon EKS) orchestrator support custom Kubernetes labels and taints for nodes within instance groups. Labels and taints are fundamental scheduling and organization mechanisms in Kubernetes that give you fine-grained control over pod placement and resource utilization.

Labels are key-value pairs that can be attached to Kubernetes objects, allowing you to organize and select resources based on attributes. Taints, working in conjunction with tolerations, are node-specific properties that influence pod scheduling by repelling pods that don't have matching tolerations. Together, these mechanisms enable you to isolate workloads, assign them according to hardware specifications, and ensure optimal resource utilization.

## Common use cases
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-use-cases"></a>

The following are common scenarios where custom labels and taints are beneficial:
+ **Preventing system pods on expensive instances** - Apply taints to GPU instances to prevent system pods and other non-critical workloads from consuming expensive compute resources
+ **Integration with existing tooling** - Apply labels that match your organization's established infrastructure patterns and node affinity configurations

## Configuring labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-configure"></a>

You can configure custom Kubernetes labels and taints at the instance group level using the `KubernetesConfig` parameter in your cluster configuration. Labels and taints are applied to all nodes in the instance group and persist throughout the cluster's lifecycle.

The `KubernetesConfig` parameter is declarative, meaning you specify the complete desired state of labels and taints for an instance group. SageMaker HyperPod then reconciles the actual state of the nodes to match this desired state.
+ **Adding labels or taints** - Include the new labels or taints in the `KubernetesConfig` along with any existing ones you want to keep
+ **Updating labels or taints** - Modify the values in the `KubernetesConfig` for the labels or taints you want to change, and include all others you want to keep
+ **Removing labels or taints** - Omit the labels or taints you want to remove from the `KubernetesConfig`, keeping only those you want to retain

### Creating a cluster with labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-create"></a>

When creating a new SageMaker HyperPod cluster, include the `KubernetesConfig` parameter in your instance group configuration. The following example shows how to create a cluster with custom labels and taints:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "ExecutionRole": "arn:aws:iam::123456789012:role/HyperPodExecutionRole",
        "ThreadsPerCore": 1,
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            },
            {
                "key": "dedicated",
                "value": "ml-workloads",
                "effect": "NoExecute"
            }]
        }
    }],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-0123456789abcdef0"],
        "Subnets": ["subnet-0123456789abcdef0", "subnet-0123456789abcdef1"]
    },
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "arn:aws:eks:us-west-2:123456789012:cluster/my-eks-cluster"
        }
    }
}
```

In this example:
+ **Labels** - Three custom labels are applied: `env=prod`, `team=ml-training`, and `gpu-type=a100`
+ **Taints** - Two taints are configured to prevent unwanted pod scheduling

### Updating labels and taints on an existing cluster
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-update"></a>

You can modify labels and taints on an existing cluster using the `UpdateCluster` API. The following example shows how to update the `KubernetesConfig` for an instance group:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100",
                "cost-center": "ml-ops"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When you update labels and taints, SageMaker HyperPod applies the changes to all nodes in the instance group. The service manages the transition from current to desired state, which you can monitor using the `DescribeCluster` API.

## Monitoring label and taint application
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-monitor"></a>

SageMaker HyperPod provides APIs to monitor the status of labels and taints as they are applied to your cluster nodes.

### Checking cluster-level status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-cluster"></a>

Use the `DescribeCluster` API to view the current and desired states of labels and taints at the instance group level. The following example shows the response structure:

```
{
    "ClusterName": "my-cluster",
    "ClusterStatus": "InService",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "CurrentInstanceCount": 4,
        "TargetInstanceCount": 4,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When the `CurrentLabels` match `DesiredLabels` and `CurrentTaints` match `DesiredTaints`, all nodes in the instance group have the specified configuration applied. If they differ, the cluster is still in the process of applying the changes.

### Checking individual node status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-node"></a>

For node-level details, use the `DescribeClusterNode` API to check the label and taint configuration of individual nodes. The following example shows the response structure:

```
{
    "NodeDetails": { 
        "InstanceId": "i-0123456789abcdef0",
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceStatus": {
            "Status": "Running",
            "Message": "Node is healthy"
        },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "LaunchTime": 1699564800.0,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }
}
```

Node-level monitoring is useful for troubleshooting when labels or taints are not applying correctly to specific nodes, or when you need to verify the configuration of a particular instance.

## Reserved prefixes
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-reserved-prefixes"></a>

Certain prefixes are reserved for system use and should not be used for custom labels or taints. The following prefixes are reserved:
+ `kubernetes.io/` - Reserved for Kubernetes core components
+ `k8s.io/` - Reserved for Kubernetes core components
+ `sagemaker.amazonaws.com/` - Reserved for SageMaker HyperPod
+ `eks.amazonaws.com/` - Reserved for Amazon EKS
+ `k8s.aws/` - Reserved for Amazon EKS
+ `karpenter.sh/` - Reserved for Karpenter autoscaling

Labels and taints with these prefixes are managed by system components and should not be overwritten with custom values.

# Checkpointless training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-checkpointless"></a>

Checkpointless training on Amazon SageMaker HyperPod enables faster recovery from training infrastructure faults. The following documentation helps you get started with checkpointless training and fine-tuning for NeMo-supported models.

Checkpointless training has the following pre-requisites:
+ [Getting started with Amazon EKS support in SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [Installing the training operator](sagemaker-eks-operator-install.md). You must install v1.2.0 or above.

 Checkpointless training on SageMaker HyperPod is built on top of the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager). You can run checkpointless training with pre-created SageMaker HyperPod recipes. If you're familiar with NeMo, the process of using the checkpointless training recipes is similar. With minor changes, you can start training a model using checkpointless training features that enable you to recover quickly from training faults.

The following HyperPod recipes are pre-configured with checkpointless training optimizations. You can specify your data paths as part of the recipe and use the associated launch script to run training (see the quick start guide below):


| Model | Method | Size | Nodes | Instance | Accelerator | Recipe | Script | Tutorial | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT OSS | Full finetune example | 120b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html) | 
| GPT OSS | LoRA-example | 120b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html) | 
| Llama3 | Pretrain example | 70b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/llama/checkpointless_llama3_70b_pretrain.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-pretraining-llama3.html) | 
| Llama3 | LoRA-example | 70b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/llama/checkpointless_llama3_70b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft-llama.html) | 

The following quick-start guide provides tutorials for using checkpointless training recipes:

**Getting started examples**
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)

If you’d like to pre-train or fine-tune custom models, see [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md).

To learn more about incorporating specific checkpointless training components, [HyperPod checkpointless training features](sagemaker-eks-checkpointless-features.md).

# Amazon SageMaker HyperPod checkpointless training tutorials
<a name="sagemaker-eks-checkpointless-recipes"></a>

[ HyperPod checkpointless training recipes](https://github.com/aws/sagemaker-hyperpod-checkpointless-training) are predefined job configurations with checkpointless training features enabled. Using these recipes, makes it easier to get started with checkpointless training on HyperPod.

**Topics**
+ [

# Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b
](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b
](sagemaker-eks-checkpointless-recipes-peft.md)
+ [

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b
](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b
](sagemaker-eks-checkpointless-recipes-peft-llama.md)
+ [

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models
](sagemaker-eks-checkpointless-recipes-custom.md)

# Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-finetune"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-finetune-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-finetune-recipes-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure your version of Python is greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch training jobs with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-finetune-launcher"></a>

You can use the Amazon SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh`

   your\$1container: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The `STATUS` will turn to `COMPLETED` when you run `kubectl get pods`.

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-finetune-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weigths in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/full_finetune_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-peft"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and < 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:
   + SageMaker HyperPod recipes method:

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```
   + kubectl with pre-defined job yaml method

     ```
     # install SageMaker HyperPod checkpointless training.
     git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
     cd sagemaker-hyperpod-checkpointless-training
     ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh`

   your\$1contrainer: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html#sagemaker-eks-checkpointless-recipes-peft-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/peft_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                                             READY   STATUS             RESTARTS        AGE
gpt-120b-lora-checkpointless-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=training/llama/checkpointless_llama3_70b_pretrain \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.data.global_batch_size=16 \
       recipes.data.micro_batch_size=4 \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `pretrain_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-pretrain-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-peft-llama"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/llama/checkpointless_llama3_70b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `peft_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-70b-lora-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models
<a name="sagemaker-eks-checkpointless-recipes-custom"></a>

The following sequence of steps is required to run checkpointless training with your custom model on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-custom-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-custom-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies

   ```
   # install SageMaker HyperPod checkpointless training.
   git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
   cd sagemaker-hyperpod-checkpointless-training
   ```

## Checkpointless training modification instructions
<a name="sagemaker-eks-checkpointless-recipes-custom-modification-instructions"></a>

To incrementally adopt checkpointless training for custom models, follow the integration guide (here we use Llama 3 70b pretraining as an example), which involves:
+ Fast communicator creation
+ Memory-mapped dataloader (MMAP)
+ In-process & Checkpointless recovery

### Component 1: Fast communicator creation
<a name="sagemaker-eks-checkpointless-recipes-custom-component1"></a>

This is to optimize time to establish connections between the workers. There is no code changes needed and only requires setting env variables

```
  # Enable Rootless features
  export HPCT_USE_ROOTLESS=1 && \
  sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \

  hyperpodrun --nproc_per_node=8 \
              ...
              --inprocess-restart \
              ...
```

The full change can be found in the [ llama3 70 pretrain launch job config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml).

### Component 2: Memory-mapped dataloader (MMAP)
<a name="sagemaker-eks-checkpointless-recipes-custom-component2"></a>

MMAP caches to store pre-fetched data samples & enable immediate training start without needing to wait for data preprocessing. It requires minimal code changes to adopt by wrapping existing dataloader.

```
data_module = MMAPDataModule(
  data_module=base_data_module,
  mmap_config=CacheResumeMMAPConfig(cache_dir=…)
)
```

### Components 3 and 4: In-process and checkpointless recovery
<a name="sagemaker-eks-checkpointless-recipes-custom-components3-4"></a>

This enables failure recovery without restart training processes or loading from checkpoints. Additional code changes needed (strategy & training config update, wrap existing main)

```
@HPWrapper(
  health_check=CudaHealthCheck(),
  hp_api_factory=HPAgentK8sAPIFactory(),
  abort_timeout=60.0,
...)
def run_main(
  cfg,
  caller: Optional[HPCallWrapper] = None):
...


CheckpointlessMegatronStrategy(
  **self.cfg.strategy,
  ddp=self.ddp,
)
```

The full change can be found in the [llama3 70 pretrain entry script](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/llama3_70b_pretrain_checkpointless.py) and the corresponding training config change can be found in the [ llama3 70b training config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/config/llama3_70b_peft_checkpointless.yaml).

### Launch training
<a name="sagemaker-eks-checkpointless-recipes-custom-launch"></a>

You can now launch the checkpointless training using kubectl.

```
kubectl apply -f your_job_config.yaml
```

# HyperPod checkpointless training features
<a name="sagemaker-eks-checkpointless-features"></a>

See the following pages to learn about the training features in checkpointless training.

**Topics**
+ [

## Amazon SageMaker HyperPod checkpointless training repositories
](#sagemaker-eks-checkpointless-repositories)
+ [

# Collective communication initialization improvements
](sagemaker-eks-checkpointless-features-communication.md)
+ [

# Memory mapped dataloader
](sagemaker-eks-checkpointless-features-mmap.md)
+ [

# In-process recovery and checkpointless training
](sagemaker-eks-checkpointless-in-process-recovery.md)

## Amazon SageMaker HyperPod checkpointless training repositories
<a name="sagemaker-eks-checkpointless-repositories"></a>

[ HyperPod checkpointless training](https://github.com/aws/sagemaker-hyperpod-checkpointless-training#) accelerates recovery from cluster faults in large-scale distributed training environments through framework-level optimizations. These optimizations are delivered via a base container image that includes enhanced NCCL initialization improvements, data loading optimizations, and in-process and checkpointless recovery components. The HyperPod checkpointless training package is built on this foundation.

Checkpointless training is enabled via three optimization tracks that run in concert:
+ **Communication initilization improvements (NCCL and Gloo)** - Eliminate communication bottlenecks by decentralizing rank peer and ring information (red box below).
+ **Data loading optimizations** - Reduce the time required to serve the first batch of data during restart operations (orange boxes below).
+ **Program restart overhead reduction** - Minimize restart costs and enable checkpointless replenishment through process recovery on healthy nodes (blue and green boxes below).

![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimization-tracks.png)


# Collective communication initialization improvements
<a name="sagemaker-eks-checkpointless-features-communication"></a>

NCCL and Gloo are fundamental communication libraries that enable collective operations (such as all-reduce and broadcast) across distributed training processes. However, traditional NCCL and Gloo initialization can create bottlenecks during fault recovery.

The standard recovery process requires all processes to connect to a centralized TCPStore and coordinate through a root process, introducing an expensive overhead that becomes particularly problematic during restarts. This centralized design creates three critical issues: coordination overhead from mandatory TCPStore connections, recovery delays as each restart must repeat the full initialization sequence, and a single point of failure in the root process itself. This imposes an expensive, centralized coordination steps every time training initializes or restarts.

HyperPod checkpointless training eliminates these coordination bottlenecks, enabling the faster recovery from faults by making initialization "rootless" and "TCPStoreless."

## Rootless configurations
<a name="sagemaker-eks-checkpointless-features-communication-rootless-config"></a>

To enable Rootless, one can simply expose the following environment variables.

```
export HPCT_USE_ROOTLESS=1 && \
sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \
```

HPCT\$1USE\$1ROOTLESS: 0 or 1. Use to turn on and off rootless

sysctl -w net.ipv4.ip\$1local\$1port\$1range="20000 65535": Set the system port range

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml#L111-L113) for enabling Rootless.

## Rootless
<a name="sagemaker-eks-checkpointless-features-communication-rootless"></a>

HyperPod checkpointless training offers novel initialization methods, Rootless and TCPStoreless, for NCCL and Gloo process groups.

The implementation of these optimizations involves modifying NCCL, Gloo, and PyTorch:
+ Extending third-party library APIs to enable Rootless and Storeless NCCL and Gloo optimizations while maintaining backward compatibility
+ Updating process group backends to conditionally use optimized paths and handle in-process recovery issues
+ Bypassing expensive TCPStore creation at the PyTorch distributed layer while maintaining symmetric address patterns through global group counters

The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.

![\[The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-libraries.png)


### NCCL and Gloo
<a name="sagemaker-eks-checkpointless-features-communication-nccl-gloo"></a>

These are independent packages that perform the core functionality of collective communications. They provide key APIs, such as ncclCommInitRank, to initialize communication networks, manage the underlying resources, and perform collective communications. After making custom changes in NCCL and Gloo, the Rootless and Storeless optimizes (e.g., skip connecting to the TCPStore) initialization of the communication network. You can switch between using the the original code paths or optimized code paths flexibly.

### PyTorch process group backend
<a name="sagemaker-eks-checkpointless-features-communication-pytorch"></a>

The process group backends, specifically ProcessGroupNCCL and ProcessGroupGloo, implement the ProcessGroup APIs by invoking the APIs of their corresponding underlying libraries. Since we extend the third-party libraries' APIs, we have to invoke them properly and make code path switches based on customers' configurations.

In addition to optimization code paths, we also change the process group backend to support in-process recovery.

# Memory mapped dataloader
<a name="sagemaker-eks-checkpointless-features-mmap"></a>

Another restart overhead stems from data loading: the training cluster remains idle while the dataloader initializes, downloads data from remote file systems, and processes it into batches.

To address this, we introduce the Memory Mapped DataLoader(MMAP) Dataloader, which caches prefetched batches in persistent memory, ensuring they remain available even after a fault-induced restart. This approach eliminates dataloader setup time and enables training to resume immediately using cached batches, while the dataloader concurrently reinitializes and fetches subsequent data in the background. The data cache resides on each rank that requires training data and maintains two types of batches: recently consumed batches that have been used for training, and prefetched batches ready for immediate use.

![\[This image illustrates the MMAP Dataloader, caches, and consumed batches.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-mmap-dataloader.png)


MMAP dataloader offers two following features:
+ **Data Prefetching** - Proactively fetches and caches data generated by the dataloader
+ **Persistent Caching** - Stores both consumed and prefetched batches in a temporary filesystem that survives process restarts

Using the cache, the training job will benefit from:
+ **Reduced Memory Footprint** - Leverages memory-mapped I/O to maintain a single shared copy of data in host CPU memory, eliminating redundant copies across GPU processes (e.g., reduces from 8 copies to 1 on a p5 instance with 8 GPUs)
+ **Faster Recovery** - Reduces Mean Time to Restart (MTTR) by enabling training to resume immediately from cached batches, eliminating the wait for dataloader reinitialization and first-batch generation

## MMAP configurations
<a name="sagemaker-eks-checkpointless-features-communication-mmap-config"></a>

To use MMAP, simply pass in the your original data module into `MMAPDataModule`

```
data_module=MMAPDataModule(
    data_module=MY_DATA_MODULE(...),
    mmap_config=CacheResumeMMAPConfig(
        cache_dir=self.cfg.mmap.cache_dir,
        checkpoint_frequency=self.cfg.mmap.checkpoint_frequency),
)
```

`CacheResumeMMAPConfig`: MMAP Dataloader parameters control cache directory location, size limits, and data fetching delegation. By default, only TP rank 0 per node fetches data from the source, while other ranks in the same data replication group read from the shared cache, eliminating redundant transfers.

`MMAPDataModule`: It wraps the original data module and returns the mmap dataloader for both train and validation.

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune_checkpointless.py#L101-L109) for enabling MMAP.

## API reference
<a name="sagemaker-eks-checkpointless-mmap-reference"></a>

### CacheResumeMMAPConfig
<a name="sagemaker-eks-checkpointless-mmap-reference-cacheresume"></a>

```
class hyperpod_checkpointless_training.dataloader.config.CacheResumeMMAPConfig(
  cache_dir='/dev/shm/pdl_cache',
  prefetch_length=10,
  val_prefetch_length=10,
  lookback_length=2,
  checkpoint_frequency=None,
  model_parallel_group=None,
  enable_batch_encryption=False)
```

Configuration class for cache-resume memory-mapped (MMAP) dataloader functionality in HyperPod checkpointless training.

This configuration enables efficient data loading with caching and prefetching capabilities, allowing training to resume quickly after failures by maintaining cached data batches in memory-mapped files.

**Parameters**
+ **cache\$1dir** (str, optional) – Directory path for storing cached data batches. Default: "/dev/shm/pdl\$1cache"
+ **prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during training. Default: 10
+ **val\$1prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during validation. Default: 10
+ **lookback\$1length** (int, optional) – Number of previously used batches to keep in cache for potential reuse. Default: 2
+ **checkpoint\$1frequency** (int, optional) – Frequency of model checkpointing steps. Used for cache performance optimization. Default: None
+ **model\$1parallel\$1group** (object, optional) – Process group for model parallelism. If None, will be created automatically. Default: None
+ **enable\$1batch\$1encryption** (bool, optional) – Whether to enable encryption for cached batch data. Default: False

**Methods**

```
create(dataloader_init_callable,
    parallel_state_util,
   step,
    is_data_loading_rank,
   create_model_parallel_group_callable,
    name='Train',
   is_val=False,
   cached_len=0)
```

Creates and returns a configured MMAP dataloader instance.

**Parameters**
+ **dataloader\$1init\$1callable** (Callable) – Function to initialize the underlying dataloader
+ **parallel\$1state\$1util** (object) – Utility for managing parallel state across processes
+ **step** (int) – The data step to resume from during training
+ **is\$1data\$1loading\$1rank** (Callable) – Function that returns True if current rank should load data
+ **create\$1model\$1parallel\$1group\$1callable** (Callable) – Function to create model parallel process group
+ **name** (str, optional) – Name identifier for the dataloader. Default: "Train"
+ **is\$1val** (bool, optional) – Whether this is a validation dataloader. Default: False
+ **cached\$1len** (int, optional) – Length of cached data if resuming from existing cache. Default: 0

Returns `CacheResumePrefetchedDataLoader` or `CacheResumeReadDataLoader` – Configured MMAP dataloader instance

Raises `ValueError` if the step parameter is `None`.

**Example**

```
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

# Create configuration
config = CacheResumeMMAPConfig(
    cache_dir="/tmp/training_cache",
    prefetch_length=20,
    checkpoint_frequency=100,
    enable_batch_encryption=False
)

# Create dataloader
dataloader = config.create(
    dataloader_init_callable=my_dataloader_init,
    parallel_state_util=parallel_util,
    step=current_step,
    is_data_loading_rank=lambda: rank == 0,
    create_model_parallel_group_callable=create_mp_group,
    name="TrainingData"
)
```

**Notes**
+ The cache directory should have sufficient space and fast I/O performance (e.g., /dev/shm for in-memory storage).
+ Setting `checkpoint_frequency` improves cache performance by aligning cache management with model checkpointing
+ For validation dataloaders (`is_val=True`), the step is reset to 0 and cold start is forced
+ Different dataloader implementations are used based on whether the current rank is responsible for data loading

### MMAPDataModule
<a name="sagemaker-eks-checkpointless-mmap-reference-mmapdatamodule"></a>

```
class hyperpod_checkpointless_training.dataloader.mmap_data_module.MMAPDataModule(  
    data_module,  
    mmap_config,  
    parallel_state_util=MegatronParallelStateUtil(),  
    is_data_loading_rank=None)
```

A PyTorch Lightning DataModule wrapper that applies memory-mapped (MMAP) data loading capabilities to existing DataModules for checkpointless training.

This class wraps an existing PyTorch Lightning DataModule and enhances it with MMAP functionality, enabling efficient data caching and fast recovery during training failures. It maintains compatibility with the original DataModule interface while adding checkpointless training capabilities.

Parameters

data\$1module (pl.LightningDataModule)  
The underlying DataModule to wrap (e.g., LLMDataModule)

mmap\$1config (MMAPConfig)  
The MMAP configuration object that defines caching behavior and parameters

`parallel_state_util` (MegatronParallelStateUtil, optional)  
Utility for managing parallel state across distributed processes. Default: MegatronParallelStateUtil()

`is_data_loading_rank` (Callable, optional)  
Function that returns True if the current rank should load data. If None, defaults to parallel\$1state\$1util.is\$1tp\$10. Default: None

**Attributes**

`global_step` (int)  
Current global training step, used for resuming from checkpoints

`cached_train_dl_len` (int)  
Cached length of the training dataloader

`cached_val_dl_len` (int)  
Cached length of the validation dataloader

**Methods**

```
setup(stage=None)
```

Setup the underlying data module for the specified training stage.

`stage` (str, optional)  
Stage of training ('fit', 'validate', 'test', or 'predict'). Default: None

```
train_dataloader()
```

Create the training DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped training DataLoader with caching and prefetching capabilities

```
val_dataloader()
```

Create the validation DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped validation DataLoader with caching capabilities

```
test_dataloader()
```

Create the test DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Test DataLoader from the underlying data module, or None if not supported

```
predict_dataloader()
```

Create the predict DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Predict DataLoader from the underlying data module, or None if not supported

```
load_checkpoint(checkpoint)
```

Load checkpoint information to resume training from a specific step.

checkpoint (dict)  
Checkpoint dictionary containing 'global\$1step' key

```
get_underlying_data_module()
```

Get the underlying wrapped data module.

*Returns:* pl.LightningDataModule – The original data module that was wrapped

```
state_dict()
```

Get the state dictionary of the MMAP DataModule for checkpointing.

*Returns:* dict – Dictionary containing cached dataloader lengths

```
load_state_dict(state_dict)
```

Load the state dictionary to restore MMAP DataModule state.

`state_dict` (dict)  
State dictionary to load

**Properties**

```
data_sampler
```

Expose the underlying data module's data sampler to NeMo framework.

*Returns:* object or None – The data sampler from the underlying data module

**Example**

```
from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule  
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig  
from my_project import MyLLMDataModule  

# Create MMAP configuration  
mmap_config = CacheResumeMMAPConfig(  
    cache_dir="/tmp/training_cache",  
    prefetch_length=20,  
    checkpoint_frequency=100  
)  

# Create original data module  
original_data_module = MyLLMDataModule(  
    data_path="/path/to/data",  
    batch_size=32  
)  

# Wrap with MMAP capabilities  
mmap_data_module = MMAPDataModule(  
    data_module=original_data_module,  
    mmap_config=mmap_config  
)  

# Use in PyTorch Lightning Trainer  
trainer = pl.Trainer()  
trainer.fit(model, data=mmap_data_module)  

# Resume from checkpoint  
checkpoint = {"global_step": 1000}  
mmap_data_module.load_checkpoint(checkpoint)
```

**Notes**
+ The wrapper delegates most attribute access to the underlying data module using \$1\$1getattr\$1\$1
+ Only data loading ranks actually initialize and use the underlying data module; other ranks use fake dataloaders
+ Cached dataloader lengths are maintained to optimize performance during training resumption

# In-process recovery and checkpointless training
<a name="sagemaker-eks-checkpointless-in-process-recovery"></a>

HyperPod checkpointless training uses model redundancy to enable fault-tolerant training. The core principle is that model and optimizer states are fully replicated across multiple node groups, with weight updates and optimizer state changes synchronously replicated within each group. When a failure occurs, healthy replicas complete their optimizer steps and transmit the updated model/optimizer states to recovering replicas.

This model redundancy-based approach enables several fault handling mechanisms:
+ **In-process recovery:** processes remain active despite faults, keeping all model and optimizer states in GPU memory with the latest values
+ **Graceful abort handling:** controlled aborts and resource cleanup for affected operations
+ **Code block re-execution:** re-running only the affected code segments within a Re-executable Code Block (RCB)
+ **Checkpointless recovery with no lost training progress:** since processes persist and states remain in memory, no training progress is lost; when a fault occurs training resumes from the previous step, as opposed to resuming from the last saved checkpoint

**Checkpointless Configurations**

Here is the core snippet of checkpointless training.

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
    wait_rank()
      
def main():
    @HPWrapper(
        health_check=CudaHealthCheck(),
        hp_api_factory=HPAgentK8sAPIFactory(),
        abort_timeout=60.0,
        checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
        abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
        finalize=CheckpointlessFinalizeCleanup(),
    )
    def run_main(cfg, caller: Optional[HPCallWrapper] = None):
        ...
        trainer = Trainer(
            strategy=CheckpointlessMegatronStrategy(...,
                num_distributed_optimizer_instances=2),
            callbacks=[..., CheckpointlessCallback(...)],
            )
        trainer.fresume = resume
        trainer._checkpoint_connector = CheckpointlessCompatibleConnector(trainer)
        trainer.wrapper = caller
```
+ `wait_rank`: All ranks will wait for the rank information from the HyperpodTrainingOperator infrastructure.
+ `HPWrapper`: Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB). The implementation uses a context manager rather than a Python decorator because decorators cannot determine the number of RCBs to monitor at runtime.
+ `CudaHealthCheck`: Ensures the CUDA context for the current process is in a healthy state by synchronizing with the GPU. Uses the device specified by the LOCAL\$1RANK environment variable, or defaults to the main thread's CUDA device if LOCAL\$1RANK is not set.
+ `HPAgentK8sAPIFactory`: This API enables checkpointless training to query the training status of other pods in the Kubernetes training cluster. It also provides an infrastructure-level barrier that ensures all ranks successfully complete abort and restart operations before proceeding.
+ `CheckpointManager`: Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance. It has the following core responsibilities:
  + **In-Memory Checkpoint Management**: Saves and manages NeMo model checkpoints in memory for fast recovery without disk I/O during checkpointless recovery scenarios.
  + **Recovery Feasibility Validation**: Determines if checkpointless recovery is possible by validating global step consistency, rank health, and model state integrity.
  + **Peer-to-Peer Recovery Orchestration**: Coordinates checkpoint transfer between healthy and failed ranks using distributed communication for fast recovery.
  + **RNG State Management**: Preserves and restores random number generator states across Python, NumPy, PyTorch, and Megatron for deterministic recovery.
  + **[Optional] Checkpoint Offload**: Offload in memory checkpoint to CPU if GPU does not have enough memory capacity.
+ `PEFTCheckpointManager`: It extends `CheckpointManager` by keeping the base model weights for PEFT finetuning.
+ `CheckpointlessAbortManager`: Manages abort operations in a background thread when an error is encountered. By default, it aborts TransformerEngine, Checkpointing, TorchDistributed, and DataLoader. Users can register custom abort handlers as needed. After the abort completes, all communication must cease and all processes and threads must terminate to prevent resource leaks.
+ `CheckpointlessFinalizeCleanup`: Handles final cleanup operations in the main thread for components that cannot be safely aborted or cleaned up in the background thread.
+ `CheckpointlessMegatronStrategy`: This inherits from the `MegatronStrategy` from in Nemo. Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization, e.g., rootless.
+ `CheckpointlessCallback`: Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system. It has the following core responsibilities:
  + **Training Step Lifecycle Management**: Tracks training progress and coordinates with ParameterUpdateLock to enable/disable checkpointless recovery based on training state (first step vs subsequent steps).
  + **Checkpoint State Coordination**: Manages in-memory PEFT base model checkpoint saving/restoring.
+ `CheckpointlessCompatibleConnector`: A PTL `CheckpointConnector` that attempts to pre-load the checkpoint file to memory, with the source path determined in this priority:
  + try checkpointless recovery
  + if checkpointless return None, fallback to parent.resume\$1start()

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune.py) to add checkpointless training features to codes.

**Concepts**

This section introduces checkpointless training concepts. Checkpointless training on Amazon SageMaker HyperPod supports in-process recovery. This API interface follows a similar format as the NVRx APIs.

**Concept - Re-Executable Code Block (RCB)**

When a failure occurs, healthy processes remain alive, but a portion of the code must be re-executed to recover the training states and python stacks. A Re-executable Code Block (RCB) is a specific code segment that re-runs during failure recovery. In the following example, the RCB encompasses the entire training script (i.e., everything under main()), meaning that each failure recovery restarts the training script while preserving the in-memory model and optimizer states.

**Concept - Faults control**

A fault controller module receives notifications when failures occur during checkpointless training. This fault controller includes the following components:
+ **Fault detection module:** Receives infrastructure fault notifications
+ **RCB definition APIs:** Enables users to define the re-executable code block (RCB) in their code
+ **Restart module:** Terminates the RCB, cleans up resources, and restarts the RCB

![\[This image illustrates how a fault controller module receives notifications when failure occurs during checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-fault-controller-module.png)


**Concept - Model redundancy**

Large model training usually requires a large enough data parallel size to train models efficiently. In traditional data parallelism like PyTorch DDP and Horovod, the model is fully replicated. More advanced sharded data parallelism techniques like DeepSpeed ZeRO optimizer and FSDP also support hybrid sharding mode, which allows sharding the model/optimizer states within the sharding group and fully replicating across replication groups. NeMo also has this hybrid sharding feature through an argument num\$1distributed\$1optimizer\$1instances, which allows redundancy.

However, adding redundancy indicates that the model will not be fully sharded across the entire cluster, resulting in higher device memory usage. The amount of redundant memory will vary depending on the specific model sharding techniques implemented by the user. The low-precision model weights, gradients, and activation memory will not be affected, since they are sharded through model parallelism. The high-precision master model weights/gradients and optimizer states will be affected. Adding one redundant model replica increases device memory usage by roughly the equivalent of one DCP checkpoint size.

Hybrid sharding breaks the collectives across the entire DP groups into relatively smaller collectives. Previously there was a reduce-scatter and an all-gather across the entire DP group. After the hybrid sharding, the reduce-scatter is only running inside each model replica, and there will be an all-reduce across model replica groups. The all-gather is also running inside each model replica. As a result, the entire communication volume remains roughly unchanged, but collectives are running with smaller groups, so we expect better latency.

**Concept - Failure and Restart Types**

The following table records different failure types and associated recovery mechanisms. Checkpointless training first attempts failure recovery via an in-process recovery, followed by a process-level restart. It falls back to a job-level restart only in the event of a catastrophic failure (e.g., multiple nodes fail at the same time).


| Failure Type | Cause | Recovery Type | Recovery Mechanism | 
| --- | --- | --- | --- | 
| In-process failure | Code-level errors, exceptions | In-Process Recovery (IPR) | Rerun RCB within existing process; healthy processes remain active | 
| Process restart failure | Corrupted CUDA context, terminated process | Process Level Restart (PLR) | SageMaker HyperPod training operator restarts processes; skips K8s pod restart | 
| Node replacement failure | Permanent node/GPU hardware failure | Job Level Restart (JLR) | Replace failed node; restart entire training job | 

**Concept - Atomic lock protection for optimizer step**

Model execution is divided into three phases: forward propagation, backward propagation, and optimizer step. Recovery behavior varies based on the failure timing:
+ **Forward/backward propagation:** Roll back to the beginning of the current training step and broadcast model states to replacement node(s)
+ **Optimizer step:** Allow healthy replicas to complete the step under lock protection, then broadcast the updated model states to replacement node(s)

This strategy ensures completed optimizer updates are never discarded, helping reduce fault recovery time.

![\[This image illustrates how failure is handled depending on if it occurs before or after failure.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimizer.png)


## Checkpointless Training Flow Diagram
<a name="sagemaker-eks-checkpointless-training-flow"></a>

![\[This diagram illustrates the checkpointless training flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-flow.png)


The following steps outline the failure detection and checkpointless recovery process:

1. Training loop starts

1. Fault occurs

1. Evaluate checkpointless resume feasibility

1. Check if it is feasible to do checkpointless resume
   + If feasible, Attempt checkpointless reusme
     + If resumes fails, fallback to checkpoint loading from storage
     + If resume succeeds, training continues from recovered state
   + If not feasible, fall back to checkpoint loading from storage

1. Clean up resources - abort all process groups and backends and free resources in preparation for restart.

1. Resume training loop - a new training loop begins, and the process returns to step 1.

## API reference
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference"></a>

### wait\$1rank
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-wait_rank"></a>

```
hyperpod_checkpointless_training.inprocess.train_utils.wait_rank()
```

Waits for and retrieves rank information from HyperPod, then updates the current process environment with distributed training variables.

This function obtains the correct rank assignment and environment variables for distributed training. It ensures that each process gets the appropriate configuration for its role in the distributed training job.

**Parameters**

None

**Returns**

**None**

**Behavior**
+ **Process Check**: Skips execution if called from a subprocess (only runs in MainProcess)
+ **Environment Retrieval**: Gets current `RANK` and `WORLD_SIZE` from environment variables
+ **HyperPod Communication**: Calls `hyperpod_wait_rank_info()` to retrieve rank information from HyperPod
+ **Environment Update**: Updates the current process environment with worker-specific environment variables received from HyperPod

**Environment Variables**

The function reads the following environment variables:
+ **RANK** (*int*) – Current process rank (default: -1 if not set)
+ **WORLD\$1SIZE** (*int*) – Total number of processes in the distributed job (default: 0 if not set)

**Raises**
+ **AssertionError** – If the response from HyperPod is not in the expected format or if required fields are missing

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank  

# Call before initializing distributed training  
wait_rank()  

# Now environment variables are properly set for this rank  
import torch.distributed as dist  
dist.init_process_group(backend='nccl')
```

**Notes**
+ Only executes in the main process; subprocess calls are automatically skipped
+ The function blocks until HyperPod provides the rank information

### HPWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPWrapper(  
    *,  
    abort=Compose(HPAbortTorchDistributed()),  
    finalize=None,  
    health_check=None,  
    hp_api_factory=None,  
    abort_timeout=None,  
    enabled=True,  
    trace_file_path=None,  
    async_raise_before_abort=True,  
    early_abort_communicator=False,  
    checkpoint_manager=None,  
    check_memory_status=True)
```

*Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB) in HyperPod checkpointless training.*

*This wrapper provides fault tolerance and automatic recovery capabilities by monitoring training execution and coordinating restarts across distributed processes when failures occur. It uses a context manager approach rather than a decorator to maintain global resources throughout the training lifecycle.*

**Parameters**
+ **abort** (*Abort*, *optional*) – Asynchronously aborts execution when failures are detected. Default: `Compose(HPAbortTorchDistributed())`
+ **finalize** (*Finalize*, *optional*) – Rank-local finalize handler executed during restart. Default: `None`
+ **health\$1check** (*HealthCheck*, *optional*) – Rank-local health check executed during restart. Default: `None`
+ **hp\$1api\$1factory** (*Callable*, *optional*) – Factory function for creating a HyperPod API to interact with HyperPod. Default: `None`
+ **abort\$1timeout** (*float*, *optional*) – Timeout for abort call in fault controlling thread. Default: `None`
+ **enabled** (*bool*, *optional*) – Enables the wrapper functionality. When `False`, the wrapper becomes a pass-through. Default: `True`
+ **trace\$1file\$1path** (*str*, *optional*) – Path to the trace file for VizTracer profiling. Default: `None`
+ **async\$1raise\$1before\$1abort** (*bool*, *optional*) – Enable raise before abort in fault controlling thread. Default: `True`
+ **early\$1abort\$1communicator** (*bool*, *optional*) – Abort communicator (NCCL/Gloo) before aborting dataloader. Default: `False`
+ **checkpoint\$1manager** (*Any*, *optional*) – Manager for handling checkpoints during recovery. Default: `None`
+ **check\$1memory\$1status** (*bool*, *optional*) – Enable memory status checking and logging. Default: `True`

**Methods**

```
def __call__(self, fn)
```

*Wraps a function to enable restart capabilities.*

**Parameters:**
+ **fn** (*Callable*) – The function to wrap with restart capabilities

**Returns:**
+ **Callable** – Wrapped function with restart capabilities, or original function if disabled

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
from hyperpod_checkpointless_training.nemo_plugins.patches import patch_megatron_optimizer  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector import CheckpointlessCompatibleConnector  
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager   
      
@HPWrapper(  
    health_check=CudaHealthCheck(),  
    hp_api_factory=HPAgentK8sAPIFactory(),  
    abort_timeout=60.0,  
    checkpoint_manager=CheckpointManager(enable_offload=False),  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    finalize=CheckpointlessFinalizeCleanup(),  
)def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ The wrapper requires `torch.distributed` to be available
+ When `enabled=False`, the wrapper becomes a pass-through and returns the original function unchanged
+ The wrapper maintains global resources like monitoring threads throughout the training lifecycle
+ Supports VizTracer profiling when `trace_file_path` is provided
+ Integrates with HyperPod for coordinated fault handling across distributed training

### HPCallWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPCallWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPCallWrapper(wrapper)
```

Monitors and manages the state of a Restart Code Block (RCB) during execution.

This class handles the lifecycle of RCB execution, including failure detection, coordination with other ranks for restarts, and cleanup operations. It manages distributed synchronization and ensures consistent recovery across all training processes.

**Parameters**
+ **wrapper** (*HPWrapper*) – The parent wrapper containing global in-process recovery settings

**Attributes**
+ **step\$1upon\$1restart** (*int*) – Counter that tracks steps since the last restart, used for determining restart strategy

**Methods**

```
def initialize_barrier()
```

Wait for HyperPod barrier synchronization after encountering an exception from RCB.

```
def start_hp_fault_handling_thread()
```

Start the fault handling thread for monitoring and coordinating failures.

```
def handle_fn_exception(call_ex)
```

Process exceptions from the execution function or RCB.

**Parameters:**
+ **call\$1ex** (*Exception*) – Exception from the monitoring function

```
def restart(term_ex)
```

Execute restart handler including finalization, garbage collection, and health checks.

**Parameters:**
+ **term\$1ex** (*RankShouldRestart*) – Termination exception triggering the restart

```
def launch(fn, *a, **kw)
```

*Execute the RCB with proper exception handling.*

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def run(fn, a, kw)
```

Main execution loop that handles restarts and barrier synchronization.

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def shutdown()
```

Shutdown fault handling and monitoring threads.

**Notes**
+ Automatically handles `RankShouldRestart` exceptions for coordinated recovery
+ Manages memory tracking and aborts, garbage collection during restarts
+ Supports both in-process recovery and PLR (Process-Level Restart) strategies based on failure timing

### CudaHealthCheck
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-cudahealthcheck"></a>

```
class hyperpod_checkpointless_training.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))
```

Ensures that the CUDA context for the current process is in a healthy state during checkpointless training recovery.

This health check synchronizes with the GPU to verify that the CUDA context is not corrupted after a training failure. It performs GPU synchronization operations to detect any issues that might prevent successful training resumption. The health check is executed after distributed groups are destroyed and finalization is complete.

**Parameters**
+ **timeout** (*datetime.timedelta*, *optional*) – Timeout duration for GPU synchronization operations. Default: `datetime.timedelta(seconds=30)`

**Methods**

```
__call__(state, train_ex=None)
```

Execute the CUDA health check to verify GPU context integrity.

**Parameters:**
+ **state** (*HPState*) – Current HyperPod state containing rank and distributed information
+ **train\$1ex** (*Exception*, *optional*) – The original training exception that triggered the restart. Default: `None`

**Returns:**
+ **tuple** – A tuple containing `(state, train_ex)` unchanged if health check passes

**Raises:**
+ **TimeoutError** – If GPU synchronization times out, indicating a potentially corrupted CUDA context

**State Preservation**: Returns the original state and exception unchanged if all checks pass

**Example**

```
import datetime  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# Create CUDA health check with custom timeout  
cuda_health_check = CudaHealthCheck(  
    timeout=datetime.timedelta(seconds=60)  
)  
  
# Use with HPWrapper for fault-tolerant training  
@HPWrapper(  
    health_check=cuda_health_check,  
    enabled=True  
)  
def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ Uses threading to implement timeout protection for GPU synchronization
+ Designed to detect corrupted CUDA contexts that could prevent successful training resumption
+ Should be used as part of the fault tolerance pipeline in distributed training scenarios

### HPAgentK8sAPIFactory
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPAgentK8sAPIFactory"></a>

```
class hyperpod_checkpointless_training.inprocess.train_utils.HPAgentK8sAPIFactory()
```

Factory class for creating HPAgentK8sAPI instances that communicate with HyperPod infrastructure for distributed training coordination.

This factory provides a standardized way to create and configure HPAgentK8sAPI objects that handle communication between training processes and the HyperPod control plane. It encapsulates the creation of the underlying socket client and API instance, ensuring consistent configuration across different parts of the training system.

**Methods**

```
__call__()
```

Create and return an HPAgentK8sAPI instance configured for HyperPod communication.

**Returns:**
+ **HPAgentK8sAPI** – Configured API instance for communicating with HyperPod infrastructure

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
  
# Create the factory  
hp_api_factory = HPAgentK8sAPIFactory()  
  
# Use with HPWrapper for fault-tolerant training  
hp_wrapper = HPWrapper(  
    hp_api_factory=hp_api_factory,  
    health_check=CudaHealthCheck(),  
    abort_timeout=60.0,  
    enabled=True  
)  
  
@hp_wrapper  
def training_function():  
    # Your distributed training code here  
    pass
```

**Notes**
+ Designed to work seamlessly with HyperPod's Kubernetes-based infrastructure. It is essential for coordinated fault handling and recovery in distributed training scenarios

### CheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.CheckpointManager(  
    enable_checksum=False,  
    enable_offload=False)
```

Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance in distributed training.

This class provides the core functionality for HyperPod checkpointless training by managing NeMo model checkpoints in memory, validating recovery feasibility, and orchestrating peer-to-peer checkpoint transfer between healthy and failed ranks. It eliminates the need for disk I/O during recovery, significantly reducing mean time to recovery (MTTR).

**Parameters**
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation for integrity checks during recovery. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading from GPU to CPU memory to reduce GPU memory usage. Default: `False`

**Attributes**
+ **global\$1step** (*int* or *None*) – Current training step associated with the saved checkpoint
+ **rng\$1states** (*list* or *None*) – Stored random number generator states for deterministic recovery
+ **checksum\$1manager** (*MemoryChecksumManager*) – Manager for model state checksum validation
+ **parameter\$1update\$1lock** (*ParameterUpdateLock*) – Lock for coordinating parameter updates during recovery

**Methods**

```
save_checkpoint(trainer)
```

Save NeMo model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Called by CheckpointlessCallback at batch end or during exception handling
+ Creates recovery points without disk I/O overhead
+ Stores complete model, optimizer, and scheduler states

```
delete_checkpoint()
```

Delete the in-memory checkpoint and perform cleanup operations.

**Notes:**
+ Clears checkpoint data, RNG states, and cached tensors
+ Performs garbage collection and CUDA cache cleanup
+ Called after successful recovery or when checkpoint is no longer needed

```
try_checkpointless_load(trainer)
```

Attempt checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored checkpoint if successful, None if fallback to disk needed

**Notes:**
+ Main entry point for checkpointless recovery
+ Validates recovery feasibility before attempting P2P transfer
+ Always cleans up in-memory checkpoints after recovery attempt

```
checkpointless_recovery_feasible(trainer, include_checksum_verification=True)
```

Determine if checkpointless recovery is possible for the current failure scenario.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **include\$1checksum\$1verification** (*bool*, *optional*) – Whether to include checksum validation. Default: `True`

**Returns:**
+ **bool** – True if checkpointless recovery is feasible, False otherwise

**Validation Criteria:**
+ Global step consistency across healthy ranks
+ Sufficient healthy replicas available for recovery
+ Model state checksum integrity (if enabled)

```
store_rng_states()
```

Store all random number generator states for deterministic recovery.

**Notes:**
+ Captures Python, NumPy, PyTorch CPU/GPU, and Megatron RNG states
+ Essential for maintaining training determinism after recovery

```
load_rng_states()
```

Restore all RNG states for deterministic recovery continuation.

**Notes:**
+ Restores all previously stored RNG states
+ Ensures training continues with identical random sequences

```
maybe_offload_checkpoint()
```

Offload checkpoint from GPU to CPU memory if offload is enabled.

**Notes:**
+ Reduces GPU memory usage for large models
+ Only executes if `enable_offload=True`
+ Maintains checkpoint accessibility for recovery

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=CheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

**Validation**: Verifies checkpoint integrity using checksums (if enabled)

**Notes**
+ Uses distributed communication primitives for efficient P2P transfer
+ Automatically handles tensor dtype conversions and device placement
+ **MemoryChecksumManager** – Handles model state integrity validation

### PEFTCheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-PEFTCheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.PEFTCheckpointManager(  
    *args,  
    **kwargs)
```

Manages checkpoints for PEFT (Parameter-Efficient Fine-Tuning) with separate base and adapter handling for optimized checkpointless recovery.

This specialized checkpoint manager extends CheckpointManager to optimize PEFT workflows by separating base model weights from adapter parameters.

**Parameters**

Inherits all parameters from **CheckpointManager**:
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading to CPU memory. Default: `False`

**Additional Attributes**
+ **params\$1to\$1save** (*set*) – Set of parameter names that should be saved as adapter parameters
+ **base\$1model\$1weights** (*dict* or *None*) – Cached base model weights, saved once and reused
+ **base\$1model\$1keys\$1to\$1extract** (*list* or *None*) – Keys for extracting base model tensors during P2P transfer

**Methods**

```
maybe_save_base_model(trainer)
```

Save base model weights once, filtering out adapter parameters.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Only saves base model weights on first call; subsequent calls are no-ops
+ Filters out adapter parameters to store only frozen base model weights
+ Base model weights are preserved across multiple training sessions

```
save_checkpoint(trainer)
```

Save NeMo PEFT adapter model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Automatically calls `maybe_save_base_model()` if base model not yet saved
+ Filters checkpoint to include only adapter parameters and training state
+ Significantly reduces checkpoint size compared to full model checkpoints

```
try_base_model_checkpointless_load(trainer)
```

Attempt PEFT base model weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored base model checkpoint if successful, None if fallback needed

**Notes:**
+ Used during model initialization to recover base model weights
+ Does not clean up base model weights after recovery (preserves for reuse)
+ Optimized for model-weights-only recovery scenarios

```
try_checkpointless_load(trainer)
```

Attempt PEFT adapter weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored adapter checkpoint if successful, None if fallback needed

**Notes:**
+ Recovers only adapter parameters, optimizer states, and schedulers
+ Automatically loads optimizer and scheduler states after successful recovery
+ Cleans up adapter checkpoints after recovery attempt

```
is_adapter_key(key)
```

Check if state dict key belongs to adapter parameters.

**Parameters:**
+ **key** (*str* or *tuple*) – State dict key to check

**Returns:**
+ **bool** – True if key is adapter parameter, False if base model parameter

**Detection Logic:**
+ Checks if key is in `params_to_save` set
+ Identifies keys containing ".adapter." substring
+ Identifies keys ending with ".adapters"
+ For tuple keys, checks if parameter requires gradients

```
maybe_offload_checkpoint()
```

Offload base model weights from GPU to CPU memory.

**Notes:**
+ Extends parent method to handle base model weight offloading
+ Adapter weights are typically small and don't require offloading
+ Sets internal flag to track offload state

**Notes**
+ Designed specifically for Parameter-Efficient Fine-Tuning scenarios (LoRA, Adapters, etc.)
+ Automatically handles separation of base model and adapter parameters

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import PEFTCheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=PEFTCheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

### CheckpointlessAbortManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAbortManager"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessAbortManager()
```

Factory class for creating and managing abort component compositions for checkpointless fault tolerance.

This utility class provides static methods to create, customize, and manage abort component compositions used during fault handling in HyperPod checkpointless training. It simplifies the configuration of abort sequences that handle cleanup of distributed training components, data loaders, and framework-specific resources during failure recovery.

**Parameters**

None (all methods are static)

**Static Methods**

```
get_default_checkpointless_abort()
```

Get the default abort compose instance containing all standard abort components.

**Returns:**
+ **Compose** – Default composed abort instance with all abort components

**Default Components:**
+ **AbortTransformerEngine()** – Cleans up TransformerEngine resources
+ **HPCheckpointingAbort()** – Handles checkpointing system cleanup
+ **HPAbortTorchDistributed()** – Aborts PyTorch distributed operations
+ **HPDataLoaderAbort()** – Stops and cleans up data loaders

```
create_custom_abort(abort_instances)
```

*Create a custom abort compose with only the specified abort instances.*

**Parameters:**
+ **abort\$1instances** (*Abort*) – Variable number of abort instances to include in the compose

**Returns:**
+ **Compose** – New composed abort instance containing only the specified components

**Raises:**
+ **ValueError** – If no abort instances are provided

```
override_abort(abort_compose, abort_type, new_abort)
```

Replace a specific abort component in a Compose instance with a new component.

**Parameters:**
+ **abort\$1compose** (*Compose*) – The original Compose instance to modify
+ **abort\$1type** (*type*) – The type of abort component to replace (e.g., `HPCheckpointingAbort`)
+ **new\$1abort** (*Abort*) – The new abort instance to use as replacement

**Returns:**
+ **Compose** – New Compose instance with the specified component replaced

**Raises:**
+ **ValueError** – If abort\$1compose doesn't have 'instances' attribute

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    health_check=CudaHealthCheck(),  
    finalize=CheckpointlessFinalizeCleanup(),  
    enabled=True  
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Custom configurations allow fine-tuned control over cleanup behavior
+ Abort operations are critical for proper resource cleanup during fault recovery

### CheckpointlessFinalizeCleanup
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessFinalizeCleanup"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessFinalizeCleanup()
```

Performs comprehensive cleanup after fault detection to prepare for in-process recovery during checkpointless training.

This finalize handler executes framework-specific cleanup operations including Megatron/TransformerEngine abort, DDP cleanup, module reloading, and memory cleanup by destroying training component references. It ensures that the training environment is properly reset for successful in-process recovery without requiring full process termination.

**Parameters**

None

**Attributes**
+ **trainer** (*pytorch\$1lightning.Trainer* or *None*) – Reference to the PyTorch Lightning trainer instance

**Methods**

```
__call__(*a, **kw)
```

**Execute comprehensive cleanup operations for in-process recovery preparation.**

*Parameters:*
+ **a** – Variable positional arguments (inherited from Finalize interface)
+ **kw** – Variable keyword arguments (inherited from Finalize interface)

**Cleanup Operations:**
+ **Megatron Framework Cleanup** – Calls `abort_megatron()` to clean up Megatron-specific resources
+ **TransformerEngine Cleanup** – Calls `abort_te()` to clean up TransformerEngine resources
+ **RoPE Cleanup** – Calls `cleanup_rope()` to clean up rotary position embedding resources
+ **DDP Cleanup** – Calls `cleanup_ddp()` to clean up DistributedDataParallel resources
+ **Module Reloading** – Calls `reload_megatron_and_te()` to reload framework modules
+ **Lightning Module Cleanup** – Optionally clears Lightning module to reduce GPU memory
+ **Memory Cleanup** – Destroys training component references to free memory

```
register_attributes(trainer)
```

*Register the trainer instance for use during cleanup operations.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance to register

**Integration with CheckpointlessCallback**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    ...  
    finalize=CheckpointlessFinalizeCleanup(),   
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Cleanup operations are executed in a specific order to avoid dependency issues
+ Memory cleanup uses garbage collection introspection to find target objects
+ All cleanup operations are designed to be idempotent and safe to retry

### CheckpointlessMegatronStrategy
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessMegatronStrategy"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.megatron_strategy.CheckpointlessMegatronStrategy(*args, **kwargs)
```

NeMo Megatron strategy with integrated checkpointless recovery capabilities for fault-tolerant distributed training.

Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization.

**Parameters**

Inherits all parameters from **MegatronStrategy**:
+ Standard NeMo MegatronStrategy initialization parameters
+ Distributed training configuration options
+ Model parallelism settings

**Attributes**
+ **base\$1store** (*torch.distributed.TCPStore* or *None*) – Distributed store for process group coordination

**Methods**

```
setup(trainer)
```

Initialize the strategy and register fault tolerance components with the trainer.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Setup Operations:**
+ **Parent Setup** – Calls parent MegatronStrategy setup
+ **Fault Injection Registration** – Registers HPFaultInjectionCallback hooks if present
+ **Finalize Registration** – Registers trainer with finalize cleanup handlers
+ **Abort Registration** – Registers trainer with abort handlers that support it

```
setup_distributed()
```

Initialize process group using either TCPStore with prefix or rootless connection.

```
load_model_state_dict(checkpoint, strict=True)
```

Load model state dict with checkpointless recovery compatibility.

**Parameters:**
+ **checkpoint** (*Mapping[str, Any]*) – Checkpoint dictionary containing model state
+ **strict** (*bool*, *optional*) – Whether to strictly enforce state dict key matching. Default: `True`

```
get_wrapper()
```

Get the HPCallWrapper instance for fault tolerance coordination.

**Returns:**
+ **HPCallWrapper** – The wrapper instance attached to the trainer for fault tolerance

```
is_peft()
```

Check if PEFT (Parameter-Efficient Fine-Tuning) is enabled in the training configuration by checking for PEFT callbacks

**Returns:**
+ **bool** – True if PEFT callback is present, False otherwise

```
teardown()
```

Override PyTorch Lightning native teardown to delegate cleanup to abort handlers.

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    checkpoint_manager=checkpoint_manager,  
    enabled=True  
)  
def training_function():  
    trainer = pl.Trainer(strategy=CheckpointlessMegatronStrategy())  
    trainer.fit(model, datamodule)
```

### CheckpointlessCallback
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCallback"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.callbacks.CheckpointlessCallback(  
    enable_inprocess=False,  
    enable_checkpointless=False,  
    enable_checksum=False,  
    clean_tensor_hook=False,  
    clean_lightning_module=False)
```

Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system.

This callback manages step tracking, checkpoint saving, and parameter update coordination for in-process recovery capabilities. It serves as the primary integration point between PyTorch Lightning training loops and HyperPod checkpointless training mechanisms, coordinating fault tolerance operations throughout the training lifecycle.

**Parameters**
+ **enable\$1inprocess** (*bool*, *optional*) – Enable in-process recovery capabilities. Default: `False`
+ **enable\$1checkpointless** (*bool*, *optional*) – Enable checkpointless recovery (requires `enable_inprocess=True`). Default: `False`
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation (requires `enable_checkpointless=True`). Default: `False`
+ **clean\$1tensor\$1hook** (*bool*, *optional*) – Clear tensor hooks from all GPU tensors during cleanup (expensive operation). Default: `False`
+ **clean\$1lightning\$1module** (*bool*, *optional*) – Enable Lightning module cleanup to free GPU memory after each restart. Default: `False`

**Attributes**
+ **tried\$1adapter\$1checkpointless** (*bool*) – Flag to track if adapter checkpointless restore has been attempted

**Methods**

```
get_wrapper_from_trainer(trainer)
```

Get the HPCallWrapper instance from the trainer for fault tolerance coordination.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **HPCallWrapper** – The wrapper instance for fault tolerance operations

```
on_train_batch_start(trainer, pl_module, batch, batch_idx, *args, **kwargs)
```

Called at the start of each training batch to manage step tracking and recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **batch** – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch
+ **args** – Additional positional arguments
+ **kwargs** – Additional keyword arguments

```
on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)
```

*Release parameter update lock at the end of each training batch.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **outputs** (*STEP\$1OUTPUT*) – Training step outputs
+ **batch** (*Any*) – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch

**Notes:**
+ Lock release timing ensures checkpointless recovery can proceed after parameter updates complete
+ Only executes when both `enable_inprocess` and `enable_checkpointless` are True

```
get_peft_callback(trainer)
```

*Retrieve the PEFT callback from the trainer's callback list.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **PEFT** or **None** – PEFT callback instance if found, None otherwise

```
_try_adapter_checkpointless_restore(trainer, params_to_save)
```

*Attempt checkpointless restore for PEFT adapter parameters.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **params\$1to\$1save** (*set*) – Set of parameter names to save as adapter parameters

**Notes:**
+ Only executes once per training session (controlled by `tried_adapter_checkpointless` flag)
+ Configures checkpoint manager with adapter parameter information

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
import pytorch_lightning as pl  
  
# Create checkpoint manager  
checkpoint_manager = CheckpointManager(  
    enable_checksum=True,  
    enable_offload=True  
)  
  
# Create checkpointless callback with full fault tolerance  
checkpointless_callback = CheckpointlessCallback(  
    enable_inprocess=True,  
    enable_checkpointless=True,  
    enable_checksum=True,  
    clean_tensor_hook=True,  
    clean_lightning_module=True  
)  
  
# Use with PyTorch Lightning trainer  
trainer = pl.Trainer(  
    callbacks=[checkpointless_callback],  
    strategy=CheckpointlessMegatronStrategy()  
)  
  
# Training with fault tolerance  
trainer.fit(model, datamodule=data_module)
```

**Memory Management**
+ **clean\$1tensor\$1hook**: Removes tensor hooks during cleanup (expensive but thorough)
+ **clean\$1lightning\$1module**: Frees Lightning module GPU memory during restarts
+ Both options help reduce memory footprint during fault recovery
+ Coordinates with ParameterUpdateLock for thread-safe parameter update tracking

### CheckpointlessCompatibleConnector
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCompatibleConnector"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector.CheckpointlessCompatibleConnector()
```

PyTorch Lightning checkpoint connector that integrates checkpointless recovery with traditional disk-based checkpoint loading.

This connector extends PyTorch Lightning's `_CheckpointConnector` to provide seamless integration between checkpointless recovery and standard checkpoint restoration. It attempts checkpointless recovery first, then falls back to disk-based checkpoint loading if checkpointless recovery is not feasible or fails.

**Parameters**

Inherits all parameters from **\$1CheckpointConnector**

**Methods**

```
resume_start(checkpoint_path=None)
```

Attempt to pre-load checkpoint with checkpointless recovery priority.

**Parameters:**
+ **checkpoint\$1path** (*str* or *None*, *optional*) – Path to disk checkpoint for fallback. Default: `None`

```
resume_end()
```

Complete the checkpoint loading process and perform post-load operations.

**Notes**
+ Extends PyTorch Lightning's internal `_CheckpointConnector` class with checkpointless recovery support
+ Maintains full compatibility with standard PyTorch Lightning checkpoint workflows

### CheckpointlessAutoResume
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAutoResume"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.resume.CheckpointlessAutoResume()
```

Extends NeMo's AutoResume with delayed setup to enable checkpointless recovery validation before checkpoint path resolution.

This class implements a two-phase initialization strategy that allows checkpointless recovery validation to occur before falling back to traditional disk-based checkpoint loading. It conditionally delays AutoResume setup to prevent premature checkpoint path resolution, enabling the CheckpointManager to first validate whether checkpointless peer-to-peer recovery is feasible.

**Parameters**

Inherits all parameters from **AutoResume**

**Methods**

```
setup(trainer, model=None, force_setup=False)
```

Conditionally delay AutoResume setup to enable checkpointless recovery validation.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer* or *lightning.fabric.Fabric*) – PyTorch Lightning trainer or Fabric instance
+ **model** (*optional*) – Model instance for setup. Default: `None`
+ **force\$1setup** (*bool*, *optional*) – If True, bypass delay and execute AutoResume setup immediately. Default: `False`

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.resume import CheckpointlessAutoResume  
from hyperpod_checkpointless_training.nemo_plugins.megatron_strategy import CheckpointlessMegatronStrategy  
import pytorch_lightning as pl  
  
# Create trainer with checkpointless auto-resume  
trainer = pl.Trainer(  
    strategy=CheckpointlessMegatronStrategy(),  
    resume=CheckpointlessAutoResume()  
)
```

**Notes**
+ Extends NeMo's AutoResume class with delay mechanism for enabling checkpointless recovery
+ Works in conjunction with `CheckpointlessCompatibleConnector` for complete recovery workflow

# Special considerations
<a name="sagemaker-eks-checkpointless-considerations"></a>

We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to job operations, resource management, and essential service functionality. 

HyperPod managed tiered checkpointing and elastic training: note that HyperPod checkpointless training is currently incompatible with HyperPod managed tiered checkpointing and elastic training.

Checkpointless training recipes for GPT OSS 120B and Llama models are provided to simplify getting started. These recipes have been verified on ml.p5 instances. Using other instance types may require additional modifications to the underlying recipes. These recipes can be adapted to full finetuning workflows as well. For custom models, we recommend reviewing the [getting started examples](https://docs.aws.amazon.com/sagemaker-eks-checkpointless-recipes-custom).

# Appendix
<a name="sagemaker-eks-checkpointless-appendix"></a>

**Monitor training results via HyperPod recipes**

SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. These recipes also incorporate VizTracer, which is a low-overhead tool for tracing and visualizing Python code execution. For more information, see [ VizTracer](https://github.com/gaogaotiantian/viztracer).

The tensorboard logs are generated and stored within the `log_dir`. To access and analyze these logs locally, use the following procedure:

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard by running the command:

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit `http://localhost:8008`.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager).

**VizTracer**

To enable VizTracer, you can modify your recipe by setting the environment variable `ENABLE_VIZTRACER` to `1`. After the training has completed, your VizTracer profile is in the experiment folder `log_dir/viztracer_xxx.json`. To analyze your profile, you can download it and open it using the **vizviewer** tool:

```
vizviewer --port <port> viztracer_xxx.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by going to http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see [ VizTracer documentation](https://viztracer.readthedocs.io/en/latest/installation.html).

# Release notes
<a name="sagemaker-eks-checkpointless-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker HyperPod checkpointless training.

**The SageMaker HyperPod checkpointless training v1.0.1**

Date: April 10, 2026

**Bug Fixes**
+ Fixed incorrect CUDA device binding in the fault handling thread. The fault handling thread now correctly sets the CUDA device context by using `LOCAL_RANK`. This fix prevents device mismatch errors during in-process fault recovery.

**The SageMaker HyperPod checkpointless training v1.0.0**

Date: December 03, 2025

**SageMaker HyperPod checkpointless training Features**
+ **Collective Communication Initialization Improvements**: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
+ **Memory-mapped (MMAP)** Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
+ **Checkpointless**: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
+ **Built on Nvidia Nemo and PyTorch Lightning**: Leverages these powerful frameworks for efficient and flexible model training
  + [Nividia NeMo](https://github.com/NVIDIA-NeMo/NeMo)
  + [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/)

**SageMaker HyperPod Checkpointless training Docker container**

Checkpointless training on HyperPod is built on top of the [ NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html). HyperPod checkpointless training aims to recover faster from cluster training faults in large-scale distributed training environments by making framework-level optimizations that will be delivered on a base container containing the base image with NCCL and PyTorch optimizations.

**Availability**

Currently images are only available in:

```
eu-north-1
ap-south-1
us-east-2
eu-west-1
eu-central-1
sa-east-1
us-east-1
eu-west-2
ap-northeast-1
us-west-2
us-west-1
ap-southeast-1
ap-southeast-2
```

but not available in the following 3 opt-in Regions:

```
ap-southeast-3
ap-southeast-4
eu-south-2
```

**Container details**

Checkpointless training Docker container for PyTorch v2.6.0 with CUDA v12.9

```
963403601044.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
423350936952.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
556809692997.dkr.ecr.us-east-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
942446708630.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
391061375763.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
311136344257.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
016839105697.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
356859066553.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
827510180725.dkr.ecr.us-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
885852567298.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
304708117039.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
```

**Pre-installed packages**

```
PyTorch: v2.6.0
CUDA: v12.9
NCCL: v2.27.5
EFA: v1.43.0
AWS-OFI-NCCL v1.16.0
Libfabric version 2.1
Megatron v0.15.0
Nemo v2.6.0rc0
```

# Using GPU partitions in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning"></a>

Cluster administrators can choose how to maximize GPU utilization across their organization. You can enable GPU partitioning with NVIDIA Multi-Instance GPU (MIG) technology to partition GPU resources into smaller, isolated instances for better resource utilization. This capability provides the ability to run multiple smaller sized tasks concurrently on a single GPU instead of dedicating the entire hardware to a single, often underutilized task. This eliminates wasted compute power and memory.

GPU partitioning with MIG technology supports GPUs and allows you to partition a single supported GPU into up to seven separate GPU partitions. Each GPU partition has dedicated memory, cache, and compute resources, providing predictable isolation.

## Benefits
<a name="sagemaker-hyperpod-eks-gpu-partitioning-benefits"></a>
+ **Improved GPU utilization** - Maximize compute efficiency by partitioning GPUs based on compute and memory requirements
+ **Task isolation** - Each GPU partition operates independently with dedicated memory, cache, and compute resources
+ **Task flexibility** - Support a mix of tasks on a single physical GPU, all running in parallel
+ **Flexible setup management** - Support both Do-it-yourself (DIY) Kubernetes configurations using Kubernetes command-line client `kubectl`, and a managed solution with custom labels to easily configure and apply your labels associated with GPU partitions

**Important**  
GPU partitioning with MIG is not supported with flexible instance groups (instance groups that use `InstanceRequirements`). To use MIG, create an instance group with a single `InstanceType`.

## Supported Instance Types
<a name="sagemaker-hyperpod-eks-gpu-partitioning-instance-types"></a>

GPU partitioning with MIG technology is supported on the following HyperPod instance types:

**A100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p4/](https://aws.amazon.com/ec2/instance-types/p4/)
+ **ml.p4d.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)
+ **ml.p4de.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)

**H100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5.48xlarge** - 8 NVIDIA H100 GPUs (80GB HBM3 per GPU)

**H200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5e.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)
+ **ml.p5en.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)

**B200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p6/](https://aws.amazon.com/ec2/instance-types/p6/)
+ **ml.p6b.48xlarge** - 8 NVIDIA B200 GPUs

## GPU partitions
<a name="sagemaker-hyperpod-eks-gpu-partitioning-profiles"></a>

NVIDIA MIG profiles define how GPUs are partitioned. Each profile specifies the compute and memory allocation per MIG instance. The following are the MIG profiles associated with each GPU type:

**A100 GPU (ml.p4d.24xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p4d.24xlarge | 
| --- | --- | --- | --- | 
| `1g.5gb` | 5 | 7 | 56 | 
| `2g.10gb` | 10 | 3 | 24 | 
| `3g.20gb` | 20 | 2 | 16 | 
| `4g.20gb` | 20 | 1 | 8 | 
| `7g.40gb` | 40 | 1 | 8 | 

**H100 GPU (ml.p5.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5.48xlarge | 
| --- | --- | --- | --- | 
| `1g.10gb` | 10 | 7 | 56 | 
| `1g.20gb` | 20 | 4 | 32 | 
| `2g.20gb` | 20 | 3 | 24 | 
| `3g.40gb` | 40 | 2 | 16 | 
| `4g.40gb` | 40 | 1 | 8 | 
| `7g.80gb` | 80 | 1 | 8 | 

**H200 GPU (ml.p5e.48xlarge and ml.p5en.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5en.48xlarge | 
| --- | --- | --- | --- | 
| `1g.18gb` | 18 | 7 | 56 | 
| `1g.35gb` | 35 | 4 | 32 | 
| `2g.35gb` | 35 | 3 | 24 | 
| `3g.71gb` | 71 | 2 | 16 | 
| `4g.71gb` | 71 | 1 | 8 | 
| `7g.141gb` | 141 | 1 | 8 | 

**Topics**
+ [

## Benefits
](#sagemaker-hyperpod-eks-gpu-partitioning-benefits)
+ [

## Supported Instance Types
](#sagemaker-hyperpod-eks-gpu-partitioning-instance-types)
+ [

## GPU partitions
](#sagemaker-hyperpod-eks-gpu-partitioning-profiles)
+ [

# Setting up GPU partitions on Amazon SageMaker HyperPod
](sagemaker-hyperpod-eks-gpu-partitioning-setup.md)
+ [

# Node Lifecycle and Labels
](sagemaker-hyperpod-eks-gpu-partitioning-labels.md)
+ [

# Task Submission with MIG
](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md)

# Setting up GPU partitions on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup"></a>

**Topics**
+ [

## Prerequisites
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites)
+ [

## Creating a Cluster with MIG Configuration
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster)
+ [

## Adding GPU operator to an existing cluster
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator)
+ [

## Updating MIG Configuration
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-update)
+ [

## Verifying MIG Configuration
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-verify)
+ [

## Common Commands for Debugging MIG Configuration
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands)
+ [

## Using SageMaker AI Console
](#sagemaker-hyperpod-eks-gpu-partitioning-setup-console)

## Prerequisites
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites"></a>
+ HyperPod Amazon EKS cluster with supported GPU instances
+ NVIDIA GPU Operator installed
+ Appropriate IAM permissions for cluster management

## Creating a Cluster with MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster"></a>

### Using AWS CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cli"></a>

```
aws sagemaker create-cluster \
  --cluster-name my-mig-cluster \
  --orchestrator 'Eks={ClusterArn=arn:aws:eks:region:account:cluster/cluster-name}' \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "LifeCycleConfig": {
       "SourceS3Uri": "s3://my-bucket",
       "OnCreate": "on_create_script.sh"
    },
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-1g.5gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role",
    "ThreadsPerCore": 1
  }' \
  --vpc-config '{
     "SecurityGroupIds": ["sg-12345"],
     "Subnets": ["subnet-12345"]
  }' \
  --node-provisioning-mode Continuous
```

### Using CloudFormation
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cfn"></a>

```
{
  "ClusterName": "my-mig-cluster",
  "InstanceGroups": [
    {
      "InstanceGroupName": "gpu-group",
      "InstanceType": "ml.p4d.24xlarge",
      "InstanceCount": 1,
      "KubernetesConfig": {
        "Labels": {
          "nvidia.com/mig.config": "all-2g.10gb"
        }
      },
      "ExecutionRole": "arn:aws:iam::account:role/execution-role"
    }
  ],
  "Orchestrator": {
    "Eks": {
      "ClusterArn": "arn:aws:eks:region:account:cluster/cluster-name"
    }
  },
  "NodeProvisioningMode": "Continuous"
}
```

## Adding GPU operator to an existing cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator"></a>

### Install GPU Operator
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-install"></a>

Replace `{$AWS_REGION}` with your cluster region (e.g., us-east-1, us-west-2).

```
helm install gpuo helm_chart/HyperPodHelmChart/charts/gpu-operator \
-f helm_chart/HyperPodHelmChart/charts/gpu-operator/regional-values/values-{$AWS_REGION}.yaml \
-n kube-system
```

### Verify Installation (Wait 2-3 minutes)
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify"></a>

Check all GPU operator pods are running:

```
kubectl get pods -n kube-system | grep -E "(gpu-operator|nvidia-)"
```

**Expected pods:**
+ gpu-operator-\$1 - 1 instance (cluster controller)
+ nvidia-device-plugin-daemonset-\$1 - 1 per GPU node (all GPU instances)
+ nvidia-mig-manager-\$1 - 1 per MIG-capable node (A100/H100)

### Remove Old Device Plugin
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-remove"></a>

Disable the existing nvidia-device-plugin:

```
helm upgrade dependencies helm_chart/HyperPodHelmChart \
--set nvidia-device-plugin.devicePlugin.enabled=false \
-n kube-system
```

### Verify GPU Resources
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify-gpu"></a>

Confirm nodes show GPU capacity. It should display: nvidia.com/gpu: 8 (or your actual GPU count).

```
kubectl describe nodes | grep "nvidia.com/gpu"
```

## Updating MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update"></a>

**Preparing Nodes Before MIG Updates**  
Before updating MIG configurations on your instance group, you must prepare the nodes to prevent workload disruption. Follow these steps to safely drain workloads from the nodes that will be reconfigured.

### Step 1: Identify Nodes in the Instance Group
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-identify"></a>

First, identify all nodes that belong to the instance group you want to update:

```
# List all nodes in the instance group
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME

# Example:
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=p4d-group
```

This command returns a list of all nodes in the specified instance group. Make note of each node name for the following steps.

### Step 2: Cordon and Drain Each Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain"></a>

For each node identified in Step 1, perform the following actions:

#### Cordon the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-cordon"></a>

Cordoning prevents new pods from being scheduled on the node:

```
# Cordon a single node
kubectl cordon NODE_NAME

# Example:
kubectl cordon hyperpod-i-014a41a7001adca60
```

#### Drain Workload Pods from the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-evict"></a>

Drain the node to evict all workload pods while preserving system pods:

```
# Drain the node (ignore DaemonSets and evict pods)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300

# Example:
kubectl drain hyperpod-i-014a41a7001adca60 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300
```

**Command Options Explained:**
+ `--ignore-daemonsets` - Allows the drain operation to proceed even if DaemonSet pods are present
+ `--delete-emptydir-data` - Deletes pods using emptyDir volumes (required for draining to succeed)
+ `--force` - Forces deletion of pods not managed by a controller (use with caution)
+ `--grace-period=300` - Gives pods 5 minutes to terminate gracefully

**Important**  
The drain operation may take several minutes depending on the number of pods and their termination grace periods
System pods in the following namespaces will remain running: `kube-system`, `cert-manager`, `kubeflow`, `hyperpod-inference-system`, `kube-public`, `mpi-operator`, `gpu-operator`, `aws-hyperpod`, `jupyter-k8s-system`, `hyperpod-observability`, `kueue-system`, and `keda`
DaemonSet pods will remain on the node (they are ignored by design)

### Step 3: Verify No Workload Pods are Running
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-verify"></a>

After draining, verify that no workload pods remain on the nodes (excluding system namespaces):

```
# Check for any remaining pods outside system namespaces
kubectl get pods --all-namespaces --field-selector spec.nodeName=NODE_NAME \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"

# Example:
kubectl get pods --all-namespaces --field-selector spec.nodeName=hyperpod-i-014a41a7001adca60 \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"
```

**Expected Output:** If the node is properly drained, this command should return no results (or only show the header row). If any pods are still running, investigate why they weren't evicted and manually delete them if necessary.

### Step 4: Verify Node Readiness Status
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-readiness"></a>

Before proceeding with the MIG update, confirm that all nodes are cordoned:

```
# Check node status - should show "SchedulingDisabled"
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME
```

Nodes should show `SchedulingDisabled` in the STATUS column, indicating they are cordoned and ready for the MIG update.

### Update MIG Profile on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-change"></a>

You can change MIG profiles on existing clusters:

```
aws sagemaker update-cluster \
  --cluster-name my-mig-cluster \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-3g.20gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role"
  }'
```

**Note**  
If jobs are already running on a node, the MIG partitioning will fail. User will get error message to drain the nodes before re-attempting the MIG partitioning.

## Verifying MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-verify"></a>

After cluster creation or update, verify the MIG configuration:

```
# Update kubeconfig
aws eks update-kubeconfig --name your-eks-cluster --region us-east-2

# Check MIG labels
kubectl get node NODE_NAME -o=jsonpath='{.metadata.labels}' | grep mig

# Check available MIG resources
kubectl describe node NODE_NAME | grep -A 10 "Allocatable:"
```

## Common Commands for Debugging MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands"></a>

Use the following commands to troubleshoot and validate MIG configuration in your cluster:

```
# Check GPU Operator status
kubectl get pods -n gpu-operator-resources

# View MIG configuration
kubectl exec -n gpu-operator-resources nvidia-driver-XXXXX -- nvidia-smi mig -lgi

# Check device plugin configuration
kubectl logs -n gpu-operator-resources nvidia-device-plugin-XXXXX

# Monitor node events
kubectl get events --field-selector involvedObject.name=NODE_NAME
```

**Note**  
Replace `nvidia-driver-XXXXX` and `nvidia-device-plugin-XXXXX` with the actual pod names from your cluster, and `NODE_NAME` with your node's name.

## Using SageMaker AI Console
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console"></a>

### Creating a New Cluster with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-create"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management** > **Create HyperPod cluster**

1. Select **Orchestrated by EKS**

1. Choose **Custom setup** and verify **GPU Operator** is enabled by default

1. Under **Instance groups** section, click **Add group**

1. Configure the instance group and navigate to **Advanced Configuration** to enable **Use GPU partition** toggle and choose your desired **MIG configuration** from the dropdown

1. Click **Add Instance group** and complete the remaining cluster configuration

1. Click **Submit** to create the cluster

### Updating MIG Configuration on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-update"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management**

1. Select your existing cluster and click **Edit** on the instance group you want to modify

1. In **Advanced configuration**, toggle **Use GPU partition** if not already enabled and select a different **MIG configuration** from the dropdown

1. Click **Save changes**

# Node Lifecycle and Labels
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels"></a>

Amazon SageMaker HyperPod performs deep health checks on cluster instances during the creation and update of HyperPod clusters before GPU partitioning begins. HyperPod health-monitoring agent continuously monitors the health status of GPU partitioned instances.

## MIG Configuration States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-states"></a>

Nodes with GPU partition configuration go through several states:
+ **Pending** - Node is being configured with a MIG profile
+ **Configuring** - GPU Operator is applying MIG partitioning
+ **Success** - GPU partitioning completed successfully
+ **Failed** - GPU partitioning encountered an error

## Monitoring Node States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-monitoring"></a>

```
# Check node health status
kubectl get nodes -l sagemaker.amazonaws.com/node-health-status=Schedulable

# Monitor MIG configuration progress
kubectl get node NODE_NAME -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'

# Check for configuration errors
kubectl describe node NODE_NAME | grep -A 5 "Conditions:"
```

## Custom Labels and Taints
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-custom"></a>

You can manage MIG configuration with custom labels and taints to label your GPU partitions and apply them across instances:

```
{
  "KubernetesConfig": {
    "Labels": {
      "nvidia.com/mig.config": "all-2g.10gb",
      "task-type": "inference",
      "environment": "production"
    },
    "Taints": [
      {
        "Key": "gpu-task",
        "Value": "mig-enabled",
        "Effect": "NoSchedule"
      }
    ]
  }
}
```

# Task Submission with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission"></a>

**Topics**
+ [

## Using Kubernetes YAML
](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl)
+ [

## Using HyperPod CLI
](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli)
+ [

## Model Deployment with MIG
](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment)
+ [

## Using HyperPod CLI
](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli)

## Using Kubernetes YAML
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl"></a>

```
apiVersion: batch/v1
kind: Job
metadata:
  name: mig-job
  namespace: default
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch:latest
        resources:
          requests:
            nvidia.com/mig-1g.5gb: 1
            cpu: "100m"
            memory: "128Mi"
          limits:
            nvidia.com/mig-1g.5gb: 1
      restartPolicy: Never
```

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli"></a>

Use the HyperPod CLI to deploy JumpStart models with MIG support. The following example demonstrates the new CLI parameters for GPU partitioning:

```
# Deploy JumpStart model with MIG
hyp create hyp-jumpstart-endpoint \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p5.48xlarge \
  --accelerator-partition-type mig-2g.10gb \
  --accelerator-partition-validation True \
  --endpoint-name my-endpoint \
  --tls-certificate-output-s3-uri s3://certificate-bucket/ \
  --namespace default
```

## Model Deployment with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment"></a>

HyperPod Inference allows deploying the models on MIG profiles via Studio Classic, `kubectl` and HyperPod CLI. To deploy JumpStart Models on `kubectl`, CRDs have fields called `spec.server.acceleratorPartitionType` to deploy the model to the desired MIG profile. We run validations to ensure models can be deployed on the MIG profile selected in the CRD. In case you want to disable the MIG validation checks, use `spec.server.validations.acceleratorPartitionValidation` to `False`.

### JumpStart Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-jumpstart"></a>

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-model
  namespace: default
spec:
  sageMakerEndpoint:
    name: deepseek-endpoint
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
  server:
    acceleratorPartitionType: mig-7g.40gb
    instanceType: ml.p4d.24xlarge
```

### Deploy model from Amazon S3 using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-s3"></a>

InferenceEndpointConfig allows you to deploy custom model from Amazon S3. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: my-model-bucket
      region: us-east-2
    modelLocation: model-path
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Deploy model from FSx for Lustre using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-fsx"></a>

InferenceEndpointConfig allows you to deploy custom model from FSx for Lustre. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: fsx
    fsxStorage:
      fileSystemId: fs-xxxxx
    modelLocation: location-on-fsx
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Using Studio Classic UI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio"></a>

#### Deploying JumpStart Models with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-deploy"></a>

1. Open **Studio Classic** and navigate to **JumpStart**

1. Browse or search for your desired model (e.g., "DeepSeek", "Llama", etc.)

1. Click on the model card and select **Deploy**

1. In the deployment configuration:
   + Choose **HyperPod** as the deployment target
   + Select your MIG-enabled cluster from the dropdown
   + Under **Instance configuration**:
     + Select instance type (e.g., `ml.p4d.24xlarge`)
     + Choose **GPU Partition Type** from available options
     + Configure **Instance count** and **Auto-scaling** settings

1. Review and click **Deploy**

1. Monitor deployment progress in the **Endpoints** section

#### Model Configuration Options
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-config"></a>

**Endpoint Settings:**
+ **Endpoint name** - Unique identifier for your deployment
+ **Variant name** - Configuration variant (default: AllTraffic)
+ **Instance type** - Must support GPU partition (p series)
+ **MIG profile** - GPU partition
+ **Initial instance count** - Number of instances to deploy
+ **Auto-scaling** - Enable for dynamic scaling based on traffic

**Advanced Configuration:**
+ **Model data location** - Amazon S3 path for custom models
+ **Container image** - Custom inference container (optional)
+ **Environment variables** - Model-specific configurations
+ **Amazon VPC configuration** - Network isolation settings

#### Monitoring Deployed Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-monitor"></a>

1. Navigate to **Studio Classic** > **Deployments** > **Endpoints**

1. Select your MIG-enabled endpoint

1. View metrics including:
   + **MIG utilization** - Per GPU partition usage
   + **Memory consumption** - Per GPU partition
   + **Inference latency** - Request processing time
   + **Throughput** - Requests per second

1. Set up **Amazon CloudWatch alarms** for automated monitoring

1. Configure **auto-scaling policies** based on MIG utilization

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli"></a>

### JumpStart Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-jumpstart"></a>

The HyperPod CLI JumpStart command includes two new fields for MIG support:
+ `--accelerator-partition-type` - Specifies the MIG configuration (e.g., mig-4g.20gb)
+ `--accelerator-partition-validation` - Validates compatibility between models and MIG profile (default: true)

```
hyp create hyp-jumpstart-endpoint \
  --version 1.1 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p4d.24xlarge \
  --endpoint-name js-test \
  --accelerator-partition-type "mig-4g.20gb" \
  --accelerator-partition-validation true \
  --tls-certificate-output-s3-uri s3://my-bucket/certs/
```

### Custom Endpoint Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-custom"></a>

For deploying via custom endpoint, use the existing fields `--resources-requests` and `--resources-limits` to enable MIG profile functionality:

```
hyp create hyp-custom-endpoint \
  --namespace default \
  --metadata-name deepseek15b-mig-10-14-v2 \
  --endpoint-name deepseek15b-mig-endpoint \
  --instance-type ml.p4d.24xlarge \
  --model-name deepseek15b-mig \
  --model-source-type s3 \
  --model-location deep-seek-15b \
  --prefetch-enabled true \
  --tls-certificate-output-s3-uri s3://sagemaker-bucket \
  --image-uri lmcache/vllm-openai:v0.3.7 \
  --container-port 8080 \
  --model-volume-mount-path /opt/ml/model \
  --model-volume-mount-name model-weights \
  --s3-bucket-name model-storage-123456789 \
  --s3-region us-east-2 \
  --invocation-endpoint invocations \
  --resources-requests '{"cpu":"5600m","memory":"10Gi","nvidia.com/mig-3g.20gb":"1"}' \
  --resources-limits '{"nvidia.com/mig-3g.20gb":"1"}' \
  --env '{
    "OPTION_ROLLING_BATCH":"vllm",
    "SERVING_CHUNKED_READ_TIMEOUT":"480",
    "DJL_OFFLINE":"true",
    "NUM_SHARD":"1",
    "SAGEMAKER_PROGRAM":"inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY":"/opt/ml/model/code",
    "MODEL_CACHE_ROOT":"/opt/ml/model",
    "SAGEMAKER_MODEL_SERVER_WORKERS":"1",
    "SAGEMAKER_MODEL_SERVER_TIMEOUT":"3600",
    "OPTION_TRUST_REMOTE_CODE":"true",
    "OPTION_ENABLE_REASONING":"true",
    "OPTION_REASONING_PARSER":"deepseek_r1",
    "SAGEMAKER_CONTAINER_LOG_LEVEL":"20",
    "SAGEMAKER_ENV":"1"
  }'
```

# Cluster resiliency features for SageMaker HyperPod cluster orchestration with Amazon EKS
<a name="sagemaker-hyperpod-eks-resiliency"></a>

SageMaker HyperPod provides the following cluster resiliency features. 

**Topics**
+ [

# Health Monitoring System
](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md)
+ [

# Basic health checks
](sagemaker-hyperpod-eks-resiliency-basic-health-check.md)
+ [

# Deep health checks
](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)
+ [

# Automatic node recovery
](sagemaker-hyperpod-eks-resiliency-node-recovery.md)
+ [

# Resilience-related Kubernetes labels by SageMaker HyperPod
](sagemaker-hyperpod-eks-resiliency-node-labels.md)
+ [

# Manually quarantine, replace, or reboot a node
](sagemaker-hyperpod-eks-resiliency-manual.md)
+ [

# Suggested resilience configurations
](sagemaker-hyperpod-eks-resiliency-config-tips.md)

# Health Monitoring System
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent"></a>

SageMaker HyperPod health monitoring system includes two components 

1. Monitoring agents installed in your nodes, which include the Health Monitoring Agent (HMA) that serves as an on-host health monitor and a set of out-of-node health monitors.

1. Node Recovery System managed by SageMaker HyperPod. The health monitoring system will monitor the node health status continuously via monitoring agents and then take actions automatically when fault is detected using the Node Recovery System. 

![\[This image illustrates how health monitoring system integrated with HyperPod Cluster.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-event.png)


## Health checks done by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-list-of-checks"></a>

The SageMaker HyperPod health-monitoring agent checks the following.

**NVIDIA GPUs**
+ [DCGM policy violation notifications](https://docs.nvidia.com/datacenter/dcgm/3.0/user-guide/feature-overview.html#notifications)
+ Errors in the `nvidia-smi` output
+ Various errors in the logs generated by the Amazon Elastic Compute Cloud (EC2) platform
+ GPU Count validation — if there’s a mismatch between the expected number of GPUs in a particular instance type (for example: 8 GPUs in ml.p5.48xlarge instance type) and the count returned by `nvidia-smi`, then HMA reboots the node 

**AWS Trainium**
+ Errors in the output from the [AWS Neuron monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html)
+ Outputs generated by the Neuron node problem detector (For more information about the AWS Neuron node problem detector, see [Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters](https://aws.amazon.com/blogs/machine-learning/node-problem-detection-and-recovery-for-aws-neuron-nodes-within-amazon-eks-clusters/).)
+ Various errors in the logs generated by the Amazon EC2 platform
+ Neuron Device Count validation — if there’s a mismatch between the actual number of neuron device count in a particular instance type and the count returned by `neuron-ls`, then HMA reboots the node

 The above checks are passive, background health checks HyperPod runs continuously on your nodes. In addition to these checks, HyperPod also runs deep (or active) health checks during the creation and update of HyperPod clusters. Learn more about [Deep health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html).

## Fault Detection
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-fault-detection"></a>

When SageMaker HyperPod detects a fault, it implements a four-part response:

1. **Node Labels**

   1. Health Status: `sagemaker.amazonaws.com/node-health-status`

   1. Fault Type: `sagemaker.amazonaws.com/fault-types` label for high-level categorization

   1. Fault Reason: `sagemaker.amazonaws.com/fault-reasons` label for detailed fault information

1. **Node Taint**

   1. `sagemaker.amazonaws.com/node-health-status=Unschedulable:NoSchedule`

1. **Node Annotation**

   1. Fault details: `sagemaker.amazonaws.com/fault-details`

   1. Records up to 20 faults with timestamps that occurred on the node

1. **Node Conditions**([Kubernetes Node Condition](https://kubernetes.io/docs/reference/node/node-status/#condition))

   1. Reflects current health status in node conditions:
      + Type: Same as fault type
      + Status: `True`
      + Reason: Same as fault reason
      + LastTransitionTime: Fault occurrence time

![\[This image illustrates how the health monitoring system works when detected a fault.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-workflow.png)


## Logs generated by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-health-check-results"></a>

The SageMaker HyperPod health-monitoring agent is an out-of-the-box health check feature and continuously runs on all HyperPod clusters. The health monitoring agent publishes detected health events on GPU or Trn instances to CloudWatch under the Cluster log group `/aws/sagemaker/Clusters/`.

The detection logs from the HyperPod health monitoring agent are created as separate log streams named `SagemakerHealthMonitoringAgent` for each node. You can query the detection logs using CloudWatch log insights as follows.

```
fields @timestamp, @message
| filter @message like /HealthMonitoringAgentDetectionEvent/
```

This should return an output similar to the following.

```
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
```

# Basic health checks
<a name="sagemaker-hyperpod-eks-resiliency-basic-health-check"></a>

SageMaker HyperPod performs a set of *basic health checks* on cluster instances during the creation and update of HyperPod clusters. These basic health checks are orchestrator-agnostic, so these checks are applicable regardless of the underlying orchestration platforms supported by SageMaker HyperPod (Amazon EKS or Slurm).

The basic health checks monitor cluster instances for issues related to devices such as accelerators (GPU and Trainium cores) and network devices (Elastic Fabric Adapter, or EFA). To find the list of basic cluster health checks, see [Cluster health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-cluster-health-check).

# Deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks"></a>

SageMaker HyperPod performs *deep health checks* on cluster instances during the creation and update of HyperPod clusters. You can also request deep health checks on-demand for a SageMaker HyperPod cluster using [StartClusterHealthCheck](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartClusterHealthCheck.html) API. The deep health checks ensure the reliability and stability of the SageMaker HyperPod clusters by testing the underlying hardware and infrastructure components. This proactive approach helps identify and mitigate potential issues early in the cluster lifecycle.

## List of deep health checks done by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-list"></a>

SageMaker HyperPod runs the following deep health checks.

**Instance-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | GPU/NVLink count | GPU | Verifies GPU/NVLink counts. | 
| Accelerator | [DCGM diagnostics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html) level 4 | GPU | Assesses the health and functionality of NVIDIA GPUs by running DCGM (NVIDIA Data Center GPU Manager) diagnostics at level 4, including additional memory tests. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Accelerator | Neuron hardware check | Trainium | Runs a training workload and verifies the results to test the hardware. | 
| Accelerator | NCCOM local test | Trainium | Evaluates the performance of collective communication operations on single Trainium nodes | 
| Network | EFA | GPU and Trainium | Runs latency and bandwidth benchmarking on the attached EFA device. | 

**Cluster-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NCCL test | GPU | Verifies the performance of collective communication operations on multiple NVIDIA GPUs | 
| Accelerator | NCCOM cluster test | Trainium | Verifies the performance of collective communication operations on multiple Trainium nodes | 

**Deep health checks with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, deep health checks behave as follows:  
Instance-level deep health checks run only on eligible GPU instance types. CPU instance types within a flexible instance group are skipped.
Cluster-level connectivity tests (such as NCCL AllReduce) run only between instances of the same type within the instance group. This ensures accurate test results that reflect the networking capabilities of each instance type.
If deep health checks are enabled, at least one instance type in the flexible instance group must support deep health checks.

## Logs from the deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-log"></a>

The following are example logs from the SageMaker HyperPod deep health checks.

**Cluster-level logs** 

The cluster-level deep health check logs are stored in your CloudWatch log group at `/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>`

The log streams are logged at `DeepHealthCheckResults/<log_stream_id>`.

As an example shown below, the deep health check output logs show the instance ID that failed the checks with the cause of the failure.

```
{
    "level": "error",
    "ts": "2024-06-18T21:15:22Z",
    "msg": "Encountered FaultyInstance. Replace the Instance. Region: us-west-2, InstanceType: p4d.24xlarge. ERROR:Bandwidth has less than threshold: Expected minimum threshold :80,NCCL Test output Bw: 30"
}
```

**Instance-level logs** 

The instance-level deep health check logs are stored at `/var/log/aws/clusters/sagemaker-deep-health-check.log` on each node. SSH into the node and open the log file by running the following command.

```
cat /var/log/aws/clusters/sagemaker-deep-health-check.log
```

The following is an example output of the hardware stress, [NVIDIA DCGM](https://developer.nvidia.com/dcgm) stress, and EFA connectivity test.

```
# Hardware Stress Test output

2024-08-20T21:53:58Z info Executing Hardware stress check with command: stress-ng, and args: [--cpu 32 --vm 2 --hdd 1 --fork 8 --switch 4 --timeout 60 --metrics]

2024-08-20T21:54:58Z info stress-ng success

2024-08-20T21:54:58Z    info    GpuPci Count check success

# DCGM Stress Test

2024-08-20T22:25:02Z    info    DCGM diagnostic health summary: dcgmCheckLevel: 0 dcgmVersion: 3.3.7 gpuDriverVersion: 535.183.01, gpuDeviceIds: [2237] replacementRequired: false rebootRequired:false

# EFA Loopback Test

2024-08-20T22:26:28Z    info    EFA Loopback check passed for device: rdmap0s29 . Output summary is MaxBw: 58.590000, AvgBw: 32.420000, MaxTypicalLat: 30.870000, MinTypicalLat: 20.080000, AvgLat: 21.630000
```

The following is an example output of the NCCL connectivity test.

```
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

           8             2     float     sum      -1    353.9    0.00    0.00      0    304.2    0.00    0.00      0
          16             4     float     sum      -1    352.8    0.00    0.00      0    422.9    0.00    0.00      0
          32             8     float     sum      -1    520.0    0.00    0.00      0    480.3    0.00    0.00      0
          64            16     float     sum      -1    563.0    0.00    0.00      0    416.1    0.00    0.00      0
         128            32     float     sum      -1    245.1    0.00    0.00      0    308.4    0.00    0.00      0
         256            64     float     sum      -1    310.8    0.00    0.00      0    304.9    0.00    0.00      0
         512           128     float     sum      -1    304.9    0.00    0.00      0    300.8    0.00    0.00      0
        1024           256     float     sum      -1    509.3    0.00    0.00      0    495.4    0.00    0.00      0
        2048           512     float     sum      -1    530.3    0.00    0.00      0    420.0    0.00    0.00      0
        4096          1024     float     sum      -1    391.2    0.01    0.01      0    384.5    0.01    0.01      0
        8192          2048     float     sum      -1    328.5    0.02    0.02      0    253.2    0.03    0.03      0
       16384          4096     float     sum      -1    497.6    0.03    0.03      0    490.9    0.03    0.03      0
       32768          8192     float     sum      -1    496.7    0.07    0.07      0    425.0    0.08    0.08      0
       65536         16384     float     sum      -1    448.0    0.15    0.15      0    501.0    0.13    0.13      0
      131072         32768     float     sum      -1    577.4    0.23    0.23      0    593.4    0.22    0.22      0
      262144         65536     float     sum      -1    757.8    0.35    0.35      0    721.6    0.36    0.36      0
      524288        131072     float     sum      -1   1057.1    0.50    0.50      0   1019.1    0.51    0.51      0
     1048576        262144     float     sum      -1   1460.5    0.72    0.72      0   1435.6    0.73    0.73      0
     2097152        524288     float     sum      -1   2450.6    0.86    0.86      0   2583.1    0.81    0.81      0
     4194304       1048576     float     sum      -1   4344.5    0.97    0.97      0   4419.3    0.95    0.95      0
     8388608       2097152     float     sum      -1   8176.5    1.03    1.03      0   8197.8    1.02    1.02      0
    16777216       4194304     float     sum      -1    15312    1.10    1.10      0    15426    1.09    1.09      0
    33554432       8388608     float     sum      -1    30149    1.11    1.11      0    29941    1.12    1.12      0
    67108864      16777216     float     sum      -1    57819    1.16    1.16      0    58635    1.14    1.14      0
   134217728      33554432     float     sum      -1   115699    1.16    1.16      0   115331    1.16    1.16      0
   268435456      67108864     float     sum      -1   227507    1.18    1.18      0   228047    1.18    1.18      0
   536870912     134217728     float     sum      -1   453751    1.18    1.18      0   456595    1.18    1.18      0
  1073741824     268435456     float     sum      -1   911719    1.18    1.18      0   911808    1.18    1.18      0
  2147483648     536870912     float     sum      -1  1804971    1.19    1.19      0  1806895    1.19    1.19      0

2024-08-20T16:22:43.831-07:00

# Out of bounds values : 0 OK

2024-08-20T16:22:43.831-07:00

# Avg bus bandwidth    : 0.488398 

2024-08-20T23:22:43Z    info    Nccl test successful. Summary: NcclMaxAlgoBw: 1.190000, NcclAvgAlgoBw: 0.488398, NcclThresholdAlgoBw: 1.180000, NcclOutOfBoundError: OK, NcclOperations: all_reduce_perf, NcclTotalDevices: 2, NcclNodes: 2, NcclClusterMessage:
```

# Automatic node recovery
<a name="sagemaker-hyperpod-eks-resiliency-node-recovery"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. This option is not recommended.

# Resilience-related Kubernetes labels by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-node-labels"></a>

*Labels* are key-value pairs that are attached to [Kubernetes objects](https://kubernetes.io/docs/concepts/overview/working-with-objects/#kubernetes-objects). SageMaker HyperPod introduces the following labels for the health checks it provides.

## Node health status labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-health-status"></a>

The `node-health-status` labels represent the status of the node health and to be used as part of node selector filter in healthy nodes.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/node-health-status: Schedulable | The node has passed basic health checks and is available for running workloads. This health check is the same as the [currently available SageMaker HyperPod resiliency features for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). | 
| sagemaker.amazonaws.com/node-health-status: Unschedulable | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement | The node has failed deep health checks or health-monitoring agent checks and requires a replacement. If automatic node recovery is enabled, the node will be automatically replaced by SageMaker HyperPod. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot | The node has failed deep health checks or health-monitoring agent checks and requires a reboot. If automatic node recovery is enabled, the node will be automatically rebooted by SageMaker HyperPod. | 

## Deep health check labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-deep-health-check"></a>

The `deep-health-check-status` labels represent the progress of deep health check on a specific node. Helpful for Kubernetes users to quickly filter for progress of overall deep health checks.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/deep-health-check-status: InProgress | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Passed | The node has successfully completed deep health checks and health-monitoring agent checks, and is available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Failed | The node has failed deep health checks or health-monitoring agent checks and requires a reboot or replacement. If automatic node recovery is enabled, the node will be automatically rebooted or replaced by SageMaker HyperPod. | 

## Fault type and reason labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-fault-type-and-reason"></a>

The following describes the `fault-type` and `fault-reason` labels.
+ `fault-type` labels represent high-level fault categories when health checks fail. These are populated for failures identified during both deep health and health-monitoring agent checks.
+ `fault-reason` labels represent the detailed fault reason associated with a `fault-type`.

## How SageMaker HyperPod labels
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels"></a>

The following topics cover how labeling is done depending on various cases.

**Topics**
+ [

### When a node is added to a SageMaker HyperPod cluster with deep health check config disabled
](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off)
+ [

### When a node is added to a SageMaker HyperPod cluster with deep health check config enabled
](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on)
+ [

### When there are any compute failures on nodes
](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails)

### When a node is added to a SageMaker HyperPod cluster with deep health check config disabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off"></a>

When a new node is added into a cluster, and if deep health check is not enabled for the instance group, SageMaker HyperPod runs the same health checks as the [currently available SageMaker HyperPod health checks for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). 

If the health check passes, the nodes will be marked with the following label.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

If the health check doesn't pass, the nodes will be terminated and replaced. This behavior is the same as the way SageMaker HyperPod health check works for Slurm clusters. 

### When a node is added to a SageMaker HyperPod cluster with deep health check config enabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on"></a>

When a new node is added into a SageMaker HyperPod cluster, and if the deep health check test is enabled for the instance group, HyperPod first taints the node and starts the \$12-hour deep health check/stress test on the node. There are 3 possible outputs of the node labels after the deep health check. 

1. When the deep health check test passes

   ```
   sagemaker.amazonaws.com/node-health-status: Schedulable
   ```

1. When the deep health check test fails, and the instance needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the deep health check test fails, and the instance needs to be rebooted to rerun the deep health check

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

If an instance fails the deep health check test, the instance will always be replaced. If the deep health check tests succeeds, the taint on the node will be removed.

### When there are any compute failures on nodes
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails"></a>

The SageMaker HyperPod health monitor agent also continuously monitors the health status of each node. When it detects any failures (such as GPU failure and driver crash), the agent marks the node with one of the following labels.

1. When the node is unhealthy and needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the node is unhealthy and needs to be rebooted

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

 The health monitor agent also taints the node when it detects any node health issues.

# Manually quarantine, replace, or reboot a node
<a name="sagemaker-hyperpod-eks-resiliency-manual"></a>

Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.

**To quarantine a node and force delete a training pod**

```
kubectl cordon <node-name>
```

After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck in termination for more than 30min or `kubectl describe pod` shows ‘Node is not ready’ in Events

```
kubectl delete pods <pod-name> --grace-period=0 --force
```

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use kubectl commands to label nodes for reboot and replace operations. Both methods activate the same SageMaker HyperPod recovery processes.

**To reboot a node using the Reboot API**

To reboot a node you can use the BatchRebootClusterNodes API.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**To replace a node using the Replace API**

To replace a node you can use the BatchReplaceClusterNodes API as follows

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**Karpenter-managed clusters**  
For SageMaker HyperPod clusters using Karpenter for node provisioning, the `BatchReplaceClusterNodes` API does not guarantee that a replacement node will be created. The specified node *will* be terminated, but replacement depends on Karpenter's pod-demand-based provisioning model. Karpenter only creates new nodes when there are pods in a `Pending` state that cannot be scheduled on existing nodes.  
If the workload from the deleted node can be rescheduled onto remaining nodes in the cluster (for example, if those nodes have sufficient capacity), Karpenter does not provision a replacement. To ensure a replacement node is created, verify that your workload configuration (such as pod anti-affinity rules or resource requests) requires a new node for the displaced pods.  
We are aware of this limitation and are actively working on a solution to enforce node replacement when requested through the API.

**To replace a node using kubectl**

Label the node to replace with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
```

**To reboot a node using kubectl**

Label the node to reboot with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
```

After the labels `UnschedulablePendingReplacement` or `UnschedulablePendingReboot` are applied, you should be able to see the node is terminated or rebooted in a few minutes. 

# Suggested resilience configurations
<a name="sagemaker-hyperpod-eks-resiliency-config-tips"></a>

When the deep health checks are enabled, whenever a new instance is added to the HyperPod cluster (either during create-cluster or through automatic node replacement), the new instance goes through the deep health check process (instance level stress tests) for about a couple of hours. The following are suggested resilience config combinations depending on possible cases.

1. **Case**: When you have additional spare nodes within a cluster as back-up resources (not using the full capacity), or if you can wait for about 2 hours for the deep health check process to get the less error-prone instances.

   **Recommendation**: Enable the deep health check config throughout the cluster lifecycle. Node auto-recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes (capacity is fully used for some training load). You want to get the replacement nodes as soon as possible to resume the training job. 

   **Recommendation**: Enable the deep health check during cluster creation, then turn-off the deep health check config after the cluster is created. Node auto recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes, and you don't want to wait for the \$12 hour deep health check process (small clusters).

   **Recommendation**: disable the deep health check config throughout the cluster life cycle. Node auto recovery config is enabled by default.

If you want to resume the training job from a failure immediately, make sure that you have additional spare nodes as backup resources in the cluster.

# Spot instances in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-spot"></a>

Amazon SageMaker HyperPod supports Amazon EC2 Spot Instances, enabling significant cost savings for fault-tolerant and stateless AI/ML workloads. Use cases include batch inference and training jobs, hyperparameter tuning, and experimental workloads. You can also use Spot Instances to automatically scale your compute capacity when this low-cost capacity is available and scale back to On-Demand capacity when the added Spot capacity is reclaimed.

By default, Spot Instances on HyperPod work with HyperPod’s [continuous provisioning feature](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-scaling-eks.html), which enables SageMaker HyperPod to automatically provision remaining capacity in the background while workloads start immediately on available instances. When node provisioning encounters failures due to capacity constraints or other issues, SageMaker HyperPod automatically retries in the background until clusters reach their desired scale, so your autoscaling operations remain resilient and non-blocking. You can also use Spot Instances with [Karpenter-based](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) autoscaling.

**Key Capabilities & Concepts to consider**
+ Capture up to 90% cost savings compared to On-Demand instances
+ Use Spot Instances for jobs that can handle interruptions and where job start and completion times are flexible
+ When using Karpenter for automatic scaling, you can configure HyperPod to automatically fallback to On-Demand when Spot capacity is interrupted or unavailable
+ Access a wide range of CPU, GPU, and accelerator instance types supported by HyperPod
+ Capacity availability depends on supply from EC2 and varies by region and instance type
+ You can perform various actions such as identifying the likelihood of obtaining desired instances or getting interrupted, using various tools such as [Spot Instance Advisor](https://aws.amazon.com/ec2/spot/instance-advisor/) provided by EC2

## Getting started
<a name="sagemaker-hyperpod-spot-instance-getstart"></a>

### Prerequisites
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq"></a>

Before you begin, ensure you have:

#### AWS CLI installed and configured
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cli"></a>

Set up your AWS credentials and region:

```
aws configure
```

Refer to the [AWS credentials documentation](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html) for detailed instructions.

#### IAM Role for SageMaker HyperPod execution
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-iam"></a>

To update the cluster, you must first create [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) permissions for Karpenter. For instructions, see [Create an IAM role for HyperPod autoscaling with Karpenter](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling-iam.html).

#### VPC and EKS Cluster Setup
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cluster"></a>

**2.1 Create VPC and EKS Cluster**

Follow the [HyperPod EKS setup guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) to:

1. Create a VPC with subnets in multiple Availability Zones

1. Create an EKS cluster

1. Install [required dependencies](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) using Helm charts

**2.2 Set Environment Variables**

```
export EKS_CLUSTER_ARN="arn:aws:eks:REGION:ACCOUNT_ID:cluster/CLUSTER_NAME"
export EXECUTION_ROLE="arn:aws:iam::ACCOUNT_ID:role/SageMakerExecutionRole"
export BUCKET_NAME="your-s3-bucket-name"
export SECURITY_GROUP="sg-xxxxx"
export SUBNET="subnet-xxxxx"
export SUBNET1="subnet-xxxxx"
export SUBNET2="subnet-xxxxx"
export SUBNET3="subnet-xxxxx"
```

#### Service quotas for the Spot instances
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-quota"></a>

Verify you have the required quotas for the instances you will create in the SageMaker HyperPod cluster. To review your quotas, on the Service Quotas console, choose AWS services in the navigation pane, then choose SageMaker. For example, the following screenshot shows the available quota for c5 instances.

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cluster-quota.png)


#### Check Spot Availability
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-availability"></a>

Before creating Spot instance groups, check availability in different Availability Zones:

```
aws ec2 get-spot-placement-scores \
  --region us-west-2 \
  --instance-types c5.2xlarge \
  --target-capacity 10 \
  --single-availability-zone \
  --region-names us-west-2
```

**Tip**: Target Availability Zones with higher placement scores for better availability. You can also check Spot Instance Advisor and EC2 Spot pricing for availability. Select required Availability Zone with better availability score and configure Instance group with associated subnet to launch instance in that AZ.

### Creating a Instance Group (No Autoscaling)
<a name="sagemaker-hyperpod-spot-instance-getstart-create"></a>

**CreateCluster (Spot)**

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
    --vpc-config '{
        "SecurityGroupIds": ["'$SECURITY_GROUP'"],
        "Subnets": ["'$SUBNET'"] 
    }'
```

**Update Cluster (Spot \$1 On-Demand)**

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

`CapacityRequirements` cannot be modified once an Instance Group is created.

**Describe Cluster**

```
aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [
        {
            "InstanceGroupName": "ml.c5.2xlarge",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 5,
            "CurrentCount": 3,
            "CapacityRequirements: { "Spot": {} },
            "ExecutionRole": "arn:aws:iam::account:role/SageMakerExecutionRole",
            "InstanceStorageConfigs": [...],
            "OverrideVpcConfig": {...}
        }
        // Other IGs
    ]
}
```

**DescribeClusterNode**

```
aws sagemaker describe-cluster-node --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
  "NodeDetails": {
    "InstanceId": "i-1234567890abcdef1",
    "InstanceGroupName": "ml.c5.2xlarge",
    "CapacityType": "Spot",
    "InstanceStatus": {...}
  }
}
```

### Using Console
<a name="sagemaker-hyperpod-spot-instance-getstart-console"></a>

#### Create and configure a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-create"></a>

To begin, launch and configure your SageMaker HyperPod EKS cluster and verify that continuous provisioning mode is enabled on cluster creation. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Choose Create HyperPod cluster and Orchestrated on Amazon EKS.

1. For Setup options, select Custom setup.

1. For Name, enter a name.

1. For Instance recovery, select Automatic.

1. For Instance provisioning mode, select Use continuous provisioning.

1. CapacityType : Select Spot 

1. Choose Submit.

Screen shot of Console : 

![\[An image containing the creation cluster flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-create-cluster.png)


This setup creates the necessary configuration such as virtual private cloud (VPC), subnets, security groups, and EKS cluster, and installs operators in the cluster. You can also provide existing resources such as an EKS cluster if you want to use an existing cluster instead of creating a new one. This setup will take around 20 minutes.

#### Adding new Spot Instance Group to the same cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-add"></a>

To add an Spot IG to your existing HyperPod EKS cluster. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Select an existing HyperPod cluster with Amazon EKS Orchestration (Ensure continuous provisioning is enabled).

1. Click Edit.

1. On the Edit Cluster page, click Create instance group.

1. Select capacity type: Spot instance in the instance group configuration.

1. Click Create instance group. 

1. Click Submit.

**Screen shot of Console : **

![\[An image containing the instance group creation flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-instance-group.png)


### Using CloudFormation
<a name="sagemaker-hyperpod-spot-instance-getstart-cfn"></a>

```
Resources:
  TestCluster:
    Type: AWS::SageMaker::Cluster
    Properties:
      ClusterName: "SampleCluster"
      InstanceGroups:
        - InstanceGroupName: group1
          InstanceType: ml.c5.2xlarge
          InstanceCount: 1
          LifeCycleConfig:
            SourceS3Uri: "s3://'$BUCKET_NAME'"
            OnCreate: "on_create_noop.sh"
          ExecutionRole: "'$EXECUTION_ROLE'",
          ThreadsPerCore: 1
          CapacityRequirements:
            Spot: {}
      VpcConfig:
        Subnets:
          - "'$SUBNET1'"
        SecurityGroupIds:
          - "'$SECURITY_GROUP'"
      Orchestrator:
        Eks:
          ClusterArn:
            '$EKS_CLUSTER_ARN'
      NodeProvisioningMode: "Continuous"
      NodeRecovery: "Automatic"
```

Please see [https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html](https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html) for details.

### Karpenter based Autoscaling
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter"></a>

#### Create cluster role
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-role"></a>

**Step 1: Navigate to IAM Console**

1. Go to the **AWS Management Console** → **IAM** service

1. Click **Roles** in the left sidebar

1. Click **Create role**

**Step 2: Set up Trust Policy**

1. Select Custom trust policy (instead of AWS service)

1. Replace the default JSON with this trust policy:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "hyperpod.sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

click Next

**Step 3: Create Custom Permissions Policy**

Since these are specific SageMaker permissions, you'll need to create a custom policy:

1. Click **Create policy** (opens new tab)

1. Click the **JSON** tab

1. Enter this policy:

   ```
   {
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "sagemaker:BatchAddClusterNodes",
           "sagemaker:BatchDeleteClusterNodes"
         ],
         "Resource": "*"
       }
     ]
   }
   ```

1. Click **Next**

1. Give it a name like `SageMakerHyperPodRolePolicy`

1. Click **Create policy**

**Step 4: Attach the Policy to Role**

1. Go back to your role creation tab

1. Refresh the policies list

1. Search for and select your newly created policy

1. Click **Next**

**Step 5: Name and Create Role**

1. Enter a role name (e.g., `SageMakerHyperPodRole`)

1. Add a description if desired

1. Review the trust policy and permissions

1. Click **Create role**

#### Verification
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-verify"></a>

After creation, you can verify by:
+ Checking the Trust relationships tab shows the hyperpod service
+ Checking the Permissions tab shows your custom policy
+ The role ARN will be available for use with HyperPod

The role ARN format will be:

```
 arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole
```

#### Create Cluster with AutoScaling:
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-cluster"></a>

For better availability, create IGs in multiple AZs by configuring Subnets. You can also include onDemand IGs for fallback.

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 0, // For Auto scaling keep instance count as 0
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
--vpc-config '{
    "SecurityGroupIds": ["'$SECURITY_GROUP'"],
    "Subnets": ["'$SUBNET'"] 
}'
--auto-scaling ' {
    "Mode": "Enable",
    "AutoScalerType": "Karpenter"
}'
```

#### Update Cluster (Spot \$1 On-Demand)
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-update-cluster"></a>

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

#### Create HyperpodNodeClass
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-class"></a>

`HyperpodNodeClass` is a custom resource that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter’s auto scaling decisions. To use `HyperpodNodeClass`, simply specify the names of the `InstanceGroups` of your SageMaker HyperPod cluster that you want to use as the source for the AWS compute resources to use to scale up your pods in your NodePools. The `HyperpodNodeClass` name that you use here is carried over to the NodePool in the next section where you reference it. This tells the NodePool which `HyperpodNodeClass` to draw resources from. To create a `HyperpodNodeClass`, complete the following steps:

1. Create a YAML file (for example, nodeclass.yaml) similar to the following code. Add `InstanceGroup` names that you used at the time of the SageMaker HyperPod cluster creation. You can also add new instance groups to an existing SageMaker HyperPod EKS cluster.

1. Reference the `HyperPodNodeClass` name in your NodePool configuration.

The following is a sample `HyperpodNodeClass` :

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  name: multiazg6
spec:
  instanceGroups:
    # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
    # before this step can be completed.
    # MaxItems: 10
    - auto-spot-c5-2x-az1
    - auto-spot-c5-2x-az2
    - auto-spot-c5-x-az3
    - auto-ondemand-c5-2x-az1
```

Karpenter prioritizes Spot instance groups over On-Demand instances, using On-Demand as a fallback when specified in the configuration. Instance selection is sorted by EC2 Spot Placement Scores associated with each subnet's availability zone.

**Apply the configuration to your EKS cluster using `kubectl`:**

```
kubectl apply -f nodeclass.yaml
```

The HyperPod cluster must have AutoScaling enabled and the AutoScaling status must change to `InService` before the `HyperpodNodeClass` can be applied. It also shows Instance Groups capacities as Spot or OnDemand. For more information and key considerations, see [Autoscaling on SageMaker HyperPod EKS](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html).

**For example**

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  creationTimestamp: "2025-11-30T03:25:04Z"
  name: multiazc6
  uid: ef5609be-15dd-4700-89ea-a3370e023690
spec:
  instanceGroups:
  -spot1
status:
  conditions:
  // true when all IGs in the spec are present in SageMaker cluster, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: InstanceGroupReady
    status: "True"
    type: InstanceGroupReady
  // true if subnets of IGs are discoverable, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  // true when all dependent resources are Ready [InstanceGroup, Subnets]
  - lastTransitionTime: "2025-11-30T05:47:55Z"
    message: ""
    observedGeneration: 3
    reason: Ready
    status: "True"
    type: Ready
  instanceGroups:
  - instanceTypes:
    - ml.c5.2xlarge
    name:auto-spot-c5-2x-az2
    subnets:
    - id: subnet-03ecc649db2ff20d2
      zone: us-west-2a
      zoneId: usw2-az2
  - capacities: {"Spot": {}}
```

#### Create NodePool
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-nodepool"></a>

The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool can be set to perform various actions, such as: 
+ Define labels and taints to limit the pods that can run on nodes Karpenter creates
+ Limit node creation to certain zones, instance types, and computer architectures, and so on

For more information about NodePool, refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/). SageMaker HyperPod managed Karpenter supports a limited set of well-known Kubernetes and Karpenter requirements, which we explain in this post.

To create a NodePool, complete the following steps:

Create a YAML file named `nodepool.yaml` with your desired NodePool configuration. The following code is a sample configuration to create a sample NodePool. We specify the NodePool to include our ml.g6.xlarge SageMaker instance type, and we additionally specify it for one zone. Refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/) for more customizations.

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
 name: gpunodepool
spec:
 template:
   spec:
     nodeClassRef:
      group: karpenter.sagemaker.amazonaws.com
      kind: HyperpodNodeClass
      name: multiazg6
     expireAfter: Never
     requirements:
        - key: node.kubernetes.io/instance-type
          operator: Exists
        - key: "node.kubernetes.io/instance-type" // Optional otherwise Karpenter will decide based on Job config resource requirements
          operator: In
          values: ["ml.c5.2xlarge"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a"]
```

**Tip**: On EC2 Spot interruption, Hyperpod taints node to trigger pod eviction. Karpenter’s **consolidation** process respects pod disruption budgets and performs normal Kubernetes eviction, but if you set consolidateAfter: 0, then consolidation can happen **immediately**, giving very little time for graceful pod eviction. Set it to non zero upto 2 min to allow graceful pod eviction for any checkpointing needs.

**Apply the NodePool to your cluster:**

```
kubectl apply -f nodepool.yaml
```

**Monitor the NodePool status to ensure the Ready condition in the status is set to True:**

```
kubectl get nodepool gpunodepool -oyaml
```

This example shows how a NodePool can be used to specify the hardware (instance type) and placement (Availability Zone) for pods.

**Launch a simple workload**

The following workload runs a Kubernetes deployment where the pods in deployment are requesting for 1 CPU and 256 MB memory per replica, per pod. The pods have not been spun up yet.

```
kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
```

When we apply this, we can see a deployment and a single node launch in our cluster, as shown in the following screenshot.

**To scale this component, use the following command:**

```
kubectl scale deployment inflate --replicas 10
```

See [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) for more details.

### Managing Node Interruption
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-interrupt"></a>

Spot Instances can be reclaimed at any time. EC2 provides a best-effort 2-minute interruption notice in most cases, but this notice is not guaranteed. In some situations, EC2 may terminate Spot Instances immediately without any advance warning.HyperPod automatically handles both scenarios:
+ With 2-minute notice: Automatically reattempts graceful pod eviction and controlled capacity replacement when Spot capacity becomes available.
+ Without notice (immediate termination): Automatically reattempts node replacement (when Spot capacity becomes available) without graceful eviction 

**How it works**

When EC2 sends a Spot interruption notice, HyperPod automatically:

1. Detects interruption signal 

1. Taints the node: Prevents new pods from being scheduled on the interrupted instance

1. Gracefully evicts pods: Gives running pods time to complete or checkpoint their work (respecting Kubernetes `terminationGracePeriodSeconds`)

1. Replaces capacity: Automatically attempts to provision the replacement instances (Spot or On-Demand based on availability). 

   Capacity replacement works by automatically provisioning replacement instances. When capacity is not immediately available, the system continues checking until resources become accessible. In the case of non-autoscaling instance groups, HyperPod attempts to scale up within the same instance group until the required capacity becomes available. For Karpenter-based instance groups, Karpenter implements a fallback mechanism to other instance groups configured in the Node class when the primary group cannot accommodate the demand. Additionally, you can configure On-Demand as a fallback option, allowing Karpenter to automatically switch to On-Demand instances if it cannot successfully scale up Spot instance groups.

1. Reschedules workloads: Kubernetes automatically reschedules evicted pods on healthy nodes

### Finding your Usage and Bill
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-bill"></a>

To check your usage and billing for Spot Instances on HyperPod you can use the AWS Cost Explorer Console. Go to Billing and Cost Management > Bill

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-region.png)


**To explore usage and billing on Console, go to Billing and Cost Management > Cost Explorer**

![\[An image containing cost and usage.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-usage.png)


# Using UltraServers in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-ultraserver"></a>

SageMaker HyperPod support for Ultraservers provides high-performance GPU computing capabilities for AI and machine learning workloads. Built on NVIDIA GB200 and NVL72 architecture, these Ultraservers provide NVLink connectivity across 18 GB200 instances in a dual-rack configuration, totaling 72 B200 GPUs. This NVLink fabric allows workloads to use GPU communications that increase usable GPU capacity and addressable memory beyond what's possible with discrete instances, supporting more complex and resource-intensive AI models. The NVLink connectivity is enabled by NVIDIA IMEX technology, which handles the low-level configuration for secure GPU fabric connections across instances within the same rack.

HyperPod simplifies the deployment and management of these GPU clusters through intelligent topology awareness and automated configuration. The platform automatically discovers and labels nodes with their physical location and capacity block information, which supports topology-aware scheduling for distributed workloads. HyperPod abstracts the complex IMEX configuration requirements, allowing you to focus on workload deployment rather than low-level GPU fabric setup. You can choose flexible deployment options including both self-managed nodes and EKS managed node groups. Amazon EKS provides optimized AMIs that include pre-configured NVIDIA drivers, Fabric Manager, IMEX drivers, and all necessary system software for seamless operation.

The integration includes pod placement capabilities that ensure distributed workloads are scheduled optimally across NVL72 domains using standard Kubernetes topology labels. Built-in monitoring and automated recovery features provide operational support, where the AMI health agent detects GPU errors from kernel logs and can automatically remediate issues or replace faulty nodes in managed node groups. This combination of GPU scale, intelligent workload placement, and automated operations helps you focus on your AI/ML innovations rather than infrastructure complexity, while achieving maximum performance from your GPU investments.

To get set up using UltraServers with your HyperPod cluster, see the following steps:

1. Create an [ EKS-based HyperPod cluster](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). When you choose an instance group, make sure you choose an UltraServer. 

1. After your cluster is created, use the following commands install operational plugins:

   NVIDIA device plugin v0.17.2

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/nvidia-device-plugin.yml
   ```

   FD DaemonSet v0.17.3

   ```
   kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.17.3"
   ```

   GPU feature discovery

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/gpu-feature-discovery-daemonset.yaml
   ```

You can now run jobs. The following example demonstrates how to create a domain, configure an IMEX domain, and enable channel allocation. These steps also let you create a pod to provision a channel for NCCL communication.

1. Create a resource specification file to use with Kubectl.

   ```
   cat <<EOF > imex-channel-injection.yaml
   ---
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: imex-channel-injection
   spec:
     numNodes: 1
     channel:
       resourceClaimTemplate:
         name: imex-channel-0
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: imex-channel-injection
   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: nvidia.com/gpu.clique
               operator: Exists
             - key: topology.k8s.aws/ultraserver-id
               operator: In
               values: 
               - <UltraServer-ID>
     containers:
     - name: ctr
       image: ubuntu:22.04
       command: ["bash", "-c"]
       args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: imex-channel-0
     resourceClaims:
     - name: imex-channel-0
       resourceClaimTemplateName: imex-channel-0
   EOF
   ```

1. Apply the configuration that you created.

   ```
   kubectl apply -f imex-channel-injection.yaml
   ```

1. To verify that your pod is created, run the `get pods` commands.

   ```
   kubectl get pods
   kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
   ```

1. You can also check the logs from the pod to see if it allocated a communication channel.

   ```
   kubectl logs imex-channel-injection
   ```

   ```
   total 0
   drwxr-xr-x 2 root root     60 Feb 19 10:43 .
   drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
   crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
   ```

1. You can also check the logs to verify that the automated IMEX configuration is running with an allocated channel.

   ```
   kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
   /etc/nvidia-imex/nodes_config.cfg:
   ```

   ```
   IMEX Log initializing at: 8/8/2025 14:23:12.081
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX version 570.124.06 is running with the following configuration options
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging level = 4
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Append to log file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Max Log file size = 1024 (MBs)
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Use Syslog file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX Library communication bind interface =
   
   [JAug 8 2025 14:23:12] [INFO] [tid 39] IMEX library communication bind port = 50000
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Identified this node as ID 0, using bind IP of '10.115.131.8', and network interface of enP5p9s0
   [Aug 8 2025 14:23:120] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
   [Aug 8 2025 14:23:12] [INFO] [tid 63] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 64] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 65] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
   [Aug 8 2025 14:23:12] [INFO] [tid 66] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
   [Aug 8 2025 14:23:12] [INFO] [tid 67] Connection established to node 0 with ip address 10.115.131.8. Number of times connected: 1
   [Aug 8 2025 14:23:12] [INFO] [tid 39] GPU event successfully subscribed
   ```

1. After you've verified everything, delete the workload and remove the configuration.

   ```
   kubectl delete -f imex-channel-injection.yaml
   ```

# IDEs and Notebooks
<a name="sagemaker-hyperpod-eks-cluster-ide"></a>

Amazon SageMaker is introducing a new capability for SageMaker HyperPod EKS clusters, which allows AI developers to run their interactive machine learning workloads directly on the HyperPod EKS cluster. This feature introduces a new add-on called Amazon SageMaker Spaces, that enables AI developers to create and manage self-contained environments for running notebooks.

Administrators can use SageMaker HyperPod Console to install the add-on on their cluster, and define default space configurations such as images, compute resources, local storage for notebook settings (additional storage to be attached to their dev spaces), file systems, and initialization scripts. A one-click installation option will be available with default settings to simplify the admin experience. Admins can use the SageMaker HyperPod Console, kubectl, or HyperPod CLI to install the operator, create default settings, and manage all spaces in a centralized location.

AI developers can use HyperPod CLI to create, update, and delete dev spaces. They have the flexibility to use default configurations provided by admins or customize settings. AI developers can access their spaces on HyperPod using their local VS Code IDEs, and/or their web browser that hosts their JupyterLab or CodeEditor IDE on custom DNS domain configured by their admins. They can also use kubernetes’ port forwarding feature to access spaces in their web browsers.

## Admin
<a name="admin-cx"></a>
+ [Set up permissions](permission-setup.md)
+ [Install SageMaker AI Spaces Add-on](operator-install.md)
+ [Customize add-on](customization.md)
+ [Add users and set up service accounts](add-user.md)
+ [Limits](ds-limits.md)
+ [Task governance for Interactive Spaces on HyperPod](task-governance.md)
+ [Observability](observability.md)

## Data scientist
<a name="data-scientist-cx"></a>
+ [Create and manage spaces](create-manage-spaces.md)
+ [Web browser access](browser-access.md)
+ [Remote access to SageMaker Spaces](vscode-access.md)

## SageMaker Spaces Managed Instance Pricing
<a name="spaces-managed-instance-pricing"></a>

The SageMaker Spaces Add-on/Operator does not incur any additional charge to the customer. However, to support the SSH-over-SSM tunneling required for the *Remote IDE Connection* feature, SageMaker Spaces uses an AWS-managed instance. This instance is registered as an Advanced On-Premises Instance under SSM, and therefore is billed per compute hour.

Please refer to the “On-Premises Instance Management” rate on the AWS Systems Manager pricing page: AWS Systems Manager Pricing: [https://aws.amazon.com/systems-manager/pricing/](https://aws.amazon.com/systems-manager/pricing.com)

# Set up permissions
<a name="permission-setup"></a>

## Roles required for Add-on and its dependencies
<a name="permission-setup-addon"></a>

### IAM Roles Required for SageMaker Spaces on SageMaker HyperPod
<a name="role-hyperpod"></a>

When enabling **SageMaker Spaces (a.k.a****SageMaker IDE / Notebooks)** features on a SageMaker HyperPod (EKS) cluster, several IAM roles must be created and assigned. These roles support secure access, routing, remote IDE sessions, and EBS storage provisioning. The following table summarizes the four roles and when they are required.

### Role Summary Table
<a name="role-table"></a>


| IAM Role | Required? | Purpose | Who Uses It? | Customization allowed by SageMaker Console? | 
| --- | --- | --- | --- | --- | 
|  Spaces Add-on Execution Role  |  Always required  |  Allows the Spaces controller to manage Spaces, generate presigned URLs, manage SSM sessions  |  Add-on controller pod (privileged)  |  ✔ Yes  | 
|  In-Cluster Router Role  |  Required for WebUI access  |  Allows router pod to perform KMS operations for JWT signing (WebUI authentication)  |  In-cluster router pod (privileged)  |  ✔ Yes  | 
|  SSM Managed Instance Role  |  Required for Remote IDE access  |  Used by SSM agent sidecar for SSH-over-SSM remote IDE sessions  |  SSM Agent in Space IDE Pods (not an add-on pod)  |  ✔ Yes  | 
|  IAM Role for EBS CSI Driver Add-on  |  Always required  |  Allows EBS CSI Driver to create/attach/modify volumes for Spaces workloads  |  EBS CSI Driver Add-on  |  Auto created  | 
|  IAM Role for External DNS Add-on  |  Required for WebUI access  |  It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones.  |  External DNS Add-on  |  Auto created  | 

### 1. Spaces Add-on Execution Role (Required)
<a name="add-n-execution-role"></a>

The Spaces Add-on Execution Role is always required because it is used by the SageMaker Spaces addon-on controller pod, an administrative component installed through the EKS add-on. This role allows the controller to manage Spaces, provision resources, interact with SSM, and generate presigned URLs for both Remote IDE and WebUI access. It also supports KMS access used for request signing for authenticating the WebUI https requests. This role can be automatically created when SageMaker Spaces add-on is installed through the SageMaker Console. For manual creation, AWS provides the `AmazonSageMakerSpacesControllerPolicy` managed policy.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

### 2. In-Cluster Router Role (Required for WebUI Authentication)
<a name="in-cluster-role"></a>

The In-Cluster Router Role is used by the** router pod**, a privileged component that authenticates Spaces WebUI sessions. The router uses a KMS key to create and sign JWT tokens that authorize user access to specific Spaces. This role allows the router pod to generate data keys, and decrypt them. Similar to the controller role, it enforces security using tag- and cluster-based scope restrictions. This role can be automatically generated when Spaces add-on is installed via the AWS SageMaker Console, but customers may manually create it.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

**Reference Permission Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "KMSDescribeKey",
            "Effect": "Allow",
            "Action": [
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}"
        },
        {
            "Sid": "KMSKeyOperations",
            "Effect": "Allow",
            "Action": [
                "kms:GenerateDataKey",
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}",
            "Condition": {
                "StringEquals": {
                    "kms:EncryptionContext:sagemaker:component": "amazon-sagemaker-spaces",
                    "kms:EncryptionContext:sagemaker:eks-cluster-arn": "${aws:PrincipalTag/eks-cluster-arn}"
                }
            }
        }
    ]
}
```

### 3. SSM Managed Instance Role (Required for Remote IDE Access)
<a name="ssm-role"></a>

The SSM Managed Instance Role is passed when registering the SSM managed instance for enabling the remote IDE access. This role allows the SSM agent to register the pod as an SSM Managed Instance and use the SSM Session Manager channels for Remote IDE (SSH-over-SSM) connectivity. It can be created automatically when using the AWS SageMaker Console. For manual deployments, customers must create this role and provide it to the Spaces add-on. The controller pod itself does not assume this role; it only provides it when calling `ssm:CreateActivation`.

**Reference Trust Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ssm.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{{account}}"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:ssm:{{region}}:{{account}}:*"
                }
            }
        }
    ]
}
```

**Reference Permissions Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:DescribeAssociation"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDocument",
        "ssm:DescribeDocument"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:document/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:GetParameters"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:parameter/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:ListInstanceAssociations"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:PutComplianceItems"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceInformation"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDeployablePatchSnapshotForInstance",
        "ssm:GetManifest",
        "ssm:ListAssociations",
        "ssm:PutInventory",
        "ssm:PutConfigurePackageResult"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:AcknowledgeMessage",
        "ec2messages:DeleteMessage",
        "ec2messages:FailMessage",
        "ec2messages:GetEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:GetMessages",
        "ec2messages:SendReply"
      ],
      "Resource": "*",
      "Condition": {
        "ArnLike": {
          "ssm:SourceInstanceARN": "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
        }
      }
    }
  ]
}
```

### 4. IAM Role for EBS CSI Driver Add-on
<a name="role-ebs-csi"></a>

The IAM role for the EBS CSI Driver is required because the EBS CSI Driver provisions persistent volumes for Spaces workloads. While the AWS-managed [AmazonEBSCSIDriverPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonEBSCSIDriverPolicy.html) provides baseline permissions, SageMaker HyperPod clusters require [additional capabilities](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html#sagemaker-hyperpod-eks-ebs-setup) such as creating fast snapshot restores, tagging cluster-owned volumes, and attaching/detaching volumes for HyperPod-managed nodes. These permissions also include SageMaker-specific APIs such as `sagemaker:AttachClusterNodeVolume`. If EBS CSI Driver is not installed, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

### 5. IAM Role for External DNS Add-on
<a name="role-external-nds"></a>

The External DNS add-on manages DNS records for Services and Ingress resources on the HyperPod cluster. It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones. Today, customers often install External DNS manually via a 1-click option in the EKS console. As part of improving the SageMaker Spaces experience, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

## Permission setup for AWS Toolkit to Access SageMaker Spaces
<a name="permission-for-toolkitl"></a>

To allow the AWS VS Code Toolkit resource explorer side panel to discover and connect to SageMaker Spaces, the following IAM permissions are required. These permissions allow the Toolkit to list available SageMaker HyperPod clusters, retrieve cluster details, and obtain a connection token for the associated Amazon EKS cluster.

**Required IAM Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SageMakerListClusters",
            "Effect": "Allow",
            "Action": "sagemaker:ListClusters",
            "Resource": "*"
        },
        {
            "Sid": "SageMakerDescribeCluster",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeCluster",
            "Resource": "arn:aws:sagemaker:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksDescribeCluster",
            "Effect": "Allow",
            "Action": "eks:DescribeCluster",
            "Resource": "arn:aws:eks:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksGetToken",
            "Effect": "Allow",
            "Action": "eks:GetToken",
            "Resource": "*"
        }
    ]
}
```

**Scoping Recommendations**
+ Replace cluster-name with the specific SageMaker HyperPod cluster(s) your users need to access.
+ The eks:GetToken action currently does not support resource-level restrictions and must use Resource: "\$1". This is an AWS service limitation. The client side Authentication is performed through [EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html).

# Install SageMaker AI Spaces Add-on
<a name="operator-install"></a>

## Dependencies
<a name="dependencies"></a>

**Amazon EKS Pod Identity Agent add-on**
+ Required for the operator to obtain AWS credentials
+ **Typically pre-installed** on most EKS clusters
+ Installation: Via EKS add-ons

**Cert-manager**
+ Required for TLS certificate management
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via EKS add-ons

**EBS CSI Driver**
+ Required for Space persistent storage (EBS volumes)
+ **Automatically installed** when using SageMaker console to install
+ Requires IAM role with `AmazonEBSCSIDriverPolicy` \$1 HyperPod-specific permissions
+ Installation: Via EKS add-ons. However, make sure follow the guide to install additional permissions needed for HyperPod. 
+ Reference: [Using the Amazon EBS CSI driver on HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html)

## Additional dependencies for WebUI Access
<a name="-additional-dependencies"></a>

**AWS Load Balancer Controller**
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via Helm
+ Manual installation guide: [Installing the AWS Load Balancer Controller](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html)

**External DNS**
+ Required when using custom domain for WebUI access
+ Manages Route53 DNS records automatically
+ Requires IAM role with Route53 permissions
+ Installation: Via EKS add-ons

## Installation
<a name="installation"></a>

Before you begin, ensure that you have:
+ An active SageMaker HyperPod cluster with at least one worker node running Kubernetes version 1.30 or later
+ At least one worker node with minimum instance type (XX vCPU, YY GiB memory)

### Installing the Amazon SageMaker Spaces add-on
<a name="space-add-on"></a>

You can install the SageMaker Spaces add-on using either quick install for default settings or custom install for advanced configuration.

#### Quick install
<a name="quick-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Quick install.

Quick install automatically:
+ Creates the required IAM roles for the add-on
+ Enables remote access mode with required IAM roles for Systems Manager
+ Installs the add-on and configures pod identity association

#### Custom install
<a name="custom-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Custom install.

1. Configure the following options:

   **IAM roles needed by add-on**
   + Choose whether to create new IAM roles with recommended permissions or use existing ones with the required permissions (Refer to Admin Permission Set up section above)

   **Remote access configuration**
   + Enable to allow users to connect to spaces from local Visual Studio Code using AWS Systems Manager
   + For SSM managed instance role:
     + **Create new role** – The add-on creates and manages the role with required Systems Manager permissions
     + **Use existing role** – Select a pre-configured role with necessary Systems Manager permissions
   + Ensure the Spaces Add-on Execution Role has PassRole permissions for the SSM managed instance role
**Note**  
Enabling remote access activates AWS Systems Manager advanced-instances tier for additional per-instance charges. For pricing information, see Systems Manager pricing.

   **Web browser access configuration**
   + Enable to allow users to access spaces through a web browser using Route 53 DNS and SSL certificates
   + **Prerequisites:** Install AWS Load Balancer Controller before enabling browser access
   + **Route 53 hosted zone:** Select an existing hosted zone for a domain or subdomain that you own. The domain or subdomain must be registered and under your control to enable DNS management and SSL certificate validation.

     For more details on domain registration, see [Registering a new domain](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-register.html#domain-register-procedure-section) in the Route 53 Developer Guide.
   + **Subdomain:** Enter subdomain prefix (alphanumeric and hyphens only, maximum 63 characters)
   + **SSL certificate:** Select an existing SSL certificate from AWS Certificate Manager. The certificate must be valid and cover both your subdomain (e.g., subdomain.domain.com) and wildcard subdomains (e.g., \$1.subdomain.domain.com) to support individual space access URLs.
   +  **Token signing key:** Select an AWS KMS asymmetric key for JWT token signing. The key is used to encrypt authentication tokens for secure WebUI access. You can create a new asymmetric key in KMS or select an existing one that your account has access to.
**Note**  
Standard Route 53 charges apply for hosted zones and DNS queries. For pricing information, see Route 53 pricing.

#### EKS Addon Installation - Jupyter K8s with WebUI
<a name="webui-install"></a>

##### Configuration File
<a name="configure-file"></a>

Create `addon-config.yaml`:

```
jupyter-k8s:
  workspacePodWatching:
    enable: true

jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    kmsEncryptionContext:
      enabled: true
    traefik:
      shouldInstall: true
    auth:
      kmsKeyId: "<KMS_KEY_ARN>"
```

**Replace the following placeholders:**
+ <DOMAIN\$1NAME>: Your domain name (e.g., `jupyter.example.com`)
+ <ACM\$1CERTIFICATE\$1ARN>: Your ACM certificate ARN (e.g. `arn:aws:acm:us-west-2:111122223333:certificate/12345678-1234-1234-1234-123456789012`, 
+ <KMS\$1KEY\$1ARN>: Your KMS key ARN (e.g., `arn:aws:kms:us-west-2:111122223333:key/12345678-1234-1234-1234-123456789012`

##### Installation via AWS CLI
<a name="install-via-cli"></a>

```
aws eks create-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

**To update existing addon:**

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

##### Installation via AWS Management Console
<a name="install-via-console"></a>

1. Go to **EKS Console** → Select your cluster

1. Click **Add-ons** tab → **Add new**

1. Select **SageMaker Spaces** addon

1. Paste the YAML config above in **Optional configuration settings**

1. Click **Next**, then review the addon settings

1. Click **Create**

##### Verify Installation
<a name="install-verify"></a>

```
# Check addon status
aws eks describe-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --region <AWS_REGION>
```

##### Customizing ALB Attributes
<a name="customize-alb"></a>

By default, the addon creates a public load balancer for use with the web UI. You can customize the load balancer attributes using the EKS addon properties.

To create an internal ALB, set the scheme to `internal`:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"  # Default is "internet-facing"
```

You can also use the `alb.annotations` field to customize ALB settings:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"
      annotations:
        alb.ingress.kubernetes.io/security-groups: "<SECURITY_GROUP_ID>"
        alb.ingress.kubernetes.io/subnets: "<SUBNET_ID_1>,<SUBNET_ID_2>"
        alb.ingress.kubernetes.io/load-balancer-attributes: "idle_timeout.timeout_seconds=60"
```

**Common ALB annotations:**
+ `alb.ingress.kubernetes.io/security-groups`: Specify security groups for the ALB
+ `alb.ingress.kubernetes.io/subnets`: Specify subnets for the ALB
+ `alb.ingress.kubernetes.io/load-balancer-attributes`: Set ALB attributes (idle timeout, access logs, etc.)

See [AWS Load Balancer Controller documentation](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/guide/ingress/annotations/) for all available annotations.

### Upgrade / versioning of add-on
<a name="upgrade-add-on"></a>

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

# Customize add-on
<a name="customization"></a>

## Template
<a name="customization-template"></a>

Templates are reusable workspace configurations that serve as admin-controlled blueprints for workspace creation. They provide defaults for workspace configuration values, and guardrails to control what data scientists can do. Templates exist at a cluster level, and can be re-used across namespaces. 

SageMaker Spaces creates two system templates as a starting point for data scientists, one for Code Editor and one for JupyterLab. These system templates are managed by the addon and cannot be editied directly. Instead, admins can create new templates and set them as default.

## Task Governance
<a name="customization-governabce"></a>

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
  labels:
    kueue.x-k8s.io/priority-class: <user-input>-priority
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

## SMD / Custom images
<a name="customization-image"></a>

Customers can configure image policies through templates by providing a default image and a list of allowed images. Additionally, administrators can choose whether to allow data scientists to bring their own custom images. The system defaults to using the latest SageMaker Distribution, but if you wish to pin to a particular version, you can specify the exact SMD version to use in a template.

Custom image requirements:
+ `curl` if you want to use idle shutdown
+ port 8888
+ remote access

## Remote IDE Requirement
<a name="remote-ide-requirement"></a>

### VS Code version requirement
<a name="remote-ide-requirement-vscode"></a>

VS Code version [v1.90](https://code.visualstudio.com/updates/v1_90) or greater is required. We recommend using the [latest stable version of VS Code](https://code.visualstudio.com/updates).

### Operating system requirements
<a name="remote-ide-requirement-operate"></a>

You need one of the following operating systems to remotely connect to Studio spaces:
+ macOS 13\$1
+ Windows 10
  + [Windows 10 support ends on October 14, 2025](https://support.microsoft.com/en-us/windows/windows-10-support-ends-on-october-14-2025-2ca8b313-1946-43d3-b55c-2b95b107f281)
+ Windows 11
+ Linux
+ Install the official [Microsoft VS Code for Linux](https://code.visualstudio.com/docs/setup/linux)
  + not an open-source version

### Local machine prerequisites
<a name="remote-ide-requirement-machine"></a>

Before connecting your local Visual Studio Code to Studio spaces, ensure your local machine has the required dependencies and network access.

**Note**  
Environments with software installation restrictions may prevent users from installing required dependencies. The AWS Toolkit for Visual Studio Code automatically searches for these dependencies when initiating remote connections and will prompt for installation if any are missing. Coordinate with your IT department to ensure these components are available.

**Required local dependencies**

Your local machine must have the following components installed:
+ **[https://code.visualstudio.com/docs/remote/ssh](https://code.visualstudio.com/docs/remote/ssh)**
+ — Standard VS Code Marketplace extension for remote development
+ **[Session Manager plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)** — Required for secure session management
+ **SSH Client** — Standard component on most machines ([OpenSSH recommended for Windows](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse))
+ **[https://code.visualstudio.com/docs/configure/command-line](https://code.visualstudio.com/docs/configure/command-line)**
+  Typically included with VS Code installation

**Platform-specific requirements**
+ **Windows users** — PowerShell 5.1 or later is required for SSH terminal connections

**Network connectivity requirements**

Your local machine must have network access to [Session Manager endpoints](https://docs.aws.amazon.com/general/latest/gr/ssm.html). For example, in US East (N. Virginia) (us-east-1) these can be:
+ `[ssm.us-east-1.amazonaws.com](http://ssm.us-east-1.amazonaws.com)`
+ `ssm.us-east-1.api.aws`
+ `[ssmmessages.us-east-1.amazonaws.com](http://ssmmessages.us-east-1.amazonaws.com)`
+ `[ec2messages.us-east-1.amazonaws.com](http://ec2messages.us-east-1.amazonaws.com)`

### Image requirements
<a name="remote-ide-requirement-image"></a>

**SageMaker Distribution images**

When using SageMaker Distribution with remote access, use [SageMaker Distribution](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-distribution.html) version 2.7 or later.

**Custom images**

When you [Bring your own image (BYOI)](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi.html) with remote access, ensure that you follow the [custom image specifications](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi-specs.html) and ensure the following dependencies are installed:
+ `curl` or `wget` — Required for downloading AWS CLI components
+ `unzip` — Required for extracting AWS CLI installation files
+ `tar` — Required for archive extraction
+ `gzip` — Required for compressed file handling

### Instance requirements
<a name="remote-ide-requirement-instance"></a>
+ **Memory** — 8GB or more
+ Use instances with at least 8GB of memory. The following instance types are *not* supported due to insufficient memory (less than 8GB): `ml.t3.medium`, `ml.c7i.large`, `ml.c6i.large`, `ml.c6id.large`, and `ml.c5.large`. For a more complete list of instance types, see the [Amazon EC2 On-Demand Pricing page](https://aws.amazon.com/ec2/pricing/on-demand/)

## Optimizing Kubernetes Startup Time by Pre-Warming Container Images
<a name="remote-ide-optimize-image"></a>

Container image pulling performance has become a significant bottleneck for many EKS customers, especially as AI/ML workloads rely on increasingly large container images. Pulling and unpacking these large images typically takes several minutes the first time they are used on each EKS node. This delay adds substantial latency when launching SageMaker Spaces and directly impacts user experience—particularly in environments where fast startup is essential, such as notebooks, interactive development jobs. 

Image pre-warming is a technique used to preload specific container images onto every node in the EKS/HyperPod cluster before they are needed. Instead of waiting for a pod to trigger the first pull of a large image, the cluster proactively downloads and caches images across all nodes. This ensures that when workloads launch, the required images are already available locally, eliminating long cold-start delays. Image pre-warming improves SageMaker Spaces startup speed and provides a more predictable and responsive experience for end users.

### Pre-Warming via DaemonSet
<a name="remote-ide-optimize-image-dae"></a>

We recommend using a DaemonSet to preload images. A DaemonSet ensures that one pod runs on every node in the cluster. Each container inside the DaemonSet pod references an image you want to cache. When Kubernetes starts the pod, it automatically pulls the images, warming the cache on each node.

The following example shows how to create a DaemonSet that preloads two GPU images. Each container runs a lightweight `sleep infinity` command to keep the pod active with minimal overhead.

```
cat <<EOF | kubectl apply -n "namespace_1" -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-preload-ds
spec:
  selector:
    matchLabels:
      app: image-preloader
  template:
    metadata:
      labels:
        app: image-preloader
    spec:
      containers:
      - name: preloader-3-4-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
      - name: preloader-3-3-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.3.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
EOF
```

### How It Works
<a name="remote-ide-optimize-image-how"></a>
+ Each container references one image.
+ Kubernetes must download each image before starting the container.
+ Once the pod is running on every node, the images are cached locally.
+ Any workload using these images now starts much faster.

## Space default storage (EBS)
<a name="space-storage"></a>

The system uses the EBS CSI driver by default to provision EBS storage volumes for each workspace. SageMaker creates an EBS storage class for use with workspaces, and administrators can customize the default and maximum size of these volumes using template settings. For advanced users working with CLI tools, you can also customize the storage class of the workspace, which allows users to leverage other storage classes including configuring customer-managed KMS keys for their EBS volumes.

Note that EBS volumes are bound to a particular AZ, which means workspaces can only be scheduled on nodes in the same AZ as their storage volume. This can lead to scheduling failures if cluster capacity exists but not in the correct AZ.

## Additional storage
<a name="space-additional-storage"></a>

SageMaker Spaces supports attaching additional storage volumes such as Amazon EFS, FSx for Lustre, or S3 Mountpoint to your development spaces. This allows you to access shared datasets, collaborate on projects, or use high-performance storage for your workloads.

### Prerequisites
<a name="space-additional-storage-prereq"></a>

Before attaching additional storage to spaces, you must:

1. **Install the appropriate CSI driver add-on** via [EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html) (Amazon EFS CSI Driver, Amazon FSx for Lustre CSI Driver, or Mountpoint for Amazon S3 CSI Driver)

1. **Set up storage resources and PersistentVolumeClaims** following the CSI driver documentation for your specific storage type

1. **Ensure the PVC is available** in the same namespace where you plan to create your space

### Attaching storage to spaces
<a name="space-additional-storage-attach"></a>

Once you have a PersistentVolumeClaim configured, you can attach it to a space using either the HyperPod CLI or kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with FSx" \
    --memory 8Gi \
    --volume name=shared-fsx,mountPath=/shared,persistentVolumeClaimName=my-fsx-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with FSx"
  desiredStatus: Running
  volumes:
  - name: shared-fsx
    mountPath: /shared
    persistentVolumeClaimName: my-fsx-pvc
```

### Multiple volumes
<a name="space-additional-storage-multiple"></a>

You can attach multiple additional storage volumes to a single space by specifying multiple `--volume` flags with the CLI or multiple entries in the `volumes` array with kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with Multiple Storage" \
    --memory 8Gi \
    --volume name=shared-efs,mountPath=/shared,persistentVolumeClaimName=my-efs-pvc \
    --volume name=datasets,mountPath=/datasets,persistentVolumeClaimName=my-s3-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with Multiple Storage"
  desiredStatus: Running
  volumes:
  - name: shared-efs
    mountPath: /shared
    persistentVolumeClaimName: my-efs-pvc
  - name: datasets
    mountPath: /datasets
    persistentVolumeClaimName: my-s3-pvc
```

## Resource configuration
<a name="space-resource-configuration"></a>

SageMaker Spaces allows you to configure compute resources for your development environments, including CPU, memory, and GPU resources to match your workload requirements.

### GPU configuration
<a name="space-gpu-configuration"></a>

SageMaker Spaces supports both whole GPU allocation and GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology. This allows you to optimize GPU utilization for different types of machine learning workloads.

#### Whole GPU allocation
<a name="space-gpu-whole"></a>

**HyperPod CLI**

```
hyp create hyp-space \
    --name gpu-space \
    --display-name "GPU Development Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 16Gi \
    --gpu 1 \
    --gpu-limit 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: gpu-space
spec:
  displayName: "GPU Development Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "16Gi"
      nvidia.com/gpu: "1"
    limits:
      memory: "16Gi"
      nvidia.com/gpu: "1"
```

#### GPU partitioning (MIG)
<a name="space-gpu-mig"></a>

GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology allows you to partition a single GPU into smaller, isolated instances. Your HyperPod cluster must have GPU nodes that support MIG and have MIG profiles configured. For more information on setting up MIG on your HyperPod cluster, see [GPU partitioning using NVIDIA MIG](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-gpu-partitioning-setup.html).

**HyperPod CLI**

```
hyp create hyp-space \
    --name mig-space \
    --display-name "MIG GPU Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 8Gi \
    --accelerator-partition-type mig-3g.20gb \
    --accelerator-partition-count 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: mig-space
spec:
  displayName: "MIG GPU Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
    limits:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
```

## Lifecycle
<a name="space-lifecycle"></a>

Lifecycle configuration provides startup scripts that run when a workspace is created or started. These scripts allow administrators to customize the workspace environment during startup. These are bash scripts with a maximum size of 1 KB. If you need larger setup configuration, we recommend adding a script to the container image and triggering the script from the lifecycle configuration.

We leverage Kubernetes container lifecycle hooks to provide this functionality [https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/). Note that Kubernetes does not provide guarantees of when the startup script will be run in relation to the entrypoint of the container. 

## Idle shutdown
<a name="space-idle-shutdown"></a>

Configure automatic shutdown of idle workspaces to optimize resource usage.

### Idle shutdown
<a name="space-idle-shutdown-spec"></a>

```
idleShutdown:
  enabled: true
  idleShutdownTimeoutMinutes: 30
  detection:
    httpGet:
      path: /api/idle
      port: 8888
      scheme: HTTP
```

### Parameters
<a name="space-idle-shutdown-parameter"></a>

**enabled** (boolean, required) - Enables or disables idle shutdown for the workspace.

**idleShutdownTimeoutMinutes** (integer, required) - Number of minutes of inactivity before the workspace shuts down. Minimum value is 1.

**detection** (object, required) - Defines how to detect workspace idle state.

**detection.httpGet** (object, optional) - HTTP endpoint configuration for idle detection. Uses Kubernetes HTTPGetAction specification.
+ **path** - HTTP path to request
+ **port** - Port number or name
+ **scheme** - HTTP or HTTPS (default: HTTP)

### Configuration Locations
<a name="space-idle-shutdown-configure"></a>

**Workspace Configuration**

Define idle shutdown directly in the workspace specification:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:

      name: my-workspace
spec:
  displayName: "Development Workspace"
  image:
      jupyter/scipy-notebook:latest
  idleShutdown:
    enabled: true

      idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path:
      /api/idle
        port: 8888
```

**Template Configuration**

Define default idle shutdown behavior in a WorkspaceTemplate:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: jupyter-template
spec:
  displayName: "Jupyter Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: true
    minTimeoutMinutes: 60
    maxTimeoutMinutes: 240
```

### Template Inheritance and Overrides
<a name="space-idle-shutdown-inherit"></a>

Workspaces using a template automatically inherit the template's `defaultIdleShutdown` configuration. Workspaces can override this configuration if the template allows it.

**Override Policy**

Templates control override behavior through `idleShutdownOverrides`:

**allow** (boolean, default: true)- Whether workspaces can override the default idle shutdown configuration.

**minTimeoutMinutes** (integer, optional)- Minimum allowed timeout value for workspace overrides.

**maxTimeoutMinutes** (integer, optional)- Maximum allowed timeout value for workspace overrides.

**Inheritance Example**

Workspace inherits template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  # Inherits defaultIdleShutdown from template
```

**Override Example**

Workspace overrides template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  idleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 60  # Must be within template bounds
    detection:
      httpGet:
        path: /api/idle
        port: 8888
```

**Locked Configuration**

Prevent workspace overrides:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: locked-template
spec:
  displayName: "Locked Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: false  # Workspaces cannot override
```

### Behavior
<a name="space-idle-shutdown-behavior"></a>

When idle shutdown is enabled, the system periodically checks the workspace for activity using the configured HTTP endpoint. If the endpoint indicates the workspace is idle for the specified timeout duration, the workspace automatically stops. You can manually restart the workspace when needed.

## Template updates
<a name="customization-template-updates"></a>

The client tools such as Kubectl or Hyperpod CLI and SDK can be used for managing Spaces within the EKS cluster. Administrators can provision Space Templates for default Space configurations, while Data Scientists can customize their integrated development environments without needing to understand the underlying Kubernetes complexity. For detailed usage instructions, please refer to the CLI and SDK documentation at [https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html](https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html).

Administrators can perform CRUD operations on Space Templates, which serve as the base configurations when creating a Space. Data Scientists can perform CRUD operations on Spaces and override various parameters, including the Multi-Instance GPU profiles for specific compute nodes. They can start, stop, and connect to the Spaces via remote VSCode access and the Web UI. When a Space Template is updated, any subsequently created Space will be configured with the settings in the updated template. Compliance checks will be performed when existing Spaces are updated or started. If any settings are out of bounds or mismatched, the Spaces will fail to update or start.

## Using hyp cli and kubectl
<a name="customization-hyp-cli"></a>

User can perform CRUD on the templates with the Hyperpod CLI

```
### 1. Create a Space Template
hyp create hyp-space-template --file template.yaml

### 2. List Space Templates
hyp list hyp-space-template
hyp list hyp-space-template --output json

### 3. Describe a Space Template
hyp describe hyp-space-template --name my-template
hyp describe hyp-space-template --name my-template --output json

### 4. Update a Space Template
hyp update hyp-space-template --name my-template --file updated-template.yaml

### 5. Delete a Space Template
hyp delete hyp-space-template --name my-template
```

To create custom templates, you can use our system templates as a starting point. This template will work for SMD-like images, however it can be customized based on the images used by admins.

Example custom JupyterLab template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

Example custom Code Editor template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-code-editor-template
  namespace: my-namespace
spec:
  displayName: "My Custom Code Editor"
  description: "Custom Code Editor with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-code-editor"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "code-editor"
```

# Add users and set up service accounts
<a name="add-user"></a>

## Fine grained access control - our recommendation
<a name="add-user-access-control"></a>

Users are differentiated based on their Kubernetes username. The user’s Kubernetes username is defined in their Access Entry. To ensure two human users have distinct usernames, there are two options:

1. Recommended - Multiple human users can use the same role as long as each one has their own distinct session name that will persist between sessions. By default, Kubernetes usernames for IAM roles are in the format `arn:aws:sts::{ACCOUNT_ID}:assumed-role/{ROLE_NAME}/{SESSION_NAME}`. With this default, users will already be differentiated by session name. An Admin has a few ways to enforce unique session names per user.
   + SSO login - Users using SSO login will by default have a session name tied to their AWS username
   + Central credentials vending service - For enterprise customers, they may have some internal credential vending service that users can call to get credentials with their identity. 
   + Role based enforcement - Require IAM users to set their `aws:username` as their role session name when they assume an IAM role in your AWS account. Documentation on how to do this is here: [https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/](https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/)

1. If 2 Data Scientists are using different access entries (different IAM role or user), they will always be counted as different users.

**Creating access entry**

Required IAM policy for data scientist role:
+ `eks:DescribeCluster`

Required access entry policies
+ `AmazonSagemakerHyperpodSpacePolicy` - scoped to namespace DS should create spaces in
+ `AmazonSagemakerHyperpodSpaceTemplatePolicy` - scoped to “jupyter-k8s-shared” namespace

## Private and Public spaces
<a name="add-user-spaces"></a>

We support 2 types of sharing patterns: “Public” and “OwnerOnly”. Both the “AccessType” and “OwnershipType” fields use these 2 values.
+ AccessType: Public spaces can be accessed by anyone with permissions in the namespace, while OwnerOnly can only be accessed by the space creator as well as administrator users. Administrator users are defined with the following criteria:
+ OwnershipType: Public spaces can be modified/deleted by anyone with permissions in the namespace, OwnerOnly can be modified/deleted by the creator or the Admin.

Admin users are defined by:

1. Part of the `system:masters` Kubernetes group

1. Part of the Kubernetes group defined in the CLUSTER\$1ADMIN\$1GROUP environment variable in the helm chart.

A user’s groups can be configured using EKS access entries. A space can be defined as “Public” or “OwnerOnly” by configuring the spec in the object:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  labels:
    app.kubernetes.io/name: jupyter-k8s
  name: example-workspace
spec:
  displayName: "Example Workspace"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-cpu"
  desiredStatus: "Running"
  ownershipType: "Public"/"OwnerOnly"
  accessType: "Public"/"OwnerOnly"
  # more fields here
```

# Limits
<a name="ds-limits"></a>

Spaces run as pods on HyperPod EKS nodes with attached EBS volumes. The number of Spaces that can be deployed per node is constrained by AWS infrastructure limits.

**EBS Volume Limits per Node**

Reference: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume\$1limits.html](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html)

EC2 nodes have a maximum number of EBS volumes that can be attached. Since each Space typically uses one EBS volume, this limits how many Spaces with dedicated EBS storage can run on a single node.

**Maximum Pods per HyperPod Node**

Reference: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)

Each HyperPod instance type supports a maximum number of pods based on available IP addresses from the VPC CNI plugin. Since each Space runs as a pod, this directly caps the number of Spaces per node.

**Impact**

The effective limit for Spaces per node is whichever constraint is reached first. 

# Task governance for Interactive Spaces on HyperPod
<a name="task-governance"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for Interactive Spaces workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your development workloads run without interruption while maintaining fair allocation across your teams' training, evaluation, and batch processing activities.

## How Interactive Space management works
<a name="task-governance-how"></a>

To effectively manage Interactive Spaces in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for Interactive Spaces with high weights (such as 100) to ensure development pods are admitted and scheduled before other task types. This configuration enables Interactive Spaces to preempt lower-priority jobs during cluster load, which is critical for maintaining uninterrupted development workflows.

**Quota sizing and allocation**

Reserve sufficient compute resources in your team's ClusterQueue to handle expected development workloads. During periods when development resources are idle, unused quota resources can be temporarily allocated to other teams' tasks. When development demand increases, these borrowed resources can be reclaimed to prioritize pending Interactive Space pods.

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

*Strict Resource Control*: Disable quota lending and borrowing to guarantee reserved compute capacity is always available for your Interactive Spaces. This approach requires sizing quotas large enough to independently handle peak development demand and may result in idle nodes during low-usage periods.

*Flexible Resource Sharing*: Enable quota lending to allow other teams to utilize idle development resources when needed. However, disable borrowing to ensure Interactive Spaces never run on borrowed, reclaimable resources that could lead to unexpected evictions.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (training, evaluation, and Interactive Spaces) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority Interactive Space pods, ensuring development work can proceed without depending on external quota borrowing.

## Sample Interactive Space setup
<a name="task-governance-space-setup"></a>

The following example shows how Kueue manages compute resources for Interactive Spaces in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**

Your cluster has the following configuration:
+ *Team Alpha (Dev Team)*: 8 CPU quota for Interactive Spaces
+ *Team Beta (ML Team)*: 16 CPU quota for training and evaluation
+ *Team Gamma (Research)*: 6 CPU quota for experimentation
+ *Static provisioning*: No autoscaling
+ *Total capacity*: 30 CPUs

The shared CPU pool uses this priority policy:
+ *Interactive Spaces*: Priority 100
+ *Training*: Priority 75
+ *Evaluation*: Priority 50
+ *Batch Processing*: Priority 25

Kueue enforces team quotas and priority classes, with preemption enabled and borrowing disabled for the dev team.

**Initial state: Normal cluster utilization**

In normal operations:
+ *Team Alpha*: Runs 6 Interactive Spaces using 6 CPUs, 2 CPUs idle
+ *Team Beta*: Runs training jobs (12 CPUs) and evaluation (4 CPUs) within its 16 CPU quota
+ *Team Gamma*: Runs research workloads on all 6 CPUs
+ *Resource sharing*: Team Beta borrows Team Alpha's 2 idle CPUs for additional training

**Development spike: Team Alpha requires additional resources**

When Team Alpha's developers need to scale up development work, additional Interactive Space pods require 4 more CPUs. Kueue detects that the new pods are:
+ Within Team Alpha's namespace
+ Priority 100 (Interactive Spaces)
+ Pending admission due to quota constraints

**Kueue's response process**

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team Alpha have unused quota?
   + *Current usage*: 6 CPUs used, 2 CPUs available
   + *New requirement*: 4 CPUs needed
   + *Result*: Insufficient quota → Proceed to Step 2

1. **Self-preemption within Team Alpha**

   Question: Can lower-priority Team Alpha jobs be preempted?
   + *Available targets*: No lower-priority jobs in Team Alpha
   + *Result*: No preemption possible → Proceed to Step 3

1. **Reclaim borrowed resources**

   Question: Are Team Alpha resources being borrowed by other teams?
   + *Borrowed resources*: Team Beta using 2 CPUs from Team Alpha
   + *Action*: Kueue evicts Team Beta's borrowed training pods, freeing 2 CPUs
   + *Remaining need*: Still need 2 more CPUs → Interactive Spaces remain in NotAdmitted state until resources become available

This approach prioritizes Interactive Spaces while maintaining team quota boundaries and preventing development work from running on unstable borrowed resources.

# Observability
<a name="observability"></a>

## Standard Kubernetes Monitoring
<a name="observability-monitor"></a>

You can monitor Spaces using standard Kubernetes tools like `kubectl` describe and `kubectl` logs.

**Monitoring Space Status**

```
# List all Spaces with status
kubectl get workspace -A

# Get detailed information about a specific Space
kubectl describe workspace <workspace-name>
```

**Viewing Space Logs**

```
# View workspace container logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace

# View SSM agent sidecar logs (for remote IDE connectivity)
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c ssm-agent-sidecar

# Follow logs in real-time
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace -f
```

**Understanding Space Conditions**

Spaces report four condition types in their status:
+ **Available**: `True` when the Space is ready for use. All required resources (pods, services, storage) are running and healthy.
+ **Progressing**: `True` when the Space is being created, updated, or reconciled. Transitions to `False` once stable.
+ **Degraded**: `True` when errors are detected with the Space resources. Check the condition message for details.
+ **Stopped**: `True` when the Space desired status is set to `Stopped`. The pods are terminated but storage and configuration are preserved.

## CloudWatch Logs Integration
<a name="observability-cw"></a>

You can install the CloudWatch logging add-on to send Space logs to Amazon CloudWatch Logs for centralized log management and retention. This enables log aggregation across multiple clusters and integration with CloudWatch Insights for querying and analysis. All of the above available `kubectl` logs are queryable in CloudWatch with this plugin.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html).

## HyperPod Observability Add-on
<a name="observability-addon"></a>

The SageMaker HyperPod observability add-on provides comprehensive dashboards for monitoring Space resource utilization. After installing the add-on, you can view Space memory and CPU usage in the **Tasks** tab of the HyperPod console, which displays metrics in Amazon Managed Grafana dashboards.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html)

**Key metrics available:**
+ CPU and memory utilization per Space
+ GPU metrics (if applicable)

# Create and manage spaces
<a name="create-manage-spaces"></a>

Data scientists can list to view all the spaces they have access to, create a space using one of the templates, update space to update the image, file system, and other attributes of space configuration, and delete a space. As a prerequisite, customers must install HyperPod CLI or use kubectl to create and manage spaces. For more details on HyperPod CLI, please see [this](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/README.md#space). To use kubectl commands, please refer to [this guide](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) to install kubectl.

## Create space
<a name="create-manage-spaces-create"></a>

**HyperPod CLI**

Create a Jupyter space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-jupyter-template,namespace=jupyter-k8s-system
```

Create a Code Editor space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system
```

**kubectl**

```
kubectl apply -f - <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: my-space
  desiredStatus: Running
EOF
```

or you can simply apply the yaml file

```
kubectl apply -f my-workspace.yaml
```

## List spaces
<a name="create-manage-spaces-list"></a>

**HyperPod CLI**

```
hyp list hyp-space
```

**kubectl**

```
kubectl get workspaces -n <workspace-namespace> 
```

## Describe a space
<a name="create-manage-spaces-describe"></a>

**HyperPod CLI**

```
hyp describe hyp-space --name myspace
```

**kubectl**

```
# Basic Status reporting
kubectl get workspace my-workspace -n <workspace-namespace>

# Enhanced Workspace Information Retrieval 
kubectl get workspace my-workspace -n <workspace-namespace> -o wide

# Complete Workspace Information Retrieval
kubectl get workspace my-workspace -n <workspace-namespace> -o json
kubectl get workspace my-workspace -n <workspace-namespace> -o yaml
```

## Update a space
<a name="create-manage-spaces-update"></a>

**HyperPod CLI**

```
hyp update hyp-space \
    --name myspace \
    --display-name "Updated My Space"
```

**kubectl**

Update the original workspace YAML file as needed, then re-apply it. Be sure that the metadata name is not modified. You can also use these kubectl command to modify fields without reapplying the entire workspace yaml: 

```
# Open a Terminal IDE and modify the Workspace
kubectl edit workspace -n <workspace-namespace>

# Patch a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"<field name>":"<desired value>"}}' -n <workspace-namespace>
```

## Start/Stop a space
<a name="create-manage-spaces-stop"></a>

**HyperPod CLI**

```
hyp start hyp-space --name myspace
hyp stop hyp-space --name myspace
```

**kubectl**

You can update the desired status field in the Workspace to start/stop a space.

```
# Start a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Running"}}' -n <workspace-namespace>
    
# Stop a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Stopped"}}' -n <workspace-namespace>
```

## Get Logs
<a name="create-manage-spaces-log"></a>

**HyperPod CLI**

```
hyp get-logs hyp-space --name myspace
```

**kubectl**

```
# Check Pod Logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Pod Events
kubectl describe pod -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Operator Logs
kubectl logs -n jupyter-k8s-system deployment/jupyter-k8s-controller-manager
```

## Delete a space
<a name="create-manage-spaces-delete"></a>

**HyperPod CLI**

```
hyp delete hyp-space --name myspace
```

**kubectl**

```
# Delete a Workspace
kubectl delete workspace <workspace-name> -n <namespace>
```

# Web browser access
<a name="browser-access"></a>

Web UI access allows you to connect directly to development spaces running on your SageMaker HyperPod cluster through a secure web browser interface. This provides immediate access to Jupyter Lab and other web-based development environments without requiring local software installation.

## Prerequisites
<a name="browser-access-prereq"></a>

Before setting up web UI access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow the [SageMaker Spaces add-on installation ](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable web UI access during installation
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate Web UI Access URL
<a name="browser-access-url"></a>

**Using HyperPod CLI**

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type web-ui
```

**Using kubectl**

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: web-ui
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Accessing Your Development Space
<a name="browser-access-develop"></a>

1. *Generate the web UI URL* using one of the methods above

1. *Copy the URL* from the response

1. *Open the URL* in your web browser

1. *Access your development environment* through the web interface

## Supported Development Environments
<a name="browser-access-develop-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="browser-access-troubleshooting"></a>

**Cannot generate access URLs**

Check the following:
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ User has appropriate EKS Access Entry permissions

# Remote access to SageMaker Spaces
<a name="vscode-access"></a>

Remote access allows you to connect your local Visual Studio Code directly to development spaces running on your SageMaker HyperPod cluster. Remote connections use SSM to establish secure, encrypted tunnels between your local machine and the development spaces.

## Prerequisites
<a name="vscode-access-prereq"></a>

Before setting up remote access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow [SageMaker Spaces add-on installation](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable remote access during installation (either Quick install or Custom install with remote access configuration enabled).
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate VS Code remote connection
<a name="vscode-access-remote"></a>

### Using HyperPod CLI
<a name="vscode-access-remote-cli"></a>

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type vscode-remote
```

### Using kubectl
<a name="vscode-access-remote-kubectl"></a>

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: vscode-remote
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Connecting with VS Code
<a name="vscode-access-remote-vscode"></a>

1. Generate the VS Code connection URL using one of the methods above

1. Copy the VS Code URL from the response

1. Click the URL or paste it into your browser

1. VS Code will prompt to open the remote connection

1. Confirm the connection to establish the remote development environment

## Supported Development Environments
<a name="vscode-access-remote-dev-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="troubleshooting"></a>

**Cannot generate connection URLs**

*Check the following:*
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ Remote access was enabled during add-on installation
+ User has appropriate EKS Access Entry permissions

# Train and deploy models with HyperPod CLI and SDK
<a name="getting-started-hyperpod-training-deploying-models"></a>

Amazon SageMaker HyperPod helps you train and deploy machine learning models at scale. The AWS HyperPod CLI is a unified command-line interface that simplifies machine learning (ML) workflows on AWS. It abstracts infrastructure complexities and provides a streamlined experience for submitting, monitoring, and managing ML training jobs. The CLI is designed specifically for data scientists and ML engineers who want to focus on model development rather than infrastructure management. This topic walks you through three key scenarios: training a PyTorch model, deploying a custom model using trained artifacts, and deploying a JumpStart model. Designed for first-time users, this concise tutorial ensures you can set up, train, and deploy models effortlessly using either the HyperPod CLI or the SDK. The handshake process between training and inference helps you manage model artifacts effectively. 

## Prerequisites
<a name="prerequisites"></a>

Before you begin using Amazon SageMaker HyperPod, make sure you have:
+ An AWS account with access to Amazon SageMaker HyperPod
+ Python 3.9, 3.10, or 3.11 installed
+ AWS CLI configured with appropriate credentials. 

## Install the HyperPod CLI and SDK
<a name="install-cli-sdk"></a>

Install the required package to access the CLI and SDK:

```
pip install sagemaker-hyperpod
```

This command sets up the tools needed to interact with HyperPod clusters.

## Configure your cluster context
<a name="configure-cluster"></a>

HyperPod operates on clusters optimized for machine learning. Start by listing available clusters to select one for your tasks.

1. List all available clusters:

   ```
   hyp list-cluster
   ```

1. Choose and set your active cluster:

   ```
   hyp set-cluster-context your-eks-cluster-name
   ```

1. Verify the configuration:

   ```
   hyp get-cluster-context
   ```

**Note**  
All subsequent commands target the cluster you've set as your context.

## Choose your scenario
<a name="choose-scenario"></a>

For detailed instructions on each scenario, click on the topics below:

**Topics**
+ [

## Prerequisites
](#prerequisites)
+ [

## Install the HyperPod CLI and SDK
](#install-cli-sdk)
+ [

## Configure your cluster context
](#configure-cluster)
+ [

## Choose your scenario
](#choose-scenario)
+ [

# Train a PyTorch model
](train-models-with-hyperpod.md)
+ [

# Deploy a custom model
](deploy-trained-model.md)
+ [

# Deploy a JumpStart model
](deploy-jumpstart-model.md)

# Train a PyTorch model
<a name="train-models-with-hyperpod"></a>

This topic walks you through the process of training a PyTorch model using HyperPod.

In this scenario, let's train a PyTorch model using the `hyp-pytorch-job` template, which simplifies job creation by exposing commonly used parameters. The model artifacts will be stored in an S3 bucket for later use in inference. However, this is optional, and you can choose your preferred storage location.

## Create a training job
<a name="create-training-job"></a>

You can train the model using either the CLI or Python SDK.

### Using the CLI
<a name="using-cli"></a>

Create a training job with the following command:

```
hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '["python", "train.py"]' \
    --args '["--epochs", "10", "--batch-size", "32"]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --volumes '["data-vol", "model-vol", "checkpoint-vol"]' \
    --persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \
    --output-s3-uri s3://my-bucket/model-artifacts
```

**Key required parameters explained**:
+ `--job-name`: Unique identifier for your training job
+ `--image`: Docker image containing your training environment

This command starts a training job named `test-pytorch-job`. The `--output-s3-uri` specifies where the trained model artifacts will be stored, for example, `s3://my-bucket/model-artifacts`. Note this location, as you’ll need it for deploying the custom model.

### Using the Python SDK
<a name="using-python-sdk"></a>

For programmatic control, use the SDK. Create a Python script to launch the same training job.

```
from sagemaker.hyperpod import HyperPodPytorchJob
from sagemaker.hyperpod.job 
import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Container
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="448049793756.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
    # S3 location for artifacts
    output_s3_uri="s3://my-bucket/model-artifacts"  
)
# Launch the job
pytorch_job.create()
```

## Monitor your training job
<a name="monitor-training-job"></a>

Monitor your job's progress with these commands:

### Using the CLI
<a name="monitor-cli"></a>

```
# Check job status
hyp list hyp-pytorch-job

# Get detailed information
hyp describe hyp-pytorch-job --job-name test-pytorch-job

# View logs
hyp get-logs hyp-pytorch-job \
    --pod-name test-pytorch-job-pod-0 \
    --job-name test-pytorch-job
```

**Note**: Training time varies based on model complexity and instance type. Monitor the logs to track progress.

These commands help you verify the job’s status and troubleshoot issues. Once the job completes successfully, the model artifacts are saved to `s3://my-bucket/model-artifacts`.

### Using the Python SDK
<a name="monitor-python-sdk"></a>

Add the following code to your Python script:

```
print("List all pods created for this job:")
print(pytorch_job.list_pods())

print("Check the logs from pod0:")
print(pytorch_job.get_logs_from_pod(pod_name="demo-pod-0"))

print("List all HyperPodPytorchJobs:")
print(HyperPodPytorchJob.list())

print("Describe job:")
print(HyperPodPytorchJob.get(name="demo").model_dump())

pytorch_job.refresh()
print(pytorch_job.status.model_dump())
```

## Next steps
<a name="next-steps"></a>

After training, the model artifacts are stored in the S3 bucket you specified (`s3://my-bucket/model-artifacts`). You can use these artifacts to deploy a model. Currently, you must manually manage the transition from training to inference. This involves:
+ **Locating artifacts**: Check the S3 bucket (`s3://my-bucket/model-artifacts`) to confirm the trained model files are present.
+ **Recording the path**: Note the exact S3 path (e.g., `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`) for use in the inference setup.
+ **Referencing in deployment**: Provide this S3 path when configuring the custom endpoint to ensure the correct model is loaded.

# Deploy a custom model
<a name="deploy-trained-model"></a>

After training completes, deploy your model for inference. You can deploy a custom model using either the CLI or the SDK.

## Locate your model artifacts
<a name="locate-model-artifacts"></a>
+ **Check your S3 bucket**: Verify that model artifacts are saved at `s3://my-bucket/model-artifacts/`
+ **Note the exact path**: You'll need the full path (for example, `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`)

## Deploy using the CLI
<a name="deploy-using-cli"></a>

Run the following command to deploy your custom model:

```
hyp create hyp-custom-endpoint \
    --version 1.0 \
    --env '{"HF_MODEL_ID":"/opt/ml/model", "SAGEMAKER_PROGRAM":"inference.py", }' \
    --model-source-type s3 \
    --model-location test-pytorch-job \
    --s3-bucket-name my-bucket \
    --s3-region us-east-2 \
    --prefetch-enabled true \ 
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --model-volume-mount-name model-weights \
    --container-port 8080 \
    --resources-requests '{"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"}' \
    --resources-limits '{"nvidia.com/gpu": 1}' \
    --tls-output-s3-uri s3://<bucket_name> \
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-custom-pytorch \
    --model-name pytorch-custom-model
```

This command deploys the trained model as an endpoint named `endpoint-custom-pytorch`. The `--model-location` references the artifact path from the training job.

## Deploy using the Python SDK
<a name="deploy-using-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint

model = Model(
    model_source_type="s3",
    model_location="test-pytorch-job",
    s3_bucket_name="my-bucket",
    s3_region="us-east-2",
    prefetch_enabled=True
)

server = Server(
    instance_type="ml.g5.8xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0",
    container_port=8080,
    model_volume_mount_name="model-weights"
)

resources = {
    "requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
    "limits": {"nvidia.com/gpu": 1}
}

env = EnvironmentVariables(
    HF_MODEL_ID="/opt/ml/model",
    SAGEMAKER_PROGRAM="inference.py",
    SAGEMAKER_SUBMIT_DIRECTORY="/opt/ml/model/code",
    MODEL_CACHE_ROOT="/opt/ml/model",
    SAGEMAKER_ENV="1"
)

endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch")

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://<bucket_name>")

custom_endpoint = HPCustomEndpoint(
    model=model,
    server=server,
    resources=resources,
    environment=env,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config
)

custom_endpoint.create()
```

## Invoke the endpoint
<a name="invoke-endpoint"></a>

### Using the CLI
<a name="invoke-using-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'
```

This returns the model’s response, such as “The capital of the USA is Washington, D.C.”

### Using the SDK
<a name="invoke-using-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = custom_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-endpoint"></a>

### Using the CLI
<a name="manage-using-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-custom-endpoint
hyp get hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="manage-using-sdk"></a>

Add the following code to your Python script:

```
logs = custom_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-using-cli"></a>

```
hyp delete hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="cleanup-using-sdk"></a>

```
custom_endpoint.delete()
```

## Next steps
<a name="next-steps"></a>

You've successfully deployed and tested a custom model using SageMaker HyperPod. You can now use this endpoint for inference in your applications.

# Deploy a JumpStart model
<a name="deploy-jumpstart-model"></a>

You can deploy a pre-trained JumpStart model for inference using either the CLI or the SDK.

## Using the CLI
<a name="deploy-jumpstart-cli"></a>

Run the following command to deploy a JumpStart model:

```
hyp create hyp-jumpstart-endpoint \
  --version 1.0 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.g5.8xlarge \
  --endpoint-name endpoint-test-jscli
```

## Using the SDK
<a name="deploy-jumpstart-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)

server=Server(
    instance_type='ml.g5.8xlarge',
)

endpoint_name=SageMakerEndpoint(name='<endpoint-name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)
```

## Invoke the endpoint
<a name="invoke-jumpstart-endpoint"></a>

### Using the CLI
<a name="invoke-jumpstart-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'
```

### Using the SDK
<a name="invoke-jumpstart-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = js_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-jumpstart-endpoint"></a>

### Using the CLI
<a name="manage-jumpstart-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-jumpstart-endpoint
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="manage-jumpstart-sdk"></a>

Add the following code to your Python script:

```
endpoint_iterator = HPJumpStartEndpoint.list()
for endpoint in endpoint_iterator:
    print(endpoint.name, endpoint.status)

logs = js_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-jumpstart-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-jumpstart-cli"></a>

```
hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="cleanup-jumpstart-sdk"></a>

```
js_endpoint.delete()
```

## Next steps
<a name="jumpstart-next-steps"></a>

Now that you've trained a PyTorch model, deployed it as a custom endpoint, and deployed a JumpStart model using HyperPod's CLI and SDK, explore advanced features:
+ **Multi-node training**: Scale training across multiple instances
+ **Custom containers**: Build specialized training environments
+ **Integration with SageMaker Pipelines**: Automate your ML workflows
+ **Advanced monitoring**: Set up custom metrics and alerts

For more examples and advanced configurations, visit the [SageMaker HyperPod GitHub repository](https://github.com/aws/amazon-sagemaker-examples).

# Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-run-jobs"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters orchestrated with Amazon EKS. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters.

**Note**  
When running jobs via the SageMaker HyperPod CLI or kubectl, HyperPod can track compute utilization (GPU/CPU hours) across namespaces (teams). These metrics power usage reports, which provide:  
Visibility into allocated vs. borrowed resource consumption
Teams resource utilization for auditing (up to 180 days)
Cost attribution aligned with Task Governance policies
To use usage reports, you must install the usage report infrastructure. We strongly recommend configuring [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to enforce compute quotas and enable granular cost attribution.  
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

**Tip**  
For a hands-on experience and guidance on how to set up and use a SageMaker HyperPod cluster orchestrated with Amazon EKS, we recommend taking this [Amazon EKS Support in SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e) workshop.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) and the native `kubectl` commands to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/overview/) according to compute quotas managed by HyperPod, and [SageMaker AI-managed MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) to manage ML experiments and training runs. 

**Topics**
+ [

# Installing the SageMaker HyperPod CLI
](sagemaker-hyperpod-eks-run-jobs-access-nodes.md)
+ [

# SageMaker HyperPod CLI commands
](sagemaker-hyperpod-eks-hyperpod-cli-reference.md)
+ [

# Running jobs using the SageMaker HyperPod CLI
](sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.md)
+ [

# Running jobs using `kubectl`
](sagemaker-hyperpod-eks-run-jobs-kubectl.md)

# Installing the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-access-nodes"></a>

SageMaker HyperPod provides the [SageMaker HyperPod command line interface](https://github.com/aws/sagemaker-hyperpod-cli) (CLI) package. 

1. Check if the version of Python on your local machine is between 3.8 and 3.11.

1. Check the prerequisites in the `README` markdown file in the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) package.

1. Clone the SageMaker HyperPod CLI package from GitHub.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   ```

1. Install the SageMaker HyperPod CLI.

   ```
   cd sagemaker-hyperpod-cli && pip install .
   ```

1. Test if the SageMaker HyperPod CLI is successfully installed by running the following command. 

   ```
   hyperpod --help
   ```

**Note**  
If you are a data scientist and want to use the SageMaker HyperPod CLI, make sure that your IAM role is set up properly by your cluster admins following the instructions at [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

# SageMaker HyperPod CLI commands
<a name="sagemaker-hyperpod-eks-hyperpod-cli-reference"></a>

The following table summarizes the SageMaker HyperPod CLI commands.

**Note**  
For a complete CLI reference, see [README](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#sagemaker-hyperpod-command-line-interface) in the [SageMaker HyperPod CLI GitHub repository](https://github.com/aws/sagemaker-hyperpod-cli).


| SageMaker HyperPod CLI command | Entity  | Description | 
| --- | --- | --- | 
| hyperpod get-clusters | cluster/access | Lists all clusters to which the user has been enabled with IAM permissions to submit training workloadsGives the current snapshot of whole available instances which are not running any workloads or jobs along with maximum capacity, grouping by health check statuses (ex: BurnInPassed) | 
| hyperpod connect-cluster | cluster/access | Configures kubectl to operate on the specified HyperPod cluster and namespace | 
| hyperpod start-job  | job | Submits the job to targeted cluster-Job name will be unique at namespace level-Users will be able to override yaml spec by passing them as CLI arguments | 
| hyperpod get-job | job | Display metadata of the submitted job | 
| hyperpod list-jobs | job | Lists all jobs in the connected cluster/namespace to which the user has been added with IAM permissions to submit training workloads | 
| hyperpod cancel-job | job | Stops and deletes the job and gives up underlying compute resources. This job cannot be resumed again. A new job needs to be started, if needed. | 
| hyperpod list-pods | pod | Lists all the pods in the given job in a namespace | 
| hyperpod get-log | pod | Retrieves the logs of a particulat pod in a specified job | 
| hyperpod exec | pod | Run the bash command in the shell of the specified pod(s) and publishes the output | 
| hyperpod --help | utility | lists all supported commands | 

# Running jobs using the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli"></a>

To run jobs, make sure that you installed Kubeflow Training Operator in the EKS clusters. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

Run the `hyperpod get-cluster` command to get the list of available HyperPod clusters.

```
hyperpod get-clusters
```

Run the `hyperpod connect-cluster` to configure the SageMaker HyperPod CLI with the EKS cluster orchestrating the HyperPod cluster.

```
hyperpod connect-cluster --cluster-name <hyperpod-cluster-name>
```

Use the `hyperpod start-job` command to run a job. The following command shows the command with required options. 

```
hyperpod start-job \
    --job-name <job-name>
    --image <docker-image-uri>
    --entry-script <entrypoint-script>
    --instance-type <ml.instance.type>
    --node-count <integer>
```

The `hyperpod start-job` command also comes with various options such as job auto-resume and job scheduling.

## Enabling job auto-resume
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-enable-auto-resume"></a>

The `hyperpod start-job` command also has the following options to specify job auto-resume. For enabling job auto-resume to work with the SageMaker HyperPod node resiliency features, you must set the value for the `restart-policy` option to `OnFailure`. The job must be running under the `kubeflow` namespace or a namespace prefixed with `hyperpod`.
+ [--auto-resume <bool>] \$1Optional, enable job auto resume after fails, default is false
+ [--max-retry <int>] \$1Optional, if auto-resume is true, max-retry default value is 1 if not specified
+ [--restart-policy <enum>] \$1Optional, PyTorchJob restart policy. Available values are `Always`, `OnFailure`, `Never` or `ExitCode`. The default value is `OnFailure`. 

```
hyperpod start-job \
    ... // required options \
    --auto-resume true \
    --max-retry 3 \
    --restart-policy OnFailure
```

## Running jobs with scheduling options
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-scheduling"></a>

The `hyperpod start-job` command has the following options to set up the job with queuing mechanisms. 

**Note**  
You need [Kueue](https://kueue.sigs.k8s.io/docs/overview/) installed in the EKS cluster. If you haven't installed, follow the instructions in [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ [--scheduler-type <enum>] \$1Optional, Specify the scheduler type. The default is `Kueue`.
+ [--queue-name <string>] \$1Optional, Specify the name of the [Local Queue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) or [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) you want to submit with the job. The queue should be created by cluster admins using `CreateComputeQuota`.
+ [--priority <string>] \$1Optional, Specify the name of the [Workload Priority Class](https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/), which should be created by cluster admins.

```
hyperpod start-job \
    ... // required options
    --scheduler-type Kueue \
    --queue-name high-priority-queue \
    --priority high
```

## Running jobs from a configuration file
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-from-config"></a>

As an alternative, you can create a job configuration file containing all the parameters required by the job and then pass this config file to the `hyperpod start-job` command using the --config-file option. In this case:

1. Create your job configuration file with the required parameters. Refer to the job configuration file in the SageMaker HyperPod CLI GitHub repository for a [baseline configuration file](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.html#sagemaker-hyperpod-eks-hyperpod-cli-from-config).

1. Start the job using the configuration file as follows.

   ```
   hyperpod start-job --config-file /path/to/test_job.yaml
   ```

**Tip**  
For a complete list of parameters of the `hyperpod start-job` command, see the [Submitting a Job](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#submitting-a-job) section in the `README.md` of the SageMaker HyperPod CLI GitHub repository.

# Running jobs using `kubectl`
<a name="sagemaker-hyperpod-eks-run-jobs-kubectl"></a>

**Note**  
Training job auto resume requires Kubeflow Training Operator release version `1.7.0`, `1.8.0`, or `1.8.1`.

Note that you should install Kubeflow Training Operator in the clusters using a Helm chart. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). Verify if the Kubeflow Training Operator control plane is properly set up by running the following command.

```
kubectl get pods -n kubeflow
```

This should return an output similar to the following.

```
NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
```

**To submit a training job**

To run a training jobs, prepare the job configuration file and run the [https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) command as follows.

```
kubectl apply -f /path/to/training_job.yaml
```

**To describe a training job**

To retrieve the details of the job submitted to the EKS cluster, use the following command. It returns job information such as the job submission time, completion time, job status, configuration details.

```
kubectl get -o yaml training-job -n kubeflow
```

**To stop a training job and delete EKS resources**

To stop a training job, use kubectl delete. The following is an example of stopping the training job created from the configuration file `pytorch_job_simple.yaml`.

```
kubectl delete -f /path/to/training_job.yaml 
```

This should return the following output.

```
pytorchjob.kubeflow.org "training-job" deleted
```

**To enable job auto-resume**

SageMaker HyperPod supports job auto-resume functionality for Kubernetes jobs, integrating with the Kubeflow Training Operator control plane.

Ensure that there are sufficient nodes in the cluster that have passed the SageMaker HyperPod health check. The nodes should have the taint `sagemaker.amazonaws.com/node-health-status` set to `Schedulable`. It is recommended to include a node selector in the job YAML file to select nodes with the appropriate configuration as follows.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

The following code snippet is an example of how to modify a Kubeflow PyTorch job YAML configuration to enable the job auto-resume functionality. You need to add two annotations and set `restartPolicy` to `OnFailure` as follows.

```
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob 
metadata:
    name: pytorch-simple
    namespace: kubeflow
    annotations: { // config for job auto resume
      sagemaker.amazonaws.com/enable-job-auto-resume: "true"
      sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
          spec:
              nodeSelector:
                  sagemaker.amazonaws.com/node-health-status: Schedulable
```

**To check the job auto-resume status**

Run the following command to check the status of job auto-resume.

```
kubectl describe pytorchjob -n kubeflow <job-name>
```

Depending on the failure patterns, you might see two patterns of Kubeflow training job restart as follows.

**Pattern 1**:

```
Start Time:    2024-07-11T05:53:10Z
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-0
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-1
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m59s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-0
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-1
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m58s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
```

**Pattern 2**: 

```
Events:
  Type    Reason                   Age    From                   Message
  ----    ------                   ----   ----                   -------
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-master-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-master-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-master-0
```

# Using the HyperPod training operator
<a name="sagemaker-eks-operator"></a>

 The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly. 

 The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.

 While Kueue is not required for this training operator, your cluster administrator can install and configure it for enhanced job scheduling capabilities. For more information, see the [official documentation for Kueue](https://kueue.sigs.k8s.io/docs/overview/).

**Note**  
To use the training operator, you must use the latest [ HyperPod AMI release](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html). To upgrade, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API operation. If you use [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html), it must also be the latest version.

## Supported versions
<a name="sagemaker-eks-operator-supported-versions"></a>

 The HyperPod training operator works only work with specific versions of Kubernetes, Kueue, and HyperPod. See the list below for the complete list of compatible versions. 
+ Supported Kubernetes versions – 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33
+ Suggested Kueue versions – [ v.0.12.2](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.2) and [v.0.12.3](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.3)
+ The latest HyperPod AMI release. To upgrade to the latest AMI release, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.
+ [PyTorch 2.4.0 – 2.7.1](https://github.com/pytorch/pytorch/releases)

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job operations, resource management, and essential service functionality.

# Installing the training operator
<a name="sagemaker-eks-operator-install"></a>

See the following sections to learn about how to install the training operator.

## Prerequisites
<a name="sagemaker-eks-operator-prerequisites"></a>

 Before you use the HyperPod training operator, you must have completed the following prerequisites: 
+  [ Created a HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). 
+ Installed the latest AMI on your HyperPod cluster. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).
+ [Installed cert-manager](https://cert-manager.io/docs/installation/).
+  [ Set up the EKS Pod Identity Agent using the console](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-agent-setup.html). If you want to use the AWS CLI, use the following command: 

  ```
  aws eks create-addon \ 
   --cluster-name my-eks-cluster \
   --addon-name eks-pod-identity-agent \
   --region AWS Region
  ```
+ (Optional) If you run your HyperPod cluster nodes in a private VPC, you must set up PrivateLinks VPC endpoints for the Amazon SageMaker AI API (`com.amazonaws.aws-region.sagemaker.api`) and Amazon EKS Auth services (com.amazonaws.*aws-region*.eks-auth). You must also make sure that your cluster nodes are running with subnets that are in a security group that allows the traffic to route through the VPC endpoints to communicate with SageMaker AI and Amazon EKS. If these aren't properly set up, the add-on installation can fail. To learn more about setting up VPC endpoints, see [Create a VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#create-interface-endpoint-aws).

## Installing the training operator
<a name="sagemaker-eks-operator-install-operator"></a>

 You can now install the HyperPod training operator through the SageMaker AI console, the Amazon EKS console, or with the AWS CLI The console methods offer simplified experiences that help you install the operator. The AWS CLI offers a programmatic approach that lets you customize more of your installation.

Between the two console experiences, SageMaker AI provides a one-click installation creates the IAM execution role, creates the pod identity association, and installs the operator. The Amazon EKS console installation is similar, but this method doesn't automatically create the IAM execution role. During this process, you can choose to create a new IAM execution role with information that the console pre-populates. By default, these created roles only have access to the current cluster that you're installing the operator in. Unless you edit the role's permissions to include other clusters, if you remove and reinstall the operator, you must create a new role. 

------
#### [ SageMaker AI console (recommended) ]

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **Amazon SageMaker HyperPod training operator**, and choose **install**. During the installation process, SageMaker AI creates an IAM execution role with permissions similar to the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and creates a pod identity association between your Amazon EKS cluster and your new execution role.

------
#### [ Amazon EKS console ]

**Note**  
If you install the add-on through the Amazon EKS cluster, first make sure that you've tagged your HyperPod cluster with the key-value pair `SageMaker:true`. Otherwise, the installation will fail.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, choose **Add-ons**, then choose ** Get more Add-ons**.

1. Choose Amazon SageMaker HyperPod training operator, then choose **Next**.

1. Under **Version**, the console defaults to the latest version, which we recommend that you use.

1. Under **Add-on access**, choose a pod identity IAM role to use with the training operator add-on. If you don't already have a role, choose **Create recommended role** to create one.

1. During this role creation process, the IAM console pre-populates all of the necessary information, such as the use case, the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and other required permissions, the role name, and the description. As you go through the steps, review the information, and choose **Create role**.

1. In the EKS console, review your add-on's settings, and then choose **Create**.

------
#### [ CLI ]

1. Make sure that the IAM execution role for your HyperPod cluster has a trust relationship that allows EKS Pod Identity to assume the role or or [create a new IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) with the following trust policy. Alternatively, you could use the Amazon EKS console to install the add-on, which creates a recommended role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
         "Effect": "Allow",
         "Principal": {
           "Service": "pods.eks.amazonaws.com"
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession",
           "eks-auth:AssumeRoleForPodIdentity"
         ]
       }
     ]
   }
   ```

------

1.  Attach the [ AmazonSageMakerHyperPodTrainingOperatorAccess managed policy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) to your created role. 

1.  [ Then create a pod identity association between your EKS cluster, your IAM role, and your new IAM role](https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html).

   ```
   aws eks create-pod-identity-association \
   --cluster-name my-eks-cluster \
   --role-arn ARN of your execution role \
   --namespace aws-hyperpod \
   --service-account hp-training-operator-controller-manager \
   --region AWS Region
   ```

1.  After you finish the process, you can use the ListPodIdentityAssociations operation to see the association you created. The following is a sample response of what it might look like. 

   ```
   aws eks list-pod-identity-associations --cluster-name my-eks-cluster
   {
       "associations": [{
           "clusterName": "my-eks-cluster",
           "namespace": "aws-hyperpod",
           "serviceAccount": "hp-training-operator-controller-manager",
           "associationArn": "arn:aws:eks:us-east-2:123456789012:podidentityassociation/my-hyperpod-cluster/a-1a2b3c4d5e6f7g8h9",
           "associationId": "a-1a2b3c4d5e6f7g8h9"
       }]
   }
   ```

1. To install the training operator, use the `create-addon` operation. The `--addon-version` parameter is optional. If you don’t provide one, the default is the latest version. To get the possible versions, use the [ DescribeAddonVersions](https://docs.aws.amazon.com/eks/latest/APIReference/API_DescribeAddonVersions.html) operation.

   ```
   aws eks create-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --resolve-conflicts OVERWRITE
   ```

------

If you already have the training operator installed on your HyperPod cluster, you can update the EKS add-on to the version that you want. If you want to use [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) or [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html), consider the following:
+ Both checkpointless training and elastic training require the EKS add-on to be on version 1.2.0 or above.
+ The Amazon SageMaker HyperPod training operator maintains backwards compatibility for any EKS add-on version, so you can upgrade from any add-on version to 1.2.0 or above.
+ If you downgrade from versions 1.2.0 or above to a lower version, you must first delete the existing jobs before the downgrade and resubmit the jobs after the downgrade is complete.

------
#### [ Amazon EKS Console ]

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, and choose **Add-ons**. Then, choose the Amazon SageMaker HyperPod training operator add-on and choose **Edit**.

1. In the **Version** menu, choose the version of the add-on that you want, then choose **Save changes**.

------
#### [ CLI ]

1. First get the list of the supported versions of the add-on for your cluster.

   ```
   aws eks describe-addon-versions \
     --kubernetes-version $(aws eks describe-cluster --name my-eks-cluster --query 'cluster.version' --output text) \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --query 'addons[0].addonVersions[].addonVersion' \
     --output table
   ```

1. Then update the add-on to the version that you want.

   ```
   aws eks update-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --addon-version target-version
     --resolve-conflicts OVERWRITE
   ```

------

 The training operator comes with a number of options with default values that might fit your use case. We recommend that you try out the training operator with default values before changing them. The table below describes all parameters and examples of when you might want to configure each parameter.


| Parameter | Description | Default | 
| --- | --- | --- | 
| hpTrainingControllerManager.manager.resources.requests.cpu | How many processors to allocate for the controller | 1 | 
| hpTrainingControllerManager.manager.resources.requests.memory | How much memory to allocate to the controller | 2Gi | 
| hpTrainingControllerManager.manager.resources.limits.cpu | The CPU limit for the controller | 2 | 
| hpTrainingControllerManager.manager.resources.limits.memory | The memory limit for the controller | 4Gi | 
| hpTrainingControllerManager.nodeSelector | Node selector for the controller pods | Default behavior is to select nodes with the label sagemaker.amazonaws.com/compute-type: "HyperPod" | 

## HyperPod elastic agent
<a name="sagemaker-eks-operator-elastic-agent"></a>

The HyperPod elastic agent is an extension of [PyTorch’s ElasticAgent](https://docs.pytorch.org/docs/stable/elastic/agent.html). It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses `hyperpodrun` to create the job launcher.

**Note**  
Both [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) and [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html) require that you use HyperPod elastic agent version 1.1.0 or above.

```
RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            --inprocess-restart \ # Optional (in-process fault recovery with checkpointless training)
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args
```

You can now submit jobs with `kubectl`.

### HyperPod elastic agent arguments
<a name="sagemaker-eks-operator-elastic-agent-args"></a>

 The HyperPod elastic agent supports all of the original arguments and adds some additional arguments. The following is all of the arguments available in the HyperPod elastic agent. For more information about PyTorch's Elastic Agent, see their [official documentation](https://docs.pytorch.org/docs/stable/elastic/agent.html). 


| Argument | Description | Default Value | 
| --- | --- | --- | 
| --shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" | 
| --shutdown-timeout | Timeout in seconds between shutdown-signal and SIGKILL signals | 15 | 
| --server-host | Agent server address | "0.0.0.0" | 
| --server-port | Agent server port | 8080 | 
| --server-log-level | Agent server log level | "info" | 
| --server-shutdown-timeout | Server shutdown timeout in seconds | 300 | 
| --pre-train-script | Path to pre-training script | None | 
| --pre-train-args | Arguments for pre-training script | None | 
| --post-train-script | Path to post-training script | None | 
| --post-train-args | Arguments for post-training script | None | 
| --inprocess-restart | Flag specifying whether to use the inprocess\$1restart feature | FALSE | 
| --inprocess-timeout | Time in seconds that the agent waits for workers to reach a synchronization barrier before triggering a process-level restart. | None | 

## Task governance (optional)
<a name="sagemaker-eks-operator-task-governance"></a>

The training operator is integrated with [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance), a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. To set up HyperPod task governance, see [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md). 

**Note**  
When installing the HyperPod task governance add-on, you must use version v1.3.0-eksbuild.1 or higher.

When submitting a job, make sure you include your queue name and priority class labels of `hyperpod-ns-team-name-localqueue` and `priority-class-name-priority`. For example, if you're using Kueue, your labels become the following:
+ kueue.x-k8s.io/queue-name: hyperpod-ns-*team-name*-localqueue
+ kueue.x-k8s.io/priority-class: *priority-class*-name-priority

The following is an example of what your configuration file might look like:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPytorchJob
metadata:
  name: hp-task-governance-sample
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: priority-class-priority
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 4
      spares: 2
      template:
        spec:
          containers:
            - name: ptjob
              image: XXXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  cpu: "2"
```

Then use the following kubectl command to apply the YAML file.

```
kubectl apply -f task-governance-job.yaml
```

## Kueue (optional)
<a name="sagemaker-eks-operator-kueue"></a>

While you can run jobs directly, your organization can also integrate the training operator with Kueue to allocate resources and schedule jobs. Follow the steps below to install Kueue into your HyperPod cluster.

1. Follow the installation guide in the [ official Kueue documentation](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version). When you reach the step of configuring `controller_manager_config.yaml`, add the following configuration:

   ```
   externalFrameworks:
   - "HyperPodPytorchJob.v1.sagemaker.amazonaws.com"
   ```

1. Follow the rest of the steps in the official installation guide. After you finish installing Kueue, you can create some sample queues with the `kubectl apply -f sample-queues.yaml` command. Use the following YAML file.

   ```
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ClusterQueue
   metadata:
     name: cluster-queue
   spec:
     namespaceSelector: {}
     preemption:
       withinClusterQueue: LowerPriority
     resourceGroups:
     - coveredResources:
       - cpu
       - nvidia.com/gpu
       - pods
       flavors:
       - name: default-flavor
         resources:
         - name: cpu
           nominalQuota: 16
         - name: nvidia.com/gpu
           nominalQuota: 16
         - name: pods
           nominalQuota: 16
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: LocalQueue
   metadata:
     name: user-queue
     namespace: default
   spec:
     clusterQueue: cluster-queue
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ResourceFlavor
   metadata:
     name: default-flavor
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: High priority
   kind: WorkloadPriorityClass
   metadata:
     name: high-priority-class
   value: 1000
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: Low Priority
   kind: WorkloadPriorityClass
   metadata:
     name: low-priority-class
   value: 500
   ```

# Using the training operator to run jobs
<a name="sagemaker-eks-operator-usage"></a>

 To use kubectl to run the job, you must create a job.yaml to specify the job specifications and run `kubectl apply -f job.yaml` to submit the job. In this YAML file, you can specify custom configurations in the `logMonitoringConfiguration` argument to define automated monitoring rules that analyze log outputs from the distributed training job to detect problems and recover. 

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job
```

If you want to use the log monitoring options, make sure that you’re emitting the training log to `sys.stdout`. HyperPod elastic agent monitors training logs in sys.stdout, which is saved at `/tmp/hyperpod/`. You can use the following command to emit training logs.

```
logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)
```

 The following table describes all of the possible log monitoring configurations: 


| Parameter | Usage | 
| --- | --- | 
| jobMaxRetryCount | Maximum number of restarts at the process level. | 
| restartPolicy: numRestartBeforeFullJobRestart | Maximum number of restarts at the process level before the operator restarts at the job level. | 
| restartPolicy: evalPeriodSeconds | The period of evaluating the restart limit in seconds | 
| restartPolicy: maxFullJobRestarts | Maximum number of full job restarts before the job fails. | 
| cleanPodPolicy | Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None. | 
| logMonitoringConfiguration | The log monitoring rules for slow and hanging job detection | 
| expectedRecurringFrequencyInSeconds | Time interval between two consecutive LogPattern matches after which the rule evaluates to HANGING. If not specified, no time constraint exists between consecutive LogPattern matches. | 
| expectedStartCutOffInSeconds | Time to first LogPattern match after which the rule evaluates to HANGING. If not specified, no time constraint exists for the first LogPattern match. | 
| logPattern | Regular expression that identifies log lines that the rule applies to when the rule is active | 
| metricEvaluationDataPoints | Number of consecutive times a rule must evaluate to SLOW before marking a job as SLOW. If not specified, the default is 1. | 
| metricThreshold | Threshold for value extracted by LogPattern with a capturing group. If not specified, metric evaluation is not performed. | 
| operator | The inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq. | 
| stopPattern | Regular expresion to identify the log line at which to deactivate the rule. If not specified, the rule will always be active. | 
| faultOnMatch | Indicates whether a match of LogPattern should immediately trigger a job fault. When true, the job will be marked as faulted as soon as the LogPattern is matched, regardless of other rule parameters. When false or not specified, the rule will evaluate to SLOW or HANGING based on other parameters. | 

 For more training resiliency, specify spare node configuration details. If your job fails, the operator works with Kueue to use nodes reserved in advance to continue running the job. Spare node configurations require Kueue, so if you try to submit a job with spare nodes but don’t have Kueue installed, the job will fail. The following example is a sample `job.yaml` file that contains spare node configurations.

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"
```

## Monitoring
<a name="sagemaker-eks-operator-usage-monitoring"></a>

The Amazon SageMaker HyperPod is integrated with [ observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html), so you can set up monitoring to collect and feed metrics into these observability tools.

Alternatively, you can scrape metrics through Amazon Managed Service for Prometheus without managed observability. To do so, include the metrics that you want to monitor into your `job.yaml` file when you run jobs with `kubectl`.

```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s
```

The following are events that the training operator emits that you can feed into Amazon Managed Service for Prometheus to monitor your training jobs.


| Event | Description | 
| --- | --- | 
| hyperpod\$1training\$1operator\$1jobs\$1created\$1total | Total number of jobs that the training operator has run | 
| hyperpod\$1training\$1operator\$1jobs\$1restart\$1latency | Current job restart latency | 
| hyperpod\$1training\$1operator\$1jobs\$1fault\$1detection\$1latency | Fault detection latency | 
| hyperpod\$1training\$1operator\$1jobs\$1deleted\$1total | Total number of deleted jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1successful\$1total | Total number of completed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1failed\$1total | Total number of failed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1restarted\$1total | Total number of auto-restarted jobs | 

## Sample docker configuration
<a name="sagemaker-eks-operator-usage-docker"></a>

The following is a sample docker file that you can run with the `hyperpod run` command.

```
export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}
```

## Sample log monitoring configurations
<a name="sagemaker-eks-operator-usage-log-monitoring"></a>

**Job hang detection**

To detect hang jobs, use the following configurations. It uses the following parameters:
+ expectedStartCutOffInSeconds – how long the monitor should wait before expecting the first logs
+ expectedRecurringFrequencyInSeconds – the time interval to wait for the next batch of logs

With these settings, the log monitor expects to see a log line matching the regex pattern `.*Train Epoch.*` within 60 seconds after the training job starts. After the first appearance, the monitor expects to see matching log lines every 10 seconds. If the first logs don't appear within 60 seconds or subsequent logs don't appear every 10 seconds, the HyperPod elastic agent treats the container as stuck and coordinates with the training operator to restart the job.

```
runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds
```

**Training loss spike**

The following monitoring configuration emits training logs with the pattern `xxx training_loss_step xx`. It uses the parameter `metricEvaluationDataPoints`, which lets you specify a threshold of data points before the operator restarts the job. If the training loss value is more than 2.0, the operator restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job
```

**Low TFLOPs detection**

The following monitoring configuration emits training logs with the pattern `xx TFLOPs xx` every five seconds. If TFLOPs is less than 100 for 5 data points, the operator restarts the training job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5
```

**Training script error log detection**

The following monitoring configuration detects if the pattern specified in `logPattern` is present in the training logs. As soon as the training operator encounters the error pattern, the training operator treats it as a fault and restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true
```

# Troubleshooting
<a name="sagemaker-eks-operator-troubleshooting"></a>

See the following sections to learn how to troubleshoot error when using the training operator.

## I can't install the training operator
<a name="sagemaker-eks-operator-troubleshooting-installation-error"></a>

If you can't install the training operator, make sure that you're using the [ supported versions of components](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html#sagemaker-eks-operator-supported-versions). For example, if you get an error that your HyperPod AMI release is incompatible with the training operator, [ update to the latest version](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html).

## Incompatible HyperPod task governance version
<a name="sagemaker-eks-operator-troubleshooting-task-governance-version"></a>

During installation, you might get an error message that the version of HyperPod task governance is incompatible. The training operator works only with version v1.3.0-eksbuild.1 or higher. Update your HyperPod task governance add-on and try again. 

## Missing permissions
<a name="sagemaker-eks-operator-troubleshooting-task-missing-permissions"></a>

 While you're setting up the training operator or running jobs, you might receive errors that you're not authorized to run certain operations, such as `DescribeClusterNode`. To resolve these errors, make sure you correctly set up IAM permissions while you're [setting up the Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-install-pod-identity).

# Using elastic training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-elastic-training"></a>

 Elastic training is a new Amazon SageMaker HyperPod capability that automatically scales training jobs based on compute resource availability and workload priority. Elastic training jobs can start with minimum compute resources required for model training and dynamically scale up or down through automatic checkpointing and resumption across different node configurations (world size). Scaling is achieved by automatically adjusting the number of data-parallel replicas. During high cluster utilization periods, elastic training jobs can be configured to automatically scale down in response to resource requests from higher-priority jobs, freeing up compute for critical workloads. When resources free up during off-peak periods, elastic training jobs automatically scale back up to accelerate training, then scale back down when higher-priority workloads need resources again. 

Elastic training is built on top of the HyperPod training operator and integrates the following components:
+ [Amazon EKS for Kubernetes orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html)
+ [ Amazon SageMaker HyperPod Task Governance ](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)for job queuing, prioritization, and scheduling
+ [PyTorch Distributed Checkpoint (DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) for scalable state and checkpoint management, such as DCP

**Supported frameworks**
+ PyTorch with Distributed Data Parallel(DDP) and Fully Sharded Data Parallel(FSDP)
+ PyTorch Distributed Checkpoint (DCP)

## Prerequisites
<a name="sagemaker-eks-elastic-prereqs"></a>

### SageMaker HyperPod EKS Cluster
<a name="sagemaker-eks-elastic-hyperpod-cluster"></a>

You must have a running SageMaker HyperPod cluster with Amazon EKS orchestration. For information on creating a HyperPod EKS cluster, see:
+ [Getting started with Amazon EKS in SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)

### SageMaker HyperPod Training Operator
<a name="sagemaker-eks-elastic-training-operator"></a>

Elastic Training is supported in training operator v. 1.2 and above.

To install the training operator as EKS add-on, see: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html)

### (Recommended) Install and configure Task Governance and Kueue
<a name="sagemaker-eks-elastic-task-governance"></a>

We recommend installing and configuring Kueue via [HyperPod Task Governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html) to specify workload priorities with elastic training. Kueue provides stronger workload management with queuing, prioritization, gang scheduling, resource tracking and graceful preemption which are essential for operating in multi-tenant training environments.
+ Gang scheduling ensures that all required pods of a training job start together. This prevents situations where some pods start while others remain pending, which could cause wasted resources.
+ Gentle preemption allows lower-priority elastic jobs to yield resources to higher-priority workloads. Elastic jobs can scale down gracefully without being forcibly evicted, improving overall cluster stability.

We recommend configuring the following Kueue components:
+ PriorityClasses to define relative job importance
+ ClusterQueues to manage global resource sharing and quotas across teams or workloads
+ LocalQueues to route jobs from individual namespaces into the appropriate ClusterQueue

For more advanced setups, you can also incorporate:
+ Fair-share policies to balance resource usage across multiple teams
+ Custom preemption rules to enforce organizational SLAs or cost controls

Please refer to:
+ [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)
+ [Kueue Documentation](https://kueue.sigs.k8s.io/)

### (Recommended) Setup user namespaces and resource quotas
<a name="sagemaker-eks-elastic-namespaces-quotas"></a>

When deploying this feature on Amazon EKS, we recommend applying a set of foundational cluster-level configurations to ensure isolation, resource fairness, and operational consistency across teams.

#### Namespace and Access Configuration
<a name="sagemaker-eks-elastic-namespace-access"></a>

Organize your workloads using separate namespaces for each team or project. This allows you to apply fine-grained isolation and governance. We also recommend configuring AWS IAM to Kubernetes RBAC mapping to associate individual IAM users or roles with their corresponding namespaces.

Key practices include:
+ Map IAM roles to Kubernetes service accounts using IAM Roles for Service Accounts (IRSA) when workloads need AWS permissions. [https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html)
+ Apply RBAC policies to restrict users to only their designated namespaces (e.g., `Role`/`RoleBinding` rather than cluster-wide permissions).

#### Resource and Compute Constraints
<a name="sagemaker-eks-elastic-resource-constraints"></a>

To prevent resource contention and ensure fair scheduling across teams, apply quotas and limits at the namespace level:
+ ResourceQuotas to cap aggregate CPU, memory, storage, and object counts (pods, PVCs, services, etc.).
+ LimitRanges to enforce default and maximum per-pod or per-container CPU and memory limits.
+ PodDisruptionBudgets (PDBs) as needed to define resiliency expectations.
+ Optional: Namespace-level queueing constraints (e.g., via Task Governance or Kueue) to prevent users from over-submitting jobs.

These constraints help maintain cluster stability and support predictable scheduling for distributed training workloads.

#### Auto-scaling
<a name="sagemaker-eks-elastic-autoscaling"></a>

SageMaker HyperPod on EKS supports cluster autoscaling through Karpenter. When Karpenter or similar resource provisioner is used together with elastic training, the cluster as well as the elastic training job may scale up automatically after an elastic training job is once submitted. This is because elastic training operator takes greedy approach, always asks more than the available compute resources until it reaches maximum limit set by the job. This occurs because the elastic training operator continuously requests additional resources as part of elastic job execution, which can trigger node provisioning. Continuous resource provisioners like Karpenter will serve the requests by scaling up the compute cluster.

To keep these scale-ups predictable and under control, we recommend configuring namespace-level ResourceQuotas in the namespaces where elastic training jobs are created. ResourceQuotas help limit the maximum resources that jobs can request, preventing unbounded cluster growth while still allowing elastic behavior within defined limits.

For example, a ResourceQuota for 8 ml.p5.48xlarge instances will have the following form:

```
apiVersion: v1
kind: ResourceQuota
metadata:
  name: <quota-name>
  namespace: <namespace-name>
spec:
  hard:
    nvidia.com/gpu: "64"
    vpc.amazonaws.com/efa: "256"
    requests.cpu: "1536"
    requests.memory: "5120Gi"
    limits.cpu: "1536"
    limits.memory: "5120Gi"
```

## Build Training Container
<a name="sagemaker-eks-elastic-build-container"></a>

HyperPod training operator works with a custom PyTorch launcher provided via HyperPod Elastic Agent python package ([https://www.piwheels.org/project/hyperpod-elastic-agent/](https://www.piwheels.org/project/hyperpod-elastic-agent/)). Customers must install the elastic agent and replace the `torchrun` command with `hyperpodrun` to launch training. For more details, please see:

[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html\$1sagemaker-eks-operator-elastic-agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-elastic-agent)

An example training container:

```
FROM ...

...

RUN pip install hyperpod-elastic-agent
ENTRYPOINT ["entrypoint.sh"]

# entrypoint.sh ...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
  --rdzv-backend hyperpod \
 # Optional ...
 # Other torchrun args
 # pre-traing arg_group
 --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
 # post-train arg_group
 --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
 training.py --script-args
```

## Training code modification
<a name="sagemaker-eks-elastic-training-code"></a>

SageMaker HyperPod provides a set of recipes that already configured to run with Elastic Policy.

To enable elastic training for custom PyTorch training scripts, you will need to make minor modifications to your training loop. This guide walks you through the necessary modifications needed to ensure your training job responds to elastic scaling events that occur when compute resource availability changes. During all elastic events (e.g., nodes are available, or nodes get preempted), the training job receives an elastic event signal that is used to coordinate a graceful shutdown by saving a checkpoint, and resuming training by restarting from that saved checkpoint with a new world configuration. To enable elastic training with custom training scripts, you need to:

### Detect Elastic Scaling Events
<a name="sagemaker-eks-elastic-detect-events"></a>

In your training loop, check for elastic events during each iteration:

```
from hyperpod_elastic_agent.elastic_event_handler import elastic_event_detected

def train_epoch(model, dataloader, optimizer, args):
    for batch_idx, batch_data in enumerate(dataloader):
        # Forward and backward pass
        loss = model(batch_data).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Handle checkpointing and elastic scaling
        should_checkpoint = (batch_idx + 1) % args.checkpoint_freq == 0
        elastic_event = elastic_event_detected()
        
        # Save checkpoint if scaling-up or scaling down job
        if should_checkpoint or elastic_event:
            save_checkpoint(model, optimizer, scheduler, 
                            checkpoint_dir=args.checkpoint_dir, 
                            step=global_step)
              
            if elastic_event:
                print("Elastic scaling event detected. Checkpoint saved.")
                return
```

### Implement Checkpoint Saving and Checkpoint Loading
<a name="sagemaker-eks-elastic-checkpoint-implementation"></a>

Note: We recommend using PyTorch Distributed Checkpoint (DCP) for saving model and optimizer states, as DCP supports resuming from a checkpoint with different world sizes. Other checkpointing formats may not support checkpoint loading on different world sizes, in which case you'll need to implement custom logic to handle dynamic world size changes.

```
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_checkpoint(model, optimizer, lr_scheduler, user_content, checkpoint_path):
    """Save checkpoint using DCP for elastic training."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler,
        **user_content
    }
      
    dcp.save(
        state_dict=state_dict,
        storage_writer=dcp.FileSystemWriter(checkpoint_path)
    )

def load_checkpoint(model, optimizer, lr_scheduler, checkpoint_path):
    """Load checkpoint using DCP with automatic resharding."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler
    }
      
    dcp.load(
        state_dict=state_dict,
        storage_reader=dcp.FileSystemReader(checkpoint_path)
    )
      
    return model, optimizer, lr_scheduler
```

### (Optional) Use stateful dataloaders
<a name="sagemaker-eks-elastic-stateful-dataloaders"></a>

If you're only training for a single-epoch (i.e., one single pass through the entire dataset), the model must see each data sample exactly once. If the training job stops mid-epoch and resumes with a different world size, previously processed data samples will be repeated if the dataloader state is not persisted. A stateful dataloader prevents this by saving and restoring the dataloader's position, ensuring that resumed runs continue from the elastic scaling event without reprocessing any samples. We recommend using [StatefulDataLoader](https://meta-pytorch.org/data/main/torchdata.stateful_dataloader.html), which is a drop‑in replacement for `torch.utils.data.DataLoader` that adds `state_dict()` and `load_state_dict()` methods, enabling mid‑epoch checkpointing of the data loading process.

## Submitting elastic training jobs
<a name="sagemaker-eks-elastic-submit-job"></a>

[HyperPod training operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-usage.html) defines a new resource type - `hyperpodpytorchjob`. Elastic training extends this resource type and add the highlighted fields below:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 1
    maxReplicas: 4
    # Increment amount of pods in fixed-size groups
    # Amount of pods will be equal to minReplicas + N * replicaIncrementStep
    replicaIncrementStep: 1           
    # ... or Provide an exact amount of pods that required for training
    replicaDiscreteValues: [2,4,8]     

    # How long traing operator wait job to save checkpoint and exit during
    # scaling events. Job will be force-stopped after this period of time
    gracefulShutdownTimeoutInSeconds: 600

    # When scaling event is detected:   
    # how long job controller waits before initiate scale-up.
    # Some delay can prevent from frequent scale-ups and scale-downs
    scalingTimeoutInSeconds: 60

    # In case of faults, specify how long elastic training should wait for
    # recovery, before triggering a scale-down
    faultyScaleDownTimeoutInSeconds: 30
  ...
  replicaSpecs:
    - name: pods
      replicas: 4           # Initial replica count
      maxReplicas: 8        # Max for this replica spec (should match elasticPolicy.maxReplicas)
      ...
```

### Using kubectl
<a name="sagemaker-eks-elastic-kubectl-apply"></a>

You can subsequently launch elastic training with the following command.

```
kubectl apply -f elastic-training-job.yaml
```

### Using SageMaker Recipes
<a name="sagemaker-eks-elastic-sagemaker-recipes"></a>

Elastic training jobs can be launched through [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

**Note**  
We have included **46** elastic recipes for **SFO** and **DPO** jobs on Hyperpod Recipe. Users can launch those jobs with one line change on top of existing static launcher script:  
`++recipes.elastic_policy.is_elastic=true`

In addition to static recipes, elastic recipes add the following fields to define the elastic behaviors:

#### Elastic Policy
<a name="sagemaker-eks-elastic-policy"></a>

The `elastic_policy` field defines the job level configuration for the elastic training job, it has the following configurations:
+ `is_elastic` : `bool` - if this job is an elastic job
+ `min_nodes` : `int` - the minimum number of nodes used for elastic training
+ `max_nodes`: `int` - the maximum number of nodes used for elastic training
+ `replica_increment_step` : `int` - increment amount of pods in fixed-size groups, this field is mutually exclusive to the `scale_config` we define later.
+ `use_graceful_shutdown` : `bool` - if use graceful shutdown during scaling events, default to `true`.
+ `scaling_timeout`: `int` - the waiting time in second during scaling event before timeout
+ `graceful_shutdown_timeout`: `int` - the waiting time for graceful shutdown

The following is a sample definition of this field, you can also find in on Hyperpod Recipe repo in recipe: `recipes_collection/recipes/fine-tuning/llama/llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.yaml`

```
<static recipe>
...
elastic_policy:
  is_elastic: true
  min_nodes: 1
  max_nodes: 16
  use_graceful_shutdown: true
  scaling_timeout: 600
  graceful_shutdown_timeout: 600
```

#### Scale Config
<a name="sagemaker-eks-elastic-scale-config"></a>

The `scale_config` field defines overriding configurations at each specific scale. It's a key-value dictionary, where key is an integer representing the target scale and value is a subset of base recipe. At `<key>` scale, we use the `<value>` to update the specific configurations in the base/static recipe. The following show an example of this field:

```
scale_config:   
...
  2:
    trainer:
      num_nodes: 2
    training_config:
      training_args:
        train_batch_size: 128
        micro_train_batch_size: 8
        learning_rate: 0.0004
  3:
    trainer:
      num_nodes: 3
    training_config:
      training_args:
        train_batch_size: 128
        learning_rate: 0.0004
        uneven_batch:
          use_uneven_batch: true
          num_dp_groups_with_small_batch_size: 16
          small_local_batch_size: 5
          large_local_batch_size: 6
 ...
```

The above configuration defines the training configuration at scale 2 and 3. In both cases, we use learning rate `4e-4`, batch size of `128`. But at scale 2, we use a `micro_train_batch_size` of 8, while scale 3, we use an uneven batch size as the train batch size cannot be evenly divided across 3 nodes.

**Uneven Batch Size**

This is a field to define the batch distributing behavior when the global batch size cannot be evenly divided by number of ranks. It's not specific to elastic training, but it's an enabler for finer scaling granularity.
+ `use_uneven_batch` : `bool` - if use uneven batch distribution
+ `num_dp_groups_with_small_batch_size` : `int` - in uneven batch distribution, some ranks use smaller local batch size, where the others use larger batch size. The global batch size should equal to `small_local_batch_size * num_dp_groups_with_small_batch_size + (world_size-num_dp_groups_with_small_batch_size) * large_local_batch_size`
+ `small_local_batch_size` : `int` - this value is the smaller local batch size
+ `large_local_batch_size` : `int` - this value is the larger local batch size

**Monitor training on MLFlow**

Hyperpod recipe jobs support observability through MLFlow. Users can specify MLFlow configurations in recipe:

```
training_config:
  mlflow:
    tracking_uri: "<local_file_path or MLflow server URL>"
    run_id: "<MLflow run ID>"
    experiment_name: "<MLflow experiment name, e.g. llama_exps>"
    run_name: "<run name, e.g. llama3.1_8b>"
```

These configurations are mapped to corresponding [MLFlow setup](https://mlflow.org/docs/latest/ml/tracking/tracking-api/#setup--configuration). The following is a sample MLflow dashboard for an elastic training job.

![\[The following is a sample MLflow dashboard for an elastic training job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-sample-dashboard.png)


After defining the elastic recipes, we can use the launcher scripts, such as `launcher_scripts/llama/run_llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.sh` to launch an elastic training job. This is similar to launching a static job using Hyperpod recipe.

**Note**  
Elastic training job from recipe support automatically resume from latest checkpoints, however, by default, every restart creates a new training directory. To enable resuming from last checkpoint correctly, we need to make sure the same training directory is reused. This can be done by setting  
`recipes.training_config.training_args.override_training_dir=true`

## Use-case examples and limitations
<a name="sagemaker-eks-elastic-use-cases"></a>

### Scale-up when more resources are available
<a name="sagemaker-eks-elastic-scale-up"></a>

When more resources become available on the cluster (e.g., other workloads complete). During this event, the training controller will automatically scale up the training job. This behavior is explained below.

To simulate a situation when more resources become available we can submit a high-priority job, and then release resources back by deleting the high-priority job.

```
# Submit a high-priority job on your cluster. As a result of this command
# resources will not be available for elastic training
kubectl apply -f high_prioriy_job.yaml

# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Wait for training to start....

# Delete high priority job. This command will make additional resources available for
# elastic training
kubectl delete -f high_prioriy_job.yaml

# Observe the scale-up of elastic job
```

Expected behavior:
+ The training operator creates a Kueue Workload When an elastic training job requests a change in world size, the training operator generates an additional Kueue Workload object representing the new resource requirements.
+ Kueue admits the Workload Kueue evaluates the request based on available resources, priorities, and queue policies. Once approved, the Workload is admitted.
+ The training operator creates the additional Pods Upon admission, the operator launches the additional pods required to reach the new world size.
+ When the new pods become ready, the training operator sends a special elastic event signal to training script.
+ The training job performs checkpointing, to prepare for a graceful shutdown The training process periodically checks for the elastic event signal by calling **elastic\$1event\$1detected()** function. Once detected, it initiates a checkpoint. After the checkpoint is successfully completed, the training process exits cleanly.
+ The training operator restarts the job with the new world size The operator waits for all processes to exit, then restarts the training job using the updated world size and the latest checkpoint.

**Note:** When Kueue is not used, the training operator skips the first two steps. It immediately attempts to create the additional pods required for the new world size. If sufficient resources are not available in the cluster, these pods will remain in a **Pending** state until capacity becomes available.

![\[The diagram illustrates the resizing and resource timeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-resize-timeline.png)


### Preemption by high priority job
<a name="sagemaker-eks-elastic-preemption"></a>

Elastic jobs can be scaled down automatically when a high-priority job needs resources. To simulate this behavior you can submit an elastic training job, which uses the maximum number of available resources from the start of training, than submit a high priority job, and observe preemption behavior.

```
# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Submit a high-priority job on your cluster. As a result of this command
# some amount of resources will be   
kubectl apply -f high_prioriy_job.yaml

# Observe scale-down behaviour
```

When a high-priority job needs resources, Kueue can preempt lower-priority Elastic Training workloads (there could be more than 1 Workload object associated with Elastic Training job). The preemption process follows this sequence:

1. A high-priority job is submitted The job creates a new Kueue Workload, but the Workload cannot be admitted due to insufficient cluster resources.

1. Kueue preempts one of the Elastic Training job's Workloads Elastic jobs may have multiple active Workloads (one per world-size configuration). Kueue selects one to preempt based on priority and queue policies.

1. The training operator sends an elastic event signal. Once preemption is triggered, the training operator notifies the running training process to stop gracefully.

1. The training process performs checkpointing. The training job periodically checks for elastic event signals. When detected, it begins a coordinated checkpoint to preserve progress before shutting down.

1. training operator cleans up pods and workloads. The operator waits for checkpoint completion, then deletes the training pods that were part of the preempted Workload. It also removes the corresponding Workload object from Kueue.

1. The high-priority workload is admitted. With resources freed, Kueue admits the high-priority job, allowing it to start execution.  
![\[Preemption timeline for elastic training worklaods.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-preemption-timeline.png)

Preemption can cause the entire training job to pause, which may not be desirable for all workflows. To avoid full-job suspension while still allowing elastic scaling, customers can configure two different priority levels within the same training job by defining two `replicaSpec` sections:
+ A primary (fixed) replicaSpec with normal or high priority
  + Contains the minimum required number of replicas needed to keep the training job running.
  + Uses a higher PriorityClass, ensuring these replicas are *never* preempted.
  + Maintains baseline progress even when the cluster is under resource pressure.
+ An elastic (scalable) replicaSpec with lower priority
  + Contains the additional optional replicas that provide extra compute during elastic scaling.
  + Uses a lower PriorityClass, allowing Kueue to preempt these replicas when higher-priority jobs need resources.
  + Ensures only the elastic portion is reclaimed, while the core training continues uninterrupted.

This configuration enables partial preemption, where only the elastic capacity is reclaimed—maintaining training continuity while still supporting fair resource sharing in multi-tenant environments. Example:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 2
    maxReplicas: 8
    replicaIncrementStep: 2
  ...
  replicaSpecs:
    - name: base
      replicas: 2
      template:
        spec:
          priorityClassName: high-priority # set high-priority to avoid evictions
           ...
    - name: elastic
      replicas: 0
      maxReplicas: 6
      template:
        spec:
          priorityClassName: low-priority. # Set low-priority for elastic part
           ...
```

### Handing pod eviction, pod crashes, and hardware degradation:
<a name="sagemaker-eks-elastic-pod-eviction"></a>

The HyperPod training operator includes built-in mechanisms to recover the training process when it is unexpectedly interrupted. Interruptions can occur for various reasons, such as training code failures, pod evictions, node failures, hardware degradation, and other runtime issues.

When this happens, the operator automatically attempts to recreate the affected pods and resume training from the latest checkpoint. If recovery is not immediately possible, for example, due to insufficient spare capacity, the operator can continue progress by temporarily reducing the world size and scale down the elastic training job.

When an elastic training job crashes or loses replicas, the system behaves as follows:
+ Recovery Phase (using spare nodes) The Training Controller waits up to `faultyScaleDownTimeoutInSeconds` for resources to become available and attempts to recover the failed replicas by redeploying pods on spare capacity.
+ Elastic scale-down If recovery is not possible within the timeout window, the training operator scales the job down to a smaller world size (if the job's elastic policy permits it). Training then resumes with fewer replicas.
+ Elastic scale-up When additional resources become available again, the operator automatically scales the training job back up to the preferred world size.

This mechanism ensures that training can continue with minimal downtime, even under resource pressure or partial infrastructure failures, while still taking advantage of elastic scaling.

### Use elastic training with other HyperPod features
<a name="sagemaker-eks-elastic-other-features"></a>

Elastic training does not currently support checkpointless training capabilities, HyperPod managed tiered checkpointing, or Spot instances.

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job and scaling operations, resource management, and essential service functionality.

# Observability for Amazon SageMaker HyperPod cluster orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability"></a>

To achieve comprehensive observability into your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster resources and software components, integrate the cluster with [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). These tools provide visibility into cluster health, performance metrics, and resource utilization.

The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Note**  
While CloudWatch, Amazon Managed Service for Prometheus, and Amazon Managed Grafana focus on operational metrics (for example, system health, training job performance), SageMaker HyperPod Usage Reports complement [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to provide financial and resource accountability insights. These reports track:  
Compute utilization (GPU/CPU/Neuron Core hours) across namespaces/teams
Cost attribution for allocated vs. borrowed resources
Historical trends (up to 180 days) for auditing and optimization
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html). 

**Tip**  
To find practical examples and solutions, see also the [Observability](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/06-observability) section in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [

# Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
](sagemaker-hyperpod-eks-cluster-observability-model.md)
+ [

# Cluster and task observability
](sagemaker-hyperpod-eks-cluster-observability-cluster.md)

# Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability-model"></a>

SageMaker HyperPod clusters orchestrated with Amazon EKS can integrate with the [MLflow application on Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html). Cluster admins set up the MLflow server and connect it with the SageMaker HyperPod clusters. Data scientists can gain insights into the model.

**To set up an MLflow server using AWS CLI**

A cluster admin must create an MLflow tracking server.

1. Create a SageMaker AI MLflow tracking server, following the instructions at [Create a tracking server using the AWS CLI](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Make sure that the [https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html](https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html) permission exists in the IAM execution role for SageMaker HyperPod.

1. If the `eks-pod-identity-agent` add-on is not already installed on your EKS cluster, install the add-on on the EKS cluster.

   ```
   aws eks create-addon \
       --cluster-name <eks_cluster_name> \
       --addon-name eks-pod-identity-agent \
       --addon-version vx.y.z-eksbuild.1
   ```

1. Create a `trust-relationship.json` file for a new role for Pod to call MLflow APIs.

   ```
   cat >trust-relationship.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
   
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ]
           }
       ]
   }
   EOF
   ```

   Run the following code to create the role and attach the trust relationship.

   ```
   aws iam create-role --role-name hyperpod-mlflow-role \
       --assume-role-policy-document file://trust-relationship.json \
       --description "allow pods to emit mlflow metrics and put data in s3"
   ```

1. Create the following policy that grants Pod access to call all `sagemaker-mlflow` operations and to put model artifacts in S3. S3 permission already exists within the tracking server but if the model artifacts is too big direct call to s3 is made from the MLflow code to upload the artifacts.

   ```
   cat >hyperpod-mlflow-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker-mlflow:AccessUI",
                   "sagemaker-mlflow:CreateExperiment",
                   "sagemaker-mlflow:SearchExperiments",
                   "sagemaker-mlflow:GetExperiment",
                   "sagemaker-mlflow:GetExperimentByName",
                   "sagemaker-mlflow:DeleteExperiment",
                   "sagemaker-mlflow:RestoreExperiment",
                   "sagemaker-mlflow:UpdateExperiment",
                   "sagemaker-mlflow:CreateRun",
                   "sagemaker-mlflow:DeleteRun",
                   "sagemaker-mlflow:RestoreRun",
                   "sagemaker-mlflow:GetRun",
                   "sagemaker-mlflow:LogMetric",
                   "sagemaker-mlflow:LogBatch",
                   "sagemaker-mlflow:LogModel",
                   "sagemaker-mlflow:LogInputs",
                   "sagemaker-mlflow:SetExperimentTag",
                   "sagemaker-mlflow:SetTag",
                   "sagemaker-mlflow:DeleteTag",
                   "sagemaker-mlflow:LogParam",
                   "sagemaker-mlflow:GetMetricHistory",
                   "sagemaker-mlflow:SearchRuns",
                   "sagemaker-mlflow:ListArtifacts",
                   "sagemaker-mlflow:UpdateRun",
                   "sagemaker-mlflow:CreateRegisteredModel",
                   "sagemaker-mlflow:GetRegisteredModel",
                   "sagemaker-mlflow:RenameRegisteredModel",
                   "sagemaker-mlflow:UpdateRegisteredModel",
                   "sagemaker-mlflow:DeleteRegisteredModel",
                   "sagemaker-mlflow:GetLatestModelVersions",
                   "sagemaker-mlflow:CreateModelVersion",
                   "sagemaker-mlflow:GetModelVersion",
                   "sagemaker-mlflow:UpdateModelVersion",
                   "sagemaker-mlflow:DeleteModelVersion",
                   "sagemaker-mlflow:SearchModelVersions",
                   "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts",
                   "sagemaker-mlflow:TransitionModelVersionStage",
                   "sagemaker-mlflow:SearchRegisteredModels",
                   "sagemaker-mlflow:SetRegisteredModelTag",
                   "sagemaker-mlflow:DeleteRegisteredModelTag",
                   "sagemaker-mlflow:DeleteModelVersionTag",
                   "sagemaker-mlflow:DeleteRegisteredModelAlias",
                   "sagemaker-mlflow:SetRegisteredModelAlias",
                   "sagemaker-mlflow:GetModelVersionByAlias"
               ],
               "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>"
           }
       ]
   }
   EOF
   ```
**Note**  
The ARNs are the one from the MLflow server and the S3 bucket set up with the MLflow server during the server you created following the instructions [Set up MLflow infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Attach the `mlflow-metrics-emit-policy` policy to the `hyperpod-mlflow-role` using the policy document saved in the previous step.

   ```
   aws iam put-role-policy \
     --role-name hyperpod-mlflow-role \
     --policy-name mlflow-metrics-emit-policy \
     --policy-document file://hyperpod-mlflow-policy.json
   ```

1. Create a Kubernetes service account for Pod to access the MLflow server. 

   ```
   cat >mlflow-service-account.yaml <<EOF
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: mlflow-service-account
     namespace: kubeflow
   EOF
   ```

   Run the following command to apply to the EKS cluster.

   ```
   kubectl apply -f mlflow-service-account.yaml
   ```

1. Create a Pod identity association.

   ```
   aws eks create-pod-identity-association \
       --cluster-name EKS_CLUSTER_NAME \
       --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \
       --namespace kubeflow \
       --service-account mlflow-service-account
   ```

**To collect metrics from training jobs to the MLflow server**

Data scientists need to set up the training script and docker image to emit metrics to the MLflow server.

1. Add the following lines at the beginning of your training script.

   ```
   import mlflow
   
   # Set the Tracking Server URI using the ARN of the Tracking Server you created
   mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN'])
   # Enable autologging in MLflow
   mlflow.autolog()
   ```

1. Build a Docker image with the training script and push to Amazon ECR. Get the ARN of the ECR container. For more information about building and pushing a Docker image, see [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) in the *ECR User Guide*.
**Tip**  
Make sure that you add installation of mlflow and sagemaker-mlflow packages in the Docker file. To learn more about the installation of the packages, requirements, and compatible versions of the packages, see [Install MLflow and the SageMaker AI MLflow plugin](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments.html#mlflow-track-experiments-install-plugin).

1. Add a service account in the training job Pods to give them access to `hyperpod-mlflow-role`. This allows Pods to call MLflow APIs. Run the following SageMaker HyperPod CLI job submission template. Create this with file name `mlflow-test.yaml`.

   ```
   defaults:
    - override hydra/job_logging: stdout
   
   hydra:
    run:
     dir: .
    output_subdir: null
   
   training_cfg:
    entry_script: ./train.py
    script_args: []
    run:
     name: test-job-with-mlflow # Current run name
     nodes: 2 # Number of nodes to use for current training
     # ntasks_per_node: 1 # Number of devices to use per node
   cluster:
    cluster_type: k8s # currently k8s only
    instance_type: ml.c5.2xlarge
    cluster_config:
     # name of service account associated with the namespace
     service_account_name: mlflow-service-account
     # persistent volume, usually used to mount FSx
     persistent_volume_claims: null
     namespace: kubeflow
     # required node affinity to select nodes with SageMaker HyperPod
     # labels and passed health check if burn-in enabled
     label_selector:
         required:
             sagemaker.amazonaws.com/node-health-status:
                 - Schedulable
         preferred:
             sagemaker.amazonaws.com/deep-health-check-status:
                 - Passed
         weights:
             - 100
     pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never
     restartPolicy: OnFailure # restart policy
   
   base_results_dir: ./result # Location to store the results, checkpoints and logs.
   container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use
   
   env_vars:
    NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
    MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
   ```

1. Start the job using the YAML file as follows.

   ```
   hyperpod start-job --config-file /path/to/mlflow-test.yaml
   ```

1. Generate a pre-signed URL for the MLflow tracking server. You can open the link on your browser and start tracking your training job.

   ```
   aws sagemaker create-presigned-mlflow-tracking-server-url \                          
       --tracking-server-name "tracking-server-name" \
       --session-expiration-duration-in-seconds 1800 \
       --expires-in-seconds 300 \
       --region region
   ```

# Cluster and task observability
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster"></a>

There are two options for monitoring SageMaker HyperPod clusters:

**The SageMaker HyperPod observability add-on**—SageMaker HyperPod provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

**Amazon CloudWatch Insights**—Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

**Topics**
+ [

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
](sagemaker-hyperpod-observability-addon.md)
+ [

# Observability with Amazon CloudWatch
](sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.md)

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
<a name="sagemaker-hyperpod-observability-addon"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

## Restricted Instance Group (RIG) support
<a name="hyperpod-observability-addon-rig-support"></a>

The observability add-on also supports clusters that contain Restricted Instance Groups. In RIG clusters, the add-on automatically adapts its deployment strategy to comply with the network isolation and security constraints of restricted nodes. DaemonSet components (node exporter, DCGM exporter, EFA exporter, Neuron monitor, and node collector) run on both standard and restricted nodes. Deployment components (central collector, Kube State Metrics, and Training Metrics Agent) are scheduled with boundary-aware logic to respect network isolation between instance groups. Container log collection with Fluent Bit is not available on restricted nodes.

For information about setting up the add-on on clusters with Restricted Instance Groups, see [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md).

**Topics**
+ [

## Restricted Instance Group (RIG) support
](#hyperpod-observability-addon-rig-support)
+ [

# Setting up the SageMaker HyperPod observability add-on
](hyperpod-observability-addon-setup.md)
+ [

# Amazon SageMaker HyperPod observability dashboards
](hyperpod-observability-addon-viewing-dashboards.md)
+ [

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
](hyperpod-observability-addon-exploring-metrics.md)
+ [

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
](hyperpod-observability-addon-customizing.md)
+ [

# Creating custom SageMaker HyperPod cluster metrics
](hyperpod-observability-addon-custom-metrics.md)
+ [

# SageMaker HyperPod cluster metrics
](hyperpod-observability-cluster-metrics.md)
+ [

# Preconfigured alerts
](hyperpod-observability-addon-alerts.md)
+ [

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
](hyperpod-observability-addon-troubleshooting.md)

# Setting up the SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-setup"></a>

The following list describes the prerequisites for setting up the observability add-on.

To have metrics for your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster sent to a Amazon Managed Service for Prometheus workspace and to optionally view them in Amazon Managed Grafana, first attach the following managed policies and permissions to your console role.
+ To use Amazon Managed Grafana, enable AWS IAM Identity Center (IAM Identity Center) in an AWS Region where Amazon Managed Grafana is available. For instructions, see [Getting started with IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/getting-started.html) in the *AWS IAM Identity Center User Guide*. For a list of AWS Regions where Amazon Managed Grafana is available, see [Supported Regions](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html#AMG-supported-Regions) in the *Amazon Managed Grafana User Guide*.
+ Create at least one user in IAM Identity Center.
+ Ensure that the [Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-pod-id) add-on is installed in your Amazon EKS cluster. The Amazon EKS Pod Identity Agent add-on makes it possible for the SageMaker HyperPod observability add-on to get the credentials to interact with Amazon Managed Service for Prometheus and CloudWatch Logs. To check whether your Amazon EKS cluster has the add-on, go to the Amazon EKS console, and check your cluster's **Add-ons** tab. For information about how to install the add-on if it's not installed, see [Create add-on (AWS Management Console)](https://docs.aws.amazon.com/eks/latest/userguide/creating-an-add-on.html#_create_add_on_console) in the *Amazon EKS User Guide*.
**Note**  
The Amazon EKS Pod Identity Agent is required for standard instance groups. For Restricted Instance Groups (RIG), the Pod Identity Agent is not available due to network isolation constraints. The cluster's instance group execution IAM role is used to interact with Amazon Managed Service for Prometheus. For information about how to configure that role, see [Additional prerequisites for Restricted Instance Groups](#hyperpod-observability-addon-rig-prerequisites).
+ Ensure that you have at least one node in your SageMaker HyperPod cluster before installing SageMaker HyperPod observability add-on. The smallest Amazon EC2 instance type that works in this case is `4xlarge`. This minimum node size requirement ensures that the node can accommodate all the pods that the SageMaker HyperPod observability add-on creates alongside any other already running pods on the cluster.
+ Add the following policies and permissions to your role.
  + [AWS managed policy: AmazonSageMakerHyperPodObservabilityAdminAccess](security-iam-awsmanpol-AmazonSageMakerHyperPodObservabilityAdminAccess.md)
  + [AWS managed policy: AWSGrafanaWorkspacePermissionManagementV2](https://docs.aws.amazon.com/grafana/latest/userguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AWSGrafanaWorkspacePermissionManagementV2)
  + [AWS managed policy: AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)
  + Additional permissions to set up required IAM roles for Amazon Managed Grafana and Amazon Elastic Kubernetes Service add-on access:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "CreateRoleAccess",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateRole",
                    "iam:CreatePolicy",
                    "iam:AttachRolePolicy",
                    "iam:ListRoles"
                ],
                "Resource": [
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityGrafanaAccess*",
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityAddonAccess*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityAddonPolicy*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityGrafanaPolicy*"
                ]
            }
        ]
    }
    ```

------
  + Additional permissions needed to manage IAM Identity Center users for Amazon Managed Grafana:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "SSOAccess",
                "Effect": "Allow",
                "Action": [
                    "sso:ListProfileAssociations",
                    "sso-directory:SearchUsers",
                    "sso-directory:SearchGroups",
                    "sso:AssociateProfile",
                    "sso:DisassociateProfile"
                ],
                "Resource": [
                    "*"
                ]
            }
        ]
    }
    ```

------

## Additional prerequisites for Restricted Instance Groups
<a name="hyperpod-observability-addon-rig-prerequisites"></a>

If your cluster contains Restricted Instance Groups, the instance group execution role must have permissions to write metrics to Amazon Managed Service for Prometheus. When you use **Quick setup** to create your cluster with observability enabled, these permissions are added to the execution role automatically.

If you are using **Custom setup** or adding observability to an existing RIG cluster, ensure that the execution role for each Restricted Instance Group has the following permissions:

```
{
    "Version": "2012-10-17", 		 	 	 
    "Statement": [
        {
            "Sid": "PrometheusAccess",
            "Effect": "Allow",
            "Action": "aps:RemoteWrite",
            "Resource": "arn:aws:aps:us-east-1:account_id:workspace/workspace-ID"
        }
    ]
}
```

Replace *us-east-1*, *account\$1id*, and *workspace-ID* with your AWS Region, account ID, and Amazon Managed Service for Prometheus workspace ID.

After you ensure that you have met the above prerequisites, you can install the observability add-on.

**To quickly install the observability add-on**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Quick install**.

**To do a custom-install of the observability add-on**

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Custom install**.

1. Specify the metrics categories that you want to see. For more information about these metrics categories, see [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md).

1. Specify whether you want to enable Amazon CloudWatch Logs.

1. Specify whether you want the service to create a new Amazon Managed Service for Prometheus workspace.

1. To be able to view the metrics in Amazon Managed Grafana dashboards, check the box labeled **Use an Amazon Managed Grafana workspace**. You can specify your own workspace or let the service create a new one for you. 
**Note**  
Amazon Managed Grafana isn't available in all AWS Regions in which Amazon Managed Service for Prometheus is available. However, you can set up a Grafana workspace in any AWS Region and configure it to get metrics data from a Prometheus workspace that resides in a different AWS Region. For information, see [Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/AMP-adding-AWS-config.html) and [Connect to Amazon Managed Service for Prometheus and open-source Prometheus data sources](https://docs.aws.amazon.com/grafana/latest/userguide/prometheus-data-source.html). 

# Amazon SageMaker HyperPod observability dashboards
<a name="hyperpod-observability-addon-viewing-dashboards"></a>

This topic describes how to view metrics dashboards for your Amazon SageMaker HyperPod (SageMaker HyperPod) clusters and how to add new users to a dashboard. The topic also describes the different types of dashboards.

## Accessing dashboards
<a name="hyperpod-observability-addon-accessing-dashboards"></a>

To view your SageMaker HyperPod cluster's metrics in Amazon Managed Grafana, perform the following steps:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the **HyperPod Observability** section, and choose **Open dashboard in Grafana**.

## Adding new users to a Amazon Managed Grafana workspace
<a name="hyperpod-observability-addon-adding-users"></a>

For information about how to add users to a Amazon Managed Grafana workspace, see [Use AWS IAM Identity Center with your Amazon Managed Grafana workspace](https://docs.aws.amazon.com/grafana/latest/userguide/authentication-in-AMG-SSO.html) in the *Amazon Managed Grafana User Guide*.

## Observability dashboards
<a name="hyperpod-observability-addon-dashboards.title"></a>

The SageMaker HyperPod observability add-on provides six interconnected dashboards in your default Amazon Managed Grafana workspace. Each dashboard provides in-depth insights about different resources and tasks in the clusters for various users such as data scientists, machine learning engineers, and administrators.

### Task dashboard
<a name="hyperpod-observability-addon-task-dashboard"></a>

The Task dashboard provides comprehensive monitoring and visualization of resource utilization metrics for SageMaker HyperPod tasks. The main panel displays a detailed table grouping resource usage by parent tasks, showing CPU, GPU, and memory utilization across pods. Interactive time-series graphs track CPU usage, system memory consumption, GPU utilization percentages, and GPU memory usage for selected pods, allowing you to monitor performance trends over time. The dashboard features powerful filtering capabilities through variables like cluster name, namespace, task type, and specific pods, making it easy to drill down into specific workloads. This monitoring solution is essential for optimizing resource allocation and maintaining performance of machine learning workloads on SageMaker HyperPod.

### Training dashboard
<a name="hyperpod-observability-addon-training-dashboard"></a>

The training dashboard provides comprehensive monitoring of training task health, reliability, and fault management metrics. The dashboard features key performance indicators including task creation counts, success rates, and uptime percentages, along with detailed tracking of both automatic and manual restart events. It offers detailed visualizations of fault patterns through pie charts and heatmaps that break down incidents by type and remediation latency, enabling you to identify recurring issues and optimize task reliability. The interface includes real-time monitoring of critical metrics like system recovery times and fault detection latencies, making it an essential tool for maintaining high availability of training workloads. Additionally, the dashboard's 24-hour trailing window provides historical context for analyzing trends and patterns in training task performance, helping teams proactively address potential issues before they impact production workloads.

### Inference dashboard
<a name="hyperpod-observability-addon-inference-dashboard"></a>

The inference dashboard provides comprehensive monitoring of model deployment performance and health metrics across multiple dimensions. It features a detailed overview of active deployments, real-time monitoring of request rates, success percentages, and latency metrics, enabling you to track model serving performance and identify potential bottlenecks. The dashboard includes specialized panels for both general inference metrics and token-specific metrics for language models, such as time to first token (TTFT) and token throughput, making it particularly valuable for monitoring large language model deployments. Additionally, it provides infrastructure insights through pod and node allocation tracking, while offering detailed error analysis capabilities to help maintain high availability and performance of inference workloads.

### Cluster dashboard
<a name="hyperpod-observability-addon-cluster-dashboard"></a>

The cluster dashboard provides a comprehensive view of cluster health and performance, offering real-time visibility into compute, memory, network, and storage resources across your Amazon SageMaker HyperPod (SageMaker HyperPod) environment. At a glance, you can view critical metrics including total instances, GPU utilization, memory usage, and network performance through an intuitive interface that automatically updates data every few seconds. The dashboard is organized into logical sections, starting with a high-level cluster overview that displays key metrics such as healthy instance percentage and total resource counts, followed by detailed sections for GPU performance, memory utilization, network statistics, and storage metrics. Each section features interactive graphs and panels that allow you to drill down into specific metrics, with customizable time ranges and filtering options by cluster name, instance, or GPU ID.

### File system dashboard
<a name="hyperpod-observability-addon-filesystem-dashboard"></a>

The file-system dashboard provides comprehensive visibility into file system (Amazon FSx for Lustre) performance and health metrics. The dashboard displays critical storage metrics including free capacity, deduplication savings, CPU/memory utilization, disk IOPS, throughput, and client connections across multiple visualizations. It makes it possible for you to monitor both system-level performance indicators like CPU and memory usage, as well as storage-specific metrics such as read/write operations and disk utilization patterns. The interface includes alert monitoring capabilities and detailed time-series graphs for tracking performance trends over time, making it valuable for proactive maintenance and capacity planning. Additionally, through its comprehensive metrics coverage, the dashboard helps identify potential bottlenecks, optimize storage performance, and ensure reliable file system operations for SageMaker HyperPod workloads.

### GPU partition dashboard
<a name="hyperpod-observability-addon-gpu-partition-dashboard"></a>

To monitor GPU partition-specific metrics when using Multi-Instance GPU (MIG) configurations, you need to install or upgrade to the latest version of the SageMaker HyperPod Observability addon. This addon provides comprehensive monitoring capabilities, including MIG-specific metrics such as partition count, memory usage, and compute utilization per GPU partition.

If you already have SageMaker HyperPod Observability installed but need MIG metrics support, simply update the addon to the latest version. This process is non-disruptive and maintains your existing monitoring configuration.

SageMaker HyperPod automatically exposes MIG-specific metrics, including:
+ `nvidia_mig_instance_count`: Number of MIG instances per profile
+ `nvidia_mig_memory_usage`: Memory utilization per MIG instance
+ `nvidia_mig_compute_utilization`: Compute utilization per MIG instance

### Cluster Logs dashboard
<a name="hyperpod-observability-addon-cluster-logs-dashboard"></a>

The Cluster Logs dashboard provides a centralized view of CloudWatch Logs for your SageMaker HyperPod cluster. The dashboard queries the `/aws/sagemaker/Clusters/{cluster-name}/{cluster-id}` log group and displays log events with filtering capabilities by instance ID, log stream name, log level (ERROR, WARN, INFO, DEBUG), and free-text search. The dashboard includes an events timeline showing log event distribution over time, a total events counter, a searched events timeline for filtered results, and a detailed logs panel with full log messages, timestamps, and log stream metadata. This dashboard uses CloudWatch as its data source and is useful for debugging cluster issues, monitoring instance health events, and investigating training job failures.

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
<a name="hyperpod-observability-addon-exploring-metrics"></a>

After you connect Amazon Managed Grafana to your Amazon Managed Service for Prometheus workspace, you can use Grafana's query editor and visualization tools to explore your metrics data. Amazon Managed Grafana provides multiple ways to interact with Prometheus data, including a comprehensive query editor for building PromQL expressions, a metrics browser for discovering available metrics and labels, and templating capabilities for creating dynamic dashboards. You can perform range queries to visualize time series data over periods and instant queries to retrieve the latest values, with options to format results as time series graphs, tables, or heatmaps. For detailed information about configuring query settings, using the metrics browser, and leveraging templating features, see [Using the Prometheus data source](https://docs.aws.amazon.com/grafana/latest/userguide/using-prometheus-datasource.html).

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
<a name="hyperpod-observability-addon-customizing"></a>

Amazon Managed Grafana makes it possible for you to create comprehensive dashboards that visualize your data through panels containing queries connected to your data sources. You can build dashboards from scratch, import existing ones, or export your creations for sharing and backup purposes. Grafana dashboards support dynamic functionality through variables that replace hard-coded values in queries, making your visualizations more flexible and interactive. You can also enhance your dashboards with features like annotations, library panels for reusability, version history management, and custom links to create a complete monitoring and observability solution. For step-by-step guidance on creating, importing, configuring, and managing dashboards, see [Building dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/v10-dash-building-dashboards.html).

# Creating custom SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-addon-custom-metrics"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on provides hundreds of health, performance, and efficiency metrics out-of-the-box. In addition to those metrics, you might need to monitor custom metrics specific to your applications or business needs that aren't captured by default metrics, such as model-specific performance indicators, data processing statistics, or application-specific measurements. To address this need, you can implement custom metrics collection using OpenTelemetry by integrating a Python code snippet into your application.

To create custom metrics, first run the following shell command to install the core OpenTelemetry components needed to instrument Python applications for observability. This installation makes it possible for Python applications that run on SageMaker HyperPod clusters to emit custom telemetry data. That data gets collected by the OpenTelemetry collector and forwarded to the observability infrastructure.

```
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
```

The following example script configures an OpenTelemetry metrics pipeline that automatically tags metrics with pod and node information, ensuring proper attribution within your cluster, and sends these metrics to the SageMaker HyperPod built-in observability stack every second. The script establishes a connection to the SageMaker HyperPod metrics collector, sets up appropriate resource attributes for identification, and provides a meter interface through which you can create various types of metrics (counters, gauges, or histograms) to track any aspect of your application's performance. Custom metrics integrate with the SageMaker HyperPod monitoring dashboards alongside system metrics. This integration allows for comprehensive observability through a single interface where you can create custom alerts, visualizations, and reports to monitor your workload's complete performance profile.

```
import os
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

# Get hostname/pod name
hostname = os.uname()[1]
node_name = os.getenv('NODE_NAME', 'unknown')

collector_endpoint = "hyperpod-otel-collector.hyperpod-observability:4317"

# Configure the OTLP exporter
exporter = OTLPMetricExporter(
    endpoint=collector_endpoint,
    insecure=True,
    timeout=5000  # 5 seconds timeout
)

reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=1000
)

resource = Resource.create({
    "service.name": "metric-test",
    "pod.name": hostname,
    "node.name": node_name
})

meter_provider = MeterProvider(
    metric_readers=[reader],
    resource=resource
)
metrics.set_meter_provider(meter_provider)

# Create a meter
meter = metrics.get_meter("test-meter")

# Create a counter
counter = meter.create_counter(
    name="test.counter",
    description="A test counter"
)

counter.add(1, {"pod": hostname, "node": node_name})
```

# SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-cluster-metrics"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.


| Metric category | Enabled by default? | Additional advanced metrics available? | Available under which Grafana dashboards? | 
| --- | --- | --- | --- | 
| Training metrics | Yes | Yes | Training | 
| Inference metrics | Yes | No | Inference | 
| Task governance metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Scaling metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Cluster metrics | Yes | Yes | Cluster | 
| Instance metrics | Yes | Yes | Cluster | 
| Accelerated compute metrics | Yes | Yes | Task, Cluster | 
| Network metrics | No | Yes | Cluster | 
| File system | Yes | No | File system | 

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

## Metrics availability on Restricted Instance Groups
<a name="hyperpod-observability-rig-metrics-availability"></a>

When your cluster contains Restricted Instance Groups, most metrics categories are available on restricted nodes with the following exceptions and considerations. You can also set up alerting on any metric of your choice.


| Metric category | Available on RIG nodes? | Notes | 
| --- | --- | --- | 
| Training metrics | Yes | Kubeflow and Kubernetes pod metrics are collected. Advanced training KPI metrics (from Training Metrics Agent) are not available from the RIG nodes. | 
| Inference metrics | No | Inference workloads are not supported on Restricted Instance Groups. | 
| Task governance metrics | No | Kueue metrics are collected from the standard nodes only, if any. | 
| Scaling metrics | No | KEDA metrics are collected from the standard nodes only, if any. | 
| Cluster metrics | Yes | Kube State Metrics and API server metrics are available. Kube State Metrics is preferentially scheduled on standard nodes but can run on restricted nodes in RIG-only clusters. | 
| Instance metrics | Yes | Node Exporter and cAdvisor metrics are collected on all nodes including restricted nodes. | 
| Accelerated compute metrics | Yes | DCGM Exporter runs on GPU-enabled restricted nodes. Neuron Monitor runs on Neuron-enabled restricted nodes when advanced mode is enabled. | 
| Network metrics | Yes | EFA Exporter runs on EFA-enabled restricted nodes when advanced mode is enabled. | 
| File system metrics | Yes | FSx for Lustre cluster utilization metrics are supported on Restricted Instance Groups. | 

**Note**  
Container log collection with Fluent Bit is not deployed on restricted nodes. Cluster logs from restricted nodes are available through the SageMaker HyperPod platform independently of the observability add-on. You can view these logs in the Cluster Logs dashboard.

## Training metrics
<a name="hyperpod-observability-training-metrics"></a>

Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kubeflow metrics | [https://github.com/kubeflow/trainer](https://github.com/kubeflow/trainer) | Yes | Kubeflow | 
| Kubernetes pod metrics | [https://github.com/kubernetes/kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) | Yes | Kubernetes | 
| training\$1uptime\$1percentage | Percentage of training time out of the total window size | No | SageMaker HyperPod training operator | 
| training\$1manual\$1recovery\$1count | Total number of manual restarts performed on the job | No | SageMaker HyperPod training operator | 
| training\$1manual\$1downtime\$1ms | Total time in milliseconds the job was down due to manual interventions | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1count | Total number of automatic recoveries | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1downtime | Total infrastructure overhead time in milliseconds during fault recovery | No | SageMaker HyperPod training operator | 
| training\$1fault\$1count | Total number of faults encountered during training | No | SageMaker HyperPod training operator | 
| training\$1fault\$1type\$1count | Distribution of faults by type | No | SageMaker HyperPod training operator | 
| training\$1fault\$1recovery\$1time\$1ms | Recovery time in milliseconds for each type of fault | No | SageMaker HyperPod training operator | 
| training\$1time\$1ms | Total time in milliseconds spent in actual training | No | SageMaker HyperPod training operator | 

## Inference metrics
<a name="hyperpod-observability-inference-metrics"></a>

Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| model\$1invocations\$1total | Total number of invocation requests to the model | Yes | SageMaker HyperPod inference operator | 
| model\$1errors\$1total | Total number of errors during model invocation | Yes | SageMaker HyperPod inference operator | 
| model\$1concurrent\$1requests | Active concurrent model requests | Yes | SageMaker HyperPod inference operator | 
| model\$1latency\$1milliseconds | Model invocation latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| model\$1ttfb\$1milliseconds | Model time to first byte latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| TGI | These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 
| LMI | These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 

## Task governance metrics
<a name="hyperpod-observability-task-governance-metrics"></a>

Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kueue | See [https://kueue.sigs.k8s.io/docs/reference/metrics/](https://kueue.sigs.k8s.io/docs/reference/metrics/). | No | Kueue | 

## Scaling metrics
<a name="hyperpod-observability-scaling-metrics"></a>

Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| KEDA Operator Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1operator](https://keda.sh/docs/2.17/integrations/prometheus/#operator). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Webhook Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1admission-webhooks](https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Metrics server Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1metrics-server](https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server). | No | Kubernetes Event-driven Autoscaler (KEDA) | 

## Cluster metrics
<a name="hyperpod-observability-cluster-health-metrics"></a>

Use these metrics to monitor overall cluster health and resource allocation.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Cluster health | Kubernetes API server metrics. See [https://kubernetes.io/docs/reference/instrumentation/metrics/](https://kubernetes.io/docs/reference/instrumentation/metrics/). | Yes | Kubernetes | 
| Kubestate | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1default-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources). | Limited | Kubernetes | 
| KubeState Advanced | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1optional-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources). | No | Kubernetes | 

## Instance metrics
<a name="hyperpod-observability-instance-metrics"></a>

Use these metrics to monitor individual instance performance and health.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Node Metrics | See [https://github.com/prometheus/node\$1exporter?tab=readme-ov-file\$1enabled-by-default](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default). | Yes | Kubernetes | 
| Container Metrics | Container metrics exposed by Cadvisor. See [https://github.com/google/cadvisor](https://github.com/google/cadvisor). | Yes | Kubernetes | 

## Accelerated compute metrics
<a name="hyperpod-observability-accelerated-compute-metrics"></a>

Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

**Note**  
When GPU partitioning with MIG (Multi-Instance GPU) is enabled on your cluster, DCGM metrics automatically provide partition-level granularity for monitoring individual MIG instances. Each MIG partition is exposed as a separate GPU device with its own metrics for temperature, power, memory utilization, and compute activity. This allows you to track resource usage and health for each GPU partition independently, enabling precise monitoring of workloads running on fractional GPU resources. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| NVIDIA GPU | DCGM metrics. See [https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv). | Limited |  NVIDIA Data Center GPU Manager (DCGM)  | 
|  NVIDIA GPU (advanced)  | DCGM metrics that are commented out in the following CSV file:[https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv) | No |  NVIDIA Data Center GPU Manager (DCGM)  | 
| AWS Trainium | Neuron metrics. See [https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html\$1neuron-monitor-nc-counters](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters). | No | AWS Neuron Monitor | 

## Network metrics
<a name="hyperpod-observability-network-metrics"></a>

Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| EFA | See [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation\$1and\$1observability/3.efa-node-exporter/README.md](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). | No | Elastic Fabric Adapter | 

## File system metrics
<a name="hyperpod-observability-file-system-metrics"></a>


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| File system | Amazon FSx for Lustre metrics from Amazon CloudWatch:[Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html). | Yes | Amazon FSx for Lustre | 

# Preconfigured alerts
<a name="hyperpod-observability-addon-alerts"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on enables default alerts for your cluster and workloads to notify you when the system detects common early indicators of cluster under-performance. These alerts are defined within the Amazon Managed Grafana built-in alerting system. For information about how to modify these pre-configured alerts or create new ones, see [Alerts in Grafana version 10](https://docs.aws.amazon.com/grafana/latest/userguide/v10-alerts.html) in the *Amazon Managed Grafana User Guide*. The following YAML shows the default alerts.

```
groups:
- name: sagemaker_hyperpod_alerts
  rules:
  # GPU_TEMP_ABOVE_80C
  - alert: GPUHighTemperature
    expr: DCGM_FI_DEV_GPU_TEMP > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU Temperature Above 80C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_TEMP_ABOVE_85C  
  - alert: GPUCriticalTemperature  
    expr: DCGM_FI_DEV_GPU_TEMP > 85
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Above 85C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_MEMORY_ERROR
  # Any ECC double-bit errors indicate serious memory issues requiring immediate attention
  - alert: GPUMemoryErrorDetected
    expr: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0 or DCGM_FI_DEV_ECC_DBE_AGG_TOTAL > DCGM_FI_DEV_ECC_DBE_AGG_TOTAL offset 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU ECC Double-Bit Error Detected"
      description: "GPU {{ $labels.gpu }} has detected ECC double-bit errors."

  # GPU_POWER_WARNING
  # Sustained power limit violations can impact performance and stability
  - alert: GPUPowerViolation
    expr: DCGM_FI_DEV_POWER_VIOLATION > 100
    for: 5m
    labels:
      severity: warning  
    annotations:
      summary: "GPU Power Violation"
      description: "GPU {{ $labels.gpu }} has been operating at power limit for extended period."

  # GPU_NVLINK_ERROR
  # NVLink errors above threshold indicate interconnect stability issues
  - alert: NVLinkErrorsDetected
    expr: DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL > 0 or DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL > 10
    labels:
      severity: warning
    annotations:
      summary: "NVLink Errors Detected" 
      description: "GPU {{ $labels.gpu }} has detected NVLink errors."

  # GPU_THERMAL_VIOLATION  
  # Immediate alert on thermal violations to prevent hardware damage
  - alert: GPUThermalViolation
    expr: increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Thermal Violation Detected"
      description: "GPU {{ $labels.gpu }} has thermal violations on node {{ $labels.Hostname }}"

  # GPU_XID_ERROR
  # XID errors indicate driver or hardware level GPU issues requiring investigation
  - alert: GPUXidError
    expr: DCGM_FI_DEV_XID_ERRORS > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "GPU XID Error Detected"
      description: "GPU {{ $labels.gpu }} experienced XID error {{ $value }} on node {{ $labels.Hostname }}"

  # MIG_CONFIG_FAILURE
  # MIG configuration failures indicate issues with GPU partitioning setup
  - alert: MIGConfigFailure
    expr: kubelet_node_name{nvidia_com_mig_config_state="failed"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MIG Configuration Failed"
      description: "MIG configuration failed on node {{ $labels.instance }}"

  # DISK_SPACE_WARNING
  # 90% threshold ensures time to respond before complete disk exhaustion
  - alert: NodeDiskSpaceWarning
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Disk Usage"
      description: "Node {{ $labels.instance }} disk usage is above 90%"

  # FSX_STORAGE_WARNING
  # 80% FSx utilization allows buffer for burst workloads
  - alert: FsxLustreStorageWarning
    expr: fsx_lustre_storage_used_bytes / fsx_lustre_storage_capacity_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High FSx Lustre Usage"
      description: "FSx Lustre storage usage is above 80% on file system {{ $labels.filesystem_id }}"
```

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-troubleshooting"></a>

Use the following guidance to resolve common issues with the Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on.

## Troubleshooting missing metrics in Amazon Managed Grafana
<a name="troubleshooting-missing-metrics"></a>

If metrics don't appear in your Amazon Managed Grafana dashboards, perform the following steps to identify and resolve the issue.

### Verify the Amazon Managed Service for Prometheus-Amazon Managed Grafana connection
<a name="verify-amp-grafana-connection"></a>

1. Sign in to the Amazon Managed Grafana console.

1. In the left pane, choose **All workspaces**.

1. In the **Workspaces** table, choose your workspace.

1. In the details page of the workspace, choose the **Data sources** tab.

1. Verify that the Amazon Managed Service for Prometheus data source exists.

1. Check the connection settings:
   + Confirm that the endpoint URL is correct.
   + Verify that IAM authentication is properly configured.
   + Choose **Test connection**. Verify that the status is **Data source is working**.

### Verify the Amazon EKS add-on status
<a name="verify-eks-addon-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Verify that the SageMaker HyperPod observability add-on is listed and that its status is **ACTIVE**.

1. If the status isn't **ACTIVE**, see [Troubleshooting add-on installation failures](#troubleshooting-addon-installation-failures).

### Verify Pod Identity association
<a name="verify-pod-identity-association"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. On the cluster details page, choose the **Access** tab.

1. In the **Pod Identity associations** table, choose the association that has the following property values:
   + **Namespace**: `hyperpod-observability`
   + **Service account**: `hyperpod-observability-operator-otel-collector`
   + **Add-on**: `amazon-sagemaker-hyperpod-observability`

1. Ensure that the IAM role that is attached to this association has the following permissions.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "PrometheusAccess",
               "Effect": "Allow",
               "Action": "aps:RemoteWrite",
               "Resource": "arn:aws:aps:us-east-1:111122223333:workspace/workspace-ID"
           },
           {
               "Sid": "CloudwatchLogsAccess",
               "Effect": "Allow",
               "Action": [
                   "logs:CreateLogGroup",
                   "logs:CreateLogStream",
                   "logs:DescribeLogGroups",
                   "logs:DescribeLogStreams",
                   "logs:PutLogEvents",
                   "logs:GetLogEvents",
                   "logs:FilterLogEvents",
                   "logs:GetLogRecord",
                   "logs:StartQuery",
                   "logs:StopQuery",
                   "logs:GetQueryResults"
               ],
               "Resource": [
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*",
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*:log-stream:*"
               ]
           }
       ]
   }
   ```

------

1. Ensure that the IAM role that is attached to this association has the following trust policy. Verify that the source ARN and source account are correct.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ],
               "Condition": {
                   "StringEquals": {
                       "aws:SourceArn": "arn:aws:eks:us-east-1:111122223333:cluster/cluster-name",
                       "aws:SourceAccount": "111122223333"
                   }
               }
           }
       ]
   }
   ```

------

### Check Amazon Managed Service for Prometheus throttling
<a name="check-amp-throttling"></a>

1. Sign in to the AWS Management Console and open the Service Quotas console at [https://console.aws.amazon.com/servicequotas/](https://console.aws.amazon.com/servicequotas/).

1. In the **Managed quotas** box, search for and select Amazon Managed Service for Prometheus.

1. Choose the **Active series per workspace** quota.

1. In the **Resource-level quotas** tab, select your Amazon Managed Service for Prometheus workspace.

1. Ensure that the utilization is less than your current quota.

1. If you've reached the quota limit, select your workspace by choosing the radio button to its left, and then choose **Request increase at resource level** .

### Verify KV caching and intelligent routing are enabled
<a name="verify-caching-routing"></a>

If the `KVCache Metrics` dashboard is missing, feature is either not enabled or the port isn't mentioned in the `modelMetrics`. For more information on how to enable this, see steps 1 and 3 in [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

If the `Intelligent Router Metrics` dashboard is missing, enable the feature to have them show up. For more information on how to enable this, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

## Troubleshooting add-on installation failures
<a name="troubleshooting-addon-installation-failures"></a>

If the observability add-on fails to install, use the following steps to diagnose and resolve the issue.

### Check health probe status
<a name="check-health-probe-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Choose the failed add-on.

1. Review the **Health issues** section.

1. If the health issue is related to credentials or pod identity, see [Verify Pod Identity association](#verify-pod-identity-association). Also ensure that the pod identity agent add-on is running in the cluster.

1. Check for errors in the manager logs. For instructions, see [Review manager logs](#review-manager-logs).

1. Contact AWS Support with the issue details.

### Review manager logs
<a name="review-manager-logs"></a>

1. Get the add-on manager pod:

   ```
   kubectl logs -n hyperpod-observability -l control-plane=hyperpod-observability-controller-manager
   ```

1. For urgent issues, contact Support.

## Review all observability pods
<a name="review-all-observability-pods"></a>

All the pods that the SageMaker HyperPod observability add-on creates are in the `hyperpod-observability` namespace. To get the status of these pods, run the following command.

```
kubectl get pods -n hyperpod-observability
```

Look for the pods whose status is either `pending` or `crashloopbackoff`. Run the following command to get the logs of these pending or failing pods.

```
kubectl logs -n hyperpod-observability pod-name
```

If you don't find errors in the logs, run the following command to describe the pods and look for errors.

```
kubectl describe -n hyperpod-observability pod pod-name
```

To get more context, run the two following commands to describe the deployments and daemonsets for these pods.

```
kubectl describe -n hyperpod-observability deployment deployment-name
```

```
kubectl describe -n hyperpod-observability daemonset daemonset-name
```

## Troubleshooting pods that are stuck in the pending status
<a name="pods-stuck-in-pending"></a>

If you see that there are pods that are stuck in the `pending` status, make sure that the node is large enough to fit in all the pods. To verify that it is, perform the following steps.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose the cluster's **Compute** tab.

1. Choose the node with the smallest instance type.

1. In the capacity allocation section, look for available pods.

1. If there are no available pods, then you need a larger instance type.

For urgent issues, contact AWS Support.

## Troubleshooting observability on Restricted Instance Groups
<a name="troubleshooting-rig-observability"></a>

Use the following guidance to resolve issues specific to clusters with Restricted Instance Groups.

### Observability pods not starting on restricted nodes
<a name="troubleshooting-rig-pods-not-starting"></a>

If observability pods are not starting on restricted nodes, check the pod status and events:

```
kubectl get pods -n hyperpod-observability -o wide
kubectl describe pod pod-name -n hyperpod-observability
```

Common causes include:
+ **Image pull failures:** The pod events may show image pull errors if the observability container images are not yet allowlisted on the restricted nodes. Ensure that you are running the latest version of the observability add-on. If the issue persists after upgrading, contact Support.
+ **Taint tolerations:** Verify that the pod spec includes the required toleration for restricted nodes. The add-on starting from version `v1.0.5-eksbuild.1` automatically adds this toleration when RIG support is enabled. If you are using older version, please upgrade to the latest version.

### Viewing logs for pods on restricted nodes
<a name="troubleshooting-rig-viewing-logs"></a>

The `kubectl logs` command does not work for pods running on restricted nodes. This is an expected limitation because the communication path required for log streaming is not available on restricted nodes.

To view logs from restricted nodes, use the **Cluster Logs** dashboard in Amazon Managed Grafana, which queries CloudWatch Logs directly. You can filter by instance ID, log stream, log level, and free-text search to find relevant log entries.

### DNS resolution failures in clusters with both standard and restricted nodes
<a name="troubleshooting-rig-dns-resolution"></a>

In hybrid clusters (clusters with both standard and restricted instance groups), pods on standard nodes may experience DNS resolution timeouts when trying to reach AWS service endpoints such as Amazon Managed Service for Prometheus or CloudWatch.

**Cause:** The `kube-dns` service has endpoints from both standard CoreDNS pods and RIG CoreDNS pods. Standard node pods cannot reach RIG CoreDNS endpoints due to network isolation. When `kube-proxy` load-balances a DNS request from a standard node pod to a RIG CoreDNS endpoint, the request times out.

**Resolution:** Set `internalTrafficPolicy: Local` on the `kube-dns` service so that pods only reach CoreDNS on their local node:

```
kubectl patch svc kube-dns -n kube-system -p '{"spec":{"internalTrafficPolicy":"Local"}}'
```

After applying this patch, restart the affected observability pods:

```
kubectl delete pods -n hyperpod-observability -l app.kubernetes.io/name=hyperpod-node-collector
```

### Metrics from restricted nodes not reaching Amazon Managed Service for Prometheus
<a name="troubleshooting-rig-metrics-not-reaching-amp"></a>

If metrics from restricted nodes are not appearing in your Amazon Managed Service for Prometheus workspace:

1. **Verify the execution role permissions.** Ensure that the execution role for the Restricted Instance Group has `aps:RemoteWrite` permission for your Prometheus workspace. For more information, see [Additional prerequisites for Restricted Instance Groups](hyperpod-observability-addon-setup.md#hyperpod-observability-addon-rig-prerequisites).

1. **Check the node collector pod status.** Run the following command and verify that node collector pods are running on restricted nodes:

   ```
   kubectl get pods -n hyperpod-observability | grep node-collector
   ```

1. **Check the central collector deployments.** In clusters with restricted nodes, the add-on deploys one central collector per network boundary. Verify that a central collector exists for each boundary:

   ```
   kubectl get deployments -n hyperpod-observability | grep central-collector
   ```

1. **Check pod events for errors.** Use `kubectl describe` on the collector pods to look for error events:

   ```
   kubectl describe pod collector-pod-name -n hyperpod-observability
   ```

If the issue persists after verifying the above, contact Support.

### Pod Identity verification does not apply to restricted instance group nodes
<a name="troubleshooting-rig-pod-identity"></a>

The [Verify Pod Identity association](#verify-pod-identity-association) troubleshooting steps apply only to standard nodes. On restricted nodes, the add-on uses the cluster instance group execution role for AWS authentication instead of Amazon EKS Pod Identity. If metrics are missing from restricted nodes, verify the execution role permissions instead of the Pod Identity association.

### Fluent Bit not running on restricted nodes
<a name="troubleshooting-rig-fluent-bit"></a>

This is expected behavior. Fluent Bit is intentionally not deployed on restricted nodes. Logs from restricted nodes are published to CloudWatch through the SageMaker HyperPod platform independently of the observability add-on. Use the **Cluster Logs** dashboard in Amazon Managed Grafana to view these logs.

# Observability with Amazon CloudWatch
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci"></a>

Use [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics and logs from the containerized applications and micro-services on the EKS cluster associated with a HyperPod cluster.

Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

To find a complete list of metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html) in the *Amazon EKS User Guide*.

## Install CloudWatch Container Insights
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-setup"></a>

Cluster admin users must set up CloudWatch Container Insights following the instructions at [Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on or the Helm chart](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html) in the *CloudWatch User Guide*. For more information about Amazon EKS add-on, see also [Install the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) in the *Amazon EKS User Guide*.

After the installation has completed, verify that the CloudWatch Observability add-on is visible in the EKS cluster add-on tab. It might take about a couple of minutes until the dashboard loads.

**Note**  
SageMaker HyperPod requires the CloudWatch Insight v2.0.1-eksbuild.1 or later.

![\[CloudWatch Observability service card showing status, version, and IAM role information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIaddon.png)


# Access CloudWatch container insights dashboard
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-dashboard"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Insights**, and then choose **Container Insights**.

1. Select the EKS cluster set up with the HyperPod cluster you're using.

1. View the Pod/Cluster level metrics.

![\[Performance monitoring dashboard for EKS cluster showing node status, resource utilization, and pod metrics.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIdashboard.png)


## Access CloudWatch container insights logs
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-log"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs**, and then choose **Log groups**.

When you have the HyperPod clusters integrated with Amazon CloudWatch Container Insights, you can access the relevant log groups in the following format: `/aws/containerinsights /<eks-cluster-name>/*`. Within this log group, you can find and explore various types of logs such as Performance logs, Host logs, Application logs, and Data plane logs.

# Continuous provisioning for enhanced cluster operations on Amazon EKS
<a name="sagemaker-hyperpod-scaling-eks"></a>

Amazon SageMaker HyperPod clusters created with Amazon EKS orchestration now supports continuous provisioning, a new capability that enables greater flexibility and efficiency running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations. 

**Note**  
Continuous provisioning is available as an optional configuration for HyperPod clusters created with EKS orchestration. Clusters created with Slurm orchestration use a different scaling model.

## How it works
<a name="sagemaker-hyperpod-scaling-eks-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional request-based model. This new architecture enables parallel, non-blocking operations across different resource levels while maintaining system stability and performance. The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group
+ **Initiates provisioning**: Begins launching instances to meet the target count

  **Tracks progress**: Monitors each instance launch attempt and records the status
+ **Handles failures**: Automatically retries failed launches

Continuous provisioning is disabled by default. To use this feature, set `--node-provisioning-mode` to `Continuous`.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group. 

Continuous provisioning also gives you access to [DescribeClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterEvent.html) and [ListClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) for detailed event monitoring and operational visibility. 

## Usage metering
<a name="sagemaker-hyperpod-scaling-eks-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you will be charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle script and stops when the instance enters a terminating state

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script
+ **Billing continues**: Throughout the instance's operational lifetime
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN). 

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-eks-create"></a>

**Note**  
You must have an existing Amazon EKS cluster configured with VPC networking and the required Helm chart installed. Additionally, prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md).

The following AWS CLI operation creates a HyperPod cluster with one instance group and continuous provisioning enabled.

```
aws sagemaker create-cluster \ 
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "ig-1",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'",
   "ThreadsPerCore": 1,
   "TrainingPlanArn": ""
}' \
--node-provisioning-mode Continuous


// Expected Output:
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:<account-id>:cluster/<cluster-id>"
}
```

After you’ve created your cluster, you can use [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html) or [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html) to find out more information about the nodes in the cluster. 

Calling these operations will return a [ClusterInstanceStatusDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceStatusDetails.html) object with one of the following values: 
+  **Running**: The node is healthy and registered with the cluster orchestrator (EKS). 
+  **Failure**: The node provisioning failed but the system will automatically retry provisioning with a new EC2 instance. 
+  **Pending**: The node is being provisioned or rebooted. 
+  **ShuttingDown**: The node termination is in progress. The node will either transition to Failure status if termination encounters issues, or will be successfully removed from the cluster. 
+  **SystemUpdating**: The node is undergoing AMI patching, either triggered manually or as part of patching cronjobs. 
+  **DeepHealthCheckInProgress**: [Deep health checks (DHCs)](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md) are being conducted. This could take anywhere between a few mins to several hours depending on the nature of tests. Bad nodes are replaced and healthy nodes switch to Running. 
+  **NotFound** : Used in [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) response to indicate a node has been deleted during idempotent replay. 

## Minimum capacity requirements (MinCount)
<a name="sagemaker-hyperpod-scaling-eks-mincount"></a>

The MinCount feature allows you to specify the minimum number of instances that must be successfully provisioned before an instance group transitions to the `InService` status. This feature provides better control over scaling operations and helps prevent scenarios where partially provisioned instance groups cannot be used effectively for training workloads.

**Important**  
MinCount is not a permanent guarantee of minimum capacity. It only ensures that the specified minimum number of instances are available when the instance group first becomes `InService`. Brief dips below MinCount may occur during normal operations such as unhealthy instance replacements or maintenance activities.

### How MinCount works
<a name="sagemaker-hyperpod-scaling-eks-mincount-how"></a>

When you create or update an instance group with MinCount enabled, the following behavior occurs:
+ **New instance groups**: The instance group remains in `Creating` status until at least MinCount instances are successfully provisioned and ready. Once this threshold is met, the instance group transitions to `InService`.
+ **Existing instance groups**: When updating MinCount on an existing instance group, the status changes to `Updating` until the new MinCount requirement is satisfied.
+ **Continuous scaling**: If TargetCount is greater than MinCount, the continuous scaling system continues attempting to launch additional instances until TargetCount is reached.
+ **Timeout and rollback**: If MinCount cannot be satisfied within 3 hours, the system automatically rolls back the instance group to its last known good state. For more information about rollback behavior, see [Automatic rollback behavior](#sagemaker-hyperpod-scaling-eks-mincount-rollback).

### Instance group status during MinCount operations
<a name="sagemaker-hyperpod-scaling-eks-mincount-status"></a>

Instance groups with MinCount configured exhibit the following status behavior:

Creating  
For new instance groups when CurrentCount < MinCount. The instance group remains in this status until the minimum capacity requirement is met.

Updating  
For existing instance groups when MinCount is modified and CurrentCount < MinCount. The instance group remains in this status until the new minimum capacity requirement is satisfied.

InService  
When MinCount ≤ CurrentCount ≤ TargetCount. The instance group is ready for use and all mutating operations are unblocked.

During `Creating` or `Updating` status, the following restrictions apply:
+ Mutating operations such as `BatchAddClusterNodes`, `BatchDeleteClusterNodes`, or `UpdateClusterSoftware` are blocked
+ You can still modify MinCount and TargetCount values to correct configuration errors
+ Cluster and Instance group deletion is always permitted

### Automatic rollback behavior
<a name="sagemaker-hyperpod-scaling-eks-mincount-rollback"></a>

If an instance group cannot reach its MinCount within 3 hours, the system automatically initiates a rollback to prevent indefinite waiting:
+ **New instance groups**: MinCount and TargetCount are reset to (0, 0)
+ **Existing instance groups**: MinCount and TargetCount are restored to their values from the last `InService` state
+ **Instance selection for termination**: If instances need to be terminated during rollback, the system selects the unhealthy instances first, then those that were most recently provisioned.
+ **Status transition**: The instance group immediately transitions to `InService` status after rollback initiation, allowing the continuous scaling system to manage capacity according to the rollback settings

The 3-hour timeout resets each time MinCount is updated. For example, if you update MinCount multiple times, the timeout period starts fresh from the most recent update.

### MinCount events
<a name="sagemaker-hyperpod-scaling-eks-mincount-events"></a>

The system emits specific events to help you track MinCount operations:
+ **Minimum capacity reached**: Emitted when an instance group successfully reaches its MinCount and transitions to `InService`
+ **Rollback initiated**: Emitted when the 3-hour timeout expires and automatic rollback begins

You can monitor these events using [ListClusterEvents](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) to track the progress of your MinCount operations.

### API usage
<a name="sagemaker-hyperpod-scaling-eks-mincount-api"></a>

MinCount is specified using the `MinInstanceCount` parameter in instance group configurations:

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "worker-group",
   "InstanceType": "ml.p4d.24xlarge",
   "InstanceCount": 64,
   "MinInstanceCount": 50,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}' \
--node-provisioning-mode Continuous
```

Key considerations for MinCount usage:
+ `MinInstanceCount` must be between 0 and `InstanceCount` (inclusive) value of the instance group specified in [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) request
+ Setting `MinInstanceCount` to 0 (default) preserves standard continuous scaling behavior
+ Setting `MinInstanceCount` equal to `InstanceCount` provides all-or-nothing scaling behavior
+ MinCount is only available for clusters with `NodeProvisioningMode` set to `Continuous`

## Flexible instance groups
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig"></a>

Flexible instance groups allow you to specify multiple instance types within a single instance group. This simplifies cluster management by reducing the number of instance groups you need to create and manage, especially for inference workloads that use autoscaling.

With flexible instance groups, HyperPod:
+ Attempts to provision instances using the first instance type in your list
+ Falls back to subsequent instance types if capacity is unavailable
+ Terminates instances of the lowest-priority instance type first during scale-down

**Note**  
Flexible instance groups are only available for clusters with `NodeProvisioningMode` set to `Continuous`. The `InstanceType` and `InstanceRequirements` properties are mutually exclusive—you can specify one or the other, but not both.

### Create a cluster with a flexible instance group
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-create"></a>

Use `InstanceRequirements` instead of `InstanceType` to create a flexible instance group. The order of instance types in the list determines the priority for provisioning.

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET_AZ1'", "'$SUBNET_AZ2'"]
}' \
--instance-groups '[{
   "InstanceGroupName": "flexible-ig",
   "InstanceRequirements": {
      "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
   },
   "InstanceCount": 10,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}]' \
--node-provisioning-mode Continuous
```

### Targeted scaling with BatchAddClusterNodes
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-targeted"></a>

When using flexible instance groups, you can use [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) to add nodes with specific instance types and availability zones. This is particularly useful when Karpenter autoscaling determines the optimal instance type and availability zone for your workload.

```
aws sagemaker batch-add-cluster-nodes \
--cluster-name $HP_CLUSTER_NAME \
--nodes-to-add '[
   {
      "InstanceGroupName": "flexible-ig",
      "IncrementTargetCountBy": 1,
      "InstanceTypes": ["ml.p5.48xlarge"],
      "AvailabilityZones": ["us-west-2a"]
   }
]'
```

### View flexible instance group details
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-describe"></a>

Use [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) to view the instance types and per-type breakdown of your flexible instance group. The response includes:
+ `InstanceRequirements` — The current and desired instance types for the instance group
+ `InstanceTypeDetails` — A per-instance-type breakdown showing the count and configuration of each instance type in the group

### Using flexible instance groups with Karpenter autoscaling
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-autoscaling"></a>

Flexible instance groups integrate with HyperPod's managed Karpenter autoscaling. For more information about setting up Karpenter, see [Autoscaling on SageMaker HyperPod EKS](sagemaker-hyperpod-eks-autoscaling.md). When you reference a flexible instance group in a `HyperPodNodeClass` configuration, Karpenter automatically:
+ Detects the supported instance types from the flexible instance group
+ Selects the optimal instance type and availability zone based on pod requirements and pricing
+ Scales the flexible instance group using targeted `BatchAddClusterNodes` calls with the selected instance type and availability zone

**Note**  
When Karpenter manages scaling, it uses its own selection logic based on pod requirements and pricing to determine which instance type to provision. This differs from the list-order priority used by HyperPod's native provisioning (such as `CreateCluster` and `UpdateCluster`), where the first instance type in the list is always attempted first.

This eliminates the need to create separate instance groups for each instance type and manually configure Karpenter to reference multiple groups.

# Autoscaling on SageMaker HyperPod EKS
<a name="sagemaker-hyperpod-eks-autoscaling"></a>

Amazon SageMaker HyperPod provides a managed Karpenter based node autoscaling solution for clusters created with EKS orchestration. [Karpenter](https://karpenter.sh/) is an open-source, Kubernetes node lifecycle manager built by AWS that optimizes cluster scaling and cost efficiency. Unlike self-managed Karpenter deployments, SageMaker HyperPod's managed implementation eliminates the operational overhead of installing, configuring, and maintaining Karpenter controllers while providing integrated resilience and fault tolerance. This managed autoscaling solution is built on HyperPod's [continuous provisioning](sagemaker-hyperpod-scaling-eks.md) capabilities and enables you to efficiently scale compute resources for training and inference workloads with automatic failure handling and recovery. 

You pay only for what you use. You're responsible for paying for all compute instances that are automatically provisioned through autoscaling according to standard SageMaker HyperPod pricing. For detailed pricing information, see [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/pricing/).

By enabling Karpenter-based autoscaling with HyperPod, you have access to:
+ **Service managed lifecycle** - HyperPod handles Karpenter installation, updates, and maintenance, eliminating operational overhead.
+ **Just in time provisioning** - Karpenter will observe your pending pods and provision the required compute for your workloads from on-demand pool.
+ **Scale to zero** - Scale down to zero nodes without maintaining dedicated controller infrastructure.
+ **Workload aware node selection** - Karpenter chooses optimal instance types based on pod requirements, availability zones, and pricing to minimize costs.
+ **Automatic node consolidation** - Karpenter regularly evaluates cluster for optimization opportunities, shifting workloads to eliminate underutilized nodes.
+ **Integrated resilience** - Leverages HyperPod's built-in fault tolerance and node recovery mechanisms.

The following topics explain how to enable HyperPod autoscaling with Karpenter.

**Topics**
+ [

## Prerequisites
](#sagemaker-hyperpod-eks-autoscaling-prereqs)
+ [

# Create an IAM role for HyperPod autoscaling with Karpenter
](sagemaker-hyperpod-eks-autoscaling-iam.md)
+ [

# Create and configure a HyperPod cluster with Karpenter autoscaling
](sagemaker-hyperpod-eks-autoscaling-cluster.md)
+ [

# Create a NodeClass
](sagemaker-hyperpod-eks-autoscaling-nodeclass.md)
+ [

# Create a NodePool
](sagemaker-hyperpod-eks-autoscaling-nodepool.md)
+ [

# Deploy a workload
](sagemaker-hyperpod-eks-autoscaling-workload.md)

## Prerequisites
<a name="sagemaker-hyperpod-eks-autoscaling-prereqs"></a>
+ Continuous provisioning enabled on your HyperPod cluster. Enable continuous provisioning by setting `--node-provisioning-mode` to `Continuous` when creating your SageMaker HyperPod cluster. For more information, see [Continuous provisioning for enhanced cluster operations on Amazon EKS](sagemaker-hyperpod-scaling-eks.md).
+ Health Monitoring Agent version 1.0.742.0\$11.0.241.0 or above installed. Required for HyperPod cluster operations and monitoring. The agent must be configured before enabling Karpenter autoscaling to ensure proper cluster health reporting and node lifecycle management. For more information, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).
+ Only if your Amazon EKS cluster has Karpenter running on it, the Karpenter `NodePool` and `NodeClaim` versions need to be v1.
+ `NodeRecovery` set to automatic. For more information, see [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md).

# Create an IAM role for HyperPod autoscaling with Karpenter
<a name="sagemaker-hyperpod-eks-autoscaling-iam"></a>

In the following steps, you'll create an IAM role that allows SageMaker HyperPod to manage Kubernetes nodes in your cluster through Karpenter-based autoscaling. This role provides the necessary permissions for HyperPod to add and remove cluster nodes automatically based on workload demand.

**Open the IAM console**

1. Sign in to the AWS Management Console and open the IAM console at console.aws.amazon.com.

1. In the navigation pane, choose **Roles**.

1. Choose **Create role**.

**Configure the trust policy**

1. For **Trusted entity type**, choose **Custom trust policy**.

1. In the **Custom trust policy** editor, replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "hyperpod.sagemaker.amazonaws.com"
                   ]
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Choose **Next**.

**Create and attach the permissions policy**

Because SageMaker HyperPod requires specific permissions that aren't available in AWS managed policies, you must create a custom policy.

1. Choose **Create policy**. This opens a new browser tab.

1. Choose the **JSON** tab.

1. Replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:BatchAddClusterNodes",
                   "sagemaker:BatchDeleteClusterNodes"
               ],
               "Resource": "arn:aws:sagemaker:*:*:cluster/*",
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Effect": "Allow",
               "Action": [
                   "kms:CreateGrant",
                   "kms:DescribeKey"
               ],
               "Resource": "arn:aws:kms:*:*:key/*",
               "Condition": {
                   "StringLike": {
                       "kms:ViaService": "sagemaker.*.amazonaws.com"
                   },
                   "Bool": {
                       "kms:GrantIsForAWSResource": "true"
                   },
                   "ForAllValues:StringEquals": {
                       "kms:GrantOperations": [
                           "CreateGrant",
                           "Decrypt",
                           "DescribeKey",
                           "GenerateDataKeyWithoutPlaintext",
                           "ReEncryptTo",
                           "ReEncryptFrom",
                           "RetireGrant"
                       ]
                   }
               }
           }
       ]
   }
   ```

------

1. Choose **Next**.

1. For **Policy name**, enter **SageMakerHyperPodKarpenterPolicy**.

1. (Optional) For **Description**, enter a description for the policy.

1. Choose **Create policy**.

1. Return to the role creation tab and refresh the policy list.

1. Search for and select the **SageMakerHyperPodKarpenterPolicy** that you just created.

1. Choose **Next**.

**Name and create the role**

1. For **Role name**, enter `SageMakerHyperPodKarpenterRole`.

1. (Optional) For **Description**, enter a description for the role.

1. In the **Step 1: Select trusted entities** section, verify that the trust policy shows the correct service principals.

1. In the **Step 2: Add permissions** section, verify that `SageMakerHyperPodKarpenterPolicy` is attached.

1. Choose **Create role**.

**Record the role ARN**

After the role is created successfully:

1. In the **Roles** list, choose the role name `SageMakerHyperPodKarpenterRole`.

1. Copy the **Role ARN** from the **Summary** section. You'll need this ARN when creating your HyperPod cluster.

The role ARN follows this format: `arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodKarpenterRole`.

# Create and configure a HyperPod cluster with Karpenter autoscaling
<a name="sagemaker-hyperpod-eks-autoscaling-cluster"></a>

In the following steps, you'll create a SageMaker HyperPod cluster with continuous provisioning enabled and configure it to use Karpenter-based autoscaling.

**Create a HyperPod cluster**

1. Load your environment configuration and extract values from CloudFormation stacks.

   ```
   source .env
   SUBNET1=$(cfn-output $VPC_STACK_NAME PrivateSubnet1)
   SUBNET2=$(cfn-output $VPC_STACK_NAME PrivateSubnet2)
   SUBNET3=$(cfn-output $VPC_STACK_NAME PrivateSubnet3)
   SECURITY_GROUP=$(cfn-output $VPC_STACK_NAME NoIngressSecurityGroup)
   EKS_CLUSTER_ARN=$(cfn-output $EKS_STACK_NAME ClusterArn)
   EXECUTION_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ExecutionRole)
   SERVICE_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ServiceRole)
   BUCKET_NAME=$(cfn-output $SAGEMAKER_STACK_NAME Bucket)
   HP_CLUSTER_NAME="hyperpod-eks-test-$(date +%s)"
   EKS_CLUSTER_NAME=$(cfn-output $EKS_STACK_NAME ClusterName)
   HP_CLUSTER_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ClusterRole)
   ```

1. Upload the node initialization script to your Amazon S3 bucket.

   ```
   aws s3 cp lifecyclescripts/on_create_noop.sh s3://$BUCKET_NAME
   ```

1. Create a cluster configuration file with your environment variables.

   ```
   cat > cluster_config.json << EOF
   {
       "ClusterName": "$HP_CLUSTER_NAME",
       "InstanceGroups": [
           {
               "InstanceCount": 1,
               "InstanceGroupName": "system",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-az1",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-4xaz2",
               "InstanceType": "ml.c5.4xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET2"
                   ]
               }
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-g5-az3",
               "InstanceType": "ml.g5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET3"
                   ]
               }
           }
       ],
       "VpcConfig": {
           "SecurityGroupIds": [
               "$SECURITY_GROUP"
           ],
           "Subnets": [
               "$SUBNET1"
           ]
       },
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "$EKS_CLUSTER_ARN"
           }
       },
       "ClusterRole": "$HP_CLUSTER_ROLE",
       "AutoScaling": {
           "Mode": "Enable",
           "AutoScalerType": "Karpenter"
       },
       "NodeProvisioningMode": "Continuous"
   }
   EOF
   ```

1. Run the following command to create your HyperPod cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://./cluster_config.json
   ```

1. The cluster creation process takes approximately 20 minutes. Monitor the cluster status until both ClusterStatus and AutoScaling.Status show InService.

1. Save the cluster ARN for subsequent operations.

   ```
   HP_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME \
      --output text --query ClusterArn)
   ```

**Enable Karpenter autoscaling**

1. Run the following command to enable Karpenter-based autoscaling on any pre-existing cluster that has continuous node provisioning mode.

   ```
   aws sagemaker update-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --auto-scaling Mode=Enable,AutoScalerType=Karpenter \
       --cluster-role $HP_CLUSTER_ROLE
   ```

1. Verify that Karpenter has been successfully enabled:

   ```
   aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --query 'AutoScaling'
   ```

1. Expected output:

   ```
   {
       "Mode": "Enable",
       "AutoScalerType": "Karpenter",
       "Status": "InService"
   }
   ```

Wait for the `Status` to show `InService` before proceeding to configure NodeClass and NodePool.

# Create a NodeClass
<a name="sagemaker-hyperpod-eks-autoscaling-nodeclass"></a>

**Important**  
You must start with 0 nodes in your instance group and let Karpenter handle the autoscaling. If you start with more than 0 nodes, Karpenter will scale them down to 0.

A node class (`NodeClass`) defines infrastructure-level settings that apply to groups of nodes in your Amazon EKS cluster, including network configuration, storage settings, and resource tagging. A `HyperPodNodeClass` is a custom `NodeClass` that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter's autoscaling decisions.

**Considerations for creating a node class**
+ You can specify up to 10 instance groups in a `NodeClass`.
+ Instance groups that use `InstanceRequirements` (flexible instance groups) can contain multiple instance types within a single instance group. This simplifies your `NodeClass` configuration because you can reference fewer instance groups to cover the same set of instance types and Availability Zones. For example, instead of creating 6 instance groups (3 instance types × 2 AZs), you can create a single flexible instance group that covers all combinations. Note that `InstanceType` and `InstanceRequirements` are mutually exclusive—you must specify one or the other for each instance group.
+ When using GPU partitioning with MIG (Multi-Instance GPU), Karpenter can automatically provision nodes with MIG-enabled instance groups. Ensure your instance groups include MIG-supported instance types (ml.p4d.24xlarge, ml.p5.48xlarge, or ml.p5e/p5en.48xlarge) and configure the appropriate MIG labels during cluster creation. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).
+ If custom labels are applied to instance groups, you can view them in the `desiredLabels` field when querying the `HyperpodNodeClass` status. This includes MIG configuration labels such as `nvidia.com/mig.config`. When incoming jobs request MIG resources, Karpenter will automatically scale instances with the appropriate MIG labels applied.
+ If you choose to delete an instance group, we recommend removing it from your `NodeClass` before deleting it from your HyperPod cluster. If an instance group is deleted while it is used in a `NodeClass`, the `NodeClass` will be marked as not `Ready` for provisioning and won't be used for subsequent scaling operations until the instance group is removed from `NodeClass`.
+ When you remove instance groups from a `NodeClass`, Karpenter will detect a drift on the nodes that were managed by Karpenter in the instance group(s) and disrupt the nodes based on your disruption budget controls.
+ Subnets used by the instance group should belong to the same AZ. Subnets are specified either using `OverrideVpcConfig` at the instance group level or the cluster level. `VpcConfig` is used by default.
+ Only on-demand capacity is supported at this time. Instance groups with Training plan or reserved capacity are not supported.
+ Instance groups with `DeepHealthChecks (DHC)` are not supported. This is because a DHC takes around 60-90 minutes to complete and pods will remain in pending state during that time which can cause over-provisioning.

The following steps cover how to create a `NodeClass`.

1. Create a YAML file (for example, nodeclass.yaml) with your `NodeClass` configuration.

1. Apply the configuration to your cluster using kubectl.

1. Reference the `NodeClass` in your `NodePool` configuration.

1. Here's a sample `NodeClass` that uses a ml.c5.xlarge and ml.c5.4xlarge instance types:

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     name: sample-nc
   spec:
     instanceGroups:
       # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
       # MaxItems: 10
       - auto-c5-xaz1
       - auto-c5-4xaz2
   ```

1. Apply the configuration:

   ```
   kubectl apply -f nodeclass.yaml
   ```

1. Monitor the NodeClass status to ensure the Ready condition in status is set to True:

   ```
   kubectl get hyperpodnodeclass sample-nc -o yaml
   ```

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     creationTimestamp: "<timestamp>"
     name: sample-nc
     uid: <resource-uid>
   spec:
     instanceGroups:
     - auto-c5-az1
     - auto-c5-4xaz2
   status:
     conditions:
     // true when all IGs in the spec are present in SageMaker cluster, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: InstanceGroupReady
       status: "True"
       type: InstanceGroupReady
     // true if subnets of IGs are discoverable, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: SubnetsReady
       status: "True"
       type: SubnetsReady
     // true when all dependent resources are Ready [InstanceGroup, Subnets]
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: Ready
       status: "True"
       type: Ready
     instanceGroups:
     - desiredLabels:
       - key: <custom_label_key>
         value: <custom_label_value>
       - key: nvidia.com/mig.config
         value: all-1g.5gb
       instanceTypes:
       - ml.c5.xlarge
       name: auto-c5-az1
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
     - instanceTypes:
       - ml.c5.4xlarge
       name: auto-c5-4xaz2
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
     # Flexible instance group with multiple instance types
     - instanceTypes:
       - ml.p5.48xlarge
       - ml.p4d.24xlarge
       - ml.g6.48xlarge
       name: inference-workers
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
   ```

# Create a NodePool
<a name="sagemaker-hyperpod-eks-autoscaling-nodepool"></a>

The `NodePool` sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The `NodePool` can be configured to do things like:
+ Limit node creation to certain zones, instance types, and computer architectures.
+ Define labels or taints to limit the pods that can run on nodes Karpenter creates.

**Note**  
HyperPod provider supports a limited set of well-known Kubernetes and Karpenter requirements explained below. 

The following steps cover how to create a `NodePool`.

1. Create a YAML file named nodepool.yaml with your desired `NodePool` configuration.

1. You can use the sample configuration below.

   Look for `Ready` under `Conditions` to indicate all dependent resources are functioning properly.

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
    name: sample-np
   spec:
    template:
      spec:
        nodeClassRef:
         group: karpenter.sagemaker.amazonaws.com
         kind: HyperpodNodeClass
         name: multiazc5
        expireAfter: Never
        requirements:
           - key: node.kubernetes.io/instance-type
             operator: Exists
   ```

1. Apply the `NodePool` to your cluster:

   ```
   kubectl apply -f nodepool.yaml
   ```

1. Monitor the `NodePool` status to ensure the `Ready` condition in status is set to `True`:

   ```
   kubectl get nodepool sample-np -oyaml
   ```

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
     name: <nodepool-name>
     uid: <resource-uid>
     ...
   spec:
     disruption:
       budgets:
       - nodes: 90%
       consolidateAfter: 0s
       consolidationPolicy: WhenEmptyOrUnderutilized
     template:
       spec:
         expireAfter: 720h
         nodeClassRef:
           group: karpenter.sagemaker.amazonaws.com
           kind: HyperpodNodeClass
           name: <nodeclass-name>
         requirements:
         - key: node.kubernetes.io/instance-type
           operator: Exists
   status:
     conditions:
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: ValidationSucceeded
       status: "True"
       type: ValidationSucceeded
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: NodeClassReady
       status: "True"
       type: NodeClassReady
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: Ready
       status: "True"
       type: Ready
   ```

**Supported Labels for Karpenter HyperPod Provider**

These are the optional constraints and requirements you can specify in your `NodePool` configuration.


|  Requirement Type  |  Purpose  |  Use Case/Supported Values  |  Recommendation  | 
| --- | --- | --- | --- | 
|  Instance Types (`node.kubernetes.io/instance-type`)  |  Controls which SageMaker instance types Karpenter can choose from  |  Instead of restricting to only ml.c5.xlarge, let Karpenter pick from all available types in your instance groups  |  Leave this undefined or use Exists operator to give Karpenter maximum flexibility in choosing cost-effective instance types  | 
|  Availability Zones (`topology.kubernetes.io/zone`)  |  Controls which AWS availability zones nodes can be created in  |  Specific zone names like us-east-1c. Use when you need pods to run in specific zones for latency or compliance reasons  | n/a | 
|  Architecture (`kubernetes.io/arch`)  |  Specifies CPU architecture  |  Only amd64 (no ARM support currently)  |  n/a  | 

# Deploy a workload
<a name="sagemaker-hyperpod-eks-autoscaling-workload"></a>

The following examples demonstrate how HyperPod autoscaling with Karpenter automatically provisions nodes in response to workload demands. These examples show basic scaling behavior and multi-availability zone distribution patterns.

**Deploy a simple workload**

1. The following Kubernetes deployment includes pods that request for 1 CPU and 256M memory per replica or pod. In this scenario, the pods aren’t spun up yet.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
   ```

1. To test the scale up process, run the following command. Karpenter will add new nodes to the cluster.

   ```
   kubectl scale deployment inflate --replicas 10
   ```

1. To test the scale down process, run the following command. Karpenter will remove nodes from the cluster.

   ```
   kubectl scale deployment inflate --replicas 0
   ```

**Deploy a workload across multiple AZs**

1. Run the following command to deploy a workload that runs a Kubernetes deployment where the pods in deployment need to spread evenly across different availability zones with a max Skew of 1.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/spread-zone.yaml
   ```

1. Run the following command to adjust number of pods:

   ```
   kubectl scale deployment zone-spread --replicas 15
   ```

   Karpenter will add new nodes to the cluster with at least one node it a different availability zone.

For more examples, see [Karpenter example workloads](https://github.com/aws/karpenter-provider-aws/tree/main/examples/workloads) on GitHub.

# Using topology-aware scheduling in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-topology"></a>

Data transfer efficiency is a critical factor in high-performance computing (HPC) and machine learning workloads. When using UltraServers with Amazon SageMaker HyperPod, SageMaker HyperPod automatically applies topology labels to your resources. Topology-aware scheduling helps allocate resources to minimize data transfer overheads by considering both instance topology (how resources are connected within an instance) and network topology (how instances are connected with each other). For more information about instance topology, see [ Amazon EC2 instance topology](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-topology.html).

Topology-aware scheduling works with both clusters on Slurm and Amazon EKS. For general information about how topology works with Slurm, see the [Topology guide in the Slurm documentation](https://slurm.schedmd.com/topology.html).

In Amazon SageMaker HyperPod, data transfer overheads typically come from three main sources:
+ **GPU-to-GPU data transfer**: Modern technologies like NVLink and NVLink switches allow high-throughput data transfer between GPUs without involving other compute resources. This is extremely efficient but usually limited to a single instance.
+ **GPU-to-CPU data transfer**: Non-uniform memory access (NUMA) systems have multiple system buses on a single motherboard. In a typical EC2 instance architecture like p5.48xlarge, there are two different system buses, each with a CPU and 4 GPUs. For optimal performance, processes that load or read data to/from GPUs should execute on a CPU connected to the same system bus as the GPU.
+ **Network communications between instances**: Instances transfer data through a chain of network switches. The shortest path typically corresponds to the lowest latency.

## UltraServer architecture
<a name="sagemaker-hyperpod-topology-ultraserver-architecture"></a>

SageMaker HyperPod supports UltraServer architecture with p6e-gb200.36xlarge instances. An UltraServer contains up to 18 p6e-gb200.36xlarge instances, with 4 GPUs on each instance. All GPUs across all nodes are interconnected through NVLink switches, enabling data transfer between any two GPUs without using network interfaces.

This architecture provides a significant performance boost compared to individual instances. To leverage this architecture effectively, jobs should be submitted to compute nodes from a single UltraServer.

## EKS topology label
<a name="sagemaker-hyperpod-topology-eks-scheduling"></a>

In accordance with EC2 instance topology, HyperPod automatically labels your nodes with the following labels:
+ **topology.kubernetes.io/region** - the AWS Region that the node resides in.
+ **topology.kubernetes.io/zone** - the Availability Zone that the node resides in.
+ **topology.k8s.aws/network-node-layer** - NetworkNodes describes the network node set of an instance. In each network node set, the network nodes are listed in a hierarchical order from top to bottom. The network node that is connected to the instance is the last network node in the list. There are up to four network node layers, and each node is tagged with a label. Available layers are `topology.k8s.aws/network-node-layer-1`, `topology.k8s.aws/network-node-layer-2`, `topology.k8s.aws/network-node-layer-3`.
+ **topology.k8s.aws/ultraserver-id** - An identifier used to label each of the instances belonging to the same NVLink domain in an Ultraserver. To learn more about using UltraServers with SageMaker HyperPod, see [Using UltraServers in Amazon SageMaker HyperPod](sagemaker-hyperpod-ultraserver.md).

Using these labels, you can use topology-aware scheduling in HyperPod task governance to apply topology labels and annotations to optimize training efficiency of your workloads. For more information, see [Using topology-aware scheduling in Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-scheduling.md).

## Slurm network topology plugins
<a name="sagemaker-hyperpod-topology-slurm-plugins"></a>

Slurm provides built-in plugins for network topology awareness. UltraServer architecture in SageMaker HyperPod supports the block plugin.

### Using the topology/block Plugin
<a name="w2aac13c35c39c15b5"></a>

NVIDIA developed a topology/block plugin that provides hierarchical scheduling across blocks of nodes with the following characteristics:
+ A block is a consecutive range of nodes
+ Blocks cannot overlap with each other
+ All nodes in a block are allocated to a job before the next block is used
+ The planning block size is the smallest block size configured
+ Every higher block level size is a power of two than the previous one

This plugin allocates nodes based on the defined network topology.

#### Configuration
<a name="w2aac13c35c39c15b5b9"></a>

To configure topology-aware scheduling with the topology/block plugin,
+ SageMaker HyperPod automatically configures the topology/block plugin. If you want to configure the plugin, specify the following in the topology.conf file in your Slurm configuration directory:

  ```
  BlockName=us1 Nodes=ultraserver1-[0-17]
    
  BlockName=us2 Nodes=ultraserver2-[0-17]
    
  BlockSizes=18
  ```
+ Ensure your `slurm.conf` includes:

  ```
  TopologyPlugin=topology/block
  ```

#### Usage
<a name="w2aac13c35c39c15b5c11"></a>

When submitting jobs, you can use the following additional arguments with `sbatch` and `srun` commands:
+ `--segment=N`: Specify the number of nodes to group together. The size of the segment must be less than or equal to the planning block size.
+ `--exclusive=topo`: Request that no other jobs be placed on the same block. This is useful for benchmarking and performance-sensitive applications.

The following are sample scenarios you might consider when thinking about allocating blocks.

**Allocate a whole block of nodes on an empty system**

```
sbatch -N18
```

**Allocate two blocks of nodes on an empty system**

```
sbatch -N36
```

**Allocate 18 nodes on one block \$1 6 nodes on another block**

```
sbatch -N24
```

**Allocate 12 nodes on one block and 12 nodes on another block**

```
sbatch -N24 —segment=12
```

**With —exclusive=topo, job must be placed on block with no other jobs**

```
sbatch -N12 —exclusive=topo
```

## Best practices for UltraServer topology
<a name="sagemaker-hyperpod-topology-best-practices"></a>

For optimal performance with UltraServer architecture in SageMaker HyperPod:
+ **Set appropriate block sizes**: Configure `BlockSizes=18` (or 17 if one node is spare) to match the UltraServer architecture.
+ **Use segments for better availability**: Use `--segment=16`, `--segment=8`, or `--segment=9` with `srun` and `sbatch` commands to improve job scheduling flexibility.
+ **Consider job size and segment size**:
  + If `BlockSizes=18`, jobs with up to 18 instances will always run on a single UltraServer.
  + If `BlockSizes=16`, jobs with fewer than 16 instances will always run on a single UltraServer, while jobs with 18 instances may run on one or two UltraServers.

When thinking about segmenting, consider the following
+ With `--segment=1`, each instance can run on a separate UltraServer.
+ With `-N 18 --segment 9`, 9 nodes will be placed on one UltraServer, and another 9 nodes can be placed on the same or another UltraServer.
+ With `-N 24 --segment 8`, the job can run on 2 or 3 UltraServers, with every 8 nodes placed together on the same server.

## Limitations in SageMaker HyperPod topology aware scheduling
<a name="sagemaker-hyperpod-topology-limitations"></a>

The `topology/block` plugin has limitations with heterogeneous clusters (clusters with different instance types):
+ Only nodes listed in blocks are schedulable by Slurm
+ Every block must have at least `BlockSizes[0]` nodes

For heterogeneous clusters, consider these alternatives:
+ Do not use the block plugin with heterogeneous clusters. Instead, isolate UltraServer nodes in a different partition.
+ Create a separate cluster with UltraServers only in the same VPC and use Slurm's multicluster setup.

# Deploying models on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-model-deployment"></a>

Amazon SageMaker HyperPod now extends beyond training to deliver a comprehensive inference platform that combines the flexibility of Kubernetes with the operational excellence of AWS managed services. Deploy, scale, and optimize your machine learning models with enterprise-grade reliability using the same HyperPod compute throughout the entire model lifecycle.

Amazon SageMaker HyperPod offers flexible deployment interfaces that allow you to deploy models through multiple methods including kubectl, Python SDK, Amazon SageMaker Studio UI, or HyperPod CLI. The service provides advanced autoscaling capabilities with dynamic resource allocation that automatically adjusts based on demand. Additionally, it includes comprehensive observability and monitoring features that track critical metrics such as time-to-first-token, latency, and GPU utilization to help you optimize performance.

**Note**  
When deploying on GPU-enabled instances, you can use GPU partitioning with Multi-Instance GPU (MIG) technology to run multiple inference workloads on a single GPU. This allows for better GPU utilization and cost optimization. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

**Unified infrastructure for training and inference**

Maximize your GPU utilization by seamlessly transitioning compute resources between training and inference workloads. This reduces the total cost of ownership while maintaining operational continuity.

**Enterprise-ready deployment options**

Deploy models from multiple sources including open-weights and gated models from Amazon SageMaker JumpStart and custom models from Amazon S3 and Amazon FSx with support for both single-node and multi-node inference architectures.

**Managed tiered Key-value (KV) caching and intelligent routing**

KV caching saves the precomputed key-value vectors after processing previous tokens. When the next token is processed, the vectors don't need to be recalculated. Through a two-tier caching architecture, you can configure an L1 cache that uses CPU memory for low-latency local reuse, and an L2 cache that leverages Redis to enable scalable, node-level cache sharing.

Intelligent routing analyzes incoming requests and directs them to the inference instance most likely to have relevant cached key-value pairs. The system examines the request and then routes it based on one of the following routing strategies:

1. `prefixaware` — Subsequent requests with the same prompt prefix are routed to the same instance

1. `kvaware` — Incoming requests are routed to the instance with the highest KV cache hit rate.

1. `session` — Requests from the same user session are routed to the same instance.

1. `roundrobin` — Even distribution of requests without considering the state of the KV cache.

For more information on how to enable this feature, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route).

**Inbuilt L2 cache Tiered Storage Support for KV Caching**

Building upon the existing KV cache infrastructure, HyperPod now integrates tiered storage as an additional L2 backend option alongside Redis. With the inbuilt SageMaker managed tiered storage, this offers improved performance. This enhancement provides customers with a more scalable and efficient option for cache offloading, particularly beneficial for high-throughput LLM inference workloads. The integration maintains compatibility with existing vLLM model servers and routing capabilities while offering better performance.

**Note**  
**Data encryption:** KV cache data (attention keys and values) is stored unencrypted at rest to optimize inference latency and improve performance. For workloads with strict encryption-at-rest requirements, consider application-layer encryption of prompts and responses, or disable caching.  
**Data isolation:** When using managed tiered storage as the L2 cache backend, multiple inference deployments within a cluster share cache storage with no isolation. L2 KV cache data (attention keys and values) from different deployments is not separated. For workloads requiring data isolation (multi-tenant scenarios, different data classification levels), deploy to separate clusters or use dedicated Redis instances.

**Multi-instance type deployment with automatic failover**

HyperPod Inference supports multi-instance type deployment to improve deployment reliability and resource utilization. Specify a prioritized list of instance types in your deployment configuration, and the system automatically selects from available alternatives when your preferred instance type lacks capacity. The Kubernetes scheduler uses `preferredDuringSchedulingIgnoredDuringExecution` node affinity to evaluate instance types in priority order, placing workloads on the highest-priority available instance type while ensuring deployment even when preferred resources are unavailable. This capability prevents deployment failures due to capacity constraints while maintaining your cost and performance preferences, ensuring continuous service availability even during cluster capacity fluctuations.

**Custom node affinity for granular scheduling control**

HyperPod Inference supports custom node affinity to control workload placement beyond instance type selection. Specify node selection criteria such as availability zone distribution, capacity type filtering (on-demand vs. spot), or custom node labels through the `nodeAffinity` field. The system supports mandatory placement constraints using `requiredDuringSchedulingIgnoredDuringExecution` and optional preferences through `preferredDuringSchedulingIgnoredDuringExecution`, providing full control over pod scheduling decisions while maintaining deployment flexibility.

**Note**  
We collect certain routine operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model inference workload. These metrics relate to deployment operations, resource management, and endpoint registration.

**Topics**
+ [

# Setting up your HyperPod clusters for model deployment
](sagemaker-hyperpod-model-deployment-setup.md)
+ [

# Deploy foundation models and custom fine-tuned models
](sagemaker-hyperpod-model-deployment-deploy.md)
+ [

# Autoscaling policies for your HyperPod inference model deployment
](sagemaker-hyperpod-model-deployment-autoscaling.md)
+ [

# Implementing inference observability on HyperPod clusters
](sagemaker-hyperpod-model-deployment-observability.md)
+ [

# Task governance for model deployment on HyperPod
](sagemaker-hyperpod-model-deployment-task-gov.md)
+ [

# HyperPod inference troubleshooting
](sagemaker-hyperpod-model-deployment-ts.md)
+ [

# Amazon SageMaker HyperPod Inference release notes
](sagemaker-hyperpod-inference-release-notes.md)

# Setting up your HyperPod clusters for model deployment
<a name="sagemaker-hyperpod-model-deployment-setup"></a>

This guide shows you how to enable inference capabilities on Amazon SageMaker HyperPod clusters. You'll set up the infrastructure, permissions, and operators that machine learning engineers need to deploy and manage inference endpoints.

**Note**  
To create a cluster with the inference operator pre-installed, see [Create an EKS-orchestrated SageMaker HyperPod cluster](sagemaker-hyperpod-quickstart.md#sagemaker-hyperpod-quickstart-eks). To install the inference operator on an existing cluster, continue with the following procedures.

You can install the inference operator using the SageMaker AI console for a streamlined experience, or use the AWS CLI for more control. This guide covers both installation methods.

## Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)
<a name="sagemaker-hyperpod-model-deployment-setup-ui"></a>

The SageMaker AI console provides the most streamlined experience with two installation options:
+ **Quick Install:** Automatically creates all required resources with optimized defaults, including IAM roles, Amazon S3 buckets, and dependency add-ons. A new Studio domain will be created with required permissions to deploy a JumpStart model to the relevant cluster. This option is ideal for getting started quickly with minimal configuration decisions.
+ **Custom Install:** Provides flexibility to specify existing resources or customize configurations while maintaining the one-click experience. Customers can choose to reuse existing IAM roles, Amazon S3 buckets, or dependency add-ons based on their organizational requirements.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-ui-prereqs"></a>
+ An existing HyperPod cluster with Amazon EKS orchestration
+ IAM permissions for Amazon EKS cluster administration
+ kubectl configured for cluster access

### Installation steps
<a name="sagemaker-hyperpod-model-deployment-setup-ui-steps"></a>

1. Navigate to the SageMaker AI console and go to **HyperPod Clusters** → **Cluster Management**.

1. Select your cluster where you want to install the Inference Operator.

1. Navigate to the **Inference** tab. Select **Quick Install** for automated setup or **Custom Install** for configuration flexibility.

1. If choosing Custom Install, specify existing resources or customize settings as needed.

1. Click **Install** to begin the automated installation process.

1. Verify the installation status through the console, or by running the following commands:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   ```
   aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION
   ```

After the add-on is successfully installed, you can deploy models using the model deployment documentation or navigate to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

## Method 2: Installing the Inference Operator using the AWS CLI
<a name="sagemaker-hyperpod-model-deployment-setup-addon"></a>

The AWS CLI installation method provides more control over the installation process and is suitable for automation and advanced configurations.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq-addon"></a>

The inference operator enables deployment and management of machine learning inference endpoints on your Amazon EKS cluster. Before installation, ensure your cluster has the required security configurations and supporting infrastructure. Complete these steps to configure IAM roles, install the AWS Load Balancer Controller, set up Amazon S3 and Amazon FSx CSI drivers, and deploy KEDA and cert-manager:

1. [Connect to your cluster and set up environment variables](#sagemaker-hyperpod-model-deployment-setup-connect-addon)

1. [Configure IAM roles for inference operator](#sagemaker-hyperpod-model-deployment-setup-prepare-addon)

1. [Create the ALB Controller role](#sagemaker-hyperpod-model-deployment-setup-alb-addon)

1. [Create the KEDA operator role](#sagemaker-hyperpod-model-deployment-setup-keda-addon)

1. [Install the dependency EKS Add-Ons](#sagemaker-hyperpod-model-deployment-setup-install-dependencies)

**Note**  
Alternatively, you can use CloudFormation templates to automate the prerequisite setup. For more information, see [Using CloudFormation templates to create the prerequisite stack](#sagemaker-hyperpod-model-deployment-setup-cfn).

### Connect to your cluster and set up environment variables
<a name="sagemaker-hyperpod-model-deployment-setup-connect-addon"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. Run the following steps using an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Ensure you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md). Install helm, eksctl, and kubectl command line utilities.

For Kubernetes administrative access to the Amazon EKS cluster, open the Amazon EKS console and select your cluster. In the **Access** tab, select **IAM Access Entries**. If no entry exists for your IAM principal, select **Create Access Entry**. Select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-*
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```
**Note**  
If using a custom bucket name that doesn't start with `hyperpod-tls-`, attach the following policy to your execution role:  

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "TLSBucketDeleteObjectsPermission",
               "Effect": "Allow",
               "Action": ["s3:DeleteObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Sid": "TLSBucketGetObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:GetObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"]
           },
           {
               "Sid": "TLSBucketPutObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:PutObject", "s3:PutObjectTagging"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           }
       ]
   }
   ```

1. Set default env variables.

   ```
   HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   # Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Configure IAM roles for inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-prepare-addon"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. These policies enable secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Service": [
                   "sagemaker.amazonaws.com"
               ]
           },
           "Action": "sts:AssumeRole"
       },
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
               }
           }
       }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Create a namespace for inference operator resources

   ```
   kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE
   ```

### Create the ALB Controller role
<a name="sagemaker-hyperpod-model-deployment-setup-alb-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/alb-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   
   # Create the role
   aws iam create-role \
       --role-name alb-role \
       --assume-role-policy-document file:///tmp/alb-trust-policy.json 
   
   # Create the policy
   ALB_POLICY_ARN=$(aws iam create-policy \
       --policy-name $ALBController_IAM_POLICY_NAME \
       --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name alb-role \
       --policy-arn $ALB_POLICY_ARN
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
       --region ${REGION} \
       --vpc-id ${VPC_ID} \
       --vpc-endpoint-type Gateway \
       --service-name "com.amazonaws.${REGION}.s3" \
       --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   
   # Create the role
   aws iam create-role \
       --role-name keda-operator-role \
       --assume-role-policy-document file:///tmp/keda-trust-policy.json
   
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
       --policy-name KedaOperatorPolicy \
       --policy-document file:///tmp/keda-policy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name keda-operator-role \
       --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create an IAM policy.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      ```

   1. Create an IAM role.

      ```
      # Create the role using existing trust policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the dependency EKS Add-Ons
<a name="sagemaker-hyperpod-model-deployment-setup-install-dependencies"></a>

Before installing the inference operator, you must install the following required EKS add-ons on your cluster. The inference operator will fail to install if any of these dependencies are missing. Each add-on has a minimum version requirement for compatibility with the Inference add-on.

**Important**  
Install all dependency add-ons before attempting to install the inference operator. Missing dependencies will cause installation failures with specific error messages.

#### Required Add-ons
<a name="sagemaker-hyperpod-model-deployment-setup-required-addons"></a>

1. **Amazon S3 Mountpoint CSI Driver** (minimum version: v1.14.1-eksbuild.1)

   Required for mounting S3 buckets as persistent volumes in inference workloads.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --region $REGION \
       --service-account-role-arn $S3_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Mountpoint for Amazon S3 CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#mountpoint-for-s3-add-on).

1. **Amazon FSx CSI Driver** (minimum version: v1.6.0-eksbuild.1)

   Required for mounting FSx file systems for high-performance model storage.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --region $REGION \
       --service-account-role-arn $FSX_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Amazon FSx for Lustre CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-aws-fsx-csi-driver).

1. **Metrics Server** (minimum version: v0.7.2-eksbuild.4)

   Required for autoscaling functionality and resource metrics collection.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name metrics-server \
       --region $REGION
   ```

   For detailed installation instructions, see [Metrics Server](https://docs.aws.amazon.com/eks/latest/userguide/metrics-server.html).

1. **Cert Manager** (minimum version: v1.18.2-eksbuild.2)

   Required for TLS certificate management for secure inference endpoints.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

   For detailed installation instructions, see [cert-manager](https://docs.aws.amazon.com/eks/latest/userguide/community-addons.html#addon-cert-manager).

#### Verify Add-on Installation
<a name="sagemaker-hyperpod-model-deployment-setup-verify-dependencies"></a>

After installing the required add-ons, verify they are running correctly:

```
# Check add-on status
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

All add-ons should show status "ACTIVE" and all pods should be in "Running" state before proceeding with inference operator installation.

**Note**  
If you created your HyperPod cluster using the quick setup or custom setup options, the FSx CSI Driver and Cert Manager may already be installed. Verify their presence using the commands above.

### Installing the Inference Operator with EKS add-on
<a name="sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon"></a>

The EKS add-on installation method provides a managed experience with automatic updates and integrated dependency validation. This is the recommended approach for installing the inference operator.

**Install the inference operator add-on**

1. Prepare the add-on configuration by gathering all required ARNs and creating the configuration file:

   ```
   # Gather required ARNs
   export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text)
   export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text)
   export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text)
   
   # Verify all ARNs are set correctly
   echo "Execution Role ARN: $EXECUTION_ROLE_ARN"
   echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN"
   echo "KEDA Role ARN: $KEDA_ROLE_ARN"
   echo "ALB Role ARN: $ALB_ROLE_ARN"
   echo "TLS S3 Bucket: $BUCKET_NAME"
   ```

1. Create the add-on configuration file with all required settings:

   ```
   cat > addon-config.json << EOF
   {
     "executionRoleArn": "$EXECUTION_ROLE_ARN",
     "tlsCertificateS3Bucket": "$BUCKET_NAME",
     "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
     "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN",
     "alb": {
       "serviceAccount": {
         "create": true,
         "roleArn": "$ALB_ROLE_ARN"
       }
     },
     "keda": {
       "auth": {
         "aws": {
           "irsa": {
             "roleArn": "$KEDA_ROLE_ARN"
           }
         }
       }
     }
   }
   EOF
   
   # Verify the configuration file
   cat addon-config.json
   ```

1. Install the inference operator add-on (minimum version: v1.0.0-eksbuild.1):

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation progress and verify successful completion:

   ```
   # Check installation status (repeat until status shows "ACTIVE")
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify pods are running
   kubectl get pods -n hyperpod-inference-system
   
   # Check operator logs for any issues
   kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
   ```

For detailed troubleshooting of installation issues, see [HyperPod inference troubleshooting](sagemaker-hyperpod-model-deployment-ts.md).

To verify the inference operator is working correctly, continue to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

### Using CloudFormation templates to create the prerequisite stack
<a name="sagemaker-hyperpod-model-deployment-setup-cfn"></a>

As an alternative to manually configuring the prerequisites, you can use CloudFormation templates to automate the creation of required IAM roles and policies for the inference operator.

1. Set up input variables. Replace the placeholder values with your own:

   ```
   #!/bin/bash
   set -e
   
   # ===== INPUT VARIABLES =====
   HP_CLUSTER_NAME="my-hyperpod-cluster"  # Replace with your HyperPod cluster name
   REGION="us-east-1"  # Replace with your AWS region
   PREFIX="my-prefix"  # Replace with your resource prefix
   SHORT_PREFIX="12a34d56"  # Replace with your short prefix (maximum 8 characters)
   CREATE_DOMAIN="true"  # Set to "false" if you don't need a SageMaker Studio domain
   STACK_NAME="hyperpod-inference-prerequisites"  # Replace with your stack name
   TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml"
   ```

1. Derive cluster and network information:

   ```
   # ===== DERIVE EKS CLUSTER NAME =====
   EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}')
   echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME"
   
   # ===== GET VPC AND OIDC =====
   VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   echo "VPC_ID=$VPC_ID"
   
   OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||')
   echo "OIDC_PROVIDER=$OIDC_PROVIDER"
   
   # ===== GET PRIVATE ROUTE TABLES =====
   ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text)
   EKS_PRIVATE_ROUTE_TABLES=""
   for rtb in $ALL_ROUTE_TABLES; do
       HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null)
       if [ -z "$HAS_IGW" ]; then
           EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb"
       fi
   done
   echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES"
   
   # ===== CHECK S3 VPC ENDPOINT =====
   S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text)
   CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false")
   echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK"
   
   # ===== GET HYPERPOD DETAILS =====
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text)
   echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN"
   
   # ===== GET DEFAULT VPC FOR DOMAIN =====
   DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text)
   echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID"
   
   DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text)
   echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS"
   
   # ===== GET INSTANCE GROUPS =====
   INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')")
   echo "INSTANCE_GROUPS=$INSTANCE_GROUPS"
   ```

1. Create parameters file and deploy stack:

   ```
   # ===== CREATE PARAMETERS JSON =====
   cat > /tmp/cfn-params.json << EOF
   [
     {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"},
     {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"},
     {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"},
     {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"},
     {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"},
     {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"},
     {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"},
     {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"},
     {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"},
     {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"},
     {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"},
     {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"},
     {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"},
     {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"}
   ]
   EOF
   
   echo -e "\n===== CREATING CLOUDFORMATION STACK ====="
   aws cloudformation create-stack \
       --region $REGION \
       --stack-name $STACK_NAME \
       --template-url $TEMPLATE_URL \
       --parameters file:///tmp/cfn-params.json \
       --capabilities CAPABILITY_NAMED_IAM
   ```

1. Monitor the stack creation status:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].StackStatus'
   ```

1. Once the stack is created successfully, retrieve the output values for use in the inference operator installation:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].Outputs'
   ```

After the CloudFormation stack is created, continue with [Installing the Inference Operator with EKS add-on](#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon) to install the inference operator.

## Method 3: Helm chart installation
<a name="sagemaker-hyperpod-model-deployment-setup-helm"></a>

**Note**  
For the simplest installation experience, we recommend using [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](#sagemaker-hyperpod-model-deployment-setup-ui) or [Method 2: Installing the Inference Operator using the AWS CLI](#sagemaker-hyperpod-model-deployment-setup-addon). Helm chart installation may be deprecated in a future release.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. The following steps need to be run by an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Verify that you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) . Verify you have installed helm, eksctl, and kubectl command line utilities. 

For Kubernetes administrative access to the Amazon EKS cluster, go to the Amazon EKS console and select the cluster you are using. Look in the **Access** tab and select IAM Access Entries. If there isn't an entry for your IAM principal, select **Create Access Entry**. Then select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Set default env variables.

   ```
   LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME"
   LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME"
   S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME"
   S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller"
   HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system"
   JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME"
   FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   #Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Prepare your environment for inference operator installation
<a name="sagemaker-hyperpod-model-deployment-setup-prepare"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. This policy enables secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
   {
       "Effect": "Allow",
       "Principal": {
           "Service": [
               "sagemaker.amazonaws.com"
           ]
       },
       "Action": "sts:AssumeRole"
   },
   {
       "Effect": "Allow",
       "Principal": {
           "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
       },
       "Action": "sts:AssumeRoleWithWebIdentity",
       "Condition": {
           "StringLike": {
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
           }
       }
   }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator and attach the managed policy.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Download and create the IAM policy required for the AWS Load Balancer Controller to manage Application Load Balancers and Network Load Balancers in your EKS cluster.

   ```
   %%bash -x 
   
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
   ```

1. Create an IAM service account that links the Kubernetes service account with the IAM policy, enabling the AWS Load Balancer Controller to assume the necessary AWS permissions through IRSA (IAM Roles for Service Accounts).

   ```
   %%bash -x 
   
   export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME"
   
   # Create IAM service account with gathered values
   eksctl create iamserviceaccount \
   --approve \
   --override-existing-serviceaccounts \
   --name=aws-load-balancer-controller \
   --namespace=kube-system \
   --cluster=$EKS_CLUSTER_NAME \
   --attach-policy-arn=$ALB_POLICY_ARN \
   --region=$REGION
   
   # Print the values for verification
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Region: $REGION"
   echo "Policy ARN: $ALB_POLICY_ARN"
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create a Namespace for KEDA and the Cert Manager.

   ```
   kubectl create namespace keda
   kubectl create namespace cert-manager
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
   --vpc-id ${VPC_ID} \
   --vpc-endpoint-type Gateway \
   --service-name "com.amazonaws.${REGION}.s3" \
   --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

1. Configure S3 storage access:

   1. Create an IAM policy that grants the necessary S3 permissions for using Mountpoint for Amazon S3, which enables file system access to S3 buckets from within the cluster.

      ```
      %%bash -x
      
      export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>”
      
      cat <<EOF> s3accesspolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          
          {
              "Sid": "MountpointAccess",
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetObject",
                  "s3:PutObject",
                  "s3:AbortMultipartUpload",
                  "s3:DeleteObject"
              ],
              "Resource": [
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}",
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*"
              ]
          }
      ]
      }
      EOF
      
      aws iam create-policy \
      --policy-name S3MountpointAccessPolicy \
      --policy-document file://s3accesspolicy.json
      
      cat <<EOF> s3accesstrustpolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringEquals": {
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}"
                  }
              }
          }
      ]
      }
      EOF
      
      aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json
      
      aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
      ```

   1. (Optional) Create an IAM service account for the Amazon S3 CSI driver. The Amazon S3 CSI driver requires an IAM service account with appropriate permissions to mount S3 buckets as persistent volumes in your Amazon EKS cluster. This step creates the necessary IAM role and Kubernetes service account with the required S3 access policy.

      ```
      %%bash -x 
      
      export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION"
      export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' |  tr -d '"')
      
      eksctl create iamserviceaccount \
      --name s3-csi-driver-sa \
      --namespace kube-system \
      --cluster $EKS_CLUSTER_NAME \
      --attach-policy-arn $S3_CSI_POLICY_ARN \
      --approve \
      --role-name $S3_CSI_ROLE_NAME \
      --region $REGION 
      
      kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
      ```

   1. (Optional) Install the Amazon S3 CSI driver add-on. This driver enables your pods to mount S3 buckets as persistent volumes, providing direct access to S3 storage from within your Kubernetes workloads.

      ```
      %%bash -x
      
      export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME  --query 'Role.Arn' --output text)
      eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
      ```

   1. (Optional) Create a Persistent Volume Claim (PVC) for S3 storage. This PVC enables your pods to request and use S3 storage as if it were a traditional file system.

      ```
      %%bash -x 
      
      cat <<EOF> pvc_s3.yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
      name: s3-claim
      spec:
      accessModes:
      - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
      storageClassName: "" # required for static provisioning
      resources:
      requests:
          storage: 1200Gi # ignored, required
      volumeName: s3-pv
      EOF
      
      kubectl apply -f pvc_s3.yaml
      ```

1. (Optional) Configure FSx storage access. Create an IAM service account for the Amazon FSx CSI driver. This service account will be used by the FSx CSI driver to interact with the Amazon FSx service on behalf of your cluster.

   ```
   %%bash -x 
   
   
   eksctl create iamserviceaccount \
   --name fsx-csi-controller-sa \
   --namespace kube-system \
   --cluster $EKS_CLUSTER_NAME \
   --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
   --approve \
   --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \
   --region $REGION
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   # Create the role
   aws iam create-role \
   --role-name keda-operator-role \
   --assume-role-policy-document file:///tmp/keda-trust-policy.json
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
   --policy-name KedaOperatorPolicy \
   --policy-document file:///tmp/keda-policy.json \
   --query 'Policy.Arn' \
   --output text)
   # Attach the policy to the role
   aws iam attach-role-policy \
   --role-name keda-operator-role \
   --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create the trust policy and IAM role for gated model access.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      
      # Create the role and attach the managed policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-install"></a>

1. Install the HyperPod inference operator. This step gathers the required AWS resource identifiers and generates the Helm installation command with the appropriate configuration parameters.

   Access the helm chart from [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart) .

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli
   cd sagemaker-hyperpod-cli
   cd helm_chart/HyperPodHelmChart
   helm dependencies update charts/inference-operator
   ```

   ```
   %%bash -x
   
   HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   echo $HYPERPOD_INFERENCE_ROLE_ARN
   
   S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text)
   echo $S3_CSI_ROLE_ARN
   
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn")
   
   # Verify values
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN"
   echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN"
   # Run the the HyperPod inference operator installation. 
   
   helm install hyperpod-inference-operator charts/inference-operator \
   -n kube-system \
   --set region=$REGION \
   --set eksClusterName=$EKS_CLUSTER_NAME \
   --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \
   --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \
   --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \
   --set s3.node.serviceAccount.create=false \
   --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \
   --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \
   --set alb.region=$REGION \
   --set alb.clusterName=$EKS_CLUSTER_NAME \
   --set alb.vpcId=$VPC_ID
   
   # For JumpStart Gated Model usage, Add
   # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
   ```

1. Configure the service account annotations for IAM integration. This annotation enables the operator's service account to assume the necessary IAM permissions for managing inference endpoints and interacting with AWS services.

   ```
   %%bash -x 
   
   EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///')
   
   # Annotate service account
   kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \
   -n hyperpod-inference-system \
   eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \
   --overwrite
   ```

## Verify the inference operator is working
<a name="sagemaker-hyperpod-model-deployment-setup-verify"></a>

Follow these steps to verify that your inference operator installation is working correctly by deploying and testing a simple model.

**Deploy a test model to verify the operator**

1. Create a model deployment configuration file. This creates a Kubernetes manifest file that defines a JumpStart model deployment for the HyperPod inference operator.

   ```
   cat <<EOF>> simple_model_install.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
   name: testing-deployment-bert
   namespace: default
   spec:
   model:
   modelId: "huggingface-eqa-bert-base-cased"
   sageMakerEndpoint:
   name: "hp-inf-ep-for-testing"
   server:
   instanceType: "ml.c5.2xlarge"
   environmentVariables:
   - name: SAMPLE_ENV_VAR
       value: "sample_value"
   maxDeployTimeInSeconds: 1800
   EOF
   ```

1. Deploy the model and clean up the configuration file.

   ```
   kubectl create -f simple_model_install.yaml
   rm -f simple_model_install.yaml
   ```

1. Verify the service account configuration to ensure the operator can assume AWS permissions.

   ```
   # Get the service account details
   kubectl get serviceaccount -n hyperpod-inference-system
   
   # Check if the service account has the AWS annotations
   kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
   ```

**Configure deployment settings (if using Studio UI)**

1. Review the recommended instance type under **Deployment settings**.

1. If modifying the **Instance type**, ensure compatibility with your HyperPod cluster. Contact your admin if compatible instances aren't available.

1. For GPU-partitioned instances with MIG enabled, select an appropriate **GPU partition** from available MIG profiles to optimize GPU utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. If using task governance, configure priority settings for model deployment preemption capabilities.

1. Enter the namespace provided by your admin. Contact your admin for the correct namespace if needed.

## (Optional) Set up user access through the JumpStart UI in SageMaker AI Studio Classic
<a name="sagemaker-hyperpod-model-deployment-setup-optional-js"></a>

For more background on setting up SageMaker HyperPod access for Studio Classic users and configuring fine-grained Kubernetes RBAC permissions for data scientist users, read [Setting up an Amazon EKS cluster in Studio](sagemaker-hyperpod-studio-setup-eks.md) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

1. Identify the IAM role that Data Scientist users will use to manage and deploy models to SageMaker HyperPod from SageMaker AI Studio Classic. This is typically the User Profile Execution Role or Domain Execution Role for the Studio Classic user.

   ```
   %%bash -x
   
   export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>"
   
   export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy"
   export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text)
   
   export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
   ```

1. Attach an Identity Policy enabling Model Deployment access.

   ```
   %%bash -x
   
   # Create access policy
   cat << EOF > hyperpod-deployment-ui-access-policy.json
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "DescribeHyerpodClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster"
               ],
               "Resource": "$HYPERPOD_CLUSTER_ARN"
           },
           {
               "Sid": "UseEksClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "eks:DescribeCluster",
                   "eks:AccessKubernetesApi",
                   "eks:MutateViaKubernetesApi",
                   "eks:DescribeAddon"
               ],
               "Resource": "$EKS_CLUSTER_ARN"
           },
           {
               "Sid": "ListPermission",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:ListClusters",
                   "sagemaker:ListEndpoints"
               ],
               "Resource": "*"
           },
           {
               "Sid": "SageMakerEndpointAccess",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeEndpoint",
                   "sagemaker:InvokeEndpoint"
               ],
               "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*"
           }
       ]
   }
   EOF
   
   aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
   ```

1. Create an EKS Access Entry for the user mapping them to a kubernetes group.

   ```
   %%bash -x
   
   aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \
       --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \
       --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
   ```

1. Create Kubernetes RBAC policies for the user.

   ```
   %%bash -x
   
   cat << EOF > cluster_level_config.yaml
   kind: ClusterRole
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     name: hyperpod-scientist-user-cluster-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["namespaces"]
     verbs: ["list"]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRoleBinding
   metadata:
     name: hyperpod-scientist-user-cluster-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-cluster-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: ClusterRole
     name: hyperpod-scientist-user-cluster-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f cluster_level_config.yaml
   
   
   cat << EOF > namespace_level_role.yaml
   kind: Role
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["create", "get"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/log"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/exec"]
     verbs: ["get", "create"]
   - apiGroups: ["kubeflow.org"]
     resources: ["pytorchjobs", "pytorchjobs/status"]
     verbs: ["get", "list", "create", "delete", "update", "describe"]
   - apiGroups: [""]
     resources: ["configmaps"]
     verbs: ["create", "update", "get", "list", "delete"]
   - apiGroups: [""]
     resources: ["secrets"]
     verbs: ["create", "get", "list", "delete"]
   - apiGroups: [ "inference.sagemaker.aws.amazon.com" ]
     resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ]
     verbs: [ "get", "list", "create", "delete", "update", "describe" ]
   - apiGroups: [ "autoscaling" ]
     resources: [ "horizontalpodautoscalers" ]
     verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-namespace-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: Role
     name: hyperpod-scientist-user-namespace-level-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f namespace_level_role.yaml
   ```

# Deploy foundation models and custom fine-tuned models
<a name="sagemaker-hyperpod-model-deployment-deploy"></a>

Whether you're deploying pre-trained foundation open-weights or gated models from Amazon SageMaker JumpStart or your own custom or fine-tuned models stored in Amazon S3 or Amazon FSx, SageMaker HyperPod provides the flexible, scalable infrastructure you need for production inference workloads.




****  

|  | Deploy open-weights and gated foundation models from JumpStart | Deploy custom and fine-tuned models from Amazon S3 and Amazon FSx | 
| --- | --- | --- | 
| Description |  Deploy from a comprehensive catalog of pre-trained foundation models with automatic optimization and scaling policies tailored to each model family.  | Bring your own custom and fine-tuned models and leverage SageMaker HyperPod's enterprise infrastructure for production-scale inference. Choose between cost-effective storage with Amazon S3 or a high-performance file system with Amazon FSx. | 
| Key benefits | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 
| Deployment options |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 

The following sections step you through deploying models from Amazon SageMaker JumpStart and from Amazon S3 and Amazon FSx.

**Topics**
+ [

# Deploy models from JumpStart using Amazon SageMaker Studio
](sagemaker-hyperpod-model-deployment-deploy-js-ui.md)
+ [

# Deploy models from JumpStart using kubectl
](sagemaker-hyperpod-model-deployment-deploy-js-kubectl.md)
+ [

# Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
](sagemaker-hyperpod-model-deployment-deploy-ftm.md)
+ [Deploy custom fine-tuned models using the Python SDK and HPCLI](deploy-trained-model.md) 
+ [Deploy models from Amazon SageMaker JumpStart using the Python SDK and HPCLI](deploy-jumpstart-model.md) 

# Deploy models from JumpStart using Amazon SageMaker Studio
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui"></a>

The following steps show you through how to deploy models from JumpStart using Amazon SageMaker Studio.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-prereqs"></a>

Verify that you've set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md). 

## Create a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-create"></a>

1. In Amazon SageMaker Studio, open the **JumpStart** landing page from the left navigation pane. 

1. Under **All public models**, choose a model you want to deploy.
**Note**  
If you’ve selected a gated model, you’ll have to accept the End User License Agreement (EULA).

1. Choose **SageMaker HyperPod**.

1. Under **Deployment settings**, JumpStart will recommend an instance for deployment. You can modify these settings if necessary.

   1. If you modify **Instance type**, ensure it’s compatible with the chosen **HyperPod cluster**. If there aren’t any compatible instances, you’ll need to select a new **HyperPod cluster **or contact your admin to add compatible instances to the cluster.

   1. To prioritize the model deployment, install the task governance addon, create compute allocations, and set up task rankings for the cluster policy. Once this is done, you should see an option to select a priority for the model deployment which can be used for preemption of other deployments and tasks on the cluster. 

   1. Enter the namespace to which your admin has provided you access. You may have to directly reach out to your admin to get the exact namespace. Once a valid namespace is provided, the **Deploy** button should be enabled to deploy the model.

   1. If your instance type is partitioned (MIG enabled), select a **GPU partition type**.

   1. If you want to enable L2 KVCache or Intelligent routing for speeding up LLM inference, enable them. By default, only L1 KV Cache is enabled. For more details on KVCache and Intelligent routing, see [SageMaker HyperPod model deployment](sagemaker-hyperpod-model-deployment.md).

1. Choose **Deploy** and wait for the **Endpoint** to be created.

1. After the **Endpoint** has been created, select **Test inference**.

## Edit a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-edit"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Edit**.

1. Under **Deployment settings**, you can enable or disable **Auto-scaling**, and change the number of **Max replicas**.

1. Select **Save**.

1. The **Status** will change to **Updating**. Once it changes back to **In service**, your changes are complete and you’ll see a message confirming it.

## Delete a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-delete"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Delete**.

1. In the **Delete HyperPod deployment window**, select the checkbox.

1. Choose **Delete**.

1. The **Status** will change to **Deleting**. Once the HyperPod deployment has been deleted, you’ll see a message confirming it.

# Deploy models from JumpStart using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-js-kubectl"></a>

The following steps show you how to deploy a JumpStart model to a HyperPod cluster using kubectl.

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands. 

## Prerequisites
<a name="kubectl-prerequisites"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="kubectl-prerequisites-setup-and-configuration"></a>

1. Select your Region.

   ```
   export REGION=<region>
   ```

1. View all SageMaker public hub models and HyperPod clusters.

1. Select a `JumpstartModel` from JumpstartPublic Hub. JumpstartPublic hub has a large number of models available so you can use `NextToken` to iteratively list all available models in the public hub.

   ```
   aws sagemaker list-hub-contents --hub-name SageMakerPublicHub --hub-content-type Model --query '{Models: HubContentSummaries[].{ModelId:HubContentName,Version:HubContentVersion}, NextToken: NextToken}' --output json
   ```

   ```
   export MODEL_ID="deepseek-llm-r1-distill-qwen-1-5b"
   export MODEL_VERSION="2.0.4"
   ```

1. Configure the model ID and cluster name you’ve selected into the variables below.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   aws sagemaker list-clusters --output table
   
   # Select the cluster name where you want to deploy the model.
   export HYPERPOD_CLUSTER_NAME="<insert cluster name here>"
   
   # Select the instance that is relevant for your model deployment and exists within the selected cluster.
   # List availble instances in your HyperPod cluster
   aws sagemaker describe-cluster --cluster-name=$HYPERPOD_CLUSTER_NAME --query "InstanceGroups[].{InstanceType:InstanceType,Count:CurrentCount}" --output table
   
   # List supported instance types for the selected model
   aws sagemaker describe-hub-content --hub-name SageMakerPublicHub --hub-content-type Model --hub-content-name "$MODEL_ID" --output json | jq -r '.HubContentDocument | fromjson | {Default: .DefaultInferenceInstanceType, Supported: .SupportedInferenceInstanceTypes}'
   
   
   # Select and instance type from the cluster that is compatible with the model. 
   # Make sure that the selected instance is either default or supported instance type for the jumpstart model 
   export INSTANCE_TYPE="<Instance_type_In_cluster"
   ```

1. Confirm with the cluster admin which namespace you are permitted to use. The admin should have created a `hyperpod-inference` service account in your namespace.

   ```
   export CLUSTER_NAMESPACE="default"
   ```

1. Set a name for endpoint and custom object to be create.

   ```
   export SAGEMAKER_ENDPOINT_NAME="deepsek-qwen-1-5b-test"
   ```

1. The following is an example for a `deepseek-llm-r1-distill-qwen-1-5b` model deployment from Jumpstart. Create a similar deployment yaml file based on the model selected iin the above step.
**Note**  
If your cluster uses GPU partitioning with MIG, you can request specific MIG profiles by adding the `acceleratorPartitionType` field to the server specification. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

   ```
   cat << EOF > jumpstart_model.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
     name: $SAGEMAKER_ENDPOINT_NAME
     namespace: $CLUSTER_NAMESPACE 
   spec:
     sageMakerEndpoint:
       name: $SAGEMAKER_ENDPOINT_NAME
     model:
       modelHubName: SageMakerPublicHub
       modelId: $MODEL_ID
       modelVersion: $MODEL_VERSION
     server:
       instanceType: $INSTANCE_TYPE
       # Optional: Specify GPU partition profile for MIG-enabled instances
       # acceleratorPartitionType: "1g.10gb"
     metrics:
       enabled: true
     environmentVariables:
       - name: SAMPLE_ENV_VAR
         value: "sample_value"
     maxDeployTimeInSeconds: 1800
     autoScalingSpec:
       cloudWatchTrigger:
         name: "SageMaker-Invocations"
         namespace: "AWS/SageMaker"
         useCachedMetrics: false
         metricName: "Invocations"
         targetValue: 10
         minValue: 0.0
         metricCollectionPeriod: 30
         metricStat: "Sum"
         metricType: "Average"
         dimensions:
           - name: "EndpointName"
             value: "$SAGEMAKER_ENDPOINT_NAME"
           - name: "VariantName"
             value: "AllTraffic"
   EOF
   ```

## Deploy your model
<a name="kubectl-deploy-your-model"></a>

**Update your kubernetes configuration and deploy your model**

1. Configure kubectl to connect to the HyperPod cluster orchestrated by Amazon EKS.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your JumpStart model.

   ```
   kubectl apply -f jumpstart_model.yaml
   ```

**Monitor the status of your model deployment**

1. Verify that the model is successfully deployed.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify that the endpoint is successfully created.

   ```
   aws sagemaker describe-endpoint --endpoint-name=$SAGEMAKER_ENDPOINT_NAME --output table
   ```

1. Invoke your model endpoint. You can programmatically retrieve example payloads from the `JumpStartModel` object.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="kubectl-manage-your-deployment"></a>

Delete your JumpStart model deployment once you no longer need it.

```
kubectl delete JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the status of Kubernetes deployment. This command inspects the underlying Kubernetes deployment object that manages the pods running your model. Use this to troubleshoot pod scheduling, resource allocation, and container startup issues.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of your JumpStart model resource. This command examines the custom `JumpStartModel` resource that manages the high-level model configuration and deployment lifecycle. Use this to troubleshoot model-specific issues like configuration errors or SageMaker AI endpoint creation problems.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of all Kubernetes objects. This command provides a comprehensive overview of all related Kubernetes resources in your namespace. Use this for a quick health check to see the overall state of pods, services, deployments, and custom resources associated with your model deployment.

   ```
   kubectl get pods,svc,deployment,JumpStartModel,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm"></a>

The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl. 

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-prereqs"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-setup"></a>

Replace all placeholder values with your actual resource identifiers.

1. Select your Region in your environment.

   ```
   export REGION=<region>
   ```

1. Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   # Specify your hyperpod cluster name here
   HYPERPOD_CLUSTER_NAME="<Hyperpod_cluster_name>"
   
   # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU
   instance_type="ml.g5.8xlarge"
   ```

1. Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.

   ```
   cluster_namespace="<namespace>"
   ```

1. Create a CRD using one of the following options:

------
#### [ Using Amazon FSx as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-fsx"
      ```

   1. Configure the Amazon FSx file system ID to be used.

      ```
      export FSX_FILE_SYSTEM_ID="fs-1234abcd"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon FSx and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_fsx_cluster_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          fsxStorage:
            fileSystemId: $FSX_FILE_SYSTEM_ID
          modelLocation: deepseek-1-5b
          modelSourceType: fsx
        worker:
          environmentVariables:
          - name: HF_MODEL_ID
            value: /opt/ml/model
          - name: SAGEMAKER_PROGRAM
            value: inference.py
          - name: SAGEMAKER_SUBMIT_DIRECTORY
            value: /opt/ml/model/code
          - name: MODEL_CACHE_ROOT
            value: /opt/ml/model
          - name: SAGEMAKER_ENV
            value: '1'
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0
          modelInvocationPort:
            containerPort: 8080
            name: http
          modelVolumeMount:
            mountPath: /opt/ml/model
            name: model-weights
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              cpu: 30000m
              memory: 100Gi
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
      kind: InferenceEndpointConfig
      metadata:
        name: $SAGEMAKER_ENDPOINT_NAME
        namespace: $CLUSTER_NAMESPACE
      spec:
        modelName: deepseek15b
        endpointName: $SAGEMAKER_ENDPOINT_NAME
        instanceType: ml.g5.8xlarge
        invocationEndpoint: invocations
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: $S3_MODEL_LOCATION
            region: $REGION
          modelLocation: deepseek15b
          prefetchEnabled: true
        worker:
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
              cpu: 25600m
              memory: 102Gi
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: bugbash-ada-resources
            region: us-west-2
          modelLocation: models/Llama-3.1-8B-Instruct
          prefetchEnabled: false
        kvCacheSpec:
          enableL1Cache: true
      #    enableL2Cache: true
      #    l2CacheSpec:
      #      l2CacheBackend: redis/sagemaker
      #      l2CacheLocalUrl: redis://redis.redis-system.svc.cluster.local:6379
        intelligentRoutingSpec:
          enabled: true
        tlsConfig:
          tlsCertificateOutputS3Uri: s3://sagemaker-lmcache-fceb9062-tls-6f6ee470
        metrics:
          enabled: true
          modelMetrics:
            port: 8000
        loadBalancer:
          healthCheckPath: /health
        worker:
          resources:
            limits:
              nvidia.com/gpu: "4"
            requests:
              cpu: "6"
              memory: 30Gi
              nvidia.com/gpu: "4"
          image: lmcache/vllm-openai:latest
          args:
            - "/opt/ml/model"
            - "--max-model-len"
            - "20000"
            - "--tensor-parallel-size"
            - "4"
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------

## Configure KV caching and intelligent routing for improved performance
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route"></a>

1. Enable KV caching by setting `enableL1Cache` and `enableL2Cache` to `true`.Then, set `l2CacheSpec` to `redis` and update `l2CacheLocalUrl` with the Redis cluster URL.

   ```
     kvCacheSpec:
       enableL1Cache: true
       enableL2Cache: true
       l2CacheSpec:
         l2CacheBackend: <redis | tieredstorage>
         l2CacheLocalUrl: <redis cluster URL if l2CacheBackend is redis >
   ```
**Note**  
If the redis cluster is not within the same Amazon VPC as the HyperPod cluster, encryption for the data in transit is not guaranteed.
**Note**  
Do not need l2CacheLocalUrl if tieredstorage is selected.

1. Enable intelligent routing by setting `enabled` to `true` under `intelligentRoutingSpec`. You can specify which routing strategy to use under `routingStrategy`. If no routing strategy is specified, it defaults to `prefixaware`.

   ```
   intelligentRoutingSpec:
       enabled: true
       routingStrategy: <routing strategy to use>
   ```

1. Enable router metrics and caching metrics by setting `enabled` to `true` under `metrics`. The `port` value needs to be the same as the `containerPort` value under `modelInvocationPort`.

   ```
   metrics:
       enabled: true
       modelMetrics:
         port: <port value>
       ...
       modelInvocationPort:
         containerPort: <port value>
   ```

## Deploy your model from Amazon S3 or Amazon FSx
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-deploy"></a>

1. Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your InferenceEndpointConfig model with one of the following options:

------
#### [ Deploy with Amazon FSx as a source ]

   ```
   kubectl apply -f deploy_fsx_luster_inference.yaml
   ```

------
#### [ Deploy with Amazon S3 as a source ]

   ```
   kubectl apply -f deploy_s3_inference.yaml
   ```

------

## Verify the status of your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-verify"></a>

1. Check if the model successfully deployed.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check that the endpoint is successfully created.

   ```
   kubectl describe SageMakerEndpointRegistration $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-manage"></a>

When you're finished testing your deployment, use the following commands to clean up your resources.

**Note**  
Verify that you no longer need the deployed model or stored data before proceeding.

**Clean up your resources**

1. Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.

   ```
   kubectl delete inferenceendpointconfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify the cleanup was done successfully.

   ```
   # # Check that Kubernetes resources are removed
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

   ```
   # Verify SageMaker endpoint is deleted (should return error or empty)
   aws sagemaker describe-endpoint --endpoint-name $SAGEMAKER_ENDPOINT_NAME --region $REGION
   ```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the Kubernetes deployment status.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.

   ```
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Autoscaling policies for your HyperPod inference model deployment
<a name="sagemaker-hyperpod-model-deployment-autoscaling"></a>

This following information provides practical examples and configurations for implementing autoscaling policies on Amazon SageMaker HyperPod inference model deployments. 

You'll learn how to configure automatic scaling using the built-in `autoScalingSpec` in your deployment YAML files, as well as how to create standalone KEDA `ScaledObject` configurations for advanced scaling scenarios. The examples cover scaling triggers based on CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource utilization metrics like CPU and memory. 

## Using autoScalingSpec in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-yaml"></a>

Amazon SageMaker HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from CloudWatch and Amazon Managed Prometheus (AMP). The following deployment YAML example includes an `autoScalingSpec` section that defines the configuration values for scaling your model deployment.

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-sample624
  namespace: ns-team-a
spec:
  sageMakerEndpoint:
    name: deepsek7bsme624
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.4
  server:
    instanceType: ml.g5.8xlarge
  metrics:
    enabled: true
  environmentVariables:
    - name: SAMPLE_ENV_VAR
      value: "sample_value"
  maxDeployTimeInSeconds: 1800
  tlsConfig:
    tlsCertificateOutputS3Uri: "s3://{USER}-tls-bucket-{REGION}/certificates"
  autoScalingSpec:
    minReplicaCount: 0
    maxReplicaCount: 5
    pollingInterval: 15
    initialCooldownPeriod: 60
    cooldownPeriod: 120
    scaleDownStabilizationTime: 60
    scaleUpStabilizationTime: 0
    cloudWatchTrigger:
        name: "SageMaker-Invocations"
        namespace: "AWS/SageMaker"
        useCachedMetrics: false
        metricName: "Invocations"
        targetValue: 10.5
        activationTargetValue: 5.0
        minValue: 0.0
        metricCollectionStartTime: 300
        metricCollectionPeriod: 30
        metricStat: "Sum"
        metricType: "Average"
        dimensions:
          - name: "EndpointName"
            value: "deepsek7bsme624"
          - name: "VariantName"
            value: "AllTraffic"
    prometheusTrigger: 
        name: "Prometheus-Trigger"
        useCachedMetrics: false
        serverAddress: http://<prometheus-host>:9090
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        targetValue: 10.0
        activationTargetValue: 5.0
        namespace: "namespace"
        customHeaders: "X-Client-Id=cid"
        metricType: "Value"
```

### Explanation of fields used in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-fields"></a>

`minReplicaCount` (Optional, Integer)  
Specifies the minimum number of model deployment replicas to maintain in the cluster. During scale-down events, the deployment scales down to this minimum number of pods. Must be greater than or equal to 0. Default: 1.

`maxReplicaCount` (Optional, Integer)  
Specifies the maximum number of model deployment replicas to maintain in the cluster. Must be greater than or equal to `minReplicaCount`. During scale-up events, the deployment scales up to this maximum number of pods. Default: 5.

`pollingInterval` (Optional, Integer)  
The time interval in seconds for querying metrics. Minimum: 0. Default: 30 seconds.

`cooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during a scale-down event. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`initialCooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during initial deployment. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`scaleDownStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-down trigger activates before scaling down occurs. Minimum: 0. Default: 300 seconds.

`scaleUpStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-up trigger activates before scaling up occurs. Minimum: 0. Default: 0 seconds.

`cloudWatchTrigger`  
The trigger configuration for CloudWatch metrics used in autoscaling decisions. The following fields are available in `cloudWatchTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `namespace` (Required, String) - The CloudWatch namespace for the metric to query.
+ `metricName` (Required, String) - The name of the CloudWatch metric.
+ `dimensions` (Optional, List) - The list of dimensions for the metric. Each dimension includes a name (dimension name - String) and value (dimension value - String).
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `minValue` (Optional, Float) - The value to use when the CloudWatch query returns no data. Default: 0.
+ `metricCollectionStartTime` (Optional, Integer) - The start time for the metric query, calculated as T-metricCollectionStartTime. Must be greater than or equal to metricCollectionPeriod. Default: 300 seconds.
+ `metricCollectionPeriod` (Optional, Integer) - The duration for the metric query in seconds. Must be a CloudWatch-supported value (1, 5, 10, 30, or a multiple of 60). Default: 300 seconds.
+ `metricStat` (Optional, String) - The statistic type for the CloudWatch query. Default: `Average`.
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

`prometheusTrigger`  
The trigger configuration for Amazon Managed Prometheus (AMP) metrics used in autoscaling decisions. The following fields are available in `prometheusTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `serverAddress` (Required, String) - The address of the AMP server. Must use the format: <https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace\$1id>
+ `query` (Required, String) - The PromQL query used for the metric. Must return a scalar value.
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `namespace` (Optional, String) - The namespace to use for namespaced queries. Default: empty string (`""`).
+ `customHeaders` (Optional, String) - Custom headers to include when querying the Prometheus endpoint. Default: empty string ("").
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

## Using KEDA ScaledObject yaml definitions through kubectl
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl"></a>

In addition to configuring autoscaling through the autoScalingSpec section in your deployment YAML, you can create and apply standalone KEDA `ScaledObject` YAML definitions using kubectl.

This approach provides greater flexibility for complex scaling scenarios and allows you to manage autoscaling policies independently from your model deployments. KEDA `ScaledObject` configurations support a [wide range of scaling triggers](https://keda.sh/docs/2.17/scalers/) including CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like CPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification.

**Note**  
Ensure the keda operator role provided during the HyperPod Inference operator installation has adequate permissions to query the metrics defined in the scaled object triggers.

### CloudWatch metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cw"></a>

The following KEDA yaml policy uses CloudWatch metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-cloudwatch/](https://keda.sh/docs/2.17/scalers/aws-cloudwatch/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-cloudwatch
    metadata:
      namespace: AWS/SageMaker
      metricName: Invocations
      targetMetricValue: "1"
      minMetricValue: "1"
      awsRegion: "us-west-2"
      dimensionName: EndpointName;VariantName
      dimensionValue: $ENDPOINT_NAME;$VARIANT_NAME
      metricStatPeriod: "30" # seconds
      metricStat: "Sum"
      identityOwner: operator
```

### Amazon SQS metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sqs"></a>

The following KEDA yaml policy uses Amazon SQS metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-sqs/](https://keda.sh/docs/2.17/scalers/aws-sqs/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/account_id/QueueName
      queueLength: "5"  # Default: "5"
      awsRegion: "us-west-1"
      scaleOnInFlight: true
      identityOwner: operator
```

### Prometheus metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-prometheus"></a>

The following KEDA yaml policy uses Prometheus metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: avg(rate(http_requests_total{deployment="$DEPLOYMENT_NAME"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      namespace: example-namespace  # for namespaced queries, eg. Thanos
      customHeaders: X-Client-Id=cid,X-Tenant-Id=tid,X-Organization-Id=oid # Optional. Custom headers to include in query. In case of auth header, use the custom authentication or relevant authModes.
      unsafeSsl: "false" #  Default is `false`, Used for skipping certificate check when having self-signed certs for Prometheus endpoint    
      timeout: 1000 # Custom timeout for the HTTP client used in this scaler
      identityOwner: operator
```

### CPU metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cpu"></a>

The following KEDA yaml policy uses cpu metric as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: cpu
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container
```

### Memory metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-memory"></a>

The following KEDA yaml policy uses Prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: memory
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container in a pod
```

## Sample Prometheus policy for scaling down to 0 pods
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sample"></a>

The following KEDA yaml policy uses prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. This policy uses a `minReplicaCount` of 0 which enables KEDA to scale the deployment down to 0 pods. When `minReplicaCount` is set to 0, you need to provide an activation criteria in order to bring up the first pod, after the pods scale down to 0. For the Prometheus trigger, this value is provided by `activationThreshold`. For the SQS queue, it comes from `activationQueueLength`.

**Note**  
While using `minReplicaCount` of 0, make sure the activation does not depend on a metric that is being generated by the pods. When the pods scale down to 0, that metric will never be generated and the pods will not scale up again.

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 0 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  cooldownPeriod:  30
  initialCooldownPeriod:  180 # time before scaling down the pods after initial deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      activationThreshold: '5.5' # Required if minReplicaCount is 0 for initial scaling
      namespace: example-namespace
      timeout: 1000
      identityOwner: operator
```

**Note**  
The CPU and Memory triggers can scale to 0 only when you define at least one additional scaler which is not CPU or Memory (eg. SQS \$1 CPU, or Prometheus \$1 CPU). 

# Implementing inference observability on HyperPod clusters
<a name="sagemaker-hyperpod-model-deployment-observability"></a>

Amazon SageMaker HyperPod provides comprehensive inference observability capabilities that enable data scientists and machine learning engineers to monitor and optimize their deployed models. This solution is enabled through SageMaker HyperPod Observability and automatically collects performance metrics for inference workloads, delivering production-ready monitoring through integrated [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/oss/) dashboards.

With metrics enabled by default, the platform captures essential model performance data including invocation latency, concurrent requests, error rates, and token-level metrics, while providing standard Prometheus endpoints for customers who prefer to implement custom observability solutions.

**Note**  
This topic contains a deep dive in to implementing inference observability on HyperPod clusters. For a more general reference, see [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md).

This guide provides step-by-step instructions for implementing and using inference observability on your HyperPod clusters. You'll learn how to configure metrics in your deployment YAML files, access monitoring dashboards based on your role (administrator, data scientist, or machine learning engineer), integrate with custom observability solutions using Prometheus endpoints, and troubleshoot common monitoring issues.

## Supported inference metrics
<a name="sagemaker-hyperpod-model-deployment-observability-metrics"></a>

**Invocation metrics**

These metrics capture model inference request and response data, providing universal visibility regardless of your model type or serving framework. When inference metrics are enabled, these metrics are calculated at invocation time and exported to your monitoring infrastructure.
+ `model_invocations_total` - Total number of invocation requests to the model 
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent model requests
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Model container metrics**

These metrics provide insights into the internal operations of your model containers, including token processing, queue management, and framework-specific performance indicators. The metrics available depend on your model serving framework:
+ [TGI container metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics) 
+ [LMI container metrics](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md) 

**Metric dimensions**

All inference metrics include comprehensive labels that enable detailed filtering and analysis across your deployments:
+ **Cluster Identity:**
  + `cluster_id` - The unique ID of the HyperPod cluster
  + `cluster_name` - The name of the HyperPod cluster
+ **Resource Identity:**
  + `resource_name` - Deployment name (For example, "jumpstart-model-deployment")
  + `resource_type` - Type of deployment (jumpstart, inference-endpoint)
  + `namespace` - Kubernetes namespace for multi-tenancy
+ **Model Characteristics:**
  + `model_name` - Specific model identifier (For example, "llama-2-7b-chat")
  + `model_version` - Model version for A/B testing and rollbacks
  + `model_container_type` - Serving framework (TGI, LMI, -)
+ **Infrastructure Context:**
  + `pod_name` - Individual pod identifier for debugging
  + `node_name` - Kubernetes node for resource correlation
  + `instance_type` - EC2 instance type for cost analysis
+ **Operational Context:**
  + `metric_source` - Collection point (reverse-proxy, model-container)
  + `task_type` - Workload classification (inference)

## Configure metrics in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-observability-yaml"></a>

Amazon SageMaker HyperPod enables inference metrics by default for all model deployments, providing immediate observability without additional configuration. You can customize metrics behavior by modifying the deployment YAML configuration to enable or disable metrics collection based on your specific requirements.

**Deploy a model from JumpStart**

Use the following YAML configuration to deploy a JuJumpStartmpStart model with metrics enabled:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/
```

**Deploy custom and fine-tuned models from Amazon S3 or Amazon FSx**

Configure custom inference endpoints with detailed metrics settings using the following YAML:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/

Deploy a custom inference endpoint

Configure custom inference endpoints with detailed metrics settings using the following YAML:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: inferenceendpoint-deepseeks
  namespace: ns-team-a
spec:
  modelName: deepseeks
  modelVersion: 1.0.1
  metrics:
    enabled: true # Default: true (can be set to false to disable)
    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s
    modelMetricsConfig:
        port: 8000 # Optional: if overriding, it defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec 8080
        path: "/custom-metrics" # Optional: if overriding the default "/metrics"
  endpointName: deepseek-sm-endpoint
  instanceType: ml.g5.12xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: model-weights
      region: us-west-2
    modelLocation: deepseek
    prefetchEnabled: true
  invocationEndpoint: invocations
  worker:
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
        cpu: 25600m
        memory: 102Gi
    image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
    modelInvocationPort:
      containerPort: 8080
      name: http
    modelVolumeMount:
      name: model-weights
      mountPath: /opt/ml/model
    environmentVariables: ...
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/inferenceendpoint-deepseeks4/certs/
```

**Note**  
To disable metrics for specific deployments, set `metrics.enabled: false` in your YAML configuration.

## Monitor and troubleshoot inference workloads by role
<a name="sagemaker-hyperpod-model-deployment-observability-role"></a>

Amazon SageMaker HyperPod provides comprehensive observability capabilities that support different user workflows, from initial cluster setup to advanced performance troubleshooting. Use the following guidance based on your role and monitoring requirements.

**HyperPod admin**

**Your responsibility:** Enable observability infrastructure and ensure system health across the entire cluster.

**What you need to know:**
+ Cluster-wide observability provides infrastructure metrics for all workloads
+ One-click setup deploys monitoring stack with pre-configured dashboards
+ Infrastructure metrics are separate from model-specific inference metrics

**What you need to do:**

1. Navigate to the HyperPod console.

1. Select your cluster.

1. Go to the HyperPod cluster details page you just created. You will see a new option to install the HyperPod observability add-on.

1. Click on the **Quick install** option. After 1-2 minutes all the steps will be completed and you will see the Grafana dashboard and Prometheus workspace details.

This single action automatically deploys the EKS Add-on, configures observability operators, and provisions pre-built dashboards in Grafana.

**Data scientist**

**Your responsibility:** Deploy models efficiently and monitor their basic performance.

**What you need to know:**
+ Metrics are automatically enabled when you deploy models
+ Grafana dashboards provide immediate visibility into model performance
+ You can filter dashboards to focus on your specific deployments

**What you need to do:**

1. Deploy your model using your preferred method:

   1. Amazon SageMaker Studio UI

   1. HyperPod CLI commands

   1. Python SDK in notebooks

   1. kubectl with YAML configurations

1. Access your model metrics:

   1. Open Amazon SageMaker Studio

   1. Navigate to HyperPod Cluster and open Grafana Dashboard

   1. Select Inference Dashboard

   1. Apply filters to view your specific model deployment

1. Monitor key performance indicators:

   1. Track model latency and throughput

   1. Monitor error rates and availability

   1. Review resource utilization trends

After this is complete, you'll have immediate visibility into your model's performance without additional configuration, enabling quick identification of deployment issues or performance changes.

**Machine learning engineer (MLE)**

**Your responsibility:** Maintain production model performance and resolve complex performance issues.

**What you need to know:**
+ Advanced metrics include model container details like queue depths and token metrics
+ Correlation analysis across multiple metric types reveals root causes
+ Auto-scaling configurations directly impact performance during traffic spikes

**Hypothetical scenario:** A customer's chat model experiences intermittent slow responses. Users are complaining about 5-10 second delays. The MLE can leverage inference observability for systematic performance investigation.

**What you need to do:**

1. Examine the Grafana dashboard to understand the scope and severity of the performance issue:

   1. High latency alert active since 09:30

   1. P99 latency: 8.2s (normal: 2.1s)

   1. Affected time window: 09:30-10:15 (45 minutes)

1. Correlate multiple metrics to understand the system behavior during the incident:

   1. Concurrent requests: Spiked to 45 (normal: 15-20)

   1. Pod scaling: KEDA scaled 2→5 pods during incident

   1. GPU utilization: Remained normal (85-90%)

   1. Memory usage: Normal (24GB/32GB)

1. Examine the distributed system behavior since the infrastructure metrics appear normal:

   1. Node-level view: All pods concentrated on same node (poor distribution)

   1. Model container metrics: TGI queue depth shows 127 requests (normal: 5-10)

   ```
   Available in Grafana dashboard under "Model Container Metrics" panel
           Metric: tgi_queue_size{resource_name="customer-chat-llama"}
           Current value: 127 requests queued (indicates backlog)
   ```

1. Identify interconnected configuration issues:

   1. KEDA scaling policy: Too slow (30s polling interval)

   1. Scaling timeline: Scaling response lagged behind traffic spike by 45\$1 seconds

1. Implement targeted fixes based on the analysis:

   1. Updated KEDA polling interval: 30s → 15s

   1. Increased maxReplicas in scaling configuration

   1. Adjusted scaling thresholds to scale earlier (15 vs 20 concurrent requests)

You can systematically diagnose complex performance issues using comprehensive metrics, implement targeted fixes, and establish preventive measures to maintain consistent production model performance.

## Implement your own observability integration
<a name="sagemaker-hyperpod-model-deployment-observability-diy"></a>

Amazon SageMaker HyperPod exposes inference metrics through industry-standard Prometheus endpoints, enabling integration with your existing observability infrastructure. Use this approach when you prefer to implement custom monitoring solutions or integrate with third-party observability platforms instead of using the built-in Grafana and Prometheus stack.

**Access inference metrics endpoints**

**What you need to know:**
+ Inference metrics are automatically exposed on standardized Prometheus endpoints
+ Metrics are available regardless of your model type or serving framework
+ Standard Prometheus scraping practices apply for data collection

**Inference metrics endpoint configuration:**
+ **Port:** 9113
+ **Path:** /metrics
+ **Full endpoint:** http://pod-ip:9113/metrics

**Available inference metrics:**
+ `model_invocations_total` - Total number of invocation requests to the model
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent requests per model
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Access model container metrics**

**What you need to know:**
+ Model containers expose additional metrics specific to their serving framework
+ These metrics provide internal container insights like token processing and queue depths
+ Endpoint configuration varies by model container type

**For JumpStart model deployments using Text Generation Inference (TGI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /metrics
+ **Documentation:** [https://huggingface.co/docs/text-generation-inference/en/reference/metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)

**For JumpStart model deployments using Large Model Inference (LMI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /server/metrics
+ **Documentation:** [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md)

**For custom inference endpoints (BYOD):**
+ **Port:** Customer-configured (default 8080 Defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec.)
+ **Path:** Customer-configured (default /metrics)

**Implement custom observability integration**

With a custom observability integration, you're responsible for:

1. **Metrics Scraping:** Implement Prometheus-compatible scraping from the endpoints above

1. **Data Export:** Configure export to your chosen observability platform

1. **Alerting:** Set up alerting rules based on your operational requirements

1. **Dashboards:** Create visualization dashboards for your monitoring needs

## Troubleshoot inference observability issues
<a name="sagemaker-hyperpod-model-deployment-observability-troubleshoot"></a>

**The dashboard shows no data**

If the Grafana dashboard is empty and all panels show "No data," perform the following steps to investigate:

1. Verify Administrator has inference observability installed:

   1. Navigate to HyperPod Console > Select cluster > Check if "Observability" status shows "Enabled"

   1. Verify Grafana workspace link is accessible from cluster overview

   1. Confirm Amazon Managed Prometheus workspace is configured and receiving data

1. Verify HyperPod Observabilty is enabled:

   ```
   hyp observability view      
   ```

1. Verify model metrics are enabled:

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled       
   ```

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled        
   ```

1. Check the metrics endpoint:

   ```
   kubectl port-forward pod/customer-chat-llama-xxx 9113:9113
   curl localhost:9113/metrics | grep model_invocations_total# Expected: model_invocations_total{...} metrics
   ```

1. Check the logs:

   ```
   # Model Container
   kubectl logs customer-chat-llama-xxx -c customer-chat-llama# Look for: OOM errors, CUDA errors, model loading failures
   
   # Proxy/SideCar
   kubectl logs customer-chat-llama-xxx -c sidecar-reverse-proxy# Look for: DNS resolution issues, upstream connection failures
   
   # Metrics Exporter Sidecar
   kubectl logs customer-chat-llama-xxx -c otel-collector# Look for: Metrics collection issues, export failures
   ```

**Other common issues**


| Issue | Solution | Action | 
| --- | --- | --- | 
|  Inference observability is not installed  |  Install inference observability through the console  |  "Enable Observability" in HyperPod console  | 
|  Metrics disabled in model  |  Update model configuration  |  Add `metrics: {enabled: true}` to model spec  | 
|  AMP workspace not configured  |  Fix data source connection  |  Verify AMP workspace ID in Grafana data sources  | 
|  Network connectivity  |  Check security groups/NACLs  |  Ensure pods can reach AMP endpoints  | 

# Task governance for model deployment on HyperPod
<a name="sagemaker-hyperpod-model-deployment-task-gov"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for real-time inference workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your inference workloads get the GPU resources they need during traffic spikes while maintaining fair allocation across your teams' training, evaluation, and testing activities. For more general information on task governance, see [SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) .

## How inference workload management works
<a name="sagemaker-hyperpod-model-deployment-task-gov-how"></a>

To effectively manage real-time inference traffic spikes in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for inference workloads with high weights (such as 100) to ensure inference pods are admitted and scheduled before other task types. This configuration enables inference workloads to preempt lower-priority jobs during cluster load, which is critical for maintaining low-latency requirements during traffic surges.

**Quota sizing and allocation**

Reserve sufficient GPU resources in your team's `ClusterQueue` to handle expected inference spikes. During periods of low inference traffic, unused quota resources can be temporarily allocated to other teams' tasks. When inference demand increases, these borrowed resources can be reclaimed to prioritize pending inference pods. For more information, see [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/).

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

1. **Strict Resource Control:** Disable quota lending and borrowing to guarantee reserved GPU capacity is always available for your workloads. This approach requires sizing quotas large enough to independently handle peak demand and may result in idle nodes during low-traffic periods.

1. **Flexible Resource Sharing:** Enable quota borrowing to utilize idle resources from other teams when needed. Borrowed pods are marked as preemptible and may be evicted if the lending team reclaims capacity.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (evaluation, training, and inference) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority inference pods, ensuring real-time inference can run without depending on external quota borrowing. For more information, see [Preemption](https://kueue.sigs.k8s.io/docs/concepts/preemption/).

## Sample inference workload setup
<a name="sagemaker-hyperpod-model-deployment-task-gov-example"></a>

The following example shows how Kueue manages GPU resources in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**  
Your cluster has the following configuration:
+ **Team A**: 10 P4 GPU quota
+ **Team B**: 20 P4 GPU quota
+ **Static provisioning**: No autoscaling
+ **Total capacity**: 30 P4 GPUs

The shared GPU pool uses this priority policy:

1. **Real-time inference**: Priority 100

1. **Training**: Priority 75

1. **Evaluation**: Priority 50

Kueue enforces team quotas and priority classes, with preemption and quota borrowing enabled.

**Initial state: Normal cluster utilization**  
In normal operations:
+ Team A runs training and evaluation jobs on all 10 P4 GPUs
+ Team B runs real-time inference (10 P4s) and evaluation (10 P4s) within its 20 GPU quota
+ The cluster is fully utilized with all jobs admitted and running

**Inference spike: Team B requires additional GPUs**  
When Team B experiences a traffic spike, additional inference pods require 5 more P4 GPUs. Kueue detects that the new pods are:
+ Within Team B's namespace
+ Priority 100 (real-time inference)
+ Pending admission due to quota constraints

**Kueue's response process chooses between two options:**  
**Option 1: Quota borrowing** - If Team A uses only 6 of its 10 P4s, Kueue can admit Team B's pods using the idle 4 P4s. However, these borrowed resources are preemptible—if Team A submits jobs to reach its full quota, Kueue evicts Team B's borrowed inference pods.

**Option 2: Self-preemption (Recommended)** - Team B runs low-priority evaluation jobs (priority 50). When high-priority inference pods are waiting, Kueue preempts the evaluation jobs within Team B's quota and admits the inference pods. This approach provides safe resource allocation with no external eviction risk.

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team B have unused quota?
   + Yes → Admit the pods
   + No → Proceed to Step 2

1. **Self-preemption within Team B**

   Question: Can lower-priority Team B jobs be preempted?
   + Yes → Preempt evaluation jobs (priority 50), free 5 P4s, and admit inference pods
   + No → Proceed to Step 3

   This approach keeps workloads within Team B's guaranteed quota, avoiding external eviction risks.

1. **Borrowing from other teams**

   Question: Is there idle, borrowable quota from other teams?
   + Yes → Admit using borrowed quota (marked as preemptible)
   + No → Pod remains in `NotAdmitted` state

# HyperPod inference troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts"></a>

This troubleshooting guide addresses common issues that can occur during Amazon SageMaker HyperPod inference deployment and operation. These problems typically involve VPC networking configuration, IAM permissions, Kubernetes resource management, and operator connectivity issues that can prevent successful model deployment or cause deployments to fail or remain in pending states.

This troubleshooting guide uses the following terminology: **Troubleshooting steps** are diagnostic procedures to identify and investigate problems, **Resolution** provides the specific actions to fix identified issues, and **Verification** confirms that the solution worked correctly.

**Topics**
+ [

# Inference operator installation failures through SageMaker AI console
](sagemaker-hyperpod-model-deployment-ts-console-cfn-failures.md)
+ [

# Inference operator installation failures through AWS CLI
](sagemaker-hyperpod-model-deployment-ts-cli.md)
+ [

# Certificate download timeout
](sagemaker-hyperpod-model-deployment-ts-certificate.md)
+ [

# Model deployment issues
](sagemaker-hyperpod-model-deployment-ts-deployment-issues.md)
+ [

# VPC ENI permission issue
](sagemaker-hyperpod-model-deployment-ts-permissions.md)
+ [

# IAM trust relationship issue
](sagemaker-hyperpod-model-deployment-ts-trust.md)
+ [

# Missing NVIDIA GPU plugin error
](sagemaker-hyperpod-model-deployment-ts-gpu.md)
+ [

# Inference operator fails to start
](sagemaker-hyperpod-model-deployment-ts-startup.md)

# Inference operator installation failures through SageMaker AI console
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-failures"></a>

**Overview:** When installing the inference operator through the SageMaker AI console using Quick Install or Custom Install, the underlying CloudFormation stacks may fail due to various issues. This section covers common failure scenarios and their resolutions.

## Inference operator add-on installation failure through Quick or Custom install
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-stack-failed"></a>

**Problem:** The HyperPod cluster creation completes successfully, but the inference operator add-on installation fails.

**Common causes:**
+ Pod capacity limits exceeded on cluster nodes. The inference operator installation requires a minimum of 13 pods. The minimum recommended instance type is `ml.c5.4xlarge`.
+ IAM permission issues
+ Resource quota constraints
+ Network or VPC configuration problems

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-symptoms"></a>

**Symptoms:**
+ Inference operator add-on shows CREATE\$1FAILED or DEGRADED status in console
+ CloudFormation stack associated with the add-on is in CREATE\$1FAILED state
+ Installation progress stops or shows error messages

**Diagnostic steps:**

1. Check the inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

1. Check for pod limit issues:

   ```
   # Check current pod count per node
   kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'
   
   # Check pods running on each node
   kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c
   
   # Check for pod evictions or failures
   kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"
   ```

1. Check CloudFormation stack status (if using console installation):

   ```
   # List CloudFormation stacks related to the cluster
   aws cloudformation list-stacks \
       --region $REGION \
       --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
       --output table
   
   # Get detailed stack events
   aws cloudformation describe-stack-events \
       --stack-name <stack-name> \
       --region $REGION \
       --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
       --output table
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-resolution"></a>

To resolve the installation failure, save the current configuration, delete the failed add-on, fix the underlying issue, and then reinstall the inference operator through the SageMaker AI console (recommended) or the AWS CLI.

**Step 1: Save the current configuration**
+ Extract and save the add-on configuration before deletion:

  ```
  # Save the current configuration
  aws eks describe-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION \
      --query 'addon.configurationValues' \
      --output text > addon-config-backup.json
  
  # Verify the configuration was saved
  cat addon-config-backup.json
  
  # Pretty print for readability
  cat addon-config-backup.json | jq '.'
  ```

**Step 2: Delete the failed add-on**
+ Delete the inference operator add-on:

  ```
  aws eks delete-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION
  
  # Wait for deletion to complete
  echo "Waiting for add-on deletion..."
  aws eks wait addon-deleted \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION 2>/dev/null || sleep 60
  ```

**Step 3: Fix the underlying issue**

Choose the appropriate resolution based on the failure cause:

If the issue is pod limit exceeded:

```
# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces
```

**Step 4: Reinstall the inference operator**

After fixing the underlying issue, reinstall the inference operator using one of the following methods:
+ **SageMaker AI console with Custom Install (recommended):** Reuse existing IAM roles and TLS bucket from your previous installation. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).
+ **AWS CLI with saved configuration:** Use the configuration you backed up in Step 1 to reinstall the add-on. For the full CLI installation steps, see [Method 2: Installing the Inference Operator using the AWS CLI](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-addon).

  ```
  aws eks create-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --addon-version v1.0.0-eksbuild.1 \
      --configuration-values file://addon-config-backup.json \
      --region $REGION
  ```
+ **SageMaker AI console with Quick Install:** Creates new IAM roles, TLS bucket, and dependency add-ons automatically. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).

**Step 5: Verify successful installation**

```
# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
```

## Cert-manager installation failed due to Kueue webhook not ready
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-webhook-race"></a>

**Problem:** The cert-manager add-on installation fails with a webhook error because the Task Governance (Kueue) webhook service has no available endpoints. This is a race condition that occurs when cert-manager tries to create resources before the Task Governance webhook pods are fully running. This can happen when Task Governance add-on is being installed along with the Inference operator during cluster creation.

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-symptoms"></a>

**Error message:**

```
AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"
```

**Root cause:**
+ Task Governance add-on installs and registers a mutating webhook that intercepts all Deployment creations
+ Cert-manager add-on tries to create Deployment resources before Task Governance webhook pods are ready
+ Kubernetes admission control calls the Task Governance webhook, but it has no endpoints (pods not running yet)

**Diagnostic step:**

1. Check cert-manager add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-resolution"></a>

**Solution: Delete and reinstall cert-manager**

The Task Governance webhook becomes ready within 60 seconds. Simply delete and reinstall the cert-manager add-on:

1. Delete the failed cert-manager add-on:

   ```
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

1. Wait 30-60 seconds for the Task Governance webhook to become ready, then reinstall the cert-manager add-on:

   ```
   sleep 60
   
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

# Inference operator installation failures through AWS CLI
<a name="sagemaker-hyperpod-model-deployment-ts-cli"></a>

**Overview:** When installing the inference operator through the AWS CLI, add-on installation may fail due to missing dependencies. This section covers common CLI installation failure scenarios and their resolutions.

## Inference add-on installation failed due to missing CSI drivers
<a name="sagemaker-hyperpod-model-deployment-ts-missing-csi-drivers"></a>

**Problem:** The inference operator add-on creation fails because required CSI driver dependencies are not installed on the EKS cluster.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
S3 CSI driver not installed (missing CSIDriver s3.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.

FSx CSI driver not installed (missing CSIDriver fsx.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if CSI drivers are installed:

   ```
   # Check for S3 CSI driver
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   
   # Check for FSx CSI driver  
   kubectl get csidriver fsx.csi.aws.com
   kubectl get pods -n kube-system | grep fsx
   ```

1. Check EKS add-on status:

   ```
   # List all add-ons
   aws eks list-addons --cluster-name $EKS_CLUSTER_NAME --region $REGION
   
   # Check specific CSI driver add-ons
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION 2>/dev/null || echo "S3 CSI driver not installed"
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION 2>/dev/null || echo "FSx CSI driver not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install missing S3 CSI driver**

1. Create IAM role for S3 CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "S3 CSI Role ARN: $S3_CSI_ROLE_ARN"
   ```

1. Install S3 CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --addon-version v1.14.1-eksbuild.1 \
       --service-account-role-arn $S3_CSI_ROLE_ARN \
       --region $REGION
   ```

1. Verify S3 CSI driver installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
   
   # Verify CSI driver is available
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   ```

**Step 2: Install missing FSx CSI driver**

1. Create IAM role for FSx CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export FSX_CSI_ROLE_ARN=$(aws iam get-role --role-name $FSX_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "FSx CSI Role ARN: $FSX_CSI_ROLE_ARN"
   ```

1. Install FSx CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --addon-version v1.6.0-eksbuild.1 \
       --service-account-role-arn $FSX_CSI_ROLE_ARN \
       --region $REGION
   
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
   
   # Verify FSx CSI driver is running
   kubectl get pods -n kube-system | grep fsx
   ```

**Step 3: Verify all dependencies**

After installing the missing dependencies, verify they are running correctly before retrying the inference operator installation:

```
# Check all required add-ons are active
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify all pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

## Inference Custom Resource Definitions are missing during model deployment
<a name="sagemaker-hyperpod-model-deployment-ts-crd-not-exist"></a>

**Problem:** Custom Resource Definitions (CRDs) are missing when you attempt to create model deployments. This issue occurs when you previously installed and deleted the inference add-on without cleaning up model deployments that have finalizers.

**Symptoms and diagnosis:**

**Root cause:**

If you delete the inference add-on without first removing all model deployments, custom resources with finalizers remain in the cluster. These finalizers must complete before you can delete the CRDs. The add-on deletion process doesn't wait for CRD deletion to complete, which causes the CRDs to remain in a terminating state and prevents new installations.

**To diagnose this issue**

1. Check whether CRDs exist.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

1. Check for stuck custom resources.

   ```
   # Check for JumpStartModel resources
   kubectl get jumpstartmodels -A
   
   # Check for InferenceEndpointConfig resources
   kubectl get inferenceendpointconfigs -A
   ```

1. Inspect finalizers on stuck resources.

   ```
   # Example for a specific JumpStartModel
   kubectl get jumpstartmodels <model-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   
   # Example for a specific InferenceEndpointConfig
   kubectl get inferenceendpointconfigs <config-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   ```

**Resolution:**

Manually remove the finalizers from all model deployments that weren't deleted when you removed the inference add-on. Complete the following steps for each stuck custom resource.

**To remove finalizers from JumpStartModel resources**

1. List all JumpStartModel resources across all namespaces.

   ```
   kubectl get jumpstartmodels -A
   ```

1. For each JumpStartModel resource, remove the finalizers by patching the resource to set metadata.finalizers to an empty array.

   ```
   kubectl patch jumpstartmodels <model-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named kv-l1-only.

   ```
   kubectl patch jumpstartmodels kv-l1-only -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the model instance is deleted.

   ```
   kubectl get jumpstartmodels -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=jumpstartmodels": the server could not find the requested resource (get jumpstartmodels.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the JumpStartModel CRD is removed.

   ```
   kubectl get crd | grep jumpstartmodels.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To remove finalizers from InferenceEndpointConfig resources**

1. List all InferenceEndpointConfig resources across all namespaces.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

1. For each InferenceEndpointConfig resource, remove the finalizers.

   ```
   kubectl patch inferenceendpointconfigs <config-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named my-inference-config.

   ```
   kubectl patch inferenceendpointconfigs my-inference-config -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the config instance is deleted.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=inferenceendpointconfigs": the server could not find the requested resource (get inferenceendpointconfigs.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the InferenceEndpointConfig CRD is removed.

   ```
   kubectl get crd | grep inferenceendpointconfigs.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To reinstall the inference add-on**

After you clean up all stuck resources and verify that the CRDs are removed, reinstall the inference add-on. For more information, see [Installing the Inference Operator with EKS add-on](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon).

**Verification:**

1. Verify that the inference add-on is successfully installed.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   ```

   The Status should be ACTIVE and the Health should be HEALTHY.

1. Verify that CRDs are properly installed.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

   You should see the inference-related CRDs listed in the output.

1. Test creating a new model deployment to confirm that the issue is resolved.

   ```
   # Create a test deployment using your preferred method
   kubectl apply -f <your-model-deployment.yaml>
   ```

**Prevention:**

To prevent this issue, complete the following steps before you uninstall the inference add-on.

1. Delete all model deployments.

   ```
   # Delete all JumpStartModel resources
   kubectl delete jumpstartmodels --all -A
   
   # Delete all InferenceEndpointConfig resources
   kubectl delete inferenceendpointconfigs --all -A
   
   # Wait for all resources to be fully deleted
   kubectl get jumpstartmodels -A
   kubectl get inferenceendpointconfigs -A
   ```

1. Verify that all custom resources are deleted.

1. After you confirm that all resources are cleaned up, delete the inference add-on.

## Inference add-on installation failed due to missing cert-manager
<a name="sagemaker-hyperpod-model-deployment-ts-missing-cert-manager"></a>

**Problem:** The inference operator add-on creation fails because the cert-manager EKS Add-On is not installed, resulting in missing Custom Resource Definitions (CRDs).

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
Missing required CRD: certificaterequests.cert-manager.io. 
The cert-manager add-on is not installed. Please install cert-manager and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if cert-manager is installed:

   ```
   # Check for cert-manager CRDs
   kubectl get crd | grep cert-manager
   kubectl get pods -n cert-manager
   
   # Check EKS add-on status
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION 2>/dev/null || echo "Cert-manager not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install cert-manager add-on**

1. Install the cert-manager EKS add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --addon-version v1.18.2-eksbuild.2 \
       --region $REGION
   ```

1. Verify cert-manager installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION
   
   # Verify cert-manager pods are running
   kubectl get pods -n cert-manager
   
   # Verify CRDs are installed
   kubectl get crd | grep cert-manager | wc -l
   # Expected: Should show multiple cert-manager CRDs
   ```

**Step 2: Retry inference operator installation**

1. After cert-manager is installed, retry the inference operator installation:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall the inference operator add-on
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation:

   ```
   # Check installation status
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify inference operator pods are running
   kubectl get pods -n hyperpod-inference-system
   ```

## Inference add-on installation failed due to missing ALB Controller
<a name="sagemaker-hyperpod-model-deployment-ts-missing-alb"></a>

**Problem:** The inference operator add-on creation fails because the AWS Load Balancer Controller is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
ALB Controller not installed (missing aws-load-balancer-controller pods). 
Please install the Application Load Balancer Controller and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if ALB Controller is installed:

   ```
   # Check for ALB Controller pods
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   
   # Check ALB Controller service account
   kubectl get serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null || echo "ALB Controller service account not found"
   kubectl get serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null || echo "ALB Controller service account not found in inference namespace"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install ALB Controller (Recommended)**
+ Ensure the ALB role is created and properly configured in your add-on configuration:

  ```
  # Verify ALB role exists
  export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "ALB Role ARN: $ALB_ROLE_ARN"
  
  # Update your addon-config.json to enable ALB
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": true,
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing ALB Controller installation**
+ If you already have ALB Controller installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable ALB installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": false
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify ALB Controller is working:

   ```
   # Check ALB Controller pods
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   
   # Check service account annotations
   kubectl describe serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null
   ```

## Inference add-on installation failed due to missing KEDA operator
<a name="sagemaker-hyperpod-model-deployment-ts-missing-keda"></a>

**Problem:** The inference operator add-on creation fails because the KEDA (Kubernetes Event Driven Autoscaler) operator is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
KEDA operator not installed (missing keda-operator pods). 
KEDA can be installed separately in any namespace or via the Inference addon.
```

**Diagnostic steps:**

1. Check if KEDA operator is installed:

   ```
   # Check for KEDA operator pods in common namespaces
   kubectl get pods -n keda-system | grep keda-operator 2>/dev/null || echo "KEDA not found in keda-system namespace"
   kubectl get pods -n kube-system | grep keda-operator 2>/dev/null || echo "KEDA not found in kube-system namespace"
   kubectl get pods -n hyperpod-inference-system | grep keda-operator 2>/dev/null || echo "KEDA not found in inference namespace"
   
   # Check for KEDA CRDs
   kubectl get crd | grep keda 2>/dev/null || echo "KEDA CRDs not found"
   
   # Check KEDA service account
   kubectl get serviceaccount keda-operator -A 2>/dev/null || echo "KEDA service account not found"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install KEDA (Recommended)**
+ Ensure the KEDA role is created and properly configured in your add-on configuration:

  ```
  # Verify KEDA role exists
  export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "KEDA Role ARN: $KEDA_ROLE_ARN"
  
  # Update your addon-config.json to enable KEDA
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": true,
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing KEDA installation**
+ If you already have KEDA installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable KEDA installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": false
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify KEDA is working:

   ```
   # Check KEDA pods
   kubectl get pods -n hyperpod-inference-system | grep keda
   kubectl get pods -n kube-system | grep keda
   kubectl get pods -n keda-system | grep keda 2>/dev/null
   
   # Check KEDA CRDs
   kubectl get crd | grep scaledobjects
   kubectl get crd | grep scaledjobs
   
   # Check KEDA service account annotations
   kubectl describe serviceaccount keda-operator -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n kube-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n keda-system 2>/dev/null
   ```

# Certificate download timeout
<a name="sagemaker-hyperpod-model-deployment-ts-certificate"></a>

When deploying a SageMaker AI endpoint, the creation process fails due to the inability to download the certificate authority (CA) certificate in a VPC environment. For detailed configuration steps, refer to the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

**Error message:**

The following error appears in the SageMaker AI endpoint CloudWatch logs: 

```
Error downloading CA certificate: Connect timeout on endpoint URL: "https://****.s3.<REGION>.amazonaws.com/****/***.pem"
```

**Root cause:**
+ This issue occurs when the inference operator cannot access the self-signed certificate in Amazon S3 within your VPC
+ Proper configuration of the Amazon S3 VPC endpoint is essential for certificate access

**Resolution:**

1. If you don't have an Amazon S3 VPC endpoint:
   + Create an Amazon S3 VPC endpoint following the configuration in section 5.3 of the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

1. If you already have an Amazon S3 VPC endpoint:
   + Ensure that the subnet route table is configured to point to the VPC endpoint (if using gateway endpoint) or that private DNS is enabled for interface endpoint.
   + Amazon S3 VPC endpoint should be similar to the configuration mentioned in section 5.3 Endpoint creation step

# Model deployment issues
<a name="sagemaker-hyperpod-model-deployment-ts-deployment-issues"></a>

**Overview:** This section covers common issues that occur during model deployment, including pending states, failed deployments, and monitoring deployment progress.

## Model deployment stuck in pending state
<a name="sagemaker-hyperpod-model-deployment-ts-pending"></a>

When deploying a model, the deployment remains in a "Pending" state for an extended period. This indicates that the inference operator is unable to initiate the model deployment in your HyperPod cluster.

**Components affected:**

During normal deployment, the inference operator should:
+ Deploy model pod
+ Create load balancer
+ Create SageMaker AI endpoint

**Troubleshooting steps:**

1. Check the inference operator pod status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output example:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the inference operator logs and examine the operator logs for error messages:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**
+ Error messages in the operator logs
+ Status of the operator pod
+ Any deployment-related warnings or failures

**Note**  
A healthy deployment should progress beyond the "Pending" state within a reasonable time. If issues persist, review the inference operator logs for specific error messages to determine the root cause.

## Model deployment failed state troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts-failed"></a>

When a model deployment enters a "Failed" state, the failure could occur in one of three components:
+ Model pod deployment
+ Load balancer creation
+ SageMaker AI endpoint creation

**Troubleshooting steps:**

1. Check the inference operator status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the operator logs:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**

The operator logs will indicate which component failed:
+ Model pod deployment failures
+ Load balancer creation issues
+ SageMaker AI endpoint errors

## Checking model deployment progress
<a name="sagemaker-hyperpod-model-deployment-ts-progress"></a>

To monitor the progress of your model deployment and identify potential issues, you can use kubectl commands to check the status of various components. This helps determine whether the deployment is progressing normally or has encountered problems during the model pod creation, load balancer setup, or SageMaker AI endpoint configuration phases.

**Method 1: Check the JumpStart model status**

```
kubectl describe jumpstartmodel.inference.sagemaker.aws.amazon.com/<model-name> -n <namespace>
```

**Key status indicators to monitor:**

1. Deployment Status
   + Look for `Status.State`: Should show `DeploymentComplete`
   + Check `Status.Deployment Status.Available Replicas`
   + Monitor `Status.Conditions` for deployment progress

1. SageMaker AI Endpoint Status
   + Check `Status.Endpoints.Sagemaker.State`: Should show `CreationCompleted`
   + Verify `Status.Endpoints.Sagemaker.Endpoint Arn`

1. TLS Certificate Status
   + View `Status.Tls Certificate` details
   + Check certificate expiration in `Last Cert Expiry Time`

**Method 2: Check the inference endpoint configuration**

```
kubectl describe inferenceendpointconfig.inference.sagemaker.aws.amazon.com/<deployment_name> -n <namespace>
```

**Common status states:**
+ `DeploymentInProgress`: Initial deployment phase
+ `DeploymentComplete`: Successful deployment
+ `Failed`: Deployment failed

**Note**  
Monitor the Events section for any warnings or errors. Check replica count matches expected configuration. Verify all conditions show `Status: True` for a healthy deployment.

# VPC ENI permission issue
<a name="sagemaker-hyperpod-model-deployment-ts-permissions"></a>

SageMaker AI endpoint creation fails due to insufficient permissions for creating network interfaces in VPC.

**Error message:**

```
Please ensure that the execution role for variant AllTraffic has sufficient permissions for creating an endpoint variant within a VPC
```

**Root cause:**

The inference operator's execution role lacks the required Amazon EC2 permission to create network interfaces (ENI) in VPC.

**Resolution:**

Add the following IAM permission to the inference operator's execution role:

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterfacePermission"
     ],
    "Resource": "*"
}
```

**Verification:**

After adding the permission:

1. Delete the failed endpoint (if exists)

1. Retry the endpoint creation

1. Monitor the deployment status for successful completion

**Note**  
This permission is essential for SageMaker AI endpoints running in VPC mode. Ensure the execution role has all other necessary VPC-related permissions as well.

# IAM trust relationship issue
<a name="sagemaker-hyperpod-model-deployment-ts-trust"></a>

HyperPod inference operator fails to start with an STS AssumeRoleWithWebIdentity error, indicating an IAM trust relationship configuration problem.

**Error message:**

```
failed to enable inference watcher for HyperPod cluster *****: operation error SageMaker: UpdateClusterInference, 
get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, 
operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ****, 
api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
```

**Resolution:**

Update the trust relationship of the inference operator's IAM execution role with the following configuration.

Replace the following placeholders:
+ `<ACCOUNT_ID>`: Your AWS account ID
+ `<REGION>`: Your AWS region
+ `<OIDC_ID>`: Your Amazon EKS cluster's OIDC provider ID

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
            "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:sub": "system:serviceaccount:<namespace>:<service-account-name>",
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:aud": "sts.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

**Verification:**

After updating the trust relationship:

1. Verify the role configuration in IAM console

1. Restart the inference operator if necessary

1. Monitor operator logs for successful startup

# Missing NVIDIA GPU plugin error
<a name="sagemaker-hyperpod-model-deployment-ts-gpu"></a>

Model deployment fails with GPU insufficiency error despite having available GPU nodes. This occurs when the NVIDIA device plugin is not installed in the HyperPod cluster.

**Error message:**

```
0/15 nodes are available: 10 node(s) didn't match Pod's node affinity/selector, 
5 Insufficient nvidia.com/gpu. preemption: 0/15 nodes are available: 
10 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.
```

**Root cause:**
+ Kubernetes cannot detect GPU resources without the NVIDIA device plugin
+ Results in scheduling failures for GPU workloads

**Resolution:**

Install the NVIDIA GPU plugin by running:

```
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/refs/tags/v0.17.1/deployments/static/nvidia-device-plugin.yml
```

**Verification steps:**

1. Check the plugin deployment status:

   ```
   kubectl get pods -n kube-system | grep nvidia-device-plugin
   ```

1. Verify GPU resources are now visible:

   ```
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
   ```

1. Retry model deployment

**Note**  
Ensure NVIDIA drivers are installed on GPU nodes. Plugin installation is a one-time setup per cluster. May require cluster admin privileges to install.

# Inference operator fails to start
<a name="sagemaker-hyperpod-model-deployment-ts-startup"></a>

Inference operator pod failed to start and is causing the following error message. This error is due to permission policy on the operator execution role not being authorized to perform `sts:AssumeRoleWithWebIdentity`. Due to this, the operator part running on the control plane is not started.

**Error message:**

```
Warning Unhealthy 5m46s (x22 over 49m) kubelet Startup probe failed: Get "http://10.1.100.59:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
```

**Root cause:**
+ Permission policy of the inference operator execution role is not set to access authorization token for resources.

**Resolution:**

Set the following policy of the execution role of `EXECUTION_ROLE_ARN` for the HyperPod inference operator:

```
HyperpodInferenceAccessPolicy-ml-cluster to include all resources
```

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        }
    ]
}
```

------

**Verification steps:**

1. Change the policy.

1. Terminate the HyperPod inference operator pod.

1. The pod will be restarted without throwing any exceptions.

# Amazon SageMaker HyperPod Inference release notes
<a name="sagemaker-hyperpod-inference-release-notes"></a>

This topic covers release notes that track updates, fixes, and new features for Amazon SageMaker HyperPod Inference. SageMaker HyperPod Inference enables you to deploy and scale machine learning models on your HyperPod clusters with enterprise-grade reliability. For general Amazon SageMaker HyperPod platform releases, updates, and improvements, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

For information about SageMaker HyperPod Inference capabilities and deployment options, see [Deploying models on Amazon SageMaker HyperPod](sagemaker-hyperpod-model-deployment.md).

## SageMaker HyperPod Inference release notes: v3.1
<a name="sagemaker-hyperpod-inference-release-notes-20260403"></a>

**Release Date:** April 3, 2026

**Summary**

Inference Operator v3.1 introduces custom Kubernetes pod configuration, custom certificate support, and per-pod request limits.

**Key Features**
+ **Custom Kubernetes Pod Configuration** – Added a new `kubernetes` field to the `InferenceEndpointConfig` CRD that allows users to customize inference pod configurations:
  + **Custom init containers** – Run user-defined init containers before the inference server starts (for example, cache warming, GDS setup). Init containers are injected after the operator's prefetch container.
  + **Custom volumes** – Add additional volumes (`emptyDir`, `hostPath`, `configMap`, etc.) to the pod spec, which can be referenced by init containers via `volumeMounts`.
  + **Custom scheduler name** – Specify a custom Kubernetes scheduler for pod placement.
+ **Custom Certificates** – Use your own ACM certificates for inference endpoints instead of operator-generated self-signed certificates, configured via `customCertificateConfig`. Supports publicly trusted ACM certificates, AWS Private CA certificates, and certificates imported from external CAs. The operator monitors certificate health and supports automatic renewal detection.
+ **Request Limits** – Control request handling per pod via the new `RequestLimits` configuration under `Worker`, with the following configurable fields:
  + `maxConcurrentRequests` – Maximum concurrent in-flight requests per pod.
  + `maxQueueSize` – Requests to queue when the concurrency limit is reached before rejecting.
  + `overflowStatusCode` – HTTP status code returned when limits are exceeded (default: 429).

For detailed information including prerequisites and upgrade instructions, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-1-prerequisites"></a>

To use the Custom Certificates feature, add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "ACMCertificateAccess",  
    "Effect": "Allow",  
    "Action": [  
        "acm:DescribeCertificate",  
        "acm:GetCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*"  
}
```

### Upgrade to v3.1
<a name="sagemaker-hyperpod-inference-v3-1-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.1
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

## SageMaker HyperPod Inference release notes: v3.0
<a name="sagemaker-hyperpod-inference-release-notes-20260223"></a>

**Release Date:** February 23, 2026

**Summary**

Inference Operator 3.0 introduces EKS Add-on integration for simplified lifecycle management, Node Affinity support for granular scheduling control, and improved resource tagging. Existing Helm-based installations can be migrated to the EKS Add-on using the provided migration script. Update your Inference Operator execution role with new tagging permissions before upgrading.

**Key Features**
+ **EKS Add-on Integration** – Enterprise-grade lifecycle management with simplified installation experience
+ **Node Affinity** – Granular scheduling control for excluding spot instances, preferring availability zones, or targeting nodes with custom labels

For detailed information including prerequisites, upgrade instructions, and migration guidance, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-0-prerequisites"></a>

Before upgrading the Helm version to 3.0, customers should add additional tagging permissions to their Inference operator execution role. As part of improving resource tagging and security, the Inference Operator now tags ALB, S3, and ACM resources. This enhancement requires additional permissions in the Inference Operator execution role. Add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "CertificateTagginPermission",  
    "Effect": "Allow",  
    "Action": [  
        "acm:AddTagsToCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*",  
},  
{  
    "Sid": "S3PutObjectTaggingAccess",  
    "Effect": "Allow",  
    "Action": [  
        "s3:PutObjectTagging"  
    ],  
    "Resource": [  
        "arn:aws:s3:::<TLS_BUCKET>/*" # Replace * with your TLS bucket  
    ]  
}
```

### Upgrade to v3.0
<a name="sagemaker-hyperpod-inference-v3-0-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.0
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

### Helm to EKS Add-on Migration
<a name="sagemaker-hyperpod-inference-v3-0-migration"></a>

If Inference operator is installed through Helm before 3.0 version, we recommend migrating to EKS Add-on to get timely updates on the new features that will be released for Inference Operator. This script migrates the SageMaker HyperPod Inference Operator from Helm-based installation to EKS Add-on installation.

**Overview:** The script takes a cluster name and region as parameters, retrieves the existing Helm installation configuration, and migrates to EKS Add-on deployment. It creates new IAM roles for the Inference Operator, ALB Controller, and KEDA Operator.

Before migrating the Inference Operator, the script ensures required dependencies (S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server) exist. If they don't exist, it deploys them as Add-on.

After the Inference Operator Add-on migration completes, the script also migrates S3, FSx, and other dependencies (ALB, KEDA, cert-manager, metrics-server) if they were originally installed via the Inference Operator Helm chart. Use `--skip-dependencies-migration` to skip this step for S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server. Note that ALB and KEDA are installed as part of the Add-on in the same namespace as Inference Operator, and will be migrated as part of the Inference Operator Add-on.

**Important**  
During the migration, do not deploy new models as they will not be deployed until the migration is completed. Once the Inference Operator Add-on is in ACTIVE state, new models can be deployed. Migration time typically takes 15 to 20 minutes, and it can complete within 30 minutes if only a few models are currently deployed.

**Migration Prerequisites:**
+ AWS CLI configured with appropriate credentials
+ kubectl configured with access to your EKS cluster
+ Helm installed
+ Existing Helm installation of hyperpod-inference-operator

**Note**  
Endpoints that are already running will not be interrupted during the migration process. Existing endpoints will continue to serve traffic without disruption throughout the migration.

**Getting the Migration Script:**

```
git clone https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator/migration
```

**Usage:**

```
./helm_to_addon.sh [OPTIONS] \
  --cluster-name <cluster-name> (Required) \
  --region <region> (Required) \
  --helm-namespace kube-system (Optional) \
  --auto-approve (Optional) \
  --skip-dependencies-migration (Optional) \
  --s3-mountpoint-role-arn <s3-mountpoint-role-arn> (Optional) \
  --fsx-role-arn <fsx-role-arn> (Optional)
```

**Options:**
+ `--cluster-name NAME` – EKS cluster name (required)
+ `--region REGION` – AWS region (required)
+ `--helm-namespace NAMESPACE` – Namespace where Helm chart is installed (default: kube-system) (optional)
+ `--s3-mountpoint-role-arn ARN` – S3 Mountpoint CSI driver IAM role ARN (optional)
+ `--fsx-role-arn ARN` – FSx CSI driver IAM role ARN (optional)
+ `--auto-approve` – Skip confirmation prompts if this flag is enabled. `step-by-step` and `auto-approve` are mutually exclusive, if `--auto-approve` is given, do not specify `--step-by-step` (optional)
+ `--step-by-step` – Pause after each major step for review. This should not be mentioned if `--auto-approve` is already added (optional)
+ `--skip-dependencies-migration` – Skip migration of Helm-installed dependencies to Add-on. For dependencies were NOT installed via the Inference Operator Helm chart, or if you want to manage them separately. (optional)

**Examples:**

Basic migration (migrates dependencies):

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1
```

Auto-approve without prompts:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --auto-approve
```

Skip dependency migration for FSx, S3 mountpoint, cert manager and Metrics server:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --skip-dependencies-migration
```

Provide existing S3 and FSx IAM roles:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --s3-mountpoint-role-arn arn:aws:iam::123456789012:role/s3-csi-role \
  --fsx-role-arn arn:aws:iam::123456789012:role/fsx-csi-role
```

**Backup Location:**

Backups are stored in `/tmp/hyperpod-migration-backup-<timestamp>/`

Backups enable safe migration and recovery:
+ **Rollback on Failure** – If migration fails, the script can automatically restore your cluster to its pre-migration state using the backed up configurations
+ **Audit Trail** – Provides a complete record of what existed before migration for troubleshooting and compliance
+ **Configuration Reference** – Allows you to compare pre-migration and post-migration configurations
+ **Manual Recovery** – If needed, you can manually inspect and restore specific resources from the backup directory

**Rollback:**

If migration fails, the script prompts for user confirmation before initiating rollback to restore the previous state.

## SageMaker HyperPod Inference release notes: v2.3
<a name="sagemaker-hyperpod-inference-release-notes-20260203"></a>

**What's new**

This release introduces new optional fields in the Custom Resource Definitions (CRDs) to enhance deployment configuration flexibility.

**Features**
+ **Multi Instance Types**
  + **Enhanced deployment reliability** – Supports multi-instance type configurations with automatic failover to alternative instance types when preferred options lack capacity
  + **Intelligent resource scheduling** – Uses Kubernetes node affinity to prioritize instance types while guaranteeing deployment even when preferred resources are unavailable
  + **Optimized cost and performance** – Maintains your instance type preferences and prevents capacity-related failures during cluster fluctuations

**Bug Fixes**

Changes to the field `invocationEndpoint` in the spec of the `InferenceEndpointConfig` will now take effect:
+ If the `invocationEndpoint` field is patched or updated, dependent resources, such as the `Ingress`, the Load Balancer, `SageMakerEndpointRegistration`, and SageMaker Endpoint, will be updated with normalisation.
+ The value for `invocationEndpoint` provided will be stored as-is in the `InferenceEndpointConfig` spec itself. When this value is used to create a Load Balancer and— if enabled— a SageMaker Endpoint, it will be normalised to have one leading forward slash.
  + `v1/chat/completions` will be normalised to `/v1/chat/completions` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `v1/chat/completions`.
  + `///invoke` will be normalised to `/invoke` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `invoke`.

**Installing Helm:**

Follow: [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart)

If you are focused on only installing the inference operator, after step 1 i.e. `Set Up Your Helm Environment`, do `cd HyperPodHelmChart/charts/inference-operator`. Since you are in the inference operator chart directory itself, in the commands, wherever you see `helm_chart/HyperPodHelmChart`, replace with `.` .

**Upgrade Operator to v2.3 in case already installed:**

```
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

helm upgrade hyperpod-inference-operator . \
  -n kube-system \
  -f current-values.yaml \
  --set image.tag=v2.3
```

# HyperPod in Studio
<a name="sagemaker-hyperpod-studio"></a>

You can launch machine learning workloads on Amazon SageMaker HyperPod clusters and view HyperPod cluster information in Amazon SageMaker Studio. The increased visibility into cluster details and hardware metrics can help your team identify the right candidate for your pre-training or fine-tuning workloads. 

A set of commands are available to help you get started when you launch Studio IDEs on a HyperPod cluster. You can work on your training scripts, use Docker containers for the training scripts, and submit jobs to the cluster, all from within the Studio IDEs. The following sections provide information on how to set this up, how to discover clusters and monitor their tasks, how to view cluster information, and how to connect to HyperPod clusters in IDEs within Studio.

**Topics**
+ [

# Setting up HyperPod in Studio
](sagemaker-hyperpod-studio-setup.md)
+ [

# HyperPod tabs in Studio
](sagemaker-hyperpod-studio-tabs.md)
+ [

# Connecting to HyperPod clusters and submitting tasks to clusters
](sagemaker-hyperpod-studio-open.md)
+ [

# Troubleshooting
](sagemaker-hyperpod-studio-troubleshoot.md)

# Setting up HyperPod in Studio
<a name="sagemaker-hyperpod-studio-setup"></a>

You need to set up the clusters depending on your choice of the cluster orchestrator to access your clusters through Amazon SageMaker Studio. In the following sections, choose the setup that matches with your orchestrator.

The instructions assume that you already have your cluster set up. For information on the cluster orchestrators and how to set up, start with the HyperPod orchestrator pages:
+  [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) 
+  [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) 

**Topics**
+ [

# Setting up a Slurm cluster in Studio
](sagemaker-hyperpod-studio-setup-slurm.md)
+ [

# Setting up an Amazon EKS cluster in Studio
](sagemaker-hyperpod-studio-setup-eks.md)

# Setting up a Slurm cluster in Studio
<a name="sagemaker-hyperpod-studio-setup-slurm"></a>

The following instructions describe how to set up a HyperPod Slurm cluster in Studio.

1. Create a domain or have one ready. For information on creating a domain, see [Guide to getting set up with Amazon SageMaker AI](gs.md).

1. (Optional) Create and attach a custom FSx for Lustre volume to your domain. 

   1. Ensure that your FSx Lustre file system exists in the same VPC as your intended domain, and is in one of the subnets present in the domain.

   1. You can follow the instructions in [Adding a custom file system to a domain](domain-custom-file-system.md). 

1. (Optional) We recommend that you add tags to your clusters to ensure a more smooth workflow. For information on how to add tags, see [Edit a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters) to update your cluster using the SageMaker AI console.

   1. Tag your FSx for Lustre file system to your Studio domain. This will help you identify the file system while launching your Studio spaces. To do so, add the following tag to your cluster to identify it with the FSx filesystem ID, `fs-id`. 

      Tag Key = “`hyperpod-cluster-filesystem`”, Tag Value = “`fs-id`”.

   1. Tag your [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) workspace to your Studio domain. This will be used to quickly link to your Grafana workspace directly from your cluster in Studio. To do so, add the following tag to your cluster to identify it with your Grafana workspace ID, `ws-id`.

      Tag Key = “`grafana-workspace`”, Tag Value = “`ws-id`”.

1. Add the following permission to your execution role. 

   For information on SageMaker AI execution roles and how to edit them, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md). 

   To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ssm:StartSession",
                   "ssm:TerminateSession"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CreateCluster",
                   "sagemaker:ListClusters"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "cloudwatch:PutMetricData",
                   "cloudwatch:GetMetricData"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster",
                   "sagemaker:DescribeClusterNode",
                   "sagemaker:ListClusterNodes",
                   "sagemaker:UpdateCluster",
                   "sagemaker:UpdateClusterSoftware"
               ],
               "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
           }
       ]
   }
   ```

------

1. Add a tag to this IAM role, with Tag Key = “`SSMSessionRunAs`” and Tag Value = “`os user`”. The `os user` here is the same user that you setup for the Slurm cluster. Manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in [AWS Systems Manager Agent (SSM Agent)](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html). With this feature, you can start each SSM session using the operating system (OS) user associated to the IAM role or user. 

   For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

1. [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). The Run As settings are account wide and is required for all SSM sessions to start successfully.

1. (Optional) [Restrict task view in Studio for Slurm clusters](#sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view). For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks).

In Amazon SageMaker Studio you can navigate to view your clusters in HyperPod clusters (under Compute).

## Restrict task view in Studio for Slurm clusters
<a name="sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view"></a>

You can restrict users to view Slurm tasks that are authorized to view, without requiring manual input of namespaces or additional permissions checks. The restriction is applied based on the users’ IAM role, providing a streamlined and secure user experience. The following section provides information on how to restrict task view in Studio for Slurm clusters. For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks). 

All Studio users can view, manage, and interact with all Slurm cluster tasks by default. To restrict this, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the **Run As** feature in [AWS Systems Manager Agent (SSM Agent)](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html).

You can do this by tagging IAM roles with specific identifiers, such as their username or group. When a user accesses Studio, the Session Manager uses the Run As feature to execute commands as a specific Slurm user account that matches their IAM role tags. The Slurm configuration can be set up to limit task visibility based on the user account. The Studio UI will automatically filter tasks visible to that specific user account when commands are executed through the Run As feature. Once set up, each user assuming the role with the specified identifiers will have those Slurm tasks filtered based on the Slurm configuration. For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

# Setting up an Amazon EKS cluster in Studio
<a name="sagemaker-hyperpod-studio-setup-eks"></a>

The following instructions describe how to set up an Amazon EKS cluster in Studio.

1. Create a domain or have one ready. For information on creating a domain, see [Guide to getting set up with Amazon SageMaker AI](gs.md).

1. Add the following permission to your execution role. 

   For information on SageMaker AI execution roles and how to edit them, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md). 

   To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "DescribeHyerpodClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster"
               ],
               "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster-name"
           },
           {
               "Effect": "Allow",
               "Action": "ec2:Describe*",
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:CompleteLayerUpload",
                   "ecr:GetAuthorizationToken",
                   "ecr:UploadLayerPart",
                   "ecr:InitiateLayerUpload",
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:PutImage"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
                   "Action": [
                       "cloudwatch:PutMetricData",
                       "cloudwatch:GetMetricData"
                       ],
               "Resource": "*"
           },
           {
               "Sid": "UseEksClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "eks:DescribeCluster",
                   "eks:AccessKubernetesApi",
                   "eks:DescribeAddon"
               ],
               "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/cluster-name"
           },
           {
               "Sid": "ListClustersPermission",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:ListClusters"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "ssm:StartSession",
                   "ssm:TerminateSession"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. [Grant IAM users access to Kubernetes with EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html).

   1. Navigate to the Amazon EKS cluster associated with your HyperPod cluster.

   1. Choose the **Access** tab and [create an access entry](https://docs.aws.amazon.com/eks/latest/userguide/creating-access-entries.html) for the execution role you created. 

      1. In step 1, Select the execution role you created above in the **IAM** principal dropdown.

      1. In step 2, select a policy name and select an access scope that you want the users to have access to. 

1. (Optional) To ensure a more smooth experience, we recommend that you add tags to your clusters. For information on how to add tags, see [Edit a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters) to update your cluster using the SageMaker AI console.

   1. Tag your [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) workspace to your Studio domain. This will be used to quickly link to your Grafana workspace directly from your cluster in Studio. To do so, add the following tag to your cluster to identify it with your Grafana workspace ID, `ws-id`.

     Tag Key = “`grafana-workspace`”, Tag Value = “`ws-id`”.

1. (Optional) [Restrict task view in Studio for EKS clusters](#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view). For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks).

## Restrict task view in Studio for EKS clusters
<a name="sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view"></a>

You can restrict Kubernetes namespace permissions for users, so that they will only have access to view tasks belonging to a specified namespace. The following provides information on how to restrict the task view in Studio for EKS clusters. For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks). 

Users will have visibility to all EKS cluster tasks by default. You can restrict users’ visibility for EKS cluster tasks to specified namespaces, ensuring that users can access the resources they need while maintaining strict access controls. You will need to provide the namespace for the user to display jobs of that namespace once the following is set up.

Once the restriction is applied, you will need to provide the namespace to the users assuming the role. Studio will only display the jobs of the namespace once the user provides inputs namespace they have permissions to view in the **Tasks** tab. 

The following configuration allows administrators to grant specific, limited access to data scientists for viewing tasks within the cluster. This configuration grants the following permissions:
+ List and get pods
+ List and get events
+ Get Custom Resource Definitions (CRDs)

YAML Configuration

```
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pods-events-crd-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["get", "list"]
- apiGroups: ["apiextensions.k8s.io"]
  resources: ["customresourcedefinitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pods-events-crd-cluster-role-binding
subjects:
- kind: Group
  name: pods-events-crd-cluster-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: pods-events-crd-cluster-role
  apiGroup: rbac.authorization.k8s.io
```

1. Save the YAML configuration to a file named `cluster-role.yaml`.

1. Apply the configuration using [https://kubernetes.io/docs/reference/kubectl/](https://kubernetes.io/docs/reference/kubectl/):

   ```
   kubectl apply -f cluster-role.yaml
   ```

1. Verify the configuration:

   ```
   kubectl get clusterrole pods-events-crd-cluster-role
   kubectl get clusterrolebinding pods-events-crd-cluster-role-binding
   ```

1. Assign users to the `pods-events-crd-cluster-level` group through your identity provider or IAM.

# HyperPod tabs in Studio
<a name="sagemaker-hyperpod-studio-tabs"></a>

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view your list of clusters. The displayed clusters contain information like tasks, hardware metrics, settings, and metadata details. This visibility can help your team identify the right candidate for your pre-training or finetuning workloads. The following sections provide information on each type of information.

## Tasks
<a name="sagemaker-hyperpod-studio-tabs-tasks"></a>

Amazon SageMaker HyperPod provides a view of your cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The following section provides information on your HyperPod cluster tasks.

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Tasks** information on your cluster. If you are having any issues with viewing tasks, see [Troubleshooting](sagemaker-hyperpod-studio-troubleshoot.md).

The task table includes:

------
#### [ For Slurm clusters ]

For Slurm clusters, the tasks currently in the Slurm job scheduler queue are shown in the table. The information shown for each task includes the task name, status, job ID, partition, run time, nodes, created by, and actions.

For a list and details about past jobs, use the [https://slurm.schedmd.com/sacct.html](https://slurm.schedmd.com/sacct.html) command in JupyterLab or a Code Editor terminal. The `sacct` command is used to view *historical information* about jobs that have *finished* or are *complete* in the system. It provides accounting information, including job resources usage like memory and exit status. 

By default, all Studio users can view, manage, and interact with all available Slurm tasks. To restrict the viewable tasks to Studio users, see [Restrict task view in Studio for Slurm clusters](sagemaker-hyperpod-studio-setup-slurm.md#sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view).

------
#### [ For Amazon EKS clusters ]

For Amazon EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown in the table. PyTorch tasks are shown by default. You can sort for PyTorch, MPI, and TensorFlow under **Task type**. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

By default, all users can view jobs across all namespaces. To restrict the viewable Kubernetes namespaces available to Studio users, see [Restrict task view in Studio for EKS clusters](sagemaker-hyperpod-studio-setup-eks.md#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view). If a user cannot view the tasks and is asked to provide a namespace, they need to get that information from the administrator. 

------

## Metrics
<a name="sagemaker-hyperpod-studio-tabs-metrics"></a>

Amazon SageMaker HyperPod provides a view of your Slurm or Amazon EKS cluster utilization metrics. The following provides information on your HyperPod cluster metrics. 

You will need to install the Amazon EKS add-on to view the following metrics. For more information, see [Install the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html).

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Metrics** details on your cluster. Metrics provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. This includes compute availability and usage, team allocation and utilization, and task run and wait time information. 

## Settings
<a name="sagemaker-hyperpod-studio-tabs-settings"></a>

Amazon SageMaker HyperPod provides a view of your cluster settings. The following provides information on your HyperPod cluster settings.

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Settings** information on your cluster. The information includes the following:
+ **Instances** details, including instance ID, status, instance type, and instance group
+ **Instance groups** details, including instance group name, type, counts, and compute information
+ **Orchestration** details, including the orchestrator, version, and certification authority
+ **Cluster resiliency** details
+ **Security** details, including subnets and security groups

## Details
<a name="sagemaker-hyperpod-studio-tabs-details"></a>

Amazon SageMaker HyperPod provides a view of your cluster metadata details. The following paragraph provides information on how to get your HyperPod cluster details.

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Details** on your cluster. This includes the tags, logs, and metadata.

# Connecting to HyperPod clusters and submitting tasks to clusters
<a name="sagemaker-hyperpod-studio-open"></a>

You can launch machine learning workloads on HyperPod clusters within Amazon SageMaker Studio IDEs. When you launch Studio IDEs on a HyperPod cluster, a set of commands are available to help you get started. You can work on your training scripts, use Docker containers for the training scripts, and submit jobs to the cluster, all from within the Studio IDEs. The following section provides information on how to connect your cluster to Studio IDEs.

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view your list of clusters. You can connect your cluster to an IDE listed under **Actions**. 

You can also choose your custom file system from the list of options. For information on how to get this set up, see [Setting up HyperPod in Studio](sagemaker-hyperpod-studio-setup.md).

Alternatively, you can create a space and launch an IDE using the AWS CLI. Use the following commands to do so. The following example creates a `Private` `JupyterLab` space for `user-profile-name` with the `fs-id` FSx for Lustre file system attached.

1. Create a space using the [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html) AWS CLI.

   ```
   aws sagemaker create-space \
   --region your-region \
   --ownership-settings "OwnerUserProfileName=user-profile-name" \
   --space-sharing-settings "SharingType=Private" \
   --space-settings "AppType=JupyterLab,CustomFileSystems=[{FSxLustreFileSystem={FileSystemId=fs-id}}]"
   ```

1. Create the app using the [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-app.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-app.html) AWS CLI.

   ```
   aws sagemaker create-app \
   --region your-region \
   --space-name space-name \
   --resource-spec '{"ec2InstanceType":"'"instance-type"'","appEnvironmentArn":"'"image-arn"'"}'
   ```

Once you have your applications open, you can submit tasks directly to the clusters you are connected to. 

# Troubleshooting
<a name="sagemaker-hyperpod-studio-troubleshoot"></a>

The following section lists troubleshooting solutions for HyperPod in Studio.

**Topics**
+ [

## Tasks tab
](#sagemaker-hyperpod-studio-troubleshoot-tasks)
+ [

## Metrics tab
](#sagemaker-hyperpod-studio-troubleshoot-metrics)

## Tasks tab
<a name="sagemaker-hyperpod-studio-troubleshoot-tasks"></a>

If you get Custom Resource Definition (CRD) is not configured on the cluster while in the **Tasks** tab.
+ Grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 

  For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

  To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

If the tasks grid for Slurm metrics doesn’t stop loading in the **Tasks** tab.
+ Ensure that `RunAs` enabled in your [AWS Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html) preferences and the role you are using has the `SSMSessionRunAs` tag attached. 
  + To enable `RunAs`, navigate to the **Preference** tab in the [Systems Manager console](https://console.aws.amazon.com/systems-manager/session-manager). 
  +  [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html) 

For restricted task view in Studio for EKS clusters:
+ If your execution role doesn’t have permissions to list namespaces for EKS clusters.
  + See [Restrict task view in Studio for EKS clusters](sagemaker-hyperpod-studio-setup-eks.md#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view).
+ If users are experiencing issues with access for EKS clusters.

  1. Verify RBAC is enabled by running the following AWS CLI command.

     ```
     kubectl api-versions | grep rbac
     ```

     This should return rbac.authorization.k8s.io/v1.

  1. Check if the `ClusterRole` and `ClusterRoleBinding` exist by running the following commands.

     ```
     kubectl get clusterrole pods-events-crd-cluster-role
     kubectl get clusterrolebinding pods-events-crd-cluster-role-binding
     ```

  1. Verify user group membership. Ensure the user is correctly assigned to the `pods-events-crd-cluster-level` group in your identity provider or IAM.
+ If user can't see any resources.
  + Verify group membership and ensure the `ClusterRoleBinding` is correctly applied.
+ If users can see resources in all namespaces.
  + If namespace restriction is required, consider using `Role` and `RoleBinding` instead of `ClusterRole` and `ClusterRoleBinding`.
+ If configuration appears correct, but permissions aren't applied.
  + Check if there are any `NetworkPolicies` or `PodSecurityPolicies` interfering with access.

## Metrics tab
<a name="sagemaker-hyperpod-studio-troubleshoot-metrics"></a>

If there are no Amazon CloudWatch metrics are displayed in the **Metrics** tab.
+ The `Metrics` section of HyperPod cluster details uses CloudWatch to fetch the data. In order to see the metrics in this section, you need to have enabled [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md). Contact your administrator to configure metrics.

# SageMaker HyperPod references
<a name="sagemaker-hyperpod-ref"></a>

Find more information and references about using SageMaker HyperPod in the following topics.

**Topics**
+ [

## SageMaker HyperPod pricing
](#sagemaker-hyperpod-ref-pricing)
+ [

## SageMaker HyperPod APIs
](#sagemaker-hyperpod-ref-api)
+ [

## SageMaker HyperPod Slurm configuration
](#sagemaker-hyperpod-ref-slurm-configuration)
+ [

## SageMaker HyperPod DLAMI
](#sagemaker-hyperpod-ref-hyperpod-ami)
+ [

## SageMaker HyperPod API permissions reference
](#sagemaker-hyperpod-ref-api-permissions)
+ [

## SageMaker HyperPod commands in AWS CLI
](#sagemaker-hyperpod-ref-cli)
+ [

## SageMaker HyperPod Python modules in AWS SDK for Python (Boto3)
](#sagemaker-hyperpod-ref-boto3)

## SageMaker HyperPod pricing
<a name="sagemaker-hyperpod-ref-pricing"></a>

The following topics provide information about SageMaker HyperPod pricing. To find more details on price per hour for using SageMaker HyperPod instances, see also [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

**Capacity requests**

You can allocate on-demand or reserved compute capacity with SageMaker AI for use on SageMaker HyperPod. On-demand cluster creation allocates available capacity from the SageMaker AI on-demand capacity pool. Alternatively, you can request reserved capacity to ensure access by submitting a ticket for a quota increase. Inbound capacity requests are prioritized by SageMaker AI and you receive an estimated time for capacity allocation.

**Service billing**

When you provision a compute capacity on SageMaker HyperPod, you are billed for the duration of the capacity allocation. SageMaker HyperPod billing appears in your anniversary bills with a line item for the type of capacity allocation (on-demand, reserved), the instance type, and the time spent on using the instance. 

To submit a ticket for a quota increase, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

## SageMaker HyperPod APIs
<a name="sagemaker-hyperpod-ref-api"></a>

The following list is a full set of SageMaker HyperPod APIs for submitting action requests in JSON format to SageMaker AI through AWS CLI or AWS SDK for Python (Boto3).
+ [BatchDeleteClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodes.html)
+ [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html)
+ [DeleteCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteCluster.html)
+ [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html)
+ [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html)
+ [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html)
+ [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html)
+ [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html)
+ [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html)

## SageMaker HyperPod Slurm configuration
<a name="sagemaker-hyperpod-ref-slurm-configuration"></a>

HyperPod supports two approaches for configuring Slurm on your cluster. Choose the approach that best fits your needs.


|  |  |  | 
| --- |--- |--- |
| Approach | Description | Recommended For | 
| API-driven configuration | Define Slurm configuration directly in the CreateCluster and UpdateCluster API requests | New clusters; simplified management | 
| Legacy configuration | Use a separate provisioning\$1parameters.json file stored in Amazon S3 | Existing clusters; backward compatibility | 

### API-driven Slurm configuration (Recommended)
<a name="sagemaker-hyperpod-ref-slurm-api-driven"></a>

With API-driven configuration, you define Slurm node types, partition assignments, and filesystem mounts directly in the CreateCluster and UpdateCluster API requests. This approach provides:
+ **Single source of truth** – All configuration in the API request
+ **No S3 file management** – No need to create or maintain `provisioning_parameters.json`
+ **Built-in validation** – API validates Slurm topology before cluster creation
+ **Drift detection** – Detects unauthorized changes to `slurm.conf`
+ **Per-instance-group storage** – Configure different FSx filesystems for different instance groups
+ **FSx for OpenZFS support** – Mount OpenZFS filesystems in addition to FSx for Lustre

#### SlurmConfig (per instance group)
<a name="sagemaker-hyperpod-ref-slurm-config"></a>

Add `SlurmConfig` to each instance group to define the Slurm node type and partition assignment.

```
"SlurmConfig": {
    "NodeType": "Controller | Login | Compute",
    "PartitionNames": ["string"]
}
```

**Parameters:**
+ `NodeType` – Required. The Slurm node type for this instance group. Valid values:
  + `Controller` – Slurm controller (head) node. Runs the `slurmctld` daemon. Exactly one instance group must have this node type.
  + `Login` – Login node for user access. Optional. At most one instance group can have this node type.
  + `Compute` – Worker nodes that execute jobs. Can have multiple instance groups with this node type.
**Important**  
`NodeType` is immutable. Once set during cluster creation, it cannot be changed. To use a different node type, create a new instance group.
+ `PartitionNames` – Conditional. An array of Slurm partition names. Required for `Compute` node types; not allowed for `Controller` or `Login` node types. Currently supports a single partition name per instance group.
**Note**  
All nodes are automatically added to the universal `dev` partition in addition to their specified partition.

**Example:**

```
{
    "InstanceGroupName": "gpu-compute",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 8,
    "SlurmConfig": {
        "NodeType": "Compute",
        "PartitionNames": ["gpu-training"]
    },
    "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-bucket/lifecycle/src/",
        "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole"
}
```

#### Orchestrator.Slurm (cluster level)
<a name="sagemaker-hyperpod-ref-slurm-orchestrator"></a>

Add `Orchestrator.Slurm` to the cluster configuration to specify how HyperPod manages the `slurm.conf` file.

```
"Orchestrator": {
    "Slurm": {
        "SlurmConfigStrategy": "Managed | Overwrite | Merge"
    }
}
```

**Parameters:**
+ `SlurmConfigStrategy` – Required when `Orchestrator.Slurm` is provided. Controls how HyperPod manages the `slurm.conf` file on the controller node. Valid values:
  + `Managed` (default) – HyperPod fully controls the partition-node mappings in `slurm.conf`. Drift detection is enabled: if the current `slurm.conf` differs from the expected configuration, UpdateCluster fails with an error. Use this strategy when you want HyperPod to be the single source of truth for Slurm configuration.
  + `Overwrite` – HyperPod forces the API configuration to be applied, overwriting any manual changes to `slurm.conf`. Drift detection is disabled. Use this strategy to recover from drift or reset the cluster to a known state.
  + `Merge` – HyperPod preserves manual `slurm.conf` changes and merges them with the API configuration. Drift detection is disabled. Use this strategy if you need to make manual Slurm configuration changes that should persist across updates.

**Note**  
If `Orchestrator.Slurm` is omitted from the request, the default behavior is `Managed` strategy.

**Tip**  
You can change `SlurmConfigStrategy` at any time using UpdateCluster. There is no lock-in to a specific strategy.

**Example:**

```
{
    "ClusterName": "my-hyperpod-cluster",
    "InstanceGroups": [...],
    "Orchestrator": {
        "Slurm": {
            "SlurmConfigStrategy": "Managed"
        }
    }
}
```

#### SlurmConfigStrategy comparison
<a name="sagemaker-hyperpod-ref-slurm-strategy-comparison"></a>


|  |  |  |  | 
| --- |--- |--- |--- |
| Strategy | Drift Detection | Manual Changes | Use Case | 
| Managed | Enabled – blocks updates if drift detected | Blocked | HyperPod managed | 
| Overwrite | Disabled | Overwritten | Recovery from drift; reset to known state | 
| Merge | Disabled | Preserved | Advanced users with custom slurm.conf needs | 

#### FSx configuration via InstanceStorageConfigs
<a name="sagemaker-hyperpod-ref-slurm-fsx-config"></a>

With API-driven configuration, you can configure FSx filesystems per instance group using `InstanceStorageConfigs`. This allows different instance groups to mount different filesystems.

**Prerequisites:**
+ Your cluster must use a custom VPC (via `VpcConfig`). FSx filesystems reside in your VPC, and the platform-managed VPC cannot reach them.
+ At least one instance group must have `SlurmConfig` with `NodeType: Controller`.

##### FsxLustreConfig
<a name="sagemaker-hyperpod-ref-slurm-fsx-lustre"></a>

Configure FSx for Lustre filesystem mounting for an instance group.

```
"InstanceStorageConfigs": [
    {
        "FsxLustreConfig": {
            "DnsName": "string",
            "MountPath": "string",
            "MountName": "string"
        }
    }
]
```

**Parameters:**
+ `DnsName` – Required. The DNS name of the FSx for Lustre filesystem. Example: `fs-0abc123def456789.fsx.us-west-2.amazonaws.com`
+ `MountPath` – Optional. The local mount path on the instance. Default: `/fsx`
+ `MountName` – Required. The mount name of the FSx for Lustre filesystem. You can find this in the Amazon FSx console or by running `aws fsx describe-file-systems`.

##### FsxOpenZfsConfig
<a name="sagemaker-hyperpod-ref-slurm-fsx-openzfs"></a>

Configure FSx for OpenZFS filesystem mounting for an instance group.

```
"InstanceStorageConfigs": [
    {
        "FsxOpenZfsConfig": {
            "DnsName": "string",
            "MountPath": "string"
        }
    }
]
```

**Parameters:**
+ `DnsName` – Required. The DNS name of the FSx for OpenZFS filesystem. Example: `fs-0xyz987654321.fsx.us-west-2.amazonaws.com`
+ `MountPath` – Optional. The local mount path on the instance. Default: `/home`

**Note**  
Each instance group can have at most one `FsxLustreConfig` and one `FsxOpenZfsConfig`.

**Example with multiple filesystems:**

```
{
    "InstanceGroupName": "gpu-compute",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 4,
    "SlurmConfig": {
        "NodeType": "Compute",
        "PartitionNames": ["gpu-training"]
    },
    "InstanceStorageConfigs": [
        {
            "FsxLustreConfig": {
                "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                "MountPath": "/fsx",
                "MountName": "abcdefgh"
            }
        },
        {
            "FsxOpenZfsConfig": {
                "DnsName": "fs-0xyz987654321.fsx.us-west-2.amazonaws.com",
                "MountPath": "/shared"
            }
        },
        {
            "EbsVolumeConfig": {
                "VolumeSizeInGB": 500
            }
        }
    ],
    "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-bucket/lifecycle/src/",
        "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole"
}
```

**Important**  
FSx configuration changes only apply during node provisioning. Existing nodes retain their original FSx configuration. To apply new FSx configuration to all nodes, scale down the instance group to 0, then scale back up.

#### Complete API-driven configuration example
<a name="sagemaker-hyperpod-ref-slurm-complete-example"></a>

The following example shows a complete CreateCluster request using API-driven Slurm configuration:

```
{
    "ClusterName": "ml-training-cluster",
    "InstanceGroups": [
        {
            "InstanceGroupName": "controller",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 1,
            "SlurmConfig": {
                "NodeType": "Controller"
            },
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        },
        {
            "InstanceGroupName": "login",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "SlurmConfig": {
                "NodeType": "Login"
            },
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        },
        {
            "InstanceGroupName": "gpu-compute",
            "InstanceType": "ml.p4d.24xlarge",
            "InstanceCount": 8,
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["gpu-training"]
            },
            "InstanceStorageConfigs": [
                {
                    "FsxLustreConfig": {
                        "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                        "MountPath": "/fsx",
                        "MountName": "abcdefgh"
                    }
                }
            ],
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2,
            "OnStartDeepHealthChecks": ["InstanceStress", "InstanceConnectivity"]
        },
        {
            "InstanceGroupName": "cpu-compute",
            "InstanceType": "ml.c5.18xlarge",
            "InstanceCount": 4,
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["cpu-preprocessing"]
            },
            "InstanceStorageConfigs": [
                {
                    "FsxLustreConfig": {
                        "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                        "MountPath": "/fsx",
                        "MountName": "abcdefgh"
                    }
                }
            ],
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        }
    ],
    "Orchestrator": {
        "Slurm": {
            "SlurmConfigStrategy": "Managed"
        }
    },
    "VpcConfig": {
        "SecurityGroupIds": ["sg-0abc123def456789a"],
        "Subnets": ["subnet-0abc123def456789a", "subnet-0abc123def456789b"]
    },
    "Tags": [
        {
            "Key": "Project",
            "Value": "ML-Training"
        }
    ]
}
```

To learn more about using API-driven configuration, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

### Legacy configuration: provisioning\$1parameters.json
<a name="sagemaker-hyperpod-ref-provisioning-forms"></a>

**Note**  
The `provisioning_parameters.json` approach is the legacy method for configuring Slurm on HyperPod. For new clusters, we recommend using the API-driven configuration approach described above. The legacy approach remains fully supported for backward compatibility.

With the legacy approach, you create a Slurm configuration file named `provisioning_parameters.json` and upload it to Amazon S3 as part of your lifecycle scripts. HyperPod reads this file during cluster creation to configure Slurm nodes.

#### Configuration form for provisioning\$1parameters.json
<a name="sagemaker-hyperpod-ref-provisioning-forms-slurm"></a>

The following code is the Slurm configuration form you should prepare to properly set up Slurm nodes on your HyperPod cluster. You should complete this form and upload it as part of a set of lifecycle scripts during cluster creation. To learn how this form should be prepared throughout HyperPod cluster creation processes, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

```
// Save as provisioning_parameters.json.
{
    "version": "1.0.0",
    "workload_manager": "slurm",
    "controller_group": "string",
    "login_group": "string",
    "worker_groups": [
        {
            "instance_group_name": "string",
            "partition_name": "string"
        }
    ],
    "fsx_dns_name": "string",
    "fsx_mountname": "string"
}
```

**Parameters:**
+ `version` – Required. This is the version of the HyperPod provisioning parameter form. Keep it to `1.0.0`.
+ `workload_manager` – Required. This is for specifying which workload manager to be configured on the HyperPod cluster. Keep it to `slurm`.
+ `controller_group` – Required. This is for specifying the name of the HyperPod cluster instance group you want to assign to Slurm controller (head) node.
+ `login_group` – Optional. This is for specifying the name of the HyperPod cluster instance group you want to assign to Slurm login node.
+ `worker_groups` – Required. This is for setting up Slurm worker (compute) nodes on the HyperPod cluster.
  + `instance_group_name` – Required. This is for specifying the name of the HyperPod instance group you want to assign to Slurm worker (compute) node.
  + `partition_name` – Required. This is for specifying the partition name to the node.
+ `fsx_dns_name` – Optional. If you want to set up your Slurm nodes on the HyperPod cluster to communicate with Amazon FSx, specify the FSx DNS name.
+ `fsx_mountname` – Optional. If you want to set up your Slurm nodes on the HyperPod cluster to communicate with Amazon FSx, specify the FSx mount name.

### Comparison: API-driven vs. legacy configuration
<a name="sagemaker-hyperpod-ref-slurm-comparison"></a>


|  |  |  | 
| --- |--- |--- |
| Feature | API-driven (Recommended) | Legacy (provisioning\$1parameters.json) | 
| Configuration location | CreateCluster API request | S3 file | 
| FSx for Lustre | Yes – Per instance group | Yes – Cluster-wide only | 
| FSx for OpenZFS | Yes – Per instance group | No – Not supported | 
| Built-in validation | Yes | No | 
| Drift detection | Yes – (Managed strategy) | No | 
| S3 file management | Not required | Required | 
| Lifecycle script complexity | Simplified | Full SLURM setup required | 

## SageMaker HyperPod DLAMI
<a name="sagemaker-hyperpod-ref-hyperpod-ami"></a>

SageMaker HyperPod runs a DLAMI based on:
+ [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) for orchestration with Slurm.
+ Amazon Linux 2 based AMI for orchestration with Amazon EKS.

The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm, Kubernetes, dependencies, and SageMaker HyperPod cluster software packages to support resiliency features such as cluster health check and auto-resume. To follow up with HyperPod software updates that the HyperPod service team distributes through DLAMIs, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

## SageMaker HyperPod API permissions reference
<a name="sagemaker-hyperpod-ref-api-permissions"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

When you are setting up access control for allowing to run SageMaker HyperPod API operations and writing a permissions policy that you can attach to IAM users for cloud administrators, use the following table as a reference.


|  |  |  | 
| --- |--- |--- |
| Amazon SageMaker API Operations | Required Permissions (API Actions) | Resources | 
| CreateCluster | sagemaker:CreateCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DeleteCluster | sagemaker:DeleteCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DescribeCluster | sagemaker:DescribeCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DescribeClusterNode | sagemaker:DescribeClusterNode | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| ListClusterNodes | sagemaker:ListClusterNodes | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| ListClusters | sagemaker:ListClusters | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| UpdateCluster | sagemaker:UpdateCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| UpdateClusterSoftware | sagemaker:UpdateClusterSoftware | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 

For a complete list of permissions and resource types for SageMaker APIs, see [Actions, resources, and condition keys for Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html) in the *AWS Service Authorization Reference*.

## SageMaker HyperPod commands in AWS CLI
<a name="sagemaker-hyperpod-ref-cli"></a>

The following are the AWS CLI commands for SageMaker HyperPod to run the core [HyperPod API operations](#sagemaker-hyperpod-ref-api).
+ [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html)
+ [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html)
+ [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html)
+ [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html)
+ [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html)
+ [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html)
+ [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html)
+ [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html)
+ [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html)

## SageMaker HyperPod Python modules in AWS SDK for Python (Boto3)
<a name="sagemaker-hyperpod-ref-boto3"></a>

The following are the methods of the AWS SDK for Python (Boto3) client for SageMaker AI to run the core [HyperPod API operations](#sagemaker-hyperpod-ref-api).
+ [batch\$1delete\$1cluster\$1nodes](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/batch_delete_cluster_nodes.html#)
+ [create\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_cluster.html)
+ [delete\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/delete_cluster.html)
+ [describe\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster.html)
+ [describe\$1cluster\$1node](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster_node.html)
+ [list\$1cluster\$1nodes](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/list_cluster_nodes.html)
+ [list\$1clusters](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/list_clusters.html)
+ [update\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_cluster.html)
+ [update\$1cluster\$1software](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_cluster_software.html)

# Amazon SageMaker HyperPod release notes
<a name="sagemaker-hyperpod-release-notes"></a>

This topic covers release notes that track update, fixes, and new features for Amazon SageMaker HyperPod. If you are looking for general feature releases, updates, and improvements for Amazon SageMaker HyperPod, you might find this page helpful.

The HyperPod AMI releases are documented separately to include information of the key components including general AMI releases, versions, and dependencies. If you are looking for these information related to HyperPod AMI releases, see [Amazon SageMaker HyperPod AMI](sagemaker-hyperpod-release-ami.md).

## SageMaker HyperPod release notes: April 16, 2026
<a name="sagemaker-hyperpod-release-notes-20260416"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ **Flexible instance groups** – You can now create instance groups with multiple instance types using the new `InstanceRequirements` parameter. This enables priority-based provisioning, where HyperPod attempts to provision the highest-priority instance type first and falls back to lower-priority types if capacity is unavailable. Flexible instance groups simplify Karpenter auto-scaling configurations by reducing the number of instance groups needed. You can specify up to 20 instance types per instance group. For more information, see [Flexible instance groups](sagemaker-hyperpod-scaling-eks.md#sagemaker-hyperpod-scaling-eks-flexible-ig).

## SageMaker HyperPod release notes: January 25, 2026
<a name="sagemaker-hyperpod-release-notes-20260125"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.34. For more information, see [SageMaker Hyperpod AMI releases for Amazon EKS: January 25, 2026](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20260125).

For more information, see [Kubernetes v1.34](https://kubernetes.io/blog/2025/08/27/kubernetes-v1-34-release/).

## SageMaker HyperPod release notes: November 07, 2025
<a name="sagemaker-hyperpod-release-notes-20251107"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Upgraded security patches [SageMaker HyperPod AMI releases for Amazon EKS: November 07, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20251107).

## SageMaker HyperPod release notes: September 29, 2025
<a name="sagemaker-hyperpod-release-notes-20250929"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.33. For more information, [SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250929).
**Important**  
The Dynamic Resource Allocation beta Kubernetes API is enabled by default in this release.  
This API improves scheduling and monitoring workloads that require resources such as GPUs.
This API was developed by the open source Kubernetes community and might change in future versions of Kubernetes. Before you use the API, review the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) and understand how it affects your workloads.
HyperPod is not releasing a HyperPod Amazon Linux 2 AMI for Kubernetes 1.33. AWS recommends that you migrate to AL2023. For more information, see [Upgrade from Amazon Linux 2 to AL2023](https://docs.aws.amazon.com/eks/latest/userguide/al2023.html).

For more information, see [Kubernetes v1.33](https://kubernetes.io/blog/2025/04/23/kubernetes-v1-33-release/).

## SageMaker HyperPod release notes: August 4, 2025
<a name="sagemaker-hyperpod-release-notes-20250804"></a>

SageMaker HyperPod releases new public AMIs for EKS orchestration. Public AMIs can be used by themselves, or they can be used to create custom AMIs. For more information about the public AMIs, see [Public AMI releases](sagemaker-hyperpod-release-public-ami.md). For more information about creating a custom AMI, see [Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters](hyperpod-custom-ami-support.md). 

## SageMaker HyperPod release notes: July 31, 2025
<a name="sagemaker-hyperpod-release-notes-20250731"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Released a new AMI that updates the operating system from Amazon Linux 2 to Amazon Linux 2023 for EKS clusters. Key upgrades include Linux Kernel 6.1, Python 3.10, NVIDIA Driver 560.35.03, and DNF package manager replacing YUM.
**Important**  
The update from Amazon Linux 2 to AL2023 introduces significant changes that might affect compatibility with software and configurations designed for AL2. We strongly recommend testing your applications with AL2023 before fully upgrading your clusters.

  For more information about the new AMI and how to upgrade your clusters, see [SageMaker HyperPod AMI releases for Amazon EKS: July 31, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250731).

## SageMaker HyperPod release notes: May 13, 2025
<a name="sagemaker-hyperpod-release-notes-20250513"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Released an updated AMI that supports Ubuntu 22.04 LTS for Slurm clusters. This release includes several system and software component upgrades to provide improved performance, updated features, and enhanced security.
**Important**  
The update from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS introduces changes that might affect compatibility with software and configurations designed for Ubuntu 20.04.

  For more information, see:
  + [Key updates in the Ubuntu 22.04 AMI](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-updates)
  + [Upgrading to the Ubuntu 22.04 AMI](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade)
  + [Troubleshooting upgrade failures](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot)

## SageMaker HyperPod release notes: May 1, 2025
<a name="sagemaker-hyperpod-release-notes-20250501"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Added usage reporting for EKS-orchestrated clusters, allowing organizations to implement transparent, usage-based cost allocation across teams, projects, or departments. This feature complements HyperPod’s [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality to ensure fair cost distribution in shared multi-tenant AI/ML environments. For more information, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

## SageMaker HyperPod release notes: April 28, 2025
<a name="sagemaker-hyperpod-release-notes-20250428"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

For information about related AMI releases, see [SageMaker HyperPod AMI releases for Slurm: April 28, 2025](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20250428) and [SageMaker HyperPod AMI releases for Amazon EKS: April 28, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250428).

## SageMaker HyperPod release notes: April 18, 2025
<a name="sagemaker-hyperpod-release-notes-20250418"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released new SageMaker HyperPod AMI for Amazon EKS 1.32.1. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS: April 18, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250418).

## SageMaker HyperPod release notes: April 10, 2025
<a name="sagemaker-hyperpod-release-notes-20250410"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Added a Direct Preference Optimization (DPO) recipe tutorial for SageMaker HyperPod with Slurm orchestration. This fine-tuning tutorial provides step-by-step guidance for optimizing model alignment using the DPO method on GPU-powered SageMaker HyperPod Slurm clusters. For more information, see [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md).

## SageMaker HyperPod release notes: April 03, 2025
<a name="sagemaker-hyperpod-release-notes-20250403"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added a [Quickstart](sagemaker-hyperpod-quickstart.md) page for deploying SageMaker HyperPod clusters. The page leverages streamlined setup workflows from SageMaker HyperPod’s specialized workshops and automates deployment using prebuilt AWS CloudFormation templates. It supports infrastructure preferences like Slurm or Amazon EKS, for easy configuration and deployment of baseline clusters.
+ SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.
  + New instance types: I3en, M7i, R7i instances. For the full list of supported instances, see the `InstanceType` field in the `[ClusterInstanceGroupDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupDetails.html)`.

## SageMaker HyperPod release notes: March 16, 2025
<a name="sagemaker-hyperpod-release-notes-20250316"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added the following IAM condition keys for more granular access control in the [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_UpdateCluster.html) API operations.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-notes.html)

## SageMaker HyperPod release notes: February 20, 2025
<a name="sagemaker-hyperpod-release-notes-20250220"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added support for deleting instance groups from your SageMaker HyperPod cluster. For more information, see [Delete instance groups](smcluster-scale-down.md#smcluster-remove-instancegroup) from EKS-orchestrated clusters and [Scale down a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-scale-down) for Slurm-orchestrated clusters. 

## SageMaker HyperPod release notes: February 18, 2025
<a name="sagemaker-hyperpod-release-notes-20250218"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ This release of SageMaker HyperPod incorporates a security update from the Nvidia container toolkit (from version 1.17.3 to version 1.17.4). For more information, see [v1.17.4 release note](https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.4). 
**Note**  
For all container workloads in the Nvidia container toolkit version 1.17.4, the mounting of CUDA compatibility libraries is now disabled. To ensure compatibility with multiple CUDA versions on container workflows, update your `LD_LIBRARY_PATH` to include your CUDA compatibility libraries. You can find the specific steps in [If you use a CUDA compatibility layer](inference-gpu-drivers.md#collapsible-cuda-compat).

For information about related AMI releases, see [SageMaker HyperPod AMI releases for Slurm: February 18, 2025](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20250218) and [SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250218).

## SageMaker HyperPod release notes: February 06, 2025
<a name="sagemaker-hyperpod-release-notes-20250206"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Enhanced SageMaker HyperPod multi-AZ support: You can specify different subnets and security groups, cutting across different Availability Zones, for individual instance groups within your cluster. For more information about SageMaker HyperPod multi-AZ support, see [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

## SageMaker HyperPod release notes: January 22, 2025
<a name="sagemaker-hyperpod-release-notes-20250122"></a>

**AMI releases**
+ [SageMaker HyperPod AMI releases for Amazon EKS: January 22, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250122)

## SageMaker HyperPod release notes: January 09, 2025
<a name="sagemaker-hyperpod-release-notes-20250109"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Added IPv6 support: Clusters can use IPv6 addressing when configured with IPv6-enabled VPC and subnets. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

## SageMaker HyperPod release notes: December 21, 2024
<a name="sagemaker-hyperpod-release-notes-20241221"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.
  + New instance types: C6gn, C6i, M6i, R6i.
  + New Trainium instance types: Trn1 and Trn1n.

**Improvements**
+ Enhanced error logging visibility when Slurm interrupts jobs, and prevented unnecessary job step termination during Slurm-initiated job cancellations.
+ Updated base DLAMI for p5en for both Slurm and Amazon EKS clusters.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: December 21, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241221)
+ [SageMaker HyperPod AMI releases for Amazon EKS: December 21, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241221)

## SageMaker HyperPod release notes: December 13, 2024
<a name="sagemaker-hyperpod-release-notes-20241213"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New feature**
+ SageMaker HyperPod releases a set of Amazon CloudWatch metrics to monitor the health and performance of SageMaker HyperPod Slurm clusters. These metrics are related to CPU, GPU, memory utilization, and cluster instance information such as node counts and failed nodes. This monitoring feature is enabled by default, and the metrics can be accessed under the `/aws/sagemaker/Clusters` CloudWatch namespace. You can also set up CloudWatch alarms based on these metrics to proactively detect and address potential issues within their Slurm-based HyperPod clusters. For more information, see [Amazon SageMaker HyperPod Slurm metrics](smcluster-slurm-metrics.md).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Amazon EKS: December 13, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241213)

## SageMaker HyperPod release notes: November 24, 2024
<a name="sagemaker-hyperpod-release-notes-20241124"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added support for configuring SageMaker HyperPod clusters across multiple Availability Zones. For more information about SageMaker HyperPod multi-AZ support, see [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 24, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241124)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 24, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241124)

## SageMaker HyperPod release notes: November 15, 2024
<a name="sagemaker-hyperpod-release-notes-20241115"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md). For more information, see and [SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241115).

**New features and improvements**
+ Added support for trn1 and trn1n instance types for both Amazon EKS and Slurm orchestrated clusters.
+ Improved log management for Slurm clusters:
  +  Implemented log rotation: weekly or daily based on size.
  +  Set log retention to 3 weeks.
  +  Compressed logs to reduce storage impact.
  +  Continued uploading logs to CloudWatch for long-term retention.
**Note**  
Some logs are still stored in syslogs.
+ Adjusted Fluent Bit settings to prevent tracking issues with files containing long lines.

**Bug fixes**
+ Prevented unintended truncation with Slurm controller node updates in configuration file `slurm.config`.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 15, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241115)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241115)

## SageMaker HyperPod release notes: November 11, 2024
<a name="sagemaker-hyperpod-release-notes-20241111"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md). 

**New feature**
+ SageMaker HyperPod AMI now supports G6e instance types.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 11, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241111)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 11, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241111)

## SageMaker HyperPod release notes: October 31, 2024
<a name="sagemaker-hyperpod-release-notes-20241031"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md). For more information about scaling down Slurm clusters, see *Scale down a cluster* in [Managing SageMaker HyperPod Slurm clusters using the AWS CLI](sagemaker-hyperpod-operate-slurm-cli-command.md).
+ SageMaker HyperPod now supports the P5e instance type for both Amazon EKS and Slurm orchestrated clusters. 

## SageMaker HyperPod release notes: October 21, 2024
<a name="sagemaker-hyperpod-release-notes-20241021"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New feature**
+ SageMaker HyperPod now supports the P5e[n], G6, Gr6, and Trn2[n] instance types for both Slurm and Amazon EKS clusters.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: October 21, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241021)
+ [SageMaker HyperPod AMI releases for Amazon EKS: October 21, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241021)

## SageMaker HyperPod release notes: September 10, 2024
<a name="sagemaker-hyperpod-release-notes-20240910"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added Amazon EKS support in SageMaker HyperPod. To learn more, see [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).
+ Added support for managing SageMaker HyperPod clusters through CloudFormation and Terraform. For more information about managing HyperPod clusters through CloudFormation, see [CloudFormation documentation for `AWS::SageMaker::Cluster`](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-cluster.html). To learn about managing HyperPod clusters through Terraform, see [Terraform documentation for `awscc_sagemaker_cluster`](https://registry.terraform.io/providers/hashicorp/awscc/latest/docs/data-sources/sagemaker_cluster).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: September 10, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20240910)
+ [SageMaker HyperPod AMI releases for Amazon EKS: September 10, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20240910)

## SageMaker HyperPod release notes: August 20, 2024
<a name="sagemaker-hyperpod-release-notes-20240820"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Enhanced the [SageMaker HyperPod auto-resume functionality](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-auto-resume), extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).

  When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Other changes**
+ Pre-packaged [https://slurm.schedmd.com/slurmrestd.html](https://slurm.schedmd.com/slurmrestd.html) in the SageMaker HyperPod AMI.
+ Changed the default values for `ResumeTimeout` and `UnkillableStepTimeout` from 60 seconds to 300 seconds in `slurm.conf` to improve system responsiveness and job handling.
+ Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).

**Bug fixes**
+ The HyperPod auto-resume plug-in can use idle nodes to resume a job.

## SageMaker HyperPod release notes: June 20, 2024
<a name="sagemaker-hyperpod-release-notes-20240620"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to `/opt/sagemaker`. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.
  + [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md)
  + [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)

  Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

**Upgrade steps**
+ Run the following command to call the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software). 
**Important**  
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

  ```
   aws sagemaker update-cluster-software --cluster-name your-cluster-name
  ```
**Note**  
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

## SageMaker HyperPod release notes: April 24, 2024
<a name="sagemaker-hyperpod-release-notes-20240424"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**Bug fixes**
+ Fixed a bug with the `ThreadsPerCore` parameter in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html) API. With the fix, the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs properly take and apply the user input through `ThreadsPerCore`. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

## SageMaker HyperPod release notes: March 27, 2024
<a name="sagemaker-hyperpod-release-notes-20240327"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**HyperPod software patch**

The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). See the following details about the latest HyperPod DLAMI.
+ In this release of the HyperPod DLAMI, Slurm is built with REST service (`slurmestd`) with JSON, YAML, and JWT support.
+ Upgraded [Slurm](https://slurm.schedmd.com/documentation.html) to v23.11.3.

**Improvements**
+ Increased auto-resume service timeout to 60 minutes.
+ Improved instance replacement process to not restart the Slurm controller.
+ Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.

**Bug fixes**
+ Fixed a bug with chrony service that caused an issue with time synchronization.
+ Fixed a bug with parsing `slurm.conf`.
+ Fixed an issue with [NVIDIA `go-dcgm`](https://github.com/NVIDIA/go-dcgm) library.

## SageMaker HyperPod release notes: March 14, 2024
<a name="sagemaker-hyperpod-release-notes-20240314"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**Improvements**
+ HyperPod now properly supports passing partition names provided through `provisioning_parameters.json` and creates partitions appropriately based on provided inputs. For more information about `provisioning_parameters.json`, see [Legacy configuration: provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms) and [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: March 14, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20240314)

## SageMaker HyperPod release notes: February 15, 2024
<a name="sagemaker-hyperpod-release-notes-20240215"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Added a new `UpdateClusterSoftware` API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by running `aws sagemaker update-cluster-software --cluster-name your-cluster-name`. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how the `UpdateClusterSoftware` API works, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).

## SageMaker HyperPod release notes: November 29, 2023
<a name="sagemaker-hyperpod-release-notes-20231129"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.

**AMI releases**
+ [SageMaker HyperPod AMI release for Slurm: November 29, 2023](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20231129)

# Amazon SageMaker HyperPod AMI
<a name="sagemaker-hyperpod-release-ami"></a>

Amazon SageMaker HyperPod Amazon Machine Images (AMIs) are specialized machine images for distributed machine learning workloads and high-performance computing. These AMIs enhance base images with essential components including GPU drivers and AWS Neuron accelerator support.

Key components added to HyperPod AMIs include:
+ [Public AMIs](sagemaker-hyperpod-release-public-ami.md) with support for [building custom AMIs](hyperpod-custom-ami-support.md)
+ Advanced orchestration tools:
  + [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md)
  + [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md)
+ Cluster management dependencies
+ Built-in resiliency features:
  + cluster health check
  + auto-resume capabilities
+ Support for HyperPod cluster management and configuration

These enhancements are built upon the following base Deep Learning AMIs (DLAMIs):
+ [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) for orchestration with Slurm.
+ Amazon Linux 2 or Amazon Linux 2023 based AMI for orchestration with Amazon EKS.

Choose your HyperPod AMIs based on your orchestration preference:
+ For Slurm orchestration, see [SageMaker HyperPod AMI releases for Slurm](sagemaker-hyperpod-release-ami-slurm.md).
+ For Amazon EKS orchestration, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).

For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

# Update your AMI version in your SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-release-ami-update"></a>

Amazon SageMaker HyperPod Amazon Machine Images (AMIs) are specialized machine images for distributed machine learning workloads and high-performance computing. Each AMI comes pre-loaded with drivers, machine learning frameworks, training libraries, and performance monitoring tools. By updating the AMI version in your cluster, you can use the latest versions of these components and packages for your training jobs and workflows.

 When updating the AMI version within your cluster, you have the option to process the update immediately, schedule a one-time only update, or use a cron expression to create a recurring schedule. You can also choose to update all of the instances in an instance group or just batches of instances. If you choose to update batches, you set the percentage or amount of instances that SageMaker AI should upgrade at a time. If you use this method of updating, you set an interval of how long SageMaker AI should wait in between batches.

If you choose to update in batches, you can also include a list of alarms and metrics. During the wait interval, SageMaker AI observes these metrics and if any exceed their threshold, the corresponding alarm goes into the ALARM state, and SageMaker AI rolls back the AMI update. To utilize automatic rollbacks, your IAM execution role must have the permission `cloudwatch:DescribeAlarms`.

**Note**  
Updating your cluster in batches is available only for HyperPod clusters integrated with Amazon EKS. Also, if you’re creating multiple schedules, we recommend that you have a time buffer in between schedules. If schedules overlap, updates might fail.

For more information about each AMI release for your HyperPod cluster, see [Amazon SageMaker HyperPod AMI](sagemaker-hyperpod-release-ami.md). For more information about general HyperPod releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

You can use the SageMaker AI API or CLI operations to update your cluster or see scheduled updates for a specific cluster. If you're using the AWS console, follow these steps:

**Note**  
Updating your AMI with the AWS console is available only for clusters integrated with Amazon EKS. If you have a Slurm cluster, you must use the SageMaker AI API or CLI operations.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left, expand **HyperPod Clusters**, and choose **Cluster Management**.

1. Choose the cluster that you want to update, then choose **Details**, and **Update AMI**.



To create and manage update schedules programmatically, use the following API operations:
+ [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) – create a cluster while specifying an update schedule
+ [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) – update a cluster to add an update schedule
+ [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) – to update the platform software of a cluster
+ [ DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) – see an update schedule you created for a cluster
+ [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html) and [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html) – see when the cluster was last updated.

## Required permissions
<a name="sagemaker-hyperpod-release-ami-update-permissions"></a>

Depending to how you configured your [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) in your Amazon EKS cluster, HyperPod evicts pods, releases nodes, and prevents any update scheduling during the AMI update process. If any constraints within the budget are violated, HyperPod skips that node during the AMI update. For SageMaker HyperPod to correctly evict pods, you must add the necessary permissions to the HyperPod service-linked role. The following yaml file has the necessary permissions.

```
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hyperpod-patching
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hyperpod-patching
subjects:
- kind: User
  name: hyperpod-service-linked-role
roleRef:
  kind: ClusterRole
  name: hyperpod-patching
  apiGroup: rbac.authorization.k8s.io
```

Use the following commands to apply the permissions.

```
git clone https://github.com/aws/sagemaker-hyperpod-cli.git 

cd sagemaker-hyperpod-cli/helm_chart

helm upgrade hyperpod-dependencies HyperPodHelmChart --namespace kube-system --install
```

## Cron expressions
<a name="sagemaker-hyperpod-release-ami-update-cron"></a>

To configure a one-time update at a certain time or a recurring schedule, use cron expressions. Cron expressions support six fields and are separated by white space. All six fields are required.

```
cron(Minutes Hours Day-of-month Month Day-of-week Year)
```


| **Fields** | **Values** | **Wildcards** | 
| --- | --- | --- | 
|  Minutes  |  00 – 59  |  N/A  | 
|  Hours  |  00 – 23  |  N/A  | 
|  Day-of-month  |  01 – 31  | ? | 
|  Month  |  01 – 12  | \$1 / | 
|  Day-of-week  |  1 – 7 or MON-SUN  | ? \$1 L | 
|  Year  |  Current year – 2099  | \$1 | 

**Wildcards**
+ The **\$1** (asterisk) wildcard includes all values in the field. In the `Hours` field, **\$1** would include every hour.
+ The **/** (forward slash) wildcard specifies increments. In the `Months` field, you could enter **\$1/3** to specify every 3rd month.
+ The **?** (question mark) wildcard specifies one or another. In the `Day-of-month` field you could enter **7**, and if you didn't care what day of the week the seventh was, you could enter **?** in the Day-of-week field.
+ The **L** wildcard in the `day-of-week` or field specifies the last day of the month or week. For example, `5L` means the last Friday of the month.
+ The **\$1** wildcard in the ay-of-week field specifies a certain instance of the specified day of the week within a month. For example, 3\$12 would be the second Tuesday of the month: the 3 refers to Tuesday because it is the third day of each week, and the 2 refers to the second day of that type within the month.

You can use cron expressions for the following scenarios:
+ One-time schedule that runs at a certain time and day. You can use the `?` wildcard to denote that day-of-month or day-of-week don't matter.

  ```
  cron(30 14 ? 12 MON 2024)
  ```

  ```
  cron(30 14 15 12 ? 2024)
  ```
+ A weekly schedule that runs at a certain time and day. The following example creates a schedule that runs at 12:00pm on every Monday regardless of day-of-month.

  ```
  cron(00 12 ? * 1 *)
  ```
+ Monthly schedule that runs every month regardless of the day-of-week. The following schedule runs at 12:30pm on the 15th of every month.

  ```
  cron(30 12 15 * ? *)
  ```
+ A monthly schedule that uses day-of-week.

  ```
  cron(30 12 ? * MON *)
  ```
+ To create a schedule that runs every Nth month, use the `/` wildcard. The following example creates a monthly schedule that runs every 3 months. The following two examples demonstrate how it works with day-of-week and day-of-month.

  ```
  cron(30 12 15 */3 ? *)
  ```

  ```
  cron(30 12 ? */3 MON *)
  ```
+ A schedule that runs on a certain instance of the specified day of the week. The following example creates a schedule that runs at 12:30pm on the second Monday of every month.

  ```
  cron(30 12 ? * 1#2 *)
  ```
+ A schedule that runs on the last instance of the specified day of the week. The following schedule runs at 12:30pm on the last Monday of every month.

  ```
  cron(30 12 ? * 1L *)
  ```

# SageMaker HyperPod AMI releases for Slurm
<a name="sagemaker-hyperpod-release-ami-slurm"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod AMI releases for Slurm orchestration. These HyperPod AMIs are built upon [AWS Deep Learning Base GPU AMI (Ubuntu 22.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/). The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). For HyperPod AMI releases for Amazon EKS orchestration, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md). For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
To update existing HyperPod clusters with the latest DLAMI, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).

## SageMaker HyperPod AMI releases for Slurm: March 01, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260301"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx26
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.1
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx26
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx26
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx26
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: February 12, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260212"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.1
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.0b1337-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.0b1337-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: January 25, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260125"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn3.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300063.0b1323-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn2.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300063.0b1323-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: December 29, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20251229"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn3.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.105.08-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.0b1304-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn2.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.0b1304-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: November 22, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20251128"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm (arm64) ]
+ Linux Kernel version: 6.8
+ Glibc version: 2.35
+ OpenSSL version: 3.0.2
+ FSx Lustre Client version: 2.15.6-1fsx21
+ Runc version: 1.3.3
+ Containerd version: containerd containerd.io v2.1.5
+ NVIDIA Driver version: 580.95.05
+ CUDA version: 12.6, 12.8, 12.9, 13.0
+ EFA Installer version: 2.1.0amzn5.0
+ Python version: 3.10.12
+ Slurm version: 24.11.0
+ nvme-cli version: 1.16
+ collectd version: 5.12.0.
+ lustre-client version: 2.15.6-1fsx21
+ nvidia-imex version: 580.95.05-1
+ systemd version: 249
+ openssh version: 8.9
+ sudo version: 1.9.9
+ ufw version: 0.36.1
+ gcc version: 11.4.0
+ cmake version: 3.22.1
+ git version: 2.34.1
+ make version: 4.3
+ cloudwatch-agent version: 1.300062.0b1304-1
+ nfs-utils version: 1:2.6.1-1ubuntu1.2
+ iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
+ lvm2 version: 2.03.11
+ ec2-instance-connect version: 1.1.14-0ubuntu1.1
+ rdma-core version: 58.amzn0-1

------
#### [ Slurm (x86\$164) ]
+ Linux Kernel version: 6.8
+ Glibc version: 2.35
+ OpenSSL version: 3.0.2
+ FSx Lustre Client version: 2.15.6-1fsx21
+ Runc version: 1.3.3
+ Containerd version: containerd containerd.io v2.1.5
+ aws Neuronx DKMS version: 2.24.7.0
+ NVIDIA Driver version: 580.95.05
+ CUDA version: 12.6, 12.8, 12.9, 13.0
+ EFA Installer version: 2.3.1amzn1.0
+ Python version: 3.10.12
+ Slurm version: 24.11.0
+ nvme-cli version: 1.16
+ stress version: 1.0.5
+ collectd version: 5.12.0.
+ lustre-client version: 2.15.6-1fsx21
+ systemd version: 249
+ openssh version: 8.9
+ sudo version: 1.9.9
+ ufw version: 0.36.1
+ gcc version: 11.4.0
+ cmake version: 3.22.1
+ make version: 4.3
+ cloudwatch-agent version: 1.300062.0b1304-1
+ nfs-utils version: 1:2.6.1-1ubuntu1.2
+ iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
+ lvm2 version: 2.03.11
+ ec2-instance-connect version: 1.1.14-0ubuntu1.1
+ rdma-core version: 59.amzn0-1

------

## SageMaker HyperPod release notes: November 07, 2025
<a name="sagemaker-hyperpod-release-notes-20251107"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Updated packages: NVIDIA Driver: 580.95.05
+ CUDA Versions: cuda-12.6, cuda-12.8, cuda-12.9, cuda-13.0
+ Security fixes: [ Runc Security patch](https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/)

## SageMaker HyperPod release notes: September 29, 2025
<a name="sagemaker-hyperpod-release-notes-20250929"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Updated packages: NVIDIA Driver: 570.172.08
+ Security fixes

## SageMaker HyperPod release notes: August 12, 2025
<a name="sagemaker-hyperpod-release-notes-20250812"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Latest available version is installed for the following packages:
  + Linux Kernel: 6.8
  + FSx Lustre
  + Docker
  + AWS CLI v2 at `/usr/bin/aws`
  + NVIDIA DCGM
  + Nvidia container toolkit:
    + Version command: `nvidia-container-cli -V`
  + Nvidia-docker2:
    + Version command: `nvidia-docker version`
  + Nvidia-IMEX: v570.172.08-1
+ NVIDIA Driver: 570.158.01
+ NVIDIA CUDA 12.4, 12.5, 12.6, 12.8 stack:
  + CUDA, NCCL and cuDDN installation directories: `/usr/local/cuda-xx.x/`
    + Example: `/usr/local/cuda-12.8/`, `/usr/local/cuda-12.8/`
  + Compiled NCCL Version:
    + For CUDA directory of 12.4, compiled NCCL Version 2.22.3\$1CUDA12.4
    + For CUDA directory of 12.5, compiled NCCL Version 2.22.3\$1CUDA12.5
    + For CUDA directory of 12.6, compiled NCCL Version 2.24.3\$1CUDA12.6
    + For CUDA directory of 12.8, compiled NCCL Version 2.27.5\$1CUDA12.8
  + Default CUDA: 12.8
    + PATH `/usr/local/cuda` points to CUDA 12.8
    + Updated below env vars:
      + `LD_LIBRARY_PATH` to have `/usr/local/cuda-12.8/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8:/usr/local/cuda-12.8/targets/sbsa-linux/lib:/usr/local/cuda-12.8/nvvm/lib64:/usr/local/cuda-12.8/extras/CUPTI/lib64`
      + `PATH` to have `/usr/local/cuda-12.8/bin/:/usr/local/cuda-12.8/include/`
      + For any different CUDA version, please update `LD_LIBRARY_PATH` accordingly.
+ EFA installer: 1.42.0
+ Nvidia GDRCopy: 2.5.1
+ AWS OFI NCCL plugin comes with EFA installer
  + Paths `/opt/amazon/ofi-nccl/lib/aarch64-linux-gnu` and `/opt/amazon/ofi-nccl/efa` are added to `LD_LIBRARY_PATH`.
+ AWS CLI v2 at `/usr/local/bin/aws2` and AWS CLI v1 at `/usr/bin/aws`
+ EBS volume type: gp3
+ Python: `/usr/bin/python3.10`

## SageMaker HyperPod release notes: May 27, 2025
<a name="sagemaker-hyperpod-release-notes-20250527"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Updated base AMI to `Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20250523` with the following key components:
  + NVIDIA Driver: 570.133.20
  + CUDA: 12.8 (default), with support for CUDA 12.4-12.6
  + NCCL Version: 2.26.5
  + EFA Installer: 1.40.0
  + AWS OFI NCCL: 1.14.2-aws
+ Updated Neuron SDK packages:
  + aws-neuronx-collectives: 2.25.65.0-9858ac9a1 (from 2.24.59.0-838c7fc8b)
  + aws-neuronx-dkms: 2.21.37.0 (from 2.20.28.0)
  + aws-neuronx-runtime-lib: 2.25.57.0-166c7a468 (from 2.24.53.0-f239092cc)
  + aws-neuronx-tools: 2.23.9.0 (from 2.22.61.0)

**Important notes**
+ NVIDIA Container Toolkit 1.17.4 now has disabled mounting of CUDA compatible libraries.
+ Updated EFA configuration from 1.37 to 1.38, and EFA now includes the AWS OFI NCCL plugin, which is located in the `/opt/amazon/ofi-nccl` directory instead of the original `/opt/aws-ofi-nccl/` path. (Released on February 18, 2025)
+ Kernel version is pinned for stability and driver compatibility.

## SageMaker HyperPod AMI releases for Slurm: May 13, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250513"></a>

Amazon SageMaker HyperPod released an updated AMI that supports Ubuntu 22.04 LTS for Slurm clusters. AWS regularly updates AMIs to ensure you have access to the most current software stack. Upgrading to the latest AMI provides enhanced security through comprehensive package updates, improved performance and stability for your workloads, and compatibility with new instance types and latest kernel features.

**Important**  
The update from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS introduces changes that might affect compatibility with software and configurations designed for Ubuntu 20.04.

**Topics**
+ [

### Key updates in the Ubuntu 22.04 AMI
](#sagemaker-hyperpod-ami-slurm-ubuntu22-updates)
+ [

### Upgrading to the Ubuntu 22.04 AMI
](#sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade)
+ [

### Troubleshooting upgrade failures
](#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot)

### Key updates in the Ubuntu 22.04 AMI
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-updates"></a>

The following table lists the component versions of the Ubuntu 22.04 AMI compared to the previous AMI.


**Component versions of the Ubuntu 22.04 AMI compared to the previous AMI**  

| Component | Previous version | Updated version | 
| --- | --- | --- | 
|  **Ubuntu OS**  |  20.04 LTS  |  22.04 LTS  | 
|  **Slurm**  |  24.11  |  24.11 (unchanged)  | 
|  **Python**  |  3.8 (default)  |  3.10 (default)  | 
|  **Elastic Fabric Adapter (EFA) on Amazon FSx**  |  Not supported  |  Supported  | 
|  **Linux kernel**  |  5.15  |  6.8  | 
|  **GNU C Library (glibc)**  |  2.31  |  2.35  | 
|  **GNU Compiler Collection (GCC)**  |  9.4.0  |  11.4.0  | 
|  **libc6**  |  ≤ 2.31  |  ≥ 2.35 supported  | 
|  **Network File System (NFS)**  |  1:1.3.4  |  1:2.6.1  | 

**Note**  
Although the Slurm version (24.11) remains unchanged, the underlying OS and library updates in this AMI may affect your system behavior and workload compatibility. You must test your workloads before upgrading production clusters.

### Upgrading to the Ubuntu 22.04 AMI
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade"></a>

Before upgrading your cluster to the Ubuntu 22.04 AMI, complete these preparation steps and review the upgrade requirements. To troubleshoot upgrade failures, see [Troubleshooting upgrade failures](#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot).

#### Review Python compatibility
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-python-compatibility"></a>

The Ubuntu 22.04 AMI uses Python 3.10 as the default version, upgraded from Python 3.8. Although Python 3.10 maintains compatibility with most Python 3.8 code, you should test your existing workloads before upgrading. If your workloads require Python 3.8, you can install it using the following command in your lifecycle script:

```
yum install python-3.8
```

Before upgrading your cluster, make sure to do the following:

1. Test your code compatibility with Python 3.10.

1. Verify your lifecycle scripts work in the new environment.

1. Check that all dependencies are compatible with the new Python version.

1. If you created your HyperPod cluster by copying the default lifecycle script from GitHub, add the following command to your `setup_mariadb_accounting.sh` file before upgrading to Ubuntu 22. For the complete script, see [setup\$1mariadb\$1accounting.sh on GitHub](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh).

   ```
   apt-get -y -o DPkg::Lock::Timeout=120 update && apt-get -y -o DPkg::Lock::Timeout=120 install apg
   ```

#### Upgrade your Slurm cluster
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade-cluster"></a>

You can upgrade your Slurm cluster to use the new AMI in two ways:

1. Create a new cluster using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API.

1. Update an existing cluster's software using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.

#### Validated configurations
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-validation"></a>

AWS has tested a wide range of distributed training workloads and infrastructure features on G5, G6, G6e, P4d, P5, and Trn1 instances, including:
+ Distributed training with PyTorch (e.g., FSDP, NeMo, LLaMA, MNIST).
+ Accelerator testing across instance types with Nvidia (P/G series) and AWS Neuron (Trn1).
+ Resiliency features that include [auto-resume](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-auto-resume) and [deep health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html).

#### Cluster downtime and availability
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-downtime-availability"></a>

During the upgrade process, the cluster will be unavailable. To minimize disruption, do the following:
+ Test the upgrade process on smaller clusters.
+ Create checkpoints before the upgrade, then restart training workloads from existing checkpoints after the upgrade completes.

### Troubleshooting upgrade failures
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot"></a>

When an upgrade fails, first determine if the failure is related to lifecycle scripts. These scripts commonly fail due to syntax errors, missing dependencies, or incorrect configurations.

To investigate failures related to lifecycle scripts, check CloudWatch logs. All SageMaker HyperPod events and logs are stored under the log group: `/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]`. Look specifically at the log stream `LifecycleConfig/[instance-group-name]/[instance-id]`, which provides detailed information about any errors during script execution.

If the upgrade failure is unrelated to lifecycle scripts, collect relevant information including the cluster ARN, error logs, and timestamps, then contact [AWS support](https://aws.amazon.com/premiumsupport/) for further assistance.

## SageMaker HyperPod AMI releases for Slurm: May 07, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250507"></a>

Amazon SageMaker HyperPod for Slurm released a major OS version upgrade to Ubuntu 22.04 (from the earlier Ubuntu 20.04). Check DLAMI Ubuntu 22.04 ([release notes](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/) ) for more information: `Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20250503`.

Key package upgrades:
+ Ubuntu 22.04 LTS (from 20.04)
+ Python Version:
  + Python 3.10 is now the default Python version in the Slurm AMI Ubuntu 22.04
  + This upgrade provide access to the latest features, performance improvements and bug fixes introduced in Python 3.10
+ Support for EFA on FSx
+ New Linux Kernel version 6.8 (updated from 5.15)
+ Glibc version: 2.35 (updated from 2.31)
+ GCC version: 11.4.0 (updated from 9.4.0)
+ Newer libc6 version support (from libc6 version <= 2.31)
+ NFS version: 1:2.6.1 (updated from 1:1.3.4)

## SageMaker HyperPod AMI releases for Slurm: April 28, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250428"></a>

**Improvements for Slurm**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

**Amazon SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **aws-neuronx-collectives:** 2.24.59.0-838c7fc8b
+ **aws-neuronx-dkms:** 2.20.28.0
+ **aws-neuronx-runtime-lib:** 2.24.53.0-f239092cc
+ **aws-neuronx-tools/unknown:** 2.22.61.0

------

## SageMaker HyperPod AMI releases for Slurm: February 18, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250218"></a>

**Improvements for Slurm**
+ Upgraded Slurm version to 24.11.
+ Upgraded Elastic Fabric Adapter (EFA) version from 1.37.0 to 1.38.0.
+ The EFA now includes the AWS OFI NCCL plugin. You can find this plugin in the `/opt/amazon/ofi-nccl` directory, rather than the original `/opt/aws-ofi-nccl/` location. If you need to update your `LD_LIBRARY_PATH` environment variable, make sure to modify the path to point to the new `/opt/amazon/ofi-nccl` location for the OFI NCCL plugin.
+ Removed the emacs package from these DLAMIs. You can install emacs from GNU emac.

**Amazon SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Installed the latest version of AWS Neuron SDK 2.19 ]
+ **aws-neuronx-collectives/unknown:** 2.23.135.0-3e70920f2 amd64
+ **aws-neuronx-dkms/unknown:** 2.19.64.0 amd64
+ **aws-neuronx-runtime-lib/unknown:** 2.23.112.0-9b5179492 amd64
+ **aws-neuronx-tools/unknown:** 2.20.204.0 amd64

------

## SageMaker HyperPod AMI releases for Slurm: December 21, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241221"></a>

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ **NVIDIA driver:** 550.127.05
+ **EFA driver:** 2.13.0-1
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** 2.22.33.0
  + **aws-neuronx-dkms:** 2.18.20.0
  + **aws-neuronx-oci-hook:** 2.5.8.0
  + **aws-neuronx-runtime-lib:** 2.22.19.0
  + **aws-neuronx-tools:** 2.19.0.0

------

## SageMaker HyperPod AMI releases for Slurm: November 24, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241124"></a>

**AMI general updates**
+ Released in `MEL` (Melbourne) Region.
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Slurm: 2024-11-22.

## SageMaker HyperPod AMI releases for Slurm: November 15, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241115"></a>

**AMI general updates**
+ Installed latest `libnvidia-nscq-xxx` package.

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ **NVIDIA driver:** 550.127.05
+ **EFA driver:** 2.13.0-1
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** v2.22.33.0-d2128d1aa
  + **aws-neuronx-dkms:** v2.17.17.0
  + **aws-neuronx-oci-hook:** v2.4.4.0
  + **aws-neuronx-runtime-lib:** v2.21.41.0
  + **aws-neuronx-tools:** v2.18.3.0

------

## SageMaker HyperPod AMI releases for Slurm: November 11, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241111"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following version:
  + Slurm: 2024-10-23.

## SageMaker HyperPod AMI releases for Slurm: October 21, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241021"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Slurm: 2024-09-27.

## SageMaker HyperPod AMI releases for Slurm: September 10, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20240910"></a>

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ Installed the NVIDIA driver v550.90.07
+ Installed the EFA driver v2.10
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** v2.21.46.0
  + **aws-neuronx-dkms:** v2.17.17.0
  + **aws-neuronx-oci-hook:** v2.4.4.0
  + **aws-neuronx-runtime-lib:** v2.21.41.0
  + **aws-neuronx-tools:** v2.18.3.0

------

## SageMaker HyperPod AMI releases for Slurm: March 14, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20240314"></a>

**HyperPod DLAMI for Slurm software patch**
+ Upgraded [Slurm](https://slurm.schedmd.com/documentation.html) to v23.11.1
+ Added [OpenPMIx](https://openpmix.github.io/code/getting-the-reference-implementation) v4.2.6 for enabling [Slurm with PMIx](https://slurm.schedmd.com/mpi_guide.html#pmix).
+ Built upon the [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) released on 2023-10-26
+ A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
  + [Slurm](https://slurm.schedmd.com/documentation.html): v23.11.1
  + [OpenPMIx ](https://openpmix.github.io/code/getting-the-reference-implementation): v4.2.6
  + Munge: v0.5.15
  + `aws-neuronx-dkms`: v2.\$1
  + `aws-neuronx-collectives`: v2.\$1
  + `aws-neuronx-runtime-lib`: v2.\$1
  + `aws-neuronx-tools`: v2.\$1
  + SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

**Upgrade steps**
+ Run the following command to call the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
**Important**  
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

  ```
   aws sagemaker update-cluster-software --cluster-name your-cluster-name
  ```
**Note**  
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

## SageMaker HyperPod AMI release for Slurm: November 29, 2023
<a name="sagemaker-hyperpod-release-ami-slurm-20231129"></a>

**HyperPod DLAMI for Slurm software patch**

The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). See the following details about the latest HyperPod DLAMI.
+ Built upon the [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) released on 2023-10-18
+ A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
  + [Slurm](https://slurm.schedmd.com/documentation.html): v23.02.3
  + Munge: v0.5.15
  + `aws-neuronx-dkms`: v2.\$1
  + `aws-neuronx-collectives`: v2.\$1
  + `aws-neuronx-runtime-lib`: v2.\$1
  + `aws-neuronx-tools`: v2.\$1
  + SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

# SageMaker HyperPod AMI releases for Amazon EKS
<a name="sagemaker-hyperpod-release-ami-eks"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod AMI releases for Amazon EKS orchestration. Each release note includes a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMI is built on AL2023 and supports a specific Kubernetes version. For HyperPod DLAMI releases for Slurm orchestration, see [SageMaker HyperPod AMI releases for Slurm](sagemaker-hyperpod-release-ami-slurm.md). For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

## SageMaker Hyperpod AMI releases for Amazon EKS: March 01, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260301"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.46 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.56
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: February 12, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260212"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: January 25, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260125"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: December 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251229"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: November 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251128"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.28.15-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.29.15-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.2
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.69 Python/3.10.17 Linux/5.10.245-241.976.amzn2.x86\$164 botocore/1.40.69
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.11-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.30.11-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.7-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.31.13-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.31.13-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.74 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.74
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.3-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.32.9-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.32.9-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.33.5-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.33.5-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker HyperPod AMI releases for Amazon EKS: November 07, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251107"></a>

**AMI general updates**
+ Released updates for SageMaker HyperPod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33. 
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.28.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.28.15
+ Package updates include boto3, botocore, pip, regex, psutil, and nvidia container toolkit components.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.29.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.29.15
+ Package updates include kernel updates, glibc updates, and various system libraries.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.30.11
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.30.11
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.31.7
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch-6.1.156-177.286

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.32.3
  + AWS IAM Authenticator version: v0.6.29
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.1.155-176.282.amzn2023.x86\$164
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch updates

------

**Note**  
runc version has been upgraded to 1.3.2 [Security bulletin](https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/)

## SageMaker HyperPod AMI releases for Amazon EKS: October 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251029"></a>

**AMI general updates**
+ Released updates for SageMaker HyperPod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33. 
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-baseoss-aml2-2025-10-14.html).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.28.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.28.15
+ Package updates include boto3, botocore, pip, regex, psutil, and nvidia container toolkit components.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.29.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.29.15
+ Package updates include kernel updates, glibc updates, and various system libraries.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.30.11
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.30.11
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.31.7
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch-6.1.156-177.286

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.32.3
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.1.155-176.282.amzn2023.x86\$164
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch updates

------

## SageMaker HyperPod AMI releases for Amazon EKS: October 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251022"></a>

**AL2x86**

**Note**  
Amazon Linux 2 is now deprecated. The Kubernetes AMI is based on AL2023.

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-baseoss-aml2-2025-10-14.html).
+ EKS versions 1.28 - 1.32
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 570.195.03             
  Driver Version: 570.195.03     
  CUDA Version: 12.8
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

**AL2023x86**

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-gpubaseoss-al2023-2025-10-14.html).
+ EKS versions 1.28 - 1.32. No release for EKS version 1.33.
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 580.95.05             
  Driver Version: 580.95.05  
  CUDA Version: 13.0
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

**AL2023 ARM64**

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-gpubaseossarm64-al2023-2025-10-14.html).
+ EKS versions 1.31 - 1.33.
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 580.95.05        
  Driver Version: 580.95.05    
  CUDA Version: 13.0
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

## SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250929"></a>

**AMI general updates**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.33. For more information, see SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025.
**Important**  
The Dynamic Resource Allocation beta Kubernetes API is enabled by default in this release.  
This API improves scheduling and monitoring workloads that require resources such as GPUs.
This API was developed by the open source Kubernetes community and might change in future versions of Kubernetes. Before you use the API, review the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) and understand how it affects your workloads.
HyperPod is not releasing a HyperPod Amazon Linux 2 AMI for Kubernetes 1.33. AWS recommends that you migrate to AL2023. For more information, see [Upgrade from Amazon Linux 2 to AL2023](https://docs.aws.amazon.com/eks/latest/userguide/al2023.html).

For more information, see [Kubernetes v1.33](https://kubernetes.io/blog/2025/04/23/kubernetes-v1-33-release/).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.33 ]

The following table contains information about components within this AMI release and the corresponding versions.


| component | AL2023\$1x86 | AL2023\$1arm64 | 
| --- | --- | --- | 
| EKS | v1.33.4 | v1.33.4 | 
| amazon-ssm-agent | 3.3.2299.0-1.amzn2023 | 3.3.2299.0-1.amzn2023 | 
| aws-neuronx-dkms | 2.23.9.0-dkms | N/A | 
| containerd | 1.7.27-1.eks.amzn2023.0.4 | 1.7.27-1.eks.amzn2023.0.4 | 
| efa | 2.17.2-1.amzn2023 | 2.17.2-1.amzn2023 | 
| ena | 2.14.1g | 2.14.1g | 
| kernel | 6.12.40-64.114.amzn2023 | N/A | 
| kernel6.12 | N/A | 6.12.40-64.114.amzn2023 | 
| kmod-nvidia-latest-dkms | 570.172.08-1.amzn2023 | 570.172.08-1.el9 | 
| nvidia-container-toolkit | 1.17.8-1 | 1.17.8-1 | 
| runc | 1.2.6-1.amzn2023.0.1 | 1.2.6-1.amzn2023.0.1 | 

------

## SageMaker HyperPod AMI releases for Amazon EKS: August 25, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250825"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.29 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.30 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ aws-neuronx-dkms.noarch: 2.22.2.0-dkms → 2.23.9.0-dkms
+ efa.x86\$164: 2.15.3-1.amzn2 → 2.17.2-1.amzn2
+ efa-nv-peermem.x86\$164: 1.2.1-1.amzn2 → 1.2.2-1.amzn2
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ ibacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ infiniband-diags.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libfabric-aws.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libfabric-aws-devel.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ libibumad.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libnccl-ofi.x86\$164: 1.15.0-1.amzn2 → 1.16.2-1.amzn2
+ librdmacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ librdmacm-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4
+ rdma-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ rdma-core-devel.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.31 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.32 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ aws-neuronx-dkms.noarch: 2.22.2.0-dkms → 2.23.9.0-dkms
+ efa.x86\$164: 2.15.3-1.amzn2 → 2.17.2-1.amzn2
+ efa-nv-peermem.x86\$164: 1.2.1-1.amzn2 → 1.2.2-1.amzn2
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ ibacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ infiniband-diags.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libfabric-aws.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libfabric-aws-devel.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ libibumad.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libnccl-ofi.x86\$164: 1.15.0-1.amzn2 → 1.16.2-1.amzn2
+ librdmacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ librdmacm-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4
+ rdma-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ rdma-core-devel.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------

## SageMaker HyperPod AMI releases for Amazon EKS: August 12, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250812"></a>

**The AMI includes the following:**
+ Supported AWS Service: Amazon EC2
+ Operating System: Amazon Linux 2023
+ Compute Architecture: ARM64
+ Latest available version is installed for the following packages:
  + Linux Kernel: 6.12
  + FSx Lustre
  + Docker
  + AWS CLI v2 at `/usr/bin/aws`
  + NVIDIA DCGM
  + Nvidia container toolkit:
    + Version command: `nvidia-container-cli -V`
  + Nvidia-docker2:
    + Version command: `nvidia-docker version`
  + Nvidia-IMEX: v570.172.08-1
+ NVIDIA Driver: 570.158.01
+ NVIDIA CUDA 12.4, 12.5, 12.6, 12.8 stack:
  + CUDA, NCCL and cuDDN installation directories: `/usr/local/cuda-xx.x/`
    + Example: `/usr/local/cuda-12.8/`, `/usr/local/cuda-12.8/`
  + Compiled NCCL Version:
    + For CUDA directory of 12.4, compiled NCCL Version 2.22.3\$1CUDA12.4
    + For CUDA directory of 12.5, compiled NCCL Version 2.22.3\$1CUDA12.5
    + For CUDA directory of 12.6, compiled NCCL Version 2.24.3\$1CUDA12.6
    + For CUDA directory of 12.8, compiled NCCL Version 2.27.5\$1CUDA12.8
  + Default CUDA: 12.8
    + PATH `/usr/local/cuda` points to CUDA 12.8
    + Updated below env vars:
      + `LD_LIBRARY_PATH` to have `/usr/local/cuda-12.8/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8:/usr/local/cuda-12.8/targets/sbsa-linux/lib:/usr/local/cuda-12.8/nvvm/lib64:/usr/local/cuda-12.8/extras/CUPTI/lib64`
      + `PATH` to have `/usr/local/cuda-12.8/bin/:/usr/local/cuda-12.8/include/`
      + For any different CUDA version, please update `LD_LIBRARY_PATH` accordingly.
+ EFA installer: 1.42.0
+ Nvidia GDRCopy: 2.5.1
+ AWS OFI NCCL plugin comes with EFA installer
  + Paths `/opt/amazon/ofi-nccl/lib` and `/opt/amazon/ofi-nccl/efa` are added to `LD_LIBRARY_PATH`.
+ AWS CLI v2 at `/usr/local/bin/aws`
+ EBS volume type: gp3
+ Python: `/usr/bin/python3.9`

## SageMaker HyperPod AMI releases for Amazon EKS: August 6, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250806"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following updates:

------
#### [ K8s v1.28 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.29 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.30 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.31 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.32 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------

**Important**  
Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 70.3
Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 68.4
Latest CUDA 12.8 support
Upgraded Nvidia Driver to from 570.158.01 to 570.172.08 to fix CVE's present in the Nvidia Security Bulletin for July

## SageMaker HyperPod AMI releases for Amazon EKS: July 31, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250731"></a>

Amazon SageMaker HyperPod now supports a new AMI for Amazon EKS clusters that updates the base operating system to Amazon Linux 2023. This release provides several improvements from Amazon Linux 2 (AL2). HyperPod releases new AMIs regularly, and we recommend that you run all of your HyperPod clusters on the latest and most secure versions of AMIs to address vulnerabilities and phase out outdated software and libraries.

### Key upgrades
<a name="sagemaker-hyperpod-release-ami-eks-20250731-specs"></a>
+ **Operating System**: Amazon Linux 2023 (updated from Amazon Linux 2, or AL2)
+ **Package Manager**: DNF is the default package management tool, replacing YUM used in AL2
+ **Networking Service**: `systemd-networkd` manages network interfaces, replacing ISC `dhclient` used in AL2
+ **Linux Kernel**: Version 6.1, updated from the kernel used in AL2
+ **Glibc**: Version 2.34, updated from the version in AL2
+ **GCC**: Version 11.5.0, updated from the version in AL2
+ **NFS**: Version 1:2.6.1, updated from version 1:1.3.4 in AL2
+ **NVIDIA Driver**: Version 570.172.08, a newer driver version
+ **Python**: Version 3.9, replacing Python 2.7 used in AL2
+ **NVME**: Version 1.11.1, a newer version of the NVMe driver

### Before you upgrade
<a name="sagemaker-hyperpod-release-ami-eks-20250731-prereqs"></a>

There are a few important things to know before upgrading. With AL2023, several packages have been added, upgraded or removed compared to AL2. We strongly recommend that you test your applications with AL2023 before upgrading your clusters. For a comprehensive list of all package changes in AL2023, see [Package changes in Amazon Linux 2023](https://docs.aws.amazon.com/linux/al2023/release-notes/compare-packages.html).

The following are some of the significant changes between AL2 and AL2023:
+ **Python 3.10**: The most significant update, apart from the operating system, is the Python version upgrade. After upgrading, clusters have Python 3.10 as default. While some Python 3.8 distributed training workloads might be compatible with Python 3.10, we strongly recommend that you test your specific workloads separately. If migration to Python 3.10 proves challenging but you still want to upgrade your cluster for other new features, you can install an older Python version by using the command `yum install python-xx.x` with [lifecycle scripts](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm.html) before running any workloads. Ensure you test both your existing lifecycle scripts and application code for compatibility.
+ **NVIDIA runtime enforcement**: AL2023 strictly enforces the NVIDIA container runtime requirements, causing containers with hard-coded NVIDIA environment variables (such as `NVIDIA_VISIBLE_DEVICES: "all"`) to fail on CPU-only nodes (whereas AL2 ignored these settings when no GPU drivers are present). You can override the enforcement by setting `NVIDIA_VISIBLE_DEVICES: "void"` in your pod specification or by using CPU-only images.
+ **cgroup v2**: AL2023 features the next generation of unified control group hierarchy (cgroup v2). cgroup v2 is used for container runtimes and is also used by `systemd`. While AL2023 still includes code that can make the system run using cgroup v1, this isn't a recommended configuration.
+ **Amazon VPC CNI and `eksctl` versions**: AL2023 also requires your Amazon VPC CNI version to be 1.16.2 or greater and your `eksctl` version to be 0.176.0 or greater.
+ **EFA on FSx for Lustre**: You can now use EFA on FSx for Lustre, which enables you to achieve application performance comparable to on-premises AI/ML or HPC (high performance computing) clusters, while benefiting from the scalability, flexibility and elasticity of cloud computing.

Additionally, upgrading to AL2023 requires at minimum version `1.0.643.0_1.0.192.0` of Health Monitoring Agent. Complete the following procedure to update the Health Monitoring Agent:

1. If you use HyperPod lifecycle scripts from the GitHub repository [awsome-distributed-training](https://github.com/aws-samples/awsome-distributed-training)), make sure to pull the latest version. Earlier versions are not compatible with AL2023. The new lifecycle script ensures that `containerd` uses the additional mounted storage for pulling in container images in AL2023.

1. Pull in the latest version of the [HyperPod CLI git repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main).

1. Update dependencies with the following command: `helm dependencies update helm_chart/HyperPodHelmChart`

1. As mentioned on the step 4 in the [README of HyperPodHelmChart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#step-four-whenever-you-want-to-upgrade-the-installation-of-helm-charts), run the following command to upgrade the version of dependencies running on the cluster: `helm upgrade dependencies helm_chart/HyperPodHelmChart -namespace kube-system`

### Workloads that have been tested on upgraded EKS clusters
<a name="sagemaker-hyperpod-release-ami-eks-20250731-tested"></a>

The following are some use cases where the upgrade has been tested:
+ **Backwards compatibility**: Popular distributed training jobs involving PyTorch should be backwards compatible on the new AMI. However, since your workloads may depend on specific Python or Linux libraries, we recommend first testing on a smaller scale or subset of nodes before upgrading your larger clusters.
+ **Accelerator testing**: Jobs across various instance types, utilizing both NVIDIA accelerators (for the P and G instance families) and AWS Neuron accelerators (for Trn instances) have been tested.

### How to upgrade your AMI and associated workloads
<a name="sagemaker-hyperpod-release-ami-eks-20250731-upgrade"></a>

You can upgrade to the new AMI using one of the following methods:
+ Use the [create-cluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API to create a new cluster with the latest AMI.
+ Use the [update-cluster-software](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to upgrade your existing cluster. Note that this option re-runs any lifecycle scripts.

The cluster is unavailable during the update process. We recommend planning for this downtime and restarting the training workload from an existing checkpoint after the upgrade completes. As a best practice, we recommend that you perform testing on a smaller cluster before upgrading your larger clusters.

If the update command fails, first identify the cause of the failure. For lifecycle script failures, make the necessary corrections to your scripts and retry. For any other issues that cannot be resolved, contact [AWS Support](https://aws.amazon.com/premiumsupport/).

### Troubleshooting
<a name="sagemaker-hyperpod-release-ami-eks-20250731-troubleshooting"></a>

Use the following section to help with troubleshooting any issues you encounter when upgrading to AL2023.

**How do I fix errors such as `"nvml error: driver not loaded: unknown"` on CPU-only cluster nodes?**

If containers that worked on CPU AL2 Amazon EKS nodes now fail on AL2023, your container image may have hard-coded NVIDIA environment variables. You can check for hard-coded environment variables with the following command:

```
docker inspect image:tag | grep -i nvidia
```

AL2023 strictly enforces these requirements whereas AL2 was more lenient on CPU-only nodes. One solution is to override the AL2023 enforcement by setting certain NVIDIA environment variables in your Amazon EKS pod specification, as shown in the following example:

```
yaml
containers:
- name: your-container
image: your-image:tag
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "void"
- name: NVIDIA_DRIVER_CAPABILITIES
value: ""
```

Another alternative is to use CPU-only container images (such as `pytorch/pytorch:latest-cpu`) or build custom images without NVIDIA dependencies.

## SageMaker HyperPod AMI releases for Amazon EKS: July 15, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250715"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following updates:

------
#### [ K8s v1.28 ]
+ **Latest NVIDIA Driver:** 550.163.01
+ **Default CUDA:** 12.4
+ **EFA Installer:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.29 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA Installer:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.30 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.31 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.32 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------

## SageMaker HyperPod AMI releases for Amazon EKS: June 09, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250609"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Neuron SDK Updates ]
+ **aws-neuronx-dkms.noarch:** 2.21.37.0 (from 2.20.74.0)

------

## SageMaker HyperPod AMI releases for Amazon EKS: May 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250522"></a>

**AMI general updates**

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Deep Learning Base AMI AL2 ]
+ **Latest NVIDIA Driver:** 550.163.01
+ **CUDA Stack updates:**
  + **Default CUDA:** 12.1
  + **NCCL Version:** 2.22.3
+ **EFA Installer:** 1.38.0
+ **AWS OFI NCCL:** 1.13.2
+ **Linux Kernel:** 5.10
+ **GDRCopy:** 2.4

**Important**  
**NVIDIA Container Toolkit 1.17.4 update:** CUDA compat libraries mounting is now disabled
**EFA Updates from 1.37 to 1.38:**  
AWS OFI NCCL plugin now located in /opt/amazon/ofi-nccl
Previous location /opt/aws-ofi-nccl/ is deprecated

------
#### [ Neuron SDK Updates ]
+ **aws-neuronx-dkms.noarch:** 2.20.74.0 (from 2.20.28.0)
+ **aws-neuronx-collectives.x86\$164:** 2.25.65.0\$19858ac9a1-1 (from 2.24.59.0\$1838c7fc8b-1)
+ **aws-neuronx-runtime-lib.x86\$164:** 2.25.57.0\$1166c7a468-1 (from 2.24.53.0\$1f239092cc-1)
+ **aws-neuronx-tools.x86\$164:** 2.23.9.0 (from 2.22.61.0)
+ **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.15.12.0 (from 0.14.12.0)
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.15.1.0\$15d31b6a3f (from 0.14.6.0\$1241eb69f4)
+ **aws-neuronx-k8-plugin.x86\$164: **2.25.24.0 (from 2.24.23.0)
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.25.24.0 (from 2.24.23.0)

**Support notes:**
+ AMI components including CUDA versions may be removed or changed based on framework support policy
+ Kernel version is pinned for compatibility. Users should avoid updates unless required for security patches
+ For EC2 instances with multiple network cards, please refer to EFA configuration guide for proper setup

------

## SageMaker HyperPod AMI releases for Amazon EKS: May 07, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250507"></a>

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **tensorflow-model-server-neuron.x86\$164** 2.8.0.2.3.0.0-0 neuron

------

## SageMaker HyperPod AMI releases for Amazon EKS: April 28, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250428"></a>

**Improvements for K8s**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **aws-neuronx-dkms.noarch:** 2.20.28.0-dkms
+ **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
+ **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
+ **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
+ **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
+ **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
+ **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
+ **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
+ **aws-neuron-tools.x86\$164:** 2.1.4.0-1
+ **aws-neuronx-collectives.x86\$164:** 2.24.59.0\$1838c7fc8b-1
+ **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
+ **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.14.12.0-1
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.14.6.0\$1241eb69f4-1
+ **aws-neuronx-k8-plugin.x86\$164:** 2.24.23.0-1
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.24.23.0-1
+ **aws-neuronx-runtime-lib.x86\$164:** 2.24.53.0\$1f239092cc-1
+ **aws-neuronx-tools.x86\$164:** 2.22.61.0-1
+ **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------

## SageMaker HyperPod AMI releases for Amazon EKS: April 18, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250418"></a>

**AMI general updates**
+ New SageMaker HyperPod AMI for Amazon EKS 1.32.1.

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.32.1 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.32.1
  + Containerd Version: 1.7.27
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.29
+ **Amazon SSM Agent:** 3.3.1611.0 
+ **Linux Kernel:** 5.10.235
+ **OSS Nvidia driver:** 550.163.01
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1-1
+ **Nvidia container toolkit:** 1.17.6
+ **AWS OFI NCCL:** 1.13.2
+ **aws-neuronx-tools:** 2.18.3.0
+ **aws-neuronx-runtime-lib:** 2.24.53.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.20.28.0
+ **aws-neuronx-collectives:** 2.24.59.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250218"></a>

**Improvements for K8s**
+ Upgraded Nvidia container toolkit from version 1.17.3 to version 1.17.4.
+ Fixed the issue where customers were unable to connect to nodes after a reboot.
+ Upgraded Elastic Fabric Adapter (EFA) version from 1.37.0 to 1.38.0.
+ The EFA now includes the AWS OFI NCCL plugin, which is located in the `/opt/amazon/ofi-nccl` directory instead of the original `/opt/aws-ofi-nccl/` path. If you need to update your `LD_LIBRARY_PATH` environment variable, make sure to modify the path to point to the new `/opt/amazon/ofi-nccl` location for the OFI NCCL plugin.
+ Removed the emacs package from these DLAMIs. You can install emacs from GNU emac.

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Installed the latest version of neuron SDK ]
+ **aws-neuronx-dkms.noarch:** 2.19.64.0-dkms @neuron
+ **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1 @neuron
+ **aws-neuronx-tools.x86\$164:** 2.18.3.0-1 @neuron
+ **aws-neuronx-collectives.x86\$164:** 2.23.135.0\$13e70920f2-1 neuron
+ **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1 neuron
+ **aws-neuronx-gpsimd-customop-lib.x86\$164**
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.13.2.0\$194ba34927-1 neuron
+ **aws-neuronx-k8-plugin.x86\$164:** 2.23.45.0-1 neuron
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.23.45.0-1 neuron
+ **aws-neuronx-runtime-lib.x86\$164:** 2.23.112.0\$19b5179492-1 neuron
+ **aws-neuronx-tools.x86\$164:** 2.20.204.0-1 neuron
+ **tensorflow-model-server-neuronx.x86\$164**

------

## SageMaker HyperPod AMI releases for Amazon EKS: January 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250122"></a>

**AMI general updates**
+ New SageMaker HyperPod AMI for Amazon EKS 1.31.2.

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.31 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.31.2
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.230
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4.1-1
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.133.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: December 21, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241221"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ K8s v1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS NVIDIA driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------
#### [ K8s v1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.10
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.15.0
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------
#### [ K8s v1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.6
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987.0
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: December 13, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241213"></a>

**SageMaker HyperPod DLAMI for Amazon EKS upgrade**
+ Updated SSM Agent to version `3.3.1311.0`.

## SageMaker HyperPod AMI releases for Amazon EKS: November 24, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241124"></a>

**AMI general updates**
+ Released in `MEL` (Melbourne) Region.
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Kubernetes: 2024-11-01.

## SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241115"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS NVIDIA driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------
#### [ Deep Learning EKS AMI 1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.10
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------
#### [ Deep Learning EKS AMI 1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.6
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: November 11, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241111"></a>

**AMI general updates**
+ Updated SageMaker HyperPod DLAMI with Amazon EKS versions 1.28.13, 1.29.8, 1.30.4.

## SageMaker HyperPod AMI releases for Amazon EKS: October 21, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241021"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Amazon EKS: 1.28.11, 1.29.6, 1.30.2.

## SageMaker HyperPod AMI releases for Amazon EKS: September 10, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20240910"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.11
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS NVIDIA driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------
#### [ Deep Learning EKS AMI 1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.6
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS Nvidia driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------
#### [ Deep Learning EKS AMI 1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.2
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS Nvidia driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------

# Public AMI releases
<a name="sagemaker-hyperpod-release-public-ami"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod public AMI releases for Amazon EKS orchestration. Each release note includes a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMI is built on AL2023 and supports a specific Kubernetes version. For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

This page is regularly updated to provide comprehensive AMI lifecycle management information including security vulnerabilities, deprecation announcements, and patching recommendations. As part of a commitment to maintaining secure and up-to-date infrastructure, SageMaker AI continuously monitors all HyperPod public AMIs for critical vulnerabilities using automated scanning workflows. When critical security issues are identified, AMIs are systematically deprecated with appropriate migration guidance. Regular updates include Common Vulnerabilites and Exposures (CVE) remediation status, compliance findings, and recommended actions to ensure that you can maintain secure HyperPod environments while minimizing operational disruption during AMI transitions.

## SageMaker HyperPod public AMI releases: August 04, 2025
<a name="sagemaker-hyperpod-release-public-ami-2025-08-04"></a>

Amazon SageMaker HyperPod now supports new public AMIs for Amazon EKS clusters. The AMIs include the following:

------
#### [ K8s v1.32 ]

AMI Name: HyperPod EKS 1.32 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.32.3
  + Containerd Version: 1.7.23
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.29
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.30 ]

AMI Name: HyperPod EKS 1.30 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.11
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.31 ]

AMI Name: HyperPod EKS 1.31 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.31.7
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.29 ]

AMI Name: HyperPod EKS 1.29 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.15
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.28 ]

AMI Name: HyperPod EKS 1.28 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------

# Generative AI in SageMaker notebook environments
<a name="jupyterai"></a>

[Jupyter AI](https://github.com/jupyterlab/jupyter-ai) is an open-source extension of JupyterLab integrating generative AI capabilities into Jupyter notebooks. Through the Jupyter AI chat interface and magic commands, users experiment with code generated from natural language instructions, explain existing code, ask questions about their local files, generate entire notebooks, and more. The extension connects Jupyter notebooks with large language models (LLMs) that users can use to generate text, code, or images, and to ask questions about their own data. Jupyter AI supports generative model providers such as AI21, Anthropic, AWS (JumpStart and Amazon Bedrock), Cohere, and OpenAI.

You can also use Amazon Q Developer as an out of the box solution. Instead of having to manually set up a connection to a model, you can start using Amazon Q Developer with minimal configuration. When you enable Amazon Q Developer, it becomes the default solution provider within Jupyter AI. For more information about using Amazon Q Developer, see [SageMaker JupyterLab](studio-updated-jl.md).

The extension's package is included in [Amazon SageMaker Distribution](https://github.com/aws/sagemaker-distribution) [version 1.2 and onwards](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v1). Amazon SageMaker Distribution is a Docker environment for data science and scientific computing used as the default image of JupyterLab notebook instances. Users of different IPython environments can install Jupyter AI manually.

In this section, we provide an overview of Jupyter AI capabilities and demonstrate how to configure models provided by JumpStart or Amazon Bedrock from [JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) or [Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) notebooks. For more in-depth information on the Jupyter AI project, refer to its [documentation](https://jupyter-ai.readthedocs.io/en/latest/). Alternatively, you can refer to the blog post *[Generative AI in Jupyter](https://blog.jupyter.org/generative-ai-in-jupyter-3f7174824862)* for an overview and examples of key Jupyter AI capabilities.

Before using Jupyter AI and interacting with your LLMs, make sure that you satisfy the following prerequisites:
+ For models hosted by AWS, you should have the ARN of your SageMaker AI endpoint or have access to Amazon Bedrock. For other model providers, you should have the API key used to authenticate and authorize requests to your model. Jupyter AI supports a wide range of model providers and language models, refer to the list of its [supported models](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers) to stay updated on the latest available models. For information on how to deploy a model in JumpStart, see [Deploy a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-deploy.html) in the JumpStart documentation. You need to request access to [Amazon Bedrock](https://aws.amazon.com/bedrock/) to use it as your model provider.
+ Ensure that Jupyter AI libraries are present in your environment. If not, install the required package by following the instructions in [Jupyter AI installation](sagemaker-jupyterai-installation.md).
+ Familiarize yourself with the capabilities of Jupyter AI in [Access Jupyter AI Features](sagemaker-jupyterai-overview.md).
+ Configure the target models you wish to use by following the instructions in [Configure your model provider](sagemaker-jupyterai-model-configuration.md).

After completing the prerequisite steps, you can proceed to [Use Jupyter AI in JupyterLab or Studio Classic](sagemaker-jupyterai-use.md).

**Topics**
+ [

# Jupyter AI installation
](sagemaker-jupyterai-installation.md)
+ [

# Access Jupyter AI Features
](sagemaker-jupyterai-overview.md)
+ [

# Configure your model provider
](sagemaker-jupyterai-model-configuration.md)
+ [

# Use Jupyter AI in JupyterLab or Studio Classic
](sagemaker-jupyterai-use.md)

# Jupyter AI installation
<a name="sagemaker-jupyterai-installation"></a>

To use Jupyter AI, you must first install the Jupyter AI package. For [Amazon SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v1) users, we recommend selecting the SageMaker Distribution image version 1.2 or later. No further installation is necessary. Users of JupyterLab in Studio can choose the version of their Amazon SageMaker Distribution when creating a space.

For users of other IPython environments, the version of the recommended Jupyter AI package depends on the version of JupyterLab they are using.

The Jupyter AI distribution consists of two packages.
+ `jupyter_ai`: This package provides a JupyterLab extension and a native chat user interface (UI). It acts as a conversational assistant using the large language model of your choice.
+ `jupyter_ai_magics`: This package provides the IPython `%%ai` and `%ai` magic commands with which you can invoke a large language model (LLM) from your notebook cells.

**Note**  
Installing `jupyter_ai` also installs `jupyter_ai_magics`. However, you can install `jupyter_ai_magics` independently without JupyterLab or `jupyter_ai`. The magic commands `%%ai` and `%ai` work in any IPython kernel environment. If you only install `jupyter_ai_magics`, you can't use the chat UI.

For users of JupyterLab 3, in particular Studio Classic users, we recommend installing `jupyter-ai` [version 1.5.x](https://pypi.org/project/jupyter-ai/#history) or any later 1.x version. However, we highly recommend using Jupyter AI with JupyterLab 4. The `jupyter-ai` version compatible with JupyterLab 3 may not allow users to set additional model parameters such as temperature, top-k and top-p sampling, max tokens or max length, or user acceptance license agreements.

For users of JupyterLab 4 environments that do not use SageMaker Distribution, we recommend installing `jupyter-ai` [version 2.5.x](https://pypi.org/project/jupyter-ai/#history) or any later 2.x version.

See the installation instructions in the *Installation* section of [Jupyter AI documentation](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#installation-via-pip).

# Access Jupyter AI Features
<a name="sagemaker-jupyterai-overview"></a>

You can access Jupyter AI capabilities through two distinct methods: using the chat UI or using magic commands within notebooks.

## From the chat user interface AI assistant
<a name="sagemaker-jupyterai-overview-chatui"></a>

The chat interface connects you with Jupyternaut, a conversational agent that uses the language model of your choice. 

After launching a JupyterLab application installed with Jupyter AI, you can access the chat interface by choosing the chat icon (![\[Icon of a rectangular shape with a curved arrow pointing to the upper right corner.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/jupyterai/jupyterai-chat-ui.png)) in the left navigation panel. First-time users are prompted to configure their model. See [Configure your model provider in the chat UI](sagemaker-jupyterai-model-configuration.md#sagemaker-jupyterai-model-configuration-chatui) for configuration instructions.

**Using the chat UI, you can:**
+ **Answer questions**: For instance, you can ask Jupyternaut to create a Python function that adds CSV files to an Amazon S3 bucket. Subsequently, you can refine your answer with a follow-up question, such as adding a parameter to the function to choose the path where the files are written. 
+ **Interact with files in JupyterLab**: You can include a portion of your notebook in your prompt by selecting it. Then, you can either replace it with the model's suggested answer or manually copy the answer to your clipboard.
+ **Generate entire notebooks** from prompts: By starting your prompt with `/generate`, you trigger a notebook generation process in the background without interrupting your use of Jupyternaut. A message containing the link to the new file is displayed upon completion of the process.
+ **Learn from and ask questions about local files**: Using the `/learn` command, you can teach an embedding model of your choice about local files and then ask questions about those files using the `/ask` command. Jupyter AI stores the embedded content in a local [FAISS vector database](https://github.com/facebookresearch/faiss), then uses retrieval-augmented generation (RAG) to provide answers based on what it has learned. To erase all previously learned information from your embedding model, use `/learn -d`.

**Note**  
Amazon Q developer doesn't have the capability to generate notebooks from scratch.

For a complete list of features and detailed instructions on their usage, see the [Jupyter AI chat interface](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#the-chat-interface) documentation. To learn about how to configure access to a model in Jupyternaut, see [Configure your model provider in the chat UI](sagemaker-jupyterai-model-configuration.md#sagemaker-jupyterai-model-configuration-chatui).

## From notebook cells
<a name="sagemaker-jupyterai-overview-magic-commands"></a>

Using `%%ai` and `%ai` magic commands, you can interact with the language model of your choice from your notebook cells or any IPython command line interface. The `%%ai` command applies your instructions to the entire cell, whereas `%ai` apply them to the specific line.

The following example illustrates an `%%ai` magic command invoking an Anthropic Claude model to output an HTML file containing the image of a white square with black borders.

```
%%ai anthropic:claude-v1.2 -f html
Create a square using SVG with a black border and white fill.
```

To learn about the syntax of each command, use `%ai help`. To list the providers and models supported by the extension, run `%ai list`.

For a complete list of features and detailed instructions on their usage, see the Jupyter AI [magic commands](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#the-ai-and-ai-magic-commands) documentation. In particular, you can customize the output format of your model using the `-f` or `--format` parameter, allow variable interpolation in prompts, including special `In` and `Out` variables, and more.

To learn about how to configure the access to a model, see [Configure your model provider in a notebook](sagemaker-jupyterai-model-configuration.md#sagemaker-jupyterai-model-configuration-magic-commands). 

# Configure your model provider
<a name="sagemaker-jupyterai-model-configuration"></a>

**Note**  
In this section, we assume that the language and embedding models that you plan to use are already deployed. For models provided by AWS, you should already have the ARN of your SageMaker AI endpoint or access to Amazon Bedrock. For other model providers, you should have the API key used to authenticate and authorize requests to your model.  
Jupyter AI supports a wide range of model providers and language models, refer to the list of its [supported models](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers) to stay updated on the latest available models. For information on how to deploy a model provided by JumpStart, see [Deploy a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-deploy.html) in the JumpStart documentation. You need to request access to [Amazon Bedrock](https://aws.amazon.com/bedrock/) to use it as your model provider.

The configuration of Jupyter AI varies depending on whether you are using the chat UI or magic commands.

## Configure your model provider in the chat UI
<a name="sagemaker-jupyterai-model-configuration-chatui"></a>

**Note**  
You can configure several LLMs and embedding models following the same instructions. However, you must configure at least one **Language model**.

**To configure your chat UI**

1. In JupyterLab, access the chat interface by choosing the chat icon (![\[Icon of a rectangular shape with a curved arrow pointing to the upper right corner.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/jupyterai/jupyterai-chat-ui.png)) in the left navigation panel.

1. Choose the configuration icon (![\[Gear or cog icon representing settings or configuration options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/jupyterai/jupyterai-configure-models.png)) in the top right corner of the left pane. This opens the Jupyter AI configuration panel.

1. Fill out the fields related to your service provider.
   + **For models provided by JumpStart or Amazon Bedrock**
     + In the **language model** dropdown list, select `sagemaker-endpoint` for models deployed with JumpStart or `bedrock` for models managed by Amazon Bedrock.
     + The parameters differ based on whether your model is deployed on SageMaker AI or Amazon Bedrock.
       + For models deployed with JumpStart:
         + Enter the name of your endpoint in **Endpoint name**, and then the AWS Region in which your model is deployed in [**Region name**](sagemaker-jupyterai-use.md#sagemaker-jupyterai-region-name). To retrieve the ARN of the SageMaker AI endpoints, navigate to [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and then choose **Inference** and **Endpoints** in the left menu.
         + Paste the JSON of the [**Request schema**](sagemaker-jupyterai-use.md#sagemaker-jupyterai-request-schema) tailored to your model, and the corresponding [**Response path**](sagemaker-jupyterai-use.md#sagemaker-jupyterai-response-path) for parsing the model's output.
**Note**  
You can find the request and response format of various of JumpStart foundation models in the following [example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/introduction_to_amazon_algorithms/jumpstart-foundation-models). Each notebook is named after the model it demonstrates.
       + For models managed by Amazon Bedrock: Add the AWS profile storing your AWS credentials on your system (optional), and then the AWS Region in which your model is deployed in [**Region name**](sagemaker-jupyterai-use.md#sagemaker-jupyterai-region-name).
     + (Optional) Select an [embedding model](sagemaker-jupyterai-overview.md#sagemaker-jupyterai-embedding-model) to which you have access. Embedding models are used to capture additional information from local documents, enabling the text generation model to respond to questions within the context of those documents.
     + Choose **Save Changes** and navigate to the left arrow icon (![\[Left-pointing arrow icon, typically used for navigation or returning to a previous page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/jupyterai/jupyterai-return-to-chat.png)) in the top left corner of the left pane. This opens the Jupyter AI chat UI. You can start interacting with your model.
   + **For models hosted by third-party providers**
     + In the **language model** dropdown list, select your provider ID. You can find the details of each provider, including their ID, in Jupyter AI [list of model providers](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers).
     + (Optional) Select an [embedding model](sagemaker-jupyterai-overview.md#sagemaker-jupyterai-embedding-model) to which you have access. Embedding models are used to capture additional information from local documents, enabling the text generation model to respond to questions within the context of those documents.
     + Insert your models' API keys.
     + Choose **Save Changes** and navigate to the left arrow icon (![\[Left-pointing arrow icon, typically used for navigation or returning to a previous page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/jupyterai/jupyterai-return-to-chat.png)) in the top left corner of the left pane. This opens the Jupyter AI chat UI. You can start interacting with your model.

The following snapshot is an illustration of the chat UI configuration panel set to invoke a Flan-t5-small model provided by JumpStart and deployed in SageMaker AI.

![\[Chat UI configuration panel set to invoke a Flan-t5-small model provided by JumpStart.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/jupyterai/jupyterai-chatui-configuration.png)


### Pass extra model parameters and custom parameters to your request
<a name="sagemaker-jupyterai-configuration-model-parameters"></a>

Your model may need extra parameters, like a customized attribute for user agreement approval or adjustments to other model parameters such as temperature or response length. We recommend configuring these settings as a start up option of your JupyterLab application using a Lifecycle Configuration. For information on how to create a Lifecycle Configuration and attach it to your domain, or to a user profile from the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/), see [Create and associate a lifecycle configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc.html). You can choose your LCC script when creating a space for your JupyterLab application.

Use the following JSON schema to configure your [extra parameters](sagemaker-jupyterai-use.md#sagemaker-jupyterai-extra-model-params):

```
{
  "AiExtension": {
    "model_parameters": {
      "<provider_id>:<model_id>": { Dictionary of model parameters which is unpacked and passed as-is to the provider.}
      }
    }
  }
}
```

The following script is an example of a JSON configuration file that you can use when creating a JupyterLab application LCC to set the maximum length of an [AI21 Labs Jurassic-2 model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-jurassic2.html) deployed on Amazon Bedrock. Increasing the length of the model's generated response can prevent the systematic truncation of your model's response.

```
#!/bin/bash
set -eux

mkdir -p /home/sagemaker-user/.jupyter

json='{"AiExtension": {"model_parameters": {"bedrock:ai21.j2-mid-v1": {"model_kwargs": {"maxTokens": 200}}}}}'
# equivalent to %%ai bedrock:ai21.j2-mid-v1 -m {"model_kwargs":{"maxTokens":200}}

# File path
file_path="/home/sagemaker-user/.jupyter/jupyter_jupyter_ai_config.json"

#jupyter --paths

# Write JSON to file
echo "$json" > "$file_path"

# Confirmation message
echo "JSON written to $file_path"

restart-jupyter-server

# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30
```

The following script is an example of a JSON configuration file for creating a JupyterLab application LCC used to set additional model parameters for an [Anthropic Claude model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-claude.html) deployed on Amazon Bedrock.

```
#!/bin/bash
set -eux

mkdir -p /home/sagemaker-user/.jupyter

json='{"AiExtension": {"model_parameters": {"bedrock:anthropic.claude-v2":{"model_kwargs":{"temperature":0.1,"top_p":0.5,"top_k":25
0,"max_tokens_to_sample":2}}}}}'
# equivalent to %%ai bedrock:anthropic.claude-v2 -m {"model_kwargs":{"temperature":0.1,"top_p":0.5,"top_k":250,"max_tokens_to_sample":2000}}

# File path
file_path="/home/sagemaker-user/.jupyter/jupyter_jupyter_ai_config.json"

#jupyter --paths

# Write JSON to file
echo "$json" > "$file_path"

# Confirmation message
echo "JSON written to $file_path"

restart-jupyter-server

# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30
```

Once you have attached your LCC to your domain, or user profile, add your LCC to your space when launching your JupyterLab application. To ensure that your configuration file is updated by the LCC, run `more ~/.jupyter/jupyter_jupyter_ai_config.json` in a terminal. The content of the file should correspond to the content of the JSON file passed to the LCC.

## Configure your model provider in a notebook
<a name="sagemaker-jupyterai-model-configuration-magic-commands"></a>

**To invoke a model via Jupyter AI within JupyterLab or Studio Classic notebooks using the `%%ai` and `%ai` magic commands**

1. Install the client libraries specific to your model provider in your notebook environment. For example, when using OpenAI models, you need to install the `openai` client library. You can find the list of the client libraries required per provider in the *Python package(s)* column of the Jupyter AI [Model providers list](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers).
**Note**  
For models hosted by AWS, `boto3` is already installed in the SageMaker AI Distribution image used by JupyterLab, or any Data Science image used with Studio Classic.

1. 
   + **For models hosted by AWS**

     Ensure that your execution role has the permission to invoke your SageMaker AI endpoint for models provided by JumpStart or that you have access to Amazon Bedrock.
   + **For models hosted by third-party providers**

     Export your provider's API key in your notebook environment using environment variables. You can use the following magic command. Replace the `provider_API_key` in the command by the environment variable found in the *Environment variable* column of the Jupyter AI [Model providers list](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers) for your provider.

     ```
     %env provider_API_key=your_API_key
     ```

# Use Jupyter AI in JupyterLab or Studio Classic
<a name="sagemaker-jupyterai-use"></a>

You can use Jupyter AI in JupyterLab or Studio Classic by invoking language models from either the chat UI or from notebook cells. The following sections give information about the steps needed to complete this.

## Use language models from the chat UI
<a name="sagemaker-jupyterai-use-chatui"></a>

Compose your message in the chat UI text box to start interacting with your model. To clear the message history, use the `/clear` command.

**Note**  
Clearing the message history does not erase the chat context with the model provider.

## Use language models from notebook cells
<a name="sagemaker-jupyterai-use-magic-commands"></a>

Before using the `%%ai` and `%ai` commands to invoke a language model, load the IPython extension by running the following command in a JupyterLab or Studio Classic notebook cell.

```
%load_ext jupyter_ai_magics
```
+ **For models hosted by AWS:**
  + To invoke a model deployed in SageMaker AI, pass the string `sagemaker-endpoint:endpoint-name` to the `%%ai` magic command with the required parameters below, then add your prompt in the following lines.

    The following table lists the required and optional parameters when invoking models hosted by SageMaker AI or Amazon Bedrock.<a name="sagemaker-jupyterai-jumpstart-inference-params"></a>    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-jupyterai-use.html)

    The following command invokes a [Llama2-7b](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html) model hosted by SageMaker AI.

    ```
    %%ai sagemaker-endpoint:jumpstart-dft-meta-textgeneration-llama-2-7b -q {"inputs":"<prompt>","parameters":{"max_new_tokens":64,"top_p":0.9,"temperature":0.6,"return_full_text":false}} -n us-east-2 -p [0].generation -m {"endpoint_kwargs":{"CustomAttributes":"accept_eula=true"}} -f text
    Translate English to French:
    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>
    ```

    The following example invokes a Flan-t5-small model hosted by SageMaker AI.

    ```
    %%ai sagemaker-endpoint:hf-text2text-flan-t5-small --request-schema={"inputs":"<prompt>","parameters":{"num_return_sequences":4}} --region-name=us-west-2 --response-path=[0]["generated_text"] -f text
    What is the atomic number of Hydrogen?
    ```
  + To invoke a model deployed in Amazon Bedrock, pass the string `bedrock:model-name` to the `%%ai` magic command with any optional parameter defined in the list of [parameters for invoking models hosted by JumpStart or Amazon Bedrock](#sagemaker-jupyterai-jumpstart-inference-params), then add your prompt in the following lines.

    The following example invokes an [AI21 Labs Jurassic-2 model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-jurassic2.html) hosted by Amazon Bedrock.

    ```
    %%ai bedrock:ai21.j2-mid-v1 -m {"model_kwargs":{"maxTokens":256}} -f code
    Write a function in python implementing a bubbble sort.
    ```
+ **For models hosted by third-party providers**

  To invoke a model hosted by third-party providers, pass the string `provider-id:model-name` to the `%%ai` magic command with an optional [`Output format`](#sagemaker-jupyterai-output-format-params), then add your prompt in the following lines. You can find the details of each provider, including their ID, in the Jupyter AI [list of model providers](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers).

  The following command asks an Anthropic Claude model to output an HTML file containing the image of a white square with black borders.

  ```
  %%ai anthropic:claude-v1.2 -f html
  Create a square using SVG with a black border and white fill.
  ```

# Amazon Q Developer
<a name="studio-updated-amazon-q"></a>

Amazon Q Developer is a generative AI conversational assistant that helps you write better code. Amazon Q Developer is available in the following IDEs within Amazon SageMaker Studio:
+ JupyterLab
+ Code Editor, based on Code-OSS, Visual Studio Code - Open Source

Use the following sections to set up Amazon Q Developer and use it within your environment.

**Topics**
+ [

# Set up Amazon Q Developer for your users
](studio-updated-amazon-q-admin-guide-set-up.md)
+ [

# Use Amazon Q to Expedite Your Machine Learning Workflows
](studio-updated-user-guide-use-amazon-q.md)
+ [

# Customize Amazon Q Developer in Amazon SageMaker Studio applications
](q-customizations.md)

# Set up Amazon Q Developer for your users
<a name="studio-updated-amazon-q-admin-guide-set-up"></a>

Amazon Q Developer is a generative AI conversational assistant. You can set up Amazon Q Developer within a new domain or an existing domain. Use the following information to set up Amazon Q Developer.

With Amazon Q Developer, your users can:
+ Receive step-by-step guidance on using SageMaker AI features independently or in combination with other AWS services.
+ Get sample code to get started on your ML tasks such as data preparation, training, inference, and MLOps.
+ Receive troubleshooting assistance to debug and resolve errors encountered while running code.

**Note**  
Amazon Q Developer in Studio doesn't use user content to improve the service, regardless of whether you use the Free-tier or Pro-tier subscription. For IDE-level telemetry sharing, Amazon Q might track your users' usage, such as the number of questions asked and whether recommendations were accepted or rejected. This telemetry data doesn't include personally identifiable information such as the users' IP address. For more information on data protection and instructions for opting out, see [Opt out of data sharing in the IDE](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/opt-out-IDE.html).

You can set up Amazon Q Developer with either a Pro or Free tier subscription. The Pro tier is a paid subscription service with higher usage limits and other features. For more information about the differences between the tiers, see [Understanding tiers of service for Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/q-tiers.html).

For information about subscribing to Amazon Q Developer Pro, see [Subscribing to Amazon Q Developer Pro](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/q-admin-setup-subscribe-general.html).

## Set up instructions for Amazon Q Developer Free Tier:
<a name="studio-updated-amazon-q-developer-free-tier-set-up"></a>

To set up Amazon Q Developer Free Tier, use the following procedure:

**To set up Amazon Q Developer Free Tier**

1. Add the following policy to the IAM role that you've used to create your JupyterLab or Code Editor space:

------
#### [ JSON ]

****  

   ```
   {
   	"Version":"2012-10-17",		 	 	 
   	"Statement": [
   		{
   			"Effect": "Allow",
   			"Action": [
   				"q:SendMessage"
   			],
   			"Resource": [
   				"*"
   			]
   		},
   		{
   			"Sid": "AmazonQDeveloperPermissions",
   			"Effect": "Allow",
   			"Action": [
   				"codewhisperer:GenerateRecommendations"
   			],
   			"Resource": "*"
   		}
   	]
   }
   ```

------

1. Navigate to Amazon SageMaker Studio.

1. Open your JupyterLab or Code Editor space.

1. Navigate to the **Launcher** and choose **Terminal**.

1. In JupyterLab, do the following:

   1. Specify `restart-jupyter-server`.

   1. Restart your browser and navigate back to Amazon SageMaker Studio.

## Set up instructions for Amazon Q Developer Pro tier:
<a name="studio-updated-amazon-q-developer-pro-set-up"></a>

**Prerequisites**  
To set up Amazon Q Pro, you must have:  
An Amazon SageMaker AI domain set up for your organization with IAM Identity Center configured as the means of access.
An Amazon Q Developer Pro subscription.

If you're updating a domain that you've already set up for your organization, you need to update it to use Amazon Q Developer. You can use either the AWS Management Console or the AWS Command Line Interface to update a domain.

You must use the ARN of your Amazon Q Developer profile. You can find the Q Profile ARN on the [Q Developer Settings](https://console.aws.amazon.com/amazonq/developer/settings) page.

You can use the following AWS Command Line Interface command to update your domain:

```
aws --region AWS Region sagemaker update-domain --domain-id domain-id --domain-settings-for-update "AmazonQSettings={Status=ENABLED,QProfileArn=Q-Profile-ARN}"           
```

You can also use the following procedure to update the domain within the AWS Management Console.

1. Navigate to the [Amazon SageMaker AI](https://console.aws.amazon.com/sagemaker) console.

1. Choose domains.

1. Select **App Configurations**.

1. For **Amazon Q Developer for SageMaker AI Applications**, choose **Edit**.

1. Select **Enable Amazon Q Developer on this domain**.

1. Provide the Q Profile ARN.

1. Choose **Submit**.

You must use the ARN of your Amazon Q Developer profile. You can find the ARN of the Q Profile on the **Amazon Q account details** page of the [Amazon Q Developer](https://console.aws.amazon.com/amazonq/developer) console.

The **Set up for organizations** is an advanced setup for the Amazon SageMaker AI domain that lets you use IAM Identity Center. For information about how you can set up the domain and information about setting up IAM Identity Center, see [Use custom setup for Amazon SageMaker AI](onboard-custom.md).

When setting up Amazon Q Developer in a new domain, you can either use the AWS Management Console or the following AWS Command Line Interface command from your local machine:

```
                    
aws --region AWS Region sagemaker create-domain --domain-id domain-id --domain-name "example-domain-name" --vpc-id example-vpc-id --subnet-ids example-subnet-ids --auth-mode SSO --default-user-settings "ExecutionRole=arn:aws:iam::111122223333:role/IAM-role",--domain-settings "AmazonQSettings={status=ENABLED,qProfileArn=Q-profile-ARN" --query example-domain-ARN--output text
```

You can use the following AWS CLI command to disable Amazon Q Developer:

```
aws --region AWS Region sagemaker update-domain --domain-id domain-id --domain-settings-for-update "AmazonQSettings={Status=DISABLED,QProfileArn=Q-Profile-ARN}"           
```

We recommend using the latest version of the AWS Command Line Interface. For information about updating the AWS CLI, see [Install or update to the latest version of the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

If you need to establish a connection between Amazon Q Developer and your VPC, see [Creating an interface VPC endpoint for Amazon Q ](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/vpc-interface-endpoints.html#vpc-endpoint-create).

**Note**  
Amazon Q Developer has the following limitations:  
It doesn't support shared spaces.
Amazon Q Developer detects whether a code suggestion might be too similar to publicly available code. The reference tracker can flag suggestions with repository URLs and licenses, or filter them out. This allows you to review the referenced code and its usage before you adopt it. All references are logged for you to review later to ensure that your code flow is not disturbed and that you can keep coding without interruption.  
For more information about code references, see [Using code references - Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/code-reference.html) and [AI Coding Assistant - Amazon Q Developer FAQs](https://aws.amazon.com/q/developer/faqs/?refid=255ccf7b-4a76-4dcb-9b07-68709e2b636b#:~:text=Can%20I%20prevent%20Amazon%20Q%20Developer%20from%20recommending%20code%20with%20code%20references%3F).
Amazon Q processes all user interaction data within the US East (N. Virginia) AWS Region. For more information about how Amazon Q processes data and the AWS Regions it supports, see [Supported Regions for Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/regions.html).
Amazon Q only works within Amazon SageMaker Studio. It is not supported within Amazon SageMaker Studio Classic.
On JupyterLab, Amazon Q works within SageMaker AI Distribution Images version 2.0 and above. On Code Editor, Amazon Q works within SageMaker AI Distribution Images version 2.2.1 and above.
Amazon Q Developer in JupyterLab works within the Jupyter AI extension. You can't use other 3P models within the extension while you're using Amazon Q.

## Amazon Q customizations in Amazon SageMaker AI
<a name="q-customizations-in-sagemaker"></a>

If you use Amazon Q Developer Pro, you have the option to create *customizations*. With customizations, Amazon Q Developer provides suggestions based on your company's codebase. If you create customizations in Amazon Q Developer, they become available for you to use in JupyterLab and Code Editor in Amazon SageMaker Studio. For more information about setting up customizations, see [Customizing suggestions](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/customizations.html) in the *Amazon Q Developer User Guide*.

# Use Amazon Q to Expedite Your Machine Learning Workflows
<a name="studio-updated-user-guide-use-amazon-q"></a>

Amazon Q Developer is your AI-powered companion for machine learning development. With Amazon Q Developer, you can:
+ Receive step-by-step guidance on using SageMaker AI features independently or in combination with other AWS services.
+ Get sample code to get started on your ML tasks such as data preparation, training, inference, and MLOps.

 To use Amazon Q Developer, choose the **Q** from the left-hand navigation of your JupyterLab or Code Editor environment.

If you don't see the **Q** icon, your administrator needs to set it up for you. For more information about setting up Amazon Q Developer, see [Set up Amazon Q Developer for your users](studio-updated-amazon-q-admin-guide-set-up.md).

Amazon Q automatically provides suggestions to help you write your code. You can also ask for suggestions through the chat interface.

# Customize Amazon Q Developer in Amazon SageMaker Studio applications
<a name="q-customizations"></a>

You can customize Amazon Q Developer in the JupyterLab and Code Editor applications in Amazon SageMaker Studio. When you customize Q Developer, it provides suggestions and answers based on examples from your codebase. If you use Amazon Q Developer Pro, you can load any customizations that you've created with that service. 

## Customize in JupyterLab
<a name="q-customizations-jupyterlab"></a>

In JupyterLab, you can load any customizations that you've created with Amazon Q Developer Pro. Or, in your JupyterLab space, you can customize Q Developer locally with files that you upload to the space.

### To use customizations that you've created in Amazon Q Developer Pro
<a name="use-q-customizations-jupyterlab"></a>

When you load a customization, Q Developer provides suggestions based on the codebase that you used to create the customization. Also, when you use the chat in the **Amazon Q** panel, you interact with your customization.

For more information about setting up customizations, see [Customizing suggestions](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/customizations.html) in the *Amazon Q Developer User Guide*.

**To load your customization**

Open your JupyterLab space and complete the following steps.

1. In the status bar at the bottom of JupyterLab, choose **Amazon Q**. A menu opens.

1. In the menu, choose **Other Features**. The **Amazon Q Features** tab opens in the main work area.

1. In the **Amazon Q Features** tab, under **Select Customization**, choose your Q Developer customization.

1. Interact with your customization in either of the following ways:
   + Create a notebook, and write code in it. As you do, Q Developer automatically provides tailored inline suggestions based on your customization.
   + Chat with Q Developer in the **Amazon Q** panel by following these steps:

     1. In the left sidebar in JupyterLab, choose the **Jupyter AI Chat** icon. The **Amazon Q** panel opens.

     1. Use the **Ask Amazon Q** chat box to interact with your customization.

### To customize Amazon Q Developer with files in your JupyterLab space
<a name="customize-q-in-jupyterlab"></a>

In JupyterLab, you can customize Q Developer with files that you upload to your space. Then, in the chat in the **Amazon Q** panel, you can use a command to ask Q Developer about those files.

When you customize Q Developer with files in your space, the customization exists only in your space. You can't load the customization elsewhere, such as in other spaces or in the Amazon Q Developer console.

You can customize Q Developer with files in JupyterLab if you use either Amazon Q Developer Pro or Amazon Q Developer at the Free tier.

**To customize with your files**

Open your JupyterLab space and complete the following steps.

1. Check whether your space is configured with the required embedding model. You can customize Q Developer in JupyterLab only if you use the default embedding model, which is **CodeSage :: codesage-small**. To check, do the following:

   1. In the left sidebar in JupyterLab, choose the **Jupyter AI Chat** icon. The **Amazon Q** panel opens.

   1. Choose the settings icon in the upper-right corner of the panel.

   1. For **Embedding model**, if necessary, choose **CodeSage :: codesage-small**, and choose **Save Changes**.

   1. In the upper-right corner of the panel, choose the back icon. 

1. To upload files that you want to customize Q Developer with, in the **File Browser** panel, choose the **Upload Files** icon.

1. After you upload your files, in the **Ask Amazon Q** chat box, type `/learn file path/`. Replace *file path/* with the path to your files in your JupyterLab space. When Amazon Q finishes processing your files, it confirms with a chat message in the Amazon Q panel.

1. To ask Q Developer a question about your files, type `/ask` in the chat box, and follow the command with your question. Amazon Q generates an answer based on your files, and it responds in the chat.

For more information about the `/learn` and `/ask` commands, such as their options and supported arguments, see [Learning about local data](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#learning-about-local-data) in the Jupyter AI user documentation. That page explains how to use the commands with the Jupyternaut AI chatbot. JupyterLab in Amazon SageMaker Studio supports the same command syntax.

## Customize in Code Editor
<a name="q-customizations-code-editor"></a>

If you've created a customization in Amazon Q Developer Pro, you can load it in Code Editor. Then, when Q Developer provides suggestions for your code, it bases them on the codebase that you used to create the customization. Also, when you use the chat in the **Amazon Q: Chat** panel, you interact with your customization.

**To use customizations that you've created in Amazon Q Developer Pro**

Open your Code Editor space and complete the following steps.

1. In the Code Editor menu, choose **View**, and choose **Command Pallette**.

1. In the command pallet, begin typing **>Amazon Q: Select Customization**, and choose that option in the filtered list of commands when it appears. The command pallet shows your Q Developer customizations.

1. Choose your customization.

1. Interact with your customization in either of the following ways:
   + Create a Python file or a Jupyter notebook, and write code in it. As you do, Q Developer automatically provides tailored inline suggestions based on your customization.
   + Chat with Q Developer in the **Amazon Q** panel by following these steps:

     1. In the left sidebar in Code Editor, choose the **Amazon Q** icon. The **Amazon Q: Chat** panel opens.

     1. Use the chat box to interact with your customization.

For more information about the capabilities of Q Developer, see [Using Amazon Q Developer in the IDE](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/q-in-IDE.html) in the *Amazon Q Developer User Guide*.

# Amazon SageMaker Partner AI Apps overview
<a name="partner-apps"></a>

With Amazon SageMaker Partner AI Apps, users get access to generative AI and machine learning (ML) development applications built, published, and distributed by industry-leading application providers. Partner AI Apps are certified to run on SageMaker AI. With Partner AI Apps, users can accelerate and improve how they build solutions based on foundation models (FM) and classic ML models without compromising the security of their sensitive data. The data stays completely within their trusted security configuration and is never shared with a third party.  

## How it works
<a name="partner-apps-how-works"></a>

Partner AI Apps are full application stacks that include an Amazon Elastic Kubernetes Service cluster and an array of accompanying services that can include Application Load Balancer, Amazon Relational Database Service, Amazon Simple Storage Service buckets, Amazon Simple Queue Service queues, and Redis caches. 

These service applications can be shared across all users in a SageMaker AI domain and are provisioned by an admin. After provisioning the application by purchasing a subscription through the AWS Marketplace, the admin can give users in the SageMaker AI domain permissions to access the Partner AI App directly from Amazon SageMaker Studio, Amazon SageMaker Unified Studio (preview), or using a pre-signed URL. For information about launching an application from Studio, see [Launch Amazon SageMaker Studio](studio-updated-launch.md). 

Partner AI Apps offers the following benefits for administrators and users.  
+  Administrators use the SageMaker AI console to browse, discover, select, and provision the Partner AI Apps for use by their data science and ML teams. After the Partner AI Apps are deployed, SageMaker AI runs them on service-managed AWS accounts. This significantly reduces the operational overhead associated with building and operating these applications, and contributes to the security and privacy of customer data. 
+  Data scientists and ML developers can access Partner AI Apps from within their ML development environment in Amazon SageMaker Studio or Amazon SageMaker Unified Studio (preview). They can use the Partner AI Apps to analyze their data, experiments, and models created on SageMaker AI. This minimizes context switching and helps accelerate building foundation models and bringing new generative AI capabilities to market. 

## Integration with AWS services
<a name="partner-apps-integration"></a>

Partner AI Apps uses the existing AWS Identity and Access Management (IAM) configuration for authorization and authentication. As a result, users don’t need to provide separate credentials to access each Partner AI App from Amazon SageMaker Studio. For more information about authorization and authentication with Partner AI Apps, see [Set up Partner AI Apps](partner-app-onboard.md). 

Partner AI Apps also integrates with Amazon CloudWatch to provide operational monitoring and management. Customers can also browse Partner AI Apps, and get details about them, such as features, customer experience, and pricing, from the AWS Management Console. For information about Amazon CloudWatch, see [How Amazon CloudWatch works](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html). 

Partner AI applications such as Deepchecks support integration with Amazon Bedrock to enable LLM-based evaluation features such as "LLM as a judge" evaluations and automated annotation capabilities. When Amazon Bedrock integration is enabled, the Partner AI App uses your customer-managed Amazon Bedrock account to access foundation models, ensuring that your data remains within your trusted security configuration. For more information about configuring Amazon Bedrock integration, see [Configure Amazon Bedrock integration](partner-app-onboard.md#partner-app-onboard-admin-bedrock).

## Supported types
<a name="partner-apps-supported"></a>

Partner AI Apps support the following types: 
+ Comet 
+  Deepchecks 
+  Fiddler 
+  Lakera Guard 

 When the admin launches a Partner AI App, they must select the configuration of the instance cluster that the Partner AI App is launched with. This configuration is known as the Partner AI App's tier. A Partner AI App's tier can be one of the following values: 
+  `small` 
+  `medium` 
+  `large` 

 The following sections give information about each of the Partner AI App types, and details about the Partner AI App's tier values. 

### Comet overview
<a name="partner-apps-supported-comet"></a>

 Comet provides an end-to-end model evaluation platform for AI developers, with LLM evaluations, experiment tracking, and production monitoring. 

 We recommend the following Partner AI App tiers based on the workload: 
+  `small` – Recommended for up to 5 users and 20 running jobs. 
+  `medium` – Recommended for up to 50 users and 100 running jobs. 
+  `large` – Recommended for up to 500 users and more than 100 running jobs. 

**Note**  
SageMaker AI does not support viewing the Comet UI as part of the output of a Jupyter notebook. 

### Deepchecks overview
<a name="partner-apps-supported-deepchecks"></a>

AI application developers and stakeholders can use Deepchecks to continuously validate LLM-based applications including characteristics, performance metrics, and potential pitfalls throughout the entire lifecycle from pre-deployment and internal experimentation to production. 

 We recommend the following Partner AI App tiers based on the speed desired for the workload: 
+  `small` – Processes 200 tokens per second. 
+  `medium` – Processes 500 tokens per second. 
+  `large` – Processes 1300 tokens per second. 

### Fiddler overview
<a name="partner-apps-supported-fiddler"></a>

 The Fiddler AI Observability Platform facilitates validating, monitoring, and analyzing ML models in production, including tabular, deep learning, computer vision, and natural language processing models. 

 We recommend the following Partner AI App tiers based on the speed desired for the workload: 
+  `small` – Processing 10MM events across 5 models, 100 features, and 20 iterations takes about 53 minutes. 
+  `medium` – Processing 10MM events across 5 models, 100 features, and 20 iterations takes about 23 minutes. 
+  `large` – Processing 10MM events across 5 models, 100 features, and 100 iterations takes about 27 minutes. 

### Lakera Guard overview
<a name="partner-apps-supported-lakera-guard"></a>

 Lakera Guard is a low-latency AI application firewall to secure generative AI applications from gen AI-specific threats. 

 We recommend the following Partner AI App tiers based on the workload: 
+  `small` – Recommended for up to 20 Robotic Process Automations (RPAs). 
+  `medium` – Recommended for up to 100 RPAs. 
+  `large` – Recommended for up to 200 RPAs. 

# Set up Partner AI Apps
<a name="partner-app-onboard"></a>

 The following topics describe the permissions needed to start using Amazon SageMaker Partner AI Apps. The permissions required are split into two parts, depending on the user permissions level: 
+  **Administrative permissions** – Permissions for administrators setting up data scientist and machine learning (ML) developer environments.
  + AWS Marketplace
  +  Partner AI Apps management 
  +  AWS License Manager 
+  **User permissions** – Permissions for data scientists and machine learning developers. 
  +  User authorization 
  +  Identity propagation 
  +  SDK access 

## Prerequisites
<a name="partner-app-onboard-prereq"></a>

 Admins can complete the following prerequisites to set up Partner AI Apps. 
+ (Optional) Onboard to a SageMaker AI domain. Partner AI Apps can be accessed directly from a SageMaker AI domain. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). 
  + If using Partner AI Apps in a SageMaker AI domain in VPC-only mode, admins must create an endpoint with the following format to connect to the Partner AI Apps. For more information about using Studio in VPC-only mode, see [Connect Amazon SageMaker Studio in a VPC to External Resources](studio-updated-and-internet-access.md). 

    ```
    aws.sagemaker.region.partner-app
    ```
+ (Optional) If admins are interacting with the domain using the AWS CLI, they must also complete the following prerequisites. 

  1. Update the AWS CLI by following the steps in [Installing the current AWS CLI Version](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).  

  1. From the local machine, run `aws configure` and provide AWS credentials. For information about AWS credentials, see [Understanding and getting your AWS credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html).

## Administrative permissions
<a name="partner-app-onboard-admin"></a>

 The administrator must add the following permissions to enable Partner AI Apps in SageMaker AI. 
+  Permission to complete AWS Marketplace subscription for Partner AI Apps 
+  Set up Partner AI App execution role 

### AWS Marketplace subscription for Partner AI Apps
<a name="partner-app-onboard-admin-marketplace"></a>

Admins must complete the following steps to add permissions for AWS Marketplace. For information about using AWS Marketplace, see [Getting started as a buyer using AWS Marketplace](https://docs.aws.amazon.com/marketplace/latest/buyerguide/buyer-getting-started.html).

1. Grant permissions for AWS Marketplace. Partner AI Apps administrators require these permissions to purchase subscriptions to Partner AI Apps from AWS Marketplace. To get access to AWS Marketplace, admins must attach the `AWSMarketplaceManageSubscriptions` managed policy to the IAM role that they're using to access the SageMaker AI console and purchase the app. For details about the `AWSMarketplaceManageSubscriptions` managed policy, see [AWS managed policies for AWS Marketplace buyers](https://docs.aws.amazon.com/marketplace/latest/buyerguide/buyer-security-iam-awsmanpol.html#security-iam-awsmanpol-awsmarketplacemanagesubscriptions). For information about attaching managed policies, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html). 

1. Grant permissions for SageMaker AI to run operations on the admins behalf using other AWS services. Admins must grant SageMaker AI permissions to use these services and the resources that they act upon. The following policy definition demonstrates how to grant the required Partner AI Apps permissions. These permissions are needed in addition to the existing permissions for the admin role. For more information, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CreatePartnerApp",
                   "sagemaker:DeletePartnerApp",
                   "sagemaker:UpdatePartnerApp",
                   "sagemaker:DescribePartnerApp",
                   "sagemaker:ListPartnerApps",
                   "sagemaker:CreatePartnerAppPresignedUrl",
                   "sagemaker:CreatePartnerApp",
                   "sagemaker:AddTags",
                   "sagemaker:ListTags",
                   "sagemaker:DeleteTags"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "iam:PassRole"
               ],
               "Resource": "arn:aws:iam::*:role/*",
               "Condition": {
                   "StringEquals": {
                        "iam:PassedToService": "sagemaker.amazonaws.com"
                    } 
               }
           }
       ]
   }
   ```

------

### Set up Partner AI App execution role
<a name="partner-app-onboard-admin-role"></a>

1. Partner AI Apps require an execution role to interact with resources in the AWS account. Admins can create this execution role using the AWS CLI. The Partner AI App uses this role to complete actions related to Partner AI App functionality. 

   ```
   aws iam create-role --role-name PartnerAiAppExecutionRole --assume-role-policy-document '{
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "sagemaker.amazonaws.com"
           ]
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }'
   ```

1.  Create the AWS License Manager service-linked role by following the steps in [Create a service-linked role for License Manager](https://docs.aws.amazon.com/license-manager/latest/userguide/license-manager-role-core.html#create-slr-core).  

1.  Grant permissions for the Partner AI App to access License Manager using the AWS CLI. These permissions are required to access the licenses for Partner AI App. This allows the Partner AI App to verify access to the Partner AI App license.

   ```
   aws iam put-role-policy --role-name PartnerAiAppExecutionRole --policy-name LicenseManagerPolicy --policy-document '{
     "Version": "2012-10-17",		 	 	 
     "Statement": {
       "Effect": "Allow",
       "Action": [
         "license-manager:CheckoutLicense",
         "license-manager:CheckInLicense",
         "license-manager:ExtendLicenseConsumption",
         "license-manager:GetLicense",
         "license-manager:GetLicenseUsage"
       ],
       "Resource": "*"
     }
   }'
   ```

1.  If the Partner AI App requires access to an Amazon S3 bucket, then add Amazon S3 permissions to the execution role. For more information, see [Required permissions for Amazon S3 API operations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-with-s3-policy-actions.html). 

### Configure Amazon Bedrock integration
<a name="partner-app-onboard-admin-bedrock"></a>

Partner AI applications such as Deepchecks support integration with Amazon Bedrock to enable LLM-based evaluation features. When configuring a Partner AI App with Amazon Bedrock support, administrators can specify which foundation models and inference profiles are available for use within the application. If you need to increase the quota limit for your Amazon Bedrock models, see [Request an increase for Amazon Bedrock quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas-increase.html).

1. Ensure the Partner AI App execution role has the required Amazon Bedrock permissions. Add the following permissions to enable Amazon Bedrock model access:

   ```
   aws iam put-role-policy --role-name PartnerAiAppExecutionRole --policy-name BedrockInferencePolicy --policy-document '{
   	   "Version": "2012-10-17",		 	 	 
   	   "Statement": {
   	     "Effect": "Allow",
   	     "Action": [
   	       "bedrock:InvokeModel",
   	       "bedrock:GetFoundationModel",
   	       "bedrock:GetInferenceProfile"
   	     ],
   	     "Resource": "*"
   	   }
   	 }'
   ```

1. Identify the Amazon Bedrock models that your organization wants to make available to the Partner AI App. You can view available models in your region using the Amazon Bedrock console. For information about model availability across regions, see [Model support by AWS Region](https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html).

1. (Optional) Create customer-managed inference profiles for cost tracking and model management. Inference profiles allow you to track Amazon Bedrock usage specifically for the Partner AI App and can enable cross-region inference when models are not available in your current region. For more information, see [ Using inference profiles in Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles.html).

1. When creating or updating the Partner AI App, specify the allowed models and inference profiles using the `CreatePartnerApp` or `UpdatePartnerApp` API. The Partner AI App will only be able to access the models and inference profiles that you explicitly configure.

**Important**  
Amazon Bedrock usage through Partner AI Apps is billed directly to your AWS account using your existing Amazon Bedrock pricing. The Partner AI App infrastructure costs are separate from Amazon Bedrock model inference costs.

#### Deepchecks Amazon Bedrock integration
<a name="partner-app-onboard-admin-bedrock-deepchecks"></a>

Deepchecks supports Amazon Bedrock integration for LLM-based evaluation capabilities, including:
+ *LLM as a judge evaluations* - Use foundation models to automatically evaluate model outputs for quality, relevance, and other criteria
+ *Automated annotation* - Generate labels and annotations for datasets using foundation models
+ *Content analysis* - Analyze text data for bias, toxicity, and other quality metrics using LLM capabilities

For detailed information about Deepchecks Amazon Bedrock features and configuration, see the Deepchecks documentation within the application.

## User permissions
<a name="partner-app-onboard-user"></a>

 After admins have completed the administrative permissions settings, they must make sure that users have the permissions needed to access the Partner AI Apps.

1. Grant permissions for SageMaker AI to run operations on your behalf using other AWS services. Admins must grant SageMaker AI permissions to use these services and the resources that they act upon. Admins grant SageMaker AI these permissions using an IAM execution role. For more information about IAM roles, see [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html). The following policy definition demonstrates how to grant the required Partner AI Apps permissions. This policy can be added to the execution role of the user profile.  For more information, see [How to use SageMaker AI execution roles](sagemaker-roles.md). 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribePartnerApp",
                   "sagemaker:ListPartnerApps",
                   "sagemaker:CreatePartnerAppPresignedUrl"
               ],
               "Resource": "arn:aws:sagemaker:*:*:partner-app/app-*"
           }
       ]
   }
   ```

------

1.  (Optional) If launching Partner AI Apps from Studio, add the `sts:TagSession` trust policy to the role used to launch Studio or the Partner AI Apps directly as follows. This makes sure that the identity can be propagated properly.

   ```
   {
       "Effect": "Allow",
       "Principal": {
           "Service": "sagemaker.amazonaws.com"
       },
       "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
                ]
   }
   ```

1.  (Optional) If using the SDK of a Partner AI App to access functionality in SageMaker AI, add the following `CallPartnerAppApi` permission to the role used to run the SDK code. If running the SDK code from Studio, add the permission to the Studio execution role. If running the code from anywhere other than Studio, add the permission to the IAM role used with the notebook. This gives the user access the Partner AI App functionality from the Partner AI App’s SDK. 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "Statement1",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CallPartnerAppApi"
               ],
               "Resource": [
                   "arn:aws:sagemaker:us-east-1:111122223333:partner-app/app"
               ]
           }
       ]
   }
   ```

------

### Manage user authorization and authentication
<a name="partner-app-onboard-user-auth"></a>

To provide access to Partner AI Apps to members of their team, admins must make sure that the identity of their users is propagated to the Partner AI Apps. This propagation makes sure users can properly access the Partner AI Apps' UI and perform authorized Partner AI App actions. 

 Partner AI Apps support the following identity sources: 
+  AWS IAM Identity Center 
+  External identity providers (IdPs)  
+  IAM Session-based identity 

 The following sections gives information about the identity sources that Partner AI Apps support, as well as important details related to that identity source. 

#### IAM Identity Center
<a name="partner-app-onboard-user-auth-idc"></a>

If a user is authenticated into Studio using IAM Identity Center and launches an application from Studio, the IAM Identity Center `UserName` is automatically propagated as the user identity for a Partner AI App. This is not the case if the user launches the Partner AI App directly using the `CreatePartnerAppPresignedUrl` API.

#### External identity providers (IdPs)
<a name="partner-app-onboard-user-auth-idps"></a>

If using SAML for AWS account federation, admins have two options to carry over the IdP identity as the user identity for a Partner AI App. For information about setting up AWS account federation, see [How to Configure SAML 2.0 for AWS account Federation](https://saml-doc.okta.com/SAML_Docs/How-to-Configure-SAML-2.0-for-Amazon-Web-Service).  
+ **Principal Tag** – Admins can configure the IdP-specific IAM Identity Center application to pass identity information from the landing session using the AWS session `PrincipalTag` with the following `Name` attribute. When using SAML, the landing role session uses an IAM role. To use the `PrincipalTag`, admins must add the `sts:TagSession` permission to this landing role, as well as the Studio execution role. For more information about `PrincipalTag`, see [Configure SAML assertions for the authentication response](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_saml_assertions.html#saml_role-session-tags). 

  ```
  https://aws.amazon.com/SAML/Attributes/PrincipalTag:SageMakerPartnerAppUser
  ```
+ **Landing session name** – Admins can propagate the landing session name as the identity for the Partner AI App. To do this, they must set the `EnableIamSessionBasedIdentity` opt-in flag for each Partner AI App. For more information, see [`EnableIamSessionBasedIdentity`](#partner-app-onboard-user-iam-session).

#### IAM session-based identity
<a name="partner-app-onboard-user-auth-iam"></a>

**Important**  
We do not recommend using this method for production accounts. For production accounts, use an identity provider for increased security.

 SageMaker AI supports the following options for identity propagation when using an IAM session-based identity. All of the options, except using a session tag with AWS STS, require setting the `EnableIamSessionBasedIdentity` opt-in flag for each application. For more information, see [`EnableIamSessionBasedIdentity`](#partner-app-onboard-user-iam-session).

When propagating identities, SageMaker AI verifies whether an AWS STS Session tag is being used. If one is not used, then SageMaker AI propagates the IAM username or AWS STS session name. 
+  **AWS STS Session tag** – Admins can set a `SageMakerPartnerAppUser` session tag for the launcher IAM session. When admins launch a Partner AI App using the SageMaker AI console or the AWS CLI, the `SageMakerPartnerAppUser` session tag is automatically passed as the user identity for the Partner AI App. The following example shows how to set the `SageMakerPartnerAppUser` session tag using the AWS CLI. The value of the key is added as a principal tag.

  ```
  aws sts assume-role \
      --role-arn arn:aws:iam::account:role/iam-role-used-to-launch-partner-ai-app \
      --role-session-name session_name \
      --tags Key=SageMakerPartnerAppUser,Value=user-name
  ```

   When giving users access to a Partner AI App using `CreatePartnerAppPresignedUrl`, we recommend verifying the value for the `SageMakerPartnerAppUser` key. This helps to prevent unintended access to Partner AI App resources. The following trust policy verifies that the session tag exactly matches the associated IAM user. Admins can use any principal tag for this purpose. It should be configured on the role that is launching Studio or the Partner AI App.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "RoleTrustPolicyRequireUsernameForSessionName",
              "Effect": "Allow",
              "Action": [
                  "sts:AssumeRole",
                  "sts:TagSession"
              ],
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:root"
              },
              "Condition": {
                  "StringLike": {
                      "aws:RequestTag/SageMakerPartnerAppUser": "prefix${aws:username}"
                  }
              }
          }
      ]
  }
  ```

------
+  **Authenticated IAM user** – The username of the user is automatically propagated as the Partner AI App user. 
+  **AWS STS session name** – If no `SageMakerPartnerAppUser` session tag is configured when using AWS STS, SageMaker AI returns an error when users launch a Partner AI App. To avoid this error, admins must set the `EnableIamSessionBasedIdentity` opt-in flag for each Partner AI App. For more information, see [`EnableIamSessionBasedIdentity`](#partner-app-onboard-user-iam-session).

   When the `EnableIamSessionBasedIdentity` opt-in flag is enabled, use the [IAM role trust policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_iam-condition-keys.html#ck_rolesessionname) to make sure that the IAM session name is or contains the IAM username. This makes sure that users don't gain access by impersonating other users. The following trust policy verifies that the session name exactly matches the associated IAM user. Admins can use any principal tag for this purpose. It should be configured on the role that is launching Studio or the Partner AI App.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "RoleTrustPolicyRequireUsernameForSessionName",
              "Effect": "Allow",
              "Action": "sts:AssumeRole",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:root"
              },
              "Condition": {
                  "StringEquals": {
                      "sts:RoleSessionName": "${aws:username}"
                  }
              }
          }
      ]
  }
  ```

------

  Admins must also add the `sts:TagSession` trust policy to the role that is launching Studio or the Partner AI App. This makes sure that the identity can be propagated properly.

  ```
  {
      "Effect": "Allow",
      "Principal": {
          "Service": "sagemaker.amazonaws.com"
      },
      "Action": [
                  "sts:AssumeRole",
                  "sts:TagSession"
               ]
  }
  ```

 After setting the credentials, admins can give their users access to Studio or the Partner AI App from the AWS CLI using either the `CreatePresignedDomainUrl` or `CreatePartnerAppPresignedUrl` API calls, respectively.

Users can also then launch Studio from the SageMaker AI console, and launch Partner AI Apps from Studio.

### `EnableIamSessionBasedIdentity`
<a name="partner-app-onboard-user-iam-session"></a>

`EnableIamSessionBasedIdentity` is an opt-in flag. When the `EnableIamSessionBasedIdentity` flag is set, SageMaker AI passes IAM session information as the Partner AI App user identity. For more information about AWS STS sessions, see [Use temporary credentials with AWS resources](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html).

### Access control
<a name="partner-app-onboard-user-access"></a>

 To control access to Partner AI Apps, use an IAM policy attached to the user profile’s execution role. To launch a Partner AI App directly from Studio or using the AWS CLI, the user profile’s execution role must have a policy that gives permissions for the `CreatePartnerAppPresignedUrl` API. Remove this permission from the user profile’s execution role to make sure they can't launch Partner AI Apps. 

### Root admin users
<a name="partner-app-onboard-user-root"></a>

 The Comet and Fiddler Partner AI Apps require at least one root admin user. Root admin users have permissions to add both normal and admin users and manage resources. The usernames provided as root admin users must be consistent with the usernames from the identity source. 

 While root admin users are persisted in SageMaker AI, normal admin users are not and exist only within the Partner AI App until the Partner AI App is terminated. 

 Admins can update root admin users using the `UpdatePartnerApp` API call. When root admin users are updated, the updated list of root admin users is passed to the Partner AI App. The Partner AI App makes sure that all usernames in the list are granted root admin privileges. If a root admin user is removed from the list, the user still retains normal admin permissions until either:
+ The user is removed from the application.
+ Another admin user revokes admin permissions for the user.

**Note**  
Fiddler doesn't support updating admin users. Only Comet supports updates to root admin users.  

 To delete a root admin user, you must first update the list of root admin users using the `UpdatePartnerApp` API. Then, remove or revoke the admin permissions through the Partner AI App's UI.

 If you remove a root admin user from the Partner AI App's UI without updating the list of root admin users with the `UpdatePartnerApp` API, the change is temporary. When SageMaker AI sends the next Partner AI App update request, SageMaker AI sends the root admin list that still includes the user to the Partner AI App. This overrides the deletion completed from the Partner AI App UI. 

# Partner AI App provisioning
<a name="partner-apps-provision"></a>

After admins have set up the required permissions, they can explore and provision Amazon SageMaker Partner AI Apps for users in the domain.

Admins can view all of the available Partner AI Apps, as well as the Partner AI Apps that they have provisioned from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/). From the **Partner AI Apps** page, admins can view details about the pricing model for each Partner AI App and make them available to users. Admins can make them available by navigating to the AWS Marketplace to subscribe to that Partner AI App.

 Admins can provision new apps from the Partner AI Apps page. They can also view the Partner AI Apps that they have already provisioned from the **My Apps** tab.

**Note**  
Applications that admins provision can be accessed by all users that admins give proper permissions to in an AWS account. Partner AI Apps are not restricted to a specific domain or user.

## Status
<a name="partner-apps-provision-status"></a>

 When admins view a Partner AI App that they have provisioned, they can also see the status of their application with one of the following values.
+  **Deployed** – The application is ready for use. Admins can update the application configuration and delete the application.
+ **Error** – There was an issue with the application deployment. Admins can troubleshoot and configure the application again to deploy it.
+ **Not deployed** – The application has been subscribed to, but not deployed. Admins can configure the application to deploy it.

## Options
<a name="partner-apps-provision-options"></a>

 When admins configure an application, they can decide the following options: 
+  **App name** – A unique name for the application. 
+  **App maintenance schedule** – Partner AI Apps undergo maintenance on a weekly basis. With this option, admins choose both the day of the week and the time that this maintenance happens. 
+  **STS identity propagation** – Use this option to pass the AWS Security Token Service (AWS STS) launcher IAM session name as the Partner AI App user identity. For more information, see [Set up Partner AI Apps](partner-app-onboard.md). 
+  **Admin management** – Some Partner AI Apps support adding up to five admins that have full rights to manage the Partner AI App functionality. This only applies to Comet and Fiddler. For more information, see [Set up Partner AI Apps](partner-app-onboard.md). 
+  **Execution role** – The role that the Partner AI App uses to access resources and perform actions. For more information, see [Set up Partner AI Apps](partner-app-onboard.md). 
+  **App version** – The version of the Partner AI App that admins want to use.  
+  **Tier selection** – The infrastructure deployment tier for the Partner AI App. The tier size impacts the speed and capabilities of the application. For more information, see [Set up Partner AI Apps](partner-app-onboard.md). 
+  **Lakera S3 bucket policy** – This is only required by the Lakera-guard app to access an Amazon S3 bucket.

# Set up the Amazon SageMaker Partner AI Apps SDKs
<a name="partner-apps-sdk"></a>

 The following topic outlines the process needed to install and use the application-specific SDKs with Amazon SageMaker Partner AI Apps. To install and use SDKs for applications, you must specify environment variables specific to Partner AI Apps, so the application’s SDK can pick up environment variables and trigger authorization. The following sections give information about the steps needed to complete this for each of the supported application types. 

## Comet
<a name="partner-apps-sdk-comet"></a>

 Comet offers two products: 
+  Opik is an source LLM evaluation framework. 
+  Comet’s ML platform can be used to track, compare, explain, and optimize models across the complete ML lifecycle. 

Comet supports the use of two different SDKs based on the product that you are interacting with. Complete the following procedure to install and use the Comet or Opik SDKs. For more information about the Comet SDK, see [Quickstart](https://www.comet.com/docs/v2/guides/quickstart/). For more information about the Opik SDK, see [Open source LLM evaluation framework](https://github.com/comet-ml/opik).

1. Launch the environment that you are using the Comet or Opik SDKs with Partner AI Apps in. For information about launching a JupyterLab application, see [Create a space](studio-updated-jl-user-guide-create-space.md). For information about launching a Code Editor, based on Code-OSS, Visual Studio Code - Open Source application, see [Launch a Code Editor application in Studio](code-editor-use-studio.md).

1.  Launch a Jupyter notebook or Code Editor space. 

1.  From the development environment, install the compatible Comet, Opik, and SageMaker Python SDK versions. To be compatible: 
   +  The SageMaker Python SDK version must be at least `2.237.0`.
   +  The Comet SDK version must be the latest version.
   +  The Opik SDK version must match the version used by your Opik application. Verify the Opik version used in the Opik web application UI. The exception to this is that the Opik SDK version must be at least `1.2.0` when the Opik application version is `1.1.5`.
**Note**  
SageMaker JupyterLab comes with SageMaker Python SDK installed. However, you may need to upgrade the SageMaker Python SDK if the version is lower than `2.237.0`.

   ```
   %pip install sagemaker>=2.237.0 comet_ml
   
   ##or
   
   %pip install sagemaker>=2.237.0 opik=<compatible-version>
   ```

1.  Set the following environment variables for the application resource ARN. These environment variables are used to communicate with the Comet and Opik SDKs. To retrieve these values, navigate to the details page for the application in Amazon SageMaker Studio.

   ```
   os.environ['AWS_PARTNER_APP_AUTH'] = 'true'
   os.environ['AWS_PARTNER_APP_ARN'] = '<partner-app-ARN>'
   ```

1.  For the Comet application, the SDK URL is automatically included as part of the API key set in the following step. You may instead set the `COMET_URL_OVERRIDE` environment variable to manually override the SDK URL.

   ```
   os.environ['COMET_URL_OVERRIDE'] = '<comet-url>'
   ```

1.  For the Opik application, the SDK URL is automatically included as part of the API key set in the following step. You may instead set the `OPIK_URL_OVERRIDE` environment variable to manually override the SDK URL. To get the Opik workspace name, see the Opik application and navigate to the user's workspace.

   ```
   os.environ['OPIK_URL_OVERRIDE'] = '<opik-url>'
   os.environ['OPIK_WORKSPACE'] = '<workspace-name>'
   ```

1.  Set the environment variable that identifies the API key for Comet or Opik. This is used to verify the connection from SageMaker to the application when the Comet and Opik SDKs are used. This API key is application-specific and is not managed by SageMaker. To get this key, you must log into the application and retrieve the API key. The Opik API key is the same as the Comet API key.

   ```
   os.environ['COMET_API_KEY'] = '<API-key>'
   os.environ["OPIK_API_KEY"] = os.environ["COMET_API_KEY"]
   ```

## Fiddler
<a name="partner-apps-sdk-fiddler"></a>

 Complete the following procedure to install and use the Fiddler Python Client. For information about the Fiddler Python Client, see [About Client 3.x](https://docs.fiddler.ai/python-client-3-x/about-client-3x). 

1.  Launch the notebook environment that you are using the Fiddler Python Client with Partner AI Apps in. For information about launching a JupyterLab application, see [Create a space](studio-updated-jl-user-guide-create-space.md). For information about launching a Code Editor, based on Code-OSS, Visual Studio Code - Open Source application, see [Launch a Code Editor application in Studio](code-editor-use-studio.md).

1.  Launch a Jupyter notebook or Code Editor space. 

1.  From the development environnment, install the Fiddler Python Client and SageMaker Python SDK versions. To be compatible: 
   +  The SageMaker Python SDK version must be at least `2.237.0`. 
   +  The Fiddler Python Client version must be compatible with the version of Fiddler used in the application. After verifying the Fiddler version from the UI, see the Fiddler [Compatibility Matrix](https://docs.fiddler.ai/history/compatibility-matrix) for the compatible Fiddler Python Client version. 
**Note**  
SageMaker JupyterLab comes with SageMaker Python SDK installed. However, you may need to upgrade the SageMaker Python SDK if the version is lower than `2.237.0`. 

   ```
   %pip install sagemaker>=2.237.0 fiddler-client=<compatible-version>
   ```

1.  Set the following environment variables for the application resource ARN and the SDK URL. These environment variables are used to communicate with the Fiddler Python Client. To retrieve these values, navigate to the details page for the Fiddler application in Amazon SageMaker Studio.   

   ```
   os.environ['AWS_PARTNER_APP_AUTH'] = 'true'
   os.environ['AWS_PARTNER_APP_ARN'] = '<partner-app-ARN>'
   os.environ['AWS_PARTNER_APP_URL'] = '<partner-app-URL>'
   ```

1.  Set the environment variable that identifies the API key for the Fiddler application. This is used to verify the connection from SageMaker to the Fiddler application when the Fiddler Python Client is used. This API key is application-specific and is not managed by SageMaker. To get this key, you must log into the Fiddler application and retrieve the API key. 

   ```
   os.environ['FIDDLER_KEY'] = '<API-key>'
   ```

## Deepchecks
<a name="partner-apps-sdk-deepchecks"></a>

 Complete the following procedure to install and use Deepchecks Python SDK. 

1.  Launch the notebook environment that you are using the Deepchecks Python SDK with Partner AI Apps in. For information about launching a JupyterLab application, see [Create a space](studio-updated-jl-user-guide-create-space.md). For information about launching a Code Editor, based on Code-OSS, Visual Studio Code - Open Source application, see [Launch a Code Editor application in Studio](code-editor-use-studio.md).

1.  Launch a Jupyter notebook or Code Editor space. 

1.  From the development environment, install the compatible Deepchecks Python SDK and SageMaker Python SDK versions.  Partner AI Apps is running version `0.21.15` of Deepchecks. To be compatible: 
   +  The SageMaker Python SDK version must be at least `2.237.0`. 
   +  The Deepchecks Python SDK must use the minor version `0.21`. 
**Note**  
SageMaker JupyterLab comes with SageMaker Python SDK installed. However, you may need to upgrade the SageMaker Python SDK if the version is lower than `2.237.0`. 

   ```
   %pip install sagemaker>=2.237.0 deepchecks-llm-client>=0.21,<0.22
   ```

1.  Set the following environment variables for the application resource ARN and the SDK URL. These environment variables are used to communicate with the Deepchecks Python SDK. To retrieve these values, navigate to the details page for the application in Amazon SageMaker Studio.   

   ```
   os.environ['AWS_PARTNER_APP_AUTH'] = 'true'
   os.environ['AWS_PARTNER_APP_ARN'] = '<partner-app-ARN>'
   os.environ['AWS_PARTNER_APP_URL'] = '<partner-app-URL>'
   ```

1.  Set the environment variable that identifies the API key for the Deepchecks application. This is used to verify the connection from SageMaker to the Deepchecks application when the Deepchecks Python SDK is used. This API key is application-specific and is not managed by SageMaker. To get this key, see [Setup: Python SDK Installation & API Key Retrieval](https://llmdocs.deepchecks.com/docs/setup-sdk-installation-api-key#generate-an-api-key-via-the-ui). 

   ```
   os.environ['DEEPCHECKS_API_KEY'] = '<API-key>'
   ```

## Lakera
<a name="partner-apps-sdk-lakera"></a>

 Lakera does not offer an SDK. Instead, you can interact with the Lakera Guard API through HTTP requests to the available endpoints in any programming language. For more information, see [Lakera Guard API](https://platform.lakera.ai/docs/api). 

 To use the SageMaker Python SDK with Lakera, complete the following steps: 

1.  Launch the environment that you are using Partner AI Apps in. For information about launching a JupyterLab application, see [Create a space](studio-updated-jl-user-guide-create-space.md). For information about launching a Code Editor, based on Code-OSS, Visual Studio Code - Open Source application, see [Launch a Code Editor application in Studio](code-editor-use-studio.md).

1.  Launch a Jupyter notebook or Code Editor space. 

1.  From the development environment, install the compatible SageMaker Python SDK version. The SageMaker Python SDK version must be at least `2.237.0` 
**Note**  
SageMaker JupyterLab comes with SageMaker Python SDK installed. However, you may need to upgrade the SageMaker Python SDK if the version is lower than `2.237.0`. 

   ```
   %pip install sagemaker>=2.237.0
   ```

1.  Set the following environment variables for the application resource ARN and the SDK URL. To retrieve these values, navigate to the details page for the application in Amazon SageMaker Studio. 

   ```
   os.environ['AWS_PARTNER_APP_ARN'] = '<partner-app-ARN>'
   os.environ['AWS_PARTNER_APP_URL'] = '<partner-app-URL>'
   ```

# Partner AI Apps in Studio
<a name="partner-apps-studio"></a>

 After the admin has added the required permissions and authorized users, users can view the Amazon SageMaker Partner AI App in Amazon SageMaker Studio. From Studio, users can launch apps that have been approved for use by their administrator.

## Browsing and selecting
<a name="partner-apps-studio-browse"></a>

 To browse the available Partner AI Apps, users must navigate to Studio. For information about launching Studio, see [Launch Amazon SageMaker Studio](studio-updated-launch.md).

 After users have launched Studio, they can view all of the available Partner AI Apps by selecting the **Partner AI Apps** section in the left navigation. The **Partner AI Apps** page lists all of the Partner AI Apps, and gives information about whether the Partner AI Apps have been deployed by the admin. If the desired Partner AI Apps haven't been deployed, users can reach out to the admin to request that they deploy the Partner AI Apps for use in the SageMaker AI domain.

 If the application has been deployed, users can open the Partner AI App UI to start using it or view details of the Partner AI App.

 When users view the details of the application, they see the value of the following. 
+  ARN – This is the resource ARN of the Partner AI App.
+  SDK URL – This is the URL of the Partner AI App that the Partner AI App SDK uses to support app-specific tasks such as logging model experiment tracking data from a JupyterLab notebook in Studio.

Users can use these values to write code that uses the Partner AI App SDK for app-specific tasks.

Each Partner AI App’s details page includes a sample notebook. To get started, users can launch the sample notebook in a JupyterLab space in the Studio environment.

# Use AWS KMS Permissions for Amazon SageMaker Partner AI Apps
<a name="partner-apps-kms"></a>

You can protect your data at rest using encryption for Amazon SageMaker Partner AI Apps. By default, it uses server-side encryption with a SageMaker owned key. SageMaker also supports an option for server-side encryption with a customer managed KMS key.

## Server-side encryption with SageMaker managed keys (Default)
<a name="partner-apps-managed-key"></a>

Partner AI Apps encrypt all your data at rest using an AWS managed key by default.

## Server-side encryption with customer managed KMS keys (Optional)
<a name="partner-apps-customer-managed-key"></a>

Partner AI Apps support the use of a symmetric customer managed key that you create, own, and manage to replace the existing AWS owned encryption. Because you have full control of this layer of encryption, you can perform such tasks as:
+ Establishing and maintaining key policies
+ Establishing and maintaining IAM policies and grants
+ Enabling and disabling key policies
+ Rotating key cryptographic material
+ Adding tags
+ Creating key aliases
+ Scheduling keys for deletion

For more information, see [Customer managed keys](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk) in the *AWS Key Management Service Developer Guide*.

## How Partner AI Apps use grants in AWS KMS
<a name="partner-apps-grants-cmk"></a>

Partner AI Apps require a grant to use your customer managed key. When you create an application encrypted with a customer managed key, Partner AI Apps creates a grant on your behalf by sending a CreateGrant request to AWS KMS. Grants in AWS KMS are used to give Partner AI Apps access to a KMS key in a customer account.

You can revoke access to the grant, or remove the service's access to the customer managed key at any time. If you do, Partner AI App won't be able to access any of the data encrypted by the customer managed key, which affects operations that are dependent on that data. The application will not operate properly and will become irrecoverable.

## Create a customer managed key
<a name="partner-apps-create-cmk"></a>

You can create a symmetric customer managed key by using the AWS Management Console or the AWS KMS APIs.

**To create a symmetric customer managed key**

Follow the steps for [Creating symmetric encryption KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html#create-symmetric-cmk) in the *AWS Key Management Service Developer Guide*.

**Key policy**

Key policies control access to your customer managed key. Every customer managed key must have exactly one key policy, which contains statements that determine who can use the key and how they can use it. When you create your customer managed key, you can specify a key policy. For more information, see [Determining access to AWS KMS keys](https://docs.aws.amazon.com/kms/latest/developerguide/determining-access.html) in the *AWS Key Management Service Developer Guide*.

To use your customer managed key with your Partner AI App resources, the following API operations must be permitted in the key policy. The principal for these operations depends on whether the role is used to create or use the application. 
+ Creating the application:
  + `[kms:CreateGrant](https://docs.aws.amazon.com/kms/latest/APIReference/API_CreateGrant.html)`
  + [https://docs.aws.amazon.com/kms/latest/APIReference/API_DescribeKey.html](https://docs.aws.amazon.com/kms/latest/APIReference/API_DescribeKey.html) 
+ Using the application:
  + [https://docs.aws.amazon.com/kms/latest/APIReference/API_Decrypt.html](https://docs.aws.amazon.com/kms/latest/APIReference/API_Decrypt.html) 
  + [https://docs.aws.amazon.com/kms/latest/APIReference/API_GenerateDataKey.html](https://docs.aws.amazon.com/kms/latest/APIReference/API_GenerateDataKey.html)

The following are policy statement examples you can add for Partner AI Apps based on whether the persona is an administrator or user. For more information about specifying permissions in a policy, see [AWS KMS permissions](https://docs.aws.amazon.com/kms/latest/developerguide/kms-api-permissions-reference.html) in the *AWS Key Management Service Developer Guide*. For more information about troubleshooting, see [Troubleshooting key access](https://docs.aws.amazon.com/kms/latest/developerguide/policy-evaluation.html) in the *AWS Key Management Service Developer Guide*.

**Administrator**

The following policy statement is used for the administrator who is creating Partner AI Apps.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Id": "example-key-policy",
    "Statement": [
        {
            "Sid": "Allow use of the key",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/<admin-role>"
            },
            "Action": [
                "kms:CreateGrant",
                "kms:DescribeKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "kms:ViaService": "sagemaker.us-east-1.amazonaws.com"
                }
            }
        }
    ]
}
```

------

**User**

The following policy statement is for the user of the Partner AI Apps.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Id":"example-key-policy",
  "Statement":[
    {
      "Effect":"Allow",
      "Principal":{
        "AWS":"arn:aws:iam::111122223333:role/user-role"
      },
      "Action":[
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource":"*",
      "Condition":{
        "StringEquals":{
          "kms:ViaService":"sagemaker.us-east-1.amazonaws.com"
        }
      }
    }
  ]
}
```

------

# Setting up cross-account sharing for Amazon SageMaker AI partner AI apps
<a name="partner-app-resource-sharing-ram"></a>

Amazon SageMaker AI integrates with AWS Resource Access Manager (AWS RAM) to enable resource sharing. AWS RAM is a service that enables you to share some Amazon SageMaker AI resources with other AWS accounts or through AWS Organizations. With AWS RAM, you share resources that you own by creating a *resource share*. A resource share specifies the resources to share, and the consumers with whom to share them. Consumers can be specific AWS accounts inside or outside of its organization in AWS Organizations.

For more information about AWS RAM, see the *[AWS RAM User Guide](https://docs.aws.amazon.com/ram/latest/userguide/)*.

This topic explains how to share resources that you own, and how to use resources that are shared with you.

**Topics**
+ [

## Prerequisites for sharing an Amazon SageMaker Partner AI App
](#partner-app-resource-sharing-ram-prereqs)
+ [

## Sharing an Amazon SageMaker Partner AI App
](#partner-app-resource-sharing-share)
+ [

## Accepting resource share invitations
](#partner-app-resource-sharing-responses)
+ [

## Identifying a shared Amazon SageMaker Partner AI App
](#sharing-identify)
+ [

## Responsibilities and permissions for shared Amazon SageMaker Partner AI Apps
](#sharing-perms)

## Prerequisites for sharing an Amazon SageMaker Partner AI App
<a name="partner-app-resource-sharing-ram-prereqs"></a>
+ To share an Amazon SageMaker Partner AI App, you must own it in your AWS account. This means that the resource must be allocated or provisioned in your account. You cannot share an Amazon SageMaker Partner AI App that has been shared with you.
+ To share an Amazon SageMaker Partner AI App with your organization or an organizational unit in AWS Organizations, you must enable sharing with AWS Organizations. For more information, see [ Enable Sharing with AWS Organizations](https://docs.aws.amazon.com/ram/latest/userguide/getting-started-sharing.html#getting-started-sharing-orgs) in the *AWS RAM User Guide*.

## Sharing an Amazon SageMaker Partner AI App
<a name="partner-app-resource-sharing-share"></a>

To share an Amazon SageMaker Partner AI App, you must add it to a resource share. A resource share is an AWS RAM resource that lets you share your resources across AWS accounts. A resource share specifies the resources to share, and the consumers with whom they are shared. When you share an Amazon SageMaker Partner AI App using the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker), you add it to an existing resource share. To add the Amazon SageMaker Partner AI App to a new resource share, you must first create the resource share by using the [AWS RAM console](https://console.aws.amazon.com/ram).

You can share an Amazon SageMaker Partner AI App that you own using the Amazon SageMaker AI console, AWS RAM console, or the AWS CLI.

**To share an Amazon SageMaker Partner AI App that you own using the Amazon SageMaker AI console**

1. Sign in to the AWS Management Console and open the AWS RAM console at [https://console.aws.amazon.com/ram/home](https://console.aws.amazon.com/ram/home).

1. In the main pane, choose **Create a resource share**.

1. Enter a name for the resource share that you want to create.

1. In the **Resources** section, for **Resource type** select **SageMaker AI Partner Apps**. The partner apps that you can share appear in the table.

1. Select the partner apps that you want to share.

1. Optionally specify tags, and then choose **Next**.

1. Specify the AWS accounts with which you want to share your partner apps.

1. Review your resource share configuration and choose **Create resource share**. It might take the service a few minutes to finish creating the resource share.

**To share an Amazon SageMaker Partner AI App that you own using the AWS RAM console**  
See [Creating a Resource Share](https://docs.aws.amazon.com/ram/latest/userguide/working-with-sharing.html#working-with-sharing-create) in the *AWS RAM User Guide*.

**To share an Amazon SageMaker Partner AI App that you own using the AWS CLI**  
Use the [create-resource-share](https://docs.aws.amazon.com/cli/latest/reference/ram/create-resource-share.html) command.

## Accepting resource share invitations
<a name="partner-app-resource-sharing-responses"></a>

When a resource owner sets up a resource share, each consumer AWS account receives an invitation to join the resource share. The consumer AWS accounts must accept the invitation to gain access to any shared resources.

For more information on accepting a resource share invitation through AWS RAM, see [Using shared AWS resources ](https://docs.aws.amazon.com/ram/latest/userguide/getting-started-shared.html)in the *AWS Resource Access Manager User Guide*.

## Identifying a shared Amazon SageMaker Partner AI App
<a name="sharing-identify"></a>

Owners and consumers can identify shared Amazon SageMaker Partner AI Apps using the Amazon SageMaker AI console and AWS CLI.

**To identify a shared Amazon SageMaker Partner AI App by using the Amazon SageMaker AI console**  
See [Partner AI Apps in Studio](partner-apps-studio.md).

**To identify a shared Amazon SageMaker Partner AI App by using the AWS CLI**  
Use the [list-partner-apps](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-partner-apps.html) command. The command returns the Amazon SageMaker Partner AI Apps that you own and Amazon SageMaker Partner AI Apps that are shared with you. `OwnerId` shows the AWS account ID of the Amazon SageMaker Partner AI App owner.

## Responsibilities and permissions for shared Amazon SageMaker Partner AI Apps
<a name="sharing-perms"></a>

The account with which an Amazon SageMaker Partner AI App is shared needs to have the following AWS Identity and Access Management policy.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement" : [
    {
      "Sid" : "AmazonSageMakerPartnerListAppsPermission",
      "Effect" : "Allow",
      "Action" : "sagemaker:ListPartnerApps",
      "Resource" : "*"
    },
    {
      "Sid" : "AmazonSageMakerPartnerAppsPermission",
      "Effect" : "Allow",
      "Action" : [
        "sagemaker:CreatePartnerAppPresignedUrl",
        "sagemaker:DescribePartnerApp",
        "sagemaker:CallPartnerAppApi"
      ],
      "Condition" : {
        "StringEquals" : {
          "aws:ResourceAccount" : [
                        "App-owner AWS account-1", "App-owner AWS account-2"]        
        }
      },
      "Resource" : "arn:aws:sagemaker:*:*:partner-app/*"
    }
  ]
}
```

------